
Emu3
Beijing Zhiyuan Artificial Intelligence Research Institute launched a large model containing several series with large-scale, high-precision, emergent and universal characteristics, and has been fully open-sourced.
Alibaba's open-source large-scale speech model supports zero-shot cloning in 3 seconds, multilingual capabilities, and command-based emotional control, enabling ultra-low-latency streaming synthesis at 150 ms.
CosyVoice is a next-generation multilingual model developed and open-sourced by Alibaba’s FunAudioLLM team at the FunAudio Lab.speech productionLarge Model. It is open-sourced under the Apache-2.0 license and provides full-stack speech synthesis capabilities, ranging from inference and training to deployment.
As of June 2026, CosyVoice has become a leading player in the open-source text-to-speech fieldOne of the most powerful models, surpassing most competitors in terms of tonal similarity.
| comparison dimension | CosyVoice | GPT-SoVITS | FishSpeech | F5-TTS |
|---|---|---|---|---|
| Core Architecture | LLM + FSQ + Stream Matching (Autoregressive + Stream-based) | Combining GPT and VITS | VQ + LLM + VQGAN | DiT + Flow Matching (Pure Non-Autoregressive) |
| Inference Delay | Extremely low (approximately 150 ms for the first packet) | Relatively high (about 1,200 ms) | Medium (approx. 350 ms) | Higher (noticeable on the CPU) |
| Resource Usage | Peak video memory: approximately 2.1 GB | Peak video memory: approximately 5.5 GB | Peak video memory: approximately 3.8 GB | The model is compact and Mac M chip-friendly |
| Multiple Languages/Dialects | Supports 9+ languages and 18 Chinese dialects | Primarily in Chinese, with community-driven English expansion | Chinese, English, and Japanese | Chinese and English |
| Strengths Summary | Low latency, extremely natural-sounding Chinese, strong dialect support, and excellent streaming performance | Fine-tuning on a small dataset yields good results, but comes at a high inference cost | Good balance between multilingual support and audio quality | Minimalist architecture, lightweight deployment, and no "machine-like" feel |







