CosyVoice

3dys agorelease 55 0 0

Alibaba's open-source large-scale speech model supports zero-shot cloning in 3 seconds, multilingual capabilities, and command-based emotional control, enabling ultra-low-latency streaming synthesis at 150 ms.

Language:

zh,en

Collection time:

2026-06-25

Open site Mobile view

CosyVoice

Open site

What is CosyVoice?

CosyVoice is a next-generation multilingual model developed and open-sourced by Alibaba’s FunAudioLLM team at the FunAudio Lab.speech productionLarge Model. It is open-sourced under the Apache-2.0 license and provides full-stack speech synthesis capabilities, ranging from inference and training to deployment.

As of June 2026, CosyVoice has become a leading player in the open-source text-to-speech field‌One of the most powerful models‌, surpassing most competitors in terms of tonal similarity.

Key Features of CosyVoice

Zero-Shot Voice Cloning: With just a clear audio sample of 3 seconds or longer, you can extract vocal characteristics for replication—no complex training process required—and it supports cross-language vocal replication.
Multilingual and Dialect Synthesis: Supports seamless generation and mixed generation in multiple languages, including Chinese, English, Japanese, and Korean, as well as 18 Chinese dialects (such as Cantonese, Sichuanese, and Shanghainese).
Command-Based Emotional Control: Supports fine-grained adjustment of the prosody and emotions (such as laughter, sadness, etc.) of generated speech through natural language commands or rich-text tags.
Ultra-Low-Latency Streaming Synthesis: Supports integrated offline and streaming modeling, with a first-packet synthesis latency as low as 150 ms, enabling “input-to-sound” functionality.
Sound Design and Customization: Supports the generation of original, custom voices from scratch using text descriptions (such as “a gentle, intelligent female voice”).

CosyVoice's Core Technology

Unified Architecture for LLM + Stream Matching: By using a pre-trained large language model (such as Qwen2.5-0.5B) as the backbone and combining it with a Conditional Flow Matching (CFM) model to convert text into discrete speech tokens and then synthesize them into waveforms, the system enhances its semantic understanding capabilities.
Finite Scalar Quantization (FSQ): By using FSQ instead of traditional vector quantization (VQ) as a speech segmenter, with a codebook utilization rate approaching 100%, pronunciation accuracy and content consistency were significantly improved.
Cross-Language Cloning Technology: By decoupling voice quality from language, and using a general-purpose voiceprint encoder to extract voice quality features, a single voice quality can be adapted to the pronunciation rules and prosodic patterns of different languages.
Reinforcement Learning and Contrastive Learning: By employing reinforcement learning techniques such as contrastive learning-based speaker encoders and DPO (Direct Preference Optimization), we further improve speaker similarity and content consistency.

Use Cases for CosyVoice

Content Creation and Personal Media: Designed for short video voiceovers, reading vlog scripts, and producing audiobooks, it supports cross-language voiceovers and multi-character performances.
Smart Interaction and Customer Service: Used in real-time interaction scenarios such as smart customer service, in-car navigation, and voice assistants, it provides low-latency, highly natural-sounding voice feedback.
Government and Business Entities and Localized Offices: Supports on-premises deployment and can be used in scenarios with high data privacy requirements, such as Party-building outreach, internal meeting minutes, and virtual streaming.
Cross-Regional Communication and Accessibility: Supports real-time transcription and speech synthesis in multiple dialects, breaking down regional accent barriers; ideal for field research, customer interviews, and more.

CosyVoice's project URL

GitHub Repositories::https://github.com/FunAudioLLM/CosyVoice
Domestic Model Library (MoDa)::https://www.modelscope.cn/models/iic/CosyVoice-300M
Overseas Model Repository (Hugging Face)::https://huggingface.co/FunAudioLLM

Comparison of similar products

In the field of open-source text-to-speech (TTS), CosyVoice is often compared with models such as GPT-SoVITS, FishSpeech, and F5-TTS:

comparison dimension	CosyVoice	GPT-SoVITS	FishSpeech	F5-TTS
Core Architecture	LLM + FSQ + Stream Matching (Autoregressive + Stream-based)	Combining GPT and VITS	VQ + LLM + VQGAN	DiT + Flow Matching (Pure Non-Autoregressive)
Inference Delay	Extremely low (approximately 150 ms for the first packet)	Relatively high (about 1,200 ms)	Medium (approx. 350 ms)	Higher (noticeable on the CPU)
Resource Usage	Peak video memory: approximately 2.1 GB	Peak video memory: approximately 5.5 GB	Peak video memory: approximately 3.8 GB	The model is compact and Mac M chip-friendly
Multiple Languages/Dialects	Supports 9+ languages and 18 Chinese dialects	Primarily in Chinese, with community-driven English expansion	Chinese, English, and Japanese	Chinese and English
Strengths Summary	Low latency, extremely natural-sounding Chinese, strong dialect support, and excellent streaming performance	Fine-tuning on a small dataset yields good results, but comes at a high inference cost	Good balance between multilingual support and audio quality	Minimalist architecture, lightweight deployment, and no "machine-like" feel

data statistics

Relevant Navigation

No comments

No comments...

CosyVoice

What is CosyVoice?

Key Features of CosyVoice

CosyVoice's Core Technology

Use Cases for CosyVoice

CosyVoice's project URL

Comparison of similar products

data statistics

Relevant Navigation

Emu3

SmartResume

Open-Sora 2.0

Seed-OSS

Gemini Robotics-ER 1.6

IFlytek Spark

Mistral 7B

Voxtral TTS

No comments

Latest Articles

Popular Sites

CosyVoice

What is CosyVoice?

Key Features of CosyVoice

CosyVoice's Core Technology

Use Cases for CosyVoice

CosyVoice's project URL

Comparison of similar products

data statistics

Relevant Navigation

Emu3

SmartResume

Open-Sora 2.0

Seed-OSS

Gemini Robotics-ER 1.6

IFlytek Spark

Mistral 7B

Voxtral TTS

No comments

Latest Articles

Popular Sites

Tag Cloud