Qwen3-ASR-Flash

1mos agoupdate 780 0 0

Alibaba has introduced a multi-language high-precision speech recognition model that supports complex scenes, dialect and song transcription, and can be intelligently customized for recognition in context.

Language:

cn,en

Collection time:

2025-09-09

Open site Mobile view

AI speech generation # Speech Recognition

Qwen3-ASR-Flash

Open site

What is Qwen3-ASR-Flash?

Qwen3-ASR-Flash is the latest automated Alibaba Tongyi Thousand Questions series launched by thespeech recognitionThe model is trained based on millions of hours of multimodal data, and has the ability to transcribe in multiple languages and dialects with high accuracy. It supports 11 languages, including Chinese (including Mandarin and multiple dialects), English, Japanese, Korean, Arabic, etc., and maintains a low error rate in complex scenarios, such as noisy environments, background music, and overlapping conversations. The model not only accurately recognizes everyday speech, but also excels in song, terminology and dialect recognition. At the same time, it supports contextual customization, allowing users to provide keywords or documents to help improve the recognition of proper nouns, and is robust to irrelevant text.

Qwen3-ASR-Flash provides API and SDK calls, supports real-time streaming transcription, and is suitable for a wide range of scenarios, such as meeting recording, interview transcription, online education, intelligent customer service, medical dictation, gaming commentary and music content analysis, etc. Qwen3-ASR-Flash is a general-purpose speech recognition solution that combines accuracy, flexibility and ease of use.

Main features of Qwen3-ASR-Flash

Multilingualism and Dialect Recognition
A single model supports 11 languages, including: Chinese (Mandarin, Sichuan, Minnan, Wu, Cantonese, etc.), English (American, British, etc.), French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean and Arabic.
High recognition accuracy
- Standard Chinese Recognition Error Rate (WER): Approx. 3.97%
- Chinese dialect: 3.48%
- English: 3.81%
- Lyrics Recognition: Error rate is less than 8% under clean singing or background music conditions; internal whole song test error rate is about 9.96%.
Excellent noise robustness
Accurate recognition even in complex acoustic environments (e.g., inside a car, background music, overlapping conversations, various noise disturbances).
For example, audio can be recognized for scenes such as game narration, English rap, dialect interludes, and chemistry lessons.
Intelligent Contextual Customization Recognition
Users can provide contextual text in any format (keyword lists, paragraphs, documents, or even irrelevant text), and the model can intelligently utilize this contextual information to improve recognition accuracy for named entities and terminology, and is highly robust to irrelevant text.
Automatic language detection and non-speech filtering
The model supports automatic language recognition (enable_lid) function and can filter non-voice content such as mute and background noise.
Additional Support Features
- Inverse Text Normalization (ITN) (Chinese and English)
- Punctuation predictions
- Streaming output support
- Multiple audio formats and calls (Java/Python SDK or HTTP API).

Qwen3-ASR-Flash Usage Scenarios

Proceedings of multilingual meetings::Suitable for cross-language, multi-accented meetings, automatic and accurate transcription of meeting content, adapting to different language participants.
news interview::Quickly and accurately convert interviews into publishable text.
online education::Convert course lectures to subtitles in real time to support multilingual students.
Intelligent Customer Service System::Real-time transcription of user speech for automatic content archiving, analysis and response.
Medical record organization::Quickly convert physician speech to text to power applications such as electronic medical records and data analytics.
Gaming Commentary::Recognize professional terminology and commentary content in complex environments and record them accurately; real examples include gaming scene background text combined with recognition.
Lyrics and music scenes::Highly accurate recognition of lyrics in singing and BGM, ideal for music content production and analysis.

Qwen3-ASR-Flash project address

ModelScope (Magic Match Community)
- address::https://modelscope.cn/studios/Qwen/Qwen3-ASR-Demo
- specificities: AliCloud launched a modeling community that provides free online experience, supports real-time speech recognition demonstrations, and allows users to upload audio files or record directly to test model performance.
Hugging Face
- address::https://huggingface.co/spaces/Qwen/Qwen3-ASR-Demo
- specificities: Internationally renowned AI community that provides an interactive demo page of the model, where users can quickly experience the multilingual recognition capabilities of Qwen3-ASR-Flash and view the technical documentation.
AliCloud Hundred Refined API
- address::https://bailian.console.aliyun.com/?tab=doc#/doc/
- specificities: The official API platform of AliCloud supports calling Qwen3-ASR-Flash via API, which is suitable for enterprise-level application development. Users need to register an account and get the API key, the document provides a detailed call guide and parameter configuration instructions.

How to use Qwen3-ASR-Flash?

access method
- API call: through the AliCloud Model Studio (Hundred Refine) provided by the qwen3-asr-flash model to make the call.
- Online Demo: Experience the demo through ModelScope or Hugging Face.
Parameter description
- language: Specify when a language is known to improve accuracy.
- enable_lid: Enable language detection.
- enable_itn: Enable inverse text normalization in English and Chinese.
- Streaming output support stream=true, suitable for real-time transcription scenarios.
Notes on restrictions
- The audio must be no longer than 3 minutes in length and no larger than 10 MB in file size.
- Supported formats include aac, mp3, wav, flac and many other mainstream audio formats.

Recommended Reasons

Leading Performance::In several language and scene benchmarks, the recognition error rate is significantly better than that of competitors such as Gemini-2.5-Pro and GPT-4o-Transcribe.
Diversity and adaptability::Supporting multiple audio types, complex environments, dialects and noise interference, it is the first choice for general-purpose scenarios.
Intelligent customization capabilities::Context accepts input in any format, improving recognition hit rates without pre-processing.
Comprehensive development support::Provides SDK, HTTP API, and Demo experience for low-threshold and fast integration.
Continuous optimization::The Ali team is committed to continuous iterative improvement of universal recognition accuracy.

data statistics

Relevant Navigation

No comments

No comments...

Qwen3-ASR-Flash

What is Qwen3-ASR-Flash?

Main features of Qwen3-ASR-Flash

Qwen3-ASR-Flash Usage Scenarios

Qwen3-ASR-Flash project address

How to use Qwen3-ASR-Flash?

Recommended Reasons

data statistics

Relevant Navigation

KittenTTS

MAI-Voice-1

UntitledPen

Narakeet

VoiSpark

MiniMax Audio

YueLu

Noiz AI

No comments

Latest Articles

Popular Sites

Qwen3-ASR-Flash

What is Qwen3-ASR-Flash?

Main features of Qwen3-ASR-Flash

Qwen3-ASR-Flash Usage Scenarios

Qwen3-ASR-Flash project address

How to use Qwen3-ASR-Flash?

Recommended Reasons

data statistics

Relevant Navigation

KittenTTS

MAI-Voice-1

UntitledPen

Narakeet

VoiSpark

MiniMax Audio

YueLu

Noiz AI

No comments

Latest Articles

Popular Sites

Tag Cloud