
What is Qwen3-ASR-Flash?
Qwen3-ASR-Flash is the latest automated Alibaba Tongyi Thousand Questions series launched by thespeech recognitionThe model is trained based on millions of hours of multimodal data, and has the ability to transcribe in multiple languages and dialects with high accuracy. It supports 11 languages, including Chinese (including Mandarin and multiple dialects), English, Japanese, Korean, Arabic, etc., and maintains a low error rate in complex scenarios, such as noisy environments, background music, and overlapping conversations. The model not only accurately recognizes everyday speech, but also excels in song, terminology and dialect recognition. At the same time, it supports contextual customization, allowing users to provide keywords or documents to help improve the recognition of proper nouns, and is robust to irrelevant text.
Qwen3-ASR-Flash provides API and SDK calls, supports real-time streaming transcription, and is suitable for a wide range of scenarios, such as meeting recording, interview transcription, online education, intelligent customer service, medical dictation, gaming commentary and music content analysis, etc. Qwen3-ASR-Flash is a general-purpose speech recognition solution that combines accuracy, flexibility and ease of use.
Main features of Qwen3-ASR-Flash
- Multilingualism and Dialect Recognition
A single model supports 11 languages, including: Chinese (Mandarin, Sichuan, Minnan, Wu, Cantonese, etc.), English (American, British, etc.), French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean and Arabic. - High recognition accuracy
- Standard Chinese Recognition Error Rate (WER): Approx. 3.97%
- Chinese dialect: 3.48%
- English: 3.81%
- Lyrics Recognition: Error rate is less than 8% under clean singing or background music conditions; internal whole song test error rate is about 9.96%.
- Excellent noise robustness
Accurate recognition even in complex acoustic environments (e.g., inside a car, background music, overlapping conversations, various noise disturbances).
For example, audio can be recognized for scenes such as game narration, English rap, dialect interludes, and chemistry lessons. - Intelligent Contextual Customization Recognition
Users can provide contextual text in any format (keyword lists, paragraphs, documents, or even irrelevant text), and the model can intelligently utilize this contextual information to improve recognition accuracy for named entities and terminology, and is highly robust to irrelevant text. - Automatic language detection and non-speech filtering
The model supports automatic language recognition (enable_lid
) function and can filter non-voice content such as mute and background noise. - Additional Support Features
- Inverse Text Normalization (ITN) (Chinese and English)
- Punctuation predictions
- Streaming output support
- Multiple audio formats and calls (Java/Python SDK or HTTP API).
Qwen3-ASR-Flash Usage Scenarios
- Proceedings of multilingual meetings::Suitable for cross-language, multi-accented meetings, automatic and accurate transcription of meeting content, adapting to different language participants.
- news interview::Quickly and accurately convert interviews into publishable text.
- online education::Convert course lectures to subtitles in real time to support multilingual students.
- Intelligent Customer Service System::Real-time transcription of user speech for automatic content archiving, analysis and response.
- Medical record organization::Quickly convert physician speech to text to power applications such as electronic medical records and data analytics.
- Gaming Commentary::Recognize professional terminology and commentary content in complex environments and record them accurately; real examples include gaming scene background text combined with recognition.
- Lyrics and music scenes::Highly accurate recognition of lyrics in singing and BGM, ideal for music content production and analysis.
Qwen3-ASR-Flash project address
- ModelScope (Magic Match Community)
- address::https://modelscope.cn/studios/Qwen/Qwen3-ASR-Demo
- specificities: AliCloud launched a modeling community that provides free online experience, supports real-time speech recognition demonstrations, and allows users to upload audio files or record directly to test model performance.
- Hugging Face
- address::https://huggingface.co/spaces/Qwen/Qwen3-ASR-Demo
- specificities: Internationally renowned AI community that provides an interactive demo page of the model, where users can quickly experience the multilingual recognition capabilities of Qwen3-ASR-Flash and view the technical documentation.
- AliCloud Hundred Refined API
- address::https://bailian.console.aliyun.com/?tab=doc#/doc/
- specificities: The official API platform of AliCloud supports calling Qwen3-ASR-Flash via API, which is suitable for enterprise-level application development. Users need to register an account and get the API key, the document provides a detailed call guide and parameter configuration instructions.
How to use Qwen3-ASR-Flash?
- access method
- API call: through the AliCloud Model Studio (Hundred Refine) provided by the
qwen3-asr-flash
model to make the call. - Online Demo: Experience the demo through ModelScope or Hugging Face.
- API call: through the AliCloud Model Studio (Hundred Refine) provided by the
- Parameter description
language
: Specify when a language is known to improve accuracy.enable_lid
: Enable language detection.enable_itn
: Enable inverse text normalization in English and Chinese.- Streaming output support
stream=true
, suitable for real-time transcription scenarios.
- Notes on restrictions
- The audio must be no longer than 3 minutes in length and no larger than 10 MB in file size.
- Supported formats include aac, mp3, wav, flac and many other mainstream audio formats.
Recommended Reasons
- Leading Performance::In several language and scene benchmarks, the recognition error rate is significantly better than that of competitors such as Gemini-2.5-Pro and GPT-4o-Transcribe.
- Diversity and adaptability::Supporting multiple audio types, complex environments, dialects and noise interference, it is the first choice for general-purpose scenarios.
- Intelligent customization capabilities::Context accepts input in any format, improving recognition hit rates without pre-processing.
- Comprehensive development support::Provides SDK, HTTP API, and Demo experience for low-threshold and fast integration.
- Continuous optimization::The Ali team is committed to continuous iterative improvement of universal recognition accuracy.
data statistics
Relevant Navigation

An intelligent AI-based OCR platform that quickly extracts unstructured documents from PDFs, scans, and images into high-precision, actionable structured data.

Meloflow
An AI-driven music generation platform that supports generating, expanding, covering and adding tracks from text, lyrics or audio to quickly create commercially available original music.

VoiSpark
AI speech generation tool that supports text-to-speech, voice cloning and voice change, helping to create high-quality voice content.

MAI-Voice-1
Microsoft has introduced an efficient speech generation model that generates natural and smooth high-fidelity audio in seconds, which has been applied to scenarios such as news broadcasting, podcasting and Copilot voice interaction.

Narakeet
AI text-to-speech and video dubbing tool with multi-language and multi-tone support for video narration, PPT voice presentations and subtitle generation, easy to operate and natural voice.

Trancy
An AI language learning tool that combines bilingual subtitles, webpage translation, grammar analysis and listening and speaking practice to help users efficiently improve their foreign language skills while watching videos and reading articles.

KittenTTS
An open source lightweight text-to-speech model that is less than 25 MB and can run in real time on ordinary CPUs, supports a variety of natural tones and can be used offline.

Noiz AI
Text-to-speech and video dubbing tools, with self-developed voice models to achieve high-quality, emotionally rich voice synthesis, suitable for multi-scene content creation.
No comments...