
What is PrismAudio?
PrismAudio is a Video-to-Audio framework released by Alibaba Tongyi Labs on March 24, 2026, focusing on the synthesis of ambient sound and sound effects. As the first model that combines reinforcement learning and thought chain technology in depth, PrismAudio realizes high synchronization between sound and video content through the generation paradigm of “think before you speak”, which solves the problems of inconsistency and inefficiency of traditional models. Its research results have been accepted by ICLR 2026, and the code will be open-sourced soon.
PrismAudio's main features
- Ambient sound/sound synthesis
- Automatically generate background sound effects to match the screen, such as hoof beats, wind and rain sounds, metal banging sounds, etc., replacing the traditional onomatopoeic work.
- Supports complex sound generation for multi-event, multi-source scenarios, maintaining stable output.
- Four-dimensional synergistic optimization
- semantic alignment: Ensure that the sound content accurately corresponds to the objects and movements in the video (e.g., recognize “hoofbeats” instead of “bird calls”).
- chronological synchronization: Precisely control the timing of sound and visual events to achieve millisecond synchronization.
- Aesthetic Optimization: Generate natural, layered, electronic-free, high-quality audio to enhance the listening experience.
- spatial orientationSupport stereo output, according to the position of the sound source in the screen automatically adjusts the left and right channels, to realize the “hear the sound of the position”.
- Highly efficient and lightweight
- The number of parameters in the model is only 518 million, and it takes only 0.63 seconds to generate 9 seconds of audio, which is nearly twice as fast as similar models and suitable for real-time application scenarios.
- chain-of-minds reasoning
- Using “Decompositional Chain of Thought” technology, the model is first generated into structured reasoning text (e.g., sound content, timing, texture, and orientation), and then the audio is generated, making the process interpretable and controllable.
Scenarios for using PrismAudio
- post-production for film and television
- Automatically generate ambient sound for movies, documentaries, and trailers to reduce post-production costs and time.
- Short video creation
- Quickly match ambient sounds to silent videos such as Vlogs, food, travel, etc. to enhance immersion and communication.
- game development
- Generate dynamic sound effects for transitions and CG promos, matching real-time ambient sounds to forests, cities, battlefields, and other scenes, reducing repetitive labor for sound engineers.
- advertising marketing
- Automatically add operational sound effects to product demonstration videos and support rapid iteration of multiple versions of audio tracks to improve ad testing efficiency and creative flexibility.
- Education and training
- Supplement the teaching video and operation demonstration with prompts and background sounds to enrich the auditory experience of multimedia courseware and improve learning concentration.
How do I use PrismAudio?
- Input Requirements
- The input video needs to contain clear visual events (e.g., actions, object movement) for the model to recognize and generate corresponding sound effects.
- parameterization
- Users can adjust parameters such as sound style (e.g. natural, sci-fi, horror), sound intensity, stereo effect, etc. according to their needs.
- output format
- Supports generation of common audio formats (e.g. WAV, MP3) for direct use in video editing software or game engines.
- Efficient training algorithm (Fast-GRPO)
- The model is optimized for training efficiency by the Fast-GRPO algorithm, which reduces the cost of random sampling and quickly adapts to the needs of different scenarios.
PrismAudio's project address
- Project website::https://prismaudio-project.github.io/
- GitHub repository::https://github.com/FunAudioLLM/ThinkSound/tree/prismaudio
- HuggingFace Model Library::https://huggingface.co/FunAudioLLM/PrismAudio
- arXiv Technical Paper::https://arxiv.org/pdf/2511.18833
- Online Experience Demo::https://huggingface.co/spaces/FunAudioLLM/PrismAudio
Recommended Reasons
- technological breakthrough
- The first “thought chain + reinforcement learning” framework, to solve the traditional model of audio and video inconsistency, inefficiency, representing the latest research progress in the field of video generation audio.
- superior performance
- It outperforms the best available models on authoritative test sets such as VGGSound and AudioCanvas, especially in complex scenarios.
- Lightweight and real-time
- With only 518 million parameters, it is fast to generate and suitable for real-time application scenarios (e.g., live streaming, gaming).
- Multi-scenario applicability
- Covering a wide range of fields such as film and television, games, advertising, education, etc., it lowers the technical threshold of audio and video content creation.
- Open Source and Community Support
- The code will soon be open-sourced, and developers will be able to carry out secondary development based on the model, promoting the universalization of the technology.
data statistics
Relevant Navigation

Microsoft has introduced an efficient speech generation model that generates natural and smooth high-fidelity audio in seconds, which has been applied to scenarios such as news broadcasting, podcasting and Copilot voice interaction.

Qwen3-ASR-Flash
Alibaba has introduced a multi-language high-precision speech recognition model that supports complex scenes, dialect and song transcription, and can be intelligently customized for recognition in context.

Noiz AI
Text-to-speech and video dubbing tools, with self-developed voice models to achieve high-quality, emotionally rich voice synthesis, suitable for multi-scene content creation.

Pascal Editor
A web-based open source 3D building editor that allows users to easily create and edit 3D building models in the browser without installing software.

WeChat ClawBot
The AI assistant integrated with WeChat supports intelligent chatting, document processing, image editing, etc. It can efficiently complete multi-scenario tasks without switching apps, and the data is safe and completely free.

UntitledPen
The full-featured creation platform based on AI technology integrates intelligent writing, multi-language speech generation and audio editing, helping users efficiently complete a one-stop solution for text creation, speech customization and post-production.

DuClaw
Baidu Intelligent Cloud launched zero-deployment OpenClaw service, integrating search, encyclopedia and other ecological capabilities with a variety of large models, ready to use, to help efficient content creation, academic research and corporate office.

MiniMax Audio
MiniMax presents an AI speech synthesis tool based on the advanced T2A-01 speech model that supports multi-language, multi-tone selection and advanced parameter control.
No comments...
