SAM AudioTranslation site

2wks agorelease 132 0 0

Meta introduces the world's first unified multimodal audio separation model that supports text, visual, and time cues to accurately separate target sounds from complex audio and video.

Language:
en
Collection time:
2026-01-08
SAM AudioSAM Audio

What is SAM Audio?

SAM Audio is the world's first unified multimodalaudio separationThe model realizes intelligent parsing and interactive extraction of complex audio scenes by fusing textual, visual and temporal cues. The core goal is to enable users to accurately isolate specific target sounds from mixed audio or video as if they were “listening with their eyes”, for example, by clicking on a musical instrument on the screen, typing text to describe the sound source, or marking a time segment, all of which can be accomplished with a single click.

SAM Audio'sKey Features

  1. Multi-modal cueing support::
    • text alertThe user can specify the target sound through a natural language description (e.g. “dog barking”, “human voice singing”) and the system automatically extracts the corresponding sound source.
    • visual cue: Clicking on a vocalized object (e.g., a speaker, a hand on a drum) in the video frame separates its audio.
    • Time Slice Cue: Marking the time interval in which the target sound appears (e.g., “3:12 to 3:18”), the model automatically processes similar sounds in the entire recording.
  2. High Precision Audio Separation::
    • Accurately extract target sounds from complex audio environments while generating the remaining tracks.
    • Outperforms existing technologies in general purpose audio separation tasks, especially in specialized areas such as instrument separation and speaker separation.
  3. Flexible application scenarios::
    • Supports a wide range of scenarios such as audio cleanup, background noise removal, music production, sound processing and accessibility technology.

SAM Audio's core technology

  1. Perception Encoder Audiovisual (PE-AV) Engine::
    • Extended from Meta's open-source Perception Encoder computer vision model, this is the first time advanced visual understanding has been deeply integrated with audio signals.
    • Provides semantically rich feature representation for cross-modal sound localization and separation by aligning video frames with audio at precise time points.
  2. Generative Modeling Framework::
    • A generative framework based on the Stream Matching Diffusion Transformer is used in conjunction with the DAC-VAE encoder to compress the audio into a compact representation while maintaining the sound quality.
    • The training data covers speech, music, and generic sound events, and the robustness of the model in real-world environments is ensured by an automated audio mixing process and multimodal cue generation.
  3. Time period coding innovations::
    • The first time period encoding function converts time information into a text sequence-like representation, with each time point labeled as “active” or “silent”.
    • Enables AI to accurately understand user-specified time information, enabling precise control at the frame level.

Scenarios for SAM Audio

  1. Audio cleanup and background noise removal::
    • Podcast creators can easily remove background sounds such as barking dogs and street noise from their recordings to improve audio clarity.
    • Flexible operation through text descriptions or time stamps makes audio editing an efficient and convenient experience.
  2. Creative media production::
    • Music producers and video creators can utilize SAM Audio for creative audio processing.
    • Extract specific instrument tracks or separate vocals from a song for audio remixing or effects addition.
    • The guitar sound can be extracted by clicking on the guitarist in the video through visual cues, providing more possibilities for creative expression.
  3. Accessible technologies::
    • Work with hearing aid manufacturers to help people with hearing loss better understand audio content through audio separation technology.
    • In noisy environments, hearing aids can automatically separate out the human voice, allowing the hearing impaired to hear conversations more clearly, enhancing their quality of life and socialization.
  4. Video Editing::
    • In video production, SAM Audio precisely separates the sound of specific objects.
    • Editors can click on specific characters or objects in the video through visual cues to extract their voices, realizing accurate matching of audio and video.
    • For example, extracting the speaker's voice in a video while removing other noises makes the video content clearer and more engaging.

SAM Audio's project address

  • Project website:: https://ai.meta.com/samaudio/
  • Github repository:: https://github.com/facebookresearch/sam-audio

Recommended Reasons

  1. Technological innovativeness::
    • SAM Audio is the world's first unified multimodal audio separation model that, for the first time, fully replicates the way humans naturally perceive sound - looking, speaking, pointing, and selecting - into an AI system.
    • Its Perceptual Encoder Audiovisual (PE-AV) engine enables cross-modal sound localization and separation, opening up entirely new paths for audio processing.
  2. Powerful and flexible::
    • Supports three types of cues: text, visual and time clip, which can be used individually or in combination to meet the needs of audio separation in different scenarios.
    • High-precision audio separation capability demonstrates superior performance in a variety of specialized areas, such as instrument separation and speaker separation.
  3. Wide range of application scenarios::
    • Suitable for a wide range of scenarios such as audio cleanup, background noise removal, music production, sound processing and accessibility technology.
    • It can provide efficient and convenient audio processing tools for music creators, podcast editors, movie and TV producers, and researchers.
  4. Open Source and Community Support::
    • Meta has synchronized and open-sourced two key tools, SAM Audio-Bench and SAM Audio Judge, to provide the industry with a unified evaluation standard and automated assessment model.
    • The open-source nature means that developers can build a wide range of audio-visual applications based on it, promoting the iteration and application of audio processing technology.

data statistics

Relevant Navigation

No comments

none
No comments...