Stanford Genius Teen Founds AI Company, Speech Model Sonic-3 Raises $100 Million Series B from NVIDIA and Others

trade3mos agoupdate AiFun
411 0

October 29(math.) genusUnited States of Americaspeech productionKaran Goel, founder and CEO of startup Cartesia, announced the launch of a new voice model, Sonic-3, on social media platform X, while also revealing that it has completed theUS$100 million (approximately RMB 712 million)financing(math.) genusNVIDIA,Senator Vote.

Cartesia was founded in 2023 by5 Stanford AI Lab FellowsFounded by Chris Ré, who is their mentor, Albert Gu, and Brandon Yang, who is Chinese. Notably, Cartesia's chief scientist and co-founderAlbert Gu is one of the authors of MambaAnd another ChineseBrandon Yang worked on the Google Brain team..

斯坦福天才少年创立AI公司,语音模型Sonic-3获英伟达等1亿美元B轮融资

▲Cartesia's founding team, from left to right, Brandon Yang, Karan Goel, Albert Gu and Arjun Desai (Source: Cartesia)

Previously, Cartesia had received an investment led by Index Ventures in December 2024US$27 million (approximately RMB 192 million)seed round, just under three months past March 2025, Cartesia announced the closing of theUSD 64 million (approximately RMB 456 million)s Series A financing.

Foreign media outlet AIM Media House says Cartesia offersBased on SSM (State Space Model) architecture(used form a nominal expression)speech productionand speech recognition models, Sonic-3 uses a non-Transformer architecture for real-time dialog and voice interaction applications

Thousands of organizations, including cloud computing platform ServiceNow, AI customer service platforms Cresta and Decagon, currently leverage the Sonic model to handle millions of conversations per month.

We were the first to put the Sonic-3 to the test. Wise Things asked Sonic-3 to tell a story in Chinese, and it only took Sonic-3 two seconds to generate and play the audio, but the smoothness of Sonic-3's Chinese speaking leaves something to be desired. And let it read a random documentary narration in English, compared to Chinese, the English is quite smooth and natural, almost impossible to hear that it is AI-generated.

I. Stanford All-Star Lineup, Mamba Writers Are On It

The Stanford AI Lab has translated years of SSM research to found Cartesia.

The Cartesia co-founding team met at Stanford.Consisting of two Chinese, two Indians and their common mentor. While still in school, they then invented SSM for training higher quality and more efficient large models.

Albert Gu, Cartesia's chief scientist and co-founder, is one of the lead authors of Mamba.In contrast to the traditional Transformer model, Mamba's SSM model achieves low-latency, high-precision sequence prediction.Albert Gu has also been named to Time Magazine's list of the world's most influential people for 2024.

Dr. Karan Goel, Cartesia CEO and co-founder, is a graduate of Stanford University, where he was awarded the Siebel Scholarship while pursuing his master's degree at Carnegie Mellon University, and was also honored by Stanford Computer Science Associate ProfessorEmma Brunskill., Director, Stanford Center for Human-Centered AI ResearchLi Feifei (1962-), PRC actorand many other distinguished professors.

Over the past four years, the Cartesia team has been actively building the theory behind SSM and extending it to a wide range of modalities, including text, audio, video, image, and time-series data, to achieve state-of-the-art results. Based on their research on SSM at Stanford, the founding team targeted SSM architecture and speech modeling from the start.

Cartesia provides users with an enterprise-grade AI speech platform where they can use models for speech-to-text conversion - the text-to-speech model Sonic and the speech-to-text model Ink - as well as build speech Agents.

II. Equipped with 42 languages and customized pronunciation, with a response speed of less than 0.2s

Cartesia is moving at quite a pace, and along with Cartesia getting its latest round of funding, the company has introduced the new Sonic-3 model.

The Sonic-3 model has advantages in terms of the number of supported languages, controllability and speed. Users can choose the system equipped with42 languagescap (a poem)More than 500 tonesPerforms text-to-speech functionality, which has been greatly increased from the Sonic-2's 15 languages.

斯坦福天才少年创立AI公司,语音模型Sonic-3获英伟达等1亿美元B轮融资

▲Cartesia can support languages (Source: Cartesia)

There are a total of 10 Chinese voice types to choose from in the Sonic-3 voice library, and the more heavily equipped English voices are more finely divided into 11 different regional accents.

斯坦福天才少年创立AI公司,语音模型Sonic-3获英伟达等1亿美元B轮融资

▲English with 11 accents (Source: Cartesia)

In terms of controllability, the model is not only capable of basic speech generation, but also capable of fine control of volume, speech rate, and emotion through API parameters and SSML tags, which can accurately capture human emotions, including laughter, intonation, and subtle emotional shifts, etc., and support custom pronunciation.

Sonic-3's modeling delay is only90 millisecondsThe total end-to-end response time in190 millisecondsWithin, the foreign media AIM Media House said the model has been among the world's fastest real-time voice AI systems.

Sonic-3 also supportsvoice cloningfeatures, and support for thetrimmingIt makes it more reproducible to the reference original voice. In addition, the new model can automatically buffer and continue the generated speech, which means that real-time speech processing will become more efficient and natural.

斯坦福天才少年创立AI公司,语音模型Sonic-3获英伟达等1亿美元B轮融资

▲ Voice cloning (Source: Cartesia)

Unlike most speech models that rely on the Transformer architecture, Sonic-3 is based on the SSM architecture. Models based on Transformer architectures predict the next word by revisiting the previous dialog, leading to delays and inefficiencies in speech generation. SSMs (such as innovations like S4 and Mamba), on the other hand, are closer to the human mindset in that they are able to consistently understand the topic and the dialog without having to review everything from the beginning, which allows Sonic-3 to generate speech that is both natural and fast.

Leveraging the Sonic model, Cartesia's platform helps organizations build voice agents with the ability to handle complex tasks, including customer support, scheduling, and even light-hearted pranks.

斯坦福天才少年创立AI公司,语音模型Sonic-3获英伟达等1亿美元B轮融资

▲Creating a personalized Agent (Source: Cartesia)

Conclusion: Cartesia to revolutionize the real-time speech modeling track

In the AI audio generation track, there is no lack of strong competitors such as MiniMax, Genspark and ElevenLabs. With Cartesia getting a new round of funding and Sonic-3's new model being put into use, the competition in the speech modeling track has become more intense.

According to Ravi Krishnamurthy, VP of Products at ServiceNow, “Cartesia's SSM architecture brings enterprise-grade speed and quality to our Voice Agents.”

In recent years, Cartesia has been working toward an SSM architecture, and as the demand for real-time conversations grows exponentially, this technology may provide a more accurate and faster solution for businesses and other users.

Source: AIM Media House

© Copyright notes

Related posts

No comments

none
No comments...