Google strikes again! Gemini 2.5 Deep Think model crushes OpenAI o3 and Grok 4!

On the evening of August 1, Google announced the launch of the Deep Think feature to Google AI Ultra subscribers, theGemini 2.5 The Deep Think model won gold at this year's International Mathematical Olympiad (IMO).

Google Debuts IMO Contest Winning Model

Google says it's its most advanced AI reasoning model, capable of answering questions by exploring and considering multiple ideas at once, and then using that output to choose the best answer.

As of yesterday, Google's $250 per month Ultra subscribers will be able to access Gemini 2.5 Deep Think in the Gemini app.

Gemini 2.5 Deep Think, which debuted at the Google I/O developer conference in May 2025, is Google's first publicly available multi-intelligence body model. These systems generate multiple AI intelligences to tackle a problem in parallel, a process that consumes more computational resources than a single intelligence but often results in better answers.

In addition to Gemini 2.5 Deep Think, Google said it will release its model for use in the International Maritime Organization (IMO) to a select group of mathematicians and academics.

Google said the AI model "reasoned for hours" rather than seconds or minutes like most consumer-facing AI models. The company hopes the IMO model will enhance research efforts and aims to gather feedback on how to improve multi-intelligence systems for academic use cases.

Google noted that the Gemini 2.5 Deep Think model is a significant improvement over the one it unveiled at the I/O conference. The company also claims to have developed "novel reinforcement learning techniques" to encourage Gemini 2.5 Deep Think to better utilize its inference paths.

In a blog post shared with TechCrunch, Google said, "Deep Think helps people solve problems that require creativity, strategic planning, and incremental improvement."

How Deep Think Works: Extending Gemini's Parallel "Thinking Time"

Just as people take the time to solve complex problems by exploring different perspectives, weighing potential solutions, and ultimately refining their answers, Deep Think pushes the boundaries of thinking ability by utilizing parallel thinking techniques. This approach allows Gemini to generate multiple ideas and think about them simultaneously, even modifying or integrating different ideas over time to arrive at the best possible answer.

In addition, by extending reasoning time or "think time," the DeepMind R&D team gives Gemini more time to explore different hypotheses and find creative solutions to complex problems.

In addition, Google has developed novel reinforcement learning techniques that encourage models to utilize these extended inference paths, thus making Deep Think a better, more intuitive problem solver over time.

How did Deep Think perform?

Deep thinking can help people solve problems that require creativity, strategic planning, and incremental improvements, for example:

Iterative Development and Design: We were impressed with how well Deep Think handled tasks that required building complex content piece by piece. For example, the technical team observed that Deep Think could simultaneously improve the aesthetics and functionality of web development tasks.

谷歌重磅出击！Gemini 2.5 Deep Think模型碾压OpenAI o3与Grok 4

Deep Think in the Gemini app uses parallel thinking techniques to provide more detailed, creative and thoughtful responses.

Scientific and mathematical discovery: Because deep thinking is capable of reasoning about highly complex problems, it can be a powerful tool for researchers. It can help construct and explore mathematical conjectures or reason about complex scientific literature, potentially accelerating the process of discovery.

Algorithm development and code: Deep Think is particularly adept at solving tricky coding problems where problem formulation and careful consideration of tradeoffs and time complexity are critical.

Deep Think also excels in challenging benchmark tests that measure coding, scientific, knowledge and reasoning skills.

For example, compared to other non-tooled models, Gemini 2.5 Deep Think achieved the best performance in both LiveCodeBench V6, which measures competitive code performance, and Humanity's Last Exam, a challenging test designed to measure AI's ability to answer thousands of crowdsourced questions in math, humanities, and science. (HLE is a challenging test that measures the ability of AI to answer thousands of crowdsourced questions in math, humanities, and science).

Google claims that its model scored 34.81 TP4T on HLE (without tools), while xAI's Grok 4 scored 25.41 TP4T and OpenAI's o3 scored 20.31 TP4T.

Google also said that Gemini 2.5 Deep Think outperformed AI models from OpenAI, xAI and Anthropic in LiveCodeBench 6. Google's model scored 87.61 TP4T, Grok 4 scored 791 TP4T, and OpenAI's o3 scored 721 TP4T.

What does Netflix think?

Google's newly released Gemini 2.5 Deep Think model has sparked heated discussions on social media and tech forums, especially on platforms such as Hacker News, Reddit and X (formerly Twitter). Many netizens were the first to test it and share their experiences and opinions.

On X, some users who have tried Gemini 2.5 Deep Think say that its context window is shorter than Gemini 2.5 Pro.

Some users think this new model is awesome and are considering buying an Ultra sub.

There are also users who think that some of the benchmarks of this model are shockingly good, and even then Google doesn't publicize it in a high-profile way.

However, some netizens did not buy the model and thought that its performance was not much competitive compared to the top models. The netizen said:

"I started some experiments with this new Deep Think agent, but reached my daily usage limit after five prompts. The price of $250 per month is unacceptable. Compared to o3-pro and Grok 4 Heavy, it's just not competitive. In terms of performance, so far I haven't even been able to see any significant advantage. I approached it with a tough organizational problem my company was facing and provided background information on it, and it did present a clear, well-thought-out solution that was consistent with what we were discussing internally. It is worth noting, however, that o3 came to an equally valid conclusion at a much lower cost, although its report was slightly less "comprehensive". It looks like I'll have to wait until tomorrow to learn more about the actual performance of this Agent."

There are also users who believe that Google's new model can not be expected to do very perfect, because even the best model will sometimes "drag the foot", and "enter a question to generate code" is not a new thing, the big model out before it has been there, just not so good. It just doesn't work that well.

"They perform very poorly on data seen but not weighted in the training set. Even the best models - such as Opus 4, which performs well, and Qwen and K2, which surprise from time to time - drag their feet in less conspicuous ways.

The most obvious example is probably related to building systems: you can tell at a glance which models have "seen" a lot of nixpkgs data. And even the best models seem to struggle with Bazel, and sometimes even CMake.

The top search engines burn over a hundred dollars a day, and I think they're a significant improvement over Google or Stack Overflow before the SEO era ...... but they're not "way ahead" of a really good search index! ". At one time, the first page of Google Search showed source code, documentation, and troubleshooting information for almost every programming topic. The experience was like typing a question into that magic search box, and immediately a piece of code popped up that worked. During the golden age of FAANG, there was also that superb grep tool in-house, which had a similar effect.

I feel like there's a generation or two of people who will think that "typing in a question to generate code" is a novelty. But really, it's not new at all - we just haven't used it in the last five to ten years."