Back to Blog List

Gemini 3 DeepThink In-Depth Review: Another Breakthrough in AI Reasoning Capabilities

12/5/2025
Author: Lydia
Category: AI
Gemini 3 DeepThink In-Depth Review: Another Breakthrough in AI Reasoning Capabilities

This was an unexpected yet perfectly logical surprise.

width="100%"

height="450"

frameborder="0"

allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture"

allowfullscreen>

Frankly, over the past year, we've become accustomed to seeing various large models "chasing" each other on the benchmark charts, often with only a few tenths of a percentage point difference. But this DeepThink release felt completely different. It wasn't simply adding to existing features; it was attempting to reconstruct the way models "think."

More Than Just "Slow Thinking"

If previous Chain-of-Thought models taught the model to "think step by step," then Gemini 3 DeepThink brings Parallel Reasoning.

This is a crucial difference. When using previous reasoning models (such as the early O1 series), we often felt they were performing linear, deep-digging analysis. The model was like a student stubbornly sticking to a single logical path. This approach is effective for solving certain problems, but if the initial direction is wrong, it will stray further and further down the wrong path, eventually leading to illusions or logical collapse.

Google's "parallel reasoning" technology gives the model the ability to "split itself." When faced with a complex mathematical problem or logical trap, DeepThink no longer gambles everything but simultaneously constructs multiple hypothetical paths. You can imagine it as a Go master simultaneously considering a dozen possibilities for the next five moves. It assesses the feasibility of each path, quickly discarding seemingly hopeless dead ends, and then concentrates computing power on the most promising path for further exploration.

The direct result of this mechanism is its extremely high "error correction rate." In my initial tests, DeepThink rarely falls directly into logic problems with deliberately set premise traps, because during parallel thinking, it's highly likely that one path has already discovered the premise fallacy.

The Truth Behind the Data: When GPT-5 Meets a Formidable Rival

Talking about technical principles alone is somewhat abstract; let's look at the solid benchmark data. Google's benchmark results this time are very hardcore, especially compared to top contenders like Claude Sonnet 4.5, GPT-5 Pro, and even GPT-5.1.

The following image shows three highly representative benchmark comparisons with a wealth of data; I recommend you take a close look:

Detailed Evaluation Data Comparison

Google's official evaluation data covers three key dimensions: reasoning and knowledge, scientific knowledge, and visual reasoning ability. Below is the complete data comparison:

1. Humanity's Last Exam (Reasoning and Knowledge Test)

Test Conditions: Tools off (No external tools used)

This test, known as "Humanity's Last Exam," aims to evaluate a model's autonomous reasoning ability on open-ended questions. DeepThink leads with a score of 41%, 3.5 percentage points higher than the second-place Gemini 3 Pro and a full 10.3 percentage points higher than the GPT-5 Pro.

Humanity's Last Exam benchmark results comparison chart, showing the Gemini 3 DeepThink leading with a score of 41.0%, followed by Gemini 3 Pro (37.5%), GPT-5 Pro (30.7%), GPT-5.1 (26.5%), Gemini 2.5 Pro (21.6%), and Claude Sonnet 4.5 (13.7%). DeepThink leads its strongest competitor by more than 10 percentage points.

2. GPQA Diamond (Science Knowledge Test)

Testing Conditions: Tools off (No external tools used)

GPQA Diamond is a graduate-level science knowledge test covering multiple disciplines such as physics, chemistry, and biology. DeepThink tops the list with an accuracy rate of 93.8%. In this high score range, every 1% improvement significantly reduces professional-level illusions.

GPQA Diamond Science Knowledge Test Results Comparison Chart: Gemini 3 DeepThink ranks first with 93.8%, followed by Gemini 3 Pro (91.9%), GPT-5 Pro (88.4%), GPT-5.1 (88.1%), Gemini 2.5 Pro (86.4%), and Claude Sonnet 4.5 (83.4%). All models scored above 80%.

3. ARC-AGI-2 (Visual Reasoning Puzzle)

Test Conditions: Tools on (Code execution allowed)

ARC-AGI-2 Visual Reasoning Test Results Comparison Chart: Gemini 3 DeepThink achieves 45.1% in Tools on mode and 31.1% in Tools off mode

ARC-AGI-2 is considered the closest benchmark to testing "general fluid intelligence," requiring models to understand abstract visual patterns never seen before. DeepThink's 45.1% score is a milestone breakthrough, almost 2.5 times that of GPT-5.1 (17.6%), and an astonishing 820% improvement over its predecessor, Gemini 2.5 Pro (4.9%). The image below is an official comparison chart, clearly showing the performance of the three tests:

Google's official performance comparison of the Gemini 3 DeepThink in three key benchmark tests: 1. In the Humanity's Last Exam (Tools off), the Gemini 3 DeepThink leads by a wide margin with a score of 41%, surpassing GPT-5 Pro (30.7%) and Claude Sonnet 4.5 (13.7%). 2. In the GPQA Diamond (Tools off) scientific knowledge test, the Gemini 3 DeepThink ranks first with 93.8%. 3. In the ARC-AGI-2 visual reasoning test, the Gemini 3 DeepThink demonstrates a remarkable advantage, reaching 45.1% in Tools on mode, far exceeding GPT-5.1's 17.6% and Claude Sonnet 4.5's 13.6%.

Let's focus on some of the most disruptive data points.

First is ARC-AGI-2. This isn't just a score; it's a watershed moment.

Those familiar with AI evaluations know that ARC (Abstraction and Reasoning Corpus) has always been a nightmare for LLM (Liquidity Logic). It doesn't test rote memorization; it tests pure visual logic and immediate learning ability—what's commonly known as "general fluid intelligence." For a long time, even the strongest models only managed single-digit or barely above-average scores on this list.

Looking at the data in the image, Gemini 2.5 Pro only scores 4.9%, which is essentially "completely incomprehensible." Claude Sonnet 4.5 and GPT-5 Pro improved to the 13-15% range; while there was progress, it still felt like they "got some right."

However, Gemini 3 DeepThink directly boosted this number to 45.1% (Tools on, meaning it allows code-assisted learning). Even without tools, it achieved 31.1%, almost double the performance of GPT-5.1. This means the model is no longer just making text predictions; it truly understands abstract patterns and can translate this understanding into code to verify its conjectures. This leap in capability is mainly attributed to "parallel reasoning"—trying multiple possible transformation rules when faced with an unfamiliar puzzle until the correct solution is found.

Now let's look at Humanity's Last Exam. While the name might sound a bit dramatic, it's widely recognized as one of the most challenging integrated reasoning tests, designed to test a model's autonomous exploration capabilities without specific instructions.

DeepThink achieved 41.0% here. In comparison, Claude Sonnet 4.5 only reached 13.7%, and even the currently powerful GPT-5 Pro only reached 30.7%. This 10 percentage point difference is terrifying in high-level games. This demonstrates that when problems become extremely complex, requiring interdisciplinary knowledge and multi-step logical leaps, DeepThink's "multi-path thinking" mode exhibits overwhelming stability.

As for GPQA Diamond (scientific knowledge), although the gap between the two isn't as large (DeepThink 93.8% vs GPT-5 Pro 88.4%), this represents a significant improvement. In the range above 90 points, each point increase means reducing a large amount of professional-level illusion. For users who need AI-assisted scientific research, this 5% accuracy improvement could be the difference between "experimental success" and "three days of error checking."

From Math Olympiad Gold Medals to Everyday Productivity

If it were just about chart-topping performance, DeepThink wouldn't excite me so much. What truly makes me feel its "future is promising" is its underlying technological foundation.

The official blog (https://blog.google/products/gemini/gemini-3-deep-think/) reveals that the Gemini 3 DeepThink's technological base is actually evolved from the Gemini 2.5 variant, which achieved gold medal status in the International Mathematical Olympiad (IMO) and the International Collegiate Programming Contest (ICPC).

What does this mean? It means that this system was originally designed for "solving difficult problems," not for "chatting."

How will this capability translate into real-world applications?

If you are a programmer, when you throw a piece of code full of subtle bugs at DeepThink, it won't give you a vague fix suggestion like a typical model. It might construct three different debugging approaches in parallel in the background (i.e., during its thought process):

  1. Suspect a memory leak;

  2. Suspect a race condition;

  3. Suspect a third-party library version incompatibility.

It will attempt to verify these three hypotheses and finally tell you: "Although it looks like a memory issue, after verification, this is actually a rare concurrent deadlock, and we recommend modifying it this way..." This is the dimensionality-reducing advantage of parallel reasoning on the engineering side.

For researchers, when conducting literature reviews or generating hypotheses, DeepThink can simultaneously interpret the same set of experimental data from different theoretical frameworks, providing more multi-dimensional perspectives, rather than just offering the most common explanation.

The Cost and Future of Thinking

Of course, DeepThink isn't without its barriers. As a feature only available to Ultra subscribers, its inference costs are clearly high. In actual use, the response speed is indeed slower in "Deep Think" mode compared to normal mode. The flashing thinking status icon on the screen constantly reminds you that a significant amount of computing power is being consumed in the background.

But this wait is worthwhile. We are witnessing a full evolution of AI from "System 1" (fast thinking, intuitive reaction) to "System 2" (slow thinking, logical reasoning).

The emergence of the Gemini 3 DeepThink, to some extent, breaks the monopoly of the GPT series in the high-end inference field. The 45.1% ARC score is a signal, telling us that on the road to AGI, besides accumulating computing power and data, innovation in algorithmic architecture—such as this parallel inference mechanism that mimics human "deliberate thinking"—still has enormous potential for development.

For the average AI user, the choices have become more interesting: if you simply need to write an email or polish an article, a standard Pro model is more than sufficient; but if you've encountered a truly challenging problem and need an assistant that can work alongside you, even thinking more comprehensively than you, the Gemini 3 DeepThink is undoubtedly one of the most worthwhile tools to try.

References

https://blog.google/products/gemini/gemini-3-deep-think/

https://x.com/GeminiApp/status/1996656314983109003

Share this article

Leave your comment

  • No comments yet.
Ad
Ad not loaded or not displayed

Recommended AI Tools

Carefully selected AI tools to improve your work, study, and live efficiency.

Virtual Try On

AI-powered virtual try-on for clothes, hairstyles, and accessories.

SPONSORED
Circle Crop Image

Circle Crop Image is a free online tool for creating round images.

SPONSORED
 Lipsync Studio

Transform your videos with advanced lip sync technology.

61.2K
SPONSORED
OpenArt

OpenArt is a versatile AI image and video generator.

SPONSORED
SAM TTS

Experience the nostalgic Microsoft SAM voice from Windows XP in your browser.

23.2K
SPONSORED
Image to Image AI

AI-powered image transformation for professional creative workflows.

SPONSORED
Grayscale Image

Grayscale Image is a free online tool for converting color photos to black and white with professional controls.

SPONSORED

Related Articles

Kimi Linear emerges: revolutionizing the attention architecture of Transformer, boosting long text processing efficiency by 6 times.
News
10/31/2025
Kimi Linear emerges: revolutionizing the attention architecture of Transformer, boosting long text processing efficiency by 6 times.
Author: Kimi Lv

A major breakthrough has been achieved in the core architecture of large-scale models! The release of Kimi Linear marks the first time that linear attention technology has comprehensively surpassed and significantly outperformed the traditional Transformer full-attention model in both performance and efficiency. This "win-win" achievement is expected to significantly reduce the computational barriers and costs for long text processing, complex reasoning, and AI agent applications, potentially changing the competitive landscape of underlying technologies for large-scale models.

In-depth analysis of OpenAI Polaris Alpha technology: A key sequel to the GPT-5.1 leak incident
News
11/12/2025
In-depth analysis of OpenAI Polaris Alpha technology: A key sequel to the GPT-5.1 leak incident
Author: Lydia

Over the past week, the AI ​​community's attention has been drawn to a mysterious model that quietly emerged on the OpenRouter platform—Polaris Alpha. As a direct continuation of yesterday's discussion of the GPT-5.1 leak, this suddenly appearing model brings more technical details and strategic signals worthy of in-depth exploration.

Grokipedia - xAI Launches New AI Knowledge Platform to Challenge Traditional Encyclopedias with AI Revolution
AI
10/28/2025
Grokipedia - xAI Launches New AI Knowledge Platform to Challenge Traditional Encyclopedias with AI Revolution
Author: Lucas

A new paradigm in knowledge acquisition has arrived, this time powered by AI.

2025, looking at the evolution of artificial intelligence
AI
4/24/2025
2025, looking at the evolution of artificial intelligence
Author: Q Yang

Standing at this moment in 2025, when we look back at the development journey of artificial intelligence, we witness how this revolutionary technology has reshaped every aspect of human society. From initial theoretical concepts to today's practical applications, each step forward in AI technology has changed the way we live. Let's revisit this fascinating journey together.

Most Popular AI Tools

Klap
30% offCode:AIWITHME

Klap transforms long videos into engaging shorts effortlessly.

458.4K
Pollo AI

Pollo AI is a versatile AI image and video generator.

Midjourney API by PiAPI
5% offCode:AIWITHME

Transform text into stunning images with Midjourney API.

FLUX API - PiAPI
5% offCode:AIWITHME

FLUX API by PiAPI offers advanced image generation capabilities.

LogoAi
30% offCode:aiwithme

Create a stunning logo effortlessly with LogoAi.