Gemini 3.0
With each new frontier model release, the questions is always the same, is it better than what we currently have? And how do we quantify "better". For that purpose labs use several benchmarks that map to different cognitive abilities or branches of human intelligence.
Find below each of the benchmarks shown on the model presentation and what is being measured.
Frontier LLM Benchmark Comparison
| Benchmark | What is evaluated & How? | Human Cognitive Analog | Official Link / Source | Gemini 3 Pro | Claude Sonnet 4.5 | GPT-5.1 |
|---|---|---|---|---|---|---|
| Humanity’s Last Exam | Academic Reasoning: Hard, “un-googleable” abstract reasoning across varied subjects. (Score shown is No Tools / With Tools) | General Intelligence (g) | Humanity's Last Exam | 37.5% (45.8% with tools) | 13.7% | 26.5% |
| ARC-AGI-2 | Visual Reasoning Puzzles: Solving novel grid pattern puzzles never seen during training. | Fluid Intelligence | ARC Prize | 31.1% | 13.6% | 17.6% |
| GPQA Diamond | Scientific Knowledge: Graduate-level QA in biology, physics, and chemistry. | Expertise (PhD) | GPQA | 91.9% | 83.4% | 88.1% |
| AIME 2025 | Mathematics: Challenging math competition word problems. (Score shown is No Tools / With Code) | Mathematical Logic | AIME | 95.0% (100% with code) | 87.0% (100% with code) | 94.0% |
| MathArena Apex | Contest Math: Extremely difficult math contest problems (Olympiad level). | Math Creativity | MathArena | 23.4% | 1.6% | 1.0% |
| MMMU-Pro | Multimodal Reasoning: Expert-level questions requiring analysis of images, diagrams, and text. | Visual-Verbal Integration | MMMU Benchmark | 81.0% | 68.0% | 76.0% |
| ScreenSpot-Pro | Screen Understanding: Identifying UI elements (buttons, menus) on computer screens. | Interface Understanding | ScreenSpot-Pro | 72.7% | 36.2% | 3.5% |
| CharXiv Reasoning | Chart Synthesis: Interpreting complex scientific charts and plots from research papers. | Data Literacy | CharXiv | 81.4% | 68.5% | 69.5% |
| OmniDocBench 1.5 | OCR: Optical Character Recognition & parsing documents. (Lower score is better/less error) | Visual Text Recognition | arXiv | 0.115 | 0.145 | 0.147 |
| Video-MMMU | Video Acquisition: Understanding events and temporal flow in video clips. | Temporal Perception | MMMU Benchmark | 87.6% | 77.8% | 80.4% |
| LiveCodeBench Pro | Competitive Coding: Coding problems from contests (Elo Rating). | Algorithmic Logic | LiveCodeBench | 2,439 | 1,418 | 2,243 |
| Terminal-Bench 2.0 | Terminal Coding: Using a Linux command line to manage files/scripts. | Computer Operation | TBench | 54.2% | 42.8% | 47.6% |
| SWE-Bench Verified | Software Engineering: Fixing real GitHub issues/bugs. | Professional Engineering | SWE-Bench | 76.2% | 77.2% | 76.3% |
| τ2-bench (Tau-2) | Agentic Tool Use: Using external tools (search, calendar) to fulfill requests. | Tool Manipulation | Tau-Bench | 85.4% | 84.7% | 80.2% |
| Vending-Bench 2 | Long-Horizon Tasks: Managing complex, stateful systems over time (Net Worth generated). | Executive Function | Andon Labs | $5,478.16 | $3,838.74 | $1,473.43 |
| FACTS Suite | Grounding: Verifying if answers are supported by source text (Factuality). | Fact Verification | DeepMind Research | 70.5% | 50.4% | 50.8% |
| SimpleQA Verified | Parametric Knowledge: Answering factual questions without hallucination. | Semantic Memory | arXiv | 72.1% | 29.3% | 34.9% |
| MMMLU | Multilingual Q&A: MMLU style questions across different languages. | Linguistic Intelligence | Hugging Face | 91.8% | 89.1% | 91.0% |
| Global PIQA | Common Sense: Physical common sense across 100+ cultures. | Cultural Awareness | arXiv | 93.4% | 90.1% | 90.9% |
| MRCR v2 (8-needle) | Long Context: Finding specific details in massive amounts of text. | Working Memory | Hugging Face | 77.0% | 47.1% | 61.6% |
Full release details: https://blog.google/technology/developers/gemini-3-developers/
LLMs