Thoughts & Writings

← Back to blog

Gemini 3.0

With each new frontier model release, the questions is always the same, is it better than what we currently have? And how do we quantify "better". For that purpose labs use several benchmarks that map to different cognitive abilities or branches of human intelligence.

Find below each of the benchmarks shown on the model presentation and what is being measured.

Frontier LLM Benchmark Comparison

BenchmarkWhat is evaluated & How?Human Cognitive AnalogOfficial Link / SourceGemini 3 ProClaude Sonnet 4.5GPT-5.1
Humanity’s Last ExamAcademic Reasoning: Hard, “un-googleable” abstract reasoning across varied subjects. (Score shown is No Tools / With Tools)General Intelligence (g)Humanity's Last Exam37.5% (45.8% with tools)13.7%26.5%
ARC-AGI-2Visual Reasoning Puzzles: Solving novel grid pattern puzzles never seen during training.Fluid IntelligenceARC Prize31.1%13.6%17.6%
GPQA DiamondScientific Knowledge: Graduate-level QA in biology, physics, and chemistry.Expertise (PhD)GPQA91.9%83.4%88.1%
AIME 2025Mathematics: Challenging math competition word problems. (Score shown is No Tools / With Code)Mathematical LogicAIME95.0% (100% with code)87.0% (100% with code)94.0%
MathArena ApexContest Math: Extremely difficult math contest problems (Olympiad level).Math CreativityMathArena23.4%1.6%1.0%
MMMU-ProMultimodal Reasoning: Expert-level questions requiring analysis of images, diagrams, and text.Visual-Verbal IntegrationMMMU Benchmark81.0%68.0%76.0%
ScreenSpot-ProScreen Understanding: Identifying UI elements (buttons, menus) on computer screens.Interface UnderstandingScreenSpot-Pro72.7%36.2%3.5%
CharXiv ReasoningChart Synthesis: Interpreting complex scientific charts and plots from research papers.Data LiteracyCharXiv81.4%68.5%69.5%
OmniDocBench 1.5OCR: Optical Character Recognition & parsing documents. (Lower score is better/less error)Visual Text RecognitionarXiv0.1150.1450.147
Video-MMMUVideo Acquisition: Understanding events and temporal flow in video clips.Temporal PerceptionMMMU Benchmark87.6%77.8%80.4%
LiveCodeBench ProCompetitive Coding: Coding problems from contests (Elo Rating).Algorithmic LogicLiveCodeBench2,4391,4182,243
Terminal-Bench 2.0Terminal Coding: Using a Linux command line to manage files/scripts.Computer OperationTBench54.2%42.8%47.6%
SWE-Bench VerifiedSoftware Engineering: Fixing real GitHub issues/bugs.Professional EngineeringSWE-Bench76.2%77.2%76.3%
τ2-bench (Tau-2)Agentic Tool Use: Using external tools (search, calendar) to fulfill requests.Tool ManipulationTau-Bench85.4%84.7%80.2%
Vending-Bench 2Long-Horizon Tasks: Managing complex, stateful systems over time (Net Worth generated).Executive FunctionAndon Labs$5,478.16$3,838.74$1,473.43
FACTS SuiteGrounding: Verifying if answers are supported by source text (Factuality).Fact VerificationDeepMind Research70.5%50.4%50.8%
SimpleQA VerifiedParametric Knowledge: Answering factual questions without hallucination.Semantic MemoryarXiv72.1%29.3%34.9%
MMMLUMultilingual Q&A: MMLU style questions across different languages.Linguistic IntelligenceHugging Face91.8%89.1%91.0%
Global PIQACommon Sense: Physical common sense across 100+ cultures.Cultural AwarenessarXiv93.4%90.1%90.9%
MRCR v2 (8-needle)Long Context: Finding specific details in massive amounts of text.Working MemoryHugging Face77.0%47.1%61.6%

Full release details: https://blog.google/technology/developers/gemini-3-developers/

LLMs