Carlos's Thoughts & Writings

With each new frontier model release, the questions is always the same, is it better than what we currently have? And how do we quantify "better". For that purpose labs use several benchmarks that map to different cognitive abilities or branches of human intelligence.

Find below each of the benchmarks shown on the model presentation and what is being measured.

Frontier LLM Benchmark Comparison

Benchmark	What is evaluated & How?	Human Cognitive Analog	Official Link / Source	Gemini 3 Pro	Claude Sonnet 4.5	GPT-5.1
Humanity’s Last Exam	Academic Reasoning: Hard, “un-googleable” abstract reasoning across varied subjects. (Score shown is No Tools / With Tools)	General Intelligence (g)	Humanity's Last Exam	37.5% (45.8% with tools)	13.7%	26.5%
ARC-AGI-2	Visual Reasoning Puzzles: Solving novel grid pattern puzzles never seen during training.	Fluid Intelligence	ARC Prize	31.1%	13.6%	17.6%
GPQA Diamond	Scientific Knowledge: Graduate-level QA in biology, physics, and chemistry.	Expertise (PhD)	GPQA	91.9%	83.4%	88.1%
AIME 2025	Mathematics: Challenging math competition word problems. (Score shown is No Tools / With Code)	Mathematical Logic	AIME	95.0% (100% with code)	87.0% (100% with code)	94.0%
MathArena Apex	Contest Math: Extremely difficult math contest problems (Olympiad level).	Math Creativity	MathArena	23.4%	1.6%	1.0%
MMMU-Pro	Multimodal Reasoning: Expert-level questions requiring analysis of images, diagrams, and text.	Visual-Verbal Integration	MMMU Benchmark	81.0%	68.0%	76.0%
ScreenSpot-Pro	Screen Understanding: Identifying UI elements (buttons, menus) on computer screens.	Interface Understanding	ScreenSpot-Pro	72.7%	36.2%	3.5%
CharXiv Reasoning	Chart Synthesis: Interpreting complex scientific charts and plots from research papers.	Data Literacy	CharXiv	81.4%	68.5%	69.5%
OmniDocBench 1.5	OCR: Optical Character Recognition & parsing documents. (Lower score is better/less error)	Visual Text Recognition	arXiv	0.115	0.145	0.147
Video-MMMU	Video Acquisition: Understanding events and temporal flow in video clips.	Temporal Perception	MMMU Benchmark	87.6%	77.8%	80.4%
LiveCodeBench Pro	Competitive Coding: Coding problems from contests (Elo Rating).	Algorithmic Logic	LiveCodeBench	2,439	1,418	2,243
Terminal-Bench 2.0	Terminal Coding: Using a Linux command line to manage files/scripts.	Computer Operation	TBench	54.2%	42.8%	47.6%
SWE-Bench Verified	Software Engineering: Fixing real GitHub issues/bugs.	Professional Engineering	SWE-Bench	76.2%	77.2%	76.3%
τ2-bench (Tau-2)	Agentic Tool Use: Using external tools (search, calendar) to fulfill requests.	Tool Manipulation	Tau-Bench	85.4%	84.7%	80.2%
Vending-Bench 2	Long-Horizon Tasks: Managing complex, stateful systems over time (Net Worth generated).	Executive Function	Andon Labs	$5,478.16	$3,838.74	$1,473.43
FACTS Suite	Grounding: Verifying if answers are supported by source text (Factuality).	Fact Verification	DeepMind Research	70.5%	50.4%	50.8%
SimpleQA Verified	Parametric Knowledge: Answering factual questions without hallucination.	Semantic Memory	arXiv	72.1%	29.3%	34.9%
MMMLU	Multilingual Q&A: MMLU style questions across different languages.	Linguistic Intelligence	Hugging Face	91.8%	89.1%	91.0%
Global PIQA	Common Sense: Physical common sense across 100+ cultures.	Cultural Awareness	arXiv	93.4%	90.1%	90.9%
MRCR v2 (8-needle)	Long Context: Finding specific details in massive amounts of text.	Working Memory	Hugging Face	77.0%	47.1%	61.6%

Full release details: https://blog.google/technology/developers/gemini-3-developers/