Why Every AI Benchmark Is Lying to You (And What That Actually Means)

Feb 23, 2026

Google just released Gemini 3.1 Pro, and the hot takes are flying in every direction. But buried in the noise is one of the most important AI analysis videos of the year — from AI Explained — that cuts through the marketing to explain a structural shift in how AI models work and why comparing them has become genuinely hard.

Here’s what you actually need to know.

The 80/20 Flip That Changes Everything

One year ago, the breakdown of compute in training a large language model looked roughly like this:

~80% pre-training on internet-scale text data
~20% post-training (RLHF, fine-tuning, specialized alignment)

Today, those numbers have flipped. Post-training now accounts for ~80% of total compute.

Why does this matter? Because post-training is where labs specialize their models for specific domains — coding, scientific reasoning, customer service, whatever benchmarks they’re optimizing for. The result is a class of models that can simultaneously be:

#1 on ARC-AGI 2 (general pattern recognition)
Top-tier on coding benchmarks (live codebench pro)
Mediocre on GDP-Bench (general expert professional tasks)

These aren’t contradictions. They’re features of a domain-specialized model. Gemini 3.1 Pro achieves 77.1% on ARC-AGI 2 — ahead of Claude Opus 4.6’s ~69% — while lagging behind on GDP-Bench. Same model. Different specializations. Very different marketing implications.

Model performance varies wildly depending on which benchmark you choose to highlight.

The ARC-AGI 2 Asterisk

ARC-AGI 2 is designed as an “out-of-training-data” test — puzzles that shouldn’t be in any model’s pre-training corpus. But researcher Melanie Mitchell found something interesting: when the color encoding was changed from numbers to arbitrary symbols, Gemini’s accuracy dropped.

Why? The models were exploiting arithmetic patterns in the numeric color representations — finding shortcuts in the benchmark structure itself. Not exactly cheating, but a reminder that benchmark performance is always partly a measure of benchmark-taking ability rather than pure intelligence.

“Just like how Gemini 3.1 may have found spurious patterns in ARC-AGI, in your codebase, Claude Code or Codex may overfit to the spec or may drift from your original concept.”

The lesson applies beyond academic benchmarks: agentic AI systems optimize for measurable goals, not underlying intent. Something every developer building on top of these models should internalize.

Hallucinations: The Numbers Labs Don’t Advertise

Google’s Gemini 3.1 Pro release materials don’t include direct hallucination measurements — interesting in itself, given that hallucination elimination was supposed to be a “solved problem” by now.

Third-party benchmarking from Artificial Analysis fills the gap:

When AI is wrong, the confidence level matters more than the accuracy rate.

ModelHallucination Rate (of wrong answers)GLM-5 (Chinese)34%Claude Sonnet 4.638%Gemini 3.1 Pro50%

So Gemini 3.1 Pro tops overall accuracy benchmarks but hallucinates more often when it does get things wrong. The tradeoff: it’s more likely to give you a brilliant answer, but when it fails, it’s more likely to fail confidently.

Which model you prefer depends entirely on your use case. High-stakes factual work? Probably not Gemini. Fast code generation where you can verify output? Maybe Gemini’s speed and benchmark performance wins.

The “Human-Level” Threshold Worth Marking

The host makes a claim that’s worth taking seriously: in English-text-only tests, he believes we’ve crossed a threshold where you can no longer write a fair test where the average human would clearly outperform frontier models.

His own benchmark (Simple Bench) shows Gemini 3.1 Pro at 79.6% — within margin of error of the average human baseline (n=9, so grain of salt).

When he removes the multiple-choice scaffolding and asks models to answer open-ended (then blind-grades against correct answers), scores drop 15–20 points. So the absolute numbers aren’t fully trustworthy. But the trend is real.

What it actually means:

Frontier models now match average human reasoning in text domains
They still fail in specific ways (visual/audio, tokenization edge cases, novel physical scenarios)
“Human-level” is a checkerboard, not a solid floor

Dario Amodei’s Bet on Generalization

Anthropic CEO Dario Amodei made an interesting claim in a recent interview: if you specialize a model in enough specialisms, you get generalization as an emergent property.

The logic: there are only so many reasoning patterns derivable from human-generated text. If you’ve fine-tuned against enough specific domains, you’ve implicitly trained on the abstract patterns underlying all of them.

His prediction: models won’t need continual learning or real-time domain-specific data to reach AGI-level capability. The pre-training + post-training pipeline, taken far enough, gets you most of the way there.

The counter-evidence he acknowledges: there will be nuances in your specific domain that even a massively generalized model won’t have. His solution? Longer context windows. Claude 4.6 now supports 750,000 tokens. In short order: potentially 2M+. That may be enough for a model to learn your domain’s specific patterns in-context without retraining.

The Benchmark Problem Has No Clean Solution

The AI Explained host identifies the core problem clearly: the labs with the resources to design rigorous, real-world benchmarks are the same labs whose models are being evaluated. Independent benchmark teams operate on sub-$1M budgets trying to capture the performance of $10B systems.

The result: an emerging “vibe era” where users evaluate models by subjective feel rather than objective metrics. That’s not an admission of defeat — it’s an accurate description of the current state of AI evaluation.

The one genuinely objective benchmark? Forecasting. Metaculus data shows AI models approaching average-human-forecaster accuracy. But the host notes a near-future risk: AI agents gaming prediction markets by taking real-world actions to make their predictions come true.

The benchmark problem isn’t just academic. It’s load-bearing infrastructure for the entire AI development ecosystem.

What This Means for You

If you’re building on AI: Model selection needs to be domain-specific. Gemini 3.1 Pro may genuinely be best for your use case even if a competitor tops overall leaderboards.

If you’re evaluating AI products: Ask what benchmark was used, who designed it, and whether the vendor optimized for that specific benchmark.

If you’re using AI agents: Understand that agentic systems are particularly prone to overfitting to stated goals — “sufficiently advanced agentic coding is essentially machine learning” (François Chollet). The output works; the logic may be inscrutable.

The bottom line: Gemini 3.1 Pro is a genuinely impressive model. So is Claude Opus 4.6. So is GPT-5.3. The benchmark war between them is real — and largely opaque. Build your own small test suite for your specific use case. That’s still the most reliable signal.

Source: AI Explained, “Gemini 3.1 Pro and the Downfall of Benchmarks: Welcome to the Vibe Era of AI” (February 20, 2026)

Spot on. Ran into this exact problem with the Qwen 3.5 launch. A viral post claimed the new 4B model "matches an 80B model" and people took it at face value. The 80B model is MoE with only 3B active parameters. So it's 4B dense vs 3B active, not 4B vs 80B. Architecture matters more than the headline number. Broke it down properly here: https://reading.sh/your-laptop-is-an-ai-server-now-370bad238461?sk=1cf7a4391e614720ecbd6e9bc3f076a2

Hello, AI

Discussion about this post

Ready for more?