The frontier becomes a plateau: three models tie at 57

For the first time in eighteen months, three frontier models share the top of a major public benchmark. Claude Opus 4.7, Gemini 3.1 Pro Preview, and GPT-5.4 all score 57 on the Artificial Analysis Intelligence Index—a composite of ten evaluations including Humanity’s Last Exam, GPQA Diamond, and Terminal-Bench Hard.

The next-closest model, GPT-5.3 Codex, sits three points behind at 54. Below that, Claude Opus 4.6 at 53. The average score across all 133 models evaluated is 31, meaning the leaders are 84% above the field. This isn’t a crowded summit; it’s a plateau with exactly three occupants and a steep drop to fourth place.

What the tie obscures

Intelligence parity masks divergence in almost everything else.

Gemini 3.1 Pro runs at 130 tokens per second—more than twice the speed of Opus 4.7’s 52 tokens per second. GPT-5.4 falls between them at 75. The average across all models is 61.

Pricing shows even wider spread. At blended rates (3:1 input-to-output ratio), Gemini costs $4.50 per million tokens. GPT-5.4 costs $5.63. Opus 4.7 costs $10.00—more than double Gemini’s price for the same score.

The cost to evaluate each model on the full Intelligence Index illustrates the gap in practice: Artificial Analysis spent $892 running Gemini through its benchmarks, $2,851 for GPT-5.4, and $4,406 for Opus 4.7. Same intelligence score. Five times the evaluation cost between cheapest and most expensive.

“Claude Opus 4.7 is amongst the leading models in intelligence, but particularly expensive when comparing to other models of similar price.” — Artificial Analysis

The reasoning era

All three models achieve 57 through reasoning-enabled configurations: Opus 4.7 at “Adaptive Reasoning, Max Effort,” GPT-5.4 at “xhigh,” and Gemini 3.1 Pro in its Preview reasoning mode. Anthropic describes Opus 4.7 as “our most capable Opus model yet.” Google positions Gemini 3.1 Pro as “best for complex tasks and bringing creative concepts to life,” the culmination of a progression from Gemini 1’s multimodality through Gemini 2’s agentic capabilities.

The models reached 57 on different timelines. Gemini 3.1 Pro Preview shipped in February 2026, GPT-5.4 in March, Opus 4.7 in April. Each new release matched rather than exceeded the others—a pattern that produced the tie.

Context windows have converged too: 1 million tokens for Opus and Gemini, 1.05 million for GPT-5.4. Multimodal capabilities differ—Gemini accepts text, image, speech, and video input—but the headline measure, raw intelligence as benchmarked, shows no daylight.

What to watch

The three-way tie marks the end of the clear-leader era that defined 2024 and 2025. When GPT-4 shipped in March 2023, it held a commanding lead. When Claude 3 Opus arrived in March 2024, it briefly claimed the crown. Leads lasted months; now they last weeks, if that.

The competition shifts. If intelligence is a commodity at the frontier—if 57 is the ceiling everyone hits—then differentiation moves to price, speed, and specialization. Gemini’s 130-token-per-second throughput matters for latency-sensitive applications. Opus’s $10-per-million cost matters for budget-constrained ones. The benchmarks say they’re tied. The invoices and the stopwatches say otherwise.

The Intelligence Index will update. New evaluations will join the composite. One of these models, or a successor, will eventually break 57 and claim a solo lead. Until then, the frontier is a plateau—and the three companies standing on it are competing on everything except the one metric that used to matter most.

sources

related briefings