Diffusion LLMs hit 1,000 tokens/sec and suddenly the typewriter era looks 🐢

Inception Labs unveiled Mercury 2 on Thursday, describing it as the world's fastest reasoning language model at roughly 1,000 tokens per second, compared with about 89 tokens per second for Anthropic's Claude Haiku 4.5 Reasoning and 71 for OpenAI's GPT-5 Mini. Inception framed the launch as a vindication of its long-standing parallel-generation bet, posting on X on June 18, 2026 that "we bet on parallel generation years ago, when it was a contrarian idea. It's great to see the industry arrive," and adding that Mercury 2 "continues to lead the Pareto frontier for quality, speed, and cost among publicly available diffusion LLMs."

Mercury 2 and Google's DiffusionGemma both abandon the sequential word-by-word decoding used by standard chatbots, instead filling text blocks with placeholder tokens and refining them across parallel passes, a technique derived from image-generation diffusion models. Inception was founded on research by Stanford professor Stefano Ermon, who co-authored foundational score-based diffusion work, and has raised $50 million from investors including Nvidia's venture arm, Andrew Ng and Andrej Karpathy.

On AIME 2026, a benchmark built from American Invitational Mathematics Examination problems and scored by percentage solved, Mercury 2 reached 90% versus 69.1% for DiffusionGemma, while standard non-diffusion Gemma 4 scored 88.3%. On GPQA, a PhD-level science benchmark scored the same way, Mercury 2 posted 77% against DiffusionGemma's 73.2%, though Google's own developer guide recommends standard Gemma 4 for applications that demand maximum quality, conceding DiffusionGemma trails it across the board.

Outside benchmarks, AI coding-agent firm Augment Code said it replaced Anthropic's Claude Opus 4.7 with Mercury 2 on its context-compaction subagent and recorded an 82% drop in latency, a 90% cut in cost and unchanged output quality, per a joint case study. The benchmark gaps and integration results underscore that the meaningful architectural change is the subagent layer, with complex AI systems now operating as coordinated ensembles of smaller models rather than a single large one.

Diffusion LLMs hit 1,000 tokens/sec and suddenly the typewriter era looks 🐢

Share Article

Quick Info