China’s AI labs are shipping faster than you can benchmark them

Three major Chinese AI releases landed in the first four months of 2026. Each one makes a different bet on what matters next: reasoning depth, coding stamina, or raw cost efficiency. I have been watching the benchmark sheets fill up. Here is what actually checks out.

Alibaba Qwen3-Max-Thinking: the skeptic’s darling

Alibaba dropped Qwen3-Max-Thinking on January 26. It is a proprietary model, not open weights, and it uses what Alibaba calls “test-time scaling” — basically letting the model think longer and refine its own reasoning paths instead of spitting out the first plausible answer.

The headline number everyone latched onto was Humanity’s Last Exam: 58.3 with web search enabled, beating GPT-5.2-Thinking at 45.5 and Gemini 3 Pro at 45.8. That gap looks dramatic. But HLE is a synthetic benchmark. Real usage tells a more modest story.

Artificial Analysis, which runs independent head-to-heads, puts Qwen3-Max-Thinking at 40 on their Intelligence Index. That trails Kimi K2.5 at 47, DeepSeek V3.2 at 42, and GLM-4.7 at 42. On agentic tasks measured by GDPval-AA, it scores 1170 ELO — behind Kimi K2.5 at 1316. It also hallucinates more than its peers, scoring -34 on the Omniscience Index versus Kimi K2.5’s -11.

So is it a revolution? No. Is it a solid step up from Alibaba’s previous work? Yes. The pricing helps: $1.20 per million input tokens and $6.00 per million output tokens. That undercuts GPT-5.2 by a wide margin. For enterprises that need a reasoning model without bleeding their API budget dry, it is a viable option.

Kimi K2.6: the coding workhorse

Moonshot AI released Kimi K2.6 on April 20. It is open-source under a modified MIT license, weighs in at 1 trillion parameters, and runs on a mixture-of-experts architecture. The context window is 262K tokens in both directions.

Moonshot’s own benchmarks claim #1 spots on AIME 2026, MathVision, and a visual reasoning test. I do not trust vendor-reported numbers blindly. But the independent Arena leaderboards are more telling: Kimi K2.6 ranks #5 in coding, #6 in reasoning, and #4 in search. On SWE-bench Verified — real-world software engineering tasks — it hits 80.2%, just behind Claude Opus 4.6 at 80.8% and DeepSeek-V4-Pro-Max at 80.6%.

The agent swarm pitch is where Moonshot gets ambitious. K2.6 supports up to 300 sub-agents running in parallel across roughly 4,000 coordinated steps. In a demo, it reportedly coded for 13 hours straight, modifying over 4,000 lines. I have not replicated that. My gut says the 300-agent claim is a ceiling, not a daily driver. But the direction is clear: Moonshot wants K2.6 to be the backend for long-horizon automation, not just chat.

Pricing is aggressive: $0.95 per million input tokens, $4.00 per million output tokens. That is cheaper than Qwen3-Max-Thinking and dramatically cheaper than Claude or GPT.

DeepSeek V4: the cost killer

DeepSeek released V4-Pro and V4-Flash on April 24. Both are open-source. V4-Pro packs 1.6 trillion parameters with 490 billion activated per forward pass. The context window jumped from 128K to 1M tokens. DeepSeek also added KV cache sliding windows and compression to cut attention overhead.

The real story is not the model. It is the silicon. DeepSeek V4 is the first trillion-parameter model to launch with same-day support for Huawei Ascend 950 and Ascend A3 supernodes. No CUDA required. The Ascend 950 chips cost roughly one-fourth of comparable NVIDIA hardware, and Huawei claims 2.87x better per-card throughput than the export-restricted H20.

DeepSeek’s API pricing for V4 is 0.25 yuan per million input tokens. Converted, that is roughly $0.035 per million tokens. Compare that to GPT-5.5 Pro at around $30 per million tokens. DeepSeek itself admits V4 trails the top closed-source models by 3 to 6 months. But at a 700x price gap, most enterprises will not care.

Huawei says it has already locked in orders for 500,000 Ascend 950 chips, with a target of 750,000. The A-share compute sector traded over 120 billion yuan in a single day after the V4 launch. Whether that enthusiasm holds depends on whether Ascend clusters can actually sustain production workloads at scale. Early adopters I have spoken to say latency is higher than NVIDIA baselines, but not deal-breakingly so.

What this means

Chinese labs are no longer chasing Western benchmarks as a primary goal. They are optimizing for three things: cost per token, domestic silicon compatibility, and agentic endurance. The gap at the absolute top end still exists — Claude Opus 4.6 and GPT-5.4 lead on several reasoning and coding tasks — but the margin is shrinking while the price differential is exploding.

My take: Qwen3-Max-Thinking is overhyped on HLE but fairly priced for enterprise use. Kimi K2.6 is the most interesting open-source release of the quarter if you care about coding agents. DeepSeek V4 is not the smartest model, but it is the most strategically significant, because it proves Chinese labs can ship trillion-parameter models on entirely domestic hardware.

The next six months will tell us whether Ascend clusters can handle real traffic, or whether the 700x cost advantage evaporates under production load. I am watching the latency numbers.