China’s Open-Source AI Labs Are Eating the Benchmarks

DeepSeek, Kimi, Qwen, ByteDance. Four names. Four models released in the past six months that are sitting at or near the top of every public leaderboard that matters. I have been watching this space for two years, and I do not remember a period where Chinese open-source models dominated the conversation this completely.

Here is what actually dropped, with numbers I could verify.

DeepSeek V3.2 arrived on December 1, 2025. Two variants: the standard V3.2 and V3.2-Speciale, which is API-only and built for maximum reasoning. DeepSeek claims 96.0% on AIME 2025, 99.2% on the Harvard-MIT Math Tournament, and gold-medal scores on IMO, CMO, and ICPC problems. The model is a 685B-parameter Mixture-of-Experts with only 37B active per token. It is open weights, MIT license. API pricing is $0.14 per million input tokens and $0.70 per million output tokens. Claude Sonnet 4 costs $3.00 and $15.00 respectively. Do the math.

The training story is the real headline. DeepSeek trained this on H800 GPUs, the cut-down NVIDIA chips Washington allows into China. FP8 mixed precision, Multi-head Latent Attention, and a synthetic data pipeline spanning 1,800 environments and 85,000 complex instructions. They spent roughly $5.6 million in GPU hours. Meta spent 11 times that on Llama 3 405B. My gut says export controls did not slow Chinese labs down. They just forced them to write better software.

Kimi K2.5 from Moonshot AI shipped on January 27, 2026. This is a 1.04 trillion parameter MoE, 32B active, with a 256K context window and native multimodal inputs. The standout feature is Agent Swarm, which spins up to 100 parallel sub-agents for complex tasks. On SWE-Bench Verified it scored 76.8%, which was open-weight state of the art at launch. AIME 2025: 96.1%. Humanity’s Last Exam: 50.2%, beating Claude Opus 4.5 at 32.0% and GPT-5.2 High at 41.7%. Pricing is $0.60 per million input tokens, $3.00 output. Again, a fraction of Western rates.

Moonshot’s CEO Yang Zhilin said they rebuilt the reinforcement learning infrastructure from scratch and optimized training algorithms for efficiency. The Agent Swarm system uses a technique called PARL, Parallel-Agent Reinforcement Learning, to stop the orchestrator from defaulting to serial execution. It works. BrowseComp jumps from 60.6% to 78.4% with the swarm enabled.

Qwen3 from Alibaba got a mid-cycle update on July 22, 2025. The 235B-A22B model, in its non-thinking mode, now beats Kimi-K2 and DeepSeek-V3 on GPQA knowledge, AIME25 math, LiveCodeBench coding, Arena-Hard human preference, and BFCL agent tasks. Context window is 256K. Alibaba also pushed it to HuggingFace and their ModelScope platform the same day. No press release theatrics. Just a model swap and a benchmark table.

ByteDance’s Doubao 2.0 launched on February 14, 2026. The Pro version claims gold-level scores on IMO, CMO, and ICPC, plus a 54.2 on HLE-Text, which ByteDance says is the highest score on that benchmark. Multimodal understanding, real-time video stream analysis, and agent capabilities are all bundled in. Pricing is 3.2 yuan per million input tokens and 16 yuan per million output. That is roughly $0.45 and $2.25 at current rates. The Lite version is 0.6 yuan per million input tokens and outperforms Doubao 1.8, which was ByteDance’s own flagship two months prior.

What ties these four together? They are all open weights or open API, they all cost a fraction of Western equivalents, and they are all competing on agentic capabilities, not just chat quality. The race has moved from “can it write a poem” to “can it fix a codebase, browse the web, and coordinate 100 sub-agents without human hand-holding.” Chinese labs are answering yes, with receipts.

There are caveats. None of these labs publish full training data or source code. The benchmark scores come from their own reports or third-party evals, and I have not independently reproduced them. Some of the more eye-popping claims, like DeepSeek’s 99.2% on HMMT, need external verification. And the licensing on Kimi-Dev-72B sparked a minor controversy when it emerged the model was fine-tuned on Qwen 2.5-72B, raising questions about whether Qwen’s commercial restrictions applied. Qwen’s team called it a “historical legacy issue” and promised Qwen3 would go fully Apache 2.0.

Still. The direction is clear. In late 2024, Chinese open-source models accounted for about 1.2% of global usage. By late 2025, estimates put that near 30%. LMArena, the crowdsourced blind-test leaderboard, had Kimi K2, DeepSeek R1, and Qwen 3 in the top three open-source slots at various points last year.

Washington’s chip restrictions were supposed to create a moat. Instead, they created a discount. Chinese labs learned to do more with less, and now they are giving it away for pennies. The real question is not whether these models are good. It is whether OpenAI and Anthropic can justify their pricing when a $0.14 API call gets you within shouting distance of the same capability.