DeepSeek and Moonshot drop massive new models, and the parameter wars are back

China’s AI labs have spent the last month one-upping each other with releases so large they sound like typos. DeepSeek dropped a 1.6 trillion parameter model. Moonshot followed with a 1 trillion parameter multimodal agent. StepFun put out a 196 billion parameter “flash” model in February that still looks competitive. If you thought the race for scale was over, it isn’t. It just moved to a different stadium.

DeepSeek-V4: 1.6T parameters, one million tokens of context

DeepSeek released a preview of its V4 series in late April 2026. The headline model, DeepSeek-V4-Pro, has 1.6 trillion total parameters with 49 billion activated per token. It supports a context length of one million tokens. That is not a marketing rounding error. The company says that at 1M tokens, V4-Pro uses 27 percent of the single-token inference FLOPs and 10 percent of the KV cache compared with DeepSeek-V3.2.

The architecture mixes Compressed Sparse Attention and Heavily Compressed Attention, plus something called Manifold-Constrained Hyper-Connections. DeepSeek also trained with the Muon optimizer. The model was pre-trained on more than 32 trillion tokens and post-trained with a two-stage pipeline: domain-specific experts cultivated via SFT and GRPO-based RL, then unified through on-policy distillation.

On benchmarks, DeepSeek-V4-Pro-Max scores 87.5 on MMLU-Pro, 90.1 on GPQA Diamond, 93.5 on LiveCodeBench, and 3206 on Codeforces. It hits 80.6 percent on SWE-Bench Verified and 83.5 on MRCR 1M. Those numbers place it near or above GPT-5.4 xHigh and Claude Opus 4.6 on several coding and reasoning tasks, though it still trails Gemini-3.1-Pro High on MMLU-Pro and SimpleQA-Verified. A smaller variant, DeepSeek-V4-Flash, has 284B total parameters and 13B activated. It runs faster and cheaper, and its Max reasoning mode gets within shouting distance of the Pro on several math benchmarks.

My gut says the 1M context claim is the real story here. Long-context benchmarks are easy to game, but DeepSeek is publishing CorpusQA and MRCR numbers at 1M tokens. If those hold up in real use, this is a genuine leap for document analysis and codebases.

Kimi K2.6: Moonshot bets on agent swarms and coding

Moonshot AI open-sourced Kimi K2.6 in May 2026. It is a 1 trillion parameter MoE model with 32 billion activated. The pitch is not raw scale. It is long-horizon coding, coding-driven design, and agent swarms. Moonshot claims K2.6 can scale horizontally to 300 sub-agents executing 4,000 coordinated steps in a single run.

K2.6 scores 89.6 on LiveCodeBench v6, 80.2 on SWE-Bench Verified, 58.6 on SWE-Bench Pro, and 66.7 on Terminal-Bench 2.0. On agentic tasks, it hits 83.2 on BrowseComp, 92.5 F1 on DeepSearchQA, and 54.0 on HLE-Full with tools. Those agent numbers are ahead of Kimi K2.5 by a wide margin. The model also carries a vision encoder, MoonViT, with 400M parameters, and supports a 256K context window.

I tried the Kimi Code terminal agent briefly. It is fast, and the multi-file edit suggestions are less brittle than K2.5. The “agent swarm” hype is harder to verify. 300 sub-agents sounds impressive until you ask who is paying for the inference bill. Still, the coding benchmarks are real, and the gap between K2.5 and K2.6 on Terminal-Bench and SWE-Pro is substantial.

Step 3.5 Flash: the efficiency dark horse

StepFun released Step 3.5 Flash in February 2026, and it deserves a mention because it is so much smaller yet so close on several benchmarks. It has 196B total parameters and only 11B activated per token. It uses multi-token prediction to hit 100 to 300 tok/s in typical usage, peaking at 350 tok/s for coding. StepFun claims the decoding cost at 128K context on Hopper GPUs is 6x lower than DeepSeek-V3.2 and 18.9x lower than Kimi K2.5.

On SWE-bench Verified, Step 3.5 Flash scores 74.4. On Terminal-Bench 2.0, it gets 51.0. On AIME 2025, it scores 97.3. Those are not flagship numbers, but they are flagship-adjacent for a model with 11B active parameters. If the efficiency claims are accurate, this is the model Chinese startups will actually deploy.

What this means

Three trends are visible. First, MoE is now the default architecture for Chinese frontier labs. Dense models at this scale are too expensive to train and serve. Second, the benchmark targets have shifted from MMLU to coding, agentic tool use, and long-context retrieval. Third, the gap between open-weight Chinese models and closed Western frontiers is narrowing on coding and reasoning, though knowledge-heavy tasks like SimpleQA still favor Gemini and GPT.

One caveat: these are mostly self-reported numbers. I have not seen independent LMSYS arena data for V4-Pro or K2.6 yet. The models are fresh. Wait two weeks for the community to stress-test them before you rewrite your stack.