China’s AI Model Arms Race: Alibaba, Baidu, and Moonshot Drop Big Numbers

China’s AI labs have spent the last six months one-upping each other on parameter counts. The results are real. The hype is also real. Here is what actually shipped.

Alibaba goes all-in at Cloud Computing Conference 2025

Alibaba Cloud held its annual conference in Hangzhou on September 24, 2025. CEO Wu Yongming announced the company is pushing 380 billion yuan into AI infrastructure and plans to spend even more. The headline drop was Qwen3-Max, a trillion-parameter flagship trained on 36 trillion tokens. Alibaba claims it hit 69.6 on SWE-Bench Verified for coding and 74.8 on Tau2-Bench for agent tool use, surpassing Claude Opus 4 and DeepSeek-V3.1 on the latter. It also scored full marks on AIME 25 and HMMT math tests, which Alibaba says is a first for a domestic model.

My gut says take the “full marks” claim with a grain of salt. AIME and HMMT are hard, but perfect scores on both from a single model smell like cherry-picked runs or specific prompting setups. I have not seen independent verification yet.

Alibaba also released Qwen3-VL for vision, Qwen3-Coder with 256K context, Qwen3-Omni as a native multimodal model, and Wan2.5-Preview for video generation with native audio sync. The open-source count is now 300-plus models, 600 million downloads, and 170,000 derivative models. That last number is not nothing. It means Qwen has become the default base for a lot of builders.

Baidu’s Ernie 5.0: 2.4 trillion parameters, less than 3% active

Baidu dropped Ernie 5.0 on November 13, 2025. It is a native multimodal model with 2.4 trillion total parameters and a super-sparse MoE architecture that activates under 3% of them per forward pass. Baidu says it scored 1432 on the LMArena text leaderboard, placing it second globally and first in China, tied with GPT-4.5-preview and Claude Opus 4-1.

The native multimodal pitch matters. Baidu is claiming it trained text, image, video, and audio together from the start, not by stitching separate models together later. The demo videos show it locating specific seconds in a two-hour video, spotting facial expressions in diving footage, and calculating prices from low-res market stall clips. The preview version only outputs text and images so far; audio and video generation are promised for the full release.

2.4 trillion sounds absurd. But with sub-3% activation, the inference cost is manageable. Whether the quality holds across all modalities is the real question. I have not tested it myself.

Moonshot’s Kimi K2: trillion parameters, open weights, and a content pivot

Moonshot AI released Kimi K2 in July 2025. It has 1.04 trillion total parameters, 32 billion active, and a 128K context window. It was trained on 15.5 trillion tokens. The weights are fully open. Moonshot also published a technical report where Kimi K2 itself is listed as a co-author, which is either clever marketing or a sign of where the industry is heading.

The model topped the LMSYS LMArena open-source rankings at launch. Independent tests from 302.AI in November 2025 showed K2 Thinking scoring 27.94 weighted total, second only to Claude Sonnet 4.5 at 28.66. It beat Claude and GPT-5 on a geometric sequence reasoning task. On coding, it produced pretty UIs but sometimes missed the underlying audio synthesis logic. The price is aggressive: $0.575 per million input tokens, about 19% of Claude Sonnet 4.5’s rate.

Here is the context most people miss. Moonshot is bleeding users. After DeepSeek’s January 2025 surge, Kimi’s monthly active users dropped to 18 million by March 2025, against DeepSeek’s 190 million and Tencent Yuanbao’s 40 million. Moonshot is now testing a content community inside Kimi with AI-generated posts and invited channel accounts. It is also cutting prices and partnering with Huawei on the Pura X handset. The K2 release is a technical win, but the business model is still searching.

DeepSeek keeps shipping quietly

DeepSeek released V3.1 on August 19, 2025. It is a 685B-parameter hybrid reasoning model with 128K context, doubling the previous 64K. On the Aider programming benchmark it scored 71.6% pass rate, the highest among non-reasoning models, at roughly $0.0045 per test case. That is about 68 times cheaper than Claude Opus for comparable results.

DeepSeek does not do launch events. It posts a changelog and lets the benchmarks talk. That approach has worked better than anyone expected.

The real story

China’s labs are now competing on three axes at once: raw benchmark scores, parameter-count theater, and inference cost. Alibaba has the broadest model family and the biggest cloud to run it on. Baidu is betting everything on native multimodal architecture. Moonshot is trying to stay relevant with open weights and low prices. DeepSeek is still the one everyone else is chasing.

The parameter counts are getting silly. Trillion-parameter models are becoming normal. What matters is whether anyone can turn them into products people pay for. So far, the answer is mostly “not yet.”