A recent evaluation by a U.S. government body suggests that China’s leading artificial intelligence models are trailing by approximately eight months, with the gap expected to widen over time. However, experts question this methodology.
The Center for AI Standards and Innovation (CAISI), part of NIST, published its assessment of DeepSeek V4 Pro on May 1, stating it lags behind current AI technology by about eight months. Despite being labeled as the most advanced Chinese AI model assessed to date, CAISI’s evaluation methodology diverges from typical practices. Instead of averaging benchmark scores, CAISI utilizes Item Response Theory (IRT), a statistical method derived from standardized testing, to gauge each model’s latent capability through problem-solving performance across nine benchmarks in five distinct areas: cybersecurity, software engineering, natural sciences, abstract reasoning, and mathematics.
According to IRT-estimated Elo scores, GPT-5.5 leads with 1,260 points, followed by Anthropic’s Claude Opus 4.6 at 999 points. DeepSeek V4 Pro trails with approximately 800 (±28) points, closely aligning it with the older GPT-5.4 mini at 749 points. Under CAISI’s framework, this positions DeepSeek closer to an earlier generation of GPT models rather than Opus.
CAISI’s scoring system mirrors standardized test evaluations by weighting problems based on difficulty and solution success, producing relative capability estimates among competing models. The inability to fully replicate CAISI’s results stems from two of the nine benchmarks being non-public, where GPT-5.5 significantly outperformed DeepSeek in a cybersecurity test with 71% versus 32%, respectively.
Public benchmarks present a different perspective. For instance, on GPQA-Diamond—a PhD-level science reasoning task—DeepSeek scored 90%, narrowly trailing Opus 4.6’s 91%. In math olympiad evaluations (OTIS-AIME-2025, PUMaC 2024, SMT 2025), DeepSeek achieved 97%, 96%, and 96%. However, on the SWE-Bench Verified—real GitHub bug fixes—DeepSeek resolved 74% of issues compared to GPT-5.5’s 81%. Despite these figures, DeepSeek claims parity with Opus 4.6 and GPT-5.4 in its own report.
In terms of cost-effectiveness, CAISI excluded any U.S. models that performed significantly worse or were more expensive per token than DeepSeek, leaving only GPT-5.4 mini as a comparable model. Notably, DeepSeek was the most economical choice on five out of seven benchmarks, even surpassing OpenAI’s smallest AI offering.
Criticism of CAISI’s methodology does not entirely clear DeepSeek’s name. The AI developer Ex0bit contested the findings: “There’s no ‘gap’, and no one’s 8 months behind. We’ve been trolled on every closed U.S drop and flexed with open weights.” This sentiment was echoed in a tweet by Eric (@Ex0byt) on May 2, 2026.
The Artificial Analysis Intelligence Index v4.0, which measures leading model intelligence across ten evaluations, shows OpenAI near 60 points and DeepSeek in the low 50s as of May 2026. This gap is notably smaller than it was a year ago based on standardized benchmarks.
When DeepSeek first appeared in January 2025, questions arose about whether China had caught up to U.S. AI advancements. In response, U.S. labs hastened to develop countermeasures. The Stanford 2026 AI Index, released April 13, reveals that the gap between Claude Opus 4.6 and Dola-Seed-2.0 Preview is narrowing, now only separated by 2.7%. CAISI plans to publish a comprehensive methodology write-up soon.