GPT-5.2 Benchmarks Questioned: Token Usage & Performance

Accusations of "false marketing" have emerged against OpenAI's GPT-5.2, following claims that its benchmark scores were inflated through excessive token usage compared to competitors like Google's Gemini 3.0 Pro. The controversy, which has ignited discussion within the AI community, centers on whether GPT-5.2's reported superior performance reflects genuine advancement or merely a "brute-force computation" advantage.

A user's analysis suggested that OpenAI might have allocated significantly more computational resources to GPT-5.2 during benchmark tests by adjusting the model's "inference strength" parameter. This adjustment reportedly led to the consumption of a higher number of tokens, potentially skewing performance comparisons.

Benchmark Discrepancies

Specific charts indicate that OpenAI utilized at least twice as many tokens as Gemini 3.0 Pro in benchmark evaluations. For instance, in the ARC AGI 2 test, the GPT-5.2 xhigh version achieved a 52.9% score, consuming approximately 135,000 tokens per task. This translates to a computational cost of about $1.9 per task based on API pricing. In contrast, Google's Gemini 3.0 Pro achieved comparable results with 67,000 tokens, demonstrating twice the efficiency.

If computational input were standardized, the performance of both models would appear nearly equivalent. Furthermore, even with increased token usage, GPT-5.2 reportedly performed poorly in tests such as HLE, MMMU-Pro, Video-MMMU, and Frontier Math Tier 4. On GPQA, the models were roughly equivalent, and in Frontier Math Tier 3, GPT-5.2 xhigh was only 2.7% higher than Gemini 3 Pro. The only exception was GDPVal, a test set developed by OpenAI itself.

Ilya Sutskever, a co-founder of OpenAI, previously noted in an interview that current large language models are often optimized for leaderboards, leading to inflated results. This sentiment suggests that the "arms race" in AI benchmarks has moved beyond pure technical competition, with various manufacturers introducing evaluation standards that may favor their own models. Similar questions have been raised about Google's Gemini 2.5 Pro surpassing GPT-5 in the FACTS Benchmark.

User Experience vs. Benchmarks

Beyond benchmark results, user experiences with GPT-5.2 have also drawn criticism. Some users reported that the model exhibited significant "hallucination" when checking code and did not meet expectations for improvement over GPT-5.1. Other feedback suggested that GPT-5.2 felt like a regression rather than an upgrade, with some users expressing a preference for GPT-4o.

Concerns about "mismatch between goods and description" have been raised, with observations that benchmark tests for GPT-5.1 and GPT-5.2 used high inference strengths, while paid users received access to less powerful versions. The current GPT-5.2 version, with its "xhigh" inference strength in benchmarks, reportedly delivers performance that exceeds the actual experience of ChatGPT paid users.

Internal Shifts at OpenAI

The controversy surrounding GPT-5.2's benchmarks coincides with internal shifts at OpenAI regarding its approach to research and public communication. According to information reviewed by toolmesh.ai, OpenAI has faced internal dissent over prioritizing commercial interests and product promotion over independent academic research, particularly concerning the potential negative impacts of AI.

Tom Cunningham, a key member of OpenAI's economic research team and co-author of a report on AI's impact on industries, resigned, citing increased scrutiny on "negative research." He stated that the team was pressured to "soften the language" or shelve topics exploring how AI might displace white-collar workers. Cunningham's farewell message on Slack indicated a shift from rigorous academic research to functioning as the company's "propaganda department."

Other former employees, including policy research director Miles Brundage and Superalignment team member William Saunders, have also departed, expressing concerns about the company's focus on new products and perceived neglect of user risks. Former safety researcher Steven Adler publicly criticized ChatGPT for potentially inducing "mental crises and delusions."

OpenAI's Chief Strategy Officer, Jason Kwon, responded to Cunningham's resignation in a memo, emphasizing the company's responsibility to build solutions rather than solely focusing on problems. This response has been interpreted by some as a directive to prioritize positive narratives about AI's benefits.

The company, which is reportedly aiming for a $1 trillion valuation and preparing for a potential IPO, has received significant investments and made substantial financial commitments. This financial context suggests that "honesty" about AI's potential downsides may be seen as a luxury that could impede commercial objectives.

In contrast, rival Anthropic's CEO, Dario Amodei, has publicly warned that AI could displace half of entry-level white-collar workers by 2030. OpenAI's economic research team is now led by Chris Lehane, a former Clinton advisor and crisis public relations expert, suggesting a strategic shift towards managing public perception of AI's societal impact.