GPT-5.2 Faces Widespread Criticism and Underperforms Against Gemini 3 Pro in Benchmarks

Victor Zhang
Victor Zhang
GPT-5.2 vs. Gemini 3 Pro benchmark comparison, showing GPT-5.2 underperforming with red alert status

OpenAI's GPT-5.2 has drawn significant negative feedback since its release, with numerous complaints across the internet and third-party evaluations indicating it falls short of expectations, particularly when compared to Google's Gemini 3 Pro.

Epoch AI's latest report shows GPT-5.2 with an Epoch Capability Index (ECI) score of 152, placing it second to Gemini 3 Pro. In various benchmark tests, GPT-5.2 did not achieve broad dominance. For instance, in the FrontierMath challenge, GPT-5.2 excelled only in levels T1-3, while T4 remained Gemini 3's strength. Although GPT-5.2 secured the top score in Chess Puzzles, its performance on SimpleQA Verified was worse than GPT-5.1, suggesting a decline in trustworthiness in the iterated version.

Performance Shortcomings

Multiple third-party benchmark evaluations, including OCR-Arena, simple-bench, and Live-Bench, show GPT-5.2 performing below expectations and failing to surpass Gemini 3. In some instances, it even ranked behind Claude Opus 4.5.

The community reaction has been largely critical, with developers expressing disappointment. In response, OpenAI reportedly issued a "red alert," prioritizing improvements to ChatGPT, halting internal AGI research, and pausing Sora's development for eight weeks. Despite these measures, the company appears to remain in a reactive position. A user of GPT-5 commented that "GPT-5.2 is not far from becoming a rock."

Google co-founder Sergey Brin recently acknowledged a past "biggest mistake" regarding AI, stating the company was "too afraid of AI saying the wrong thing," which led to missing an era. Now, with Gemini 3 Pro and Nano Banana Pro, Google has re-emerged as a frontrunner in AI.

Reports from The Information previously indicated that GPT-5.2, codenamed Garlic, was initially slated for an early next-year debut. There were rumors in Silicon Valley that OpenAI's pre-training had concluded, and that GPT-5.1, showing minimal improvement, might have been trained after GPT-4o. This suggests OpenAI encountered a scaling bottleneck during pre-training.

Pre-training and Post-training Dynamics

Earlier reports claimed OpenAI had resolved key issues in the pre-training phase for GPT-5.2, integrating fixes from "Shallotpeat" development and accumulating extensive pre-training experience. While official benchmarks suggested some pre-training improvements against Gemini 3, third-party evaluations and user feedback indicate no breakthrough in underlying technological iteration for GPT-5.2.

In another Epoch AI evaluation of long-range tasks, Gemini 3 Pro demonstrated superior performance with 4.9 hours, compared to GPT-5.2's 3.5 hours and Opus 4.5's 2.6 hours. Engineer Dan Mac attributed Gemini 3 Pro's deeper intelligence to Google's strong pre-training, while GPT-5.2's specialized intelligence is seen as a result of OpenAI's post-training optimization.

Market Position and Future Plans

The New York Times reported that OpenAI plans to continue focusing on ChatGPT optimization in the coming weeks, preparing for a larger release early next year. OpenAI is pursuing a "two-front war" strategy for both business-to-business (2B) and business-to-consumer (2C) markets, exploring initiatives related to advertising and e-commerce. Despite user complaints, the company is investigating "more restrained" methods, such as facilitating purchases through ChatGPT and taking a transaction cut. In the enterprise sector, OpenAI is adapting its AI technologies for enterprise software.

Data indicates over 800 million weekly ChatGPT users, representing approximately 76% market share. An AI expert noted that "Consumer AI is almost synonymous with OpenAI; if it loses this, the company will not have its current value." However, numerous AI startups globally have developed technologies that rival or surpass OpenAI's leading models in certain aspects. The emergence of Google's Gemini 3 Pro is seen as a significant challenge to OpenAI's business.

Comparative Performance and User Feedback

User testing suggests significant room for improvement for GPT-5.2. Some users have criticized its tone as "cold" and its language as "constantly regressing."

In visual reasoning, Gemini 3 Pro reportedly outperforms GPT-5.2. For 3D model generation, GPT-5.2 is described as slower, more expensive, and generally inferior to Gemini 3. In generating transgressive novels, GPT-5.2 ranked last among Gemini 3 Pro, Claude 4.5 Opus, and Grok 4. Gemini 3 also showed a significant lead in front-end code generation. A comparison of fitness dashboard homepage designs, based on a single prompt, showed GPT-5.2 consistently ranking last among Gemini 3, GPT-5.2, and Claude Opus 4.5. Developer Mattia, using the AI search model Perplexity, found Gemini 3 to be the preferred choice based on user comments.

On the betting website Ploymarket, most users anticipate Google will have the leading AI model by the end of the year. In user Lisan al Gaib's Dubesors benchmark, Gemini 3 Pro ranked first, while GPT-5.2 was 16th. The CAIS AI Dashboard from the Center for AI Safety also showed Gemini 3 Pro leading in text and visual capability indices, only trailing GPT-5.2 in the risk index.

In text capability tests, Gemini 3 Pro only fell behind in ARC-AGI-2, while GPT-5.2 experienced a more significant decline. In visual capability tests, Gemini 3 Pro largely prevailed, with an average score 4.5 points higher than GPT-5.2. In the risk index test, GPT-5.2 led Gemini 3 Pro but was behind Claude Opus 4.5 and Claude Sonnet 4.5. On Terminus, a platform for evaluating language models in terminal environments, Gemini 3.0 Pro and GPT-5.2 were nearly on par, with Gemini 3.0 Pro averaging 0.2% higher in high-reasoning mode. Other benchmarks like SWE-Bench and IUMB also indicate GPT-5.2's underperformance against Gemini 3 in several key areas.

Altman's "Christmas Surprise"

On the day GPT-5.2 was released, Altman hinted at a "Christmas gift" for the following week, possibly the next-generation GPT Image v2 model. Two mysterious AI image models, "Chestnut" and "Hazelnut," were tested on the LM Arena platform. However, developer testing suggested that OpenAI's image model is not performing optimally in image generation/editing, lagging behind Nano Banana Pro, which is powered by Gemini 3. The output results reportedly exhibit issues such as yellow tones, poor logic, weak consistency, low image quality, and insufficient world knowledge. The base of this model is believed to be GPT-4o.