Peking University Unveils SUPERChem Benchmark, GPT-5 Achieves 38.5% in Chemistry Reasoning

A research team at Peking University has introduced SUPERChem, a multimodal, high-difficulty benchmark designed to assess the chemistry reasoning capabilities of large language models (LLMs). The benchmark aims to address limitations in existing chemistry evaluations, which often lack systematic assessment of deep reasoning and multimodal understanding.
Initial tests using SUPERChem revealed that advanced models like GPT-5 achieved an accuracy of 38.5%, a performance level comparable to that of a junior undergraduate chemistry student. The results indicate that current LLMs exhibit weaknesses in advanced chemical reasoning tasks.
Developing a New Chemistry Benchmark
The development of SUPERChem involved collaboration between Peking University's College of Chemistry and Molecular Engineering, the Computing Center, the School of Computer Science, and Yuanpei College. The team sought to create a system for evaluating LLMs' chemical reasoning, moving beyond basic question-answering to complex, multi-step reasoning.
The researchers leveraged the expertise of top-tier undergraduate and graduate students from Peking University's chemistry department to construct the benchmark. The SUPERChem question bank was developed by nearly a hundred teachers and students, undergoing a three-stage review process that included question writing, solution drafting, and strict validation.
Questions were sourced from non-public exams and professional literature, adapted with anti-leakage designs to prevent models from relying on memorized information. To account for the multimodal nature of chemical information, datasets with interleaved text and images, as well as pure text versions, were provided.
SUPERChem currently includes 500 expert-selected questions across four core chemical domains: structure and properties, chemical reactions and synthesis, chemical principles and calculations, and experimental design and analysis.
To evaluate the LLM's thought process, SUPERChem introduced the Reasoning Path Fidelity (RPF) metric. This metric assesses the consistency between a model's chain of thought and detailed, expert-written solutions, including key checkpoints.
Evaluation Findings
The evaluation demonstrated that SUPERChem is a challenging and discriminative benchmark. In a closed-book test for junior undergraduate chemistry students at Peking University, human accuracy was 40.3%. Among the LLMs tested, the highest-performing GPT-5 (High) achieved 38.5% accuracy, suggesting its chemical reasoning ability aligns with that of a junior undergraduate.
Analysis using the RPF metric showed variations in reasoning process quality among models. Gemini-2.5-Pro and GPT-5 (High) exhibited higher accuracy and more consistent reasoning logic compared to expert paths. DeepSeek-V3.1-Think, despite similar accuracy, had a lower RPF score, indicating a tendency toward heuristic reasoning.
The study also explored the impact of multimodal information. For models with strong reasoning capabilities, such as Gemini-2.5-Pro, image input could enhance accuracy. However, for models with weaker reasoning abilities, like GPT-4o, visual information sometimes introduced interference. This suggests that the appropriate input modality should be matched to the model's capabilities in scientific tasks.
Reasoning breakpoint analysis revealed that LLM failures were concentrated in advanced chemical reasoning tasks, including product structure prediction, reaction mechanism identification, and structure-activity relationship analysis. This indicates that current LLMs still face challenges in core tasks involving reactivity and molecular structure understanding.
The SUPERChem project was led by Zhao Zehua, Huang Zhixian, Li Junren, and Lin Siyu from Peking University. The team included 77 doctoral students and senior undergraduates, with 3 International Chemistry Olympiad (IChO) award winners and 64 China Chemistry Olympiad (CChO) final award winners. The human baseline test involved 174 junior undergraduate students from Peking University's chemistry department. The project was guided by Professors Pei Jian and Gao Zhen, Professor Ma Hao, and Professor Yang Tong.