Hugging Face AI Insight Talk to Feature OCR Innovations

Victor Zhang
Victor Zhang
Hugging Face logo with AI-related graphics, symbolizing innovation in OCR technologies like HunyuanOCR and PaddleOCR-VL.

As AI systems move beyond text to encompass more complex data types, the field of Optical Character Recognition (OCR) continues to evolve. Hugging Face, in collaboration with OpenMMLab, ModelScope, Zhihu, and Jizhilu, is set to host the sixth session of its AI Insight Talk series, focusing on advanced OCR technologies. This special session will analyze three technical solutions aimed at advancing OCR capabilities from general recognition to specialized analysis, and from single-language to multilingual support. The event will also include a round table discussion with developers.

The live broadcast is scheduled for December 4, 2025, from 20:00 to 22:00 Beijing Time.

Key Points

The session will feature presentations on three distinct OCR advancements:

  • HunyuanOCR: Presented by Li Gengluo, Algorithm Engineer at Hunyuan Vision Large Model, HunyuanOCR is described as a lightweight, commercial-grade, open-source Vision-Language Model (VLM) with 1 billion parameters. It utilizes a pure end-to-end architecture with Native ViT and a lightweight Large Language Model (LLM) to enhance performance across tasks such as text detection, parsing, information extraction, Visual Question Answering (VQA), and image-text translation. The model supports multiple OCR functionalities within a unified framework, aiming to reduce error accumulation by eliminating traditional pipeline preprocessing. HunyuanOCR reportedly surpassed commercial APIs and larger models in various evaluations, securing first place in the ICDAR 2025 DIMT small model track and achieving leading performance on OCRBench at the 3 billion parameter scale. The model is now open-sourced on HuggingFace.

  • PaddleOCR-VL: Sun Ting, Senior Engineer at Baidu, will introduce PaddleOCR-VL, a lightweight multimodal document parsing solution supporting 109 languages. Its core component, PaddleOCR-VL-0.9B, is a compact VLM designed for precise element recognition. It integrates a NaViT-style dynamic resolution visual encoder with the ERNIE-4.5-0.3B language model. This model aims to efficiently recognize complex elements, including text, tables, formulas, and charts, with low resource consumption. Validation on public and internal benchmarks indicates that PaddleOCR-VL achieves optimal performance in both page-level document parsing and element-level recognition, demonstrating competitive performance against other vision-language models and fast inference speeds.

  • MinerU: He Tianyao, Algorithm Engineer at Shanghai AI Lab, will discuss MinerU, an efficient and accurate document parsing technology. MinerU has evolved from a pipeline solution to an end-to-end VLM (version 2.0) and subsequently to a two-stage decoupled native resolution model (version 2.5). This iteration aims for finer layout detection and precise parsing of complex elements. The accompanying OmniDocBench evaluation set covers diverse document scenarios, with MinerU2.5 reportedly achieving over 90 points in parsing accuracy. The presentation will cover MinerU's technical roadmap, underlying principles, and future prospects.

What Comes Next

Following the presentations, a round table discussion is planned, inviting attendees to engage with the authors and community members. The event aims to foster an exchange of ideas and insights within the OCR development community.