Baidu's new open-source model PaddleOCR-VL tops the charts overnight, recognizing 109 languages with the world's top overall score
BaiduOctober 16th open sourceMultilingual Document Parsing ModelPaddleOCR-VL, dominating the Hugging Face Trends chart at number one for three days in a row!.

PaddleOCR-VL RecognizesComplex elements such as text, tables, formulas and charts in 109 languagesThe program is available in all major languages of the world, as well as in Russian, Arabic and Hindi. In the latest OmniDocBench list, a benchmarking tool for evaluating the performance of parsing diverse documents in real-world scenarios.PaddleOCR-VL takes first place globally with 92.6 composite scoreand is number one in both OmniDocBench v1.5, OmniDocBench v1.0.
PaddleOCR-VL achieves overall, text, formula, table, and reading order SOTA performance on OmniDocBench v1.5, outperforming existing pipeline tools, generic VLMs, and other specialized document parsing models on all key metrics.

The paper mentions that PaddleOCR-VL achieves optimal performance in document parsing tasks and excels at recognizing complex document elements such as text, tables, formulas, and charts for a variety of challenging content types such as handwritten text and historical documents.
In the official handwritten text example given by Baidu, the text writing in the picture is relatively standardized, with less unclear text and fewer errors in the model recognition results.

▲Handwritten text (left), recognition result (right)
Then the author uploaded a Su Shi handwritten notes, relative to the picture above is difficult to recognize clearly by naked eyes only and there are more traditional Chinese characters, the model recognition results in more errors.

▲Handwritten text (top), recognition result (bottom left), and original text from GuruPoetry.com (bottom right)
The core component of the program, PaddleOCR-VL 0.9BNaViT style based visual coder and the ERNIE-4.5-0.3B language modelconstructed with fast reasoning and low resource consumption for practical deployment.
For the training data, the researchers usedOpen source datasets, synthetic datasets, web-accessible datasets, and internal datasets.Meanwhile, it has developed a high-quality training data construction process and collected more than 30 million training samples through public data collection and data synthesis to guide the generalized large-scale model for automatic annotation with the recognition results based on expert models.
Hugging Face open source address:https://huggingface.co/PaddlePaddle/PaddleOCR-VL
Experience Address:https://aistudio.baidu.com/application/detail/98365
I. Complex formulas, multi-language recognition is accurate, unclear, reflective text appears a small number of errors
I experienced PaddleOCR-VL document parsing capability and element-level recognition.The model has high recognition accuracy in English, Chinese, Korean, and complex formulas, charts, etc., with rare errors when images are reflective or unclear..
The author uploaded the first page of the PaddleOCR-VL paper, and in the recognition results, the model automatically recognized the links, email addresses, and accurately sliced and diced the chart.

Here is a physics question where the model automatically recognizes the header section's slogans, and subheadings, diagrams, and complex formulas are recognized accurately.

Of the element-level recognition skills, let's look first at chart recognition, where each part of the chart is clearly and accurately represented in terms of content and numbers.

In text recognition, the author uploaded Chinese and Korean. Below is a handwritten Korean picture with accurate model recognition results.

For formula recognition, the author uploaded an image containing a formula, and the model identified all the details of the complex formula accurately.

Secondly, the picture is not clear for the Chinese recognition, you can see that there is a crease in the upper left corner of the bag below, the model error will be the first “full” character recognized as “gold”, the rest of the text is accurate.

The image below is taken from the side, so the text on the right side is reflective, and the model incorrectly recognizes “文” as “大”, but even with the reflection + variant of “物”, the model does not make a mistake in recognizing the English character below. However, even with the reflection and variants of "thing", the model's recognition result is not wrong, and the recognition of the English below is also completely correct.

Second, document recognition of previous technologies have drawbacks, Baidu proposed visual language model-based document parsing program
The exponential growth in complexity and volume of documents as core information carriers makes document parsing an indispensable and critical technology. The main goal of document parsing is to gain an in-depth understanding of the structure and semantics of the document layout, including recognizing different blocks and columns of text, distinguishing between formulas, tables, charts, and images, determining the correct reading order, and detecting key elements.
However, modern documents are complex and contain dense text, complex tables or charts, mathematical expressions, multiple languages and handwritten text. Therefore there are currently two technical approaches in this area, either pipeline approaches based on specialized modular expert models, which are hampered by integration complexity, cumulative error propagation and inherent limitations when dealing with highly complex documents, or end-to-end approaches using multimodal models to simplify the workflow and enable joint optimization. However, these methods usually struggle to maintain the correct text order, and can even create illusions when faced with lengthy or complex layouts, as well as introduce significant computational overhead for long sequential outputs.
Based on this, Baidu researchers introduced PaddleOCR-VL, a high-performance, resource-efficient document parsing solution based on the visual language model, which combines the layout analysis model with the visual language model PaddleOCR-VL-0.9B.
First, PaddleOCR-VL performs layout detection and reading order prediction, obtaining the positional coordinates and reading order of text blocks, tables, formulas, charts, and other elements. The paper mentions that PaddleOCR-VL's method is faster to reason, cheaper to train, and easier to extend with new layout classes than multimodal methods that rely on base and sequence outputs.
This scheme then segments the elements according to their locations and feeds them into PaddleOCR-VL-0.9B for recognition.Designed for resource-efficient reasoning, PaddleOCR-VL-0.9B excels at element recognition in document parsing. It improves the model's recognition capability and decoding efficiency by combining a NaViT-style dynamic high-resolution visual coder with a lightweight ERNIE-4.5-0.3B language model.

▲Overview of PaddleOCR-VL
To train robust multimodal models, the researchers developed a high-quality training data construction process, which collected more than 30 million training samples through public data collection and data synthesis, to guide the automated annotation of general-purpose large-scale models based on the recognition results of expert models. Data cleaning was also performed to remove low-quality or inconsistent annotations. In addition, the researchers designed an evaluation engine to classify each element into more detailed categories through the evaluation set, based on which the training performance of the current model in different scenarios is analyzed.
Finally, it also incorporates a small number of extreme cases for manual labeling to finalize the construction of the training data.
Third, document parsing and element recognition are all based on a two-stage training scheme, with four types of training data sources
PaddleOCR-VL decomposes the document parsing task into two phases: The first stage, PP-DocLayoutV2, is responsible for layout analysis, locating semantic regions and predicting their reading order; the second stage, PaddleOCR-VL-0.9B, utilizes these layout predictions for fine-grained recognition of various contents. Finally, a lightweight post-processing module aggregates the output of the two phases and formats the final document into structured Markdown and JSON formats.
In terms of the training scheme for PP-DocLayoutV2 for layout analysis, the researchers used the PP-DocLayoutV2 model to perform layout element localization, classification, and reading order prediction.PP-DocLayoutV2 extends RT-DETR (a real-time target detection model based on Transformer) by adding a Pointer Network. Transformer-based Real-Time Target Detection Model) by adding a Pointer Network, which is responsible for predicting the reading order of detected elements.
Its training process adopts a two-stage strategy: the core RT-DETR model is first trained for layout detection and classification, then its parameters are frozen and the pointer network is trained separately for reading order prediction.
In the first stage the researchers follow the training strategy of RT-DETR and initialize the model using PP-DocLayout_Plus-L pre-training weights and train it for 100 epochs on its self-constructed dataset of more than 20,000 high-quality samples; in the second stage, the model outputs a matrix that represents the pairwise ordering relationship between any two elements and calculates the real labels based on the generalized cross-entropy loss, which is trained for 200 epochs using a constant learning rate of 2e-4 and the AdamW optimizer.
On the PaddleOCR-VL-0.9B training scheme for element recognition.PaddleOCR-VL-0.9B contains three modules: a visual coder, a projector and a language model. It employs a post-adaptive strategy of pre-trained models, with the visual model initialized with weights from Keye-VL and the language model initialized with weights from ERNIE-4.5-0.3B.
Its training methodology is divided into two phases, the first initial phase focuses on pre-training alignment, where the model learns to associate visual information in an image with the corresponding textual representation, a critical step based on a massive dataset containing 29 million high-quality graphic and textual pairs; once pre-training is completed in the second phase, the model is fine-tuned with commands to adapt its generalized multimodal understanding to specific downstream element recognition tasks This phase uses 2.7 million sample datasets.

▲Phase 1 and 2 training setups
The researchers used four main sources of data: open source datasets, synthetic datasets, web-accessible datasets, and internal datasets.
After acquiring the raw data, the researcher utilizes an automated data annotation process for large-scale labeling. Firstly its initial processing of the data using the expert model PP-StructureV3 to generate pseudo-labels with possible errors; then the cues containing the original images and their associated pseudo-labels are created through cue engineering and submitted to the more advanced multimodal large-scale language models ERNIE-4.5-VL and Qwen2.5VL.

▲PaddleOCR-VL-0.9B training data construction process
These models refine and enhance the initial results by analyzing the image content to generate better quality labels. Finally, to ensure the quality of the labels, the system performs a phantom filtering step that eliminates potentially erroneous content generated by the larger models.
Fourth, PaddleOCR-VL achieves SOTA in the document parsing ability test set.
To evaluate the effectiveness of PaddleOCR-VL, the researchers compared its performance on page-level document parsing and element-level recognition.
Starting with page-level document parsing, theThe researchers evaluated the end-to-end document parsing capabilities of PaddleOCR-VL using three benchmarks, OmniDocBench v1.5, OmniDocBench v1.0, and olmOCR-Bench.
OmniDocBench v1.5 is a comprehensive set of tests for evaluating document parsing capabilities.PaddleOCR-VL achieves overall, text, formula, table, and reading order SOTA performance on OmniDocBench v1.5, outperforming existing pipeline tools, generic VLMs, and other specialized document parsing models on all key metrics.
Specifically, the PaddleOCR-VL model achieved the highest overall score of 92.56, surpassing the second-ranked MinerU2.5-1.2B (90.67.) PaddleOCR-VL achieved new SOTA scores in subtasks including the lowest Text-Edit distance, the highest Formula-CDM scores and the Table-TEDS, and Table-TEDS-S. The paper mentions that this shows that the model has high accuracy in text recognition, formula recognition, and complex table structure analysis.

▲OmniDocBench v1.5 Comprehensive Evaluation of Document Parsing
OmniDocBench v1.0 is specifically designed to evaluate real-world document parsing capabilities.PaddleOCR-VL achieves SOTA performance on OmniDocBench v1.0 for almost all metrics for overall, text, formulas, tables and reading order.
PaddleOCR-VL averaged an overall editing distance of 0.115. The model achieved the best SOTA score (0.062) and a comparable best SOTA score (0.041) for Chinese and English text editing distances, respectively. However, in the English form TEDS, the model only scored 88, and the paper mentions that the reason for this is a spelling error-related labeling error in OmniDocBench v1.0.

▲OmniDocBench v1.5 Comprehensive Evaluation of Document Parsing
In terms of reading order edit distance, the model achieved a best score of 0.063 in Chinese and a comparable SOTA best score of 0.045 in English.
The olmOCR-Bench evaluates tools and models primarily through simple, clear, and machine-verifiable unit tests. paddleOCR-VL achieved the highest overall score of 80.0 ± 1.0 in the olmOCR-Bench review, leading in ArXiv (85.7), header and footer (97.0), and ranking second in multi-column text (79.9) and long small text (85.7) ranked second.

▲olmOCR-Bench Comprehensive Evaluation of Document Parsing
The second is an element-level assessment.In text recognition, PaddleOCR-VL achieves the lowest error rate in almost all categories evaluated by OmniDocBench-OCR-block; Baidu's internal self-built text evaluation dataset, the model demonstrates high accuracy in both multilingual metrics, and text type metrics.

▲Overall comparison of OmniDocBench-OCR-block performance
Ocean-OCR-Handwritten is a line- and paragraph-level handwriting evaluation dataset, and the model achieves an optimal edit distance of 0.118 in English and performs well in terms of F1 score, precision, recall, BLEU, and METEOR, and the model has an edit distance of 0.034 in Chinese.

Comparison of English and Chinese OCR Handwriting Recognition Performance on Ocean-OCR-Bench
For table recognition, PaddleOCR-VL leads in the OmniDocBench-Table-block benchmark, surpassing models such as Seed1.6; on Baidu's own form evaluation dataset, the model achieves the highest scores in overall TEDS, structural TEDS, overall edit distance and structural edit distance. For formula recognition, the model achieves the best CDM score of 0.9453 in OmniDocBench-Formula-block; for chart recognition, on Baidu's internal dataset, PaddleOCR-VL not only outperforms professional OCR VLMs, but also even surpasses some 72B-level multimodal language models.

▲OmniDocBench-Table-block Performance Comparison
For inference performance, the researchers measured end-to-end inference speed and GPU usage on the OmniDocBench v1.0 dataset and processed PDF documents in batches of 512 on a single NVIDIA A100 GPU.PaddleOCR-VL demonstrates clear and consistent advantages in both processing speed and memory efficiency.The deployment of the vLLM backend improves page throughput by 15.81 TP4T and token throughput by 14.21 TP4T compared to the leading benchmark MinerU2.5. In addition, the PaddleOCR-VL GPU memory footprint is reduced by about 401 TP4T compared to dots.ocr.

▲ Comparison of end-to-end inference performance
Conclusion: or accelerate the efficient extraction of complex document information
Based on PaddleOCR-VL, the researchers have enhanced the model's recognition capability and decoding efficiency, and reduced computational requirements while ensuring high recognition accuracy, making it well suited for efficient and practical document processing applications.
PaddleOCR-VL's extensive multi-language support and powerful performance are expected to drive the application and development of multi-modal document processing technology, or significantly improve the performance and stability of the RAG system, making it more efficient for researchers to extract information from complex documents, and thus providing more reliable data support for future AI applications.
Note: The source of the article is Zhidi, public number: zhidxcom
© Copyright notes
The copyright of the article belongs to the author, please do not reprint without permission.
Related articles
No comments...
 
                 
                 
                