Qianfan-VL: Domain-Enhanced Universal Vision-Language Models

Domain Capability Enhancement through Continuous Pre-training | 3B to 70B Parameter Scale
Document Understanding & OCR Enhancement | Reasoning Capability Support

Released August 2025 | Baidu AI Cloud Qianfan

Table of Contents

Core Features

The Qianfan-VL model series is a general-purpose multimodal large model enhanced for enterprise-level multimodal applications. It possesses fundamental general capabilities while offering deep optimization for high-frequency scenarios in industrial deployment. Through three core functions, it precisely meets multimodal understanding needs in different scenarios.

Multi-Size Models

Provides 3B, 8B, and 70B model variants to meet different scenario requirements

OCR & Document Understanding Enhancement

Full-scenario OCR recognition and intelligent understanding capabilities, covering documents, natural scenes, and various application scenarios

Reasoning Capability

Supports chain-of-thought capabilities, demonstrating excellent performance in complex scenarios like mathematics and reasoning calculations

Multi-Size Models Meet Different Scenario Requirements

Provides 3B, 8B, and 70B model variants, allowing enterprises and developers of different scales to find suitable solutions

Model Name Context Length Reasoning Support Application Scenarios
Qianfan-VL-3B 32k Not Supported Edge real-time scenarios, OCR text recognition
Qianfan-VL-8B 32k Supported Server-side general scenarios, fine-tuning optimization scenarios
Qianfan-VL-70B 32k Supported Offline data synthesis, complex reasoning computation scenarios

General Capability Benchmark Performance

Comprehensive comparison of Qianfan-VL models of all scales with mainstream models on standard multimodal benchmarks

Benchmark Qianfan-VL-3B Qianfan-VL-8B Qianfan-VL-70B InternVL3-8B InternVL3-78B Qwen2.5-VL-7B Qwen2.5-VL-72B
A-Bench_VAL 75.65 75.72 78.1 75.86 75.86 76.49 79.22
CCBench 66.86 70.39 80.98 77.84 70.78 57.65 73.73
SEEDBench_IMG 76.55 78.02 79.13 77.00 77.52 76.98 78.34
SEEDBench2_Plus 67.59 70.97 73.17 69.52 68.47 70.93 73.25
MMVet 48.17 53.21 57.34 80.28 78.90 70.64 75.69
MMMU_VAL 46.44 47.11 58.33 56.11 60.78 51.0 65.78
ScienceQA_TEST 95.19 97.62 98.76 97.97 97.17 85.47 92.51
ScienceQA_VAL 93.85 97.62 98.81 97.81 95.14 83.59 91.32
MMT-Bench_VAL 62.23 63.22 71.06 65.17 63.67 61.40 69.49
MTVQA_TEST 26.5 30.14 32.18 30.30 27.62 29.08 31.48
BLINK 49.97 56.81 59.44 55.87 51.87 54.55 63.02
MMStar 57.93 64.07 69.47 68.40 66.07 61.53 66.00
RealWorldQA 65.75 70.59 71.63 71.11 74.25 69.28 73.86
Q-Bench1_VAL 73.51 75.25 77.46 75.99 77.99 78.10 79.93
POPE 85.08 86.06 88.97 90.59 88.87 85.97 83.35
RefCOCO (Avg) 85.94 89.37 91.01 89.65 91.40 86.56 90.25

OCR & Document Understanding Enhancement

Focuses on two distinctive capabilities: full-scenario OCR recognition and complex layout document understanding, demonstrating excellent performance in multiple benchmark tests and providing high-precision visual understanding solutions for enterprise-level applications

Full-Scenario OCR Tasks

  • Handwriting Recognition: Chinese and English handwriting recognition, supporting various fonts like cursive and regular script
  • Formula Recognition: Precise mathematical formula recognition and conversion to LaTeX format
  • Natural Scene Text Recognition: Text detection in complex environments like street views, signs, and markers
  • Card/Document Information Extraction: Structured information extraction from ID cards, driver's licenses, business licenses, etc.

Complex Layout Document Understanding

  • Layout Analysis: Automatic recognition of layout elements like titles, paragraphs, charts, and tables
  • Table Understanding: Complex table structure parsing, supporting merged cells and multi-level headers
  • Chart Understanding: Data extraction and analysis of bar charts, line charts, pie charts, etc.
  • Document Q&A: Intelligent question answering and information retrieval based on document content
  • Document Parsing: Structured parsing of PDF, Word, and other format documents

OCR & Document Understanding Benchmark Performance

Comprehensive comparison of Qianfan-VL models of all scales with mainstream models on OCR and document understanding professional benchmarks

Benchmark Qianfan-VL-3B Qianfan-VL-8B Qianfan-VL-70B Qwen2.5-VL-3B InternVL3-8B InternVL3-78B Qwen2.5-VL-7B Qwen2.5-VL-72B
OCRBench 831 854 873 810 881 847 883 874
AI2D_TEST 81.38 85.07 87.73 77.07 85.07 83.55 80.472 83.84
OCRVQA_TEST 66.15 68.98 74.06 69.24 39.03 35.58 71.02 66.8
TextVQA_VAL 80.11 82.13 84.48 79.09 82.15 83.52 84.962 83.26
DocVQA_VAL 90.85 93.54 94.75 92.71 92.04 83.82 94.91 95.75
ChartQA_TEST 81.79 87.72 89.6 83.4 85.76 82.04 86.68 87.16

Reasoning Capability

8B and 70B models support chain-of-thought capability activation through special tokens, covering complex chart understanding, visual reasoning, mathematical problem-solving, and more scenarios. These tasks typically require combinatorial reasoning based on visual information and external knowledge. We synthesized extensive visual/textual reasoning data and integrated it into Qianfan-VL's post-training, significantly improving performance on reasoning and computation-related tasks as shown by benchmark results

Core Reasoning Application Scenarios

Complex Chart Understanding & Reasoning
  • Data Analysis: Extract key information from complex charts for reasoning analysis
  • Trend Prediction: Trend judgment and prediction based on historical data charts
  • Correlation Reasoning: Cross-analysis and correlation reasoning of multi-chart data
  • Statistical Computation: Statistical analysis and quantitative calculation of chart data
Mathematical Problem-Solving & Visual Reasoning
  • Geometric Reasoning: Spatial figure relationship understanding and theorem application
  • Formula Recognition: Precise recognition and understanding of complex mathematical formulas
  • Step-by-step Solution: Clear problem-solving process and step presentation
  • Logical Inference: Logic reasoning and problem-solving based on visual cues

Mathematical Problem-Solving Benchmark Performance

Benchmark Qianfan-VL-8B Qianfan-VL-70B InternVL3-8B InternVL3-78B Qwen2.5-VL-7B Qwen2.5-VL-72B
Mathvista-mini 69.19 78.6 69.5 71.1 69.5 70.1
Mathvision 32.82 50.29 21.48 33.48 29.61 34.8
Mathverse 48.4 61.04 30.96 43.32 43.68 49.26
ChartQA Pro 50.41 52 19.38 47.92 37.32 44.43
HallusionBench 51.72 54.52 49.7 40.5 49.2 40.2
InHouse Dataset A 59.87 71.78 26 43.40 40.64 41.47
InHouse Dataset B 61.33 75.6 26.81 39.7 36.25 42.65

Model Architecture Design & Technical Features

Through advanced multimodal architecture design and three major technical innovations, Qianfan-VL achieves domain-enhanced general vision-language capabilities

Overall Architecture

Qianfan-VL Architecture

Qianfan-VL adopts advanced multimodal architecture, integrating industry best practices and autonomous innovations

Core Architecture Components

Language Model

Based on Llama 3.1 architecture, enhanced through vocabulary expansion and localization with 3T Chinese-English corpus, supporting mixed Chinese-English understanding

Vision Encoder

Initialized with InternViT, supporting dynamic patching for different resolution images, with maximum support for 4K resolution input

Cross-modal Fusion

MLP adapter achieves seamless bridging between vision and language modalities, ensuring accuracy and efficiency of information transfer

Technical Innovation & Features

Capability Enhancement Training Pipeline

Innovative four-stage training strategy that significantly enhances domain capabilities while maintaining general capabilities

High-Precision Data Synthesis Technology

Combines traditional CV models with programmatic generation to efficiently construct high-quality training data

Large-Scale Kunlun Chip Training

Completed training entirely using Baidu's self-developed Kunlun P800 chips, demonstrating the mature capabilities of domestic AI infrastructure

Capability Enhancement Training Pipeline

Innovative four-stage progressive training strategy that significantly enhances domain capabilities while maintaining general capabilities

Qianfan-VL Training Pipeline

Stage 1: Cross-modal Alignment - This stage aims to establish basic vision-language connection mapping, using a training strategy that only updates MLP Adapter while freezing Vision Encoder and LLM, trained with 100B tokens of general knowledge data. This stage is necessary, otherwise it will affect overall performance.

Stage 2: General Knowledge Injection - Focusing on the amount of injected data, trying to cover all training data, using full-parameter update training strategy with 2.66T tokens of general knowledge data. This stage builds the model's strong foundational capabilities while including sufficient proportion of text corpus to prevent catastrophic forgetting of LLM knowledge.

Stage 3: Domain-Enhanced Knowledge Injection - Carefully selecting high-quality data for domains to be enhanced, including task data for enhanced domains while integrating general data sampling to maintain general knowledge and prevent catastrophic forgetting, using full-parameter update training with 0.32T tokens of domain-specific data and general sampled data. This stage achieves significant enhancement of professional capabilities.

Stage 4: Post-training - This stage aims to improve instruction following ability and preference alignment, using full-parameter update training strategy with 1B tokens of instruction fine-tuning data. Uses high-quality alignment data including complex instruction following, writing, Q&A, programming, OCR, information extraction, mathematics, reasoning computation tasks, while incorporating sufficient pure text instruction fine-tuning data to maintain text model capabilities.

High-Precision Data Synthesis Technology

Constructs a data synthesis pipeline for multimodal tasks, covering core tasks such as document recognition, mathematical problem solving, chart understanding, table recognition, formula recognition, and natural scene OCR. Through refined pipeline design and intermediate process data construction, it achieves efficient production of high-quality training data

Multi-task Data Synthesis Pipeline

Document Recognition OCR Pipeline
Core Tasks: Comprehensive analysis, image-to-Markdown, and document Q&A
  • Comprehensive Analysis: Multi-dimensional analysis integrating layout, category, and content, supporting multiple languages and handwritten scanned documents
  • Image-to-Markdown: Efficient conversion of single/multi-page documents to structured Markdown
  • Document Q&A: Deep understanding supporting summarization, reasoning, and multi-turn dialogue
Mathematical Problem Solving OCR Pipeline
Core Advantages: Customized educational data construction + enhanced visual mathematical reasoning
  • Educational Data Preprocessing: Collect multilingual high-quality problem-solving data, standardize terminology and symbols, structure problems/conditions/steps/formulas
  • Problem-Solving Data Synthesis: Combine knowledge systems to synthesize photo problem-solving scenario data through structured expression→LaTeX→HTML→image pipeline
  • Visual Extraction Enhancement: For complex scenarios like charts, formulas, and geometry, construct high-quality data through formal description languages combined with HTML rendering
Chart Understanding Pipeline
Core Objective: Generate high-quality chart Q&A pairs covering data retrieval, visual attributes, and computational Q&A
  • Data Expansion: Open source dataset sampling + Baidu Image Search API expansion + deduplication processing
  • Chart Summary: Pre-trained VLM generates structured summaries containing visual and numerical information
  • Two-stage Generation: Generate questions based on summaries → Generate answers based on questions and summaries
Table Recognition Pipeline
Core Tasks: Table structure recognition and table Q&A
  • Table Structuring: Precise recovery of image tables to HTML/LaTeX, supporting complex layouts like borderless tables and contract tables
  • Table Q&A: Numerical computation, comparative analysis, and information retrieval based on table images
  • Content Generation: Random table structure + Faker library/LLM filling + random cell merging with professional CSS theme rendering
Formula Recognition Pipeline
Core Capabilities: Integrated symbol recognition + syntax parsing + semantic understanding
  • Symbol Recognition: Precise recognition of mathematical symbols, Greek letters, and special notations
  • Structure Parsing: Complex structures like fractions, radicals, superscripts/subscripts, matrices
  • Multi-engine Rendering: MathJax/KaTeX ensuring rendering consistency
Natural Scene OCR Pipeline
Core Innovation: Synthtext-pipeline systematic text image synthesis method
  • Background Filtering: Lightweight OCR model + image type detection to exclude samples with text/non-static content
  • Scene Understanding: Semantic segmentation model + monocular depth estimation for region division and 3D structure
  • Real Projection: Plane detection + perspective projection + random text style natural projection

Large-Scale Kunlun Chip Parallel Training

Based on Baidu's self-developed Kunlun P800 chips, constructed an industry-leading ultra-large-scale distributed training system, achieving efficient training through innovative parallel strategies and operator optimization

Cluster Scale

5000+
Kunlun P800 Parallel

Training Data Scale

3T
Token Training Data

Scaling Efficiency

90%+
Large-scale cluster scaling efficiency

3D Parallel Training Strategy

Uses a combination of Data Parallelism (DP), Tensor Parallelism (TP), and Pipeline Parallelism (PP), with dynamic load balancing optimizing distribution based on model layer characteristics. Gradient synchronization optimization reduces AllReduce communication time by 60%, combined with ZeRO-3 state sharding technology for memory optimization. Pipeline scheduling uses 1F1B strategy with bubble rate controlled below 5%, sequence dimension partitioning halves long sequence training memory usage, dynamic batching adaptively adjusts batch size based on sequence length, and selective activation recomputation for checkpoint optimization.

Kunlun Chip Communication-Computation Fusion Technology

Architecture Advantages: In the P800 architecture, communication operators and matrix multiplication operators belong to different hardware units, forming a significant difference from traditional GPGPU architecture. In traditional GPU architecture, communication and computation often compete for the same hardware resources, leading to mutual blocking during execution. The P800 architecture achieves true communication-computation parallelism through hardware separation design of dedicated communication processing units and matrix multiplication processing units. This design brings core advantages of resource isolation, where communication operator execution is completely unaffected by matrix multiplication operators, avoiding resource competition in traditional architectures. Meanwhile, through parallel execution mechanisms, data transmission and matrix operations can be performed simultaneously, significantly improving hardware utilization. More importantly, this architecture can use overlap technology to mutually mask communication latency with computation processes.

GEMM Communication-Computation Fusion Technology: By establishing additional bypass streams (BypassStream), we can seamlessly integrate communication operators before and after matrix multiplication operations. The core idea of this mechanism is to establish an independent scheduling system, where bypass streams run independently of main computation streams without blocking the main matrix multiplication pipeline. Meanwhile, through data prefetching mechanisms, data communication is initiated in advance to ensure timely arrival of computation-required data. After computation completion, result communication transmission is immediately initiated, forming a complete pipeline.

Multi-stream Optimization Implementation: Taking AllGather and matrix multiplication fusion as an example, through fine data chunking strategies, deep fusion of computation and communication is achieved. Traditional methods require completing the entire AllGather operation first, waiting for all data transmission to finish before starting GEMM computation. The fusion method decomposes data into multiple blocks, with each data block immediately starting corresponding computation after communication completion, forming pipeline parallelism. When communication operators prepare atomic data blocks, matrix multiplication can immediately start operations without waiting for all data to be ready, achieving true pipeline parallelism.

Scenario Case Studies

Quick Start

Functional Example Code

For complete usage examples and code, please refer to our Cookbook: Qianfan-VL Example Notebook

API Parameter Description

For detailed API parameter descriptions and calling documentation, please refer to: Qianfan ModelBuilder API Documentation

Summary

Qianfan-VL is positioned as a domain-enhanced general multimodal large language model, offering multiple specifications of 3B, 8B, and 70B, achieving multi-scale and full-scenario application coverage. Focusing on B2B customer needs, it significantly enhances multiple task capabilities in intelligent office and K-12 education scenarios, including OCR recognition, document parsing, photo problem-solving, chart understanding, and complex table parsing. For scenarios requiring complex reasoning, the thinking capability can be enabled on 8B and 70B models to further enhance model performance.

On the technical level, it adopts multi-stage progressive continuous pre-training technology, continuously enhancing the proportion of domain-specific data while maintaining general capabilities, thereby achieving significant improvement in domain capabilities. Based on traditional small models and programmatic synthesis methods, the Qianfan-VL team has constructed a large amount of high-precision training data, significantly increasing data density for long-tail scenarios and improving model generalization. All model sizes were completed through large-scale parallel training powered by 5000+ Kunlun chips, and these models can perform efficient inference on Kunlun chips, GPUs, and other processors.

The Qianfan-VL series models demonstrate good generalizability among models of the same parameter size, with excellent performance on specialized domain benchmarks and even better performance on real business benchmarks. Through the domain enhancement technology route, Qianfan-VL provides high-performance solutions that combine both generalizability and specialization for enterprise-level multimodal AI applications.