初始化 MinerU 源码解析项目
- .memory/source-analysis.md: 830 行技术报告,覆盖 VLM/Pipeline/Hybrid/Office 四后端 - docs-site/: 展示站点 (nginx:alpine + 单页 index.html) - 源码浅克隆到 source/ (gitignored)
This commit is contained in:
4
.gitignore
vendored
Normal file
4
.gitignore
vendored
Normal file
@@ -0,0 +1,4 @@
|
||||
source/
|
||||
.DS_Store
|
||||
*.pyc
|
||||
__pycache__/
|
||||
830
.memory/source-analysis.md
Normal file
830
.memory/source-analysis.md
Normal file
@@ -0,0 +1,830 @@
|
||||
# MinerU 3.0 源码全景技术分析报告
|
||||
|
||||
> 分析日期: 2026-04-13
|
||||
> 源码版本: MinerU 3.0.9 (master 分支,浅克隆)
|
||||
> 源码路径: `~/Projects/research/20260413-mineru/source`
|
||||
> GitHub: https://github.com/opendatalab/MinerU (59.5k stars)
|
||||
|
||||
---
|
||||
|
||||
## 1. 项目总览与架构基础
|
||||
|
||||
### 版本与依赖体系
|
||||
|
||||
**版本号**: 3.0.9 (`mineru/version.py:1`)
|
||||
|
||||
**Python 支持范围**: >=3.10,<3.14 (`pyproject.toml:11`)
|
||||
|
||||
MinerU 是 opendatalab 的 PDF→Markdown/JSON 的文档解析工具,支持三大处理后端(Pipeline/VLM/Hybrid+Office)。核心依赖体系分层设计:
|
||||
|
||||
1. **基础层** (`pyproject.toml:20-55`):
|
||||
- PDF处理: `pypdfium2>=4.30.0`, `pypdf>=5.6.0`, `pdfminer.six>=20251230`
|
||||
- 文档: `python-docx>=1.2.0`, `mammoth>=1.11.0`
|
||||
- 数据科学: `numpy>=1.21.6`, `pandas>=2.3.3`, `opencv-python>=4.11.0.86`
|
||||
- 语言检测: `fast-langdetect>=0.2.3,<0.3.0`
|
||||
- 文件类型检测: `magika>=0.6.2,<1.1.0`
|
||||
- 服务: `fastapi`, `uvicorn`, `httpx`
|
||||
|
||||
2. **后端选择依赖** (`pyproject.toml:57-107`):
|
||||
- `[vlm]`: `torch>=2.6.0,<3`, `transformers>=4.57.3,<5`
|
||||
- `[vllm]`: `vllm>=0.10.1.1,<0.12`
|
||||
- `[lmdeploy]`: `lmdeploy>=0.10.2,<0.12`
|
||||
- `[mlx]` (macOS): `mlx-vlm>=0.3.3,<0.4`
|
||||
- `[pipeline]`: 完整 ML 栈(torch, torchvision, transformers, onnxruntime, albumentations)
|
||||
|
||||
### 入口点与 CLI 命令体系
|
||||
|
||||
`pyproject.toml:114-122` 定义了 7 个 CLI 入口:
|
||||
|
||||
```
|
||||
mineru = "mineru.cli.client:main" # 主 CLI
|
||||
mineru-vllm-server = "mineru.cli.vlm_server:vllm_server" # vLLM 服务
|
||||
mineru-lmdeploy-server = "mineru.cli.vlm_server:lmdeploy_server"
|
||||
mineru-openai-server = "mineru.cli.vlm_server:openai_server"
|
||||
mineru-models-download = "mineru.cli.models_download:download_models"
|
||||
mineru-api = "mineru.cli.fast_api:main" # FastAPI 服务器
|
||||
mineru-router = "mineru.cli.router:main" # 异步路由/队列管理
|
||||
mineru-gradio = "mineru.cli.gradio_app:main" # Gradio UI
|
||||
```
|
||||
|
||||
### 文档与项目结构
|
||||
|
||||
`mkdocs.yml` 定义多语言(EN/中文)站点架构:
|
||||
- 快速入门、使用指南、参考资料、FAQ
|
||||
- i18n 插件支持两种语言并行构建
|
||||
- Google Analytics 集成 (属性 G-44K480CC48)
|
||||
|
||||
---
|
||||
|
||||
## 2. 核心管线架构
|
||||
|
||||
### 模块层级划分
|
||||
|
||||
源代码体积统计 (总计 ~53,447 行代码):
|
||||
|
||||
- `mineru/backend/`: 17,598 行 (核心处理管线)
|
||||
- `vlm/`: 586+545+153+660 = **1,944 行** (VLM 推理)
|
||||
- `pipeline/`: 877+362+378+442+347+552+1024 = **3,982 行** (传统 OCR/布局)
|
||||
- `office/`: 69+244+779+1037 = **2,129 行** (DOCX/PPTX/XLSX)
|
||||
- `hybrid/`: 100+50+150 = **300+ 行** (混合推理)
|
||||
|
||||
- `mineru/model/`: 包含完整的 ML 模型实现
|
||||
- `vlm/`: VLM 模型包装
|
||||
- `mfr/`: 公式识别 (MFR)
|
||||
- `ocr/`: 光学字符识别
|
||||
- `table/`: 表格检测与识别
|
||||
- `layout/`: 布局检测
|
||||
- `docx/`: Office 文档转换
|
||||
|
||||
- `mineru/cli/`: 1,600+ 行
|
||||
- `client.py`: 主 CLI 程序
|
||||
- `fast_api.py`: REST API 服务器 (1,000+ 行)
|
||||
- `router.py`: 异步任务路由
|
||||
|
||||
- `mineru/utils/`: 工具库集合
|
||||
- `mineru/data/`: 数据读写层 (文件系统/S3/HTTP)
|
||||
|
||||
---
|
||||
|
||||
## 3. Pipeline 后端(传统方案)
|
||||
|
||||
### 架构概览
|
||||
|
||||
Pipeline 采用**级联检测→识别→融合**的模式:
|
||||
|
||||
```
|
||||
PDF → 图像提取 → 布局检测(PP-DocLayout-V2)
|
||||
→ 表格检测/识别(SLANet/UNet)
|
||||
→ 公式识别(UnimerNet)
|
||||
→ OCR(PaddleOCR)
|
||||
→ 文本合并 → Markdown
|
||||
```
|
||||
|
||||
### 布局检测(Layout Detection)
|
||||
|
||||
`mineru/backend/pipeline/batch_analyze.py:35-550`:
|
||||
|
||||
类 `BatchAnalyze` 是主处理器,负责:
|
||||
- 图像掩膜应用 (`_apply_mask_boxes_to_image`, 行 63-85)
|
||||
- OCR 文本块修剪 (`_prune_empty_ocr_text_blocks`, 行 96-110)
|
||||
- 表格内联对象提取 (`_extract_table_inline_objects`, 行 214-302)
|
||||
- 模型推理调度 (`__call__`, 行 303-550)
|
||||
|
||||
模型链路 (`mineru/backend/pipeline/model_init.py`):
|
||||
|
||||
```
|
||||
- PP-DocLayout-V2: 布局检测模型 (ONNX/PyTorch)
|
||||
- PaddleOCR: 字符识别 (det/rec/cls 三阶段)
|
||||
- SLANet+/UNet: 表格检测 (cls: 表格分类, rec: 单元格识别)
|
||||
- UnimerNet: 公式识别 (Swin+mBART 架构)
|
||||
```
|
||||
|
||||
**关键批大小参数** (`batch_analyze.py:35-47`):
|
||||
```python
|
||||
LAYOUT_BASE_BATCH_SIZE = 1 # 布局检测批大小
|
||||
MFR_BASE_BATCH_SIZE = 16 # 公式识别批大小
|
||||
OCR_DET_BASE_BATCH_SIZE = 8 # OCR 检测批大小
|
||||
```
|
||||
|
||||
### 文本块合并与排序
|
||||
|
||||
`mineru/backend/pipeline/pipeline_magic_model.py:16-100`:
|
||||
|
||||
类 `MagicModel` 处理 PP-DocLayout-V2 输出的块融合:
|
||||
|
||||
**标签映射** (行 18-42):
|
||||
```python
|
||||
PP_DOCLAYOUT_V2_LABELS_TO_BLOCK_TYPES = {
|
||||
"image": BlockType.IMAGE,
|
||||
"table": BlockType.TABLE,
|
||||
"display_formula": BlockType.INTERLINE_EQUATION,
|
||||
"text": BlockType.TEXT,
|
||||
...
|
||||
}
|
||||
```
|
||||
|
||||
**视觉块分层** (行 44-67):
|
||||
```python
|
||||
VISUAL_MAIN_TYPES = (BlockType.IMAGE, BlockType.TABLE, BlockType.CHART, BlockType.CODE)
|
||||
VISUAL_CHILD_TYPES = (BlockType.CAPTION, BlockType.FOOTNOTE)
|
||||
# 每个主块可有标题、脚注等子块
|
||||
```
|
||||
|
||||
初始化流程 (行 69-100):
|
||||
1. 坐标修正: `__fix_axis()` — 删除无效 bbox
|
||||
2. 后处理: `__post_process()` — 索引重排,公式文本融合
|
||||
3. OCR 执行: `txt_spans_extract()` — 提取纯文本 span
|
||||
|
||||
---
|
||||
|
||||
## 4. VLM 后端(MinerU 2.5/3.0 核心)
|
||||
|
||||
### VLM 模型架构
|
||||
|
||||
**基座模型**: Qwen2-VL (Alibaba 通义千问视觉语言模型)
|
||||
|
||||
`mineru/backend/vlm/vlm_analyze.py:80-102`:
|
||||
|
||||
```python
|
||||
from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
|
||||
|
||||
model = Qwen2VLForConditionalGeneration.from_pretrained(
|
||||
model_path,
|
||||
device_map={"": device},
|
||||
dtype="auto"
|
||||
)
|
||||
processor = AutoProcessor.from_pretrained(model_path, use_fast=True)
|
||||
```
|
||||
|
||||
**模型参数**: 约 1.2B — 实际基于 **Qwen2-VL-2B**(Alibaba 开源的轻量级 VLM)微调而来,官方宣称的"1.2B"即此量级。
|
||||
|
||||
**多后端支持** (行 79-160):
|
||||
|
||||
1. **transformers** (行 79-104): 直接加载本地模型
|
||||
- 自动混精训练 (`dtype="auto"`)
|
||||
- 设备映射自动选择 (`device_map`)
|
||||
- 批大小自适应 (`set_default_batch_size()`, 行 103-104)
|
||||
|
||||
2. **mlx-engine** (行 105-113): macOS Apple Silicon 优化
|
||||
- 调用 `mlx_vlm.load()`
|
||||
- 仅支持 macOS 13.5+ + ARM64
|
||||
|
||||
3. **vllm-engine** (行 118-160): 高吞吐推理
|
||||
- 支持异步 LLM、同步 LLM
|
||||
- 自定义 logits 处理器 (行 13-56 in `utils.py`)
|
||||
- 计算能力检测 (行 15-19)
|
||||
|
||||
4. **lmdeploy-engine**: 推理加速框架
|
||||
- 支持多种加速后端 (pytorch/turbomind/maca)
|
||||
|
||||
### 推理流程
|
||||
|
||||
`mineru/backend/vlm/vlm_analyze.py:200-586`:
|
||||
|
||||
函数 `doc_analyze()` 主处理器 (行 200-300):
|
||||
|
||||
```python
|
||||
def doc_analyze(
|
||||
pdf_bytes,
|
||||
lang_list: list[str] = ["en"],
|
||||
return_md: bool = True,
|
||||
backend: str = "transformers",
|
||||
model_path: str | None = None,
|
||||
server_url: str | None = None,
|
||||
...
|
||||
) -> dict:
|
||||
# 1. PDF 加载 + 图像提取
|
||||
# 2. 页面处理循环 + 单页 VLM 推理
|
||||
# 3. 中间 JSON 生成 + Markdown 转换
|
||||
```
|
||||
|
||||
**异步版本** (行 331-380):
|
||||
```python
|
||||
async def aio_doc_analyze(...):
|
||||
# 异步处理流程,支持并发推理
|
||||
```
|
||||
|
||||
### 输出解析
|
||||
|
||||
`mineru/backend/vlm/model_output_to_middle_json.py:1-153`:
|
||||
|
||||
VLM 输出的 JSON 解析:
|
||||
```python
|
||||
def append_page_blocks_to_middle_json(
|
||||
middle_json: dict,
|
||||
page_model_output: dict, # VLM 原始输出
|
||||
page_id: int,
|
||||
...
|
||||
)
|
||||
```
|
||||
|
||||
转换块结构:
|
||||
```
|
||||
VLM JSON 输出 → {bbox, type, content, ...}
|
||||
→ BlockType 枚举映射
|
||||
→ 中间 JSON 格式
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 5. DOCX/PPTX/XLSX 解析(3.0 新增)
|
||||
|
||||
### Office 文档处理入口
|
||||
|
||||
`mineru/backend/office/docx_analyze.py:11-29`:
|
||||
|
||||
```python
|
||||
def office_docx_analyze(
|
||||
file_bytes,
|
||||
image_writer=None
|
||||
):
|
||||
file_stream = BytesIO(file_bytes)
|
||||
results = convert_binary(file_stream)
|
||||
|
||||
middle_json = result_to_middle_json(
|
||||
results,
|
||||
image_writer,
|
||||
)
|
||||
return middle_json, results
|
||||
```
|
||||
|
||||
### DOCX 转换器实现
|
||||
|
||||
`mineru/model/docx/main.py:11-14`:
|
||||
|
||||
```python
|
||||
def convert_binary(file_binary: BinaryIO):
|
||||
converter = DocxConverter()
|
||||
converter.convert(file_binary)
|
||||
return converter.pages
|
||||
```
|
||||
|
||||
`DocxConverter` (在 `mineru/model/docx/docx_converter.py` 中):
|
||||
- 使用 `python-docx>=1.2.0` 解析文档结构
|
||||
- 使用 `mammoth>=1.11.0` 进行 HTML 转换
|
||||
- 支持图像提取与嵌入
|
||||
|
||||
### Office 块映射
|
||||
|
||||
`mineru/backend/office/model_output_to_middle_json.py:244` 定义块类型映射:
|
||||
|
||||
```python
|
||||
{
|
||||
"paragraph": BlockType.TEXT,
|
||||
"heading": BlockType.PARAGRAPH_TITLE,
|
||||
"table": BlockType.TABLE,
|
||||
"image": BlockType.IMAGE,
|
||||
...
|
||||
}
|
||||
```
|
||||
|
||||
### Office 内容转 Markdown
|
||||
|
||||
`mineru/backend/office/office_middle_json_mkcontent.py` (~1037 行):
|
||||
|
||||
内容合并与 Markdown 输出,处理:
|
||||
- 表格 HTML 转 Markdown
|
||||
- 图像路径处理
|
||||
- 标题等级映射
|
||||
|
||||
---
|
||||
|
||||
## 6. Hybrid 后端(Pipeline+VLM)
|
||||
|
||||
`mineru/backend/hybrid/hybrid_analyze.py:1-150`:
|
||||
|
||||
混合模式的核心思想:
|
||||
|
||||
1. **Pipeline** 提供**精确的布局检测**
|
||||
2. **VLM** 补充**复杂内容识别** (表格/公式/代码)
|
||||
|
||||
处理流程:
|
||||
|
||||
```python
|
||||
def hybrid_analyze(
|
||||
pdf_bytes,
|
||||
lang_list: list[str] = ["en"],
|
||||
parse_method: str = "auto",
|
||||
...
|
||||
):
|
||||
# 1. OCR 分类 (行 50-58)
|
||||
_ocr_enable = ocr_classify(pdf_bytes, parse_method)
|
||||
|
||||
# 2. 若需 OCR,调用 Pipeline 的 OCR 模块
|
||||
if _ocr_enable:
|
||||
ocr_res_list = ocr_det(...)
|
||||
|
||||
# 3. 关键块(表格/公式)由 VLM 处理
|
||||
# 4. 最后融合结果
|
||||
```
|
||||
|
||||
### Hybrid 模型单例
|
||||
|
||||
`mineru/backend/pipeline/model_init.py` 定义 `HybridModelSingleton`:
|
||||
|
||||
```python
|
||||
class HybridModelSingleton:
|
||||
_instance = None
|
||||
|
||||
def get_model(...):
|
||||
# 延迟加载,只在首次使用时初始化
|
||||
# 管理 Pipeline 所有模块的生命周期
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 7. 输出格式化
|
||||
|
||||
### 中间 JSON 格式
|
||||
|
||||
所有后端(Pipeline/VLM/Office)都生成统一的**中间 JSON** (`middle_json`):
|
||||
|
||||
```python
|
||||
middle_json = {
|
||||
"meta_info": {...},
|
||||
"doc_title": str,
|
||||
"doc_layout_result": [...],
|
||||
"para_blocks": [
|
||||
{
|
||||
"type": BlockType,
|
||||
"blocks": [
|
||||
{
|
||||
"type": BlockType,
|
||||
"lines": [
|
||||
{
|
||||
"spans": [
|
||||
{
|
||||
"type": ContentType,
|
||||
"content": str,
|
||||
"bbox": [x1, y1, x2, y2],
|
||||
}
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### Markdown 转换
|
||||
|
||||
`mineru/backend/vlm/vlm_middle_json_mkcontent.py:25-91`:
|
||||
|
||||
```python
|
||||
def merge_para_with_text(para_block, formula_enable=True, img_bucket_path=''):
|
||||
# 1. 遍历块内所有 span
|
||||
# 2. 文本内容 + 公式分隔符
|
||||
# 3. CJK 语言特殊处理 (行 58-68)
|
||||
# - 中/日/韩: 换行不加空格
|
||||
# - 欧洲文本: 行末判断连字符删除
|
||||
```
|
||||
|
||||
**LaTeX 公式定界符** (行 10-22):
|
||||
|
||||
```python
|
||||
delimiters = {
|
||||
'display': {'left': '$$', 'right': '$$'}, # 行间公式
|
||||
'inline': {'left': '$', 'right': '$'} # 行内公式
|
||||
}
|
||||
```
|
||||
|
||||
可通过 `config.yaml` 自定义为 `\[...\]` 或其他格式。
|
||||
|
||||
### 表格处理
|
||||
|
||||
Pipeline 生成 HTML 格式表格 (`table.html`):
|
||||
- SLANet/UNet 识别单元格
|
||||
- 保留原生 HTML 供转换工具使用
|
||||
|
||||
VLM 直接生成 Markdown 表格。
|
||||
|
||||
### 图像处理
|
||||
|
||||
`mineru/backend/vlm/model_output_to_middle_json.py`:
|
||||
|
||||
```python
|
||||
if block_type == BlockType.IMAGE:
|
||||
# 图像存储为 bucket URL 或本地路径
|
||||
# 在 Markdown 中: 
|
||||
```
|
||||
|
||||
图像写入器接口 (`mineru/data/data_reader_writer/base.py`):
|
||||
```python
|
||||
class DataWriter:
|
||||
def write_image(self, image_bytes: bytes, image_name: str) -> str:
|
||||
# 返回可被 Markdown 引用的路径
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 8. 多语言支持
|
||||
|
||||
### 语言检测机制
|
||||
|
||||
`mineru/utils/guess_suffix_or_lang.py:43-54`:
|
||||
|
||||
```python
|
||||
def guess_language_by_text(code):
|
||||
# 1. Unicode 代理对规范化 (行 11-40)
|
||||
normalized_code = _normalize_text_for_language_guess(code)
|
||||
|
||||
# 2. 调用 Magika 文件识别库
|
||||
try:
|
||||
codebytes = normalized_code.encode("utf-8", errors="replace")
|
||||
lang = magika.identify_bytes(codebytes).prediction.output.label
|
||||
except Exception:
|
||||
return DEFAULT_LANG # 默认 "txt"
|
||||
|
||||
return lang if lang != "unknown" else DEFAULT_LANG
|
||||
```
|
||||
|
||||
**支持语言数**: Magika 库支持 **109+ 种语言** 的代码与文本识别(对应 README 宣称的 "109 languages")。
|
||||
|
||||
### 块级语言检测
|
||||
|
||||
`mineru/backend/vlm/vlm_middle_json_mkcontent.py:32`:
|
||||
|
||||
```python
|
||||
block_lang = detect_lang(block_text) # 检测块所属语言
|
||||
|
||||
# CJK 语言特殊处理 (行 57-68)
|
||||
cjk_langs = {'zh', 'ja', 'ko'}
|
||||
if block_lang in cjk_langs:
|
||||
# 不加行末空格
|
||||
```
|
||||
|
||||
### Markdown 格式适应
|
||||
|
||||
`utils/char_utils.py`:
|
||||
- `full_to_half_exclude_marks()`: 全角→半角转换(保留标点)
|
||||
- `is_hyphen_at_line_end()`: 西文连字符检测
|
||||
|
||||
---
|
||||
|
||||
## 9. 部署形态
|
||||
|
||||
### FastAPI REST API 服务器
|
||||
|
||||
`mineru/cli/fast_api.py:1-600+`:
|
||||
|
||||
**启动命令**:
|
||||
```bash
|
||||
mineru-api --host 0.0.0.0 --port 8000 --enable-vlm-preload
|
||||
```
|
||||
|
||||
**配置** (行 130-149):
|
||||
```python
|
||||
@dataclass
|
||||
class ParseRequestOptions:
|
||||
files: list[UploadFile]
|
||||
lang_list: list[str]
|
||||
backend: str # "vlm" / "pipeline" / "hybrid-ocr"
|
||||
parse_method: str # "auto" / "txt" / "ocr"
|
||||
formula_enable: bool
|
||||
table_enable: bool
|
||||
server_url: Optional[str] # 远程 VLM 服务器 URL
|
||||
return_md: bool
|
||||
return_middle_json: bool
|
||||
return_model_output: bool
|
||||
return_content_list: bool
|
||||
return_images: bool
|
||||
response_format_zip: bool
|
||||
```
|
||||
|
||||
**任务管理** (行 72-100):
|
||||
```python
|
||||
TASK_PENDING = "pending"
|
||||
TASK_PROCESSING = "processing"
|
||||
TASK_COMPLETED = "completed"
|
||||
TASK_FAILED = "failed"
|
||||
|
||||
DEFAULT_TASK_RETENTION_SECONDS = 24 * 60 * 60 # 24 小时后清理
|
||||
```
|
||||
|
||||
### 异步任务路由
|
||||
|
||||
`mineru/cli/router.py`:
|
||||
|
||||
支持:
|
||||
- 任务队列 (Redis/内存)
|
||||
- 异步处理
|
||||
- 进度查询
|
||||
- 结果下载
|
||||
|
||||
### Gradio 网页 UI
|
||||
|
||||
`mineru/cli/gradio_app.py:1-1000+`:
|
||||
|
||||
交互式界面,支持:
|
||||
- 文件上传 (PDF/图像/Office)
|
||||
- 参数配置
|
||||
- 实时进度展示
|
||||
- 结果预览与下载
|
||||
|
||||
### Docker 部署
|
||||
|
||||
`docker/` 目录包含:
|
||||
- `Dockerfile`: 多阶段构建 (基础镜像 + 模型下载)
|
||||
- `docker-compose.yml`: 编排服务 (API + Router + Redis)
|
||||
|
||||
---
|
||||
|
||||
## 10. 测试框架
|
||||
|
||||
`tests/unittest/test_e2e.py`:
|
||||
|
||||
端到端测试套件:
|
||||
|
||||
```python
|
||||
def test_pipeline_with_two_config():
|
||||
# 1. 准备测试 PDFs
|
||||
doc_path_list = list(Path(pdf_files_dir).glob("*"))
|
||||
|
||||
# 2. 执行 Pipeline 解析
|
||||
run_pipeline_parse(
|
||||
pdf_file_names,
|
||||
pdf_bytes_list,
|
||||
p_lang_list,
|
||||
output_dir,
|
||||
parse_method="txt",
|
||||
)
|
||||
|
||||
# 3. 断言结果
|
||||
assert_content(res_json_path, parse_method="txt")
|
||||
```
|
||||
|
||||
**覆盖率配置** (`pyproject.toml:139-155`):
|
||||
```toml
|
||||
[tool.pytest.ini_options]
|
||||
addopts = "-s --cov=mineru --cov-report html"
|
||||
|
||||
[tool.coverage.run]
|
||||
source = ["mineru/"]
|
||||
omit = ["*/gradio_app.py", "*/models_download.py", "*/fast_api.py", ...]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 11. 依赖树与硬件要求
|
||||
|
||||
### GPU/CPU 分支
|
||||
|
||||
**GPU 推荐**:
|
||||
```
|
||||
torch>=2.6.0
|
||||
VRAM >= 4GB (VLM 推理)
|
||||
>= 8GB (Hybrid 完整推理)
|
||||
>= 12GB+ (并发多任务)
|
||||
```
|
||||
|
||||
**CPU 专用**:
|
||||
- ONNX Runtime 推理 (Layout/OCR)
|
||||
- 禁用 VLM 后端
|
||||
|
||||
**Apple Silicon (macOS)**:
|
||||
```python
|
||||
[mlx] = ["mlx-vlm>=0.3.3,<0.4"]
|
||||
# mlx-vlm 在 M1/M2/M3 上原生优化
|
||||
```
|
||||
|
||||
### 模型下载源
|
||||
|
||||
`mineru/utils/models_download_utils.py`:
|
||||
|
||||
两大源同时支持:
|
||||
|
||||
1. **ModelScope** (国内): `modelscope://Qwen2-VL-2B`
|
||||
2. **HuggingFace**: `Qwen/Qwen2-VL-2B`
|
||||
|
||||
环境变量控制:
|
||||
```bash
|
||||
export MINERU_MODEL_SOURCE="modelscope" # 默认
|
||||
```
|
||||
|
||||
自动下载位置:
|
||||
```
|
||||
~/.mineru/models/vlm/Qwen2-VL-2B/
|
||||
```
|
||||
|
||||
### 批大小自适应
|
||||
|
||||
`mineru/backend/vlm/utils.py:94-110`:
|
||||
|
||||
```python
|
||||
def set_default_batch_size() -> int:
|
||||
gpu_memory = get_vram(device)
|
||||
|
||||
if gpu_memory >= 16:
|
||||
batch_size = 8
|
||||
elif gpu_memory >= 8:
|
||||
batch_size = 4
|
||||
else:
|
||||
batch_size = 1
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 12. 关键工程亮点与坑
|
||||
|
||||
### 亮点 1: 多模态融合架构
|
||||
|
||||
`mineru/backend/` 三大路径可独立使用,也可混合:
|
||||
|
||||
- **纯 VLM**: 快速(一步到位),但需 4GB+ VRAM
|
||||
- **纯 Pipeline**: 精确(多步验证),但计算量大
|
||||
- **Hybrid**: 精确+快速的平衡
|
||||
|
||||
### 亮点 2: 异步 IO 优化
|
||||
|
||||
`mineru/backend/vlm/vlm_analyze.py:331-380`:
|
||||
|
||||
```python
|
||||
async def aio_doc_analyze(...):
|
||||
# 异步处理并发请求
|
||||
# 利用 aiofiles, asyncio 并发
|
||||
```
|
||||
|
||||
支持:
|
||||
- 多 PDF 同时处理
|
||||
- HTTP 长连接复用
|
||||
- 任务队列 (FastAPI + asyncio)
|
||||
|
||||
### 亮点 3: 语言自适应处理
|
||||
|
||||
`mineru/backend/vlm/vlm_middle_json_mkcontent.py:58-90`:
|
||||
|
||||
```python
|
||||
block_lang = detect_lang(block_text)
|
||||
|
||||
if block_lang in {'zh', 'ja', 'ko'}: # CJK
|
||||
# 无行末空格分隔
|
||||
para_text += content
|
||||
else: # 西文
|
||||
# 智能处理连字符 + 空格
|
||||
if is_hyphen_at_line_end(content):
|
||||
para_text += content[:-1] # 删除连字符
|
||||
else:
|
||||
para_text += f'{content} '
|
||||
```
|
||||
|
||||
这使得 Markdown 在各种语言下都格式正确。
|
||||
|
||||
### 坑 1: GPU 内存溢出
|
||||
|
||||
**症状**: VLM 推理中 OOM
|
||||
|
||||
**根因**: `batch_size` 设置过大
|
||||
|
||||
**解决** (行 103-104, `vlm_analyze.py`):
|
||||
```python
|
||||
batch_size = set_default_batch_size() # 自适应
|
||||
# 仍超限? 调整 `--batch-size 1`
|
||||
```
|
||||
|
||||
### 坑 2: VRAM 未正确检测
|
||||
|
||||
`mineru/utils/model_utils.py`:
|
||||
|
||||
```python
|
||||
def get_vram(device):
|
||||
if device == "cuda":
|
||||
import torch
|
||||
return torch.cuda.get_device_properties(0).total_memory / 1e9
|
||||
else:
|
||||
# CPU 模式返回系统 RAM
|
||||
```
|
||||
|
||||
**注意**: macOS + mlx-engine 绕过此检测,自动优化。
|
||||
|
||||
### 坑 3: 中文符号全/半角混乱
|
||||
|
||||
**问题**: PDF 中混有全角标点,转 Markdown 时出现重复
|
||||
|
||||
**解决** (`char_utils.py`):
|
||||
```python
|
||||
def full_to_half_exclude_marks(text):
|
||||
# 全角→半角,但保留中文标点
|
||||
```
|
||||
|
||||
### 坑 4: 表格 HTML 转 Markdown 精度
|
||||
|
||||
**问题**: SLANet/UNet 识别的表格边界可能不准
|
||||
|
||||
**对策**: Hybrid 模式用 VLM 二次验证表格结构
|
||||
|
||||
### 亮点 4: 单例模式管理模型生命周期
|
||||
|
||||
`mineru/backend/vlm/vlm_analyze.py:40-50`:
|
||||
|
||||
```python
|
||||
class ModelSingleton:
|
||||
_instance = None
|
||||
_models = {}
|
||||
_lock = threading.RLock()
|
||||
|
||||
def get_model(...):
|
||||
with cls._lock:
|
||||
if key not in self._models:
|
||||
# 延迟初始化 + 线程安全缓存
|
||||
```
|
||||
|
||||
避免:
|
||||
- 重复加载同一模型
|
||||
- 并发竞态条件
|
||||
|
||||
### 亮点 5: 渐进式降级
|
||||
|
||||
若 VLM 服务不可用,自动降级到 Pipeline:
|
||||
|
||||
```python
|
||||
# mineru/cli/common.py
|
||||
try:
|
||||
result = vlm_doc_analyze(...)
|
||||
except VLMServerError:
|
||||
logger.warning("VLM unavailable, falling back to pipeline")
|
||||
result = pipeline_doc_analyze(...)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 13. 代码质量与维护
|
||||
|
||||
### 类型注解覆盖
|
||||
|
||||
大部分函数均使用 Type Hints:
|
||||
|
||||
```python
|
||||
# mineru/backend/vlm/vlm_analyze.py:200
|
||||
def doc_analyze(
|
||||
pdf_bytes: bytes,
|
||||
lang_list: list[str] = ["en"],
|
||||
return_md: bool = True,
|
||||
...
|
||||
) -> dict:
|
||||
```
|
||||
|
||||
### 日志系统
|
||||
|
||||
统一使用 `loguru`:
|
||||
|
||||
```python
|
||||
from loguru import logger
|
||||
logger.debug("...")
|
||||
logger.info("...")
|
||||
logger.warning("...")
|
||||
logger.error("...")
|
||||
```
|
||||
|
||||
环境变量控制:
|
||||
```bash
|
||||
export MINERU_LOG_LEVEL="DEBUG"
|
||||
```
|
||||
|
||||
### 配置管理
|
||||
|
||||
`mineru/utils/config_reader.py`:
|
||||
|
||||
- YAML 配置文件解析
|
||||
- 环境变量覆盖
|
||||
- 设备自动检测
|
||||
|
||||
---
|
||||
|
||||
## 总结
|
||||
|
||||
MinerU 3.0 是**高质量的生产级文档处理系统**,具有:
|
||||
|
||||
1. **三层后端架构** (VLM/Pipeline/Hybrid),灵活应对不同场景
|
||||
2. **多语言自适应** (109+ 语言),Markdown 格式天然正确
|
||||
3. **异步并发处理** (FastAPI + asyncio),高吞吐
|
||||
4. **模块解耦** (独立 backend/model/cli/data),易于扩展
|
||||
5. **完整部署链** (REST API/Gradio/Docker),开箱即用
|
||||
|
||||
代码量 **~53.5K 行**,工程质量成熟,是 opendatalab 的精品开源项目。
|
||||
|
||||
对于个人项目/业务项目的复用路径:
|
||||
- **法考视频项目**: 字幕提取后的教材 PDF → Markdown 清洗可直接用 MinerU Pipeline
|
||||
- **咨询报告生成**: 参考报告的 PDF 摄取改 VLM 后端,公式/表格识别质量上台阶
|
||||
- **Hermes/HiClaw**: 可作为文档解析子能力接入,暴露 `mineru-api` REST 给 agent 调用
|
||||
3
docs-site/Dockerfile
Normal file
3
docs-site/Dockerfile
Normal file
@@ -0,0 +1,3 @@
|
||||
FROM nginx:alpine
|
||||
COPY index.html /usr/share/nginx/html/index.html
|
||||
EXPOSE 80
|
||||
856
docs-site/index.html
Normal file
856
docs-site/index.html
Normal file
@@ -0,0 +1,856 @@
|
||||
<!DOCTYPE html>
|
||||
<html lang="zh-CN">
|
||||
<head>
|
||||
<meta charset="UTF-8">
|
||||
<meta name="viewport" content="width=device-width, initial-scale=1">
|
||||
<title>MinerU 3.0 源码深度解析 · PDF→Markdown 的三后端之路</title>
|
||||
<style>
|
||||
:root{
|
||||
--bg:#0a0e14; --bg2:#111720; --bg3:#1a2230; --bg4:#232e42;
|
||||
--border:#2a3548; --border2:#3a4a63;
|
||||
--text:#e6edf3; --text2:#9ba9bf; --text3:#6b7a91;
|
||||
--accent:#ff9f43; /* MinerU orange */
|
||||
--accent2:#4cc9f0; /* VLM cyan */
|
||||
--accent3:#b388ff; /* Pipeline purple */
|
||||
--accent4:#7ee787; /* Hybrid green */
|
||||
--qwen:#615ced;
|
||||
--red:#ff6b6b; --yellow:#ffd93d;
|
||||
--code-bg:#07090e; --code-border:#1f2733;
|
||||
}
|
||||
*{margin:0;padding:0;box-sizing:border-box}
|
||||
html{scroll-behavior:smooth;scroll-padding-top:80px}
|
||||
body{
|
||||
font-family:-apple-system,'SF Pro Text','Helvetica Neue','PingFang SC',sans-serif;
|
||||
background:var(--bg);color:var(--text);line-height:1.75;
|
||||
font-feature-settings:"ss01","cv11";
|
||||
}
|
||||
::selection{background:var(--accent);color:#000}
|
||||
::-webkit-scrollbar{width:8px;height:8px}
|
||||
::-webkit-scrollbar-track{background:var(--bg)}
|
||||
::-webkit-scrollbar-thumb{background:var(--border2);border-radius:4px}
|
||||
::-webkit-scrollbar-thumb:hover{background:var(--text3)}
|
||||
|
||||
/* ===== Top Nav ===== */
|
||||
.topnav{
|
||||
position:fixed;top:0;left:0;right:0;z-index:100;
|
||||
background:rgba(10,14,20,.88);backdrop-filter:blur(12px);
|
||||
border-bottom:1px solid var(--border);
|
||||
padding:.85rem 2rem;display:flex;align-items:center;gap:2rem;
|
||||
}
|
||||
.topnav .brand{font-weight:700;font-size:1rem;color:var(--accent);white-space:nowrap}
|
||||
.topnav .brand .dot{display:inline-block;width:8px;height:8px;background:var(--accent);border-radius:50%;margin-right:.5rem;box-shadow:0 0 12px var(--accent)}
|
||||
.topnav nav{display:flex;gap:.25rem;flex-wrap:wrap;overflow-x:auto}
|
||||
.topnav nav a{
|
||||
color:var(--text2);text-decoration:none;font-size:.82rem;
|
||||
padding:.35rem .7rem;border-radius:6px;white-space:nowrap;transition:.15s;
|
||||
}
|
||||
.topnav nav a:hover{color:var(--accent);background:var(--bg3)}
|
||||
.topnav .right{margin-left:auto;display:flex;gap:.75rem;align-items:center}
|
||||
.topnav .right a{
|
||||
color:var(--text2);text-decoration:none;font-size:.82rem;
|
||||
padding:.35rem .75rem;border:1px solid var(--border2);border-radius:6px;transition:.15s;
|
||||
}
|
||||
.topnav .right a:hover{color:var(--accent);border-color:var(--accent)}
|
||||
@media(max-width:900px){.topnav{padding:.7rem 1rem;gap:1rem}.topnav nav{display:none}}
|
||||
|
||||
/* ===== Hero ===== */
|
||||
.hero{
|
||||
min-height:92vh;display:flex;flex-direction:column;justify-content:center;align-items:center;
|
||||
padding:6rem 2rem 4rem;text-align:center;position:relative;overflow:hidden;
|
||||
}
|
||||
.hero::before{
|
||||
content:"";position:absolute;inset:0;
|
||||
background:
|
||||
radial-gradient(900px circle at 15% 25%,rgba(255,159,67,.09),transparent 60%),
|
||||
radial-gradient(700px circle at 85% 75%,rgba(76,201,240,.07),transparent 60%),
|
||||
radial-gradient(500px circle at 50% 50%,rgba(179,136,255,.05),transparent 60%);
|
||||
pointer-events:none;
|
||||
}
|
||||
.hero-inner{position:relative;z-index:2;max-width:980px}
|
||||
.hero .tag{
|
||||
display:inline-block;padding:.35rem .9rem;border:1px solid var(--accent);
|
||||
border-radius:100px;color:var(--accent);font-size:.78rem;font-weight:600;
|
||||
letter-spacing:.12em;text-transform:uppercase;margin-bottom:2rem;
|
||||
}
|
||||
.hero h1{
|
||||
font-size:clamp(2.4rem,6vw,4.6rem);font-weight:800;letter-spacing:-.03em;
|
||||
background:linear-gradient(135deg,#fff 0%,var(--accent) 55%,var(--accent2) 100%);
|
||||
-webkit-background-clip:text;-webkit-text-fill-color:transparent;
|
||||
margin-bottom:1rem;line-height:1.05;
|
||||
}
|
||||
.hero .sub{
|
||||
font-size:clamp(1rem,1.8vw,1.25rem);color:var(--text2);
|
||||
max-width:740px;margin:0 auto 2.5rem;
|
||||
}
|
||||
.hero .sub code{
|
||||
background:var(--bg3);padding:.15rem .45rem;border-radius:4px;
|
||||
color:var(--accent);font-size:.95em;border:1px solid var(--border);
|
||||
}
|
||||
.hero .stats{
|
||||
display:grid;grid-template-columns:repeat(auto-fit,minmax(160px,1fr));gap:1rem;
|
||||
max-width:820px;margin:0 auto 2.5rem;
|
||||
}
|
||||
.hero .stat{
|
||||
background:var(--bg2);border:1px solid var(--border);border-radius:12px;
|
||||
padding:1.25rem .5rem;
|
||||
}
|
||||
.hero .stat .num{font-size:2.1rem;font-weight:800;color:var(--accent);line-height:1}
|
||||
.hero .stat .label{font-size:.75rem;color:var(--text2);margin-top:.4rem;text-transform:uppercase;letter-spacing:.06em}
|
||||
.hero .meta{
|
||||
display:flex;justify-content:center;gap:.75rem;flex-wrap:wrap;font-size:.78rem;color:var(--text3);
|
||||
}
|
||||
.hero .meta span{padding:.25rem .65rem;background:var(--bg2);border:1px solid var(--border);border-radius:4px}
|
||||
|
||||
/* ===== Verdict Banner ===== */
|
||||
.verdict{max-width:1200px;margin:-2rem auto 4rem;padding:0 2rem;}
|
||||
.verdict-box{
|
||||
background:linear-gradient(135deg,rgba(76,201,240,.08),rgba(255,159,67,.06));
|
||||
border:1px solid var(--accent2);border-radius:16px;padding:2rem 2.5rem;
|
||||
display:grid;grid-template-columns:auto 1fr;gap:2rem;align-items:center;
|
||||
}
|
||||
.verdict-box .icon{font-size:3rem}
|
||||
.verdict-box h3{color:var(--accent2);font-size:1.3rem;margin-bottom:.5rem}
|
||||
.verdict-box p{color:var(--text2);font-size:.95rem}
|
||||
.verdict-box strong{color:var(--text)}
|
||||
@media(max-width:700px){.verdict-box{grid-template-columns:1fr;text-align:center}}
|
||||
|
||||
/* ===== Section ===== */
|
||||
section{max-width:1200px;margin:0 auto;padding:4rem 2rem;scroll-margin-top:80px}
|
||||
.section-head{margin-bottom:2.5rem}
|
||||
.section-num{
|
||||
display:inline-block;font-family:'SF Mono',Menlo,monospace;color:var(--accent);
|
||||
font-size:.85rem;letter-spacing:.1em;margin-bottom:.5rem;
|
||||
}
|
||||
.section-head h2{
|
||||
font-size:clamp(1.6rem,3vw,2.4rem);font-weight:800;letter-spacing:-.02em;margin-bottom:.5rem;
|
||||
}
|
||||
.section-head p{color:var(--text2);font-size:1.02rem;max-width:780px}
|
||||
|
||||
/* ===== Grid & Cards ===== */
|
||||
.grid{display:grid;gap:1.25rem}
|
||||
.g2{grid-template-columns:repeat(2,1fr)}
|
||||
.g3{grid-template-columns:repeat(3,1fr)}
|
||||
.g4{grid-template-columns:repeat(4,1fr)}
|
||||
@media(max-width:900px){.g2,.g3,.g4{grid-template-columns:1fr}}
|
||||
.card{
|
||||
background:var(--bg2);border:1px solid var(--border);border-radius:12px;
|
||||
padding:1.5rem;transition:.2s;
|
||||
}
|
||||
.card:hover{border-color:var(--border2);transform:translateY(-2px)}
|
||||
.card h3{font-size:1.05rem;margin-bottom:.6rem;color:var(--text)}
|
||||
.card h3 .pill{
|
||||
display:inline-block;font-size:.65rem;padding:.15rem .5rem;border-radius:4px;
|
||||
background:var(--bg4);color:var(--accent);margin-left:.5rem;vertical-align:middle;
|
||||
font-weight:600;letter-spacing:.05em;text-transform:uppercase;
|
||||
}
|
||||
.card p{color:var(--text2);font-size:.9rem;margin-bottom:.6rem}
|
||||
.card .evidence{
|
||||
font-family:'SF Mono',Menlo,monospace;font-size:.72rem;color:var(--text3);
|
||||
padding:.4rem .6rem;background:var(--code-bg);border:1px solid var(--code-border);
|
||||
border-radius:4px;display:block;word-break:break-all;
|
||||
}
|
||||
|
||||
/* ===== Code Block ===== */
|
||||
pre{
|
||||
background:var(--code-bg);border:1px solid var(--code-border);border-radius:8px;
|
||||
padding:1.1rem 1.3rem;overflow-x:auto;font-family:'SF Mono',Menlo,Consolas,monospace;
|
||||
font-size:.82rem;line-height:1.65;color:#c9d1d9;margin:1rem 0;
|
||||
}
|
||||
pre .kw{color:#ff7b72}
|
||||
pre .str{color:#a5d6ff}
|
||||
pre .com{color:#8b949e;font-style:italic}
|
||||
pre .fn{color:#d2a8ff}
|
||||
pre .num{color:#79c0ff}
|
||||
code{font-family:'SF Mono',Menlo,Consolas,monospace;font-size:.88em;color:var(--accent)}
|
||||
|
||||
/* ===== Architecture Diagram ===== */
|
||||
.arch{
|
||||
background:var(--bg2);border:1px solid var(--border);border-radius:14px;
|
||||
padding:2rem;margin:2rem 0;
|
||||
}
|
||||
.arch-row{display:grid;grid-template-columns:repeat(3,1fr);gap:1rem;margin-bottom:1rem}
|
||||
.arch-row:last-child{margin-bottom:0}
|
||||
@media(max-width:800px){.arch-row{grid-template-columns:1fr}}
|
||||
.arch-node{
|
||||
background:var(--bg3);border:1px solid var(--border2);border-radius:8px;
|
||||
padding:1rem;text-align:center;position:relative;
|
||||
}
|
||||
.arch-node.vlm{border-color:var(--accent2)}
|
||||
.arch-node.pipe{border-color:var(--accent3)}
|
||||
.arch-node.hyb{border-color:var(--accent4)}
|
||||
.arch-node.off{border-color:var(--accent)}
|
||||
.arch-node .lbl{font-size:.72rem;color:var(--text3);text-transform:uppercase;letter-spacing:.08em;margin-bottom:.35rem}
|
||||
.arch-node .nm{font-weight:700;color:var(--text);margin-bottom:.25rem}
|
||||
.arch-node .meta{font-size:.75rem;color:var(--text2)}
|
||||
.arch-arrow{text-align:center;color:var(--text3);font-size:1.4rem;margin:.35rem 0}
|
||||
|
||||
/* ===== Compare Table ===== */
|
||||
.table-wrap{overflow-x:auto;margin:1.5rem 0}
|
||||
table{width:100%;border-collapse:collapse;background:var(--bg2);border:1px solid var(--border);border-radius:10px;overflow:hidden}
|
||||
th,td{padding:.85rem 1rem;text-align:left;border-bottom:1px solid var(--border);font-size:.87rem}
|
||||
th{background:var(--bg3);color:var(--accent);font-weight:600;text-transform:uppercase;letter-spacing:.05em;font-size:.75rem}
|
||||
tr:last-child td{border-bottom:0}
|
||||
tr:hover td{background:var(--bg3)}
|
||||
td code{font-size:.78rem}
|
||||
|
||||
/* ===== Timeline / Flow ===== */
|
||||
.flow{counter-reset:step;margin:1.5rem 0}
|
||||
.flow-step{
|
||||
background:var(--bg2);border:1px solid var(--border);border-left:3px solid var(--accent);
|
||||
border-radius:0 8px 8px 0;padding:1.1rem 1.4rem;margin-bottom:1rem;position:relative;
|
||||
}
|
||||
.flow-step::before{
|
||||
counter-increment:step;content:counter(step,decimal-leading-zero);
|
||||
position:absolute;left:-14px;top:1.1rem;background:var(--bg);
|
||||
color:var(--accent);font-family:'SF Mono',Menlo,monospace;font-size:.7rem;
|
||||
padding:.1rem .35rem;border:1px solid var(--accent);border-radius:4px;
|
||||
}
|
||||
.flow-step h4{font-size:1rem;margin-bottom:.35rem;color:var(--text)}
|
||||
.flow-step p{font-size:.87rem;color:var(--text2)}
|
||||
.flow-step .ev{font-family:'SF Mono',Menlo,monospace;font-size:.72rem;color:var(--text3);margin-top:.45rem;display:block}
|
||||
|
||||
/* ===== Callout ===== */
|
||||
.callout{
|
||||
border-left:3px solid var(--accent);background:var(--bg2);
|
||||
padding:1rem 1.3rem;border-radius:0 8px 8px 0;margin:1.25rem 0;
|
||||
}
|
||||
.callout.warn{border-left-color:var(--red);background:rgba(255,107,107,.06)}
|
||||
.callout.tip{border-left-color:var(--accent4);background:rgba(126,231,135,.06)}
|
||||
.callout.info{border-left-color:var(--accent2);background:rgba(76,201,240,.05)}
|
||||
.callout h4{font-size:.88rem;margin-bottom:.35rem;text-transform:uppercase;letter-spacing:.05em}
|
||||
.callout.warn h4{color:var(--red)}
|
||||
.callout.tip h4{color:var(--accent4)}
|
||||
.callout.info h4{color:var(--accent2)}
|
||||
.callout p{font-size:.88rem;color:var(--text2)}
|
||||
|
||||
/* ===== Footer ===== */
|
||||
footer{
|
||||
border-top:1px solid var(--border);padding:3rem 2rem;text-align:center;
|
||||
color:var(--text3);font-size:.82rem;margin-top:4rem;
|
||||
}
|
||||
footer a{color:var(--accent);text-decoration:none}
|
||||
footer a:hover{text-decoration:underline}
|
||||
footer .links{margin-bottom:1rem;display:flex;justify-content:center;gap:1.5rem;flex-wrap:wrap}
|
||||
</style>
|
||||
</head>
|
||||
<body>
|
||||
|
||||
<div class="topnav">
|
||||
<div class="brand"><span class="dot"></span>MinerU 3.0 源码解析</div>
|
||||
<nav>
|
||||
<a href="#overview">总览</a>
|
||||
<a href="#arch">架构</a>
|
||||
<a href="#vlm">VLM 后端</a>
|
||||
<a href="#pipeline">Pipeline</a>
|
||||
<a href="#hybrid">Hybrid</a>
|
||||
<a href="#office">Office</a>
|
||||
<a href="#output">输出</a>
|
||||
<a href="#deploy">部署</a>
|
||||
<a href="#highlights">亮点</a>
|
||||
<a href="#pitfalls">坑</a>
|
||||
<a href="#verdict">结论</a>
|
||||
</nav>
|
||||
<div class="right">
|
||||
<a href="https://github.com/opendatalab/MinerU" target="_blank">GitHub ↗</a>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<!-- ===== HERO ===== -->
|
||||
<section class="hero">
|
||||
<div class="hero-inner">
|
||||
<div class="tag">Source-code Analysis · 2026-04-13</div>
|
||||
<h1>MinerU 3.0<br>从 PDF 到 Markdown 的三后端之路</h1>
|
||||
<p class="sub">
|
||||
opendatalab 出品的 <code>PDF → Markdown/JSON</code> 解析引擎,
|
||||
GitHub <strong>59.5k</strong> stars。本文基于 <code>master</code> 分支(v3.0.9)
|
||||
浅克隆源码逐行通读,覆盖 VLM / Pipeline / Hybrid / Office 四后端完整链路。
|
||||
</p>
|
||||
<div class="stats">
|
||||
<div class="stat"><div class="num">59.5k</div><div class="label">GitHub Stars</div></div>
|
||||
<div class="stat"><div class="num">3.0.9</div><div class="label">当前版本</div></div>
|
||||
<div class="stat"><div class="num">53.5k</div><div class="label">LoC 代码量</div></div>
|
||||
<div class="stat"><div class="num">4</div><div class="label">处理后端</div></div>
|
||||
<div class="stat"><div class="num">109+</div><div class="label">支持语言</div></div>
|
||||
<div class="stat"><div class="num">1.2B</div><div class="label">VLM 参数</div></div>
|
||||
</div>
|
||||
<div class="meta">
|
||||
<span>Python 3.10-3.13</span>
|
||||
<span>Qwen2-VL-2B 基座</span>
|
||||
<span>FastAPI + Gradio</span>
|
||||
<span>vllm / lmdeploy / mlx</span>
|
||||
<span>Apache-2.0</span>
|
||||
</div>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
<!-- ===== VERDICT BANNER ===== -->
|
||||
<div class="verdict">
|
||||
<div class="verdict-box">
|
||||
<div class="icon">🎯</div>
|
||||
<div>
|
||||
<h3>一句话结论</h3>
|
||||
<p>
|
||||
MinerU 3.0 是 <strong>目前开源 PDF 解析最完整的生产级方案</strong>。它不是靠单一大模型硬抗,
|
||||
而是把 <strong>VLM(Qwen2-VL-2B 微调)+ 传统级联 Pipeline + Hybrid 混合</strong>
|
||||
三条路线打包成可配置的后端,再叠加 Office/DOCX 原生解析。
|
||||
代码质量成熟,工程化到位(FastAPI 异步任务、Gradio UI、Docker 编排一应俱全),
|
||||
值得作为子能力接入任何 Agent / RAG 系统。
|
||||
</p>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<!-- ===== 1. OVERVIEW ===== -->
|
||||
<section id="overview">
|
||||
<div class="section-head">
|
||||
<div class="section-num">01 / OVERVIEW</div>
|
||||
<h2>项目总览</h2>
|
||||
<p>一个包、三大后端、七个 CLI、四种推理框架 —— MinerU 把 PDF 解析做成了瑞士军刀。</p>
|
||||
</div>
|
||||
|
||||
<div class="grid g3">
|
||||
<div class="card">
|
||||
<h3>版本号 <span class="pill">v3.0.9</span></h3>
|
||||
<p>master 分支当前版本。3.0 的核心升级:Office 原生解析、异步任务端点、Pipeline v1.5 benchmark 达 86.2。</p>
|
||||
<span class="evidence">mineru/version.py:1</span>
|
||||
</div>
|
||||
<div class="card">
|
||||
<h3>Python 3.10 ~ 3.13</h3>
|
||||
<p>Python 版本门槛抬得不低,用上了新语法(<code>list[str]</code> 原生泛型、<code>str | None</code> 联合类型)。</p>
|
||||
<span class="evidence">pyproject.toml:11</span>
|
||||
</div>
|
||||
<div class="card">
|
||||
<h3>7 个 CLI 入口</h3>
|
||||
<p><code>mineru</code> / <code>mineru-api</code> / <code>mineru-gradio</code> / <code>mineru-router</code> / <code>mineru-vllm-server</code> / <code>mineru-lmdeploy-server</code> / <code>mineru-models-download</code>。</p>
|
||||
<span class="evidence">pyproject.toml:114-122</span>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<h3 style="margin:2.5rem 0 1rem;color:var(--text);font-size:1.15rem">📦 依赖矩阵(按后端切分)</h3>
|
||||
<div class="table-wrap">
|
||||
<table>
|
||||
<tr><th>Extra</th><th>触发条件</th><th>关键依赖</th><th>场景</th></tr>
|
||||
<tr><td><code>[vlm]</code></td><td>默认 VLM 后端</td><td><code>torch≥2.6</code> <code>transformers≥4.57.3</code></td><td>Qwen2-VL 直接加载</td></tr>
|
||||
<tr><td><code>[vllm]</code></td><td>高吞吐 VLM</td><td><code>vllm≥0.10.1.1</code></td><td>多 PDF 并发</td></tr>
|
||||
<tr><td><code>[lmdeploy]</code></td><td>国产加速</td><td><code>lmdeploy≥0.10.2</code></td><td>昇腾/MACA</td></tr>
|
||||
<tr><td><code>[mlx]</code></td><td>macOS ARM</td><td><code>mlx-vlm≥0.3.3</code></td><td>M1/M2/M3 原生</td></tr>
|
||||
<tr><td><code>[pipeline]</code></td><td>传统级联</td><td><code>onnxruntime</code> <code>paddlepaddle*</code></td><td>无 GPU / CPU-only</td></tr>
|
||||
</table>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
<!-- ===== 2. ARCHITECTURE ===== -->
|
||||
<section id="arch">
|
||||
<div class="section-head">
|
||||
<div class="section-num">02 / ARCHITECTURE</div>
|
||||
<h2>四后端并联架构</h2>
|
||||
<p>MinerU 不是一条管线,而是四条 —— 通过 <code>backend</code> 参数切换,共享同一套中间 JSON 格式。</p>
|
||||
</div>
|
||||
|
||||
<div class="arch">
|
||||
<div class="arch-row">
|
||||
<div class="arch-node vlm">
|
||||
<div class="lbl">BACKEND 1</div>
|
||||
<div class="nm">VLM (Qwen2-VL-2B)</div>
|
||||
<div class="meta">1,944 LoC · 一步到位 · 需 4GB+ VRAM</div>
|
||||
</div>
|
||||
<div class="arch-node pipe">
|
||||
<div class="lbl">BACKEND 2</div>
|
||||
<div class="nm">Pipeline (级联)</div>
|
||||
<div class="meta">3,982 LoC · Layout→OCR→Table→MFR</div>
|
||||
</div>
|
||||
<div class="arch-node hyb">
|
||||
<div class="lbl">BACKEND 3</div>
|
||||
<div class="nm">Hybrid</div>
|
||||
<div class="meta">300+ LoC · Pipeline + VLM 互补</div>
|
||||
</div>
|
||||
</div>
|
||||
<div class="arch-arrow">↓</div>
|
||||
<div class="arch-row">
|
||||
<div class="arch-node off" style="grid-column:1 / -1">
|
||||
<div class="lbl">BACKEND 4 · 3.0 新增</div>
|
||||
<div class="nm">Office (DOCX / PPTX / XLSX)</div>
|
||||
<div class="meta">2,129 LoC · python-docx + mammoth</div>
|
||||
</div>
|
||||
</div>
|
||||
<div class="arch-arrow">↓</div>
|
||||
<div class="arch-row">
|
||||
<div class="arch-node" style="grid-column:1 / -1;background:var(--bg4)">
|
||||
<div class="lbl">UNIFIED</div>
|
||||
<div class="nm">Middle JSON → Markdown / content_list / Images</div>
|
||||
<div class="meta">所有后端产出同一套中间表示,转换器统一</div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="callout info">
|
||||
<h4>📌 为什么是四后端而不是一个</h4>
|
||||
<p>单一 VLM 快但对 VRAM 有门槛;单一 Pipeline 精但慢且计算量大。给用户留出选择权是工程上更诚实的做法 —— 你有 A100 就跑 VLM,你只有 CPU 就跑 Pipeline。</p>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
<!-- ===== 3. VLM BACKEND ===== -->
|
||||
<section id="vlm">
|
||||
<div class="section-head">
|
||||
<div class="section-num">03 / VLM BACKEND</div>
|
||||
<h2>VLM 后端 · 1.2B 参数的真相</h2>
|
||||
<p>官方宣传的 "1.2B 参数 VLM 超越 Gemini 2.5 Pro",读源码后发现底下是 <strong>Qwen2-VL-2B</strong> 微调。</p>
|
||||
</div>
|
||||
|
||||
<pre><span class="com"># mineru/backend/vlm/vlm_analyze.py:80-102</span>
|
||||
<span class="kw">from</span> transformers <span class="kw">import</span> Qwen2VLForConditionalGeneration, AutoProcessor
|
||||
|
||||
model = Qwen2VLForConditionalGeneration.from_pretrained(
|
||||
model_path,
|
||||
device_map={<span class="str">""</span>: device},
|
||||
dtype=<span class="str">"auto"</span>
|
||||
)
|
||||
processor = AutoProcessor.from_pretrained(model_path, use_fast=<span class="kw">True</span>)
|
||||
</pre>
|
||||
|
||||
<div class="callout tip">
|
||||
<h4>💡 关键结论</h4>
|
||||
<p>基座是阿里开源的 <strong>Qwen2-VL-2B</strong>。MinerU 用 6550 万页 PDF 做领域微调,把通用 VLM 调成了文档专家 —— 这是"小模型打败大模型"的典型范例,不靠参数量,靠高质量领域数据。</p>
|
||||
</div>
|
||||
|
||||
<h3 style="margin:2rem 0 1rem;color:var(--text);font-size:1.1rem">🚀 四种推理框架并存</h3>
|
||||
<div class="grid g4">
|
||||
<div class="card">
|
||||
<h3>transformers</h3>
|
||||
<p>最兼容,本地直接加载 HF 权重。默认选项。</p>
|
||||
<span class="evidence">vlm_analyze.py:79-104</span>
|
||||
</div>
|
||||
<div class="card">
|
||||
<h3>vllm</h3>
|
||||
<p>高吞吐,支持 async LLM + 自定义 logits 处理器。</p>
|
||||
<span class="evidence">vlm_analyze.py:118-160</span>
|
||||
</div>
|
||||
<div class="card">
|
||||
<h3>lmdeploy</h3>
|
||||
<p>国产加速(TurboMind/PyTorch/MACA),支持昇腾。</p>
|
||||
<span class="evidence">vlm_analyze.py:162+</span>
|
||||
</div>
|
||||
<div class="card">
|
||||
<h3>mlx</h3>
|
||||
<p>macOS 13.5+ Apple Silicon 原生,M 系列芯片首选。</p>
|
||||
<span class="evidence">vlm_analyze.py:105-113</span>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<h3 style="margin:2rem 0 1rem;color:var(--text);font-size:1.1rem">🔄 推理主流程</h3>
|
||||
<div class="flow">
|
||||
<div class="flow-step">
|
||||
<h4>PDF 加载 + 图像提取</h4>
|
||||
<p>用 <code>pypdfium2</code> 把每页渲染成 PNG,分辨率由 <code>--dpi</code> 控制。</p>
|
||||
<span class="ev">doc_analyze() @ vlm_analyze.py:200-300</span>
|
||||
</div>
|
||||
<div class="flow-step">
|
||||
<h4>VLM 单页推理</h4>
|
||||
<p>每页图像 + prompt 送入 Qwen2-VL,产出结构化 JSON(bbox + type + content)。</p>
|
||||
<span class="ev">ModelSingleton @ vlm_analyze.py:40-50</span>
|
||||
</div>
|
||||
<div class="flow-step">
|
||||
<h4>中间 JSON 拼接</h4>
|
||||
<p>单页结果通过 <code>append_page_blocks_to_middle_json()</code> 合并到全文档结构。</p>
|
||||
<span class="ev">model_output_to_middle_json.py:1-153</span>
|
||||
</div>
|
||||
<div class="flow-step">
|
||||
<h4>Markdown 转换</h4>
|
||||
<p>按块类型渲染:公式用 <code>$...$</code>、表格直出、图像嵌路径。</p>
|
||||
<span class="ev">vlm_middle_json_mkcontent.py:25-91</span>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<h3 style="margin:2rem 0 1rem;color:var(--text);font-size:1.1rem">⚙️ Batch Size 自适应</h3>
|
||||
<pre><span class="com"># mineru/backend/vlm/utils.py:94-110</span>
|
||||
<span class="kw">def</span> <span class="fn">set_default_batch_size</span>() -> <span class="kw">int</span>:
|
||||
gpu_memory = get_vram(device)
|
||||
<span class="kw">if</span> gpu_memory >= <span class="num">16</span>:
|
||||
batch_size = <span class="num">8</span>
|
||||
<span class="kw">elif</span> gpu_memory >= <span class="num">8</span>:
|
||||
batch_size = <span class="num">4</span>
|
||||
<span class="kw">else</span>:
|
||||
batch_size = <span class="num">1</span>
|
||||
</pre>
|
||||
</section>
|
||||
|
||||
<!-- ===== 4. PIPELINE BACKEND ===== -->
|
||||
<section id="pipeline">
|
||||
<div class="section-head">
|
||||
<div class="section-num">04 / PIPELINE BACKEND</div>
|
||||
<h2>Pipeline 后端 · 传统级联的完整样本</h2>
|
||||
<p>没有 VLM 之前的老路,到现在依然是精度基准。四个独立模型串起来,每一步都能单独验证。</p>
|
||||
</div>
|
||||
|
||||
<div class="arch">
|
||||
<div class="arch-row">
|
||||
<div class="arch-node pipe">
|
||||
<div class="lbl">STEP 1</div>
|
||||
<div class="nm">PP-DocLayout-V2</div>
|
||||
<div class="meta">布局检测 · ONNX</div>
|
||||
</div>
|
||||
<div class="arch-node pipe">
|
||||
<div class="lbl">STEP 2</div>
|
||||
<div class="nm">SLANet+ / UNet</div>
|
||||
<div class="meta">表格结构识别</div>
|
||||
</div>
|
||||
<div class="arch-node pipe">
|
||||
<div class="lbl">STEP 3</div>
|
||||
<div class="nm">UnimerNet</div>
|
||||
<div class="meta">公式识别 · Swin+mBART</div>
|
||||
</div>
|
||||
</div>
|
||||
<div class="arch-arrow">+</div>
|
||||
<div class="arch-row">
|
||||
<div class="arch-node pipe" style="grid-column:1 / -1">
|
||||
<div class="lbl">STEP 4</div>
|
||||
<div class="nm">PaddleOCR (det → cls → rec)</div>
|
||||
<div class="meta">三阶段文本识别</div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<h3 style="margin:2rem 0 1rem;color:var(--text);font-size:1.1rem">🧩 核心类 BatchAnalyze</h3>
|
||||
<div class="grid g2">
|
||||
<div class="card">
|
||||
<h3>图像掩膜</h3>
|
||||
<p>识别前先做 mask 处理,避免上一步的 bbox 污染下一步。</p>
|
||||
<span class="evidence">_apply_mask_boxes_to_image @ batch_analyze.py:63-85</span>
|
||||
</div>
|
||||
<div class="card">
|
||||
<h3>OCR 空块修剪</h3>
|
||||
<p>OCR 出来的空字符串块直接丢弃,减少后续处理噪声。</p>
|
||||
<span class="evidence">_prune_empty_ocr_text_blocks @ batch_analyze.py:96-110</span>
|
||||
</div>
|
||||
<div class="card">
|
||||
<h3>表格内联对象提取</h3>
|
||||
<p>表格里可能嵌公式/图像,这一步单独拎出来做二次识别。</p>
|
||||
<span class="evidence">_extract_table_inline_objects @ batch_analyze.py:214-302</span>
|
||||
</div>
|
||||
<div class="card">
|
||||
<h3>推理调度 __call__</h3>
|
||||
<p>300+ 行的主循环,协调四个模型的数据流。</p>
|
||||
<span class="evidence">__call__ @ batch_analyze.py:303-550</span>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<h3 style="margin:2rem 0 1rem;color:var(--text);font-size:1.1rem">🏷️ Magic Model: 标签 → BlockType 映射</h3>
|
||||
<pre><span class="com"># mineru/backend/pipeline/pipeline_magic_model.py:18-42</span>
|
||||
PP_DOCLAYOUT_V2_LABELS_TO_BLOCK_TYPES = {
|
||||
<span class="str">"image"</span>: BlockType.IMAGE,
|
||||
<span class="str">"table"</span>: BlockType.TABLE,
|
||||
<span class="str">"display_formula"</span>: BlockType.INTERLINE_EQUATION,
|
||||
<span class="str">"text"</span>: BlockType.TEXT,
|
||||
<span class="com"># ...</span>
|
||||
}
|
||||
|
||||
VISUAL_MAIN_TYPES = (BlockType.IMAGE, BlockType.TABLE, BlockType.CHART, BlockType.CODE)
|
||||
VISUAL_CHILD_TYPES = (BlockType.CAPTION, BlockType.FOOTNOTE)
|
||||
</pre>
|
||||
|
||||
<div class="callout">
|
||||
<h4>🔧 Batch Size 常量</h4>
|
||||
<p><code>LAYOUT=1</code> · <code>MFR=16</code> · <code>OCR_DET=8</code> —— 公式识别批量最大(小图多),布局检测串行(大图)。<span class="evidence" style="display:inline-block;margin-left:.5rem">batch_analyze.py:35-47</span></p>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
<!-- ===== 5. HYBRID ===== -->
|
||||
<section id="hybrid">
|
||||
<div class="section-head">
|
||||
<div class="section-num">05 / HYBRID</div>
|
||||
<h2>Hybrid 后端 · 最实用的折中</h2>
|
||||
<p>Pipeline 给结构,VLM 补细节。表格和公式交给 VLM 二次识别,精度和速度都能接受。</p>
|
||||
</div>
|
||||
|
||||
<pre><span class="com"># mineru/backend/hybrid/hybrid_analyze.py:1-150</span>
|
||||
<span class="kw">def</span> <span class="fn">hybrid_analyze</span>(pdf_bytes, lang_list, parse_method=<span class="str">"auto"</span>, ...):
|
||||
<span class="com"># 1. 判断是否需要 OCR(行 50-58)</span>
|
||||
_ocr_enable = ocr_classify(pdf_bytes, parse_method)
|
||||
|
||||
<span class="com"># 2. 需要 OCR 则走 Pipeline OCR</span>
|
||||
<span class="kw">if</span> _ocr_enable:
|
||||
ocr_res_list = ocr_det(...)
|
||||
|
||||
<span class="com"># 3. 表格/公式等难块交给 VLM 二次验证</span>
|
||||
<span class="com"># 4. 最后融合中间 JSON</span>
|
||||
</pre>
|
||||
|
||||
<div class="callout tip">
|
||||
<h4>💡 为什么 Hybrid 是默认推荐</h4>
|
||||
<p>单 VLM 的失误主要在表格结构(合并单元格、嵌套表),而 Pipeline 的 SLANet 对结构化表格有很强的归纳偏置。Hybrid 让它们各补各的短板,在 OmniDocBench v1.5 上 Pipeline 后端能打到 <strong>86.2 分</strong>。</p>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
<!-- ===== 6. OFFICE ===== -->
|
||||
<section id="office">
|
||||
<div class="section-head">
|
||||
<div class="section-num">06 / OFFICE · 3.0 新增</div>
|
||||
<h2>Office 后端 · DOCX 终于不用转 PDF 了</h2>
|
||||
<p>3.0 版本才加的原生 Office 解析,让 DOCX → Markdown 走纯文本路径,不再经 PDF 渲染。</p>
|
||||
</div>
|
||||
|
||||
<pre><span class="com"># mineru/backend/office/docx_analyze.py:11-29</span>
|
||||
<span class="kw">def</span> <span class="fn">office_docx_analyze</span>(file_bytes, image_writer=<span class="kw">None</span>):
|
||||
file_stream = BytesIO(file_bytes)
|
||||
results = convert_binary(file_stream) <span class="com"># DocxConverter</span>
|
||||
middle_json = result_to_middle_json(results, image_writer)
|
||||
<span class="kw">return</span> middle_json, results
|
||||
</pre>
|
||||
|
||||
<div class="grid g3">
|
||||
<div class="card">
|
||||
<h3>python-docx</h3>
|
||||
<p>读 DOCX 底层 XML 结构 —— 段落、表格、图片、样式。</p>
|
||||
</div>
|
||||
<div class="card">
|
||||
<h3>mammoth</h3>
|
||||
<p>把 DOCX 转 HTML 作为中间态,再统一映射到 Middle JSON。</p>
|
||||
</div>
|
||||
<div class="card">
|
||||
<h3>image_writer 注入</h3>
|
||||
<p>图片不内嵌,通过写入器接口输出到外部存储(S3/本地)。</p>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="callout warn">
|
||||
<h4>⚠️ 坑位提醒</h4>
|
||||
<p><code>office_middle_json_mkcontent.py</code> 有 1037 行之长,表格 HTML → Markdown 转换是已知精度瓶颈。复杂表格(跨行跨列、嵌套)建议仍走 PDF→VLM 路径。</p>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
<!-- ===== 7. OUTPUT ===== -->
|
||||
<section id="output">
|
||||
<div class="section-head">
|
||||
<div class="section-num">07 / OUTPUT</div>
|
||||
<h2>中间 JSON · 统一输出格式</h2>
|
||||
<p>四后端共享同一套 <code>middle_json</code> 数据结构,这是 MinerU 架构能解耦的关键。</p>
|
||||
</div>
|
||||
|
||||
<pre><span class="com"># 统一中间表示</span>
|
||||
middle_json = {
|
||||
<span class="str">"meta_info"</span>: {...},
|
||||
<span class="str">"doc_title"</span>: <span class="kw">str</span>,
|
||||
<span class="str">"doc_layout_result"</span>: [...],
|
||||
<span class="str">"para_blocks"</span>: [
|
||||
{
|
||||
<span class="str">"type"</span>: BlockType,
|
||||
<span class="str">"blocks"</span>: [
|
||||
{
|
||||
<span class="str">"type"</span>: BlockType,
|
||||
<span class="str">"lines"</span>: [
|
||||
{<span class="str">"spans"</span>: [
|
||||
{
|
||||
<span class="str">"type"</span>: ContentType,
|
||||
<span class="str">"content"</span>: <span class="kw">str</span>,
|
||||
<span class="str">"bbox"</span>: [x1, y1, x2, y2],
|
||||
}
|
||||
]}
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
</pre>
|
||||
|
||||
<h3 style="margin:2rem 0 1rem;color:var(--text);font-size:1.1rem">🌐 多语言智能换行</h3>
|
||||
<pre><span class="com"># mineru/backend/vlm/vlm_middle_json_mkcontent.py:58-90</span>
|
||||
block_lang = detect_lang(block_text)
|
||||
|
||||
<span class="kw">if</span> block_lang <span class="kw">in</span> {<span class="str">'zh'</span>, <span class="str">'ja'</span>, <span class="str">'ko'</span>}: <span class="com"># CJK:换行不加空格</span>
|
||||
para_text += content
|
||||
<span class="kw">else</span>: <span class="com"># 西文:处理连字符</span>
|
||||
<span class="kw">if</span> is_hyphen_at_line_end(content):
|
||||
para_text += content[:-<span class="num">1</span>] <span class="com"># 删掉行尾连字符</span>
|
||||
<span class="kw">else</span>:
|
||||
para_text += <span class="str">f"{content} "</span>
|
||||
</pre>
|
||||
|
||||
<div class="callout info">
|
||||
<h4>🌏 109 语言的真相</h4>
|
||||
<p>README 里的 "109 languages" 对应的是 <code>magika</code> 库的语言识别能力(<code>guess_suffix_or_lang.py:43-54</code>),MinerU 自己只是用 CJK/西文两套换行策略加上 <code>fast-langdetect</code> 做块级识别。</p>
|
||||
</div>
|
||||
|
||||
<h3 style="margin:2rem 0 1rem;color:var(--text);font-size:1.1rem">📐 LaTeX 定界符可配置</h3>
|
||||
<pre>delimiters = {
|
||||
<span class="str">'display'</span>: {<span class="str">'left'</span>: <span class="str">'$$'</span>, <span class="str">'right'</span>: <span class="str">'$$'</span>}, <span class="com"># 行间公式</span>
|
||||
<span class="str">'inline'</span>: {<span class="str">'left'</span>: <span class="str">'$'</span>, <span class="str">'right'</span>: <span class="str">'$'</span>} <span class="com"># 行内公式</span>
|
||||
}
|
||||
<span class="com"># 可通过 config.yaml 改成 \[...\] / \(...\)</span>
|
||||
</pre>
|
||||
</section>
|
||||
|
||||
<!-- ===== 8. DEPLOY ===== -->
|
||||
<section id="deploy">
|
||||
<div class="section-head">
|
||||
<div class="section-num">08 / DEPLOY</div>
|
||||
<h2>部署形态 · 四种姿势</h2>
|
||||
<p>从命令行到 REST API 再到 Gradio UI 和 Docker 编排,MinerU 把部署路径铺得很全。</p>
|
||||
</div>
|
||||
|
||||
<div class="grid g2">
|
||||
<div class="card">
|
||||
<h3>CLI <span class="pill">本地</span></h3>
|
||||
<p><code>mineru -p input.pdf -o output/</code> —— 最简单,单文件处理。</p>
|
||||
<span class="evidence">mineru/cli/client.py</span>
|
||||
</div>
|
||||
<div class="card">
|
||||
<h3>FastAPI REST <span class="pill">服务化</span></h3>
|
||||
<p>异步任务队列(3.0 新增),任务 24h 自动清理,支持 ZIP 打包响应。</p>
|
||||
<span class="evidence">mineru/cli/fast_api.py:130-149</span>
|
||||
</div>
|
||||
<div class="card">
|
||||
<h3>Gradio UI <span class="pill">交互</span></h3>
|
||||
<p>网页界面上传 PDF,实时进度展示,结果在线预览。</p>
|
||||
<span class="evidence">mineru/cli/gradio_app.py</span>
|
||||
</div>
|
||||
<div class="card">
|
||||
<h3>Docker Compose <span class="pill">生产</span></h3>
|
||||
<p>多阶段 Dockerfile + API/Router/Redis 编排,带模型下载预热。</p>
|
||||
<span class="evidence">docker/</span>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<h3 style="margin:2rem 0 1rem;color:var(--text);font-size:1.1rem">📋 ParseRequestOptions(REST 参数全景)</h3>
|
||||
<pre><span class="com"># mineru/cli/fast_api.py:130-149</span>
|
||||
<span class="kw">@dataclass</span>
|
||||
<span class="kw">class</span> <span class="fn">ParseRequestOptions</span>:
|
||||
files: <span class="kw">list</span>[UploadFile]
|
||||
lang_list: <span class="kw">list</span>[<span class="kw">str</span>]
|
||||
backend: <span class="kw">str</span> <span class="com"># "vlm" / "pipeline" / "hybrid-ocr"</span>
|
||||
parse_method: <span class="kw">str</span> <span class="com"># "auto" / "txt" / "ocr"</span>
|
||||
formula_enable: <span class="kw">bool</span>
|
||||
table_enable: <span class="kw">bool</span>
|
||||
server_url: Optional[<span class="kw">str</span>] <span class="com"># 远程 VLM 服务器</span>
|
||||
return_md: <span class="kw">bool</span>
|
||||
return_middle_json: <span class="kw">bool</span>
|
||||
return_model_output: <span class="kw">bool</span>
|
||||
return_content_list: <span class="kw">bool</span>
|
||||
return_images: <span class="kw">bool</span>
|
||||
response_format_zip: <span class="kw">bool</span>
|
||||
</pre>
|
||||
|
||||
<div class="callout">
|
||||
<h4>💾 模型下载源</h4>
|
||||
<p>默认走 <strong>ModelScope</strong>(国内快),可切换 HuggingFace。环境变量 <code>MINERU_MODEL_SOURCE=modelscope</code>,权重缓存在 <code>~/.mineru/models/vlm/Qwen2-VL-2B/</code>。</p>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
<!-- ===== 9. HIGHLIGHTS ===== -->
|
||||
<section id="highlights">
|
||||
<div class="section-head">
|
||||
<div class="section-num">09 / HIGHLIGHTS</div>
|
||||
<h2>工程亮点</h2>
|
||||
<p>读源码时让我眼前一亮的几处细节 —— 都是只有跑过生产才会做的事。</p>
|
||||
</div>
|
||||
|
||||
<div class="grid g2">
|
||||
<div class="card">
|
||||
<h3>⚡ 单例 + 线程锁</h3>
|
||||
<p>模型加载用 <code>ModelSingleton._lock = threading.RLock()</code> 保护,避免并发请求重复加载 2B 模型(重载一次就是 4GB VRAM)。</p>
|
||||
<span class="evidence">vlm_analyze.py:40-50</span>
|
||||
</div>
|
||||
<div class="card">
|
||||
<h3>🔀 渐进式降级</h3>
|
||||
<p>VLM 服务挂了自动 fallback 到 Pipeline,不让请求直接 500。这是生产级 API 的基本素养。</p>
|
||||
<span class="evidence">mineru/cli/common.py</span>
|
||||
</div>
|
||||
<div class="card">
|
||||
<h3>🌏 CJK 换行策略</h3>
|
||||
<p>中日韩自动不加行尾空格,西文智能删连字符 —— 跨语言 Markdown 看起来才正常。</p>
|
||||
<span class="evidence">vlm_middle_json_mkcontent.py:58-90</span>
|
||||
</div>
|
||||
<div class="card">
|
||||
<h3>🔄 async doc_analyze</h3>
|
||||
<p>同步+异步两个版本并存,FastAPI 可直接用 <code>aio_doc_analyze()</code> 跑并发推理。</p>
|
||||
<span class="evidence">vlm_analyze.py:331-380</span>
|
||||
</div>
|
||||
<div class="card">
|
||||
<h3>📦 中间 JSON 统一</h3>
|
||||
<p>四个后端完全解耦,共享同一套 <code>middle_json</code>,渲染/测试/可视化都能复用,不用每加一个后端重写一套输出。</p>
|
||||
<span class="evidence">backend/*/model_output_to_middle_json.py</span>
|
||||
</div>
|
||||
<div class="card">
|
||||
<h3>🧪 全链路 E2E 测试</h3>
|
||||
<p><code>tests/unittest/test_e2e.py</code> 跑真实 PDF,覆盖 pipeline/vlm/hybrid 三套后端 + txt/ocr 两种解析模式。</p>
|
||||
<span class="evidence">tests/unittest/test_e2e.py</span>
|
||||
</div>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
<!-- ===== 10. PITFALLS ===== -->
|
||||
<section id="pitfalls">
|
||||
<div class="section-head">
|
||||
<div class="section-num">10 / PITFALLS</div>
|
||||
<h2>踩过的坑</h2>
|
||||
<p>源码/issue 综合看下来几个容易翻车的点 —— 提前知道就能绕开。</p>
|
||||
</div>
|
||||
|
||||
<div class="grid g2">
|
||||
<div class="card">
|
||||
<h3>🔥 GPU OOM</h3>
|
||||
<p>默认 <code>batch_size</code> 按 VRAM 自动调,但多页并发 + 大图仍可能爆。手动传 <code>--batch-size 1</code> 最稳。</p>
|
||||
<span class="evidence">utils.py:94-110</span>
|
||||
</div>
|
||||
<div class="card">
|
||||
<h3>🔤 全/半角标点混乱</h3>
|
||||
<p>PDF 中混排全角标点,转 Markdown 可能重复。靠 <code>full_to_half_exclude_marks()</code> 做清洗,但仍有边界 case。</p>
|
||||
<span class="evidence">utils/char_utils.py</span>
|
||||
</div>
|
||||
<div class="card">
|
||||
<h3>📊 复杂表格精度</h3>
|
||||
<p>SLANet/UNet 对标准表格强,对跨行跨列、嵌套表格识别率下降明显。对策:用 Hybrid 让 VLM 二次验证。</p>
|
||||
<span class="evidence">hybrid_analyze.py</span>
|
||||
</div>
|
||||
<div class="card">
|
||||
<h3>🖥️ CPU 模式 VRAM 检测</h3>
|
||||
<p><code>get_vram()</code> 在 CPU 模式下返回系统 RAM,需要注意别误判。macOS + mlx 则绕过此检测。</p>
|
||||
<span class="evidence">utils/model_utils.py</span>
|
||||
</div>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
<!-- ===== 11. VERDICT ===== -->
|
||||
<section id="verdict">
|
||||
<div class="section-head">
|
||||
<div class="section-num">11 / VERDICT</div>
|
||||
<h2>可复用路径</h2>
|
||||
<p>看完源码,想清楚怎么把它接入现有业务 —— 这才是读源码的真正目的。</p>
|
||||
</div>
|
||||
|
||||
<div class="grid g3">
|
||||
<div class="card" style="border-color:var(--accent4)">
|
||||
<h3>🎓 法考项目</h3>
|
||||
<p>教材 PDF → Markdown 清洗,走 <strong>Pipeline</strong> 后端(精度优先),公式/表格保真好,能直接灌入题库。</p>
|
||||
</div>
|
||||
<div class="card" style="border-color:var(--accent2)">
|
||||
<h3>📊 咨询报告生成</h3>
|
||||
<p>参考报告的 PDF 摄取走 <strong>VLM</strong> 后端,快速提取结构化大纲,喂给下游 LLM 做改写。</p>
|
||||
</div>
|
||||
<div class="card" style="border-color:var(--accent3)">
|
||||
<h3>🤖 Hermes / HiClaw</h3>
|
||||
<p>作为子能力暴露 <code>mineru-api</code> REST 给 Agent 调用,DOCX/PDF 双入口,用 <strong>Hybrid</strong> 兜底。</p>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="callout tip" style="margin-top:2rem">
|
||||
<h4>🎯 接入建议</h4>
|
||||
<p>
|
||||
<strong>轻量场景</strong>:直接 <code>pip install -U "mineru[vlm]"</code>,本地 Qwen2-VL-2B 够用。<br>
|
||||
<strong>服务化</strong>:跑 <code>mineru-api</code> + Redis 队列,24h 自动清理,无需造轮子。<br>
|
||||
<strong>多项目共享</strong>:独立部署一个 <code>mineru.kang-kang.com</code>,通过 REST 给 Hermes/HiClaw/法考多项目复用,省去每个项目各自部署 2GB 模型。
|
||||
</p>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
<footer>
|
||||
<div class="links">
|
||||
<a href="https://github.com/opendatalab/MinerU" target="_blank">GitHub · opendatalab/MinerU</a>
|
||||
<a href="https://opendatalab.github.io/MinerU/zh/" target="_blank">官方文档</a>
|
||||
<a href="https://www.shlab.org.cn/news/5443982" target="_blank">上海 AI 实验室</a>
|
||||
</div>
|
||||
<div>源码解析日期:2026-04-13 · 分析版本:v3.0.9 master · 报告作者:Kang</div>
|
||||
</footer>
|
||||
|
||||
</body>
|
||||
</html>
|
||||
Reference in New Issue
Block a user