MinerU: The PDF Parser That Doesn't Mangle Your Formulas

MinerU converts scientific PDFs, financial reports, and scanned documents into structured Markdown without destroying formulas or multi-column layouts. Originally built by OpenDataLab to generate training data for InternLM, it's become the go-to open-source alternative to commercial document APIs—despite GPU hunger and table extraction quirks that catch newcomers off guard.

Featured Repository Screenshot

You know the moment: you've just fed a batch of academic PDFs into your RAG pipeline, and what comes out the other end is mangled LaTeX rendered as Unicode soup, tables exported as JPEG artifacts, and a two-column layout reassembled in complete narrative chaos. PyPDF gives you gibberish. Commercial APIs want $3,000 a month. And your knowledge base is still broken.

MinerU emerged from a different kind of pressure test. When OpenDataLab needed to generate training data for InternLM—China's homegrown large language model—they faced a problem: scientific literature is dense with symbols, layouts, and multi-element pages that standard extractors can't handle. They built MinerU not to impress at a demo, but to ingest symbol-heavy papers at scale without losing fidelity. That origin story matters, because it means the tool was battle-tested on exactly the documents that break everything else.

What MinerU Actually Gets Right (and Wrong)

The architecture is a multi-model fusion pipeline: layout analysis, OCR, formula recognition (UniMERNet for LaTeX conversion), table extraction, all orchestrated to reconstruct semantic structure rather than just scraping text. When you throw a scanned PDF with embedded digital layers, chemical equations, and multi-column conference proceedings at it, MinerU typically preserves reading order and outputs clean Markdown with LaTeX formulas intact.

The warts are real, though. Tables sometimes export as images instead of structured HTML—a documented pain point that shows up repeatedly in user reports. Version 1.3.6 introduced a CPU-mode regression where processing times ballooned from 3 minutes to 18 minutes due to heavier OCR models. GPU acceleration is strongly recommended; without it, you're looking at 6x slower throughput and potential out-of-memory errors on large documents. Apple Silicon support via MPS has been unstable enough that maintainers are backing away from it in favor of CUDA. Installation involves multi-model downloads and configuration that lighter tools don't demand.

These aren't dealbreakers—they're table stakes for the problem space. Document parsing is infrastructure, not innovation, and the tradeoffs reflect the genuine difficulty of extracting structure from PDFs that were never designed to preserve it.

MinerU vs. the Field: Marker, Docling, LlamaParse

Comparative evaluations in 2024–2025 RAG tooling studies position MinerU as the general-purpose, balanced option. Marker has limitations with reading order on multi-column layouts. LlamaParse excels at table extraction but is weaker on formula fidelity. Docling is lighter-weight but narrower in scope. MinerU sits in the middle: not the fastest, not the most specialized, but reliable across document types—scientific papers, financial reports, technical docs, mixed scanned-digital PDFs.

The same team behind MinerU built OmniDocBench, a widely referenced document-parsing benchmark, which adds credibility when they claim their approach rivals commercial APIs in fidelity while staying fully local. Community tutorials show independent developers deploying MinerU in dataset-creation pipelines and RAG systems, packaged via Docker and exposed through cloud APIs—signals of real adoption beyond GitHub stars.

The 2.x Upgrade and Why Momentum Matters

Recent releases overhauled the architecture: unified intermediate format, a sub-1B parameter multimodal parsing model, >50% speed gains on supported GPUs, better table and layout parsing. Growth includes MinerU-HTML for web content extraction and inclusion in multiple 2024–2025 tool comparisons. The 50,000 stars aren't hype—they validate that the community desperately needed a middle ground between naive extractors and per-page API pricing.

When to Reach for MinerU (and When to Walk Away)

Use MinerU if you're ingesting academic papers, financial reports, or technical docs at scale for RAG or training data, and you have GPU budget. Walk away if you only need simple text extraction—PyPDF is fine—or if you can't afford the GPU overhead. Perfect table extraction might still require LlamaParse plus post-processing.

Setup friction is the price of quality. Document parsing is the unsexy infrastructure holding together half the AI pipelines in production. MinerU does the job.


opendatalabOP

opendatalab/MinerU

Transforms complex documents like PDFs and Office docs into LLM-ready markdown/JSON for your Agentic workflows.

61.1kstars
5.1kforks
ai4science
document-analysis
docx
extract-data
layout-analysis