DeepSeek OCR Vision Encoder
80M-parameter windowed SAM plus 300M-parameter CLIP-Large align local glyph detail with global layout features, retaining fidelity in dense legal, financial, and scientific PDFs.
Tiny → Base → Large → Gundam progression visualizes how DeepSeek OCR maintains low token counts while scaling visual fidelity.
DeepSeek OCR is a two-stage transformer-based document AI that compresses page images into compact vision tokens before decoding them with a high-capacity mixture-of-experts language model. Stage 1 merges a windowed SAM vision transformer with a dense CLIP-Large encoder and a 16× convolutional compressor; Stage 2 uses the DeepSeek-3B-MoE decoder (~570M active parameters per token) to reconstruct text, HTML, and figure annotations with minimal loss.
Trained on 30 million real PDF pages plus synthetic charts, formulas, and diagrams, DeepSeek OCR preserves layout structure, tables, chemistry (SMILES strings), and geometry tasks. Its CLIP heritage maintains multimodal competence—captions and object grounding remain intact even after aggressive compression.
By reducing a 1024×1024 page to just 256 tokens, DeepSeek OCR enables long-document ingestion that would overwhelm conventional OCR pipelines, keeping global semantics while slashing compute requirements.
More than 100 languages—including Latin, CJK, Cyrillic, and specialized scientific scripts—benefit from DeepSeek OCR’s training distribution, enabling global digitization and data generation projects.
80M-parameter windowed SAM plus 300M-parameter CLIP-Large align local glyph detail with global layout features, retaining fidelity in dense legal, financial, and scientific PDFs.
From Tiny (64 tokens) to Gundam (multi-viewport tiling), DeepSeek OCR allows precision tuning between speed and fidelity for invoices, blueprints, and large-format scans.
Outputs HTML tables, Markdown charts, SMILES chemistry, and geometry annotations, enabling direct ingestion into analytics pipelines without manual reconstruction.
MIT-licensed weights let organizations run DeepSeek OCR on-premises, avoiding regulatory scrutiny tied to DeepSeek’s Chinese infrastructure when using hosted APIs.
Rasterized pages (up to 1280×1280) split into 4096 patches, compressed 16× into 256–400 tokens. Local windows ensure glyph accuracy while CLIP-Large preserves page semantics.
The mixture-of-experts decoder activates ~570M parameters per token, reconstructing text, layout tags, and captions. FlashAttention and CUDA optimizations sustain GPU throughput.
CLIP pretraining lets DeepSeek OCR align textual summaries with diagrams, charts, and figures—vital for scientific documents and data visualization handoffs.
Compression to decoding pipeline keeps context intact:
1. High-resolution PDF page (640–1280 px)
SAM patch extraction
2. 16× convolutional compression to 64–400 tokens
Context optical compression
3. DeepSeek OCR MoE decoding (~570M active)
FlashAttention acceleration
4. Output structured HTML, Markdown, or captions
Layout-preserving results
Benchmark studies indicate DeepSeek OCR delivers state-of-the-art accuracy on structured documents while maintaining low token budgets.
OCR System | Accuracy Snapshot | Speed / Throughput | Core Strengths | Deployment |
---|---|---|---|---|
DeepSeek OCR | ~97% exact match at 10× compression | ~200k pages/day per NVIDIA A100 | Layout-rich OCR, tables, formulas, diagrams, multilingual | Open-source (MIT); Local GPU or DeepSeek API |
Google Cloud Vision | ~98% on mixed benchmarks | Elastic cloud throughput | Enterprise support, multilingual APIs | Proprietary pay-per-use API |
AWS Textract | ~97–99% on forms | Managed cloud scaling | Invoice & form extraction with JSON output | Proprietary pay-per-use API |
Azure OCR | ~99.8% on clean typed text | Azure ecosystem integrations | Strong for printed pages; handwriting variance | Proprietary pay-per-use API |
Tesseract OSS | ~90–95% depending on scans | Local CPU/GPU | Open-source, handwriting friendly | Open-source (Apache 2.0) |
Sources: Fox compression benchmark, OmniDocBench, AI Multiple accuracy reviews, DeepSeek documentation.
Clone the DeepSeek OCR GitHub repo, download the 6.7 GB safetensors checkpoint, and configure PyTorch 2.6+ with FlashAttention. Base mode runs on 8–10 GB GPUs, while Gundam tiling benefits from 40 GB A100s.
Utilize DeepSeek’s OpenAI-compatible API endpoints to submit images and receive structured text. Pricing mirrors the platform’s token billing (~$0.028 per million input tokens for cache hits).
Convert OCR outputs to JSON, link SMILES strings to cheminformatics pipelines, or auto-caption diagrams for bilingual publishing—all using DeepSeek OCR’s structured results.
Compress thousands of words per page into compact tokens for downstream search, summarization, and knowledge graph pipelines.
Extract geometry reasoning, engineering annotations, and chemical SMILES from visual assets to support scientific analysis.
Build global corpora across 100+ languages, scanning books or surveys to create training data for downstream language models.
Embed into invoice, contract, or form-processing platforms to emit layout-aware JSON and HTML ready for automation.
Browse glimpses of DeepSeek OCR in action—architecture diagrams, benchmark dashboards, and real-world conversions. Click any frame to open a high-resolution view.
Accuracy drops to ~60% at 20× compression; opt for Large or Gundam modes when microtext or dense tables are present.
Fine vector charts remain tough; combine with vector-native parsers when CAD precision is essential.
Primarily trained on printed text; supplement with handwriting OCR tools for cursive-heavy workloads.
Real-time throughput requires modern GPUs. Batch processing or DeepSeek’s managed API can smooth compute needs.
Download the ~6.7 GB safetensors checkpoint and operate DeepSeek OCR locally without license fees, customizing workflows to your compliance standards.
Hosted access follows DeepSeek’s token pricing (~$0.028 per million input tokens for cache hits). Plan budgets around compression mode and document volume.
Hardware planning: a single A100 (~200k pages/day) can drive enterprise queues, while 20 nodes × 8 A100s reach ~33 million pages/day for large-scale digitization.
DeepSeek OCR slices pages into patches, applies 16× convolutional downsampling, and forwards only 64–400 vision tokens to the MoE decoder, retaining layout cues while cutting context size tenfold.
NVIDIA A100 (40 GB) offers peak throughput (~200k pages/day), while RTX 30-series cards with ≥8 GB VRAM can handle Base mode for moderate loads.
Handwriting is not a core focus; performance remains limited compared to specialized cursive OCR tools. Pair DeepSeek OCR with handwriting engines when needed.
Yes. Tests show near-lossless HTML/Markdown reproduction for tables and chart structures, enabling analytics pipelines without manual clean-up.
DeepSeek OCR covers roughly 100 languages, spanning Latin, CJK, Cyrillic, and scientific notation, thanks to its extensive real and synthetic training data.
DeepSeek OCR can emit plain text, HTML, Markdown, structured JSON, SMILES chemistry strings, and contextual captions, depending on prompts.
Local deployment keeps data on-prem under the MIT license. When using DeepSeek’s API, consult compliance guidance due to scrutiny of the company’s cloud infrastructure.
It matches or exceeds cloud competitors on complex documents while using far fewer vision tokens, making it ideal for GPU-constrained operations.
Hugging Face Spaces, community notebooks, and “awesome DeepSeek” repositories showcase demos, while SDKs integrate with Adobe, Figma, and Python clients.
Yes. Store conversations as images to expand LLM context windows, and let DeepSeek OCR reconstruct the text when required.
Practitioners and researchers across the globe are sharing how DeepSeek OCR’s context optical compression shifts their document workflows. Explore a curated feed of reactions captured from X (Twitter).
The big blue whale is back with something wild this time!
— Unwind AI (@unwind_ai_) October 21, 2025
DeepSeek built an OCR model that can compress text by 10x using vision tokens.
Let me explain:
They had a core insight - A picture containing text requires far fewer tokens to represent than the raw text itself.
Now,… pic.twitter.com/tIYtq437qX
DeepSeek-OCRバリ凄い。長文コンテキストを画像トークンに変換することで、約10倍の圧縮でほぼ劣化なし、20倍圧縮でも精度6割を維持を達成。これによりLLMのロングコンテキスト処理は圧倒的な改善が可能に。さらに普通のOCRとしてもめちゃめちゃ優秀な模様 pic.twitter.com/Ya6ae3Mbwz
— 石川陽太 Yota Ishikawa (@ytiskw) October 20, 2025
deepseek-ocr这个名字过于低调,不去深入了解的话以为又是一个orc模型而已,然而这个模型实现了十倍的信息压缩率,一个图像token可以顶十个文本token,这可是一件大事,在hn上直接炸了。deepseek还提出用图像模糊程度来模拟人类记忆随时间衰退的现象,读取同一张图片时可以调用不同分辨率的专家模型。 https://t.co/y2xt9IwiF7 pic.twitter.com/4D8tNe7Oki
— Datou (@Datou) October 20, 2025
Unlike closed AI labs, DeepSeek proves they are truly open research
— Bindu Reddy (@bindureddy) October 21, 2025
Their OCR paper treats paragraphs as pixels and is 60x leap more efficient than traditional LLMs
Small super efficient models are the future pic.twitter.com/RY7PJoeH3E
DeepSeek OCR! Open source is a gift that keeps on giving! AWESOME! I just converted a 400 page PDF into markdown using this fine new open source model. It took under 4 minutes! pic.twitter.com/QuxcDhVlPG
— Dr. Tristan Behrens (@DrTBehrens) October 20, 2025
🚀 DeepSeek-OCR — the new frontier of OCR from @deepseek_ai , exploring optical context compression for LLMs, is running blazingly fast on vLLM ⚡ (~2500 tokens/s on A100-40G) — powered by vllm==0.8.5 for day-0 model support.
— vLLM (@vllm_project) October 20, 2025
🧠 Compresses visual contexts up to 20× while keeping… pic.twitter.com/bx3d7LnfaR
Dive deeper into the context optical compression paradigm, architecture, and benchmarks by downloading the official PDF. Review it offline to explore detailed experiments, ablations, and deployment guidance straight from the DeepSeek OCR team.
Digitize, analyze, and restructure complex PDFs, charts, and multilingual archives using context optical compression.