Engineering

Which AI Model Is Best for Document Extraction? We Ran the Numbers.

YY Yonas Yeneneh May 10, 2026 10 min read

ParseApi supports multiple AI backends for document extraction: Anthropic's Claude family, OpenAI's GPT-4 series, Google's Gemini, and any OpenAI-compatible endpoint — Ollama, Groq, Together AI, vLLM, and others. We benchmark these models regularly to inform our default routing rules and give you honest data for your own configuration decisions.

Here's what we found in our most recent benchmark run, conducted in April 2026.

Methodology

We assembled 400 real-world documents across four categories:

Category Count Notes
Invoices 100 Varied vendor templates; mix of digital-native and scanned
Resumes 100 1–3 pages; including two-column and table-heavy layouts
Contracts 100 2–15 pages; dense legalese, definition sections, exhibits
Medical intake forms 100 Structured forms; some with handwritten fields

Each document was run through four models:

  • Claude 3.5 Sonnet (claude-3-5-sonnet-20241022)
  • GPT-4o (gpt-4o-2024-08-06)
  • Gemini 1.5 Pro (gemini-1.5-pro-002)
  • Gemini 1.5 Flash (included as a cost-performance reference point)

We used an identical extraction prompt across all models, against a pre-confirmed schema per document category. Ground truth was established by human reviewers on a stratified sample. Scores were computed against human-verified correct values.

Metrics:

  1. Field accuracy — extracted value matches ground truth exactly (or within normalization tolerance for dates/numbers)
  2. Hallucination rate — model returns a value that doesn't appear anywhere in the source document
  3. Schema conformance — model returns the correct field types without requiring a retry
  4. p50/p95 latency — time from API call to complete response
  5. Effective cost per page — based on published per-token pricing at time of benchmark

Field accuracy

Model Invoices Resumes Contracts Medical Average
Claude 3.5 Sonnet 96.1% 93.8% 91.4% 95.5% 94.2%
GPT-4o 93.2% 92.4% 89.7% 91.4% 91.7%
Gemini 1.5 Pro 90.4% 89.1% 86.2% 87.5% 88.3%
Gemini 1.5 Flash 84.7% 83.2% 79.1% 82.3% 82.3%

Claude leads on accuracy, particularly for contracts and medical forms where dense or overlapping information increases the chance of extraction errors. The gap widens most on long, multi-page documents — Claude appears to handle context accumulation better at higher page counts.


Hallucination rate

Hallucination — where the model returns a value that doesn't appear anywhere in the source document — is the most dangerous failure mode in document extraction. A hallucinated invoice total or a fabricated medication name is actively harmful.

Model Hallucination rate
Claude 3.5 Sonnet 0.8%
GPT-4o 1.4%
Gemini 1.5 Pro 2.1%
Gemini 1.5 Flash 3.7%

Claude's instruction-following keeps hallucination lowest. We observed that Claude more consistently returns null for missing fields rather than guessing a plausible value — which is the correct behavior when a field isn't present in the source.


Latency

Model p50 (ms) p95 (ms)
Gemini 1.5 Flash 890 1,900
Gemini 1.5 Pro 1,200 2,600
GPT-4o 1,800 3,900
Claude 3.5 Sonnet 2,100 4,800

Gemini wins on speed. For use cases where extraction happens asynchronously in the background — which is the common case in ParseApi — latency matters less than accuracy. But for interactive scenarios where a user watches a progress indicator, the difference between 2 seconds and 1 second is noticeable.


Cost per page

Model Est. cost per page
Gemini 1.5 Flash ~$0.0002
Gemini 1.5 Pro ~$0.0019
GPT-4o ~$0.0038
Claude 3.5 Sonnet ~$0.0045

Estimates assume ~1,500 input tokens per page (varies with document density) and ~200 output tokens. Claude is approximately 2.4× more expensive per page than Gemini Pro — but accuracy differences at scale can easily justify the premium, depending on the cost of downstream errors.


What ParseApi routes to by default

Our default routing uses Claude 3.5 Sonnet for documents where accuracy is critical — contracts, medical records, financial documents. We fall back to GPT-4o for simpler documents and as a retry provider. For high-volume lower-stakes extraction (basic forms, simple receipts), we route to Gemini 1.5 Pro.

Admins can override routing per-provider, per-document-type, and per-folder. Users on the Scale plan can provide their own API keys (BYOK) and route to any model they want.


Recommendations by use case

Use case Recommended model Reason
Invoices with line items Claude 3.5 Sonnet Best line-item and nested-object accuracy
High-volume simple forms Gemini 1.5 Flash Lowest cost; acceptable accuracy for simple schemas
Resumes GPT-4o Good balance of speed and accuracy for variable layouts
Legal contracts Claude 3.5 Sonnet Best at dense multi-page extraction
Real-time interactive Gemini 1.5 Pro Best latency at reasonable accuracy
Cost-sensitive batch work Gemini 1.5 Pro 2.4× cheaper than Claude with ~6% accuracy trade-off

A note on model churn

These benchmarks reflect models available in April 2026. Model performance changes with every version update, and new models are released frequently. ParseApi re-runs benchmarks quarterly and updates default routing accordingly.

If you're configuring your own routing rules, treat this table as a starting point, not a permanent truth.

Try ParseApi free

100 pages per month at no cost. No credit card required.

Get started