Engineering

Which AI Model Is Best for Document Extraction? We Ran the Numbers.

YY Yonas Yeneneh May 10, 2026 10 min read

ParseApi supports multiple AI backends for document extraction: Anthropic's Claude family, OpenAI's GPT-4 series, Google's Gemini, and any OpenAI-compatible endpoint — Ollama, Groq, Together AI, vLLM, and others. We benchmark these models regularly to inform our default routing rules and give you honest data for your own configuration decisions.

Here's what we found in our most recent benchmark run, conducted in April 2026.

Methodology

We assembled 400 real-world documents across four categories:

Category	Count	Notes
Invoices	100	Varied vendor templates; mix of digital-native and scanned
Resumes	100	1–3 pages; including two-column and table-heavy layouts
Contracts	100	2–15 pages; dense legalese, definition sections, exhibits
Medical intake forms	100	Structured forms; some with handwritten fields

Each document was run through four models:

Claude 3.5 Sonnet (claude-3-5-sonnet-20241022)
GPT-4o (gpt-4o-2024-08-06)
Gemini 1.5 Pro (gemini-1.5-pro-002)
Gemini 1.5 Flash (included as a cost-performance reference point)

We used an identical extraction prompt across all models, against a pre-confirmed schema per document category. Ground truth was established by human reviewers on a stratified sample. Scores were computed against human-verified correct values.

Metrics:

Field accuracy — extracted value matches ground truth exactly (or within normalization tolerance for dates/numbers)
Hallucination rate — model returns a value that doesn't appear anywhere in the source document
Schema conformance — model returns the correct field types without requiring a retry
p50/p95 latency — time from API call to complete response
Effective cost per page — based on published per-token pricing at time of benchmark

Field accuracy

Model	Invoices	Resumes	Contracts	Medical	Average
Claude 3.5 Sonnet	96.1%	93.8%	91.4%	95.5%	94.2%
GPT-4o	93.2%	92.4%	89.7%	91.4%	91.7%
Gemini 1.5 Pro	90.4%	89.1%	86.2%	87.5%	88.3%
Gemini 1.5 Flash	84.7%	83.2%	79.1%	82.3%	82.3%

Claude leads on accuracy, particularly for contracts and medical forms where dense or overlapping information increases the chance of extraction errors. The gap widens most on long, multi-page documents — Claude appears to handle context accumulation better at higher page counts.

Hallucination rate

Hallucination — where the model returns a value that doesn't appear anywhere in the source document — is the most dangerous failure mode in document extraction. A hallucinated invoice total or a fabricated medication name is actively harmful.

Model	Hallucination rate
Claude 3.5 Sonnet	0.8%
GPT-4o	1.4%
Gemini 1.5 Pro	2.1%
Gemini 1.5 Flash	3.7%

Claude's instruction-following keeps hallucination lowest. We observed that Claude more consistently returns null for missing fields rather than guessing a plausible value — which is the correct behavior when a field isn't present in the source.

Latency

Model	p50 (ms)	p95 (ms)
Gemini 1.5 Flash	890	1,900
Gemini 1.5 Pro	1,200	2,600
GPT-4o	1,800	3,900
Claude 3.5 Sonnet	2,100	4,800

Gemini wins on speed. For use cases where extraction happens asynchronously in the background — which is the common case in ParseApi — latency matters less than accuracy. But for interactive scenarios where a user watches a progress indicator, the difference between 2 seconds and 1 second is noticeable.

Cost per page

Model	Est. cost per page
Gemini 1.5 Flash	~$0.0002
Gemini 1.5 Pro	~$0.0019
GPT-4o	~$0.0038
Claude 3.5 Sonnet	~$0.0045

Estimates assume ~1,500 input tokens per page (varies with document density) and ~200 output tokens. Claude is approximately 2.4× more expensive per page than Gemini Pro — but accuracy differences at scale can easily justify the premium, depending on the cost of downstream errors.

What ParseApi routes to by default

Our default routing uses Claude 3.5 Sonnet for documents where accuracy is critical — contracts, medical records, financial documents. We fall back to GPT-4o for simpler documents and as a retry provider. For high-volume lower-stakes extraction (basic forms, simple receipts), we route to Gemini 1.5 Pro.

Admins can override routing per-provider, per-document-type, and per-folder. Users on the Scale plan can provide their own API keys (BYOK) and route to any model they want.

Recommendations by use case

Use case	Recommended model	Reason
Invoices with line items	Claude 3.5 Sonnet	Best line-item and nested-object accuracy
High-volume simple forms	Gemini 1.5 Flash	Lowest cost; acceptable accuracy for simple schemas
Resumes	GPT-4o	Good balance of speed and accuracy for variable layouts
Legal contracts	Claude 3.5 Sonnet	Best at dense multi-page extraction
Real-time interactive	Gemini 1.5 Pro	Best latency at reasonable accuracy
Cost-sensitive batch work	Gemini 1.5 Pro	2.4× cheaper than Claude with ~6% accuracy trade-off

A note on model churn

These benchmarks reflect models available in April 2026. Model performance changes with every version update, and new models are released frequently. ParseApi re-runs benchmarks quarterly and updates default routing accordingly.

If you're configuring your own routing rules, treat this table as a starting point, not a permanent truth.

All posts Start free

Try ParseApi free

100 pages per month at no cost. No credit card required.

Get started