Which AI Model Is Best for Document Extraction? We Ran the Numbers.
ParseApi supports multiple AI backends for document extraction: Anthropic's Claude family, OpenAI's GPT-4 series, Google's Gemini, and any OpenAI-compatible endpoint — Ollama, Groq, Together AI, vLLM, and others. We benchmark these models regularly to inform our default routing rules and give you honest data for your own configuration decisions.
Here's what we found in our most recent benchmark run, conducted in April 2026.
Methodology
We assembled 400 real-world documents across four categories:
| Category | Count | Notes |
|---|---|---|
| Invoices | 100 | Varied vendor templates; mix of digital-native and scanned |
| Resumes | 100 | 1–3 pages; including two-column and table-heavy layouts |
| Contracts | 100 | 2–15 pages; dense legalese, definition sections, exhibits |
| Medical intake forms | 100 | Structured forms; some with handwritten fields |
Each document was run through four models:
- Claude 3.5 Sonnet (
claude-3-5-sonnet-20241022) - GPT-4o (
gpt-4o-2024-08-06) - Gemini 1.5 Pro (
gemini-1.5-pro-002) - Gemini 1.5 Flash (included as a cost-performance reference point)
We used an identical extraction prompt across all models, against a pre-confirmed schema per document category. Ground truth was established by human reviewers on a stratified sample. Scores were computed against human-verified correct values.
Metrics:
- Field accuracy — extracted value matches ground truth exactly (or within normalization tolerance for dates/numbers)
- Hallucination rate — model returns a value that doesn't appear anywhere in the source document
- Schema conformance — model returns the correct field types without requiring a retry
- p50/p95 latency — time from API call to complete response
- Effective cost per page — based on published per-token pricing at time of benchmark
Field accuracy
| Model | Invoices | Resumes | Contracts | Medical | Average |
|---|---|---|---|---|---|
| Claude 3.5 Sonnet | 96.1% | 93.8% | 91.4% | 95.5% | 94.2% |
| GPT-4o | 93.2% | 92.4% | 89.7% | 91.4% | 91.7% |
| Gemini 1.5 Pro | 90.4% | 89.1% | 86.2% | 87.5% | 88.3% |
| Gemini 1.5 Flash | 84.7% | 83.2% | 79.1% | 82.3% | 82.3% |
Claude leads on accuracy, particularly for contracts and medical forms where dense or overlapping information increases the chance of extraction errors. The gap widens most on long, multi-page documents — Claude appears to handle context accumulation better at higher page counts.
Hallucination rate
Hallucination — where the model returns a value that doesn't appear anywhere in the source document — is the most dangerous failure mode in document extraction. A hallucinated invoice total or a fabricated medication name is actively harmful.
| Model | Hallucination rate |
|---|---|
| Claude 3.5 Sonnet | 0.8% |
| GPT-4o | 1.4% |
| Gemini 1.5 Pro | 2.1% |
| Gemini 1.5 Flash | 3.7% |
Claude's instruction-following keeps hallucination lowest. We observed that Claude more consistently returns null for missing fields rather than guessing a plausible value — which is the correct behavior when a field isn't present in the source.
Latency
| Model | p50 (ms) | p95 (ms) |
|---|---|---|
| Gemini 1.5 Flash | 890 | 1,900 |
| Gemini 1.5 Pro | 1,200 | 2,600 |
| GPT-4o | 1,800 | 3,900 |
| Claude 3.5 Sonnet | 2,100 | 4,800 |
Gemini wins on speed. For use cases where extraction happens asynchronously in the background — which is the common case in ParseApi — latency matters less than accuracy. But for interactive scenarios where a user watches a progress indicator, the difference between 2 seconds and 1 second is noticeable.
Cost per page
| Model | Est. cost per page |
|---|---|
| Gemini 1.5 Flash | ~$0.0002 |
| Gemini 1.5 Pro | ~$0.0019 |
| GPT-4o | ~$0.0038 |
| Claude 3.5 Sonnet | ~$0.0045 |
Estimates assume ~1,500 input tokens per page (varies with document density) and ~200 output tokens. Claude is approximately 2.4× more expensive per page than Gemini Pro — but accuracy differences at scale can easily justify the premium, depending on the cost of downstream errors.
What ParseApi routes to by default
Our default routing uses Claude 3.5 Sonnet for documents where accuracy is critical — contracts, medical records, financial documents. We fall back to GPT-4o for simpler documents and as a retry provider. For high-volume lower-stakes extraction (basic forms, simple receipts), we route to Gemini 1.5 Pro.
Admins can override routing per-provider, per-document-type, and per-folder. Users on the Scale plan can provide their own API keys (BYOK) and route to any model they want.
Recommendations by use case
| Use case | Recommended model | Reason |
|---|---|---|
| Invoices with line items | Claude 3.5 Sonnet | Best line-item and nested-object accuracy |
| High-volume simple forms | Gemini 1.5 Flash | Lowest cost; acceptable accuracy for simple schemas |
| Resumes | GPT-4o | Good balance of speed and accuracy for variable layouts |
| Legal contracts | Claude 3.5 Sonnet | Best at dense multi-page extraction |
| Real-time interactive | Gemini 1.5 Pro | Best latency at reasonable accuracy |
| Cost-sensitive batch work | Gemini 1.5 Pro | 2.4× cheaper than Claude with ~6% accuracy trade-off |
A note on model churn
These benchmarks reflect models available in April 2026. Model performance changes with every version update, and new models are released frequently. ParseApi re-runs benchmarks quarterly and updates default routing accordingly.
If you're configuring your own routing rules, treat this table as a starting point, not a permanent truth.