How ParseApi Auto-Detects Your Document Schema
When you create a new ParseApi folder, there's no schema editor waiting for you — no form fields to fill in, no JSON Schema to hand-write. You just upload documents. The schema emerges from the content.
This post explains how that works.
The three-document bootstrap
Schema detection triggers after your third unique document upload in a new folder. Before that threshold, ParseApi runs each document through an "open" extraction: the AI is instructed to extract every piece of structured information it can identify, without being constrained by a predetermined schema.
The results from three documents are then sent to a synthesis pass. The synthesis prompt asks the model to look across all three extractions and identify:
- Fields that appear in all three documents (core schema)
- Fields that appear in two of three (common but not universal)
- The most appropriate data type for each field —
string,number,date,boolean,array, orobject - Whether any fields are nested objects or arrays of objects (such as
line_itemsin an invoice)
The output of this synthesis is a proposed schema: a flat-or-nested field list with types.
What you see after synthesis
A schema review panel appears in the folder settings after synthesis completes. The proposed schema is shown as an editable field list. For each field you can:
- Rename it (e.g., change
invoice_notoinvoice_number) - Change the type (e.g., demote
totalfromnumbertostringif your values include currency symbols) - Mark it optional or required
- Delete it entirely
- Add a new field manually
Until you confirm the schema, extractions remain in draft state and won't appear on your public API endpoint. Once you confirm, the schema is locked and all future uploads conform to it.
How subsequent extractions use the schema
With a confirmed schema, the extraction prompt changes structure. Instead of "extract what you can", the prompt becomes:
"You are a document extraction engine. Extract the following fields from the document: [field list with types and descriptions]. If a field cannot be found, return null. Do not invent values."
This constrained prompt reduces hallucination significantly. The model is no longer guessing at what you want — it's looking for specific named fields. The schema also becomes the validation contract: if the model returns a field with the wrong type, the extraction engine issues a corrective prompt and retries.
Field-level confidence scores
Each extracted field receives a confidence score from 0 to 1, stored alongside the result:
{
"result": {
"invoice_number": "INV-2024-00841",
"total": 4632.50,
"due_date": "2024-12-14"
},
"confidence": {
"invoice_number": 0.97,
"total": 0.82,
"due_date": 0.61
}
}
Confidence below 0.6 is flagged in the dashboard for human review. These flags are also available via the API so your application can route low-confidence extractions to a review queue before acting on them.
Schema evolution
Your document templates change. A new vendor starts including a purchase_order_number field your old schema doesn't have. You edit the schema to add it.
When you add a field, you can opt to re-run extraction on existing documents. ParseApi re-processes only the documents where that field is absent from the current result — documents that already have the field are not re-billed.
Field deletions are non-destructive: the raw extraction result still contains the original data; the deleted field is just excluded from the API response. The raw result is always preserved in extraction.raw_result, which is immutable.
The schema as an API contract
Your folder's schema is published at:
GET /v1/{user}/{folder}/schema
And embedded in the auto-generated OpenAPI specification at:
GET /v1/{user}/{folder}/openapi.json
This means your API consumers can discover the expected response shape without asking you. Tools like Postman, Insomnia, and OpenAPI code generators can import the spec and immediately understand what the endpoint returns — and what types to expect for each field.
What the schema doesn't do
The schema describes the expected structure of extracted data. It doesn't validate the content — whether an invoice total is positive, whether a date is in the future, whether an email address is valid. Content validation is the responsibility of your downstream application.
If you need content-level rules, use the webhook payload and validate in your own handler.
The next release
We're working on schema templates for common document types: standard invoices, W-9 forms, pay stubs, driver's licenses, medical intake forms. Templates will pre-populate the schema with known field names and types, reducing the cold-start friction for documents you process frequently.
Watch this blog for the announcement.