JSON Schema Guide for Document Extraction
JSON Schema Guide for Document Extraction
Your JSON Schema is the most important part of the extraction process. It tells the extraction engine exactly what data to look for and how to structure the output. A well-designed schema gets you accurate, consistent results. A poorly designed one leads to missed fields and messy data.
This guide covers everything you need to know about designing schemas for document extraction.
How Schemas Work in Extraction
When you submit a document with a schema, the extraction engine:
- Reads the document content
- Looks at your schema to understand what data you want
- Identifies relevant information in the document
- Maps that information to your schema fields
- Returns structured JSON matching your schema exactly
The field names, types, and structure in your schema all serve as hints. The more descriptive and well-structured your schema, the better the results.
Schema Basics
Field Names Matter
The extraction engine uses field names as context. Compare:
// Vague — the engine has less context
{
"field1": { "type": "string" },
"field2": { "type": "number" },
"items": { "type": "array" }
}
// Descriptive — the engine knows what to look for
{
"vendor_name": { "type": "string" },
"total_amount": { "type": "number" },
"line_items": { "type": "array" }
}
Use names that clearly describe the data: invoice_date, shipping_address, unit_price, employee_id.
Data Types
Use the right type for each field:
| Type | Use For | Example |
|---|---|---|
string | Text, names, IDs, dates | "vendor_name": { "type": "string" } |
number | Amounts, prices, percentages | "total": { "type": "number" } |
integer | Counts, quantities | "quantity": { "type": "integer" } |
boolean | Yes/no values | "is_paid": { "type": "boolean" } |
array | Lists, repeating items | "line_items": { "type": "array" } |
object | Grouped, related fields | "vendor": { "type": "object" } |
Format Hints
Add format to string fields for specific data types:
{
"invoice_date": { "type": "string", "format": "date" },
"vendor_email": { "type": "string", "format": "email" },
"company_website": { "type": "string", "format": "uri" }
}
The format: "date" hint tells the extraction engine to return dates in ISO 8601 format (2025-01-15) rather than whatever format appears in the document ("January 15, 2025" or "15.01.2025").
Structuring Your Schema
Flat Schemas
For simple documents with a handful of fields:
{
"type": "object",
"properties": {
"document_title": { "type": "string" },
"author": { "type": "string" },
"date": { "type": "string", "format": "date" },
"summary": { "type": "string" }
}
}
Best for: memos, simple letters, certificates.
Nested Objects
Group related fields together:
{
"type": "object",
"properties": {
"vendor": {
"type": "object",
"properties": {
"name": { "type": "string" },
"address": { "type": "string" },
"tax_id": { "type": "string" }
}
},
"customer": {
"type": "object",
"properties": {
"name": { "type": "string" },
"address": { "type": "string" }
}
}
}
}
Nesting helps the extraction engine understand relationships. It knows name under vendor refers to the vendor's name, not the customer's.
Arrays for Repeating Data
Use arrays when a document contains a list of similar items:
{
"type": "object",
"properties": {
"line_items": {
"type": "array",
"items": {
"type": "object",
"properties": {
"description": { "type": "string" },
"quantity": { "type": "number" },
"unit_price": { "type": "number" },
"total": { "type": "number" }
}
}
}
}
}
Arrays work for: invoice line items, contract clauses, employee certifications, table rows, list entries.
Arrays of Strings
For simple lists:
{
"type": "object",
"properties": {
"key_findings": {
"type": "array",
"items": { "type": "string" }
},
"recommendations": {
"type": "array",
"items": { "type": "string" }
}
}
}
Real-World Schema Patterns
Invoice Schema
{
"type": "object",
"properties": {
"vendor": {
"type": "object",
"properties": {
"name": { "type": "string" },
"vat_id": { "type": "string" }
}
},
"invoice_number": { "type": "string" },
"date": { "type": "string", "format": "date" },
"due_date": { "type": "string", "format": "date" },
"line_items": {
"type": "array",
"items": {
"type": "object",
"properties": {
"description": { "type": "string" },
"quantity": { "type": "number" },
"unit_price": { "type": "number" },
"total": { "type": "number" }
}
}
},
"subtotal": { "type": "number" },
"tax_amount": { "type": "number" },
"total": { "type": "number" }
}
}
Personnel Record Schema
{
"type": "object",
"properties": {
"employee": {
"type": "object",
"properties": {
"full_name": { "type": "string" },
"employee_id": { "type": "string" },
"department": { "type": "string" },
"position": { "type": "string" },
"start_date": { "type": "string", "format": "date" }
}
},
"compensation": {
"type": "object",
"properties": {
"annual_salary": { "type": "number" },
"currency": { "type": "string" },
"pay_grade": { "type": "string" }
}
},
"certifications": {
"type": "array",
"items": {
"type": "object",
"properties": {
"name": { "type": "string" },
"year": { "type": "integer" }
}
}
}
}
}
Report Summary Schema
{
"type": "object",
"properties": {
"report_title": { "type": "string" },
"date": { "type": "string", "format": "date" },
"key_metrics": {
"type": "array",
"items": {
"type": "object",
"properties": {
"metric_name": { "type": "string" },
"value": { "type": "string" },
"trend": { "type": "string" }
}
}
},
"executive_summary": { "type": "string" },
"action_items": {
"type": "array",
"items": { "type": "string" }
}
}
}
Common Mistakes
1. Too Many Fields
Don't try to extract everything at once. Start with the fields you'll actually use. You can always expand the schema later.
Too much:
{
"vendor_name": {}, "vendor_address": {}, "vendor_phone": {},
"vendor_fax": {}, "vendor_email": {}, "vendor_website": {},
"vendor_tax_id": {}, "vendor_bank_name": {}, "vendor_iban": {},
"vendor_bic": {}, "vendor_contact_person": {}, ...
}
Right-sized:
{
"vendor": {
"type": "object",
"properties": {
"name": { "type": "string" },
"vat_id": { "type": "string" }
}
}
}
2. Generic Field Names
Avoid names like data, info, value, or item. These don't give the extraction engine enough context.
3. Wrong Data Types
Using string for amounts means you'll get "1,234.56" instead of 1234.56. Use number for anything you'll do math with.
4. Flat Structure for Complex Data
If a document has grouped information (e.g., vendor details and customer details), use nested objects. A flat schema with vendor_name, vendor_address, customer_name, customer_address works, but nested objects produce cleaner, more maintainable output.
Schema Design Workflow
- Look at a sample document — What data does it contain?
- Identify the fields you need — Not everything, just what your workflow requires
- Group related fields — Vendor info, line items, totals
- Choose appropriate types — String for text, number for amounts, arrays for lists
- Test with real documents — Try your schema in the Playground
- Iterate — Refine field names and structure based on results
Try It Now
Head to the Playground to design and test schemas interactively. Upload a document, define your schema, and see results in seconds.
For the full JSON Schema specification supported by Smole, see the API documentation.
Related articles
Building Effective JSON Schemas for Invoice Extraction
A practical guide to designing JSON schemas that maximize extraction accuracy for invoices and receipts.
pdfHow to Extract Tables from PDFs into Structured Data
Extract tables from PDF documents into structured JSON or CSV. Handle multi-column layouts, merged cells, and inconsistent formatting with schema-based extraction.
pythonExtract Structured Data from Documents with Python
How to extract structured JSON data from PDFs, scanned documents, and Word files using Python. Complete code examples with requests, error handling, and batch processing.
