JSON Schema Guide for Document Extraction

Your JSON Schema is the most important part of the extraction process. It tells the extraction engine exactly what data to look for and how to structure the output. A well-designed schema gets you accurate, consistent results. A poorly designed one leads to missed fields and messy data.

This guide covers everything you need to know about designing schemas for document extraction.

How Schemas Work in Extraction

When you submit a document with a schema, the extraction engine:

Reads the document content
Looks at your schema to understand what data you want
Identifies relevant information in the document
Maps that information to your schema fields
Returns structured JSON matching your schema exactly

The field names, types, and structure in your schema all serve as hints. The more descriptive and well-structured your schema, the better the results.

Schema Basics

Field Names Matter

The extraction engine uses field names as context. Compare:

// Vague — the engine has less context
{
  "field1": { "type": "string" },
  "field2": { "type": "number" },
  "items": { "type": "array" }
}

// Descriptive — the engine knows what to look for
{
  "vendor_name": { "type": "string" },
  "total_amount": { "type": "number" },
  "line_items": { "type": "array" }
}

Use names that clearly describe the data: invoice_date, shipping_address, unit_price, employee_id.

Data Types

Use the right type for each field:

Type	Use For	Example
`string`	Text, names, IDs, dates	`"vendor_name": { "type": "string" }`
`number`	Amounts, prices, percentages	`"total": { "type": "number" }`
`integer`	Counts, quantities	`"quantity": { "type": "integer" }`
`boolean`	Yes/no values	`"is_paid": { "type": "boolean" }`
`array`	Lists, repeating items	`"line_items": { "type": "array" }`
`object`	Grouped, related fields	`"vendor": { "type": "object" }`

Format Hints

Add format to string fields for specific data types:

{
  "invoice_date": { "type": "string", "format": "date" },
  "vendor_email": { "type": "string", "format": "email" },
  "company_website": { "type": "string", "format": "uri" }
}

The format: "date" hint tells the extraction engine to return dates in ISO 8601 format (2025-01-15) rather than whatever format appears in the document ("January 15, 2025" or "15.01.2025").

Structuring Your Schema

Flat Schemas

For simple documents with a handful of fields:

{
  "type": "object",
  "properties": {
    "document_title": { "type": "string" },
    "author": { "type": "string" },
    "date": { "type": "string", "format": "date" },
    "summary": { "type": "string" }
  }
}

Best for: memos, simple letters, certificates.

Nested Objects

Group related fields together:

{
  "type": "object",
  "properties": {
    "vendor": {
      "type": "object",
      "properties": {
        "name": { "type": "string" },
        "address": { "type": "string" },
        "tax_id": { "type": "string" }
      }
    },
    "customer": {
      "type": "object",
      "properties": {
        "name": { "type": "string" },
        "address": { "type": "string" }
      }
    }
  }
}

Nesting helps the extraction engine understand relationships. It knows name under vendor refers to the vendor's name, not the customer's.

Arrays for Repeating Data

Use arrays when a document contains a list of similar items:

{
  "type": "object",
  "properties": {
    "line_items": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "description": { "type": "string" },
          "quantity": { "type": "number" },
          "unit_price": { "type": "number" },
          "total": { "type": "number" }
        }
      }
    }
  }
}

Arrays work for: invoice line items, contract clauses, employee certifications, table rows, list entries.

Arrays of Strings

For simple lists:

{
  "type": "object",
  "properties": {
    "key_findings": {
      "type": "array",
      "items": { "type": "string" }
    },
    "recommendations": {
      "type": "array",
      "items": { "type": "string" }
    }
  }
}

Real-World Schema Patterns

Invoice Schema

{
  "type": "object",
  "properties": {
    "vendor": {
      "type": "object",
      "properties": {
        "name": { "type": "string" },
        "vat_id": { "type": "string" }
      }
    },
    "invoice_number": { "type": "string" },
    "date": { "type": "string", "format": "date" },
    "due_date": { "type": "string", "format": "date" },
    "line_items": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "description": { "type": "string" },
          "quantity": { "type": "number" },
          "unit_price": { "type": "number" },
          "total": { "type": "number" }
        }
      }
    },
    "subtotal": { "type": "number" },
    "tax_amount": { "type": "number" },
    "total": { "type": "number" }
  }
}

Personnel Record Schema

{
  "type": "object",
  "properties": {
    "employee": {
      "type": "object",
      "properties": {
        "full_name": { "type": "string" },
        "employee_id": { "type": "string" },
        "department": { "type": "string" },
        "position": { "type": "string" },
        "start_date": { "type": "string", "format": "date" }
      }
    },
    "compensation": {
      "type": "object",
      "properties": {
        "annual_salary": { "type": "number" },
        "currency": { "type": "string" },
        "pay_grade": { "type": "string" }
      }
    },
    "certifications": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "name": { "type": "string" },
          "year": { "type": "integer" }
        }
      }
    }
  }
}

Report Summary Schema

{
  "type": "object",
  "properties": {
    "report_title": { "type": "string" },
    "date": { "type": "string", "format": "date" },
    "key_metrics": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "metric_name": { "type": "string" },
          "value": { "type": "string" },
          "trend": { "type": "string" }
        }
      }
    },
    "executive_summary": { "type": "string" },
    "action_items": {
      "type": "array",
      "items": { "type": "string" }
    }
  }
}

Common Mistakes

1. Too Many Fields

Don't try to extract everything at once. Start with the fields you'll actually use. You can always expand the schema later.

Too much:

{
  "vendor_name": {}, "vendor_address": {}, "vendor_phone": {},
  "vendor_fax": {}, "vendor_email": {}, "vendor_website": {},
  "vendor_tax_id": {}, "vendor_bank_name": {}, "vendor_iban": {},
  "vendor_bic": {}, "vendor_contact_person": {}, ...
}

Right-sized:

{
  "vendor": {
    "type": "object",
    "properties": {
      "name": { "type": "string" },
      "vat_id": { "type": "string" }
    }
  }
}

2. Generic Field Names

Avoid names like data, info, value, or item. These don't give the extraction engine enough context.

3. Wrong Data Types

Using string for amounts means you'll get "1,234.56" instead of 1234.56. Use number for anything you'll do math with.

4. Flat Structure for Complex Data

If a document has grouped information (e.g., vendor details and customer details), use nested objects. A flat schema with vendor_name, vendor_address, customer_name, customer_address works, but nested objects produce cleaner, more maintainable output.

Schema Design Workflow

Look at a sample document — What data does it contain?
Identify the fields you need — Not everything, just what your workflow requires
Group related fields — Vendor info, line items, totals
Choose appropriate types — String for text, number for amounts, arrays for lists
Test with real documents — Try your schema in the Playground
Iterate — Refine field names and structure based on results

Try It Now

Head to the Playground to design and test schemas interactively. Upload a document, define your schema, and see results in seconds.

For the full JSON Schema specification supported by Smole, see the API documentation.

JSON Schema Guide for Document Extraction

JSON Schema Guide for Document Extraction

How Schemas Work in Extraction

Schema Basics

Field Names Matter

Data Types

Format Hints

Structuring Your Schema

Flat Schemas

Nested Objects

Arrays for Repeating Data

Arrays of Strings

Real-World Schema Patterns

Invoice Schema

Personnel Record Schema

Report Summary Schema

Common Mistakes

1. Too Many Fields

2. Generic Field Names

3. Wrong Data Types

4. Flat Structure for Complex Data

Schema Design Workflow

Try It Now

Related articles

Building Effective JSON Schemas for Invoice Extraction

We Crawled the Cloudflare Blog and Turned It Into Structured Data

How to Extract Tables from PDFs into Structured Data