Back to blog
json-schemabest-practicesguideextraction

JSON Schema Guide for Document Extraction

February 7, 2026Smole Team

JSON Schema Guide for Document Extraction

Your JSON Schema is the most important part of the extraction process. It tells the extraction engine exactly what data to look for and how to structure the output. A well-designed schema gets you accurate, consistent results. A poorly designed one leads to missed fields and messy data.

This guide covers everything you need to know about designing schemas for document extraction.

How Schemas Work in Extraction

When you submit a document with a schema, the extraction engine:

  1. Reads the document content
  2. Looks at your schema to understand what data you want
  3. Identifies relevant information in the document
  4. Maps that information to your schema fields
  5. Returns structured JSON matching your schema exactly

The field names, types, and structure in your schema all serve as hints. The more descriptive and well-structured your schema, the better the results.

Schema Basics

Field Names Matter

The extraction engine uses field names as context. Compare:

// Vague — the engine has less context
{
  "field1": { "type": "string" },
  "field2": { "type": "number" },
  "items": { "type": "array" }
}

// Descriptive — the engine knows what to look for
{
  "vendor_name": { "type": "string" },
  "total_amount": { "type": "number" },
  "line_items": { "type": "array" }
}

Use names that clearly describe the data: invoice_date, shipping_address, unit_price, employee_id.

Data Types

Use the right type for each field:

TypeUse ForExample
stringText, names, IDs, dates"vendor_name": { "type": "string" }
numberAmounts, prices, percentages"total": { "type": "number" }
integerCounts, quantities"quantity": { "type": "integer" }
booleanYes/no values"is_paid": { "type": "boolean" }
arrayLists, repeating items"line_items": { "type": "array" }
objectGrouped, related fields"vendor": { "type": "object" }

Format Hints

Add format to string fields for specific data types:

{
  "invoice_date": { "type": "string", "format": "date" },
  "vendor_email": { "type": "string", "format": "email" },
  "company_website": { "type": "string", "format": "uri" }
}

The format: "date" hint tells the extraction engine to return dates in ISO 8601 format (2025-01-15) rather than whatever format appears in the document ("January 15, 2025" or "15.01.2025").

Structuring Your Schema

Flat Schemas

For simple documents with a handful of fields:

{
  "type": "object",
  "properties": {
    "document_title": { "type": "string" },
    "author": { "type": "string" },
    "date": { "type": "string", "format": "date" },
    "summary": { "type": "string" }
  }
}

Best for: memos, simple letters, certificates.

Nested Objects

Group related fields together:

{
  "type": "object",
  "properties": {
    "vendor": {
      "type": "object",
      "properties": {
        "name": { "type": "string" },
        "address": { "type": "string" },
        "tax_id": { "type": "string" }
      }
    },
    "customer": {
      "type": "object",
      "properties": {
        "name": { "type": "string" },
        "address": { "type": "string" }
      }
    }
  }
}

Nesting helps the extraction engine understand relationships. It knows name under vendor refers to the vendor's name, not the customer's.

Arrays for Repeating Data

Use arrays when a document contains a list of similar items:

{
  "type": "object",
  "properties": {
    "line_items": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "description": { "type": "string" },
          "quantity": { "type": "number" },
          "unit_price": { "type": "number" },
          "total": { "type": "number" }
        }
      }
    }
  }
}

Arrays work for: invoice line items, contract clauses, employee certifications, table rows, list entries.

Arrays of Strings

For simple lists:

{
  "type": "object",
  "properties": {
    "key_findings": {
      "type": "array",
      "items": { "type": "string" }
    },
    "recommendations": {
      "type": "array",
      "items": { "type": "string" }
    }
  }
}

Real-World Schema Patterns

Invoice Schema

{
  "type": "object",
  "properties": {
    "vendor": {
      "type": "object",
      "properties": {
        "name": { "type": "string" },
        "vat_id": { "type": "string" }
      }
    },
    "invoice_number": { "type": "string" },
    "date": { "type": "string", "format": "date" },
    "due_date": { "type": "string", "format": "date" },
    "line_items": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "description": { "type": "string" },
          "quantity": { "type": "number" },
          "unit_price": { "type": "number" },
          "total": { "type": "number" }
        }
      }
    },
    "subtotal": { "type": "number" },
    "tax_amount": { "type": "number" },
    "total": { "type": "number" }
  }
}

Personnel Record Schema

{
  "type": "object",
  "properties": {
    "employee": {
      "type": "object",
      "properties": {
        "full_name": { "type": "string" },
        "employee_id": { "type": "string" },
        "department": { "type": "string" },
        "position": { "type": "string" },
        "start_date": { "type": "string", "format": "date" }
      }
    },
    "compensation": {
      "type": "object",
      "properties": {
        "annual_salary": { "type": "number" },
        "currency": { "type": "string" },
        "pay_grade": { "type": "string" }
      }
    },
    "certifications": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "name": { "type": "string" },
          "year": { "type": "integer" }
        }
      }
    }
  }
}

Report Summary Schema

{
  "type": "object",
  "properties": {
    "report_title": { "type": "string" },
    "date": { "type": "string", "format": "date" },
    "key_metrics": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "metric_name": { "type": "string" },
          "value": { "type": "string" },
          "trend": { "type": "string" }
        }
      }
    },
    "executive_summary": { "type": "string" },
    "action_items": {
      "type": "array",
      "items": { "type": "string" }
    }
  }
}

Common Mistakes

1. Too Many Fields

Don't try to extract everything at once. Start with the fields you'll actually use. You can always expand the schema later.

Too much:

{
  "vendor_name": {}, "vendor_address": {}, "vendor_phone": {},
  "vendor_fax": {}, "vendor_email": {}, "vendor_website": {},
  "vendor_tax_id": {}, "vendor_bank_name": {}, "vendor_iban": {},
  "vendor_bic": {}, "vendor_contact_person": {}, ...
}

Right-sized:

{
  "vendor": {
    "type": "object",
    "properties": {
      "name": { "type": "string" },
      "vat_id": { "type": "string" }
    }
  }
}

2. Generic Field Names

Avoid names like data, info, value, or item. These don't give the extraction engine enough context.

3. Wrong Data Types

Using string for amounts means you'll get "1,234.56" instead of 1234.56. Use number for anything you'll do math with.

4. Flat Structure for Complex Data

If a document has grouped information (e.g., vendor details and customer details), use nested objects. A flat schema with vendor_name, vendor_address, customer_name, customer_address works, but nested objects produce cleaner, more maintainable output.

Schema Design Workflow

  1. Look at a sample document — What data does it contain?
  2. Identify the fields you need — Not everything, just what your workflow requires
  3. Group related fields — Vendor info, line items, totals
  4. Choose appropriate types — String for text, number for amounts, arrays for lists
  5. Test with real documents — Try your schema in the Playground
  6. Iterate — Refine field names and structure based on results

Try It Now

Head to the Playground to design and test schemas interactively. Upload a document, define your schema, and see results in seconds.

For the full JSON Schema specification supported by Smole, see the API documentation.