Back to blog
json-schemainvoicesbest-practices

Building Effective JSON Schemas for Invoice Extraction

January 25, 2024Smole Team

Building Effective JSON Schemas for Invoice Extraction

The quality of your JSON schema directly impacts extraction accuracy. A well-designed schema acts as a guide for the AI, helping it understand exactly what data you need and where to find it.

Schema Design Principles

1. Use Descriptive Field Names

Field names serve as hints for the AI. Compare these two approaches:

Less effective:

{
  "f1": "string",
  "f2": "number",
  "f3": "string"
}

More effective:

{
  "vendor_name": "string",
  "total_amount": "number",
  "invoice_date": "string"
}

The AI uses these names to understand what you're looking for.

2. Match Your Document Structure

If invoices have line items, your schema should have line items:

{
  "invoice_number": "string",
  "vendor": {
    "name": "string",
    "address": "string",
    "tax_id": "string"
  },
  "line_items": [
    {
      "description": "string",
      "quantity": "number",
      "unit_price": "number",
      "total": "number"
    }
  ],
  "subtotal": "number",
  "tax": "number",
  "total": "number"
}

3. Be Specific About Data Types

  • Use "number" for amounts, quantities, and percentages
  • Use "string" for text, dates, and IDs
  • Use arrays [] for repeating items
  • Use nested objects {} for grouped data

Common Invoice Schema Patterns

Basic Invoice

For simple extraction needs:

{
  "invoice_number": "string",
  "date": "string",
  "vendor_name": "string",
  "total_amount": "number"
}

Detailed Invoice

For comprehensive extraction:

{
  "invoice_number": "string",
  "invoice_date": "string",
  "due_date": "string",
  "vendor": {
    "name": "string",
    "address": "string",
    "phone": "string",
    "email": "string"
  },
  "customer": {
    "name": "string",
    "address": "string"
  },
  "line_items": [
    {
      "sku": "string",
      "description": "string",
      "quantity": "number",
      "unit_price": "number",
      "discount": "number",
      "total": "number"
    }
  ],
  "subtotal": "number",
  "discount_total": "number",
  "tax_rate": "number",
  "tax_amount": "number",
  "shipping": "number",
  "total": "number",
  "payment_terms": "string",
  "notes": "string"
}

Tips for Better Accuracy

  1. Start minimal, then expand - Begin with essential fields and add more as needed
  2. Test with real documents - Use actual invoices from your workflow
  3. Handle variations - Some invoices may not have all fields
  4. Use consistent naming - Stick to snake_case or camelCase throughout

Handling Edge Cases

Not all invoices are created equal. Some tips:

  • Missing fields: The AI will return null for fields it can't find
  • Multiple formats: The same schema works across different invoice layouts
  • Handwritten notes: AI extraction handles handwriting better than traditional OCR

Try It Yourself

Head to the Playground to test your schemas with real documents. You can iterate quickly and see exactly what the AI extracts.