Back to blog
pdfjsonapitutorial

How to Convert PDFs to JSON with an API

February 24, 2026Smole Team

How to Convert PDFs to JSON with an API

PDFs are everywhere — invoices, contracts, reports, forms — but the data inside them is trapped. You can't query a PDF. You can't pipe it into a database. You need that data as structured JSON.

This guide covers how to convert PDFs to JSON using a REST API, from simple text-based PDFs to scanned documents that need OCR.

Why PDF to JSON?

PDFs are designed for display, not for data. When you need to:

  • Pull line items from invoices into your accounting system
  • Extract terms and dates from contracts into a database
  • Feed report data into dashboards or analytics tools
  • Automate data entry from scanned forms

You need a way to turn that PDF into structured data your code can work with.

The Two Types of PDFs

Not all PDFs are equal. How you handle them depends on what's inside:

Digital (text-based) PDFs

These PDFs contain actual text data. You can select and copy text from them. They're created by software like Word, Google Docs, or reporting tools.

Processing: Text is extracted directly — fast and accurate.

Scanned PDFs (image-based)

These are essentially images wrapped in a PDF. They come from scanners, phone cameras, or fax machines. You can't select text in them.

Processing: OCR (Optical Character Recognition) converts the image to text first, then extraction happens on the OCR output.

Smole handles both types automatically. Upload any PDF, and the system detects whether OCR is needed.

Converting a PDF to JSON: Step by Step

1. Create a Schema

First, define what data you want to extract. A JSON Schema tells the API what fields to look for:

{
  "type": "object",
  "properties": {
    "document_title": { "type": "string" },
    "date": { "type": "string", "format": "date" },
    "author": { "type": "string" },
    "sections": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "heading": { "type": "string" },
          "content": { "type": "string" }
        }
      }
    }
  }
}

Register it via the API:

curl -X POST https://api.smole.tech/api/schemas \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{ "name": "report-schema", "schema": { ... } }'

2. Upload and Extract

Send your PDF to the pipeline endpoint:

curl -X POST https://api.smole.tech/api/pipeline/file \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -F "file=@report.pdf" \
  -F "schemaId=YOUR_SCHEMA_ID"

This single call handles everything: conversion, OCR (if needed), and extraction.

3. Get Your JSON

Poll for the result:

curl https://api.smole.tech/api/pipeline/PIPELINE_ID \
  -H "Authorization: Bearer YOUR_API_KEY"

You'll get back clean JSON matching your schema:

{
  "document_title": "Q4 Sales Report",
  "date": "2025-12-31",
  "author": "Finance Team",
  "sections": [
    {
      "heading": "Revenue Summary",
      "content": "Total revenue for Q4 was €2.4M, a 15% increase over Q3..."
    },
    {
      "heading": "Regional Breakdown",
      "content": "DACH region contributed 45% of total revenue..."
    }
  ]
}

Handling Different PDF Types

Invoices

{
  "type": "object",
  "properties": {
    "vendor_name": { "type": "string" },
    "invoice_number": { "type": "string" },
    "date": { "type": "string", "format": "date" },
    "line_items": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "description": { "type": "string" },
          "quantity": { "type": "number" },
          "unit_price": { "type": "number" },
          "total": { "type": "number" }
        }
      }
    },
    "subtotal": { "type": "number" },
    "tax": { "type": "number" },
    "total": { "type": "number" }
  }
}

Contracts

{
  "type": "object",
  "properties": {
    "parties": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "name": { "type": "string" },
          "role": { "type": "string" }
        }
      }
    },
    "effective_date": { "type": "string", "format": "date" },
    "term_months": { "type": "integer" },
    "governing_law": { "type": "string" },
    "key_obligations": {
      "type": "array",
      "items": { "type": "string" }
    }
  }
}

Reports and Forms

The same approach works for any document type. Define the fields you care about, and the extraction handles the rest — regardless of layout or formatting.

Batch Processing

For high-volume scenarios, process multiple PDFs through the same pipeline:

const files = ['invoice1.pdf', 'invoice2.pdf', 'invoice3.pdf'];

const results = await Promise.all(
  files.map(file =>
    fetch('https://api.smole.tech/api/pipeline/file', {
      method: 'POST',
      headers: { 'Authorization': `Bearer ${API_KEY}` },
      body: createFormData(file, schemaId)
    })
  )
);

Each document is processed independently, so failures in one don't affect the others.

Tips for Better Results

  1. Be specific with field namesinvoice_date works better than date1 because the name gives context about what to look for
  2. Match your schema to the document — If the document has a table, use an array in your schema
  3. Use appropriate types"type": "number" for amounts, "format": "date" for dates
  4. Start simple — Extract the most important fields first, then expand your schema

Try It Now

Upload a PDF in the Playground and see the extraction results in seconds. No setup required — just drag, drop, and define your schema.

For API integration details, see the documentation.