How to Convert PDFs to JSON with an API

PDFs are everywhere — invoices, contracts, reports, forms — but the data inside them is trapped. You can't query a PDF. You can't pipe it into a database. You need that data as structured JSON.

This guide covers how to convert PDFs to JSON using a REST API, from simple text-based PDFs to scanned documents that need OCR.

Why PDF to JSON?

PDFs are designed for display, not for data. When you need to:

Pull line items from invoices into your accounting system
Extract terms and dates from contracts into a database
Feed report data into dashboards or analytics tools
Automate data entry from scanned forms

You need a way to turn that PDF into structured data your code can work with.

The Two Types of PDFs

Not all PDFs are equal. How you handle them depends on what's inside:

Digital (text-based) PDFs

These PDFs contain actual text data. You can select and copy text from them. They're created by software like Word, Google Docs, or reporting tools.

Processing: Text is extracted directly — fast and accurate.

Scanned PDFs (image-based)

These are essentially images wrapped in a PDF. They come from scanners, phone cameras, or fax machines. You can't select text in them.

Processing: OCR (Optical Character Recognition) converts the image to text first, then extraction happens on the OCR output.

Smole handles both types automatically. Upload any PDF, and the system detects whether OCR is needed.

Converting a PDF to JSON: Step by Step

1. Create a Schema

First, define what data you want to extract. A JSON Schema tells the API what fields to look for:

{
  "type": "object",
  "properties": {
    "document_title": { "type": "string" },
    "date": { "type": "string", "format": "date" },
    "author": { "type": "string" },
    "sections": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "heading": { "type": "string" },
          "content": { "type": "string" }
        }
      }
    }
  }
}

curl -X POST https://api.smole.tech/api/schemas \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{ "name": "report-schema", "schema": { ... } }'

2. Upload and Extract

Send your PDF to the pipeline endpoint:

curl -X POST https://api.smole.tech/api/pipeline/file \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -F "file=@report.pdf" \
  -F "schemaId=YOUR_SCHEMA_ID"

This single call handles everything: conversion, OCR (if needed), and extraction.

3. Get Your JSON

Poll for the result:

curl https://api.smole.tech/api/pipeline/PIPELINE_ID \
  -H "Authorization: Bearer YOUR_API_KEY"

You'll get back clean JSON matching your schema:

{
  "document_title": "Q4 Sales Report",
  "date": "2025-12-31",
  "author": "Finance Team",
  "sections": [
    {
      "heading": "Revenue Summary",
      "content": "Total revenue for Q4 was €2.4M, a 15% increase over Q3..."
    },
    {
      "heading": "Regional Breakdown",
      "content": "DACH region contributed 45% of total revenue..."
    }
  ]
}

Handling Different PDF Types

Invoices

{
  "type": "object",
  "properties": {
    "vendor_name": { "type": "string" },
    "invoice_number": { "type": "string" },
    "date": { "type": "string", "format": "date" },
    "line_items": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "description": { "type": "string" },
          "quantity": { "type": "number" },
          "unit_price": { "type": "number" },
          "total": { "type": "number" }
        }
      }
    },
    "subtotal": { "type": "number" },
    "tax": { "type": "number" },
    "total": { "type": "number" }
  }
}

Contracts

{
  "type": "object",
  "properties": {
    "parties": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "name": { "type": "string" },
          "role": { "type": "string" }
        }
      }
    },
    "effective_date": { "type": "string", "format": "date" },
    "term_months": { "type": "integer" },
    "governing_law": { "type": "string" },
    "key_obligations": {
      "type": "array",
      "items": { "type": "string" }
    }
  }
}

Reports and Forms

The same approach works for any document type. Define the fields you care about, and the extraction handles the rest — regardless of layout or formatting.

Batch Processing

For high-volume scenarios, process multiple PDFs through the same pipeline:

const files = ['invoice1.pdf', 'invoice2.pdf', 'invoice3.pdf'];

const results = await Promise.all(
  files.map(file =>
    fetch('https://api.smole.tech/api/pipeline/file', {
      method: 'POST',
      headers: { 'Authorization': `Bearer ${API_KEY}` },
      body: createFormData(file, schemaId)
    })
  )
);

Each document is processed independently, so failures in one don't affect the others.

Tips for Better Results

Be specific with field names — invoice_date works better than date1 because the name gives context about what to look for
Match your schema to the document — If the document has a table, use an array in your schema
Use appropriate types — "type": "number" for amounts, "format": "date" for dates
Start simple — Extract the most important fields first, then expand your schema

Try It Now

Upload a PDF in the Playground and see the extraction results in seconds. No setup required — just drag, drop, and define your schema.

For API integration details, see the documentation.

How to Convert PDFs to JSON with an API

How to Convert PDFs to JSON with an API

Why PDF to JSON?

The Two Types of PDFs

Digital (text-based) PDFs

Scanned PDFs (image-based)

Converting a PDF to JSON: Step by Step

1. Create a Schema

2. Upload and Extract

3. Get Your JSON

Handling Different PDF Types

Invoices

Contracts

Reports and Forms

Batch Processing

Tips for Better Results

Try It Now

Related articles

Convert Word Documents (DOCX) to JSON via API

How to Extract Tables from PDFs into Structured Data

Extract Structured Data from Documents with Python