Back to blog
pdfcsvspreadsheettutorial

Convert PDF to CSV: Extract Tabular Data via API

February 21, 2026Smole Team

Convert PDF to CSV: Extract Tabular Data via API

You have a stack of PDFs — invoices, reports, statements — and you need that data in a spreadsheet. Manually copying rows from PDFs into Excel is painful and error-prone. Here's how to automate it: extract structured JSON from PDFs via API, then convert to CSV.

The Pipeline: PDF → JSON → CSV

There's no direct "PDF to CSV" converter that works reliably, because PDFs don't have a concept of rows and columns. The reliable approach is:

  1. Extract structured JSON from the PDF using a schema
  2. Convert the JSON to CSV with a simple script

The schema-based extraction step is what makes this work — it understands the document and returns clean, typed data that maps directly to spreadsheet columns.

Step 1: Extract Data from the PDF

Define a schema for the data you want as CSV columns:

{
  "type": "object",
  "properties": {
    "rows": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "date": { "type": "string", "format": "date" },
          "description": { "type": "string" },
          "category": { "type": "string" },
          "amount": { "type": "number" }
        }
      }
    }
  }
}

Upload the PDF:

curl -X POST https://api.smole.tech/api/pipeline/file \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -F "file=@expense-report.pdf" \
  -F "schemaId=YOUR_SCHEMA_ID"

Get JSON back:

{
  "rows": [
    { "date": "2025-11-01", "description": "Office supplies", "category": "Operations", "amount": 234.50 },
    { "date": "2025-11-03", "description": "Client dinner", "category": "Entertainment", "amount": 187.00 },
    { "date": "2025-11-08", "description": "Software license", "category": "IT", "amount": 599.00 }
  ]
}

Step 2: Convert JSON to CSV

Python

import csv
import json

data = json.loads(api_response)
rows = data["rows"]

with open("output.csv", "w", newline="") as f:
    writer = csv.DictWriter(f, fieldnames=rows[0].keys())
    writer.writeheader()
    writer.writerows(rows)

JavaScript (Node.js)

const rows = data.rows;
const headers = Object.keys(rows[0]);
const csv = [
  headers.join(","),
  ...rows.map(row => headers.map(h => JSON.stringify(row[h] ?? "")).join(","))
].join("\n");

fs.writeFileSync("output.csv", csv);

Command Line (with jq)

echo "date,description,category,amount" > output.csv
echo "$JSON_RESPONSE" | jq -r '.rows[] | [.date, .description, .category, .amount] | @csv' >> output.csv

Real-World Example: Invoices to CSV

Extract invoice data and flatten to spreadsheet rows:

import csv
import json

def invoices_to_csv(extracted_invoices, output_path):
    """Convert extracted invoice JSON to CSV with one row per line item."""
    rows = []
    for inv in extracted_invoices:
        for item in inv.get("line_items", []):
            rows.append({
                "invoice_number": inv.get("invoice_number", ""),
                "vendor": inv.get("vendor", {}).get("name", ""),
                "date": inv.get("date", ""),
                "item": item.get("description", ""),
                "quantity": item.get("quantity", 0),
                "unit_price": item.get("unit_price", 0),
                "total": item.get("total", 0),
                "invoice_total": inv.get("total", 0),
            })

    with open(output_path, "w", newline="") as f:
        writer = csv.DictWriter(f, fieldnames=rows[0].keys())
        writer.writeheader()
        writer.writerows(rows)

invoices_to_csv(all_invoices, "invoices.csv")

Output CSV:

invoice_number,vendor,date,item,quantity,unit_price,total,invoice_total
INV-2025-001,Acme Corp,2025-11-15,Widget Pro,5,24.99,124.95,148.69
INV-2025-001,Acme Corp,2025-11-15,Shipping,1,23.74,23.74,148.69
INV-2025-002,TechParts GmbH,2025-11-20,Server RAM,4,189.00,756.00,899.64

Batch Processing: Many PDFs to One CSV

Process an entire folder of PDFs into a single CSV:

from pathlib import Path

all_rows = []
for pdf in Path("./invoices").glob("*.pdf"):
    result = extract(str(pdf), schema_id)  # Your extraction function
    for item in result.get("line_items", []):
        all_rows.append({
            "file": pdf.name,
            "vendor": result.get("vendor", {}).get("name", ""),
            "date": result.get("date", ""),
            "item": item.get("description", ""),
            "amount": item.get("total", 0),
        })

# Write combined CSV
with open("all_invoices.csv", "w", newline="") as f:
    writer = csv.DictWriter(f, fieldnames=all_rows[0].keys())
    writer.writeheader()
    writer.writerows(all_rows)

Handling Different Document Types

The same PDF-to-CSV approach works for any document type — just adjust the schema:

  • Bank statements → Date, Description, Debit, Credit, Balance
  • Inventory lists → SKU, Name, Quantity, Location, Price
  • Employee records → Name, ID, Department, Start Date, Salary
  • Purchase orders → PO Number, Vendor, Item, Quantity, Total

Tips

  1. Design your schema for flat output — CSV works best with flat rows, so keep your schema's array items simple (no deep nesting)
  2. Use format: "date" for date columns — This ensures consistent ISO date formatting in your CSV
  3. Handle missing data — Some documents won't have every field. Default to empty strings or 0 in your CSV conversion script
  4. Test with one document first — Verify the schema produces the columns you want before running a batch

Try It Now

Upload a PDF in the Playground, define a schema matching the table columns, and see the structured JSON output. From there, converting to CSV is a few lines of code.

For API details, see the documentation.