Back to blog
pdftablesextractiontutorial

How to Extract Tables from PDFs into Structured Data

February 23, 2026Smole Team

How to Extract Tables from PDFs into Structured Data

Tables in PDFs are one of the most frustrating data sources to work with. The data looks perfectly structured on screen, but copy-pasting gives you a jumbled mess. Traditional PDF parsers struggle with merged cells, multi-line rows, and inconsistent column alignment.

Schema-based extraction solves this — you tell the API what columns to expect, and it returns clean, structured data.

Why PDF Tables Are Hard

PDFs don't actually have a concept of "tables." What you see as a table is just text positioned at specific coordinates on a page, sometimes with lines drawn between them, sometimes not. This means:

  • Copy-paste scrambles column order
  • PDF parsers can't reliably detect where one column ends and another begins
  • Headers might span multiple columns
  • Rows can wrap across multiple lines
  • Some tables have borders, others don't

Schema-based extraction bypasses all of this by understanding the content rather than trying to parse the layout.

Extracting a Simple Table

Say you have a PDF with a product pricing table. Define what each row should look like:

{
  "type": "object",
  "properties": {
    "table_title": { "type": "string" },
    "rows": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "product_name": { "type": "string" },
          "sku": { "type": "string" },
          "unit_price": { "type": "number" },
          "quantity_available": { "type": "integer" }
        }
      }
    }
  }
}

Upload the PDF:

curl -X POST https://api.smole.tech/api/pipeline/file \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -F "file=@pricing-sheet.pdf" \
  -F "schemaId=YOUR_SCHEMA_ID"

Get structured output:

{
  "table_title": "Q1 2026 Product Pricing",
  "rows": [
    { "product_name": "Widget Pro", "sku": "WP-100", "unit_price": 24.99, "quantity_available": 1500 },
    { "product_name": "Widget Lite", "sku": "WL-200", "unit_price": 12.49, "quantity_available": 3200 },
    { "product_name": "Widget Enterprise", "sku": "WE-300", "unit_price": 89.99, "quantity_available": 450 }
  ]
}

Every row is typed correctly — prices as numbers, quantities as integers, text as strings.

Handling Complex Tables

Multiple Tables in One Document

If a PDF contains several tables, define each in your schema:

{
  "type": "object",
  "properties": {
    "revenue_table": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "region": { "type": "string" },
          "q1": { "type": "number" },
          "q2": { "type": "number" },
          "q3": { "type": "number" },
          "q4": { "type": "number" }
        }
      }
    },
    "expense_table": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "category": { "type": "string" },
          "budgeted": { "type": "number" },
          "actual": { "type": "number" },
          "variance": { "type": "number" }
        }
      }
    }
  }
}

The extraction engine identifies which table is which based on context and column headers.

Tables with Merged Cells

Financial statements and reports often merge cells for category headers. The extraction handles this by understanding the hierarchy:

{
  "type": "object",
  "properties": {
    "line_items": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "category": { "type": "string" },
          "subcategory": { "type": "string" },
          "amount": { "type": "number" }
        }
      }
    }
  }
}

A merged "Operating Expenses" header spanning multiple rows gets assigned as the category for each sub-row.

Tables Without Borders

Many PDFs use whitespace alignment instead of visible borders. This doesn't matter for schema-based extraction — it reads the content, not the lines.

Invoice Line Items

The most common table extraction use case. Invoice line items almost always follow the same pattern:

{
  "type": "object",
  "properties": {
    "line_items": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "description": { "type": "string" },
          "quantity": { "type": "number" },
          "unit_price": { "type": "number" },
          "tax_rate": { "type": "number" },
          "total": { "type": "number" }
        }
      }
    },
    "subtotal": { "type": "number" },
    "tax_total": { "type": "number" },
    "grand_total": { "type": "number" }
  }
}

This works regardless of whether the invoice uses "Qty" or "Quantity", "Unit Price" or "Price/Unit", or any other header variation.

Financial Statements

Extract balance sheets, income statements, and cash flow tables:

{
  "type": "object",
  "properties": {
    "statement_type": { "type": "string" },
    "period": { "type": "string" },
    "entries": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "line_item": { "type": "string" },
          "current_period": { "type": "number" },
          "prior_period": { "type": "number" }
        }
      }
    }
  }
}

Tips for Better Table Extraction

  1. Name your array item properties after column headers — If the table has a "Description" column, use "description" in your schema
  2. Use number for numeric columns — This ensures amounts, quantities, and percentages come back as numbers, not strings
  3. Extract the table summary too — Include fields for totals, subtotals, and other summary rows alongside the array
  4. Test with varying formats — The same schema should work across different table layouts

Try It Now

Upload a PDF with tables in the Playground and see structured data come back in seconds. Define your schema to match the table columns, and let the extraction handle the rest.

For API integration details, see the documentation.