Extract Data from Spreadsheets (XLSX) via API

Spreadsheets are everywhere in business — budgets, inventory lists, project trackers, sales reports, HR records. They're flexible, which is why people use them for everything. But that flexibility makes them hard to process programmatically: merged cells, multiple sheets, inconsistent formatting, formulas, and hidden columns.

Schema-based extraction cuts through the complexity. Define what data you want, upload the spreadsheet, and get clean JSON back.

Why Not Just Parse the XLSX Directly?

Libraries like openpyxl (Python) or xlsx (JavaScript) can read Excel files, but you still need to handle:

Inconsistent structure — Column A is the name in one file, column B in another
Multiple sheets — Which sheet has the data you need?
Merged cells — Headers spanning multiple columns
Empty rows and columns — Padding, spacing, notes scattered throughout
Formula results vs raw values — Some cells contain formulas, others contain static values
Formatting as data — Currency symbols, percentage signs, date formats

Schema-based extraction handles all of this. It converts the spreadsheet to a structured format, then extracts the data you defined — regardless of layout quirks.

Extracting Data from an Excel File

Example: A Sales Report Spreadsheet

{
  "type": "object",
  "properties": {
    "report_title": { "type": "string" },
    "reporting_period": { "type": "string" },
    "sales_data": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "product": { "type": "string" },
          "region": { "type": "string" },
          "units_sold": { "type": "integer" },
          "revenue": { "type": "number" },
          "growth_percent": { "type": "number" }
        }
      }
    },
    "total_revenue": { "type": "number" },
    "total_units": { "type": "integer" }
  }
}

Upload and extract:

curl -X POST https://api.smole.tech/api/pipeline/file \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -F "file=@sales-report.xlsx" \
  -F "schemaId=YOUR_SCHEMA_ID"

Result:

{
  "report_title": "Q4 2025 Sales Report",
  "reporting_period": "October - December 2025",
  "sales_data": [
    { "product": "Widget Pro", "region": "DACH", "units_sold": 1250, "revenue": 31237.50, "growth_percent": 12.5 },
    { "product": "Widget Pro", "region": "Nordics", "units_sold": 840, "revenue": 20997.60, "growth_percent": 8.3 },
    { "product": "Widget Lite", "region": "DACH", "units_sold": 3100, "revenue": 38750.00, "growth_percent": -2.1 },
    { "product": "Widget Lite", "region": "Nordics", "units_sold": 2200, "revenue": 27500.00, "growth_percent": 15.7 }
  ],
  "total_revenue": 118485.10,
  "total_units": 7390
}

Note how growth_percent comes back as a number (-2.1), not a string ("-2.1%"). The extraction handles format conversion automatically.

Common Spreadsheet Extraction Scenarios

Budget Tracking

{
  "type": "object",
  "properties": {
    "department": { "type": "string" },
    "fiscal_year": { "type": "string" },
    "line_items": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "category": { "type": "string" },
          "budgeted": { "type": "number" },
          "actual": { "type": "number" },
          "variance": { "type": "number" }
        }
      }
    },
    "total_budget": { "type": "number" },
    "total_actual": { "type": "number" }
  }
}

Inventory Lists

{
  "type": "object",
  "properties": {
    "items": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "sku": { "type": "string" },
          "name": { "type": "string" },
          "category": { "type": "string" },
          "quantity_on_hand": { "type": "integer" },
          "reorder_point": { "type": "integer" },
          "unit_cost": { "type": "number" },
          "location": { "type": "string" }
        }
      }
    }
  }
}

Employee Directories

{
  "type": "object",
  "properties": {
    "employees": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "name": { "type": "string" },
          "employee_id": { "type": "string" },
          "department": { "type": "string" },
          "title": { "type": "string" },
          "email": { "type": "string", "format": "email" },
          "start_date": { "type": "string", "format": "date" }
        }
      }
    }
  }
}

Supported Formats

Format	Extension	Description
Excel (modern)	.xlsx	Standard Excel format
Excel with macros	.xlsm	Excel with macros (macros are ignored, data is extracted)
CSV	.csv	Comma-separated values (also supported via pipeline)

Tips for Better Spreadsheet Extraction

Name your fields after column headers — If the spreadsheet has a "Product Name" column, use product_name in your schema
Use arrays for tabular data — Spreadsheet rows map naturally to JSON arrays
Include summary fields — Extract totals and subtotals alongside row data for validation
Use integer for counts — Quantities, headcounts, and other whole numbers should use integer type

Try It Now

Upload a spreadsheet in the Playground and see structured data in seconds. Works with any Excel file — no matter how messy the formatting.

For API details, see the documentation.

Extract Data from Spreadsheets (XLSX) via API

Extract Data from Spreadsheets (XLSX) via API

Why Not Just Parse the XLSX Directly?

Extracting Data from an Excel File

Example: A Sales Report Spreadsheet

Common Spreadsheet Extraction Scenarios

Budget Tracking

Inventory Lists

Employee Directories

Supported Formats

Tips for Better Spreadsheet Extraction

Try It Now

Related articles

We Crawled the Cloudflare Blog and Turned It Into Structured Data

How to Extract Tables from PDFs into Structured Data

Extract Structured Data from Documents with Python