Back to blog
xlsxspreadsheetexcelextraction

Extract Data from Spreadsheets (XLSX) via API

February 13, 2026Smole Team

Extract Data from Spreadsheets (XLSX) via API

Spreadsheets are everywhere in business — budgets, inventory lists, project trackers, sales reports, HR records. They're flexible, which is why people use them for everything. But that flexibility makes them hard to process programmatically: merged cells, multiple sheets, inconsistent formatting, formulas, and hidden columns.

Schema-based extraction cuts through the complexity. Define what data you want, upload the spreadsheet, and get clean JSON back.

Why Not Just Parse the XLSX Directly?

Libraries like openpyxl (Python) or xlsx (JavaScript) can read Excel files, but you still need to handle:

  • Inconsistent structure — Column A is the name in one file, column B in another
  • Multiple sheets — Which sheet has the data you need?
  • Merged cells — Headers spanning multiple columns
  • Empty rows and columns — Padding, spacing, notes scattered throughout
  • Formula results vs raw values — Some cells contain formulas, others contain static values
  • Formatting as data — Currency symbols, percentage signs, date formats

Schema-based extraction handles all of this. It converts the spreadsheet to a structured format, then extracts the data you defined — regardless of layout quirks.

Extracting Data from an Excel File

Example: A Sales Report Spreadsheet

{
  "type": "object",
  "properties": {
    "report_title": { "type": "string" },
    "reporting_period": { "type": "string" },
    "sales_data": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "product": { "type": "string" },
          "region": { "type": "string" },
          "units_sold": { "type": "integer" },
          "revenue": { "type": "number" },
          "growth_percent": { "type": "number" }
        }
      }
    },
    "total_revenue": { "type": "number" },
    "total_units": { "type": "integer" }
  }
}

Upload and extract:

curl -X POST https://api.smole.tech/api/pipeline/file \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -F "file=@sales-report.xlsx" \
  -F "schemaId=YOUR_SCHEMA_ID"

Result:

{
  "report_title": "Q4 2025 Sales Report",
  "reporting_period": "October - December 2025",
  "sales_data": [
    { "product": "Widget Pro", "region": "DACH", "units_sold": 1250, "revenue": 31237.50, "growth_percent": 12.5 },
    { "product": "Widget Pro", "region": "Nordics", "units_sold": 840, "revenue": 20997.60, "growth_percent": 8.3 },
    { "product": "Widget Lite", "region": "DACH", "units_sold": 3100, "revenue": 38750.00, "growth_percent": -2.1 },
    { "product": "Widget Lite", "region": "Nordics", "units_sold": 2200, "revenue": 27500.00, "growth_percent": 15.7 }
  ],
  "total_revenue": 118485.10,
  "total_units": 7390
}

Note how growth_percent comes back as a number (-2.1), not a string ("-2.1%"). The extraction handles format conversion automatically.

Common Spreadsheet Extraction Scenarios

Budget Tracking

{
  "type": "object",
  "properties": {
    "department": { "type": "string" },
    "fiscal_year": { "type": "string" },
    "line_items": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "category": { "type": "string" },
          "budgeted": { "type": "number" },
          "actual": { "type": "number" },
          "variance": { "type": "number" }
        }
      }
    },
    "total_budget": { "type": "number" },
    "total_actual": { "type": "number" }
  }
}

Inventory Lists

{
  "type": "object",
  "properties": {
    "items": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "sku": { "type": "string" },
          "name": { "type": "string" },
          "category": { "type": "string" },
          "quantity_on_hand": { "type": "integer" },
          "reorder_point": { "type": "integer" },
          "unit_cost": { "type": "number" },
          "location": { "type": "string" }
        }
      }
    }
  }
}

Employee Directories

{
  "type": "object",
  "properties": {
    "employees": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "name": { "type": "string" },
          "employee_id": { "type": "string" },
          "department": { "type": "string" },
          "title": { "type": "string" },
          "email": { "type": "string", "format": "email" },
          "start_date": { "type": "string", "format": "date" }
        }
      }
    }
  }
}

Supported Formats

FormatExtensionDescription
Excel (modern).xlsxStandard Excel format
Excel with macros.xlsmExcel with macros (macros are ignored, data is extracted)
CSV.csvComma-separated values (also supported via pipeline)

Tips for Better Spreadsheet Extraction

  1. Name your fields after column headers — If the spreadsheet has a "Product Name" column, use product_name in your schema
  2. Use arrays for tabular data — Spreadsheet rows map naturally to JSON arrays
  3. Include summary fields — Extract totals and subtotals alongside row data for validation
  4. Use integer for counts — Quantities, headcounts, and other whole numbers should use integer type

Try It Now

Upload a spreadsheet in the Playground and see structured data in seconds. Works with any Excel file — no matter how messy the formatting.

For API details, see the documentation.