Back to blog
docxwordjsonapitutorial

Convert Word Documents (DOCX) to JSON via API

February 17, 2026Smole Team

Convert Word Documents (DOCX) to JSON via API

Word documents are a staple of business operations. Contracts are drafted in Word. Reports are written in Word. Forms, proposals, and policy documents — all Word. But when you need to pull structured data out of them, you're stuck with manual copy-paste or fragile parsing scripts.

This guide shows how to convert DOCX files to structured JSON using a REST API.

Why DOCX to JSON?

Word documents store content as rich text with formatting — bold, italic, tables, headers. But your systems don't need formatting. They need data: the contract start date, the vendor name, the line items, the approval status.

Converting DOCX to JSON lets you:

  • Feed document data into databases without manual entry
  • Automate workflows that depend on document content
  • Build searchable indexes from document archives
  • Integrate document data with your existing tools and APIs

How DOCX Extraction Works

Step 1: Document Conversion

The DOCX file is converted to clean Markdown. This step preserves the document's structure — headings, paragraphs, tables, and lists — while stripping away formatting noise like fonts, colors, and page layout.

Step 2: Schema-Based Extraction

The Markdown content is analyzed against your JSON schema. The extraction engine understands the document's context and maps content to your schema fields.

Unlike regex-based parsing, schema-based extraction doesn't break when the document layout changes. A new paragraph, a reordered section, or different formatting won't affect the results.

Extracting Data from a Word Document

Example: A Proposal Document

Say you receive proposals in Word format and need to extract key details for your CRM.

Define a schema:

{
  "type": "object",
  "properties": {
    "company_name": { "type": "string" },
    "contact_person": { "type": "string" },
    "proposal_date": { "type": "string", "format": "date" },
    "project_name": { "type": "string" },
    "project_description": { "type": "string" },
    "deliverables": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "name": { "type": "string" },
          "timeline": { "type": "string" },
          "cost": { "type": "number" }
        }
      }
    },
    "total_cost": { "type": "number" },
    "payment_terms": { "type": "string" },
    "validity_period": { "type": "string" }
  }
}

Upload the DOCX:

curl -X POST https://api.smole.tech/api/pipeline/file \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -F "file=@proposal.docx" \
  -F "schemaId=YOUR_SCHEMA_ID"

Get structured JSON:

{
  "company_name": "Digital Solutions AG",
  "contact_person": "Sarah Hoffmann",
  "proposal_date": "2025-10-15",
  "project_name": "E-Commerce Platform Redesign",
  "project_description": "Complete redesign of the customer-facing e-commerce platform including mobile optimization and checkout flow improvements.",
  "deliverables": [
    { "name": "UX Research & Wireframes", "timeline": "4 weeks", "cost": 12000 },
    { "name": "UI Design", "timeline": "3 weeks", "cost": 9000 },
    { "name": "Frontend Development", "timeline": "8 weeks", "cost": 32000 },
    { "name": "Testing & Launch", "timeline": "2 weeks", "cost": 7000 }
  ],
  "total_cost": 60000,
  "payment_terms": "30% upfront, 40% at midpoint, 30% on delivery",
  "validity_period": "30 days"
}

Common DOCX Extraction Scenarios

Contracts and Agreements

Extract parties, dates, terms, obligations, and clauses from legal documents:

{
  "type": "object",
  "properties": {
    "agreement_type": { "type": "string" },
    "parties": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "name": { "type": "string" },
          "role": { "type": "string" }
        }
      }
    },
    "effective_date": { "type": "string", "format": "date" },
    "termination_date": { "type": "string", "format": "date" },
    "key_terms": {
      "type": "array",
      "items": { "type": "string" }
    }
  }
}

Reports

Pull key metrics and findings from business reports:

{
  "type": "object",
  "properties": {
    "report_title": { "type": "string" },
    "reporting_period": { "type": "string" },
    "key_metrics": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "metric": { "type": "string" },
          "value": { "type": "string" },
          "change": { "type": "string" }
        }
      }
    },
    "summary": { "type": "string" },
    "recommendations": {
      "type": "array",
      "items": { "type": "string" }
    }
  }
}

HR Documents

Extract employee details from onboarding forms, reviews, or policy documents.

DOCX vs PDF: Extraction Differences

AspectDOCXPDF
Text qualityAlways clean textMay need OCR for scans
StructureHeadings, lists, tables preservedLayout-dependent
TablesWell-structuredCan be inconsistent
Processing speedFasterSlightly slower (especially with OCR)

DOCX files generally produce better extraction results because the text is already structured. PDFs, especially scanned ones, add an extra layer of complexity.

Handling Tables in Word Documents

Word documents often contain tables — line items, comparison matrices, schedules. The extraction engine handles these well:

{
  "type": "object",
  "properties": {
    "table_data": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "column_a": { "type": "string" },
          "column_b": { "type": "number" },
          "column_c": { "type": "string" }
        }
      }
    }
  }
}

Name your array item properties to match the table's column headers for best results.

Try It Now

Upload a Word document in the Playground and define your schema. See structured JSON results in seconds — no code required.

For API integration, see the documentation.