Legacy PDF Digitization

Organizations sit on years of archived PDFs and scanned documents that are effectively unsearchable. Digitizing these into structured data unlocks their value for analytics, compliance, and process automation.

PDF digitization: from scanned document to structured JSON

Input document
[Scanned Document — Employee Record]

PERSONNEL FILE
Document Date: March 14, 2019
Classification: CONFIDENTIAL

Employee Information:
  Full Name: Maria Schneider
  Employee ID: EMP-2019-0342
  Department: Engineering
  Position: Senior Software Engineer
  Start Date: April 1, 2019
  Reports To: Thomas Weber, VP Engineering

Contact Details:
  Email: m.schneider@company.de
  Phone: +49 30 1234 5678
  Address: Berliner Str. 45, 10715 Berlin

Compensation:
  Annual Salary: 85,000 EUR
  Pay Grade: E5
  Review Cycle: Annual (next: April 2020)

Certifications:
  - AWS Solutions Architect (2018)
  - Certified Scrum Master (2017)
  - ISO 27001 Lead Auditor (2019)

Emergency Contact:
  Name: Klaus Schneider
  Relationship: Spouse
  Phone: +49 30 9876 5432
Extracted JSON
{
  "document_date": "2019-03-14",
  "classification": "CONFIDENTIAL",
  "employee": {
    "full_name": "Maria Schneider",
    "employee_id": "EMP-2019-0342",
    "department": "Engineering",
    "position": "Senior Software Engineer",
    "start_date": "2019-04-01",
    "reports_to": {
      "name": "Thomas Weber",
      "title": "VP Engineering"
    }
  },
  "contact": {
    "email": "m.schneider@company.de",
    "phone": "+49 30 1234 5678",
    "address": "Berliner Str. 45, 10715 Berlin"
  },
  "compensation": {
    "annual_salary": 85000,
    "currency": "EUR",
    "pay_grade": "E5",
    "next_review": "2020-04-01"
  },
  "certifications": [
    { "name": "AWS Solutions Architect", "year": 2018 },
    { "name": "Certified Scrum Master", "year": 2017 },
    { "name": "ISO 27001 Lead Auditor", "year": 2019 }
  ],
  "emergency_contact": {
    "name": "Klaus Schneider",
    "relationship": "Spouse",
    "phone": "+49 30 9876 5432"
  }
}

Define your schema

Tell Smole what data to extract using a JSON Schema.

{
  "type": "object",
  "properties": {
    "document_date": { "type": "string", "format": "date" },
    "classification": { "type": "string" },
    "employee": {
      "type": "object",
      "properties": {
        "full_name": { "type": "string" },
        "employee_id": { "type": "string" },
        "department": { "type": "string" },
        "position": { "type": "string" },
        "start_date": { "type": "string", "format": "date" },
        "reports_to": {
          "type": "object",
          "properties": {
            "name": { "type": "string" },
            "title": { "type": "string" }
          }
        }
      }
    },
    "contact": {
      "type": "object",
      "properties": {
        "email": { "type": "string", "format": "email" },
        "phone": { "type": "string" },
        "address": { "type": "string" }
      }
    },
    "compensation": {
      "type": "object",
      "properties": {
        "annual_salary": { "type": "number" },
        "currency": { "type": "string" },
        "pay_grade": { "type": "string" },
        "next_review": { "type": "string", "format": "date" }
      }
    },
    "certifications": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "name": { "type": "string" },
          "year": { "type": "integer" }
        }
      }
    },
    "emergency_contact": {
      "type": "object",
      "properties": {
        "name": { "type": "string" },
        "relationship": { "type": "string" },
        "phone": { "type": "string" }
      }
    }
  }
}

Try with your own documents

Upload a document and define your schema. See results in seconds.