PDF Parsing

Extract structured content from PDF documents with financial document optimization.

Endpoint

POST /parser/pdf

Description

This API provides advanced PDF parsing capabilities specifically optimized for financial documents. It extracts text with precise layout preservation, identifies tables, recognizes financial statements, and maintains document structure hierarchy.

Parameters

Parameter
Type
Required
Description

file

binary

Yes*

PDF file to parse (max: 50MB)

url

string

Yes*

URL of PDF to parse (alternative to file upload)

parsing_options

object

No

Parsing configuration options

*Either file or url is required, not both.

Parsing Options:

json

{
  "ocr_enabled": true,              // Enable OCR for scanned documents
  "table_extraction": true,         // Extract tables as structured data
  "preserve_layout": true,          // Maintain original layout structure
  "page_range": [1, 10],           // Specific pages to parse
  "financial_mode": true,          // Optimize for financial documents
  "extract_headers_footers": false, // Include headers/footers
  "detect_signatures": true,        // Identify signature blocks
  "language": "en"                 // Document language for OCR
}

Example Request (File Upload)

bash

curl -X POST "https://api.orbitfin.ai/v1/parser/pdf" \
  -H "Authorization: Bearer YOUR_TOKEN" \
  -F "file=@annual_report.pdf" \
  -F 'parsing_options={"financial_mode":true,"table_extraction":true}'

Example Request (URL)

bash

curl -X POST "https://api.orbitfin.ai/v1/parser/pdf" \
  -H "Authorization: Bearer YOUR_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/financial_report.pdf",
    "parsing_options": {
      "financial_mode": true,
      "table_extraction": true,
      "page_range": [1, 50]
    }
  }'

Response

json

{
  "status": "success",
  "credits_used": 4.5,
  "data": {
    "document_info": {
      "title": "Annual Report 2024",
      "author": "Apple Inc.",
      "subject": "Financial Statements",
      "pages": 45,
      "creation_date": "2024-02-15T10:30:00Z",
      "file_size": 3456789,
      "is_scanned": false,
      "has_forms": false
    },
    "content": {
      "pages": [
        {
          "page_number": 1,
          "width": 612,
          "height": 792,
          "blocks": [
            {
              "block_id": "blk_001",
              "type": "heading",
              "level": 1,
              "text": "Annual Report 2024",
              "bbox": [50, 700, 550, 750],
              "font": {
                "name": "Helvetica-Bold",
                "size": 24
              },
              "confidence": 0.99
            },
            {
              "block_id": "blk_002",
              "type": "paragraph",
              "text": "This annual report contains forward-looking statements...",
              "bbox": [50, 600, 550, 680],
              "font": {
                "name": "Helvetica",
                "size": 11
              },
              "confidence": 0.98
            },
            {
              "block_id": "blk_003",
              "type": "table",
              "title": "Consolidated Statement of Income",
              "bbox": [50, 200, 550, 500],
              "confidence": 0.97,
              "table_data": {
                "headers": [
                  ["", "2024", "2023", "2022"],
                  ["(In millions, except per share data)", "", "", ""]
                ],
                "rows": [
                  ["Net Revenue", "$394,328", "$383,285", "$365,817"],
                  ["Cost of Sales", "214,137", "209,786", "201,471"],
                  ["Gross Profit", "180,191", "173,499", "164,346"],
                  ["Operating Expenses:", "", "", ""],
                  ["  Research and Development", "29,915", "26,251", "23,456"],
                  ["  Selling, General and Administrative", "25,094", "24,932", "22,876"],
                  ["Total Operating Expenses", "55,009", "51,183", "46,332"],
                  ["Operating Income", "125,182", "122,316", "118,014"]
                ],
                "detected_type": "financial_statement",
                "currency": "USD",
                "period_type": "annual"
              }
            }
          ]
        }
      ]
    },
    "extracted_entities": {
      "monetary_values": [
        {
          "value": 394328000000,
          "currency": "USD",
          "context": "Net Revenue 2024",
          "page": 1
        }
      ],
      "dates": [
        {
          "date": "2024-12-31",
          "context": "Fiscal Year End",
          "page": 1
        }
      ],
      "percentages": [
        {
          "value": 45.7,
          "context": "Gross Margin",
          "page": 2
        }
      ]
    },
    "document_structure": {
      "sections": [
        {
          "title": "Financial Highlights",
          "start_page": 1,
          "end_page": 3,
          "subsections": []
        },
        {
          "title": "Management Discussion and Analysis",
          "start_page": 4,
          "end_page": 25,
          "subsections": [
            {
              "title": "Overview",
              "start_page": 4,
              "end_page": 6
            }
          ]
        }
      ],
      "table_of_contents": [
        {
          "title": "Financial Highlights",
          "page": 1
        },
        {
          "title": "Letter to Shareholders", 
          "page": 3
        }
      ]
    },
    "quality_metrics": {
      "overall_confidence": 0.96,
      "ocr_required": false,
      "extraction_warnings": [
        "Page 45: Low quality scan detected"
      ]
    }
  }
}

Block Types

Type
Description

heading

Title or section header with level (1-6)

paragraph

Standard text paragraph

table

Structured table with rows and columns

list

Bulleted or numbered list

image

Image or chart (base64 encoded)

footnote

Footnote or endnote text

header

Page header content

footer

Page footer content

page_number

Page numbering

signature

Signature block

Financial Mode Features

When financial_mode is enabled, the parser:

  • Identifies standard financial statements (Income Statement, Balance Sheet, Cash Flow)

  • Recognizes XBRL-like structures

  • Extracts monetary values with currency detection

  • Preserves table relationships and calculations

  • Identifies fiscal periods and dates

  • Detects auditor signatures and opinions

Error Handling

Additional error codes specific to PDF parsing:

Code
Description

FILE_TOO_LARGE

PDF exceeds 50MB limit

INVALID_PDF

File is corrupted or not a valid PDF

PARSING_TIMEOUT

Document too complex, parsing timed out

OCR_FAILED

OCR processing failed for scanned document

ENCRYPTED_PDF

PDF is password protected

Last updated