# PDF Parsing

Extract structured content from PDF documents with financial document optimization.

#### Endpoint

```
POST /parser/pdf
```

#### Description

This API provides advanced PDF parsing capabilities specifically optimized for financial documents. It extracts text with precise layout preservation, identifies tables, recognizes financial statements, and maintains document structure hierarchy.

#### Parameters

| Parameter        | Type   | Required | Description                                      |
| ---------------- | ------ | -------- | ------------------------------------------------ |
| file             | binary | Yes\*    | PDF file to parse (max: 50MB)                    |
| url              | string | Yes\*    | URL of PDF to parse (alternative to file upload) |
| parsing\_options | object | No       | Parsing configuration options                    |

\*Either `file` or `url` is required, not both.

**Parsing Options:**

json

```json
{
  "ocr_enabled": true,              // Enable OCR for scanned documents
  "table_extraction": true,         // Extract tables as structured data
  "preserve_layout": true,          // Maintain original layout structure
  "page_range": [1, 10],           // Specific pages to parse
  "financial_mode": true,          // Optimize for financial documents
  "extract_headers_footers": false, // Include headers/footers
  "detect_signatures": true,        // Identify signature blocks
  "language": "en"                 // Document language for OCR
}
```

#### Example Request (File Upload)

bash

```bash
curl -X POST "https://api.orbitfin.ai/v1/parser/pdf" \
  -H "Authorization: Bearer YOUR_TOKEN" \
  -F "file=@annual_report.pdf" \
  -F 'parsing_options={"financial_mode":true,"table_extraction":true}'
```

#### Example Request (URL)

bash

```bash
curl -X POST "https://api.orbitfin.ai/v1/parser/pdf" \
  -H "Authorization: Bearer YOUR_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/financial_report.pdf",
    "parsing_options": {
      "financial_mode": true,
      "table_extraction": true,
      "page_range": [1, 50]
    }
  }'
```

#### Response

json

```json
{
  "status": "success",
  "credits_used": 4.5,
  "data": {
    "document_info": {
      "title": "Annual Report 2024",
      "author": "Apple Inc.",
      "subject": "Financial Statements",
      "pages": 45,
      "creation_date": "2024-02-15T10:30:00Z",
      "file_size": 3456789,
      "is_scanned": false,
      "has_forms": false
    },
    "content": {
      "pages": [
        {
          "page_number": 1,
          "width": 612,
          "height": 792,
          "blocks": [
            {
              "block_id": "blk_001",
              "type": "heading",
              "level": 1,
              "text": "Annual Report 2024",
              "bbox": [50, 700, 550, 750],
              "font": {
                "name": "Helvetica-Bold",
                "size": 24
              },
              "confidence": 0.99
            },
            {
              "block_id": "blk_002",
              "type": "paragraph",
              "text": "This annual report contains forward-looking statements...",
              "bbox": [50, 600, 550, 680],
              "font": {
                "name": "Helvetica",
                "size": 11
              },
              "confidence": 0.98
            },
            {
              "block_id": "blk_003",
              "type": "table",
              "title": "Consolidated Statement of Income",
              "bbox": [50, 200, 550, 500],
              "confidence": 0.97,
              "table_data": {
                "headers": [
                  ["", "2024", "2023", "2022"],
                  ["(In millions, except per share data)", "", "", ""]
                ],
                "rows": [
                  ["Net Revenue", "$394,328", "$383,285", "$365,817"],
                  ["Cost of Sales", "214,137", "209,786", "201,471"],
                  ["Gross Profit", "180,191", "173,499", "164,346"],
                  ["Operating Expenses:", "", "", ""],
                  ["  Research and Development", "29,915", "26,251", "23,456"],
                  ["  Selling, General and Administrative", "25,094", "24,932", "22,876"],
                  ["Total Operating Expenses", "55,009", "51,183", "46,332"],
                  ["Operating Income", "125,182", "122,316", "118,014"]
                ],
                "detected_type": "financial_statement",
                "currency": "USD",
                "period_type": "annual"
              }
            }
          ]
        }
      ]
    },
    "extracted_entities": {
      "monetary_values": [
        {
          "value": 394328000000,
          "currency": "USD",
          "context": "Net Revenue 2024",
          "page": 1
        }
      ],
      "dates": [
        {
          "date": "2024-12-31",
          "context": "Fiscal Year End",
          "page": 1
        }
      ],
      "percentages": [
        {
          "value": 45.7,
          "context": "Gross Margin",
          "page": 2
        }
      ]
    },
    "document_structure": {
      "sections": [
        {
          "title": "Financial Highlights",
          "start_page": 1,
          "end_page": 3,
          "subsections": []
        },
        {
          "title": "Management Discussion and Analysis",
          "start_page": 4,
          "end_page": 25,
          "subsections": [
            {
              "title": "Overview",
              "start_page": 4,
              "end_page": 6
            }
          ]
        }
      ],
      "table_of_contents": [
        {
          "title": "Financial Highlights",
          "page": 1
        },
        {
          "title": "Letter to Shareholders", 
          "page": 3
        }
      ]
    },
    "quality_metrics": {
      "overall_confidence": 0.96,
      "ocr_required": false,
      "extraction_warnings": [
        "Page 45: Low quality scan detected"
      ]
    }
  }
}
```

#### Block Types

| Type         | Description                              |
| ------------ | ---------------------------------------- |
| heading      | Title or section header with level (1-6) |
| paragraph    | Standard text paragraph                  |
| table        | Structured table with rows and columns   |
| list         | Bulleted or numbered list                |
| image        | Image or chart (base64 encoded)          |
| footnote     | Footnote or endnote text                 |
| header       | Page header content                      |
| footer       | Page footer content                      |
| page\_number | Page numbering                           |
| signature    | Signature block                          |

#### Financial Mode Features

When `financial_mode` is enabled, the parser:

* Identifies standard financial statements (Income Statement, Balance Sheet, Cash Flow)
* Recognizes XBRL-like structures
* Extracts monetary values with currency detection
* Preserves table relationships and calculations
* Identifies fiscal periods and dates
* Detects auditor signatures and opinions

#### Error Handling

Additional error codes specific to PDF parsing:

| Code             | Description                                |
| ---------------- | ------------------------------------------ |
| FILE\_TOO\_LARGE | PDF exceeds 50MB limit                     |
| INVALID\_PDF     | File is corrupted or not a valid PDF       |
| PARSING\_TIMEOUT | Document too complex, parsing timed out    |
| OCR\_FAILED      | OCR processing failed for scanned document |
| ENCRYPTED\_PDF   | PDF is password protected                  |


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.orbitfin.ai/orbit-api-reference/api/pdf-parsing.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
