PDF Parsing
Extract structured content from PDF documents with financial document optimization.
Endpoint
POST /parser/pdf
Description
This API provides advanced PDF parsing capabilities specifically optimized for financial documents. It extracts text with precise layout preservation, identifies tables, recognizes financial statements, and maintains document structure hierarchy.
Parameters
file
binary
Yes*
PDF file to parse (max: 50MB)
url
string
Yes*
URL of PDF to parse (alternative to file upload)
parsing_options
object
No
Parsing configuration options
*Either file
or url
is required, not both.
Parsing Options:
json
{
"ocr_enabled": true, // Enable OCR for scanned documents
"table_extraction": true, // Extract tables as structured data
"preserve_layout": true, // Maintain original layout structure
"page_range": [1, 10], // Specific pages to parse
"financial_mode": true, // Optimize for financial documents
"extract_headers_footers": false, // Include headers/footers
"detect_signatures": true, // Identify signature blocks
"language": "en" // Document language for OCR
}
Example Request (File Upload)
bash
curl -X POST "https://api.orbitfin.ai/v1/parser/pdf" \
-H "Authorization: Bearer YOUR_TOKEN" \
-F "file=@annual_report.pdf" \
-F 'parsing_options={"financial_mode":true,"table_extraction":true}'
Example Request (URL)
bash
curl -X POST "https://api.orbitfin.ai/v1/parser/pdf" \
-H "Authorization: Bearer YOUR_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com/financial_report.pdf",
"parsing_options": {
"financial_mode": true,
"table_extraction": true,
"page_range": [1, 50]
}
}'
Response
json
{
"status": "success",
"credits_used": 4.5,
"data": {
"document_info": {
"title": "Annual Report 2024",
"author": "Apple Inc.",
"subject": "Financial Statements",
"pages": 45,
"creation_date": "2024-02-15T10:30:00Z",
"file_size": 3456789,
"is_scanned": false,
"has_forms": false
},
"content": {
"pages": [
{
"page_number": 1,
"width": 612,
"height": 792,
"blocks": [
{
"block_id": "blk_001",
"type": "heading",
"level": 1,
"text": "Annual Report 2024",
"bbox": [50, 700, 550, 750],
"font": {
"name": "Helvetica-Bold",
"size": 24
},
"confidence": 0.99
},
{
"block_id": "blk_002",
"type": "paragraph",
"text": "This annual report contains forward-looking statements...",
"bbox": [50, 600, 550, 680],
"font": {
"name": "Helvetica",
"size": 11
},
"confidence": 0.98
},
{
"block_id": "blk_003",
"type": "table",
"title": "Consolidated Statement of Income",
"bbox": [50, 200, 550, 500],
"confidence": 0.97,
"table_data": {
"headers": [
["", "2024", "2023", "2022"],
["(In millions, except per share data)", "", "", ""]
],
"rows": [
["Net Revenue", "$394,328", "$383,285", "$365,817"],
["Cost of Sales", "214,137", "209,786", "201,471"],
["Gross Profit", "180,191", "173,499", "164,346"],
["Operating Expenses:", "", "", ""],
[" Research and Development", "29,915", "26,251", "23,456"],
[" Selling, General and Administrative", "25,094", "24,932", "22,876"],
["Total Operating Expenses", "55,009", "51,183", "46,332"],
["Operating Income", "125,182", "122,316", "118,014"]
],
"detected_type": "financial_statement",
"currency": "USD",
"period_type": "annual"
}
}
]
}
]
},
"extracted_entities": {
"monetary_values": [
{
"value": 394328000000,
"currency": "USD",
"context": "Net Revenue 2024",
"page": 1
}
],
"dates": [
{
"date": "2024-12-31",
"context": "Fiscal Year End",
"page": 1
}
],
"percentages": [
{
"value": 45.7,
"context": "Gross Margin",
"page": 2
}
]
},
"document_structure": {
"sections": [
{
"title": "Financial Highlights",
"start_page": 1,
"end_page": 3,
"subsections": []
},
{
"title": "Management Discussion and Analysis",
"start_page": 4,
"end_page": 25,
"subsections": [
{
"title": "Overview",
"start_page": 4,
"end_page": 6
}
]
}
],
"table_of_contents": [
{
"title": "Financial Highlights",
"page": 1
},
{
"title": "Letter to Shareholders",
"page": 3
}
]
},
"quality_metrics": {
"overall_confidence": 0.96,
"ocr_required": false,
"extraction_warnings": [
"Page 45: Low quality scan detected"
]
}
}
}
Block Types
heading
Title or section header with level (1-6)
paragraph
Standard text paragraph
table
Structured table with rows and columns
list
Bulleted or numbered list
image
Image or chart (base64 encoded)
footnote
Footnote or endnote text
header
Page header content
footer
Page footer content
page_number
Page numbering
signature
Signature block
Financial Mode Features
When financial_mode
is enabled, the parser:
Identifies standard financial statements (Income Statement, Balance Sheet, Cash Flow)
Recognizes XBRL-like structures
Extracts monetary values with currency detection
Preserves table relationships and calculations
Identifies fiscal periods and dates
Detects auditor signatures and opinions
Error Handling
Additional error codes specific to PDF parsing:
FILE_TOO_LARGE
PDF exceeds 50MB limit
INVALID_PDF
File is corrupted or not a valid PDF
PARSING_TIMEOUT
Document too complex, parsing timed out
OCR_FAILED
OCR processing failed for scanned document
ENCRYPTED_PDF
PDF is password protected
Last updated