# Enterprise Infrastructure for Unstructured Data

## 1. Introduction to Orbit AI Studio

Orbit AI Studio is an enterprise-grade platform designed to solve the fundamental challenge of managing and extracting intelligence from unstructured financial documents at scale. While modern financial institutions have mature infrastructure for structured data (Snowflake, Databricks), unstructured documents—which contain 80%+ of critical financial intelligence—remain fragmented across systems with inconsistent processing and no unified architecture.AI Studio provides the complete technology stack for acquiring, processing, storing, and analyzing unstructured financial data. From raw PDF documents to AI-powered insights, the platform handles the entire lifecycle with enterprise-grade reliability, security, and performance.

**Orbit AI Studio is to unstructured data what Snowflake is to structured data** — a comprehensive platform that enables organizations to focus on building differentiated applications and generating alpha, rather than building commodity data infrastructure.

### Core Platform Modules

#### Module 1: Knowledge Base Building Pipeline

Complete infrastructure for transforming raw documents into AI-ready structured data

* Data Ingestion (web scraping, APIs, uploads)
* Entity Master & Ontology Management
* Metadata Management & Classification
* Advanced PDF Parsing
* Storage (flat files, databases, search engines)

#### Module 2: Pre-Built Knowledge Bases

Production-ready datasets of processed financial documents

* Global Filings (50K+ companies)
  * Financial Reports
  * Earnings Transcripts
  * ESG & Sustainability Reports
  * Corporate Action Announcements
* Available as data feed or via MCP/API

#### Module 3: Extraction & Calculation Services

AI-powered data extraction and analytical computation at scale

* Bespoke Data Point Extraction
* Document Summarization
* Complex Analytical Workflows
* Extraction Logic Optimization
* Model Fine-tuning Services

### Platform Architecture Overview

```plaintext
Raw Documents → Ingestion & Parsing → Knowledge Base → Extraction & Calculation
   (PDFs)           Pipeline          (Structured)         (AI-Powered)        
```

***

## 2. Knowledge Base Building Pipeline

The Knowledge Base Building Pipeline is the foundation of AI Studio, responsible for transforming raw financial documents into clean, structured, AI-ready data. This pipeline handles millions of documents with enterprise reliability, processing complex financial filings that generic document processing tools cannot handle effectively.

### 2.1 Data Ingestion

The ingestion layer supports multiple acquisition methods to bring documents into the platform from any source, handling both public regulatory filings and proprietary internal documents.

#### Supported Ingestion Methods

**Web Scraping Framework:**

* Configurable crawlers with rate limiting, respectful crawling patterns, and automatic retry logic

**API Integration:**

* RESTful and GraphQL connectors for third-party data providers

**Direct Upload:**

* Bulk upload via UI, API, or S3-compatible object storage sync

**Email Integration:**

* Automated document intake from email attachments

**Enterprise Connectors:**

* SharePoint, Box, Google Drive, internal document management systems

#### Technical Capabilities

* Scheduled polling for batch ingestion
* Webhook support for push-based ingestion
* Deduplication logic to prevent reprocessing
* Document versioning and change tracking
* Error handling with automatic retry and alerting

> **Note:** Custom connectors can be developed for proprietary sources. Development timeline typically 2-4 weeks depending on source complexity.

***

### 2.2 Entity Master & Ontology Management

The Entity Master provides unified entity resolution and relationship management across all documents, ensuring consistent identification of companies, people, and concepts regardless of naming variations or data source.

#### Key Features

**Entities covered:**

* Unified company profiles with multiple identifier support (ticker, ISIN, LEI, CIK, etc)

**Name Variation Management:**

* Handles alternate names, legacy names, DBA names, international variations

**Automatic Entity Resolution:**

* ML-based matching to resolve entities across documents

**Change History:**

* Track name changes, mergers, acquisitions, spin-offs

**Custom Ontologies:**

* Support for client-specific taxonomies and classification schemes

***

### 2.3 Metadata Management & Document Classification

Comprehensive metadata extraction and enrichment enables efficient document discovery, filtering, and compliance tracking.

#### Automatic Metadata Extraction

**Document Type Classification:**

* Annual reports, quarterly reports, earnings transcripts, presentations, sustainability reports (300+ document types)

**Temporal Information:**

* Filing dates, report dates, event dates

**Regulatory Metadata:**

* Exchange identifiers, filing authority

**Language Detection:**

* Automatic language identification with translation markers

#### Metadata Storage & Indexing

* Elasticsearch-based full-text search index
* PostgreSQL/MongoDB for structured metadata queries
* Support for custom metadata fields and tags
* Version control for metadata updates
* API access for metadata queries and bulk export

***

### 2.4 Advanced PDF Parsing Engine

The parsing engine is specifically optimized for complex financial documents, achieving 99%+ accuracy on tables and multi-column layouts.

#### Core Parsing Capabilities

**Financial Table Extraction:**

* Intelligent detection and extraction of financial statements, preserving row/column headers, cell merging, and numeric formatting

**Multi-Page Table Reconstruction:**

* Automatic stitching of tables split across pages with header repetition detection

**Multi-Column Layout Processing:**

* Correct reading order preservation in complex 2-3 column layouts

**Exhibit & Section Extraction:**

* Intelligent segmentation of documents into logical sections

**Chart & Image Extraction:**

* Identification and extraction of embedded graphics with OCR for chart text

#### Output Formats

* **Structured JSON:** Complete document hierarchy with text, tables, metadata

#### Performance Specifications

* Processing speed: 100-page document in under 60 seconds
* Accuracy: 99%+ on financial tables and structured content
* Maximum document size: 1,000 pages
* Supported formats: PDF, scanned PDF (OCR)
* Auto-scaling based on queue depth

***

### 2.5 Storage Architecture

A multi-tier storage system optimized for different access patterns and data types, from raw document preservation to real-time search and analytics.

#### Storage Layers

**Raw Document Store (S3/Blob):** Original PDF files preserved for audit and reprocessing

* Versioned storage with lifecycle policies
* Hot/warm/cold tiering for cost optimization
* Immutable storage options for compliance

**Structured Data Store (PostgreSQL):** Extracted text, tables, and metadata

* JSONB for flexible schema
* Full ACID compliance
* Optimized indexes for common query patterns
* Partitioning by company and date for performance

**Search Engine (Elasticsearch):**&#x44;ocument embeddings for semantic search

* Multiple embedding model support
* Sub-second similarity search
* Full-text search and faceted filtering
* Real-time indexing pipeline
* Advanced query DSL support
* Aggregations for analytics

**Flat File Export:** CSV, JSON Lines for bulk export

* Partitioned by company/date for selective loading
* S3-compatible bulk export

#### Data Management Features

* **Data Lineage:** Complete tracking from raw document through all processing stages
* **Version Control:** Document and data versioning with rollback capability
* **Reprocessing:** Ability to reprocess documents with updated parsing logic
* **Backup & Recovery:** Automated backups with point-in-time recovery
* **Data Retention Policies:** Configurable retention with automated archival
* **Access Logging:** Complete audit trail of data access for compliance

#### Knowledge Base Building Pipeline Flow

```plaintext
Ingestion → PDF Parsing → Entity Resolution → Metadata Enrichment → Multi-Tier Storage
  Layer                                                                  
```

End-to-end processing time: 2-5 minutes per document (depending on size and complexity)

***

## 3. Pre-Built Knowledge Bases

Orbit maintains production-ready knowledge bases of processed financial documents, eliminating the need to build data collection and processing infrastructure from scratch. These knowledge bases are continuously updated as new documents are filed and are available through multiple consumption models.

### 3.1 Global Filings Knowledge Base

Our flagship knowledge base covering regulatory filings from public companies globally. All documents are pre-processed through the complete pipeline described in Chapter 2, delivered in AI-ready structured format.

#### Coverage Summary

* **Companies:** 50,000+ globally listed companies
* **Documents:** 15M+ historical documents
* **Markets:** United States, United Kingdom, Europe, China, Japan, Australia, Canada, Singapore, India, and 50+ additional countries
* **Document Types:** Annual reports, quarterly filings, current reports, proxy statements, prospectuses, earnings transcripts, presentations
* **Historical Depth:** 10+ years
* **Update Frequency:** Daily

***

### 3.2 Access Methods

#### Data Feed License

**Bulk Data DeliveryDescription:** Complete knowledge base delivered to your infrastructure for local hosting and integration.**Delivery Options:**

* Initial bulk transfer (S3)
* Continuous incremental updates (daily)

**Best For:** Organizations wanting full data ownership, no API dependencies, or integration with existing data lakes.

***

#### API Access

**On-Demand QueryDescription:** Query knowledge base via RESTful API or GraphQL without hosting data locally.**API Capabilities:**

* Document retrieval by company, date, type
* Full-text and semantic search
* Table and section extraction
* Metadata queries and filtering
* Bulk export endpoints

**Best For:** Rapid prototyping, variable usage patterns, or supplementing existing data sources.

***

#### Model Context Protocol (MCP)

**LLM-Native IntegrationDescription:** Direct integration with Large Language Models (ChatGPT, Claude, etc.) via MCP standard.**MCP Features:**

* Native tool definitions for LLMs
* Semantic search over documents
* Context-aware document retrieval
* Automatic citation and sourcing
* Optimized for RAG patterns

**Best For:** AI chatbots, research assistants, or conversational interfaces requiring financial document access.

***

## 4. Extraction & Calculation Services

Beyond storing and serving documents, AI Studio provides powerful computation capabilities to extract structured data and perform complex analytical workflows. These services leverage Large Language Models and custom-trained extraction models to transform unstructured text into actionable intelligence.

### 4.1 Bespoke Data Point Extraction

Extract specific metrics, facts, or data points systematically across thousands of documents. Unlike generic prompting, our extraction service uses optimized prompts, validation logic, and quality assurance to ensure consistent, reliable output.

#### Extraction Capabilities

**Financial Metrics:**

* Revenue segments, margins by division, guidance ranges, capital expenditure plans

**Qualitative Factors:**

* Management sentiment, risk factor changes, competitive positioning statements

**Structured Events:**

* M\&A announcements, product launches, regulatory actions

**Custom Taxonomies:**

* Client-specific data points and classification schemes

**Multi-Source Validation:**

* Cross-reference data points across multiple documents for consistency

**Historical Tracking:**

* Time-series construction for trend analysis

#### Technical Approach

* Iterative prompt engineering with validation on sample sets
* Confidence scoring for each extracted data point
* Source citation with page/section references
* Batch processing for large-scale extraction (thousands of companies)
* Real-time extraction API for ad-hoc queries

**Typical Use Cases:** Systematic extraction of 50+ ESG metrics from sustainability reports; revenue breakdown by product line across 1,000 companies; quarterly tracking of supply chain mentions.

***

### 4.2 Document Summarization

Generate consistent, high-quality summaries of financial documents optimized for investment analysis workflows.

#### Summarization Types

* **Executive Summaries:** High-level overview for quick review (200-500 words)
* **Section Summaries:** Condensed version of specific sections (Risk Factors, MD\&A, etc.)
* **Change Summaries:** What's different vs. prior period
* **Comparative Summaries:** Company vs. peers on specific topics
* **Thematic Summaries:** Extraction of all content related to specific themes (AI, ESG, supply chain)

All summaries include source citations and can be customized for length, focus areas, and output format.

***

### 4.3 Complex Analytical Workflows

Multi-step analysis pipelines that combine extraction, calculation, and reasoning to produce sophisticated analytical outputs.

#### Example Workflows

**Competitive Positioning Analysis:** Extract competitive mentions → Identify key competitors → Compare product/service positioning → Assess relative strengths → Generate summary report**Risk Factor Analysis:** Extract all risks → Categorize by type → Track changes vs. prior period → Score severity → Identify emerging risks → Peer comparison**Management Quality Assessment:** Extract capital allocation decisions → Track promises vs. execution → Analyze compensation structure → Evaluate governance → Generate quality score**Thematic Exposure Quantification:** Identify theme mentions → Extract revenue/margin impact → Classify as headwind/tailwind → Quantify exposure → Track over time**Workflow Orchestration:** Visual workflow builder allows business users to define multi-step analysis logic without coding. Workflows can be scheduled to run automatically on new documents or triggered via API.

***

### 4.4 Extraction Optimization Services

For complex or high-volume extraction tasks, our optimization services ensure maximum accuracy and cost-efficiency through custom prompt engineering or model fine-tuning.

#### Extraction Logic Optimization

**Prompt EngineeringService Description:** Expert prompt engineering to maximize extraction accuracy and minimize token usage.**Deliverables:**

* Optimized prompts with validation rules
* Sample output validation (100+ documents)
* Accuracy benchmarking report
* Documentation and usage guidelines

**Timeline:** 2-4 weeks&#x20;

***

#### Model Fine-Tuning

**Custom ML ModelsService Description:** Fine-tune extraction models on your specific use case for superior accuracy and lower cost.**Deliverables:**

* Custom fine-tuned model weights
* Training and validation datasets
* Model deployment (API or on-premise)
* Performance benchmarking
* Ongoing retraining options

**Timeline:** 8-12 weeks&#x20;

***

> **When to Optimize:** Consider optimization services when: (1) Running extraction on >10,000 documents, (2) Accuracy requirements >95%, (3) Cost per extraction needs to be minimized, or (4) Proprietary terminology requires domain adaptation.

***

## 5. Deployment Options & Commercial Models

AI Studio offers flexible deployment options to match your security requirements, technical infrastructure, and commercial preferences. Whether you prefer cloud SaaS, private cloud, or on-premise deployment, the platform capabilities remain consistent.

### 5.1 Platform Deployment Options

#### Orbit SaaS on AWS

**Fully ManagedDescription:** Complete AI Studio platform hosted and managed by Orbit on AWS infrastructure.**Characteristics:**

* Zero infrastructure management required
* Automatic scaling and updates
* 99.9% uptime SLA
* Multi-tenant architecture with data isolation
* Global availability (US, EU, Asia regions)

**Best For:** Fastest time-to-value, no infrastructure team, prefer OPEX over CAPEX.

***

#### Private Cloud (AWS/Azure)

**Dedicated InfrastructureDescription:** Dedicated AI Studio instance in your AWS or Azure account, managed by Orbit.**Characteristics:**

* Data resides in your cloud account
* VPC isolation and custom networking
* Your encryption keys (BYOK)
* Orbit manages infrastructure and updates
* Custom SLA agreements available
* Hybrid connectivity to on-premise systems

**Best For:** Data residency requirements, need dedicated resources, want managed service without multi-tenancy.

***

#### On-Premise Installation

**Self-HostedDescription:** Complete AI Studio stack deployed in your data center or private cloud, managed by your team.**Characteristics:**

* Full control over infrastructure
* Docker/Kubernetes deployment
* Your team manages operations
* Annual license + maintenance model
* Orbit provides updates and support

**Best For:** Strict data governance, regulatory requirements, no cloud usage allowed, very high document volumes.

***

### 5.2 Module-Specific Commercial Models

#### PDF Parsing Services Pricing

| Offering               | Description                                        | Pricing Model                                            | Typical Use Case                                                                   |
| ---------------------- | -------------------------------------------------- | -------------------------------------------------------- | ---------------------------------------------------------------------------------- |
| **SaaS Pay-As-You-Go** | API access with usage-based billing                | 0.4 credits per 10 pages                                 | Variable volumes, development/testing, sporadic usage                              |
| **SaaS Fixed Project** | Pre-purchased parsing credits for defined project  | Fixed fee for X documents 50% discount vs. pay-as-you-go | One-time historical document processing, migration projects                        |
| **On-Premise License** | Software license for unlimited self-hosted parsing | annual license Based on volume tier + one-time setup     | Very high volumes (100K+ docs/year), data security requirements, predictable costs |

> **Break-Even Analysis:** On-premise license becomes cost-effective at approximately 100K-150K documents per year compared to SaaS pay-as-you-go pricing.

***

#### Knowledge Base Access Pricing

| Offering                     | Description                                  | Pricing Model                                                          | What's Included                                                                               |
| ---------------------------- | -------------------------------------------- | ---------------------------------------------------------------------- | --------------------------------------------------------------------------------------------- |
| **Data Feed License**        | Bulk data delivery for local hosting         | Contact sales team                                                     | Complete historical dataset, continuous updates, unlimited internal usage, technical support  |
| **API Access - Consumption** | Query knowledge base via API, pay per use    | [Per API pricing guide](https://docs.orbitfin.ai/orbit-api-reference/) | On-demand document retrieval, search, metadata queries. See detailed API pricing in Chapter 3 |
| **MCP Access - Consumption** | LLM-native access via Model Context Protocol | Consumption based                                                      | Semantic search, context-aware retrieval, optimized for RAG patterns, LLM tool definitions    |

***

#### Extraction & Calculation Services Pricing

| Offering                           | Description                                                | Pricing Model                        | Considerations                                                                                                |
| ---------------------------------- | ---------------------------------------------------------- | ------------------------------------ | ------------------------------------------------------------------------------------------------------------- |
| **Consumption-Based (Orbit SaaS)** | Pay per extraction/calculation run on Orbit infrastructure | Consumption based                    | Ideal for variable workloads, testing different extraction approaches, no infrastructure management           |
| **Private Deployment License**     | Run extraction services on your infrastructure             | Annual License + Maintenance         | Predictable costs for high volumes, full control over compute, data never leaves your environment             |
| **Extraction Optimization**        | Professional services to optimize accuracy and efficiency  | Prompt Engineering Model Fine-Tuning | Recommended for >10K documents, accuracy >95% required, or proprietary terminology. See Chapter 4 for details |

***

### 5.3 Platform Licensing (Full Stack)

For organizations deploying the complete AI Studio platform (pipeline, knowledge bases, and extraction services), comprehensive licensing packages are available.

#### SaaS Platform

**What's Included:**

* Complete pipeline access
* Knowledge base licenses
* Extraction services (consumption-based)
* Managed infrastructure
* Standard support

**Pricing Structure:**

* Annual base platform fee
  * Knowledge base licenses&#x20;
  * Extraction consumption

***

#### Private Cloud

**What's Included:**

* Dedicated infrastructure in your cloud
* All platform capabilities
* Knowledge base licenses
* Managed by Orbit
* Premium support

**Pricing Structure:**

* Setup cost
* Platform License
  * Knowledge base licenses
  * Monthly management fee

***

#### On-Premise

**What's Included:**

* Complete software stack
* Unlimited processing
* Knowledge base licenses
* Your team operates
* Orbit provides updates & support

**Pricing Structure:**

* Setup & Implementation
* Annual Platform License
  * Knowledge base licenses
  * Monthly management fee

***

> **Custom Packages:** Actual packages are customized based on: document volumes, number of companies/industries covered, required knowledge bases, extraction complexity, deployment requirements, and support SLAs. Contact us for detailed proposal based on your specific requirements.

***

### 5.4 Decision Framework: Which Deployment Option?

| Your Requirements                             | Recommended Approach                    | Rationale                                                                 |
| --------------------------------------------- | --------------------------------------- | ------------------------------------------------------------------------- |
| Building RAG system, need parsing only        | PDF Parsing SaaS (pay-as-you-go)        | Minimal commitment, test quality, scale as needed                         |
| Need historical data for platform development | Knowledge Base Data Feed License        | Complete data ownership, no API dependencies, unlimited queries           |
| Prototyping AI applications                   | API/MCP Access (consumption)            | Fast start, flexible, validate use case before larger investment          |
| Processing 200K+ documents/year               | On-Premise Parsing License              | Cost-effective at scale, predictable budgeting                            |
| Strict data residency requirements            | Private Cloud or On-Premise             | Data never leaves your infrastructure                                     |
| Building enterprise AI platform               | Full Platform License (SaaS or Private) | Complete stack, managed service, focus on applications not infrastructure |
| Regulated environment, air-gapped network     | On-Premise Platform License             | Only option for environments without internet connectivity                |


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.orbitfin.ai/3.-orbit-ai-studio/enterprise-infrastructure-for-unstructured-data.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.