# Document Fetching Guidance

### Environment Setup

**Specification**

Orbit will provide a folder under **orbit-data-provider** AWS S3 bucket.Orbit will provide an AWS S3 **Key Pair** for accessing the data under the prepared S3 folder.

**With Your Case**

In your case:

* The AWS S3 folder will be: s3://orbit-data-provider/clients/jpmorgan/
* Key Pair will be provided in separate file.

### Way to fetch data

**Specification**

As the raw report data is really large, Orbit will only provide the report index in our client folder.After Client get the index, then clients can download the raw report as needed.\
The data is updated in real time and clients can use any SKD (like boto3 in Python) to fetch the data which client would like to use.

* The SDK will leverage the **Key Pair** to get permissions for accessing the data.

**Get data**

Below screenshot is the sample data we delivered (The data will be delivered according to specific client requirements for real delivery).

* Each file represents a report;
* The name of the file is by CURRENT UTC TIME + REPORT ID, in this way can be easy to filter with aws SDK.

<figure><img src="https://ra97ksj7al.feishu.cn/space/api/box/stream/download/asynccode/?code=YWE0YWZhZjcxNDZhZThlYmQ2YWNkN2EyZjc3Y2ExYjRfZkpLSENWZWRpRzVTSGJZdjJ5UWlRc2Q2SlAzQXBkZWxfVG9rZW46UDJ5Q2IxeDQzb2JwOVJ4eGlmYWNBZmNGbnhoXzE3NTYzODE3MjM6MTc1NjM4NTMyM19WNA" alt=""><figcaption></figcaption></figure>

* There is a **presigned\_url** key in each report(each line) for downloading the raw report.

```json
{
  "report_id": "f_7UrMt7SKXYWoDKIWjZpOCb",
  "reported_at": "2025-04-04",
  "report_title": "DEF 14A",
  "report_type_id_list": [
    "10178"
  ],
  "company_info": [
    {
      "orbit_id": "1-4295904557",
      "company_name": "MORGAN STANLEY",
      "isin": [
       "US61747S5047",
      ],
      "ticker": [
        "MS"
      ],
      "country": [
        "US"
      ]
    }
  ],
  "attachments": [
    {
      "s3_path": "s3://filing-reports/reports-data/stock_us/2025/04/04/edgar-data-895421-000114036125012302-ny20039620x1_def14a.htm.pdf",
      "presigned_url_4_file": "https://filing-reports.s3.amazonaws.com/reports-data/stock_us/2025/04/04/edgar-data-895421-000114036125012302-ny20039620x1_def14a.htm.pdf?AWSAccessKeyId=AKIAZ2SDT5DU46K54RGA&Signature=DBaDcO1qTdYPzcxwQa5TFd8tYGg%3D&Expires=1749017986",
      "presigned_url_4_pages": "https://filing-reports.s3.amazonaws.com/txt-vector/reports-data/stock_us/2025/04/04/edgar-data-895421-000114036125012302-ny20039620x1_def14a.htm.pdf/pages.txt?AWSAccessKeyId=AKIAZ2SDT5DU46K54RGA&Signature=e9JyxeMDEQyklR9LhQTnLc82gH8%3D&Expires=1749017986",
      "presigned_url_4_blocks": "https://filing-reports.s3.amazonaws.com/txt-vector/reports-data/stock_us/2025/04/04/edgar-data-895421-000114036125012302-ny20039620x1_def14a.htm.pdf/blocks.txt?AWSAccessKeyId=AKIAZ2SDT5DU46K54RGA&Signature=qGcmz4k6JPXSKWUoQyUOW7FWLmY%3D&Expires=1749017986",
      "presigned_url_4_pages_vector": "https://filing-reports.s3.amazonaws.com/txt-vector/reports-data/stock_us/2025/04/04/edgar-data-895421-000114036125012302-ny20039620x1_def14a.htm.pdf/pages.txt.vector?AWSAccessKeyId=AKIAZ2SDT5DU46K54RGA&Signature=%2FIauqGI%2BuEHA0ym5s1%2Bsy4pUDaA%3D&Expires=1749017986",
      "presigned_url_4_blocks_vector": "https://filing-reports.s3.amazonaws.com/txt-vector/reports-data/stock_us/2025/04/04/edgar-data-895421-000114036125012302-ny20039620x1_def14a.htm.pdf/blocks.txt.vector?AWSAccessKeyId=AKIAZ2SDT5DU46K54RGA&Signature=jcWGfstmaGhF0N0BU7uhrY9cdMQ%3D&Expires=1749017986"
    }
  ],
  "x_version": 1
}
```

Clients can read the data in a programming way.

> The code below is to use the boto3 SDK in Python.

```python
import json
import boto3

s3_client = boto3.client('s3', aws_access_key_id="your key id", aws_secret_access_key="your secret key")

bucket_name = 'orbit-data-provider'
prefix = 'clients/abc/'  # Your owned data folder

response = s3_client.get_object(Bucket=bucket_name, Key="clients/marketscreener/streaming/20241108072151_f_gNEoQE9TllGsQhHjAIambp.json")
file_content = response['Body'].read().decode('utf-8')

print(json.loads(file_content))  # Decode data in json format
```

⚠️ Please be notified that the expiration of **presigned\_url** is typically 7 days. But you can also regenerate them again by using **s3\_path** key in the index file. Below is the sample code.

```python
import json
import boto3
import os
import re

s3_client = boto3.client('s3', aws_access_key_id="your key id", aws_secret_access_key="your secret key")


def gen_n_files_presign_url(s3_path):
    s3_path_obj = s3_split_path(s3_path)

    presigned_url_pdf = s3_client.generate_presigned_url(
        'get_object',
        Params={
            'Bucket':  s3_path_obj["bucket"], 
            'Key': s3_path_obj["store_path"]
        },
        ExpiresIn=604800
    )  # 7 days

    return presigned_url_pdf

# Tool method
def s3_split_path(s3_path: str):
    if not s3_path.startswith('s3://'):
        raise Exception("Invalid s3 path format.")

    s3_path_re = re.compile(r"(s3://[a-zA-Z\-_0-9]+)/(.+)")
    path_group = s3_path_re.search(s3_path).groups()
    return {
        'bucket': path_group[0].replace('s3://', ''),
        'store_path': path_group[1],
    }
```


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.orbitfin.ai/orbit-api-reference/document-feeds/document-fetching-guidance.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
