PDF API Endpoints¶
Endpoints for PDF file processing with text extraction and OCR.
POST /api/pdf/process¶
Process a PDF file with text extraction and optional OCR.
Request¶
Content-Type: multipart/form-data
Parameters:
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
file |
File | Yes | - | PDF file |
extraction_method |
String | No | native |
Extraction method: native, tesseract_ocr, openai_vision, combined |
template |
String | No | "" |
Optional template for text transformation |
context |
JSON | No | {} |
Additional context for template |
useCache |
Boolean | No | true |
Whether to use cache |
includeImages |
Boolean | No | false |
Base64-kodiertes ZIP-Archiv mit generierten Bildern erstellen |
page_start |
Integer | No | - | Start page (1-indexed) |
page_end |
Integer | No | - | End page (1-indexed) |
Extraction Methods¶
- native: Extract text directly from PDF structure (fastest)
- tesseract_ocr: Use Tesseract OCR for text extraction
- openai_vision: Use OpenAI Vision API for OCR
- combined: Try multiple methods and combine results
Note: For Mistral OCR transformation with integrated images, use the dedicated endpoint POST /api/pdf/process-mistral-ocr instead.
Request Example¶
curl -X POST "http://localhost:5001/api/pdf/process" \
-H "Authorization: Bearer YOUR_API_KEY" \
-F "file=@document.pdf" \
-F "extraction_method=combined" \
-F "template=MeetingMinutes" \
-F "includeImages=true"
Response (Success)¶
{
"status": "success",
"data": {
"extracted_text": "Full extracted text from PDF...",
"metadata": {
"page_count": 10,
"author": "Author Name",
"title": "Document Title",
"text_contents": [
{
"page": 1,
"text": "Page 1 text...",
"method": "native"
}
],
"image_paths": [
"/path/to/page_1.jpg",
"/path/to/page_2.jpg"
]
},
"images_archive_filename": "document_images.zip",
"images_archive_data": "base64_encoded_zip_data"
}
}
POST /api/pdf/process-mistral-ocr¶
Process a PDF file with Mistral OCR transformation and parallel page image extraction.
This endpoint is specifically designed for Mistral OCR transformation with integrated images. It runs two processes in parallel: 1. Mistral OCR Transformation: Converts PDF to Markdown with embedded images (recognized by Mistral OCR) 2. Page Image Extraction: Extracts PDF pages as images and returns them as a ZIP archive
Request¶
Content-Type: multipart/form-data
Parameters:
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
file |
File | Yes | - | PDF file |
page_start |
Integer | No | - | Start page (1-indexed) |
page_end |
Integer | No | - | End page (1-indexed, inclusive) |
includeImages |
Boolean | No | true |
Request Mistral OCR images as Base64 in response. Images are available in data.mistral_ocr_raw.pages[*].images[*].image_base64. |
includePageImages |
Boolean | No | true |
Extract PDF pages as images and return as ZIP archive. Runs in parallel to Mistral OCR transformation. |
useCache |
Boolean | No | true |
Whether to use cache |
callback_url |
String | No | - | Absolute HTTPS URL for webhook callback |
callback_token |
String | No | - | Per-job secret for webhook callback |
jobId |
String | No | - | Unique job ID for callback |
wait_ms |
Integer | No | 0 |
Optional: Wait time in milliseconds for completion (only without callback_url) |
Request Example¶
curl -X POST "http://localhost:5001/api/pdf/process-mistral-ocr" \
-H "Authorization: Bearer YOUR_API_KEY" \
-F "file=@document.pdf" \
-F "includeImages=true" \
-F "includePageImages=true" \
-F "page_start=1" \
-F "page_end=10"
Response (Success)¶
{
"status": "success",
"data": {
"extracted_text": "--- Seite 1 ---\n\n\nText content...",
"metadata": {
"page_count": 10,
"file_name": "document.pdf",
"file_size": 753504,
"extraction_method": "mistral_ocr_with_pages"
},
"mistral_ocr_raw": {
"pages": [
{
"index": 0,
"markdown": "\n\nText content...",
"images": [
{
"id": "img-0.jpeg",
"image_base64": "data:image/jpeg;base64,...",
"top_left_x": 93,
"top_left_y": 221,
"bottom_right_x": 1577,
"bottom_right_y": 508
}
]
}
],
"model": "mistral-ocr-latest",
"usage_info": {
"pages_processed": 10,
"doc_size_bytes": 753504
}
},
"pages_archive_filename": "pages.zip",
"pages_archive_data": "base64_encoded_zip_data",
"images_archive_data": null,
"images_archive_filename": null
}
}
Response Structure¶
The response contains two types of images:
- Mistral OCR Images (
data.mistral_ocr_raw.pages[*].images[*]): - Images recognized and extracted by Mistral OCR
- Embedded in the Markdown text
- Available as Base64-encoded data URLs
-
Include coordinates and annotations
-
Page Images (
data.pages_archive_data): - All PDF pages converted to images
- Packaged as a Base64-encoded ZIP archive
- Filename available in
data.pages_archive_filename - Extracted in parallel to Mistral OCR processing
Differences to /api/pdf/process¶
- Dedicated endpoint: Simplified interface specifically for Mistral OCR workflows
- Parallel processing: Page image extraction runs in parallel to OCR transformation
- Two image types: Returns both Mistral OCR images and page images
- No template support: Focused on OCR transformation only
- Simplified parameters: Fewer options, clearer purpose
Use Cases¶
- Document digitization with full page images and OCR results
- Archival systems requiring both searchable text and page images
- Quality assurance workflows comparing OCR results with original pages
- Multi-format export (Markdown with embedded images + page images)
Downloading Page Images Archive¶
There are two ways to download the ZIP archive with PDF pages as images:
Option 1: Direct from Response (Base64)¶
The pages_archive_data field in the response contains the ZIP file as a Base64-encoded string. You can decode it directly:
// Extract from response
const response = await fetch('/api/pdf/process-mistral-ocr', {...});
const data = await response.json();
if (data.data.pages_archive_data) {
// Decode Base64 to binary
const binaryString = atob(data.data.pages_archive_data);
const bytes = new Uint8Array(binaryString.length);
for (let i = 0; i < binaryString.length; i++) {
bytes[i] = binaryString.charCodeAt(i);
}
// Create blob and download
const blob = new Blob([bytes], { type: 'application/zip' });
const url = URL.createObjectURL(blob);
const a = document.createElement('a');
a.href = url;
a.download = data.data.pages_archive_filename || 'pages.zip';
a.click();
URL.revokeObjectURL(url);
}
import base64
# Extract from response
response_data = {...} # Your API response
if response_data.get('data', {}).get('pages_archive_data'):
# Decode Base64
archive_data = base64.b64decode(response_data['data']['pages_archive_data'])
filename = response_data['data'].get('pages_archive_filename', 'pages.zip')
# Save to file
with open(filename, 'wb') as f:
f.write(archive_data)
Option 2: Download Endpoint (for Async Jobs)¶
If you're using async job processing (with callback_url or wait_ms=0), you can download the archive via a dedicated endpoint:
Endpoint: GET /api/pdf/jobs/{job_id}/download-pages-archive
Example:
curl -X GET "http://localhost:5001/api/pdf/jobs/{job_id}/download-pages-archive" \
-H "Authorization: Bearer YOUR_API_KEY" \
-o pages.zip
Response: Binary ZIP file with Content-Type: application/zip and Content-Disposition: attachment
Status Codes:
- 200: Success - ZIP file returned
- 202: Processing - Job still running, try again later
- 400: No archive available (check if includePageImages=true was set)
- 404: Job not found
- 500: Server error
Note: The download endpoint only works for jobs that were processed with includePageImages=true. The archive is stored in the job results and can be downloaded even after the initial response.
POST /api/pdf/job¶
Process PDF asynchronously as a job.
Request¶
Content-Type: application/json
Body:
{
"filename": "/path/to/file.pdf",
"extraction_method": "combined",
"template": "MeetingMinutes",
"use_cache": true,
"webhook": {
"url": "https://example.com/webhook",
"token": "webhook_token",
"jobId": "client_job_id"
}
}
Response (Success)¶
{
"status": "success",
"data": {
"job_id": "job-id-123",
"status": "pending"
}
}
Job Status¶
Query job status via /api/jobs/{job_id}:
{
"status": "success",
"data": {
"job_id": "job-id-123",
"status": "completed",
"progress": {
"step": "completed",
"percent": 100,
"message": "Processing completed"
},
"results": {
"structured_data": {
"extracted_text": "...",
"metadata": {...}
}
}
}
}