Zum Inhalt

PDF API Endpoints

Endpoints for PDF file processing with text extraction and OCR.

POST /api/pdf/process

Process a PDF file with text extraction and optional OCR.

Request

Content-Type: multipart/form-data

Parameters:

Parameter Type Required Default Description
file File Yes - PDF file
extraction_method String No native Extraction method: native, tesseract_ocr, openai_vision, combined
template String No "" Optional template for text transformation
context JSON No {} Additional context for template
useCache Boolean No true Whether to use cache
includeImages Boolean No false Base64-kodiertes ZIP-Archiv mit generierten Bildern erstellen
page_start Integer No - Start page (1-indexed)
page_end Integer No - End page (1-indexed)

Extraction Methods

  • native: Extract text directly from PDF structure (fastest)
  • tesseract_ocr: Use Tesseract OCR for text extraction
  • openai_vision: Use OpenAI Vision API for OCR
  • combined: Try multiple methods and combine results

Note: For Mistral OCR transformation with integrated images, use the dedicated endpoint POST /api/pdf/process-mistral-ocr instead.

Request Example

curl -X POST "http://localhost:5001/api/pdf/process" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -F "file=@document.pdf" \
  -F "extraction_method=combined" \
  -F "template=MeetingMinutes" \
  -F "includeImages=true"

Response (Success)

{
  "status": "success",
  "data": {
    "extracted_text": "Full extracted text from PDF...",
    "metadata": {
      "page_count": 10,
      "author": "Author Name",
      "title": "Document Title",
      "text_contents": [
        {
          "page": 1,
          "text": "Page 1 text...",
          "method": "native"
        }
      ],
      "image_paths": [
        "/path/to/page_1.jpg",
        "/path/to/page_2.jpg"
      ]
    },
    "images_archive_filename": "document_images.zip",
    "images_archive_data": "base64_encoded_zip_data"
  }
}

POST /api/pdf/process-mistral-ocr

Process a PDF file with Mistral OCR transformation and parallel page image extraction.

This endpoint is specifically designed for Mistral OCR transformation with integrated images. It runs two processes in parallel: 1. Mistral OCR Transformation: Converts PDF to Markdown with embedded images (recognized by Mistral OCR) 2. Page Image Extraction: Extracts PDF pages as images and returns them as a ZIP archive

Request

Content-Type: multipart/form-data

Parameters:

Parameter Type Required Default Description
file File Yes - PDF file
page_start Integer No - Start page (1-indexed)
page_end Integer No - End page (1-indexed, inclusive)
includeImages Boolean No true Request Mistral OCR images as Base64 in response. Images are available in data.mistral_ocr_raw.pages[*].images[*].image_base64.
includePageImages Boolean No true Extract PDF pages as images and return as ZIP archive. Runs in parallel to Mistral OCR transformation.
useCache Boolean No true Whether to use cache
callback_url String No - Absolute HTTPS URL for webhook callback
callback_token String No - Per-job secret for webhook callback
jobId String No - Unique job ID for callback
wait_ms Integer No 0 Optional: Wait time in milliseconds for completion (only without callback_url)

Request Example

curl -X POST "http://localhost:5001/api/pdf/process-mistral-ocr" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -F "file=@document.pdf" \
  -F "includeImages=true" \
  -F "includePageImages=true" \
  -F "page_start=1" \
  -F "page_end=10"

Response (Success)

{
  "status": "success",
  "data": {
    "extracted_text": "--- Seite 1 ---\n![img-0.jpeg](img-0.jpeg)\n\nText content...",
    "metadata": {
      "page_count": 10,
      "file_name": "document.pdf",
      "file_size": 753504,
      "extraction_method": "mistral_ocr_with_pages"
    },
    "mistral_ocr_raw": {
      "pages": [
        {
          "index": 0,
          "markdown": "![img-0.jpeg](img-0.jpeg)\n\nText content...",
          "images": [
            {
              "id": "img-0.jpeg",
              "image_base64": "data:image/jpeg;base64,...",
              "top_left_x": 93,
              "top_left_y": 221,
              "bottom_right_x": 1577,
              "bottom_right_y": 508
            }
          ]
        }
      ],
      "model": "mistral-ocr-latest",
      "usage_info": {
        "pages_processed": 10,
        "doc_size_bytes": 753504
      }
    },
    "pages_archive_filename": "pages.zip",
    "pages_archive_data": "base64_encoded_zip_data",
    "images_archive_data": null,
    "images_archive_filename": null
  }
}

Response Structure

The response contains two types of images:

  1. Mistral OCR Images (data.mistral_ocr_raw.pages[*].images[*]):
  2. Images recognized and extracted by Mistral OCR
  3. Embedded in the Markdown text
  4. Available as Base64-encoded data URLs
  5. Include coordinates and annotations

  6. Page Images (data.pages_archive_data):

  7. All PDF pages converted to images
  8. Packaged as a Base64-encoded ZIP archive
  9. Filename available in data.pages_archive_filename
  10. Extracted in parallel to Mistral OCR processing

Differences to /api/pdf/process

  • Dedicated endpoint: Simplified interface specifically for Mistral OCR workflows
  • Parallel processing: Page image extraction runs in parallel to OCR transformation
  • Two image types: Returns both Mistral OCR images and page images
  • No template support: Focused on OCR transformation only
  • Simplified parameters: Fewer options, clearer purpose

Use Cases

  • Document digitization with full page images and OCR results
  • Archival systems requiring both searchable text and page images
  • Quality assurance workflows comparing OCR results with original pages
  • Multi-format export (Markdown with embedded images + page images)

Downloading Page Images Archive

There are two ways to download the ZIP archive with PDF pages as images:

Option 1: Direct from Response (Base64)

The pages_archive_data field in the response contains the ZIP file as a Base64-encoded string. You can decode it directly:

// Extract from response
const response = await fetch('/api/pdf/process-mistral-ocr', {...});
const data = await response.json();

if (data.data.pages_archive_data) {
  // Decode Base64 to binary
  const binaryString = atob(data.data.pages_archive_data);
  const bytes = new Uint8Array(binaryString.length);
  for (let i = 0; i < binaryString.length; i++) {
    bytes[i] = binaryString.charCodeAt(i);
  }

  // Create blob and download
  const blob = new Blob([bytes], { type: 'application/zip' });
  const url = URL.createObjectURL(blob);
  const a = document.createElement('a');
  a.href = url;
  a.download = data.data.pages_archive_filename || 'pages.zip';
  a.click();
  URL.revokeObjectURL(url);
}
import base64

# Extract from response
response_data = {...}  # Your API response

if response_data.get('data', {}).get('pages_archive_data'):
    # Decode Base64
    archive_data = base64.b64decode(response_data['data']['pages_archive_data'])
    filename = response_data['data'].get('pages_archive_filename', 'pages.zip')

    # Save to file
    with open(filename, 'wb') as f:
        f.write(archive_data)

Option 2: Download Endpoint (for Async Jobs)

If you're using async job processing (with callback_url or wait_ms=0), you can download the archive via a dedicated endpoint:

Endpoint: GET /api/pdf/jobs/{job_id}/download-pages-archive

Example:

curl -X GET "http://localhost:5001/api/pdf/jobs/{job_id}/download-pages-archive" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -o pages.zip

Response: Binary ZIP file with Content-Type: application/zip and Content-Disposition: attachment

Status Codes: - 200: Success - ZIP file returned - 202: Processing - Job still running, try again later - 400: No archive available (check if includePageImages=true was set) - 404: Job not found - 500: Server error

Note: The download endpoint only works for jobs that were processed with includePageImages=true. The archive is stored in the job results and can be downloaded even after the initial response.

POST /api/pdf/job

Process PDF asynchronously as a job.

Request

Content-Type: application/json

Body:

{
  "filename": "/path/to/file.pdf",
  "extraction_method": "combined",
  "template": "MeetingMinutes",
  "use_cache": true,
  "webhook": {
    "url": "https://example.com/webhook",
    "token": "webhook_token",
    "jobId": "client_job_id"
  }
}

Response (Success)

{
  "status": "success",
  "data": {
    "job_id": "job-id-123",
    "status": "pending"
  }
}

Job Status

Query job status via /api/jobs/{job_id}:

{
  "status": "success",
  "data": {
    "job_id": "job-id-123",
    "status": "completed",
    "progress": {
      "step": "completed",
      "percent": 100,
      "message": "Processing completed"
    },
    "results": {
      "structured_data": {
        "extracted_text": "...",
        "metadata": {...}
      }
    }
  }
}