Audio API Endpoints¶

Endpoints for audio file processing with transcription and optional translation.

POST /api/audio/process¶

Process an audio file with transcription and optional template-based transformation.

Request¶

Content-Type: multipart/form-data

Parameters:

Parameter	Type	Required	Default	Description
`file`	File	Yes	-	Audio file (MP3, WAV, M4A, FLAC, OGG, etc.)
`source_language`	String	No	`de`	Source language (ISO 639-1 code, e.g., "en", "de")
`target_language`	String	No	`de`	Target language for translation (ISO 639-1 code)
`template`	String	No	`""`	Optional template name for text transformation
`useCache`	Boolean	No	`true`	Whether to use cache

Supported Formats¶

FLAC, M4A, MP3, MP4, MPEG, MPGA, OGA, OGG, WAV, WEBM

Request Example¶

curl -X POST "http://localhost:5001/api/audio/process" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -F "file=@audio.mp3" \
  -F "source_language=en" \
  -F "target_language=de" \
  -F "template=MeetingMinutes" \
  -F "useCache=true"

Response (Success)¶

Status Code: 200 OK

{
  "status": "success",
  "request": {
    "id": "process-id-123",
    "timestamp": "2024-01-01T00:00:00Z"
  },
  "process": {
    "duration_ms": 5000,
    "llm_info": {
      "total_tokens": 1500,
      "total_cost": 0.015,
      "requests": [
        {
          "model": "whisper-1",
          "purpose": "transcription",
          "tokens": 1500,
          "duration_ms": 4500
        }
      ]
    }
  },
  "data": {
    "duration": 120.5,
    "detected_language": "en",
    "output_text": "Transcribed and transformed text...",
    "original_text": "Original transcribed text...",
    "translated_text": "Translated text...",
    "llm_model": "whisper-1",
    "translation_model": "gpt-4",
    "token_count": 1500,
    "segments": [
      {
        "id": 0,
        "start": 0.0,
        "end": 10.5,
        "text": "First segment..."
      }
    ],
    "process_id": "process-id-123",
    "process_dir": "/path/to/process/dir",
    "from_cache": false
  }
}

Response (Error)¶

Status Code: 400 Bad Request

{
  "status": "error",
  "error": {
    "code": "INVALID_FORMAT",
    "message": "The format 'xyz' is not supported. Supported formats: flac, m4a, mp3...",
    "details": {
      "error_type": "INVALID_FORMAT",
      "supported_formats": ["flac", "m4a", "mp3", ...]
    }
  }
}

Processing Flow¶

Audio file is uploaded and validated
File is segmented into manageable chunks (if large)
Each segment is transcribed using OpenAI Whisper API
Optional: Text is transformed using template (via TransformerProcessor)
Optional: Text is translated to target language
Results are aggregated and returned

LLM Tracking¶

The response includes detailed LLM usage information: - Total tokens used - Total cost - Individual requests with model, purpose, tokens, duration

Caching¶

Results are cached based on: - File hash - Source language - Target language - Template name

Use useCache=false to bypass cache and force reprocessing.