All models

AssemblyAI Universal-3 Pro

assemblyai/universal-3-pro
AssemblyAI Speech-to-TextTranscriptionMultilingual

AssemblyAI's Universal 3 Pro speech recognition model for high-accuracy transcription.

Quick start

# Inspect the price — a plain request returns the 402 challenge:
curl -i https://api.glianalabs.com/v1/infer \
  -H "content-type: application/json" \
  -d '{
    "model": "assemblyai/universal-3-pro",
    "prompt": <string>
  }'

# Pay + run in one step with the mppx CLI (create a wallet: npx mppx account create):
npx mppx https://api.glianalabs.com/v1/infer \
  -J '{"model": "assemblyai/universal-3-pro", "prompt": "<string>"}'

Parameters

Input
audio_end_at integer

Timestamp (in milliseconds) to end transcription at.

audio_start_from integer

Timestamp (in milliseconds) to start transcription from.

audio_url string

The URL of the audio file to transcribe. Can be a publicly accessible URL or a data URI (data:audio/...;base64,...). For data URIs, the audio will be uploaded to AssemblyAI automatically. Required for pre-recorded transcription (when stream is false or not set).

auto_chapters boolean

Enable automatic chapter detection.

auto_highlights boolean

Enable automatic extraction of key phrases and highlights.

boost_param string

How much to boost the words in word_boost.

content_safety boolean

Enable content safety detection for sensitive content.

custom_spelling array

Custom spelling rules to replace specific words or phrases in the transcription output.

disfluencies boolean

Include filler words like "um", "uh", etc. in the transcript.

domain string

Domain-specific transcription mode. "medical-v1" enables medical terminology optimization.

dual_channel boolean

Process audio as dual-channel (stereo) for better accuracy.

entity_detection boolean

Enable detection of entities like names, organizations, and locations.

filter_profanity boolean

Filter profanity from the transcription.

iab_categories boolean

Enable IAB (Interactive Advertising Bureau) content taxonomy classification.

keyterms_prompt array

An array of up to 1,000 words or phrases (max 6 words per phrase) to improve transcription accuracy. Cannot be used with the prompt parameter.

language_code string

The language code for the audio file (e.g., "en", "es", "fr"). Defaults to automatic language detection.

language_detection boolean

Enable automatic language detection. When enabled with speech_models, the system will automatically select the best model for the detected language.

multichannel boolean

Process each audio channel separately for multi-channel audio files.

prompt string

A custom prompt to guide transcription style, formatting, and output characteristics. Maximum 1,500 words.

redact_pii boolean

Redact personally identifiable information.

redact_pii_audio boolean

Generate a redacted audio file with PII removed.

redact_pii_policies array

Specific PII policies to apply for redaction.

redact_pii_sub string

Strategy for substituting redacted PII.

sentiment_analysis boolean

Enable sentiment analysis for each sentence.

speaker_labels boolean

Enable speaker diarization to identify different speakers in the audio.

speakers_expected integer

Expected number of speakers for speaker diarization.

speech_threshold number

Confidence threshold for speech detection.

temperature number

Controls randomness in model output (0.0-1.0). Lower values make output more deterministic. Default is 0.0.

webhook_url string

URL to receive webhook notifications when transcription is complete.

websocket boolean

Enable real-time WebSocket streaming for live audio transcription. When true, a WebSocket connection is established instead of submitting a pre-recorded transcription job. Cannot be used with audio_url.

word_boost array

Array of words to boost recognition accuracy (legacy - use keyterms_prompt instead).

Output
confidence: Overall confidence score for the transcription.
language_code: Detected or specified language code.
language_confidence: Confidence score for language detection.
text: The transcribed text.
utterances: Speaker-separated utterances (when speaker_labels is enabled).
words: Word-level timestamps and confidence scores.