Grok STT

xai/grok-stt

xAI Speech-to-TextTranscriptionMultilingual

xAI's Grok speech-to-text model. Transcribes audio files into text across 25 languages with word-level timestamps, multichannel transcription, speaker diarization, and key-term biasing.

Quick start

# Inspect the price — a plain request returns the 402 challenge:
curl -i https://api.glianalabs.com/v1/infer \
  -H "content-type: application/json" \
  -d '{
    "model": "xai/grok-stt",
    "url": "https://example.com/input"
  }'

# Pay + run in one step with the mppx CLI (create a wallet: npx mppx account create):
npx mppx https://api.glianalabs.com/v1/infer \
  -J '{"model": "xai/grok-stt", "url": "https://example.com/input"}'

Parameters

Input

audio_format string optional

Format hint for raw/headerless audio. Required for pcm, mulaw, alaw. Omit for container formats (mp3, wav, etc.) — xAI auto-detects them.

channels integer optional

Number of audio channels (2–8). Required only for multichannel raw audio; auto-detected for container formats.

diarize boolean optional

When true, enables speaker diarization. Each word in the response includes a `speaker` integer identifying the detected speaker.

file string optional

Audio file as a data URI (data:audio/...;base64,...) or an HTTPS URL the gateway fetches and uploads. Supported container formats: flac, mp3, mp4, m4a, mkv, ogg, opus, wav, aac. Raw formats (pcm, mulaw, alaw) also accepted — supply audio_format and sample_rate. Gateway-side size limit: 25 MB. Mutual

filler_words boolean optional

When true, filler words (uh, um, er) are included in the transcript. Defaults to false — filler words are removed.

format boolean optional

When true, enables Inverse Text Normalization — spoken numbers and currencies are converted to written form (e.g. "one hundred dollars" → "$100"). Requires language to be set.

keyterm array optional

Key terms to bias transcription toward (e.g. product names, proper nouns). Each term up to 50 characters, max 100 terms. Sent as repeated form fields: keyterm=Term+One&keyterm=Term+Two.

language string optional

Language code (e.g. "en", "fr", "de"). Used with format=true to enable Inverse Text Normalization. xAI transcribes in any language regardless — supplying this enables number/currency formatting in the transcript.

multichannel boolean optional

When true, each audio channel is transcribed independently. Results are returned in the `channels` array. Requires channels ≥ 2.

sample_rate integer optional

Sample rate in Hz. Required when audio_format is set.

url string optional

HTTPS URL of an audio file for xAI to fetch server-side. Mutually exclusive with `file`. No gateway-side size limit applies.

Output

channels: Per-channel transcripts when multichannel=true.

duration: Audio duration in seconds (2 d.p.).

language: Detected language name (e.g. "English", "French").

text: Full transcript text.

words: Word-level segments. Each entry has text, start, end (seconds). Includes speaker integer when diarize=true.