Grok STT
xAI's Grok speech-to-text model. Transcribes audio files into text across 25 languages with word-level timestamps, multichannel transcription, speaker diarization, and key-term biasing.
Quick start
# Inspect the price — a plain request returns the 402 challenge:
curl -i https://api.glianalabs.com/v1/infer \
-H "content-type: application/json" \
-d '{
"model": "xai/grok-stt",
"file": <string>,
"url": <string>
}'
# Pay + run in one step with the mppx CLI (create a wallet: npx mppx account create):
npx mppx https://api.glianalabs.com/v1/infer \
-J '{"model": "xai/grok-stt", "file": "<string>", "url": "<string>"}'Parameters
Format hint for raw/headerless audio. Required for pcm, mulaw, alaw. Omit for container formats (mp3, wav, etc.) — xAI auto-detects them.
Number of audio channels (2–8). Required only for multichannel raw audio; auto-detected for container formats.
When true, enables speaker diarization. Each word in the response includes a `speaker` integer identifying the detected speaker.
Audio file as a data URI (data:audio/...;base64,...) or an HTTPS URL the gateway fetches and uploads. Supported container formats: flac, mp3, mp4, m4a, mkv, ogg, opus, wav, aac. Raw formats (pcm, mulaw, alaw) also accepted — supply audio_format and sample_rate. Gateway-side size limit: 25 MB. Mutual
When true, filler words (uh, um, er) are included in the transcript. Defaults to false — filler words are removed.
When true, enables Inverse Text Normalization — spoken numbers and currencies are converted to written form (e.g. "one hundred dollars" → "$100"). Requires language to be set.
Key terms to bias transcription toward (e.g. product names, proper nouns). Each term up to 50 characters, max 100 terms. Sent as repeated form fields: keyterm=Term+One&keyterm=Term+Two.
Language code (e.g. "en", "fr", "de"). Used with format=true to enable Inverse Text Normalization. xAI transcribes in any language regardless — supplying this enables number/currency formatting in the transcript.
When true, each audio channel is transcribed independently. Results are returned in the `channels` array. Requires channels ≥ 2.
Sample rate in Hz. Required when audio_format is set.
HTTPS URL of an audio file for xAI to fetch server-side. Mutually exclusive with `file`. No gateway-side size limit applies.