All models

Grok STT

xai/grok-stt
xAI Speech-to-TextTranscriptionMultilingual

xAI's Grok speech-to-text model. Transcribes audio files into text across 25 languages with word-level timestamps, multichannel transcription, speaker diarization, and key-term biasing.

Quick start

# Inspect the price — a plain request returns the 402 challenge:
curl -i https://api.glianalabs.com/v1/infer \
  -H "content-type: application/json" \
  -d '{
    "model": "xai/grok-stt",
    "file": <string>,
    "url": <string>
  }'

# Pay + run in one step with the mppx CLI (create a wallet: npx mppx account create):
npx mppx https://api.glianalabs.com/v1/infer \
  -J '{"model": "xai/grok-stt", "file": "<string>", "url": "<string>"}'

Parameters

Input
audio_format string

Format hint for raw/headerless audio. Required for pcm, mulaw, alaw. Omit for container formats (mp3, wav, etc.) — xAI auto-detects them.

channels integer

Number of audio channels (2–8). Required only for multichannel raw audio; auto-detected for container formats.

diarize boolean

When true, enables speaker diarization. Each word in the response includes a `speaker` integer identifying the detected speaker.

file string

Audio file as a data URI (data:audio/...;base64,...) or an HTTPS URL the gateway fetches and uploads. Supported container formats: flac, mp3, mp4, m4a, mkv, ogg, opus, wav, aac. Raw formats (pcm, mulaw, alaw) also accepted — supply audio_format and sample_rate. Gateway-side size limit: 25 MB. Mutual

filler_words boolean

When true, filler words (uh, um, er) are included in the transcript. Defaults to false — filler words are removed.

format boolean

When true, enables Inverse Text Normalization — spoken numbers and currencies are converted to written form (e.g. "one hundred dollars" → "$100"). Requires language to be set.

keyterm array

Key terms to bias transcription toward (e.g. product names, proper nouns). Each term up to 50 characters, max 100 terms. Sent as repeated form fields: keyterm=Term+One&keyterm=Term+Two.

language string

Language code (e.g. "en", "fr", "de"). Used with format=true to enable Inverse Text Normalization. xAI transcribes in any language regardless — supplying this enables number/currency formatting in the transcript.

multichannel boolean

When true, each audio channel is transcribed independently. Results are returned in the `channels` array. Requires channels ≥ 2.

sample_rate integer

Sample rate in Hz. Required when audio_format is set.

url string

HTTPS URL of an audio file for xAI to fetch server-side. Mutually exclusive with `file`. No gateway-side size limit applies.

Output
channels: Per-channel transcripts when multichannel=true.
duration: Audio duration in seconds (2 d.p.).
language: Detected language name (e.g. "English", "French").
text: Full transcript text.
words: Word-level segments. Each entry has text, start, end (seconds). Includes speaker integer when diarize=true.