Grok TTS

xai/grok-tts

xAI Text-to-SpeechAudio GenerationMultilingual

xAI's Grok text-to-speech model. Generates high-fidelity spoken audio in 5 expressive voices (eve, ara, rex, sal, leo) with 20+ supported languages. Supports inline speech tags for laughter, whispers, and pauses.

Quick start

# Inspect the price — a plain request returns the 402 challenge:
curl -i https://api.glianalabs.com/v1/infer \
  -H "content-type: application/json" \
  -d '{
    "model": "xai/grok-tts",
    "text": "<text to speak>"
  }'

# Pay + run in one step with the mppx CLI (create a wallet: npx mppx account create):
npx mppx https://api.glianalabs.com/v1/infer \
  -J '{"model": "xai/grok-tts", "text": "<text to speak>"}'

Examples

Parameters

Input

text string required

Text to convert to speech. Maximum 15,000 characters. Supports inline speech tags: [pause], [laugh], <whisper>…</whisper>, etc.

language string optional default: "auto"

BCP-47 language code (e.g. "en", "zh", "pt-BR") or "auto" for automatic language detection. Required — xAI returns 400 if omitted. Supported codes: auto, en, ar-EG, ar-SA, ar-AE, bn, zh, fr, de, hi, id, it, ja, ko, pt-BR, pt-PT, ru, es-MX, es-ES, tr, vi.

optimize_streaming_latency number optional

Latency optimization for streaming synthesis. 0 (default): no optimization, best audio quality. 1: reduced first-chunk size for lower time-to-first-audio with minor quality tradeoff.

output_format object optional

Output audio format. Defaults to MP3 at 24 kHz / 128 kbps when omitted.

speed number optional

Speech speed multiplier. 1.0 is normal speed. Range: 0.7 to 1.5. Defaults to 1.0. Only used in WebSocket mode.

text_normalization boolean optional

When true, normalizes written-form text into spoken-form before synthesis (e.g. "Dr." → "Doctor", "100" → "one hundred"). Defaults to false.

voice_id string optional

Voice for synthesis. Defaults to "eve". Built-in voices: eve (energetic), ara (warm), rex (confident), sal (balanced), leo (authoritative). Custom voice IDs from /v1/tts/voices are also accepted. Case-insensitive — "Eve", "EVE", and "eve" are equivalent.

Output

audio: Presigned R2 URL for the generated audio file. MIME type reflects the requested codec (audio/mpeg for mp3, audio/wav for wav, etc.).