On April 18, 2026, xAI launched two standalone APIs that most observers had not anticipated: a Speech-to-Text (STT) API and a Text-to-Speech (TTS) API, available immediately via the same api.x.ai endpoint as the Grok LLM API. No waitlist, no staged rollout — both live on launch day.
The move signals something beyond a product release. xAI has operated one of the largest voice inference deployments in the world without ever selling access to it: the Grok Voice system embedded in Tesla vehicles and handling Starlink customer support at scale. The STT and TTS APIs are that production stack, externalized. For developers who care about infrastructure provenance — who want to build on something that has already processed billions of audio tokens in adversarial, real-world conditions — this is a meaningful credential.
Whether the pricing and accuracy hold up against entrenched competitors is a different question. That’s what this review examines.
What Launched
xAI released two separate products on the same day:
- Grok STT — Speech-to-Text transcription via REST (batch) and WebSocket (streaming)
- Grok TTS — Text-to-Speech synthesis with five voices and speech-tag controls
Both are available through the standard xAI API (api.x.ai/v1) using the same API keys developers already use for Grok LLM access. If you’re already in the xAI ecosystem, onboarding to the audio APIs requires no new credentials.
Grok STT: Specifications
Pricing
| Mode | Price |
|---|---|
| Batch (REST) | $0.10 / hour of audio |
| Streaming (WebSocket) | $0.20 / hour of audio |
xAI positions this as “market-low.” For comparison, OpenAI Whisper API runs $0.36/hour. Deepgram’s Nova-3 starts around $0.14–0.24/hour depending on tier. AssemblyAI’s standard tier is $0.37/hour. The batch price is genuinely competitive; the streaming price is tight with Deepgram’s real-time tier.
API Endpoints
- REST:
POST https://api.x.ai/v1/stt— file upload or URL-based transcription - WebSocket:
wss://api.x.ai/v1/stt— real-time streaming
Rate Limits
| Limit | Value |
|---|---|
| REST requests | 600 / minute |
| WebSocket connections | 10 / second |
| Max concurrent sessions | 100 / team |
| Max file size | 500 MB |
| Deployment region | us-east-1 |
Key Features
Speaker diarization — Separates audio by individual speaker, answering “who said what” across multi-speaker recordings. Critical for meeting transcription, call centers, interview workflows, and legal documentation.
Word-level timestamps — Assigns precise start/end times to every word, enabling subtitle generation, searchable audio archives, audio-to-text alignment for video editors, and compliance-grade documentation.
Inverse Text Normalization (ITN) — Converts spoken idiom into structured output. “One hundred sixty-seven thousand nine hundred eighty-three dollars and fifteen cents” becomes “$167,983.15.” This matters enormously in financial services, healthcare, and any domain where structured data must be extracted from voice.
Multi-channel audio — Supports simultaneous recording from multiple microphones. Relevant for conference room setups, call center infrastructure with separate agent/customer channels, and multi-speaker live events.
25+ languages with automatic language switching — The API handles mid-conversation language transitions without requiring explicit language specification upfront.
Supported formats — Common audio and video containers including MP3, WAV, MP4, and M4A. The 500 MB file limit accommodates up to several hours of compressed audio per request.
Benchmarks: The Claim
xAI’s benchmark claim is specific: on phone call entity recognition — extracting names, account numbers, and dates from call center audio — Grok STT achieves a 5.0% word error rate, versus:
| Provider | Error Rate |
|---|---|
| Grok STT | 5.0% |
| ElevenLabs | 12.0% |
| Deepgram | 13.5% |
| AssemblyAI | 21.3% |
The important caveat: this benchmark was published by xAI on a dataset and evaluation methodology they control. It has not been independently reproduced. Phone call entity recognition is a specific, cherry-pickable slice of speech recognition performance — it does not represent general-purpose transcription quality across accents, noise environments, or non-English audio.
That said, the gap is large enough to be interesting. A 5.0% vs. 13.5% spread on Deepgram — even with home-field advantage on the benchmark — suggests real architectural differentiation. The production provenance (Tesla, Starlink) is at least a plausible explanation for strong call-center performance: those systems have been optimized for exactly this kind of structured, high-accuracy audio extraction.
Independent third-party evaluation on standard benchmarks (CommonVoice, LibriSpeech, CORAAL for diverse English, MLS for multilingual) would tell a more complete story.
Grok TTS: Specifications
Pricing
$4.20 per million characters — flat rate, no tier differentiation at launch.
For comparison: OpenAI TTS at $15/M chars (HD) or $12/M (standard). ElevenLabs at $11/M chars on their Starter plan. Eleven’s Flash v2.5 (fastest/cheapest) prices at $5.50/M. Grok TTS at $4.20/M undercuts the entire incumbent field.
Voices
Five voices at launch:
| Voice | Default? |
|---|---|
| Ara | — |
| Eve | ✓ (default) |
| Leo | — |
| Rex | — |
| Sal | — |
xAI has not published demographic or stylistic descriptions of each voice in the launch materials. Developers will need to evaluate fit for their use case directly via API.
Languages
20 languages supported. The specific language list is not enumerated in the launch announcement. Given the STT list includes 25+ with automatic switching, TTS coverage is somewhat narrower — a consideration for internationalized applications.
Speech Tags
The API supports speech tags — control tokens embedded in input text that modify delivery: pacing, emphasis, prosody. This is the category where ElevenLabs has historically led the market (their SSML-like control has been a key differentiator). xAI’s approach appears similar in concept, though the depth of control relative to ElevenLabs has not been independently documented.
The Infrastructure Story
What separates Grok STT/TTS from a new entrant pitching equivalent specs is the deployment history. xAI’s audio stack has been running in production at scale in two demanding environments:
Tesla vehicles — In-car voice interfaces operate under compounded noise: road noise, music, HVAC, wind, multi-passenger conversation. Inference must be fast enough to feel responsive, accurate enough for navigation commands, and robust enough across accents. Tesla’s Grok integration has been active since 2025.
Starlink customer support — Call center voice AI at SpaceX scale means millions of support interactions with real users, not controlled test environments. The structured entity extraction case (account numbers, service addresses, ticket IDs) maps directly to the phone call entity recognition benchmark xAI published.
The argument is not that Tesla and Starlink deployments guarantee performance for your use case. It’s that a voice API backed by this kind of production history has a different error surface than a model released from research into API access. Production APIs get optimized for failure modes that benchmarks don’t surface.
Competitive Positioning
vs. OpenAI Whisper API
Whisper is the default choice for many developer stacks because of its open-source provenance (the weights are public) and OpenAI API familiarity. At $0.36/hr, it is 3.6× more expensive than Grok STT batch. Whisper has no native speaker diarization; developers combine it with pyannote or similar. Grok STT’s diarization is built in.
For teams already using OpenAI’s API, switching STT is a low-cost experiment — same API key structure, comparable feature set, lower price.
vs. Deepgram
Deepgram’s Nova-3 is the strongest incumbent on accuracy across diverse English. It has speaker diarization, filler word detection, sentiment analysis, and deeper customization options. Deepgram also has a significantly longer track record of independent benchmark results across accent diversity.
Grok STT matches Deepgram’s real-time streaming price and undercuts their batch rate. Whether accuracy holds up across Deepgram’s evaluation framework is the open question.
vs. ElevenLabs
ElevenLabs’ primary moat is TTS quality — particularly voice cloning, fine-grained prosody control via SSML, and a large voice library. At $11/M chars vs. Grok TTS at $4.20/M, ElevenLabs is 2.6× more expensive for comparable output volume.
If your application requires voice cloning or custom voice training, ElevenLabs still leads. If you need five solid voices at aggressive pricing, Grok TTS is worth benchmarking.
vs. AssemblyAI
AssemblyAI at $0.37/hr and 21.3% phone call entity recognition error in xAI’s benchmark is the weakest comparison point in this competitive set. AssemblyAI has strengths in its LeMUR transcript analysis layer (running LLM queries over transcripts natively). That additional layer is not something Grok STT offers at launch.
What’s Missing
No voice cloning — The five fixed voices cover broad use cases but rule out any application requiring brand-specific voice identity. ElevenLabs leads here by a wide margin.
us-east-1 only — Single deployment region introduces latency for users in Europe, Asia-Pacific, or South America. For real-time streaming applications, round-trip latency from Tokyo or São Paulo will be measurably worse than from a regional provider.
No native LLM + audio pipeline — Deepgram’s Aura-2 and AssemblyAI’s LeMUR allow you to run language model queries over transcripts within the same API call. Grok STT is transcription only; chaining to Grok LLM for analysis requires two separate calls.
Self-reported benchmark — The 5.0% entity recognition figure needs third-party reproduction before it should anchor a procurement decision.
Who Should Evaluate Grok STT/TTS
Strong fit:
- Teams already using Grok API who want to add voice to their stack with zero new credential overhead
- Applications where structured entity extraction from audio is the primary accuracy requirement (finance, healthcare, call centers)
- High-volume transcription workflows where per-hour cost is the primary optimization target
- Multi-language applications requiring automatic language detection and switching
Proceed with caution:
- Applications requiring voice cloning or custom voice identity
- Teams with established Deepgram or AssemblyAI pipelines where switching cost needs clear justification
- Latency-sensitive streaming applications outside North America (us-east-1 only)
- Production use cases where the self-reported benchmark is the primary accuracy evidence and independent validation hasn’t been done
Rating
3.5 / 5 — The pricing is genuinely aggressive and the production infrastructure story is credible. The benchmarks need independent verification, the TTS voice library is thin, and regional coverage is limited to us-east-1. For teams already in the xAI ecosystem or running high-volume transcription at a cost-constrained point in the stack, this is worth a structured evaluation against your own data. For teams making a primary voice infrastructure bet, wait for independent benchmark reproduction before committing.
ChatForest researches and reviews AI tools based on published documentation, announcements, and third-party reporting. We do not have direct access to the Grok Speech APIs and have not conducted hands-on testing. Benchmark claims are sourced from xAI’s launch materials unless otherwise noted. This review was written in May 2026 based on information available at that time.