Streaming support for gemini-3.1-flash-tts-preview arrived June 17, 2026. The model itself launched in April — this changelog entry is specifically about streaming via streamGenerateContent becoming available. Before June 17, TTS on this model required a non-streaming call that returned the full audio blob at completion. Now you can stream audio chunks and begin playback before the full generation is done.

This guide covers: how streaming works, the audio tag system, voice selection, output format, pricing, and the one known bug you need to handle before shipping.


The Model

gemini-3.1-flash-tts-preview is Google’s expressive TTS model with inline style control. It is a preview model — not yet GA, not in Flex inference, and not in Priority inference. Input is text only; output is audio only.

Attribute Value
Model ID gemini-3.1-flash-tts-preview
Input Text (max 8,192 tokens)
Output Audio
Max output tokens 16,384
Audio format PCM, 24 kHz, 16-bit, mono
Voices 30 prebuilt
Languages 70+
Pricing $20 / 1M output tokens
Batch API Supported
Caching, function calling, structured outputs Not supported

Note the $20/1M output token rate. Compare to Gemini 3.5 Flash TTS at $6/1M — the 3.5 Flash model is newer and cheaper. The 3.1 Flash TTS is the model to use when you need the 200+ audio tag system for fine-grained delivery control; 3.5 Flash TTS is the cost-efficient option for straightforward narration.


How Streaming Works

Use streamGenerateContent instead of generateContent. In the Interactions API, set stream: true in the request.

Audio arrives as step.delta events. When event.delta.type == "audio", the data field contains base64-encoded PCM. Decode and pipe directly to your audio output:

import base64
import wave

audio_chunks = []

for event in client.models.stream_generate_content(
    model="gemini-3.1-flash-tts-preview",
    contents="[excited] We just shipped streaming TTS support.",
    config={"speech_config": {"voice": "Puck"}}
):
    if event.delta and event.delta.type == "audio":
        audio_chunks.append(base64.b64decode(event.delta.data))

# Write PCM chunks to WAV
with wave.open("output.wav", "wb") as f:
    f.setnchannels(1)
    f.setsampwidth(2)   # 16-bit
    f.setframerate(24000)
    f.writeframes(b"".join(audio_chunks))

The output is always 24000 Hz, 16-bit, mono PCM. You do not need to negotiate the format — it is fixed.


Audio Tags

The audio tag system lets you embed delivery instructions directly in the input text. Tags are inline, surrounded by square brackets:

[whispers] This part is spoken softly.
[excited] And this part with energy!

Tags can combine:

[sarcastically, one painfully slow word at a time] Oh. What. A. Great. Idea.

Categories of control include:

  • Delivery style: [whispers], [shouting], [breathlessly], [dramatically]
  • Emotion: [excitedly], [bored], [laughing], [sighs]
  • Pacing: [one word at a time], [quickly], [pausing]
  • Accent and register: vary by tag — the full list is in Google’s TTS documentation

Tags are processed inline — you can switch style mid-sentence or per paragraph. This is the primary differentiator from Gemini 3.5 Flash TTS, which handles narration but does not support the 200+ tag system.


Voice Selection

30 prebuilt voices are available. A subset of named voices:

  • Kore — Firm
  • Puck — Upbeat
  • Enceladus — Breathy

Set voice in speech_config.voice. Voice selection affects the base character; audio tags control delivery on top of the voice. You can run two independent speakers in the same request with separate voice and style configuration per speaker.


The Known 500 Error

Google’s own documentation warns: the model occasionally returns text tokens instead of audio tokens, causing the server to fail with a 500 error.

This is a known issue in preview, not a transient infrastructure problem. Handle it with automated retry:

import time

MAX_RETRIES = 3

for attempt in range(MAX_RETRIES):
    try:
        result = stream_tts(text, voice="Kore")
        break
    except Exception as e:
        if "500" in str(e) and attempt < MAX_RETRIES - 1:
            time.sleep(1)
            continue
        raise

Do not surface the retry to end users — one or two retries transparently handle the majority of these failures.


Long Output Drift

For outputs longer than a few minutes, the documentation notes that “speech quality and consistency may begin to drift.” If you need long-form TTS (podcast narration, audiobook chapters), split your input at natural paragraph or section boundaries and stitch the output chunks rather than submitting the entire document in one call. This also reduces the impact of a mid-generation 500 error, since you only retry the failed chunk rather than the entire input.


vs. Gemini 3.5 Flash TTS

3.1 Flash TTS Preview 3.5 Flash TTS
Audio tags 200+ inline tags Not supported
Pricing $20/1M output tokens $6/1M output tokens
Status Preview GA
Multi-speaker Yes, 2 speakers Check docs
Use case Expressive, steerable narration Cost-efficient narration

Use 3.1 Flash TTS when fine delivery control matters (character voices, expressive training data, styled narration). Use 3.5 Flash TTS when you need reliable, cheap, scalable narration without style complexity.


Streaming vs. Non-Streaming

Non-streaming (generateContent) is still available and remains the right choice when:

  • You do not need real-time playback (batch processing, file generation)
  • You want to avoid implementing chunk assembly
  • Latency to first byte does not matter

Streaming (streamGenerateContent) is the right choice when:

  • You are building a voice interface where response latency is user-visible
  • You want to begin audio playback before generation completes
  • You are integrating with a real-time audio output pipeline

The 500 error bug affects both streaming and non-streaming. Build retry logic regardless of which surface you use.