Guide to prompting Gemini 3.1 Flash TTS (text-to-speech)

Today, Gemini 3.1 Flash TTS, our latest text-to-speech model, is available on Google AI Studio and Vertex AI. It delivers precise controllability and expressivity, empowering developers and enterprises to build advanced AI-speech applications.

The new TTS model introduces a high level of controllability by allowing you to steer the delivery using 200+ audio tags. We’ll share how to get strong results from the model, whether you are building accessible gaming soundtracks, banking systems, or audiobooks. Learn more about the model here.

What you will learn:

Model overview
Voice style instructions
The core prompting framework for audio tags
Directing expression and pacing
Use cases: accessibility and inclusive design, creative and entertainment, enterprise use cases

1. Model overview

Gemini 3.1 Flash TTS is available on Google AI Studio and Vertex AI in public preview.

The model delivers high-fidelity speech and precise control across 70+ languages. These core optimizations bring advanced control to style, accent, pacing and expressivity to major markets.

Audio generated by Gemini 3.1 Flash TTS is watermarked with SynthID, a watermark woven directly into the audio output to help identify AI-generated content.

2. Voice style instructions

To begin, choose a baseline voice from the 30 available prebuilt voices and a target language from over 70+ supported options and regional variants. This selection serves as the foundation for your audio output.

Once you have your base voice and language, you can use natural language instructions to add stylization. Whether you need a specific regional accent, a professional narrator’s tone, or a more casual, conversational vibe, simply describe the style you want to achieve.

Finally, you can now embed audio tags directly into your text prompt to trigger specific pacing and expressiveness to generate your audio.

3. The core prompting framework for audio tags

With Gemini 3.1 flash TTS, we are introducing audio tags as an intuitive way to guide vocal style, pace, and delivery. By embedding natural language commands directly into the text input, you can steer AI-speech output with high levels of granularity.

The formula: [pacing tag] + spoken text + [expressive tag] + spoken text + [pause tag] + spoken text

All inline tags must be enclosed in square brackets, such as [whispers] or [happy]. Insert these tags exactly where you want the transition to occur. Ensure tags are separated by text or punctuation to avoid system errors. Do not place two tags directly next to each other. Accents should be triggered by style prompts, not by the language setting.

Please note that the tags are in English only, but English-language tags can be combined with text in other languages.

French language example:

[cautious] L’ombre avança lentement dans la pièce silencieuse. [whispers] Le document secret devait être caché ici. [short pause] Mais où? [gasp] Soudain, un bruit sourd résonna dans le couloir [panic] Il fallait sortir d’ici immédiatement.

4. Directing expression and pacing

Gemini 3.1 flash TTS supports 200+ audio tags to prompt expressive voices.

Most commonly used tags include: [determination], [enthusiasm], [adoration], [interest], [awe], [admiration], [nervousness], [frustration], [excitement], [curiosity], [hope], [annoyance], [amusement], [aggression], [tension], [agitation], [confusion], [anger], [positive], [neutral], [negative], [whispers], and [laughs].

Pacing and stylistic controls: you can use pacing tags like [slow] or [fast] to control the speed of the delivery. To pace out your information and let dramatic moments land, use tags like [short pause] or [long pause].

Non-verbal vocalizations: the model can produce realistic non-verbal audio. You can insert tags like [laughs] or [whispers] to add texture to the audio output.

Use cases

1. Accessibility and inclusive design

Text-to-speech technology plays a vital role in making digital spaces accessible. Gemini 3.1 Flash TTS provides highly contextual, clear audio for individuals who rely on screen readers or augmentative and alternative communication devices.

Gaming soundtracks and descriptions

For players navigating game menus, audio descriptions need to be clear, inviting, and easy to follow.

[enthusiasm] You have selected the twilight forest level. [interest] This area features hidden artifacts and new challenges. It includes an expansive map, challenging puzzles, and a specialized survival kit.

Media and TV audio descriptions

When providing audio descriptions for television or film, the synthetic voice can match the energy of the scene, utilizing tags like whispers or sound effects to enhance the experience without distracting from the primary audio.

[neutral] The scene fades in on a dimly lit diner. [whispers] A person in a trench coat sits alone in the corner booth, nervously checking their watch. [neutral] They look up sharply as the diner door swings open.

2. Creative and entertainment use cases

For audio publishers and content creators, the model applies context-aware pacing based on the natural flow of the text. You can stack pacing and expressive tags to build suspense and drama in a narrative format.

Audiobooks

A narrator setting the scene, followed by a character speaking with a distinct style.

[cautious] step carefully around the glowing runes on the floor. [anxiety] one wrong move and the entire temple collapses. [relief] we finally found the crystal. [awe] it is more brilliant than the stories described. [alarm] wait, the light inside is turning red. [panic] run for the exit!

Tip: For long-form content, manual tagging can be inefficient. You can use Gemini 3.1 Flash-Lite to programmatically annotate your text before passing it to the Gemini 3.1 Flash TTS model. Try this demo here.

3. Enterprise and business use cases

For enterprise applications, precision and tone are critical. Here is how to use pacing and expressive tags to handle day-to-day business operations.

Banking

When dealing with sensitive information like credit card fraud alerts, the model can transition from a serious, professional tone to a helpful, positive resolution.

[neutral] Hello. This is an automated fraud prevention alert from Horizon bank. [seriousness] We detected unusual activity on your card ending in [slow] 4 3 2 1. [positive] If you recognize a charge of eighty-five dollars at City electronics, please press one.

Urgent automated notifications

Use pacing tags to emphasize the urgency of an action required by the customer.

[neutral] Hello. This is an automated message from City airways. [short pause] Your flight, [slow] C A 4 2 7, has been updated. [positive] It is now departing at 8:45 AM from Gate B 12. [fast] Please proceed to the gate immediately, as boarding will begin in five minutes.

How to get started today

You can start building with Gemini 3.1 Flash TTS in our core developer products:

Vertex AI: The model is available in preview on Vertex AI. Build Gemini 3.1 Flash TTS into your applications with the scale, security, and enterprise-readiness of Google Cloud.
Google AI Studio: For rapid prototyping and experimentation, the model is available in Preview in AI Studio. Explore the new audio playground interface to test the expressive controls.

To dive deeper into the best practices, explore the following resources: