The Gemini Live API enables low-latency bidirectional text and voice interactions with Gemini. Using the Live API, you can provide end users with the experience of natural, human-like voice conversations, with the ability to interrupt the model's responses using text or voice commands. The model can process text and audio input (video coming soon!), and it can provide text and audio output.
You can prototype with prompts and the Live API in Vertex AI Studio.
The Live API is a stateful API that creates a WebSocket connection to establish a session between the client and the Gemini server. For details, see the Live API reference documentation.
Before you begin
If you haven't already, complete the
getting started guide,
which describes how to set up your Firebase project, connect your app to
Firebase, add the SDK, initialize the Vertex AI service, and create a
LiveModel
instance.
Note that to use the Live API:
Make sure that you're using at minimum these Firebase library versions:
iOS+: not yet supported | Android: v16.3.0+ (BoM: v33.12.0+) | Web: not yet supported | Flutter: v1.5.0+ (BoM: v3.9.0+)Create a
LiveModel
instance (not aGenerativeModel
instance).
Models that support this capability
The Live API is supported by gemini-2.0-flash-live-preview-04-09
only (not gemini-2.0-flash
).
Use the standard features of the Live API
This section describes how to use the standard features of the Live API, specifically to stream various types of inputs and outputs:
- Send text and receive text
- Send audio and receive audio
- Send audio and receive text
- Send text and receive audio
Send text and receive text
You can send streamed text input and receive streamed text output. Make sure to
create a liveModel
instance and set the response modality to Text
.
Learn how to choose a model and optionally a location appropriate for your use case and app.
Send audio and receive audio
You can send streamed audio input and receive streamed audio output. Make sure
to create a LiveModel
instance and set the response modality to Audio
.
Learn how to configure and customize the response voice (later on this page).
Learn how to choose a model and optionally a location appropriate for your use case and app.
You can send streamed audio input and receive streamed text output. Make sure to
create a LiveModel
instance and set the response modality to Text
.
Learn how to choose a model and optionally a location appropriate for your use case and app.
You can send streamed text input and receive streamed audio output. Make sure to
create a LiveModel
instance and set the response modality to Audio
.
Learn how to configure and customize the response voice (later on this page).
Learn how to choose a model and optionally a location appropriate for your use case and app.
Create more engaging and interactive experiences
This section describes how to create and manage more engaging or interactive features of the Live API.
Change the response voice
The Live API uses Chirp 3 to support synthesized speech responses. When using Vertex AI in Firebase, you can send audio in 5 HD voices and 31 languages.
If you don't specify a voice, the default is Puck
. Alternatively, you can
configure the model to respond in any of the following voices:
Aoede (female)Charon (male) |
Fenrir (male)Kore (female) |
Puck (male) |
For demos of what these voices sound like and for the full list of available languages, see Chirp 3: HD voices.
To specify a voice, set the voice name within the speechConfig
object as part
of the model configuration:
For the best results when prompting and requiring the model to respond in a non-English language, include the following as part of your system instructions:
RESPOND IN LANGUAGE. YOU MUST RESPOND UNMISTAKABLY IN LANGUAGE.
Maintain context across sessions and requests
You can use a chat structure to maintain context across sessions and requests. Note that this only works for text input and text output.
This approach is best for short contexts; you can send turn-by-turn interactions to represent the exact sequence of events . For longer contexts, we recommend providing a single message summary to free up the context window for subsequent interactions.
Handle interruptions
Vertex AI in Firebase does not yet support handling interruptions. Check back soon!
Use function calling (tools)
You can define tools, like available functions, to use with the Live API just like you can with the standard content generation methods. This section describes some nuances when using the Live API with function calling. For a complete description and examples for function calling, see the function calling guide.
From a single prompt, the model can generate multiple function calls and the
code necessary to chain their outputs. This code executes in a sandbox
environment, generating subsequent
BidiGenerateContentToolCall
messages. The execution pauses until the results of each function call are
available, which ensures sequential processing.
Additionally, using the Live API with function calling is particularly powerful because the model can request follow-up or clarifying information from the user. For example, if the model doesn't have enough information to provide a parameter value to a function it wants to call, then the model can ask the user to provide more or clarifying information.
The client should respond with
BidiGenerateContentToolResponse
.
Limitations and requirements
Keep in mind the following limitations and requirements of the Live API.
Transcription
Vertex AI in Firebase does not yet support transcriptions. Check back soon!
Languages
- Input languages: See the full list of supported input languages for Gemini models
- Output languages: See the full list of available output languages in Chirp 3: HD voices
Audio formats
The Live API supports the following audio formats:
- Input audio format: Raw 16 bit PCM audio at 16kHz little-endian
- Output audio format: Raw 16 bit PCM audio at 24kHz little-endian
Rate limits
The following rate limits apply:
- 10 concurrent sessions per Firebase project
- 4M tokens per minute
Session length
The default length for a session is 30 minutes. When the session duration exceeds the limit, the connection is terminated.
The model is also limited by the context size. Sending large chunks of input may result in earlier session termination.
Voice activity detection (VAD)
The model automatically performs voice activity detection (VAD) on a continuous audio input stream. VAD is enabled by default.
Token counting
You cannot use the CountTokens
API with the Live API.