Bidirectional streaming using the Gemini Live API



The Gemini Live API enables low-latency bidirectional text and voice interactions with Gemini. Using the Live API, you can provide end users with the experience of natural, human-like voice conversations, with the ability to interrupt the model's responses using text or voice commands. The model can process text and audio input (video coming soon!), and it can provide text and audio output.

You can prototype with prompts and the Live API in Vertex AI Studio.

The Live API is a stateful API that creates a WebSocket connection to establish a session between the client and the Gemini server. For details, see the Live API reference documentation.

Before you begin

If you haven't already, complete the getting started guide, which describes how to set up your Firebase project, connect your app to Firebase, add the SDK, initialize the Vertex AI service, and create a LiveModel instance.

Note that to use the Live API:

  • Make sure that you're using at minimum these Firebase library versions:
    iOS+: not yet supported | Android: v16.3.0+ (BoM: v33.12.0+) | Web: not yet supported | Flutter: v1.5.0+ (BoM: v3.9.0+)

  • Create a LiveModel instance (not a GenerativeModel instance).

Models that support this capability

The Live API is supported by gemini-2.0-flash-live-preview-04-09 only (not gemini-2.0-flash).

Use the standard features of the Live API

This section describes how to use the standard features of the Live API, specifically to stream various types of inputs and outputs:

Send text and receive text

You can send streamed text input and receive streamed text output. Make sure to create a liveModel instance and set the response modality to Text.

Learn how to choose a model and optionally a location appropriate for your use case and app.

Send audio and receive audio

You can send streamed audio input and receive streamed audio output. Make sure to create a LiveModel instance and set the response modality to Audio.

Learn how to configure and customize the response voice (later on this page).

Learn how to choose a model and optionally a location appropriate for your use case and app.



Create more engaging and interactive experiences

This section describes how to create and manage more engaging or interactive features of the Live API.

Change the response voice

The Live API uses Chirp 3 to support synthesized speech responses. When using Vertex AI in Firebase, you can send audio in 5 HD voices and 31 languages.

If you don't specify a voice, the default is Puck. Alternatively, you can configure the model to respond in any of the following voices:

Aoede (female)
Charon (male)
Fenrir (male)
Kore (female)
Puck (male)

For demos of what these voices sound like and for the full list of available languages, see Chirp 3: HD voices.

To specify a voice, set the voice name within the speechConfig object as part of the model configuration:

For the best results when prompting and requiring the model to respond in a non-English language, include the following as part of your system instructions:

RESPOND IN LANGUAGE. YOU MUST RESPOND UNMISTAKABLY IN LANGUAGE.

Maintain context across sessions and requests

You can use a chat structure to maintain context across sessions and requests. Note that this only works for text input and text output.

This approach is best for short contexts; you can send turn-by-turn interactions to represent the exact sequence of events . For longer contexts, we recommend providing a single message summary to free up the context window for subsequent interactions.

Handle interruptions

Vertex AI in Firebase does not yet support handling interruptions. Check back soon!

Use function calling (tools)

You can define tools, like available functions, to use with the Live API just like you can with the standard content generation methods. This section describes some nuances when using the Live API with function calling. For a complete description and examples for function calling, see the function calling guide.

From a single prompt, the model can generate multiple function calls and the code necessary to chain their outputs. This code executes in a sandbox environment, generating subsequent BidiGenerateContentToolCall messages. The execution pauses until the results of each function call are available, which ensures sequential processing.

Additionally, using the Live API with function calling is particularly powerful because the model can request follow-up or clarifying information from the user. For example, if the model doesn't have enough information to provide a parameter value to a function it wants to call, then the model can ask the user to provide more or clarifying information.

The client should respond with BidiGenerateContentToolResponse.



Limitations and requirements

Keep in mind the following limitations and requirements of the Live API.

Transcription

Vertex AI in Firebase does not yet support transcriptions. Check back soon!

Languages

Audio formats

The Live API supports the following audio formats:

  • Input audio format: Raw 16 bit PCM audio at 16kHz little-endian
  • Output audio format: Raw 16 bit PCM audio at 24kHz little-endian

Rate limits

The following rate limits apply:

  • 10 concurrent sessions per Firebase project
  • 4M tokens per minute

Session length

The default length for a session is 30 minutes. When the session duration exceeds the limit, the connection is terminated.

The model is also limited by the context size. Sending large chunks of input may result in earlier session termination.

Voice activity detection (VAD)

The model automatically performs voice activity detection (VAD) on a continuous audio input stream. VAD is enabled by default.

Token counting

You cannot use the CountTokens API with the Live API.