Check out the latest news from Firebase at Cloud Next 2025. Learn more.

Generate text from multimodal prompts using the Gemini API
Stay organized with collections Save and categorize content based on your preferences.

When calling the Gemini API from your app using a Vertex AI in Firebase SDK, you can prompt the Gemini model to generate text based on a multimodal input. Multimodal prompts can include multiple modalities (or types of input), like text along with images, PDFs, plain-text files, video, and audio.

In each multimodal request, you must always provide the following:

The file's mimeType. Learn about each input file's supported MIME types.
The file. You can either provide the file as inline data (as shown on this page) or using its URL or URI.

For testing and iterating on multimodal prompts, we recommend using Vertex AI Studio.

Other options for working with the Gemini API

Optionally experiment with an alternative "Google AI" version of the Gemini API
Get free-of-charge access (within limits and where available) using Google AI Studio and Google AI client SDKs. These SDKs should be used for prototyping only in mobile and web apps.

After you're familiar with how a Gemini API works, migrate to our Vertex AI in Firebase SDKs (this documentation), which have many additional features important for mobile and web apps, like protecting the API from abuse using Firebase App Check and support for large media files in requests.

Optionally call the Vertex AI Gemini API server-side (like with Python, Node.js, or Go)
Use the server-side Vertex AI SDKs, Genkit, or Firebase Extensions for the Gemini API.

Before you begin

If you haven't already, complete the getting started guide, which describes how to set up your Firebase project, connect your app to Firebase, add the SDK, initialize the Vertex AI service, and create a GenerativeModel instance.

Generate text from text and a single image Generate text from text and multiple images Generate text from text and a video

Sample media files

If you don't already have media files, then you can use the following publicly available files. Since these files are stored in buckets that aren't in your Firebase project, you need to use the https://storage.googleapis.com/BUCKET_NAME/PATH/TO/FILE format for the URL.

Image: https://storage.googleapis.com/cloud-samples-data/generative-ai/image/scones.jpg with a MIME type of image/jpeg. View or download this image.
PDF: https://storage.googleapis.com/cloud-samples-data/generative-ai/pdf/2403.05530.pdf with a MIME type of application/pdf. View or download this PDF.
Video: https://storage.googleapis.com/cloud-samples-data/video/animals.mp4 with a MIME type of video/mp4. View or download this video.
Audio: https://storage.googleapis.com/cloud-samples-data/generative-ai/audio/pixel.mp3 with a MIME type of audio/mp3. Listen to or download this audio.

Generate text from text and a single image

Make sure that you've completed the Before you begin section of this guide before trying this sample.

You can call the Gemini API with multimodal prompts that include both text and a single file (like an image, as shown in this example). For these calls, you need to use a model that supports media in prompts (like Gemini 2.0 Flash).

Make sure to review the requirements and recommendations for input files.

Choose whether you want to stream the response (generateContentStream) or wait for the response until the entire result is generated (generateContent).

Streaming

You can achieve faster interactions by not waiting for the entire result from the model generation, and instead use streaming to handle partial results.

Without streaming

Alternatively, you can wait for the entire result instead of streaming; the result is only returned after the model completes the entire generation process.

Learn how to choose a model and optionally a location appropriate for your use case and app.

Generate text from text and multiple images

Make sure that you've completed the Before you begin section of this guide before trying this sample.

You can call the Gemini API with multimodal prompts that include both text and multiple files (like images, as shown in this example). For these calls, you need to use a model that supports media in prompts (like Gemini 2.0 Flash).

Make sure to review the requirements and recommendations for input files.

Choose whether you want to stream the response (generateContentStream) or wait for the response until the entire result is generated (generateContent).

Streaming

You can achieve faster interactions by not waiting for the entire result from the model generation, and instead use streaming to handle partial results.

Without streaming

Alternatively, you can alternatively wait for the entire result instead of streaming; the result is only returned after the model completes the entire generation process.

Learn how to choose a model and optionally a location appropriate for your use case and app.

Generate text from text and a video

Make sure that you've completed the Before you begin section of this guide before trying this sample.

You can call the Gemini API with multimodal prompts that include both text and video file(s) (as shown in this example). For these calls, you need to use a model that supports media in prompts (like Gemini 2.0 Flash).

Make sure to review the requirements and recommendations for input files.

Choose whether you want to stream the response (generateContentStream) or wait for the response until the entire result is generated (generateContent).

Streaming

You can achieve faster interactions by not waiting for the entire result from the model generation, and instead use streaming to handle partial results.

Without streaming

Alternatively, you can wait for the entire result instead of streaming; the result is only returned after the model completes the entire generation process.

Learn how to choose a model and optionally a location appropriate for your use case and app.

Requirements and recommendations for input files

See Supported input files and requirements for the Vertex AI Gemini API to learn about the following:

Different options for providing a file in a request
Supported file types
Supported MIME types and how to specify them
Requirements and best practices for files and multimodal requests

What else can you do?

Learn how to count tokens before sending long prompts to the model.
Set up Cloud Storage for Firebase so that you can include large files in your multimodal requests and have a more managed solution for providing files in prompts. Files can include images, PDFs, video, and audio.
Start thinking about preparing for production, including setting up Firebase App Check to protect the Gemini API from abuse by unauthorized clients. Also, make sure to review the production checklist.

Try out other capabilities

Build multi-turn conversations (chat).
Generate text from text-only prompts.
Generate structured output (like JSON) from both text and multimodal prompts.
Generate images from text prompts.
Use function calling to connect generative models to external systems and information.

Learn how to control content generation

Understand prompt design, including best practices, strategies, and example prompts.
Configure model parameters like temperature and maximum output tokens (for Gemini) or aspect ratio and person generation (for Imagen).
Use safety settings to adjust the likelihood of getting responses that may be considered harmful.

You can also experiment with prompts and model configurations using Vertex AI Studio.

Learn more about the supported models

Learn about the models available for various use cases and their quotas and pricing.

Give feedback about your experience with Vertex AI in Firebase

Generate text from multimodal prompts using the Gemini API Stay organized with collections Save and categorize content based on your preferences.

Before you begin

Sample media files

Generate text from text and a single image

Streaming

Without streaming

Generate text from text and multiple images

Streaming

Without streaming

Generate text from text and a video

Streaming

Without streaming

Requirements and recommendations for input files

What else can you do?

Try out other capabilities

Learn how to control content generation

Learn more about the supported models

Generate text from multimodal prompts using the Gemini API
Stay organized with collections Save and categorize content based on your preferences.