Firebase is back at Google I/O on May 20-21! Register now.

Supported input files and requirements for the Vertex AI Gemini API
bookmark_border Stay organized with collections Save and categorize content based on your preferences.

When calling the Vertex AI Gemini API from your app using a Vertex AI in Firebase SDK, you can prompt the Gemini model to generate text based on a multimodal input. Multimodal prompts can include multiple modalities (or types of input), like text along with images, PDFs, video, and audio.

For the non-text parts of the input (like media files), you need to use supported file types, specify a supported MIME type, and make sure that your files and multimodal requests meet the requirements and follow best practices.

This page describes the following:

Options for providing files in your request.
Details about the supported MIME types, best practices, and limitations for the following file inputs:
Images | Video | Audio | Documents (like PDFs).

Options for providing files in multimodal requests

In each multimodal request, you must always provide the following:

The file's mimeType. See each input file's supported MIME types in the applicable section of this page.
The file. You can either provide the file using its URL / URI or provide the file as inline data.

The size and number of files that you can provide in the request is dictated by the input file type, how you provide the file, and the model used (for details, see each input file type's section on this page).

Option 1: Provide the file using a URL or URI

Here are the acceptable types of URLs or URIs:

Cloud Storage for Firebase bucket URL: The file's URL must be public or the signed in user or client must have sufficient access to the file. Learn more about Cloud Storage for Firebase benefits, URL requirements, and code samples.
Google Cloud Storage bucket URL: The file's URL must be public. Also, if the bucket is in a different project than the one you're using with Vertex AI in Firebase, then use the https://storage.googleapis.com/BUCKET_NAME/PATH/TO/FILE format for the URL.
Browser/HTTP URLs: The file URL must be publicly readable. Examples include URLs from media-hosting sites, URLs that show the media directly (not a web page hosting the media), or a published Google Drive or Google Workspace file.
YouTube video URL: The YouTube video must be public or unlisted.

Learn more about the requirements for URLs and URIs in the Google Cloud documentation.

Option 2: Provide the file as inline data

Note the following about files provided as inline data:

Only small files can be sent as inline data because the total request size limit is 20 MB.
The file is encoded to base64 in transit (which increases the file size).

For examples showing how to include files as inline data, see Generate text from multimodal prompts using the Gemini API.

Images: Requirements, best practices, and limitations

Images: Requirements

In this section, learn about the supported MIME types and limits per request for images.

Supported MIME types

Gemini multimodal models support the following image MIME types:

Image MIME type	Gemini 2.0 Flash	Gemini 2.0 Flash‑Lite
PNG - `image/png`
JPEG - `image/jpeg`
WebP - `image/webp`

Limits per request

There isn't a specific limit to the number of pixels in an image. However, larger images are scaled down and padded to fit a maximum resolution of 3072 x 3072 while preserving their original aspect ratio.

Here's the maximum number of image files allowed in a prompt request:

Gemini 2.0 Flash and Gemini 2.0 Flash‑Lite: 3000 images

Images: Tokenization

Here's how tokens are calculated for images:

Gemini 2.0 Flash and Gemini 2.0 Flash‑Lite:
- If both dimensions of an image are less than or equal to 384 pixels, then 258 tokens are used.
- If one dimension of an image is greater than 384 pixels, then the image is cropped into tiles. Each tile size defaults to the smallest dimension (width or height) divided by 1.5. If necessary, each tile is adjusted so that it's not smaller than 256 pixels and not greater than 768 pixels. Each tile is then resized to 768x768 and uses 258 tokens.

Images: Best practices

When using images, use the following best practices and information for the best results:

If you want to detect text in an image, use prompts with a single image to produce better results than prompts with multiple images.
If your prompt contains a single image, place the image before the text prompt in your request.
If your prompt contains multiple images, and you want to refer to them later in your prompt or have the model refer to them in the model response, it can help to give each image an index before the image. Use a b c or image 1 image 2 image 3 for your index. The following is an example of using indexed images in a prompt:
```
image 1 
image 2 
image 3 

Write a blogpost about my day using image 1 and image 2. Then, give me ideas
for tomorrow based on image 3.
```
Use images with higher resolution; they yield better results.
Include a few examples in the prompt.
Rotate images to their proper orientation before adding them to the prompt.
Avoid blurry images.

Images: Limitations

While Gemini multimodal models are powerful in many multimodal use cases, it's important to understand the limitations of the models:

Content moderation: The models refuse to provide answers on images that violate our safety policies.
Spatial reasoning: The models aren't precise at locating text or objects in images. They might only return the approximated counts of objects.
Medical uses: The models aren't suitable for interpreting medical images (for example, x-rays and CT scans) or providing medical advice.
People recognition: The models aren't meant to be used to identify people who aren't celebrities in images.
Accuracy: The models might hallucinate or make mistakes when interpreting low-quality, rotated, or extremely low-resolution images. The models might also hallucinate when interpreting handwritten text in images documents.

Video: Requirements, best practices, and limitations

Video: Requirements

In this section, learn about the supported MIME types and limits per request for video.

Supported MIME types

Gemini multimodal models support the following video MIME types:

Video MIME type	Gemini 2.0 Flash	Gemini 2.0 Flash‑Lite
FLV - `video/x-flv`
MOV - `video/quicktime`
MPEG - `video/mpeg`
MPEGPS - `video/mpegps`
MPG - `video/mpg`
MP4 - `video/mp4`
WEBM - `video/webm`
WMV - `video/wmv`
3GPP - `video/3gpp`

Limits per request

Here's the maximum number of video files allowed in a prompt request:

Gemini 2.0 Flash and Gemini 2.0 Flash‑Lite: 10 video files

Video: Tokenization

Here's how tokens are calculated for video:

Gemini 2.0 Flash and Gemini 2.0 Flash‑Lite: The audio track is encoded with video frames. The audio track is also broken down into 1-second trunks that each accounts for 32 tokens. The video frame and audio tokens are interleaved together with their timestamps. The timestamps are represented as 7 tokens.
All Gemini multimodal models: Videos are sampled at 1 frame per second (fps). Each video frame accounts for 258 tokens.

Video: Best practices

When using video, use the following best practices and information for the best results:

If your prompt contains a single video, place the video before the text prompt.
If you need timestamp localization in a video with audio, ask the model to generate timestamps in the MM:SS format where the first two digits represent minutes and the last two digits represent seconds. Use the same format for questions that ask about a timestamp.

Video: Limitations

While Gemini multimodal models are powerful in many multimodal use cases, it's important to understand the limitations of the models:

Content moderation: The models refuse to provide answers on videos that violate our safety policies.
Non-speech sound recognition: The models that support audio might make mistakes recognizing sound that's not speech.
High-speed motion: The models might make mistakes understanding high-speed motion in video due to the fixed 1 frame per second (fps) sampling rate.

Audio: Requirements and limitations

Audio: Requirements

In this section, learn about the supported MIME types and limits per request for audio.

Supported MIME types

Gemini multimodal models support the following audio MIME types:

Audio MIME type	Gemini 2.0 Flash	Gemini 2.0 Flash‑Lite
AAC - `audio/aac`
FLAC - `audio/flac`
MP3 - `audio/mp3`
MPA - `audio/m4a`
MPEG - `audio/mpeg`
MPGA - `audio/mpga`
MP4 - `audio/mp4`
OPUS - `audio/opus`
PCM - `audio/pcm`
WAV - `audio/wav`
WEBM - `audio/webm`

Limits per request

You can include a maximum of 1 audio file in a prompt request.

Audio: Limitations

While Gemini multimodal models are powerful in many multimodal use cases, it's important to understand the limitations of the models:

Non-speech sound recognition: The models that support audio might make mistakes recognizing sound that's not speech.
Audio-only timestamps: To accurately generate timestamps for audio-only files, you must configure the audio_timestamp parameter in generation_config.

Documents (like PDFs): Requirements, best practices, and limitations

Documents: Requirements

In this section, learn about the supported MIME types and limits per request for documents (like PDFs).

Supported MIME types

Gemini multimodal models support the following document MIME types:

Document MIME type	Gemini 2.0 Flash	Gemini 2.0 Flash‑Lite
PDF - `application/pdf`
Text - `text/plain`

Limits per request

PDFs are treated as images, so a single page of a PDF is treated as one image. The number of pages allowed in a prompt is limited to the number of images the model can support:

Gemini 2.0 Flash and Gemini 2.0 Flash‑Lite:
- Maximum files per request: 3,000
- Maximum pages per file: 1,000
- Maximum size per file: 50 MB

Documents: Tokenization

PDF tokenization

PDFs are treated as images, so each page of a PDF is tokenized in the same way as an image.

Also, the cost for PDFs follows Gemini image pricing. For example, if you include a two-page PDF in a Gemini API call, you incur an input fee of processing two images.

Documents: Best practices

When using PDFs, use the following best practices and information for the best results:

If your prompt contains a single PDF, place the PDF before the text prompt in your request.
If you have a long document, consider splitting it into multiple PDFs to process it.
Use PDFs created with text rendered as text instead of using text in scanned images. This format ensures text is machine-readable so that it's easier for the model to edit, search, and manipulate compared to scanned image PDFs. This practice provides optimal results when working with text-heavy documents like contracts.

Documents: Limitations

While Gemini multimodal models are powerful in many multimodal use cases, it's important to understand the limitations of the models:

Spatial reasoning: The models aren't precise at locating text or objects in PDFs. They might only return the approximated counts of objects.
Accuracy: The models might hallucinate when interpreting handwritten text in PDF documents.

Supported input files and requirements for the Vertex AI Gemini API bookmark_borderbookmark Stay organized with collections Save and categorize content based on your preferences.

Options for providing files in multimodal requests

Option 1: Provide the file using a URL or URI

Option 2: Provide the file as inline data

Images: Requirements, best practices, and limitations

Images: Requirements

Supported MIME types

Limits per request

Images: Tokenization

Images: Best practices

Images: Limitations

Video: Requirements, best practices, and limitations

Video: Requirements

Supported MIME types

Limits per request

Video: Tokenization

Video: Best practices

Video: Limitations

Audio: Requirements and limitations

Audio: Requirements

Supported MIME types

Limits per request

Audio: Limitations

Documents (like PDFs): Requirements, best practices, and limitations

Documents: Requirements

Supported MIME types

Limits per request

Documents: Tokenization

Documents: Best practices

Documents: Limitations

Supported input files and requirements for the Vertex AI Gemini API
bookmark_border Stay organized with collections Save and categorize content based on your preferences.