Supported input files and requirements for the Vertex AI Gemini API

When calling the Vertex AI Gemini API from your app using a Vertex AI in Firebase SDK, you can prompt the Gemini model to generate text based on a multimodal input. Multimodal prompts can include multiple modalities (or types of input), like text along with images, PDFs, video, and audio.

For the non-text parts of the input (like media files), you need to use supported file types, specify a supported MIME type, and make sure that your files and multimodal requests meet the requirements and follow best practices.

This page describes the supported MIME types, best practices, and limitations for the following:

Requirements specific to the Vertex AI in Firebase SDKs

For Vertex AI in Firebase SDKs, the maximum total request size is 20 MB. You get an HTTP 413 error if a request is too large.



Images: Requirements, best practices, and limitations

Images: Requirements

In this section, learn about the supported MIME types and limits per request for images.

Supported MIME types

Gemini multimodal models support the following image MIME types:

Image MIME type Gemini 1.5 Flash Gemini 1.5 Pro Gemini 1.0 Pro Vision
PNG - image/png
JPEG - image/jpeg

Limits per request

There isn't a specific limit to the number of pixels in an image. However, larger images are scaled down and padded to fit a maximum resolution of 3072 x 3072 while preserving their original aspect ratio.

Here's the maximum number of image files allowed in a prompt request:

  • Gemini 1.0 Pro Vision: 16 images
  • Gemini 1.5 Flash and Gemini 1.5 Pro: 3000 images

Images: Tokenization

Here's how tokens are calculated for images:

  • Gemini 1.0 Pro Vision: Each image accounts for 258 tokens.
  • Gemini 1.5 Flash and Gemini 1.5 Pro:
    • If both dimensions of an image are less than or equal to 384 pixels, then 258 tokens are used.
    • If one dimension of an image is greater than 384 pixels, then the image is cropped into tiles. Each tile size defaults to the smallest dimension (width or height) divided by 1.5. If necessary, each tile is adjusted so that it's not smaller than 256 pixels and not greater than 768 pixels. Each tile is then resized to 768x768 and uses 258 tokens.

Images: Best practices

When using images, use the following best practices and information for the best results:

  • If you want to detect text in an image, use prompts with a single image to produce better results than prompts with multiple images.
  • If your prompt contains a single image, place the image before the text prompt in your request.
  • If your prompt contains multiple images, and you want to refer to them later in your prompt or have the model refer to them in the model response, it can help to give each image an index before the image. Use a b c or image 1 image 2 image 3 for your index. The following is an example of using indexed images in a prompt:
    image 1 
    image 2 
    image 3 
    
    Write a blogpost about my day using image 1 and image 2. Then, give me ideas
    for tomorrow based on image 3.
  • Use images with higher resolution; they yield better results.
  • Include a few examples in the prompt.
  • Rotate images to their proper orientation before adding them to the prompt.
  • Avoid blurry images.

Images: Limitations

While Gemini multimodal models are powerful in many multimodal use cases, it's important to understand the limitations of the models:

  • Content moderation: The models refuse to provide answers on images that violate our safety policies.
  • Spatial reasoning: The models aren't precise at locating text or objects in images. They might only return the approximated counts of objects.
  • Medical uses: The models aren't suitable for interpreting medical images (for example, x-rays and CT scans) or providing medical advice.
  • People recognition: The models aren't meant to be used to identify people who aren't celebrities in images.
  • Accuracy: The models might hallucinate or make mistakes when interpreting low-quality, rotated, or extremely low-resolution images. The models might also hallucinate when interpreting handwritten text in images documents.



Video: Requirements, best practices, and limitations

Video: Requirements

In this section, learn about the supported MIME types and limits per request for video.

Supported MIME types

Gemini multimodal models support the following video MIME types:

Video MIME type Gemini 1.5 Flash Gemini 1.5 Pro Gemini 1.0 Pro Vision
FLV - video/x-flv
MOV - video/mov
MPEG - video/mpeg
MPEGPS - video/mpegps
MPG - video/mpg
MP4 - video/mp4
WEBM - video/webm
WMV - video/wmv
3GPP - video/3gpp

Limits per request

Here's the maximum number of video files allowed in a prompt request:

  • Gemini 1.0 Pro Vision: 1 video file
  • Gemini 1.5 Flash and Gemini 1.5 Pro: 10 video files

Video: Tokenization

Here's how tokens are calculated for video:

  • All Gemini multimodal models: Videos are sampled at 1 frame per second (fps). Each video frame accounts for 258 tokens.
  • Gemini 1.5 Flash and Gemini 1.5 Pro: The audio track is encoded with video frames. The audio track is also broken down into 1-second trunks that each accounts for 32 tokens. The video frame and audio tokens are interleaved together with their timestamps. The timestamps are represented as 7 tokens.

Video: Best practices

When using video, use the following best practices and information for the best results:

  • If your prompt contains a single video, place the video before the text prompt.
  • If you need timestamp localization in a video with audio, ask the model to generate timestamps in the MM:SS format where the first two digits represent minutes and the last two digits represent seconds. Use the same format for questions that ask about a timestamp.
  • Note the following if you're using Gemini 1.0 Pro Vision:

    • Use no more than one video per prompt.
    • The model only processes the information in the first two minutes of the video.
    • The model processes videos as non-contiguous image frames from the video. Audio isn't included. If you notice the model missing some content from the video, try making the video shorter so that the model captures a greater portion of the video content.
    • The model does not process any audio information or timestamp metadata. Because of this, the model might not perform well in use cases that require audio input, such as captioning audio, or time-related information, such as speed or rhythm.

Video: Limitations

While Gemini multimodal models are powerful in many multimodal use cases, it's important to understand the limitations of the models:

  • Content moderation: The models refuse to provide answers on videos that violate our safety policies.
  • Non-speech sound recognition: The models that support audio might make mistakes recognizing sound that's not speech.
  • High-speed motion: The models might make mistakes understanding high-speed motion in video due to the fixed 1 frame per second (fps) sampling rate.
  • Transcription punctuation: (if using Gemini 1.5 Flash) The models might return transcriptions that don't include punctuation.



Audio: Requirements and limitations

Audio: Requirements

In this section, learn about the supported MIME types and limits per request for audio.

Supported MIME types

Gemini multimodal models support the following audio MIME types:

Audio MIME type Gemini 1.5 Flash Gemini 1.5 Pro
AAC - audio/aac
FLAC - audio/flac
MP3 - audio/mp3
MPA - audio/m4a
MPEG - audio/mpeg
MPGA - audio/mpga
MP4 - audio/mp4
OPUS - audio/opus
PCM - audio/pcm
WAV - audio/wav
WEBM - audio/webm

Limits per request

You can include a maximum of 1 audio file in a prompt request.

Audio: Limitations

While Gemini multimodal models are powerful in many multimodal use cases, it's important to understand the limitations of the models:

  • Non-speech sound recognition: The models that support audio might make mistakes recognizing sound that's not speech.
  • Audio-only timestamps: The models that support audio can't accurately generate timestamps for requests with audio files. This includes segmentation and temporal localization timestamps. Timestamps can be generated accurately for input that includes a video that contains audio.
  • Transcription punctuation: (if using Gemini 1.5 Flash) The models might return transcriptions that don't include punctuation.



Documents (like PDFs): Requirements, best practices, and limitations

Documents: Requirements

In this section, learn about the supported MIME types and limits per request for documents (like PDFs).

Supported MIME types

Gemini multimodal models support the following document MIME types:

Document MIME type Gemini 1.5 Flash Gemini 1.5 Pro Gemini 1.0 Pro Vision
PDF - application/pdf

Limits per request

PDFs are treated as images, so a single page of a PDF is treated as one image. The number of pages allowed in a prompt is limited to the number of images the model can support:

  • Gemini 1.0 Pro Vision: 16 pages
  • Gemini 1.5 Pro and Gemini 1.5 Flash: 1000 pages

Documents: Tokenization

PDFs are treated as images, so each page of a PDF is tokenized in the same way as an image.

Also, the cost for PDFs follows Gemini image pricing. For example, if you include a two-page PDF in a Gemini API call, you incur an input fee of processing two images.

Documents: Best practices

When using PDFs, use the following best practices and information for the best results:

  • If your prompt contains a single PDF, place the PDF before the text prompt in your request.
  • If you have a long document, consider splitting it into multiple PDFs to process it.
  • Use PDFs created with text rendered as text instead of using text in scanned images. This format ensures text is machine-readable so that it's easier for the model to edit, search, and manipulate compared to scanned image PDFs. This practice provides optimal results when working with text-heavy documents like contracts.

Documents: Limitations

While Gemini multimodal models are powerful in many multimodal use cases, it's important to understand the limitations of the models:

  • Spatial reasoning: The models aren't precise at locating text or objects in PDFs. They might only return the approximated counts of objects.
  • Accuracy: The models might hallucinate when interpreting handwritten text in PDF documents.