Search with vector embeddings

The page shows you how to use Cloud Firestore to perform K-nearest neighbor (KNN) vector searches using these techniques:

  • Store vector values
  • Create and manage KNN vector indexes
  • Make a K-nearest-neighbor (KNN) query using one of the supported vector distance functions

Store vector embeddings

You can create vector values such as text embeddings from your Cloud Firestore data, and store them in Cloud Firestore documents.

Write operation with a vector embedding

The following example shows how to store a vector embedding in a Cloud Firestore document:

Python
from google.cloud import firestore
from google.cloud.firestore_v1.vector import Vector

collection = firestore_client.collection("coffee-beans")
doc = {
  "name": "Kahawa coffee beans"
  "description": "Information about the Kahawa coffee beans."
  "embedding_field": Vector([1.0 , 2.0, 3.0])
}

collection.add(doc)
    
Node.js
import {
  Firestore,
  FieldValue,
} from "@google-cloud/firestore";

const db = new Firestore();
const coll = db.collection('coffee-beans');
await coll.add({
  name: "Kahawa coffee beans",
  description: "Information about the Kahawa coffee beans.",
  embedding_field: FieldValue.vector([1.0 , 2.0, 3.0])
});
    

Compute vector embeddings with a Cloud Function

To calculate and store vector embeddings whenever a document is updated or created, you can set up a Cloud Function:

Python
@functions_framework.cloud_event
def store_embedding(cloud_event) -> None:
  """Triggers by a change to a Firestore document.
  """
  firestore_payload = firestore.DocumentEventData()
  payload = firestore_payload._pb.ParseFromString(cloud_event.data)

  collection_id, doc_id = from_payload(payload)
  # Call a function to calculate the embedding
  embedding = calculate_embedding(payload)
  # Update the document
  doc = firestore_client.collection(collection_id).document(doc_id)
  doc.set({"embedding_field": embedding}, merge=True)
    
Node.js
/**
 * A vector embedding will be computed from the
 * value of the `content` field. The vector value
 * will be stored in the `embedding` field. The
 * field names `content` and `embedding` are arbitrary
 * field names chosen for this example.
 */
async function storeEmbedding(event: FirestoreEvent<any>): Promise<void> {
  // Get the previous value of the document's `content` field.
  const previousDocumentSnapshot = event.data.before as QueryDocumentSnapshot;
  const previousContent = previousDocumentSnapshot.get("content");

  // Get the current value of the document's `content` field.
  const currentDocumentSnapshot = event.data.after as QueryDocumentSnapshot;
  const currentContent = currentDocumentSnapshot.get("content");

  // Don't update the embedding if the content field did not change
  if (previousContent === currentContent) {
    return;
  }

  // Call a function to calculate the embedding for the value
  // of the `content` field.
  const embeddingVector = calculateEmbedding(currentContent);

  // Update the `embedding` field on the document.
  await currentDocumentSnapshot.ref.update({
    embedding: embeddingVector,
  });
}
    

Create and manage vector indexes

Before you can perform a nearest neighbor search with your vector embeddings, you must create a corresponding index. The following examples demonstrate how to create and manage vector indexes.

Create a single-field vector index

To create a single-field vector index, use gcloud alpha firestore indexes composite create:

gcloud
gcloud alpha firestore indexes composite create \
--collection-group=collection-group \
--query-scope=COLLECTION \
--field-config field-path=vector-field,vector-config='vector-configuration' \
--database=database-id
    

where:

  • collection-group is the ID of the collection group.
  • vector-field is the name of the field that contains the vector embedding.
  • database-id is the ID of the database.
  • vector-configuration includes the vector dimension and index type. The dimension is an integer up to 2048. The index type must be flat. Format the index configuration as follows: {"dimension":"DIMENSION", "flat": "{}"}.

Create a composite vector index

The following example creates a composite vector index for field color and a vector embedding field.

gcloud
gcloud alpha firestore indexes composite create \
--collection-group=collection-group \
--query-scope=COLLECTION \
--field-config=order=ASCENDING,field-path="color" \
--field-config field-path=field,vector-config='{"dimension":"1024", "flat": "{}"}' \
--database=database-id
    

List all vector indexes

gcloud
gcloud alpha firestore indexes composite list --database=database-id

Replace database-id with the ID of the database.

Delete a vector index

gcloud
gcloud alpha firestore indexes composite delete index-id --database=database-id
    

where:

  • index-id is the ID of the index to delete. Use indexes composite list to retrieve the index ID.
  • database-id is the ID of the database.

Describe a vector index

gcloud
gcloud alpha firestore indexes composite describe index-id --database=database-id
    

where:

  • index-id is the ID of the index to describe. Use or indexes composite list to retrieve the index ID.
  • database-id is the ID of the database.

Make a nearest-neighbor query

You can perform a similarity search to find the nearest neighbors of a vector embedding. Similarity searches require vector indexes. If an index doesn't exist, Cloud Firestore suggests an index to create using the gCloud CLI.

Python
from google.cloud.firestore_v1.base_vector_query import DistanceMeasure

collection = collection("coffee-beans")

// Requires vector index
collection.find_nearest(
   vector_field="embedding_field",
   query_vector=Vector([3.0, 1.0, 2.0]),
   distance_measure=DistanceMeasure.EUCLIDEAN,
   limit=5)
    
Node.js
import {
  Firestore,
  FieldValue,
  VectorQuery,
  VectorQuerySnapshot,
} from "@google-cloud/firestore";

// Requires single-field vector index
const vectorQuery: VectorQuery = coll.findNearest('embedding_field', FieldValue.vector([3.0, 1.0, 2.0]), {
  limit: 5,
  distanceMeasure: 'EUCLIDEAN'
});

const vectorQuerySnapshot: VectorQuerySnapshot = await vectorQuery.get();
    

Vector distances

Nearest-neighbor queries support the following options for vector distance:

  • EUCLIDEAN: Measures the EUCLIDEAN distance between the vectors. To learn more, see Euclidean.
  • COSINE: Compares vectors based on the angle between them which lets you measure similarity that isn't based on the vectors magnitude. We recommend using DOT_PRODUCT with unit normalized vectors instead of COSINE distance, which is mathematically equivalent with better performance. To learn more see Cosine similarity to learn more.
  • DOT_PRODUCT: Similar to COSINE but is affected by the magnitude of the vectors. To learn more, see Dot product.

Pre-filter data

To pre-filter data before finding the nearest neighbors, you can combine a similarity search with other filters except inequality filters. The and and or composite filters are supported. For field filters, the following filters are supported:

  • == equal to
  • in
  • array_contains
  • array_contains_any
Python
// Similarity search with pre-filter
// Requires composite vector index
collection.where("color", "==", "red").find_nearest(
   vector_field="embedding_field",
   query_vector=Vector([3.0, 1.0, 2.0]),
   distance_measure=DistanceMeasure.EUCLIDEAN,
   limit=5)
    
Node.js
// Similarity search with pre-filter
// Requires composite vector index
const preFilteredVectorQuery: VectorQuery = coll
  .where("color", "==", "red")
  .findNearest("embedding_field", FieldValue.vector([3.0, 1.0, 2.0]), {
    limit: 5,
    distanceMeasure: "EUCLIDEAN",
  });

vectorQueryResults = await preFilteredVectorQuery.get();
    

Limitations

As you work with vector embeddings, note the following limitations:

  • The max supported embedding dimension is 2048. To store larger indexes, use dimensionality reduction.
  • The max number of documents to return from a nearest-neighbor query is 1000.
  • Vector search does not support real-time snapshot listeners.
  • You cannot use inequality filters to pre-filter data.
  • Only the Python and Node.js client libraries support vector search.

What's next