Generating content with AI models

At the heart of generative AI are AI models. Currently, the two most prominent examples of generative models are large language models (LLMs) and image generation models. These models take input, called a prompt (most commonly text, an image, or a combination of both), and from it produce as output text, an image, or even audio or video.

The output of these models can be surprisingly convincing: LLMs generate text that appears as though it could have been written by a human being, and image generation models can produce images that are very close to real photographs or artwork created by humans.

In addition, LLMs have proven capable of tasks beyond simple text generation:

  • Writing computer programs
  • Planning subtasks that are required to complete a larger task
  • Organizing unorganized data
  • Understanding and extracting information data from a corpus of text
  • Following and performing automated activities based on a text description of the activity

There are many models available to you, from several different providers. Each model has its own strengths and weaknesses and one model might excel at one task but perform less well at others. Apps making use of generative AI can often benefit from using multiple different models depending on the task at hand.

As an app developer, you typically don't interact with generative AI models directly, but rather through services available as web APIs. Although these services often have similar functionality, they all provide them through different and incompatible APIs. If you want to make use of multiple model services, you have to use each of their proprietary SDKs, potentially incompatible with each other. And if you want to upgrade from one model to the newest and most capable one, you might have to build that integration all over again.

Genkit addresses this challenge by providing a single interface that abstracts away the details of accessing potentially any generative AI model service, with several pre-built implementations already available. Building your AI-powered app around Genkit simplifies the process of making your first generative AI call and makes it equally easy to combine multiple models or swap one model for another as new models emerge.

Before you begin

If you want to run the code examples on this page, first complete the steps in the Getting started guide. All of the examples assume that you have already installed Genkit as a dependency in your project.

Models supported by Genkit

Genkit is designed to be flexible enough to use potentially any generative AI model service. Its core libraries define the common interface for working with models, and model plugins define the implementation details for working with a specific model and its API.

The Genkit team maintains plugins for working with models provided by Vertex AI, Google Generative AI, and Ollama:

  • Gemini family of LLMs, through the Google Cloud Vertex AI plugin
  • Gemini family of LLMs, through the Google AI plugin
  • Imagen2 and Imagen3 image generation models, through Google Cloud Vertex AI
  • Anthropic's Claude 3 family of LLMs, through Google Cloud Vertex AI's model garden
  • Gemma 2, Llama 3, and many more open models, through the Ollama plugin (you must host the Ollama server yourself)

In addition, there are also several community-supported plugins that provide interfaces to these models:

You can discover more by searching for packages tagged with genkit-model on npmjs.org.

Loading and configuring model plugins

Before you can use Genkit to start generating content, you need to load and configure a model plugin. If you're coming from the Getting Started guide, you've already done this. Otherwise, see the Getting Started guide or the individual plugin's documentation and follow the steps there before continuing.

The generate() function

In Genkit, the primary interface through which you interact with generative AI models is the generate() function.

The simplest generate() call specifies the model you want to use and a text prompt:

import { generate } from '@genkit-ai/ai';
import { configureGenkit } from '@genkit-ai/core';
import { gemini15Flash } from '@genkit-ai/googleai';

configureGenkit(/* ... */);

(async () => {
  const llmResponse = await generate({
    model: gemini15Flash,
    prompt: 'Invent a menu item for a pirate themed restaurant.',
  });

  console.log(await llmResponse.text());
})();

When you run this brief example, it will print out some debugging information followed by the output of the generate() call, which will usually be Markdown text as in the following example:

## The Blackheart's Bounty

**A hearty stew of slow-cooked beef, spiced with rum and molasses, served in a
hollowed-out cannonball with a side of crusty bread and a dollop of tangy
pineapple salsa.**

**Description:** This dish is a tribute to the hearty meals enjoyed by pirates
on the high seas. The beef is tender and flavorful, infused with the warm spices
of rum and molasses. The pineapple salsa adds a touch of sweetness and acidity,
balancing the richness of the stew. The cannonball serving vessel adds a fun and
thematic touch, making this dish a perfect choice for any pirate-themed
adventure.

Run the script again and you'll get a different output.

The preceding code sample specified the model using a model reference exported by the model plugin. You can also specify the model using a string identifier:

const llmResponse = await generate({
  model: 'googleai/gemini-1.5-flash-latest',
  prompt: 'Invent a menu item for a pirate themed restaurant.',
});

A model string identifier looks like providerid/modelid, where the provider ID (in this case, googleai) identifies the plugin, and the model ID is a plugin-specific string identifier for a specific version of a model.

Some model plugins, such as the Ollama plugin, provide access to potentially dozens of different models and therefore do not export individual model references. In these cases, you can only specify a model to generate() using its string identifier:

const llmResponse = await generate({
  model: 'ollama/gemma2',
  prompt: 'Invent a menu item for a pirate themed restaurant.',
});

All of the preceding examples also illustrate an important point: when you use generate() to make generative AI model calls, changing the model you want to use is simply a matter of passing a different value to the model parameter. By using generate() instead of the native model SDKs, you give yourself the flexibility to more easily use several different models in your app and change models in the future.

So far you have only seen examples of the simplest generate() calls. However, generate() also provides an interface for more advanced interactions with generative models, which you will see in the sections that follow.

Model parameters

The generate() function takes a config parameter, through which you can specify optional settings that control how the model generates content:

const llmResponse = await generate({
  prompt: "Suggest an item for the menu of a pirate themed restaurant",
  model: gemini15Flash,
  config: {
    maxOutputTokens: 400,
    stopSequences: ["<end>", "<fin>"],
    temperature: 1.2,
    topP: 0.4,
    topK: 50,
  },
});

The exact parameters that are supported depend on the individual model and model API. However, the parameters in the previous example are common to almost every model. The following is an explanation of these parameters:

Parameters that control output length

maxOutputTokens

LLMs operate on units called tokens. A token usually, but does not necessarily, map to a specific sequence of characters. When you pass a prompt to a model, one of the first steps it takes is to tokenize your prompt string into a sequence of tokens. Then, the LLM generates a sequence of tokens from the tokenized input. Finally, the sequence of tokens gets converted back into text, which is your output.

The maximum output tokens parameter simply sets a limit on how many tokens to generate using the LLM. Every model potentially uses a different tokenizer, but a good rule of thumb is to consider a single English word to be made of 2 to 4 tokens.

As stated earlier, some tokens might not map to character sequences. One such example is that there is often a token that indicates the end of the sequence: when an LLM generates this token, it stops generating more. Therefore, it's possible and often the case that an LLM generates fewer tokens than the maximum because it generated the "stop" token.

stopSequences

You can use this parameter to set the tokens or token sequences that, when generated, indicate the end of LLM output. The correct values to use here generally depend on how the model was trained, and are usually set by the model plugin. However, if you have prompted the model to generate another stop sequence, you might specify it here.

Note that you are specifying character sequences, and not tokens per se. In most cases, you will specify a character sequence that the model's tokenizer maps to a single token.

Parameters that control "creativity"

The temperature, top-p, and top-k parameters together control how "creative" you want the model to be. Below are very brief explanations of what these parameters mean, but the more important point to take away is this: these parameters are used to adjust the character of an LLM's output. The optimal values for them depend on your goals and preferences, and are likely to be found only through experimentation.

temperature

LLMs are fundamentally token-predicting machines. For a given sequence of tokens (such as the prompt) an LLM predicts, for each token in its vocabulary, the likelihood that the token comes next in the sequence. The temperature is a scaling factor by which these predictions are divided before being normalized to a probability between 0 and 1.

Low temperature values—between 0.0 and 1.0—amplify the difference in likelihoods between tokens, with the result that the model will be even less likely to produce a token it already evaluated to be unlikely. This is often perceived as output that is less creative. Although 0.0 is technically not a valid value, many models treat it as indicating that the model should behave deterministically, and to only consider the single most likely token.

High temperature values—those greater than 1.0—compress the differences in likelihoods between tokens, with the result that the model becomes more likely to produce tokens it had previously evaluated to be unlikely. This is often perceived as output that is more creative. Some model APIs impose a maximum temperature, often 2.0.

topP

Top-p is a value between 0.0 and 1.0 that controls the number of possible tokens you want the model to consider, by specifying the cumulative probability of the tokens. For example, a value of 1.0 means to consider every possible token (but still take into account the probability of each token). A value of 0.4 means to only consider the most likely tokens, whose probabilities add up to 0.4, and to exclude the remaining tokens from consideration.

topK

Top-k is an integer value that also controls the number of possible tokens you want the model to consider, but this time by explicitly specifying the maximum number of tokens. Specifying a value of 1 means that the model should behave deterministically.

Experiment with model parameters

You can experiment with the effect of these parameters on the output generated by different model and prompt combinations by using the Developer UI. Start the developer UI with the genkit start command and it will automatically load all of the models defined by the plugins configured in your project. You can quickly try different prompts and configuration values without having to repeatedly make these changes in code.

Structured output

When using generative AI as a component in your application, you often want output in a format other than plain text. Even if you're just generating content to display to the user, you can benefit from structured output simply for the purpose of presenting it more attractively to the user. But for more advanced applications of generative AI, such as programmatic use of the model's output, or feeding the output of one model into another, structured output is a must.

In Genkit, you can request structured output from a model by specifying a schema when you call generate():

import { z } from "zod";
const MenuItemSchema = z.object({
  name: z.string(),
  description: z.string(),
  calories: z.number(),
  allergens: z.array(z.string()),
});

const llmResponse = await generate({
  prompt: "Suggest an item for the menu of a pirate themed restaurant",
  model: gemini15Flash,
  output: {
    schema: MenuItemSchema,
  },
});

Model output schemas are specified using the Zod library. In addition to a schema definition language, Zod also provides runtime type checking, which bridges the gap between static TypeScript types and the unpredictable output of generative AI models. Zod lets you write code that can rely on the fact that a successful generate call will always return output that conforms to your TypeScript types.

When you specify a schema in generate(), Genkit does several things behind the scenes:

  • Augments the prompt with additional guidance about the desired output format. This also has the side effect of specifying to the model what content exactly you want to generate (for example, not only suggest a menu item but also generate a description, a list of allergens, and so on).
  • Parses the model output into a JavaScript object.
  • Verifies that the output conforms with the schema.

To get structured output from a successful generate call, use the response object's output() method:

type MenuItem = z.infer<typeof MenuItemSchema>;

const output: MenuItem | null = llmResponse.output();

Handling errors

Note in the prior example that the output method can return null. This can happen when the model fails to generate output that conforms to the schema. You can also detect this condition by catching the NoValidCandidatesError exception thrown by generate:

import { NoValidCandidatesError } from "@genkit-ai/ai";
try {
  llmResponse = await generate(/* ... */);
} catch (e) {
  if (e instanceof NoValidCandidatesError) {
    // Output doesn't conform to schema.
  }
}

The best strategy for dealing with such errors will depend on your exact use case, but here are some general hints:

  • Try a different model. For structured output to succeed, the model must be capable of generating output in JSON. The most powerful LLMs, like Gemini and Claude, are versatile enough to do this; however, smaller models, such as some of the local models you would use with Ollama, might not be able to generate structured output reliably unless they have been specifically trained to do so.

  • Make use of Zod's coercion abilities: You can specify in your schemas that Zod should try to coerce non-conforming types into the type specified by the schema. If your schema includes primitive types other than strings, using Zod coercion can reduce the number of generate() failures you experience. The following version of MenuItemSchema uses type conversion to automatically correct situations where the model generates calorie information as a string instead of a number:

    const MenuItemSchema = z.object({
      name: z.string(),
      description: z.string(),
      calories: z.coerce.number(),
      allergens: z.array(z.string()),
    });
    
  • Retry the generate() call. If the model you've chosen only rarely fails to generate conformant output, you can treat the error as you would treat a network error, and simply retry the request using some kind of incremental back-off strategy.

Streaming

When generating large amounts of text, you can improve the experience for your users by presenting the output as it's generated—streaming the output. A familiar example of streaming in action can be seen in most LLM chat apps: users can read the model's response to their messages as it's being generated, which improves the perceived responsiveness of the application and enhances the illusion of chatting with an intelligent counterpart.

In Genkit, you can stream output using the generateStream() function. Its syntax is similar to the generate() function:

import { generateStream } from "@genkit-ai/ai";
import { GenerateResponseChunk } from "@genkit-ai/ai/lib/generate";
const llmResponseStream = await generateStream({
  prompt: 'Suggest a complete menu for a pirate themed restaurant',
  model: gemini15Flash,
});

However, this function returns an asynchronous iterable of response chunks. Handle each of these chunks as they become available:

for await (const responseChunkData of llmResponseStream.stream()) {
  const responseChunk = responseChunkData as GenerateResponseChunk;
  console.log(responseChunk.text());
}

You can still get the entire response at once:

const llmResponse = await llmResponseStream.response();

Streaming also works with structured output:

const MenuSchema = z.object({
  starters: z.array(MenuItemSchema),
  mains: z.array(MenuItemSchema),
  desserts: z.array(MenuItemSchema),
});
type Menu = z.infer<typeof MenuSchema>;

const llmResponseStream = await generateStream({
  prompt: "Suggest a complete menu for a pirate themed restaurant",
  model: gemini15Flash,
  output: { schema: MenuSchema },
});

for await (const responseChunkData of llmResponseStream.stream()) {
  const responseChunk = responseChunkData as GenerateResponseChunk<Menu>;
  // output() returns an object representing the entire output so far
  const output: Menu | null = responseChunk.output();
  console.log(output);
}

Streaming structured output works a little differently from streaming text. When you call the output() method of a response chunk, you get an object constructed from the accumulation of the chunks that have been produced so far, rather than an object representing a single chunk (which might not be valid on its own). Every chunk of structured output in a sense supersedes the chunk that came before it.

For example, here are the first five outputs from the prior example:

null
{ starters: [ {} ] }
{
  starters: [ { name: "Captain's Treasure Chest", description: 'A' } ]
}
{
  starters: [
    {
      name: "Captain's Treasure Chest",
      description: 'A mix of spiced nuts, olives, and marinated cheese served in a treasure chest.',
      calories: 350
    }
  ]
}
{
  starters: [
    {
      name: "Captain's Treasure Chest",
      description: 'A mix of spiced nuts, olives, and marinated cheese served in a treasure chest.',
      calories: 350,
      allergens: [Array]
    },
    { name: 'Shipwreck Salad', description: 'Fresh' }
  ]
}

Multimodal input

The examples you've seen so far have used text strings as model prompts. While this remains the most common way to prompt generative AI models, many models can also accept other media as prompts. Media prompts are most often used in conjunction with text prompts that instruct the model to perform some operation on the media, such as to caption an image or transcribe an audio recording.

The ability to accept media input and the types of media you can use are completely dependent on the model and its API. For example, the Gemini 1.5 series of models can accept images, video, and audio as prompts.

To provide a media prompt to a model that supports it, instead of passing a simple text prompt to generate, pass an array consisting of a media part and a text part:

const llmResponse = await generate({
  prompt: [
    { media: { url: 'https://example.com/photo.jpg' } },
    { text: 'Compose a poem about this image.' },
  ],
  model: gemini15Flash,
});

In the above example, you specified an image using a publicly-accessible HTTPS URL. You can also pass media data directly by encoding it as a data URL. For example:

import { readFile } from 'node:fs/promises';
const b64Data = await readFile('output.png', { encoding: 'base64url' });
const dataUrl = `data:image/png;base64,${b64Data}`;

const llmResponse = await generate({
  prompt: [
    { media: { url: dataUrl } },
    { text: 'Compose a poem about this image.' },
  ],
  model: gemini15Flash,
});

All models that support media input support both data URLs and HTTPS URLs. Some model plugins add support for other media sources. For example, the Vertex AI plugin also lets you use Cloud Storage (gs://) URLs.

Generating media

So far, most of the examples on this page have dealt with generating text using LLMs. However, Genkit can also be used with image generation models. Using generate() with an image generation model is similar to using an LLM. For example, to generate an image using the Imagen2 model through Vertex AI:

  1. Genkit uses data: URLs as the standard output format for generated media. This is a standard format with many libraries available to handle them. This example uses the data-urls package from jsdom:

    npm i data-urls
    npm i --save-dev @types/data-urls
  2. To generate an image and save it to a file, call generate(), specifying an image generation model and the media type of output format:

    import { generate } from '@genkit-ai/ai';
    import { configureGenkit } from '@genkit-ai/core';
    import { vertexAI, imagen2 } from '@genkit-ai/vertexai';
    import parseDataURL from 'data-urls';
    
    import { writeFile } from 'node:fs/promises';
    
    configureGenkit({
      plugins: [vertexAI({ location: 'us-central1' })],
    });
    
    (async () => {
      const mediaResponse = await generate({
        model: imagen2,
        prompt: 'photo of a meal fit for a pirate',
        output: { format: 'media' },
      });
    
      const media = mediaResponse.media();
      if (media === null) throw new Error('No media generated.');
    
      const data = parseDataURL(media.url);
      if (data === null) throw new Error('Invalid data: URL.');
    
      await writeFile(`output.${data.mimeType.subtype}`, data.body);
    })();
    

Recording message history

Many of your users will have interacted with large language models for the first time through chatbots. Although LLMs are capable of much more than simulating conversations, it remains a familiar and useful style of interaction. Even when your users will not be interacting directly with the model in this way, the conversational style of prompting is a powerful way to influence the output generated by an AI model.

To generate message history from a model response, call the toHistory() method:

let response = await generate({
  model: gemini15Flash,
  prompt: "How do you say 'dog' in French?",
});
let history = response.toHistory();

You can serialize this history and persist it in a database or session storage. Then, pass the history along with the prompt on future calls to generate():

response = await generate({
  model: gemini15Flash,
  prompt: 'How about in Spanish?',
  history,
});
history = response.toHistory();

If the model you're using supports the system role, you can use the initial history to set the system message:

import { MessageData } from "@genkit-ai/ai/model";
let history: MessageData[] = [
  { role: 'system', content: [{ text: 'Talk like a pirate.' }] },
];
let response = await generate({
  model: gemini15Flash,
  prompt: "How do you say 'dog' in French?",
  history,
});

Next steps

Learn more about Genkit

  • As an app developer, the primary way you influence the output of generative AI models is through prompting. Read Prompt management to learn how Genkit helps you develop effective prompts and manage them in your codebase.
  • Although generate() is the nucleus of every generative AI powered application, real-world applications usually require additional work before and after invoking a generative AI model. To reflect this, Genkit introduces the concept of flows, which are defined like functions but add additional features such as observability and simplified deployment. To learn more, see Defining workflows.

Advanced LLM use

  • One way to enhance the capabilities of LLMs is to prompt them with a list of ways they can request more information from you, or request you to perform some action. This is known as tool calling or function calling. Models that are trained to support this capability can respond to a prompt with a specially-formatted response, which indicates to the calling application that it should perform some action and send the result back to the LLM along with the original prompt. Genkit has library functions that automate both the prompt generation and the call-response loop elements of a tool calling implementation. See Tool calling to learn more.
  • Retrieval-augmented generation (RAG) is a technique used to introduce domain-specific information into a model's output. This is accomplished by inserting relevant information into a prompt before passing it on to the language model. A complete RAG implementation requires you to bring several technologies together: text embedding generation models, vector databases, and large language models. See Retrieval-augmented generation (RAG) to learn how Genkit simplifies the process of coordinating these various elements.

Testing model output

As a software engineer, you're used to deterministic systems where the same input always produces the same output. However, with AI models being probabilistic, the output can vary based on subtle nuances in the input, the model's training data, and even randomness deliberately introduced by parameters like temperature.

Genkit's evaluators are structured ways to assess the quality of your LLM's responses, using a variety of strategies. Read more on the Evaluation page.