Evaluation is a form of testing that helps you validate your LLM's responses and ensure they meet your quality bar.
Genkit supports third-party evaluation tools through plugins, paired with powerful observability features that provide insight into the runtime state of your LLM-powered applications. Genkit tooling helps you automatically extract data including inputs, outputs, and information from intermediate steps to evaluate the end-to-end quality of LLM responses as well as understand the performance of your system's building blocks.
Types of evaluation
Genkit supports two types of evaluation:
Inference-based evaluation: This type of evaluation runs against a collection of pre-determined inputs, assessing the corresponding outputs for quality.
This is the most common evaluation type, suitable for most use cases. This approach tests a system's actual output for each evaluation run.
You can perform the quality assessment manually, by visually inspecting the results. Alternatively, you can automate the assessment by using an evaluation metric.
Raw evaluation: This type of evaluation directly assesses the quality of inputs without any inference. This approach typically is used with automated evaluation using metrics. All required fields for evaluation (e.g.,
input
,context
,output
andreference
) must be present in the input dataset. This is useful when you have data coming from an external source (e.g., collected from your production traces) and you want to have an objective measurement of the quality of the collected data.For more information, see the Advanced use section of this page.
This section explains how to perform inference-based evaluation using Genkit.
Quick start
Perform these steps to get started quickly with Genkit.
Setup
- Use an existing Genkit app or create a new one by following our Get started guide.
-
Add the following code to define a simple RAG application to evaluate. For
this guide, we use a dummy retriever that always returns the same documents.
import ( "context" "fmt" "log" "github.com/firebase/genkit/go/ai" "github.com/firebase/genkit/go/genkit" "github.com/firebase/genkit/go/plugins/googlegenai" ) func main() { ctx := context.Background() // Initialize Genkit g, err := genkit.Init(ctx, genkit.WithPlugins(&googlegenai.GoogleAI{}), genkit.WithDefaultModel("googleai/gemini-2.0-flash"), ) if err != nil { log.Fatalf("Genkit initialization error: %v", err) } // Dummy retriever that always returns the same facts dummyRetrieverFunc := func(ctx context.Context, req *ai.RetrieverRequest) (*ai.RetrieverResponse, error) { facts := []string{ "Dog is man's best friend", "Dogs have evolved and were domesticated from wolves", } // Just return facts as documents. var docs []*ai.Document for _, fact := range facts { docs = append(docs, ai.DocumentFromText(fact, nil)) } return &ai.RetrieverResponse{Documents: docs}, nil } factsRetriever := genkit.DefineRetriever(g, "local", "dogFacts", dummyRetrieverFunc) m := googlegenai.GoogleAIModel(g, "gemini-2.0-flash") if m == nil { log.Fatal("failed to find model") } // A simple question-answering flow genkit.DefineFlow(g, "qaFlow", func(ctx context.Context, query string) (string, error) { factDocs, err := ai.Retrieve(ctx, factsRetriever, ai.WithTextDocs(query)) if err != nil { return "", fmt.Errorf("retrieval failed: %w", err) } llmResponse, err := genkit.Generate(ctx, g, ai.WithModelName("googleai/gemini-2.0-flash"), ai.WithPrompt("Answer this question with the given context: %s", query), ai.WithDocs(factDocs.Documents...) ) if err != nil { return "", fmt.Errorf("generation failed: %w", err) } return llmResponse.Text(), nil }) }
-
You can optionally add evaluation metrics to your application to use while
evaluating. This guide uses the
EvaluatorRegex
metric from theevaluators
package. Note: Ensure that theimport ( "github.com/firebase/genkit/go/plugins/evaluators" ) func main() { // ... metrics := []evaluators.MetricConfig{ { MetricType: evaluators.EvaluatorRegex, }, } // Initialize Genkit g, err := genkit.Init(ctx, genkit.WithPlugins( &googlegenai.GoogleAI{}, &evaluators.GenkitEval{Metrics: metrics} // Add this plugin ), genkit.WithDefaultModel("googleai/gemini-2.0-flash"), ) }
evaluators
package is installed in your go project:go get github.com/firebase/genkit/go/plugins/evaluators
-
Start your Genkit application.
genkit start -- go run main.go
Create a dataset
Create a dataset to define the examples we want to use for evaluating our flow.
Go to the Dev UI at
http://localhost:4000
and click the Datasets button to open the Datasets page.Click the Create Dataset button to open the create dataset dialog.
a. Provide a
datasetId
for your new dataset. This guide usesmyFactsQaDataset
.b. Select
Flow
dataset type.c. Leave the validation target field empty and click Save
Your new dataset page appears, showing an empty dataset. Add examples to it by following these steps:
a. Click the Add example button to open the example editor panel.
b. Only the
Input
field is required. Enter"Who is man's best friend?"
in theInput
field, and click Save to add the example has to your dataset.If you have configured the
EvaluatorRegex
metric and would like to try it out, you need to specify a Reference string that contains the pattern to match the output against. For the preceding input, set theReference output
text to"(?i)dog"
, which is a case-insensitive regular- expression pattern to match the word "dog" in the flow output.c. Repeat steps (a) and (b) a couple of more times to add more examples. This guide adds the following example inputs to the dataset:
"Can I give milk to my cats?" "From which animals did dogs evolve?"
If you are using the regular-expression evaluator, use the corresponding reference strings:
"(?i)don't know" "(?i)wolf|wolves"
Note that this is a contrived example and the regular-expression evaluator may not be the right choice to evaluate the responses from
qaFlow
. However, this guide can be applied to any Genkit Go evaluator of your choice.By the end of this step, your dataset should have 3 examples in it, with the values mentioned above.
Run evaluation and view results
To start evaluating the flow, click the Run new evaluation button on your dataset page. You can also start a new evaluation from the Evaluations tab.
Select the
Flow
radio button to evaluate a flow.Select
qaFlow
as the target flow to evaluate.Select
myFactsQaDataset
as the target dataset to use for evaluation.If you have installed an evaluator metric using Genkit plugins, you can see these metrics in this page. Select the metrics that you want to use with this evaluation run. This is entirely optional: Omitting this step will still return the results in the evaluation run, but without any associated metrics.
If you have not provided any reference values and are using the
EvaluatorRegex
metric, your evaluation will fail since this metric needs reference to be set.Click Run evaluation to start evaluation. Depending on the flow you're testing, this may take a while. Once the evaluation is complete, a success message appears with a link to view the results. Click the link to go to the Evaluation details page.
You can see the details of your evaluation on this page, including original input, extracted context and metrics (if any).
Core concepts
Terminology
Knowing the following terms can help ensure that you correctly understand the information provided on this page:
Evaluation: An evaluation is a process that assesses system performance. In Genkit, such a system is usually a Genkit primitive, such as a flow or a model. An evaluation can be automated or manual (human evaluation).
Bulk inference Inference is the act of running an input on a flow or model to get the corresponding output. Bulk inference involves performing inference on multiple inputs simultaneously.
Metric An evaluation metric is a criterion on which an inference is scored. Examples include accuracy, faithfulness, maliciousness, whether the output is in English, etc.
Dataset A dataset is a collection of examples to use for inference-based evaluation. A dataset typically consists of
Input
and optionalReference
fields. TheReference
field does not affect the inference step of evaluation but it is passed verbatim to any evaluation metrics. In Genkit, you can create a dataset through the Dev UI. There are two types of datasets in Genkit: Flow datasets and Model datasets.
Supported evaluators
Genkit supports several evaluators, some built-in, and others provided externally.
Genkit evaluators
Genkit includes a small number of built-in evaluators, ported from the JS evaluators plugin, to help you get started:
- EvaluatorDeepEqual -- Checks if the generated output is deep-equal to the reference output provided.
- EvaluatorRegex -- Checks if the generated output matches the regular expression provided in the reference field.
- EvaluatorJsonata -- Checks if the generated output matches the JSONATA expression provided in the reference field.
Advanced use
Along with its basic functionality, Genkit also provides advanced support for certain evaluation use cases.
Evaluation using the CLI
Genkit CLI provides a rich API for performing evaluation. This is especially useful in environments where the Dev UI is not available (e.g. in a CI/CD workflow).
Genkit CLI provides 3 main evaluation commands: eval:flow
, eval:extractData
,
and eval:run
.
Evaluation eval:flow
command
The eval:flow
command runs inference-based evaluation on an input dataset.
This dataset may be provided either as a JSON file or by referencing an existing
dataset in your Genkit runtime.
# Referencing an existing dataset genkit eval:flow qaFlow --input myFactsQaDataset
# or, using a dataset from a file genkit eval:flow qaFlow --input testInputs.json
Note: Make sure that you start your genkit app before running these CLI commands.
genkit start -- go run main.go
Here, testInputs.json
should be an array of objects containing an input
field and an optional reference
field, like below:
[
{
"input": "What is the French word for Cheese?"
},
{
"input": "What green vegetable looks like cauliflower?",
"reference": "Broccoli"
}
]
If your flow requires auth, you may specify it using the --context
argument:
genkit eval:flow qaFlow --input testInputs.json --context '{"auth": {"email_verified": true}}'
By default, the eval:flow
and eval:run
commands use all available metrics
for evaluation. To run on a subset of the configured evaluators, use the
--evaluators
flag and provide a comma-separated list of evaluators by name:
genkit eval:flow qaFlow --input testInputs.json --evaluators=genkitEval/regex,genkitEval/jsonata
You can view the results of your evaluation run in the Dev UI at
localhost:4000/evaluate
.
eval:extractData
and eval:run
commands
To support raw evaluation, Genkit provides tools to extract data from traces and run evaluation metrics on extracted data. This is useful, for example, if you are using a different framework for evaluation or if you are collecting inferences from a different environment to test locally for output quality.
You can batch run your Genkit flow and extract an evaluation dataset from the resultant traces. A raw evaluation dataset is a collection of inputs for evaluation metrics, without running any prior inference.
Run your flow over your test inputs:
genkit flow:batchRun qaFlow testInputs.json
Extract the evaluation data:
genkit eval:extractData qaFlow --maxRows 2 --output factsEvalDataset.json
The exported data has a format different from the dataset format presented earlier. This is because this data is intended to be used with evaluation metrics directly, without any inference step. Here is the syntax of the extracted data.
Array<{
"testCaseId": string,
"input": any,
"output": any,
"context": any[],
"traceIds": string[],
}>;
The data extractor automatically locates retrievers and adds the produced docs
to the context array. You can run evaluation metrics on this extracted dataset
using the eval:run
command.
genkit eval:run factsEvalDataset.json
By default, eval:run
runs against all configured evaluators, and as with
eval:flow
, results for eval:run
appear in the evaluation page of Developer
UI, located at localhost:4000/evaluate
.