Evals

Requires the experimental feature flag: cargo add rig -F experimental

rig::evals is an experimental framework for measuring the quality of LLM outputs, so you can tell whether your agents, prompts, and RAG systems produce correct and relevant responses. Inspired by OpenAI’s evals framework, it gives you:

A core Eval trait for defining custom evaluators
Built-in metrics: LLM-as-a-judge, LLM scoring, and semantic similarity
Structured outcomes: pass, fail, or invalid

Core trait: `Eval`

The Eval trait is the foundation of the framework. It is generic over the Output produced by the evaluation and takes a single String input (the text being evaluated):

pub trait Eval<Output>
where
    Output: for<'a> Deserialize<'a> + Serialize + Clone + Send + Sync,
    Self: Sized + Send + Sync + 'static,
{
    fn eval(&self, input: String) -> impl Future<Output = EvalOutcome<Output>> + Send;

    fn eval_batch(
        &self,
        input: Vec<String>,
        concurrency_limit: usize,
    ) -> impl Future<Output = Vec<EvalOutcome<Output>>> + Send;
}

Note that eval returns an EvalOutcome<Output> directly — not a Result — so there is no ? when awaiting it. External failures (such as an API error) are represented by the Invalid variant rather than a returned error. Use eval_batch to evaluate many inputs at once, passing a concurrency_limit to stay within provider rate limits.

Every evaluator returns an EvalOutcome:

pub enum EvalOutcome<Output> {
    /// Evaluation passed (carries the produced output/score)
    Pass(Output),
    /// Evaluation failed (carries the produced output/score)
    Fail(Output),
    /// The evaluation could not be completed (reason in the field)
    Invalid(String),
}

EvalOutcome provides two helpers:

is_pass(&self) -> bool — true only for the Pass variant.
score(&self) -> Option<&Output> — the carried output for Pass/Fail, or None for Invalid.

A typical way to consume an outcome is to match on it:

use rig::evals::EvalOutcome;

match outcome {
    EvalOutcome::Pass(score) => println!("passed: {score:?}"),
    EvalOutcome::Fail(score) => println!("failed: {score:?}"),
    EvalOutcome::Invalid(reason) => eprintln!("could not evaluate: {reason}"),
}

could not evaluate: example

Built-in metrics

The LLM-backed metrics are built on top of Rig’s extractors. You obtain an ExtractorBuilder from any provider client via client.extractor::<T>(model_name) and hand it to the metric builder.

LLM Judge (`LlmJudgeMetric`)

Uses an LLM to judge whether an output meets certain criteria. You provide a schema type that implements the Judgment trait, whose single passes method decides pass/fail:

use rig::evals::{Eval, EvalOutcome, Judgment, LlmJudgeBuilder};
use schemars::JsonSchema;
use serde::{Deserialize, Serialize};

#[derive(Clone, Deserialize, Serialize, JsonSchema)]
struct FactualityJudgment {
    /// Whether the response is factually accurate
    is_factual: bool,
    /// Explanation for the judgment
    reasoning: String,
}

impl Judgment for FactualityJudgment {
    fn passes(&self) -> bool {
        self.is_factual
    }
}

// `client` is any Rig provider client (OpenAI, Anthropic, etc.).
let ext = client.extractor::<FactualityJudgment>("gpt-5.5");
let judge = LlmJudgeBuilder::new(ext).build();

let outcome = judge
    .eval("The capital of France is Paris.".to_string())
    .await;

assert!(outcome.is_pass());

The judge asks the LLM to fill in your schema, then calls passes() on the result to decide Pass or Fail. If the extraction itself fails, the outcome is Invalid.

LLM Judge with a Custom Function (`LlmJudgeMetricWithFn`)

Instead of implementing the Judgment trait, you can supply a closure that determines pass/fail from the extracted schema. Call with_fn on the builder, which yields an LlmJudgeMetricWithFn after build():

use rig::evals::LlmJudgeBuilder;

let ext = client.extractor::<MySchema>("gpt-5.5");
let judge = LlmJudgeBuilder::new(ext)
    .with_fn(|schema: &MySchema| schema.score > 0.5)
    .build();

With this variant the schema type does not need to implement Judgment.

LLM Score (`LlmScoreMetric`)

Uses an LLM to assign a numerical score to an input against a set of criteria. The LLM fills in an LlmScoreMetricScore:

pub struct LlmScoreMetricScore {
    /// A score between 0.0 and 1.0 inclusive.
    pub score: f64,
    /// Feedback on the input in relation to the criteria.
    pub feedback: String,
}

Build the metric from an extractor whose target type is LlmScoreMetricScore, add one or more criteria, and set a required threshold. build() returns a Result because the threshold must be provided:

use rig::evals::{Eval, EvalOutcome, LlmScoreMetricBuilder, LlmScoreMetricScore};

let ext = client.extractor::<LlmScoreMetricScore>("gpt-5.5");
let scorer = LlmScoreMetricBuilder::new(ext)
    .criteria("The response is clear and technically accurate")
    .threshold(0.7) // scores >= 0.7 pass
    .build()?;

let outcome = scorer
    .eval("Quantum entanglement is when two particles become linked...".to_string())
    .await;

Scores are normalized to the 0.0..=1.0 range; a score outside that range yields EvalOutcome::Invalid.

Semantic Similarity (`SemanticSimilarityMetric`)

Measures cosine similarity between the embedding of the input and the embedding of a reference answer. This is a non-LLM metric — it uses an embedding model only. Both threshold and reference_answer are required, and build is async (it embeds the reference answer up front), returning a Result:

use rig::evals::{Eval, EvalOutcome, SemanticSimilarityMetric};

let metric = SemanticSimilarityMetric::builder(embedding_model)
    .reference_answer("The cat sat on the mat")
    .threshold(0.85) // cosine similarity >= 0.85 passes
    .build()
    .await?;

let outcome = metric
    .eval("A cat was sitting on a mat".to_string())
    .await;

The carried output is a SemanticSimilarityMetricScore, whose score field holds the computed cosine similarity:

pub struct SemanticSimilarityMetricScore {
    pub score: f64,
}

Writing custom evals

Implement Eval<Output> for any custom evaluation logic. The Output type must be Serialize + Deserialize + Clone + Send + Sync. eval takes the input as a String and returns an EvalOutcome<Output> directly (no Result):

use rig::evals::{Eval, EvalOutcome};

struct LengthCheck {
    min_length: usize,
    max_length: usize,
}

impl Eval<usize> for LengthCheck {
    async fn eval(&self, input: String) -> EvalOutcome<usize> {
        let len = input.len();
        if len >= self.min_length && len <= self.max_length {
            EvalOutcome::Pass(len)
        } else {
            EvalOutcome::Fail(len)
        }
    }
}

Best practices

Combine Metrics: Use multiple eval metrics together. For example, combine an LLM judge for factuality with semantic similarity for relevance.
Determinism: LLM-based evals are inherently non-deterministic. Run them multiple times and look at aggregate results for reliable assessments.
Thresholds: Start with permissive thresholds and tighten them as you understand your system’s behavior.
Cost: LLM-as-a-judge evals incur additional API costs. Consider using cheaper models for judging when possible, and use non-LLM metrics (like semantic similarity) where appropriate.
Invalid Outcomes: Always handle EvalOutcome::Invalid — it indicates the eval itself could not run (e.g., the judge LLM returned unparseable output or an API call failed), not that the tested output was bad.

Experimental status

The evals module is behind the experimental feature flag. The API may change in future versions as the framework matures. Feedback is welcome — see the contributing guide.

Next steps

TestingWire evals into your test suite and mock providers for deterministic runs.

Structured OutputUnderstand the extractors that power the LLM judge and score metrics.

EmbeddingsConfigure the embedding model behind the semantic similarity metric.

AgentsBuild the agents whose outputs you'll evaluate.

Previous
Observability Next
Testing

Evals

Core trait: Eval

Built-in metrics

LLM Judge (LlmJudgeMetric)

LLM Judge with a Custom Function (LlmJudgeMetricWithFn)

LLM Score (LlmScoreMetric)

Semantic Similarity (SemanticSimilarityMetric)