Skip to content
Get Started

Evals

Requires the experimental feature flag: cargo add rig -F experimental

rig::evals is an experimental framework for measuring the quality of LLM outputs, so you can tell whether your agents, prompts, and RAG systems produce correct and relevant responses. Inspired by OpenAI’s evals framework, it gives you:

  • A core Eval trait for defining custom evaluators
  • Built-in metrics: LLM-as-a-judge, LLM scoring, and semantic similarity
  • Structured outcomes: pass, fail, or invalid

The Eval trait is the foundation of the framework. It is generic over the Output produced by the evaluation and takes a single String input (the text being evaluated):

pub trait Eval<Output>
where
Output: for<'a> Deserialize<'a> + Serialize + Clone + Send + Sync,
Self: Sized + Send + Sync + 'static,
{
fn eval(&self, input: String) -> impl Future<Output = EvalOutcome<Output>> + Send;
fn eval_batch(
&self,
input: Vec<String>,
concurrency_limit: usize,
) -> impl Future<Output = Vec<EvalOutcome<Output>>> + Send;
}

Note that eval returns an EvalOutcome<Output> directly — not a Result — so there is no ? when awaiting it. External failures (such as an API error) are represented by the Invalid variant rather than a returned error. Use eval_batch to evaluate many inputs at once, passing a concurrency_limit to stay within provider rate limits.

Every evaluator returns an EvalOutcome:

pub enum EvalOutcome<Output> {
/// Evaluation passed (carries the produced output/score)
Pass(Output),
/// Evaluation failed (carries the produced output/score)
Fail(Output),
/// The evaluation could not be completed (reason in the field)
Invalid(String),
}

EvalOutcome provides two helpers:

  • is_pass(&self) -> booltrue only for the Pass variant.
  • score(&self) -> Option<&Output> — the carried output for Pass/Fail, or None for Invalid.

A typical way to consume an outcome is to match on it:

use rig::evals::EvalOutcome;
match outcome {
EvalOutcome::Pass(score) => println!("passed: {score:?}"),
EvalOutcome::Fail(score) => println!("failed: {score:?}"),
EvalOutcome::Invalid(reason) => eprintln!("could not evaluate: {reason}"),
}
could not evaluate: example

The LLM-backed metrics are built on top of Rig’s extractors. You obtain an ExtractorBuilder from any provider client via client.extractor::<T>(model_name) and hand it to the metric builder.

Uses an LLM to judge whether an output meets certain criteria. You provide a schema type that implements the Judgment trait, whose single passes method decides pass/fail:

use rig::evals::{Eval, EvalOutcome, Judgment, LlmJudgeBuilder};
use schemars::JsonSchema;
use serde::{Deserialize, Serialize};
#[derive(Clone, Deserialize, Serialize, JsonSchema)]
struct FactualityJudgment {
/// Whether the response is factually accurate
is_factual: bool,
/// Explanation for the judgment
reasoning: String,
}
impl Judgment for FactualityJudgment {
fn passes(&self) -> bool {
self.is_factual
}
}
// `client` is any Rig provider client (OpenAI, Anthropic, etc.).
let ext = client.extractor::<FactualityJudgment>("gpt-5.5");
let judge = LlmJudgeBuilder::new(ext).build();
let outcome = judge
.eval("The capital of France is Paris.".to_string())
.await;
assert!(outcome.is_pass());

The judge asks the LLM to fill in your schema, then calls passes() on the result to decide Pass or Fail. If the extraction itself fails, the outcome is Invalid.

LLM Judge with a Custom Function (LlmJudgeMetricWithFn)

Section titled “LLM Judge with a Custom Function (LlmJudgeMetricWithFn)”

Instead of implementing the Judgment trait, you can supply a closure that determines pass/fail from the extracted schema. Call with_fn on the builder, which yields an LlmJudgeMetricWithFn after build():

use rig::evals::LlmJudgeBuilder;
let ext = client.extractor::<MySchema>("gpt-5.5");
let judge = LlmJudgeBuilder::new(ext)
.with_fn(|schema: &MySchema| schema.score > 0.5)
.build();

With this variant the schema type does not need to implement Judgment.

Uses an LLM to assign a numerical score to an input against a set of criteria. The LLM fills in an LlmScoreMetricScore:

pub struct LlmScoreMetricScore {
/// A score between 0.0 and 1.0 inclusive.
pub score: f64,
/// Feedback on the input in relation to the criteria.
pub feedback: String,
}

Build the metric from an extractor whose target type is LlmScoreMetricScore, add one or more criteria, and set a required threshold. build() returns a Result because the threshold must be provided:

use rig::evals::{Eval, EvalOutcome, LlmScoreMetricBuilder, LlmScoreMetricScore};
let ext = client.extractor::<LlmScoreMetricScore>("gpt-5.5");
let scorer = LlmScoreMetricBuilder::new(ext)
.criteria("The response is clear and technically accurate")
.threshold(0.7) // scores >= 0.7 pass
.build()?;
let outcome = scorer
.eval("Quantum entanglement is when two particles become linked...".to_string())
.await;

Scores are normalized to the 0.0..=1.0 range; a score outside that range yields EvalOutcome::Invalid.

Semantic Similarity (SemanticSimilarityMetric)

Section titled “Semantic Similarity (SemanticSimilarityMetric)”

Measures cosine similarity between the embedding of the input and the embedding of a reference answer. This is a non-LLM metric — it uses an embedding model only. Both threshold and reference_answer are required, and build is async (it embeds the reference answer up front), returning a Result:

use rig::evals::{Eval, EvalOutcome, SemanticSimilarityMetric};
let metric = SemanticSimilarityMetric::builder(embedding_model)
.reference_answer("The cat sat on the mat")
.threshold(0.85) // cosine similarity >= 0.85 passes
.build()
.await?;
let outcome = metric
.eval("A cat was sitting on a mat".to_string())
.await;

The carried output is a SemanticSimilarityMetricScore, whose score field holds the computed cosine similarity:

pub struct SemanticSimilarityMetricScore {
pub score: f64,
}

Implement Eval<Output> for any custom evaluation logic. The Output type must be Serialize + Deserialize + Clone + Send + Sync. eval takes the input as a String and returns an EvalOutcome<Output> directly (no Result):

use rig::evals::{Eval, EvalOutcome};
struct LengthCheck {
min_length: usize,
max_length: usize,
}
impl Eval<usize> for LengthCheck {
async fn eval(&self, input: String) -> EvalOutcome<usize> {
let len = input.len();
if len >= self.min_length && len <= self.max_length {
EvalOutcome::Pass(len)
} else {
EvalOutcome::Fail(len)
}
}
}
  1. Combine Metrics: Use multiple eval metrics together. For example, combine an LLM judge for factuality with semantic similarity for relevance.

  2. Determinism: LLM-based evals are inherently non-deterministic. Run them multiple times and look at aggregate results for reliable assessments.

  3. Thresholds: Start with permissive thresholds and tighten them as you understand your system’s behavior.

  4. Cost: LLM-as-a-judge evals incur additional API costs. Consider using cheaper models for judging when possible, and use non-LLM metrics (like semantic similarity) where appropriate.

  5. Invalid Outcomes: Always handle EvalOutcome::Invalid — it indicates the eval itself could not run (e.g., the judge LLM returned unparseable output or an API call failed), not that the tested output was bad.

The evals module is behind the experimental feature flag. The API may change in future versions as the framework matures. Feedback is welcome — see the contributing guide.

  • Extractors — Structured data extraction (used internally by the LLM judge and score metrics)
  • Embeddings — Embedding models (used by the semantic similarity metric)