How to Evaluate an AI Agent Before You Ship It

AI Summary - 20-sec read - Reviewed by experts

A live demo proves an agent can work once. It does not prove it works on the awkward, adversarial, and edge-case inputs real users send. You need an evaluation set, not a vibe check.
Build a golden set of 30 to 100 real cases, each with an input and an expected outcome or a clear rule for what 'good' means. Score the agent against it automatically and turn quality into a single pass rate.
Mix case types: happy path, edge cases, adversarial inputs, and known past failures. Add every production bug back as a permanent test so it can never silently return.
Set a pass bar before you ship and re-run the whole set on every prompt, model, or tool change. That is how you catch the regression a one-line prompt edit introduces.
Short on time? Book a free call.

Short on time? Book a free call.

You demo the agent to the team. You type three sensible questions, it answers all three, everyone nods, and it ships on Friday. On Monday a customer phrases a refund request in a way nobody tried, the agent invents a policy, and you find out from a support ticket. The demo was never wrong. It just measured the wrong thing - whether the agent can work, not whether it does work on the inputs you do not control.

Testing an AI agent is not the same as testing ordinary software, because the same input can produce different output and 'correct' is often a judgment call. But it is far more measurable than most teams assume. This is how to turn 'it seemed fine in the demo' into a pass rate you can stand behind, and how to keep that number honest as the agent changes.

Why a demo is not a test

A demo is a sample of one or two friendly inputs chosen by the person who built the thing. It is biased by design - you reach for the questions you know work. Production traffic is the opposite: messy phrasing, missing context, users trying to get a discount they are not owed, and the long tail of cases you never imagined.

An evaluation set fixes the sampling problem. Instead of a handful of hand-picked inputs, you assemble a fixed, representative collection of cases and score the agent against all of them every time. The output stops being a feeling and becomes a number: 92 of 100 cases passed. Now you can compare two versions, set a release bar, and prove the agent got better or worse rather than guessing.

Build the golden set: 30 to 100 real cases

The golden set is the heart of the whole exercise. It is a list of cases, each with two parts: an input (what the user sends) and a definition of a good outcome (the expected answer, or a rule the answer must satisfy). Start with 30 to 100 cases - enough to cover the variety of real use, small enough that you can curate each one by hand.

Where the cases come from matters more than the count:

Real production logs are the best source. Pull actual user inputs once you have them, including the ones that went wrong. Nothing you invent at your desk beats traffic you did not script.
Subject-matter experts supply the cases users have not sent yet but will - the rare refund scenario, the compliance edge, the question that is technically out of scope.
Every past failure becomes a permanent case. When the agent breaks in production, you fix it and add that exact input to the set forever. That single habit is what stops the same bug returning three releases later.

Deliberately spread the set across four kinds of case: the happy path (ordinary requests that should just work), edge cases (unusual but legitimate inputs), adversarial cases (users trying to manipulate the agent or extract something), and the regression cases drawn from real past bugs. A set that is all happy path tells you nothing about the failures that actually hurt.

Not sure your agent is ready for real users?

Get a free audit. Send us your agent and your use case and we will build a starter evaluation set, run it, and show you where it breaks before your customers find out. No pitch, reply in 2 hrs, no card needed, NDA on request.

Get a free audit

How to score answers that vary

The obvious objection: if the agent can word the same answer ten ways, how do you grade it automatically? You match the scoring method to the kind of question.

Exact or structured checks for anything with a right answer. Did it call the right tool with the right arguments? Did it return valid JSON? Did the refund amount equal the order total? These are deterministic - a simple assertion passes or fails.
Contains and must-not-contain rules for open text. The answer must mention the 30-day window; it must never quote a policy that does not exist; it must include the disclaimer. You assert on the facts that matter, not the exact wording.
A model-graded judge for genuinely subjective quality - tone, helpfulness, faithfulness to a source document. A separate model scores the answer against a rubric you write. It is not perfect, so reserve it for what rules cannot capture, and spot-check its grades against human judgment.

Two scores deserve their own attention because they map to the two ways agents fail. Faithfulness measures whether the answer is grounded in the retrieved source rather than invented - the same failure mode behind a RAG agent that returns wrong answers. Refusal correctness measures whether the agent declines what it should decline and answers what it should answer. An agent that hedges on everything scores badly with users even when it never says anything false.

Set a pass bar and run it on every change

An evaluation set you run once is a report. An evaluation set you run on every change is a safety net. The difference is automation and a threshold.

Decide the bar before you ship: for example, 95 percent of cases must pass, and zero adversarial or safety cases may fail. Safety cases are pass-or-block - one failure holds the release no matter how good the average looks. Then wire the set to run automatically whenever anything that affects behavior changes:

A prompt edit - the most dangerous change, because it feels trivial and ships without review.
A model swap or version bump, where a 'better' model can quietly regress on your specific cases.
A new or changed tool, a retrieval change, or a knowledge-base update.

This is the agent equivalent of a regression test suite. The point is not to reach 100 percent - some cases are genuinely hard - it is to know your number and to be told the moment it drops. A one-line prompt tweak that fixes one complaint and breaks five others is invisible without this, and obvious with it.

Takeaways

A demo proves the agent can work; an evaluation set proves it does work on inputs you do not control.
Curate 30 to 100 real cases from production logs, experts, and every past failure - spread across happy path, edge, adversarial, and regression.
Score with the cheapest method that fits: exact checks and contains-rules first, a model-graded judge only for subjective quality.
Set a pass bar, block on safety failures, and re-run the whole set on every prompt, model, and tool change.

Evaluation is also your cost and safety control

A good eval harness pays for itself beyond catching wrong answers. Because it runs the same cases through any version, it is where you safely test a cheaper model - if the smaller model holds your pass rate, you can switch and cut spend without guessing, which is the disciplined version of managing what a custom AI agent costs to build and run. It is also where adversarial cases live, so your defenses against prompt-injection attacks on production agents become tests that fail loudly rather than assumptions, and where the model-reliability work in handling AI hallucinations in production gets a number attached to it. The set you build to stop bad answers becomes the same instrument you use to lower cost and prove safety.

Want an AI agent you can ship with confidence?

Talk to a team that builds agents and the evaluation harness around them for UK and US businesses - golden sets, automated scoring, and a release gate that catches regressions before users do. No pitch, reply in 2 hrs.

Book a free call

FAQ

How many test cases do I actually need to start?

Start with 30 to 50 well-chosen cases covering your main use, your known edge cases, and any adversarial inputs that matter. Quality beats quantity early - 40 cases you curated by hand from real traffic are worth more than 400 generated ones that all look alike. Grow the set as production surfaces new failure modes.

Can I use another AI model to grade the answers?

Yes, for subjective qualities like tone, helpfulness, and faithfulness, a model-graded judge with a clear rubric works well and scales. But use it only where deterministic checks cannot - exact matches, valid structure, and contains-rules are cheaper and more reliable. And spot-check the judge against human grades, because a judge can be wrong too.

How often should the evaluation set run?

On every change that can affect behavior - prompt edits, model swaps, tool changes, and retrieval or knowledge-base updates - plus on a schedule to catch drift from upstream model updates. Tie it to your deploy process so a release that drops below the pass bar is blocked automatically rather than caught by a customer.

What is the single most valuable case to add?

The last bug. Every time the agent fails in production, fix it and add that exact input to the set as a permanent case. A suite built from real past failures is the one thing that stops you shipping the same mistake twice, and it compounds in value with every incident you feed back in.

The takeaway: stop confusing a successful demo with a tested agent. Build a golden set from real cases, score it automatically, set a bar, and run it on every change. The work is modest and it is the difference between learning your agent is broken from a dashboard you control and learning it from a customer you just lost.

AI Summary - 20-sec read - Reviewed by experts

A live demo proves an agent can work once. It does not prove it works on the awkward, adversarial, and edge-case inputs real users send. You need an evaluation set, not a vibe check.
Build a golden set of 30 to 100 real cases, each with an input and an expected outcome or a clear rule for what 'good' means. Score the agent against it automatically and turn quality into a single pass rate.
Mix case types: happy path, edge cases, adversarial inputs, and known past failures. Add every production bug back as a permanent test so it can never silently return.
Set a pass bar before you ship and re-run the whole set on every prompt, model, or tool change. That is how you catch the regression a one-line prompt edit introduces.
Short on time? Book a free call.

Short on time? Book a free call.

Why a demo is not a test

Build the golden set: 30 to 100 real cases

Where the cases come from matters more than the count:

Real production logs are the best source. Pull actual user inputs once you have them, including the ones that went wrong. Nothing you invent at your desk beats traffic you did not script.
Subject-matter experts supply the cases users have not sent yet but will - the rare refund scenario, the compliance edge, the question that is technically out of scope.
Every past failure becomes a permanent case. When the agent breaks in production, you fix it and add that exact input to the set forever. That single habit is what stops the same bug returning three releases later.

Not sure your agent is ready for real users?

Get a free audit

How to score answers that vary

The obvious objection: if the agent can word the same answer ten ways, how do you grade it automatically? You match the scoring method to the kind of question.

Exact or structured checks for anything with a right answer. Did it call the right tool with the right arguments? Did it return valid JSON? Did the refund amount equal the order total? These are deterministic - a simple assertion passes or fails.
Contains and must-not-contain rules for open text. The answer must mention the 30-day window; it must never quote a policy that does not exist; it must include the disclaimer. You assert on the facts that matter, not the exact wording.
A model-graded judge for genuinely subjective quality - tone, helpfulness, faithfulness to a source document. A separate model scores the answer against a rubric you write. It is not perfect, so reserve it for what rules cannot capture, and spot-check its grades against human judgment.

Set a pass bar and run it on every change

An evaluation set you run once is a report. An evaluation set you run on every change is a safety net. The difference is automation and a threshold.

A prompt edit - the most dangerous change, because it feels trivial and ships without review.
A model swap or version bump, where a 'better' model can quietly regress on your specific cases.
A new or changed tool, a retrieval change, or a knowledge-base update.

Takeaways

A demo proves the agent can work; an evaluation set proves it does work on inputs you do not control.
Curate 30 to 100 real cases from production logs, experts, and every past failure - spread across happy path, edge, adversarial, and regression.
Score with the cheapest method that fits: exact checks and contains-rules first, a model-graded judge only for subjective quality.
Set a pass bar, block on safety failures, and re-run the whole set on every prompt, model, and tool change.

Shipping an AI agent you have not tested?

Why a demo is not a test

Build the golden set: 30 to 100 real cases

How to score answers that vary

Set a pass bar and run it on every change

Evaluation is also your cost and safety control

FAQ

How many test cases do I actually need to start?

Can I use another AI model to grade the answers?

How often should the evaluation set run?

What is the single most valuable case to add?

Let's find what's breaking — and fix it

Shipping an AI agent you have not tested?

Why a demo is not a test

Build the golden set: 30 to 100 real cases

How to score answers that vary

Set a pass bar and run it on every change

Evaluation is also your cost and safety control

FAQ

How many test cases do I actually need to start?

Can I use another AI model to grade the answers?

How often should the evaluation set run?

What is the single most valuable case to add?

Let's find what's breaking — and fix it