AI Summary - 20-sec read - Reviewed by experts
- AI agents break the assumption every test suite rests on: the same input gives the same output. So teams ship on vibes, then discover the regressions in production.
- An eval harness fixes this. You build a golden set of real inputs with expected behaviour, run the agent against it, and score the results automatically - a test suite for non-deterministic software.
- You do not need exact-match scoring. Grade on what matters: did it call the right tool, stay grounded in sources, use the correct format, avoid the unsafe answer - checks that tolerate wording changes but catch real failures.
- Wire the harness into CI so every prompt tweak, model upgrade, or tool change is scored against the golden set before it merges. A change that quietly breaks 8% of cases gets caught, not shipped.
- Short on time? We will build the eval harness that lets you change your agent without fear. Book a free call.
Short on time? Book a free call.
Every test suite you have ever written rests on one assumption: the same input produces the same output. AI agents quietly break it. Run the same prompt twice and you can get two different answers, both plausible, one subtly wrong. So the usual playbook - write an assertion, check it equals the expected string - falls apart, and most teams respond by not testing at all. They eyeball a few examples, decide it "seems good", and ship. Then a prompt tweak or a model upgrade silently breaks a tenth of real conversations, and the first people to notice are customers. There is a better way, and it is not exotic: an eval harness, a test suite built for software that does not give the same answer twice.
Why you cannot unit-test an agent
Traditional tests are pass or fail against a fixed expected value. An agent has no single correct string - there are many good answers to "help me return this order" and infinite bad ones. Worse, agents fail in ways a compiler never catches: they call the wrong tool, answer confidently from nothing, leak a system instruction, drift off format, or take an action they should have escalated. None of that shows up as an exception. The code runs fine; the behaviour is wrong. That is the gap an eval harness closes - it tests behaviour, not return values, and it does it at a scale your eyeballs cannot.
Start with a golden set
The foundation is a curated set of real inputs paired with what good behaviour looks like. Do not invent these at a desk - pull them from actual usage, support logs, and the edge cases that have already bitten you. A useful golden set is not huge; fifty to a few hundred well-chosen cases beat thousands of near-duplicates. Deliberately include the hard ones: the ambiguous request, the out-of-scope question the agent should decline, the prompt-injection attempt, the input where the honest answer is "I do not know". Those are where agents fail and where a demo never looks. This library becomes the thing you protect - every real failure in production gets distilled into a new golden case so the same bug can never ship twice.
Shipping agent changes on gut feel?
We will build a golden set from your real traffic and a harness that scores every change against it, so you know what a prompt or model tweak actually did before it reaches users. No pitch, reply in 2 hrs, no card needed, NDA on request.
Get a free auditScore on what matters, not exact match
The trick that makes evals work is grading on behaviour, not wording. You almost never want string equality - you want checks that tolerate rephrasing but catch real errors. In practice you combine a few kinds of scoring.
- Deterministic checks for anything with a right answer: did it call the correct tool with the right arguments, is the output valid against the schema, did it stay under the latency and cost budget, does it cite a source when it makes a claim. These are cheap, fast, and unambiguous - run them first.
- Assertion checks for content: does the answer contain the required fact, does it avoid the forbidden one, did it decline the out-of-scope request, did it refuse the injection attempt. You are testing properties of the answer, not its exact text.
- Model-graded checks for the fuzzy dimensions - tone, helpfulness, faithfulness to the retrieved context - where you use a separate model as a judge against a rubric. Powerful, but calibrate it against human judgement on a sample so you are not trusting a grader you never checked.
Grounding and faithfulness deserve special attention, because a fluent wrong answer is the most dangerous output an agent produces - the same failure mode we break down in why RAG agents hallucinate. Your harness should score whether every claim traces to a real source.
A prompt tweak can silently break a tenth of your conversations.
We will stand up an eval harness and wire it into your pipeline so every change is scored before it ships - and regressions get caught by a test, not a customer. Reply in 2 hrs, NDA on request.
Book a free callWire it into CI and watch it run
An eval you run by hand once a month is a document, not a safety net. The value arrives when the harness runs automatically on every change - a new prompt, a model version bump, a reworked tool - and reports the score against the golden set before the change can merge. Now "this refactor looks fine" becomes "this refactor passed 96% of golden cases, down from 98%, and the two new failures are both in returns handling". That is a decision you can make. Set a threshold the suite must clear to ship, and treat a drop the way you treat a failing unit test. The same golden cases then double as the seed for what you watch in production, because pre-ship evals and runtime monitoring are two ends of one discipline - the runtime half is where a well-built agent proves it still behaves once real traffic hits it.
Takeaways
- Agents break the core testing assumption - same input, same output - so eyeballing a few examples and shipping is how regressions reach production.
- Build a golden set from real traffic and past failures; fifty to a few hundred sharp cases beat thousands of duplicates.
- Score behaviour, not wording: deterministic checks for tool calls and schema, assertions for content, a calibrated model-judge for tone and faithfulness.
- Test the dangerous cases on purpose - out-of-scope questions, injection attempts, and inputs where "I do not know" is the right answer.
- Run the harness in CI with a pass threshold so every prompt, model, or tool change is scored before it merges.
Frequently asked questions
How big does the golden set need to be?
Smaller than you think. Fifty to a few hundred carefully chosen cases that cover your real intents and your known failure modes are far more useful than thousands of near-identical ones. Coverage of the hard, ambiguous, and adversarial cases matters more than raw count. Grow it by adding every genuine production failure as a new case, so the set gets sharper over time rather than just bigger.
Can I trust a model to grade another model?
For fuzzy dimensions like tone and faithfulness, yes - but only after you calibrate it. Have humans grade a sample, compare the model-judge to those grades, and tune the rubric until they agree. Then use the judge at scale for the dimensions it handles well, and keep deterministic checks for anything with a clear right answer. A model-judge you never checked is just another untested component.
How is this different from monitoring in production?
Evals run before you ship, against a fixed golden set, to catch regressions early. Monitoring runs after, against live traffic, to catch drift and real-world failures you did not anticipate. They share the same checks and the same failure taxonomy - a production failure becomes a new golden case, and a golden case becomes something you watch live. You need both; the harness stops known bugs, monitoring catches the unknown ones.
Does building an eval harness slow us down?
It speeds you up after the first week. The setup cost is real - curating the golden set and writing the scorers - but once it exists, you can change prompts, swap models, and refactor tools with a number instead of a nervous feeling. The alternative is slower: every change becomes a careful manual review, and the failures you miss cost far more to fix in production than they would have to catch in CI.
The short version: you cannot test an AI agent like ordinary code, but you can test it. Build a golden set from real inputs, score behaviour rather than exact wording, put the dangerous cases in on purpose, and run the whole thing in CI so every change is measured before it ships. Do that and you trade shipping on vibes for shipping on evidence - which is the only way to change an agent quickly without breaking it quietly.
Founder and CEO of Braincuber. Has scoped and shipped 500+ Odoo, AI, and cloud projects for US mid-market and global brands. Takes every founder call personally — no SDR layer between buyers and the people building the system.
