How to Use Promptfoo for LLM Evaluation: Complete Guide
By Braincuber Team
Published on April 22, 2026
Promptfoo is an open-source CLI that replaces manual prompt testing with structured, repeatable evaluations. You define what good output looks like, pick your models, and run every combination automatically. In this step by step guide, you will learn how to set up Promptfoo, build your first eval suite, and integrate it into CI/CD.
What You'll Learn:
- What LLM evaluation is and why it matters
- Installing Promptfoo and initializing a project
- Writing prompt templates and configuring providers
- Creating test cases with deterministic and model-assisted assertions
- Running evaluations and viewing results
- Comparing multiple models side by side
- Integrating Promptfoo into GitHub Actions CI/CD
LLM Evaluation in 60 Seconds
Before getting into the tool, you need to know how LLM testing works. It's different from testing regular code.
If you have tested a function before, you know the pattern: give it an input, check that the output matches what you expect. LLM outputs do not work that way. The same prompt can produce different text every time you run it, so you cannot check for an exact match.
Instead, you check for properties of the output: does it contain the right information? Does it hit the right tone? Did it respond fast enough?
| Term | Definition |
|---|---|
| Provider | A model API being tested (e.g., GPT-5, Claude Sonnet 4.6) |
| Test Case | One input paired with expected behavior |
| Assertion | A rule that output must pass |
| Rubric | Grading instruction for subjective checks |
What Is Promptfoo?
Now that you know what an eval is, here is what Promptfoo does with it.
You give Promptfoo three things: your prompt templates, the models you want to test, and your test cases with assertions. It runs every prompt against every model for every test case and scores the results. One command, promptfoo eval, runs the whole thing.
Say you have one email writing prompt, two models (GPT-5 and Claude Sonnet 4), and three test cases (casual, formal, urgent). Promptfoo runs all six combinations and tells you which ones passed and which ones failed.
Open Source
MIT-licensed CLI with support for dozens of model providers. Acquired by OpenAI in March 2026.
350K+ Developers
Used by teams at over 25% of Fortune 500 companies for prompt testing.
Setting Up Your Promptfoo Environment
Install Promptfoo globally and initialize a new project:
npm install -g promptfoo
mkdir email-writer-eval
cd email-writer-eval
promptfoo init
The init command walks you through an interactive setup. It asks what you would like to do (choose "Not sure yet") and which model provider to use (choose "OpenAI GPT 5, GPT 4.1").
API Keys Required
You need at least one API key to run evals. Get OpenAI key from OpenAI console and Anthropic key from Anthropic console.
Set your API keys as environment variables:
export OPENAI_API_KEY=sk-...
export ANTHROPIC_API_KEY=sk-ant-...
If you only set ANTHROPIC_API_KEY and skip OpenAI, Promptfoo automatically uses Claude as the grading provider for model-assisted assertions like llm-rubric.
Building Your First Evaluation
Here is the task: given bullet points and a tone (casual, formal, or urgent), write an email. You will build an eval that tests this across two models.
Prompt and Providers
Delete the generated sample and create a new promptfooconfig.yaml. Start with the prompt and providers:
description: "Email writer evaluation"
prompts:
- |
Draft an email based on these bullet points.
Match the specified tone throughout the email.
Bullet points:
{{bullet_points}}
Tone: {{tone}}
providers:
- id: openai:chat:gpt-5
label: "GPT-5"
- id: anthropic:messages:claude-sonnet-4-6
label: "Claude Sonnet 4.6"
The prompt template has two placeholders: {{bullet_points}} and {{tone}}. Each test case fills these with different values. The label field on each provider gives you readable column headers in the results view.
defaultTest Block
Add a defaultTest block. Assertions inside defaultTest apply to every test case automatically:
defaultTest:
assert:
- type: latency
threshold: 30000
This fails any response slower than 30 seconds. Frontier models like GPT-5 can take 10-20 seconds per request due to reasoning tokens.
Test Cases
Now add the test cases. Each one provides different inputs and its own assertions:
tests:
- vars:
bullet_points: |
- Recap of the design review decisions
- Next steps: finalize mockups by Thursday
- Ask if anyone has questions
tone: "casual"
assert:
- type: icontains
value: "mockups"
- type: llm-rubric
value: "The email uses a casual tone with contractions and short sentences"
- vars:
bullet_points: |
- Q1 revenue exceeded targets by 12%
- New enterprise client onboarded
- Hiring plan for Q2 approved
tone: "formal"
assert:
- type: icontains
value: "Q1"
- type: llm-rubric
value: "The email maintains a formal, professional tone throughout"
- vars:
bullet_points: |
- API migration deadline is Friday at 5pm
- Three endpoints still need updating
- Downtime window is Saturday 2-6am
tone: "urgent"
assert:
- type: icontains
value: "Friday"
- type: llm-rubric
value: "The email conveys urgency with direct language and clear action items"
Each test case pairs two types of assertions. icontains is a simple string check: did the output include "mockups", case-insensitive? It is fast, free, and does not call any API.
llm-rubric sends the output to another LLM and asks it to grade the response against your rubric. It costs tokens, but it catches things string matching cannot, like whether an email actually sounds casual.
Running the Evaluation
Run the evaluation and view the results:
promptfoo eval
promptfoo view
The web UI shows providers as columns and test cases as rows. Each cell shows pass/fail for every assertion, and you can click into any cell to see the full output and grading details.
Caching
Promptfoo caches API responses to disk by default (14-day TTL), so re-running the same eval costs nothing. Use --no-cache when you want fresh responses.
Writing Assertions
The first eval used icontains and llm-rubric. Those are two of many assertion types Promptfoo supports.
| Type | What It Checks |
|---|---|
| contains / icontains | Output includes a substring (case-sensitive or not) |
| regex | Output matches a pattern |
| not-contains | Output excludes something |
| latency | Response arrived under N milliseconds |
| cost | Response costs less than $X |
Model-Assisted Assertions
These cost tokens, but can judge things that string matching cannot. You have already used llm-rubric. The rubric you write makes or breaks it.
A vague rubric like "The email sounds professional" tells the grader nothing useful. A specific one gives it something to measure:
- type: llm-rubric
value: "The email uses a casual tone: contractions like 'we'll' and 'don't',
sentences under 20 words, no corporate jargon like 'synergy' or 'circle back',
and opens with a greeting like 'Hey' or 'Hi team'"
Custom Python Assertions
When built-in types do not cover your logic, write your own. Here is an inline assertion that passes if the email is between 50 and 200 words:
- type: python
value: "50 <= len(output.split()) <= 200"
For more complex checks, put the logic in a separate file:
def get_assert(output, context):
word_count = len(output.split())
in_range = 50 <= word_count <= 200
return {
"pass": in_range,
"score": 1.0 if in_range else 0.0,
"reason": f"Word count: {word_count} (target: 50-200)"
}
Comparing Models Side by Side
The config already has two providers, so promptfoo eval tested GPT-5 and Claude Sonnet 4 in a single pass. Open promptfoo view and you see them as separate columns in the results.
LLM outputs are not deterministic, though. The same prompt can produce different results on consecutive runs. The --repeat flag accounts for this:
promptfoo eval --repeat 3
This runs each test case three times per provider. If a model passes the tone assertion twice but fails on the third run, that is a reliability signal you would miss with a single pass.
From Local Tests to CI/CD
Running evals locally works during development, but it depends on whoever changes the prompt remembering to run them. Promptfoo has an official GitHub Action that removes that dependency by running your eval suite on every pull request.
Create a workflow file:
mkdir -p .github/workflows
Create .github/workflows/prompt-eval.yml:
name: 'Prompt Evaluation'
on:
pull_request:
paths:
- 'prompts/**'
jobs:
evaluate:
runs-on: ubuntu-latest
permissions:
contents: read
pull-requests: write
steps:
- uses: actions/checkout@v4
- name: Set up promptfoo cache
uses: actions/cache@v4
with:
path: |
~/.promptfoo/cache
.promptfoo-cache
key: ${{ runner.os }}-promptfoo-${{ hashFiles('prompts/**') }}-${{ github.sha }}
- name: Run promptfoo evaluation
uses: promptfoo/promptfoo-action@v1
with:
openai-api-key: ${{ secrets.OPENAI_API_KEY }}
github-token: ${{ secrets.GITHUB_TOKEN }}
config: 'promptfooconfig.yaml'
cache-path: '.promptfoo-cache'
API Keys
Add your OPENAI_API_KEY (and optionally ANTHROPIC_API_KEY) as repository secrets in your GitHub repo under Settings > Secrets and variables > Actions.
The workflow from here:
Change a prompt
Modify your prompt template in the YAML file.
Open a PR
Create a pull request with your changes.
CI runs the eval
GitHub Actions automatically runs your evaluation suite.
Review results
Results appear as a PR comment with a link to the web viewer.
Fix if needed
Address any failures and re-run.
Merge when passing
Prompt changes get the same test-before-merge treatment as code changes.
Next Steps
When you are ready to go further, here are areas worth exploring:
Red Teaming
promptfoo redteam run scans for prompt injection and jailbreaks across dozens of attack plugins.
Custom Providers
Wrap any internal model or fine-tuned endpoint with file://my_provider.py.
CSV Test Data
Scale your test suite with file://tests.csv when inline YAML gets unwieldy.
Frequently Asked Questions
What is Promptfoo, and what problem does it solve?
Promptfoo is an open-source CLI for testing LLM outputs before they reach users. Instead of manually trying a few inputs and eyeballing results, you define assertions for what good output looks like, pick your models, and run every combination automatically.
How do you set up and run your first Promptfoo evaluation?
Install Promptfoo with npm install -g promptfoo, run promptfoo init to scaffold a project, set your API keys, then write a promptfooconfig.yaml with your prompts, providers, and test cases. Run promptfoo eval to execute.
What assertion types does Promptfoo support?
Promptfoo has three tiers: deterministic assertions (contains, regex, latency, cost) are free and instant; model-assisted assertions (llm-rubric, answer-relevance) send output to another LLM for judgment; custom Python assertions let you write any check as inline code or a separate file.
How do you compare multiple models on the same test suite?
List multiple providers in your promptfooconfig.yaml and run promptfoo eval once. Promptfoo tests every model against every test case and shows them as separate columns. Use --repeat flag to catch inconsistencies from non-deterministic outputs.
How does Promptfoo fit into a CI/CD pipeline?
Promptfoo has an official GitHub Action that runs your eval suite on every pull request touching prompt files. It posts results as a PR comment with a link to the web viewer. Add your API keys as repository secrets and configure a paths filter.
Need Help with LLM Evaluation?
Our experts can help you set up Promptfoo, build test suites, and integrate automated evaluations into your workflow.
