Promptfoo is an open-source CLI that replaces manual prompt testing with structured, repeatable evaluations. You define what good output looks like, pick your models, and run every combination automatically. In this step by step guide, you will learn how to set up Promptfoo, build your first eval suite, and integrate it into CI/CD.

What You'll Learn:

What LLM evaluation is and why it matters
Installing Promptfoo and initializing a project
Writing prompt templates and configuring providers
Creating test cases with deterministic and model-assisted assertions
Running evaluations and viewing results
Comparing multiple models side by side
Integrating Promptfoo into GitHub Actions CI/CD

LLM Evaluation in 60 Seconds

Before getting into the tool, you need to know how LLM testing works. It's different from testing regular code.

If you have tested a function before, you know the pattern: give it an input, check that the output matches what you expect. LLM outputs do not work that way. The same prompt can produce different text every time you run it, so you cannot check for an exact match.

Instead, you check for properties of the output: does it contain the right information? Does it hit the right tone? Did it respond fast enough?

Term	Definition
Provider	A model API being tested (e.g., GPT-5, Claude Sonnet 4.6)
Test Case	One input paired with expected behavior
Assertion	A rule that output must pass
Rubric	Grading instruction for subjective checks

What Is Promptfoo?

Now that you know what an eval is, here is what Promptfoo does with it.

You give Promptfoo three things: your prompt templates, the models you want to test, and your test cases with assertions. It runs every prompt against every model for every test case and scores the results. One command, promptfoo eval, runs the whole thing.

Say you have one email writing prompt, two models (GPT-5 and Claude Sonnet 4), and three test cases (casual, formal, urgent). Promptfoo runs all six combinations and tells you which ones passed and which ones failed.

Open Source

MIT-licensed CLI with support for dozens of model providers. Acquired by OpenAI in March 2026.

350K+ Developers

Used by teams at over 25% of Fortune 500 companies for prompt testing.

Setting Up Your Promptfoo Environment

Install Promptfoo globally and initialize a new project:

Terminal

npm install -g promptfoo
mkdir email-writer-eval
cd email-writer-eval
promptfoo init

The init command walks you through an interactive setup. It asks what you would like to do (choose "Not sure yet") and which model provider to use (choose "OpenAI GPT 5, GPT 4.1").

API Keys Required

You need at least one API key to run evals. Get OpenAI key from OpenAI console and Anthropic key from Anthropic console.

Set your API keys as environment variables:

Terminal

export OPENAI_API_KEY=sk-...
export ANTHROPIC_API_KEY=sk-ant-...

If you only set ANTHROPIC_API_KEY and skip OpenAI, Promptfoo automatically uses Claude as the grading provider for model-assisted assertions like llm-rubric.

Building Your First Evaluation

Here is the task: given bullet points and a tone (casual, formal, or urgent), write an email. You will build an eval that tests this across two models.

Prompt and Providers

Delete the generated sample and create a new promptfooconfig.yaml. Start with the prompt and providers:

promptfooconfig.yaml

description: "Email writer evaluation"

prompts:
  - |
    Draft an email based on these bullet points.
    Match the specified tone throughout the email.

    Bullet points:
    {{bullet_points}}

    Tone: {{tone}}

providers:
  - id: openai:chat:gpt-5
    label: "GPT-5"
  - id: anthropic:messages:claude-sonnet-4-6
    label: "Claude Sonnet 4.6"

The prompt template has two placeholders: {{bullet_points}} and {{tone}}. Each test case fills these with different values. The label field on each provider gives you readable column headers in the results view.

defaultTest Block

Add a defaultTest block. Assertions inside defaultTest apply to every test case automatically:

YAML

defaultTest:
  assert:
    - type: latency
      threshold: 30000

This fails any response slower than 30 seconds. Frontier models like GPT-5 can take 10-20 seconds per request due to reasoning tokens.

Test Cases

Now add the test cases. Each one provides different inputs and its own assertions:

Test Cases YAML

tests:
  - vars:
      bullet_points: |
        - Recap of the design review decisions
        - Next steps: finalize mockups by Thursday
        - Ask if anyone has questions
      tone: "casual"
    assert:
      - type: icontains
        value: "mockups"
      - type: llm-rubric
        value: "The email uses a casual tone with contractions and short sentences"

  - vars:
      bullet_points: |
        - Q1 revenue exceeded targets by 12%
        - New enterprise client onboarded
        - Hiring plan for Q2 approved
      tone: "formal"
    assert:
      - type: icontains
        value: "Q1"
      - type: llm-rubric
        value: "The email maintains a formal, professional tone throughout"

  - vars:
      bullet_points: |
        - API migration deadline is Friday at 5pm
        - Three endpoints still need updating
        - Downtime window is Saturday 2-6am
      tone: "urgent"
    assert:
      - type: icontains
        value: "Friday"
      - type: llm-rubric
        value: "The email conveys urgency with direct language and clear action items"

Each test case pairs two types of assertions. icontains is a simple string check: did the output include "mockups", case-insensitive? It is fast, free, and does not call any API.

llm-rubric sends the output to another LLM and asks it to grade the response against your rubric. It costs tokens, but it catches things string matching cannot, like whether an email actually sounds casual.

Running the Evaluation

Run the evaluation and view the results:

Terminal

promptfoo eval
promptfoo view

The web UI shows providers as columns and test cases as rows. Each cell shows pass/fail for every assertion, and you can click into any cell to see the full output and grading details.

Caching

Promptfoo caches API responses to disk by default (14-day TTL), so re-running the same eval costs nothing. Use --no-cache when you want fresh responses.

Writing Assertions

The first eval used icontains and llm-rubric. Those are two of many assertion types Promptfoo supports.

Type	What It Checks
contains / icontains	Output includes a substring (case-sensitive or not)
regex	Output matches a pattern
not-contains	Output excludes something
latency	Response arrived under N milliseconds
cost	Response costs less than $X

Model-Assisted Assertions

These cost tokens, but can judge things that string matching cannot. You have already used llm-rubric. The rubric you write makes or breaks it.

A vague rubric like "The email sounds professional" tells the grader nothing useful. A specific one gives it something to measure:

YAML

- type: llm-rubric
  value: "The email uses a casual tone: contractions like 'we'll' and 'don't',
    sentences under 20 words, no corporate jargon like 'synergy' or 'circle back',
    and opens with a greeting like 'Hey' or 'Hi team'"

Custom Python Assertions

When built-in types do not cover your logic, write your own. Here is an inline assertion that passes if the email is between 50 and 200 words:

YAML

- type: python
  value: "50 <= len(output.split()) <= 200"

For more complex checks, put the logic in a separate file:

assert_length.py

def get_assert(output, context):
    word_count = len(output.split())
    in_range = 50 <= word_count <= 200
    return {
        "pass": in_range,
        "score": 1.0 if in_range else 0.0,
        "reason": f"Word count: {word_count} (target: 50-200)"
    }

Comparing Models Side by Side

The config already has two providers, so promptfoo eval tested GPT-5 and Claude Sonnet 4 in a single pass. Open promptfoo view and you see them as separate columns in the results.

LLM outputs are not deterministic, though. The same prompt can produce different results on consecutive runs. The --repeat flag accounts for this:

Terminal

promptfoo eval --repeat 3

This runs each test case three times per provider. If a model passes the tone assertion twice but fails on the third run, that is a reliability signal you would miss with a single pass.

From Local Tests to CI/CD

Running evals locally works during development, but it depends on whoever changes the prompt remembering to run them. Promptfoo has an official GitHub Action that removes that dependency by running your eval suite on every pull request.

Create a workflow file:

promptfooconfig.yaml

mkdir -p .github/workflows

Create .github/workflows/prompt-eval.yml:

prompt-eval.yml

name: 'Prompt Evaluation'
on:
  pull_request:
    paths:
      - 'prompts/**'
jobs:
  evaluate:
    runs-on: ubuntu-latest
    permissions:
      contents: read
      pull-requests: write
    steps:
      - uses: actions/checkout@v4

      - name: Set up promptfoo cache
        uses: actions/cache@v4
        with:
          path: |
            ~/.promptfoo/cache
            .promptfoo-cache
          key: ${{ runner.os }}-promptfoo-${{ hashFiles('prompts/**') }}-${{ github.sha }}

      - name: Run promptfoo evaluation
        uses: promptfoo/promptfoo-action@v1
        with:
          openai-api-key: ${{ secrets.OPENAI_API_KEY }}
          github-token: ${{ secrets.GITHUB_TOKEN }}
          config: 'promptfooconfig.yaml'
          cache-path: '.promptfoo-cache'

API Keys

Add your OPENAI_API_KEY (and optionally ANTHROPIC_API_KEY) as repository secrets in your GitHub repo under Settings > Secrets and variables > Actions.

The workflow from here:

Change a prompt

Modify your prompt template in the YAML file.

Open a PR

Create a pull request with your changes.

CI runs the eval

GitHub Actions automatically runs your evaluation suite.

Review results

Results appear as a PR comment with a link to the web viewer.

Fix if needed

Address any failures and re-run.

Merge when passing

Prompt changes get the same test-before-merge treatment as code changes.

Next Steps

When you are ready to go further, here are areas worth exploring:

Red Teaming

promptfoo redteam run scans for prompt injection and jailbreaks across dozens of attack plugins.

Custom Providers

Wrap any internal model or fine-tuned endpoint with file://my_provider.py.

CSV Test Data

Scale your test suite with file://tests.csv when inline YAML gets unwieldy.

Frequently Asked Questions

What is Promptfoo, and what problem does it solve?

Promptfoo is an open-source CLI for testing LLM outputs before they reach users. Instead of manually trying a few inputs and eyeballing results, you define assertions for what good output looks like, pick your models, and run every combination automatically.

How do you set up and run your first Promptfoo evaluation?

Install Promptfoo with npm install -g promptfoo, run promptfoo init to scaffold a project, set your API keys, then write a promptfooconfig.yaml with your prompts, providers, and test cases. Run promptfoo eval to execute.

What assertion types does Promptfoo support?

Promptfoo has three tiers: deterministic assertions (contains, regex, latency, cost) are free and instant; model-assisted assertions (llm-rubric, answer-relevance) send output to another LLM for judgment; custom Python assertions let you write any check as inline code or a separate file.

How do you compare multiple models on the same test suite?

List multiple providers in your promptfooconfig.yaml and run promptfoo eval once. Promptfoo tests every model against every test case and shows them as separate columns. Use --repeat flag to catch inconsistencies from non-deterministic outputs.

How does Promptfoo fit into a CI/CD pipeline?

Promptfoo has an official GitHub Action that runs your eval suite on every pull request touching prompt files. It posts results as a PR comment with a link to the web viewer. Add your API keys as repository secrets and configure a paths filter.

Need Help with LLM Evaluation?

Our experts can help you set up Promptfoo, build test suites, and integrate automated evaluations into your workflow.

What You'll Learn:

What LLM evaluation is and why it matters
Installing Promptfoo and initializing a project
Writing prompt templates and configuring providers
Creating test cases with deterministic and model-assisted assertions
Running evaluations and viewing results
Comparing multiple models side by side
Integrating Promptfoo into GitHub Actions CI/CD

LLM Evaluation in 60 Seconds

Before getting into the tool, you need to know how LLM testing works. It's different from testing regular code.

Instead, you check for properties of the output: does it contain the right information? Does it hit the right tone? Did it respond fast enough?

Term	Definition
Provider	A model API being tested (e.g., GPT-5, Claude Sonnet 4.6)
Test Case	One input paired with expected behavior
Assertion	A rule that output must pass
Rubric	Grading instruction for subjective checks

What Is Promptfoo?

Now that you know what an eval is, here is what Promptfoo does with it.

Open Source

MIT-licensed CLI with support for dozens of model providers. Acquired by OpenAI in March 2026.

350K+ Developers

Used by teams at over 25% of Fortune 500 companies for prompt testing.

Setting Up Your Promptfoo Environment

Install Promptfoo globally and initialize a new project:

Terminal

npm install -g promptfoo
mkdir email-writer-eval
cd email-writer-eval
promptfoo init

The init command walks you through an interactive setup. It asks what you would like to do (choose "Not sure yet") and which model provider to use (choose "OpenAI GPT 5, GPT 4.1").

API Keys Required

You need at least one API key to run evals. Get OpenAI key from OpenAI console and Anthropic key from Anthropic console.

Set your API keys as environment variables:

Terminal

export OPENAI_API_KEY=sk-...
export ANTHROPIC_API_KEY=sk-ant-...

If you only set ANTHROPIC_API_KEY and skip OpenAI, Promptfoo automatically uses Claude as the grading provider for model-assisted assertions like llm-rubric.

Building Your First Evaluation

Here is the task: given bullet points and a tone (casual, formal, or urgent), write an email. You will build an eval that tests this across two models.

Prompt and Providers

Delete the generated sample and create a new promptfooconfig.yaml. Start with the prompt and providers:

promptfooconfig.yaml

description: "Email writer evaluation"

prompts:
  - |
    Draft an email based on these bullet points.
    Match the specified tone throughout the email.

    Bullet points:
    {{bullet_points}}

    Tone: {{tone}}

providers:
  - id: openai:chat:gpt-5
    label: "GPT-5"
  - id: anthropic:messages:claude-sonnet-4-6
    label: "Claude Sonnet 4.6"

defaultTest Block

Add a defaultTest block. Assertions inside defaultTest apply to every test case automatically:

YAML

defaultTest:
  assert:
    - type: latency
      threshold: 30000

This fails any response slower than 30 seconds. Frontier models like GPT-5 can take 10-20 seconds per request due to reasoning tokens.

Test Cases

Now add the test cases. Each one provides different inputs and its own assertions:

Test Cases YAML

tests:
  - vars:
      bullet_points: |
        - Recap of the design review decisions
        - Next steps: finalize mockups by Thursday
        - Ask if anyone has questions
      tone: "casual"
    assert:
      - type: icontains
        value: "mockups"
      - type: llm-rubric
        value: "The email uses a casual tone with contractions and short sentences"

  - vars:
      bullet_points: |
        - Q1 revenue exceeded targets by 12%
        - New enterprise client onboarded
        - Hiring plan for Q2 approved
      tone: "formal"
    assert:
      - type: icontains
        value: "Q1"
      - type: llm-rubric
        value: "The email maintains a formal, professional tone throughout"

  - vars:
      bullet_points: |
        - API migration deadline is Friday at 5pm
        - Three endpoints still need updating
        - Downtime window is Saturday 2-6am
      tone: "urgent"
    assert:
      - type: icontains
        value: "Friday"
      - type: llm-rubric
        value: "The email conveys urgency with direct language and clear action items"

Each test case pairs two types of assertions. icontains is a simple string check: did the output include "mockups", case-insensitive? It is fast, free, and does not call any API.

Running the Evaluation

Run the evaluation and view the results:

Terminal

promptfoo eval
promptfoo view

The web UI shows providers as columns and test cases as rows. Each cell shows pass/fail for every assertion, and you can click into any cell to see the full output and grading details.

Caching

Promptfoo caches API responses to disk by default (14-day TTL), so re-running the same eval costs nothing. Use --no-cache when you want fresh responses.

Writing Assertions

The first eval used icontains and llm-rubric. Those are two of many assertion types Promptfoo supports.

Type	What It Checks
contains / icontains	Output includes a substring (case-sensitive or not)
regex	Output matches a pattern
not-contains	Output excludes something
latency	Response arrived under N milliseconds
cost	Response costs less than $X

Model-Assisted Assertions

These cost tokens, but can judge things that string matching cannot. You have already used llm-rubric. The rubric you write makes or breaks it.

A vague rubric like "The email sounds professional" tells the grader nothing useful. A specific one gives it something to measure:

YAML

- type: llm-rubric
  value: "The email uses a casual tone: contractions like 'we'll' and 'don't',
    sentences under 20 words, no corporate jargon like 'synergy' or 'circle back',
    and opens with a greeting like 'Hey' or 'Hi team'"

Custom Python Assertions

When built-in types do not cover your logic, write your own. Here is an inline assertion that passes if the email is between 50 and 200 words:

YAML

- type: python
  value: "50 <= len(output.split()) <= 200"

For more complex checks, put the logic in a separate file:

assert_length.py

def get_assert(output, context):
    word_count = len(output.split())
    in_range = 50 <= word_count <= 200
    return {
        "pass": in_range,
        "score": 1.0 if in_range else 0.0,
        "reason": f"Word count: {word_count} (target: 50-200)"
    }

Comparing Models Side by Side

The config already has two providers, so promptfoo eval tested GPT-5 and Claude Sonnet 4 in a single pass. Open promptfoo view and you see them as separate columns in the results.

LLM outputs are not deterministic, though. The same prompt can produce different results on consecutive runs. The --repeat flag accounts for this:

Terminal

promptfoo eval --repeat 3

This runs each test case three times per provider. If a model passes the tone assertion twice but fails on the third run, that is a reliability signal you would miss with a single pass.

From Local Tests to CI/CD

Create a workflow file:

promptfooconfig.yaml

mkdir -p .github/workflows

Create .github/workflows/prompt-eval.yml:

prompt-eval.yml

name: 'Prompt Evaluation'
on:
  pull_request:
    paths:
      - 'prompts/**'
jobs:
  evaluate:
    runs-on: ubuntu-latest
    permissions:
      contents: read
      pull-requests: write
    steps:
      - uses: actions/checkout@v4

      - name: Set up promptfoo cache
        uses: actions/cache@v4
        with:
          path: |
            ~/.promptfoo/cache
            .promptfoo-cache
          key: ${{ runner.os }}-promptfoo-${{ hashFiles('prompts/**') }}-${{ github.sha }}

      - name: Run promptfoo evaluation
        uses: promptfoo/promptfoo-action@v1
        with:
          openai-api-key: ${{ secrets.OPENAI_API_KEY }}
          github-token: ${{ secrets.GITHUB_TOKEN }}
          config: 'promptfooconfig.yaml'
          cache-path: '.promptfoo-cache'

API Keys

Add your OPENAI_API_KEY (and optionally ANTHROPIC_API_KEY) as repository secrets in your GitHub repo under Settings > Secrets and variables > Actions.

The workflow from here:

Change a prompt

Modify your prompt template in the YAML file.

Open a PR

Create a pull request with your changes.

CI runs the eval

GitHub Actions automatically runs your evaluation suite.

Review results

Results appear as a PR comment with a link to the web viewer.

Fix if needed

Address any failures and re-run.

Merge when passing

Prompt changes get the same test-before-merge treatment as code changes.

Next Steps

When you are ready to go further, here are areas worth exploring:

Red Teaming

promptfoo redteam run scans for prompt injection and jailbreaks across dozens of attack plugins.

Custom Providers

Wrap any internal model or fine-tuned endpoint with file://my_provider.py.

CSV Test Data

Scale your test suite with file://tests.csv when inline YAML gets unwieldy.

Frequently Asked Questions

What is Promptfoo, and what problem does it solve?

How do you set up and run your first Promptfoo evaluation?

What assertion types does Promptfoo support?

How do you compare multiple models on the same test suite?

How does Promptfoo fit into a CI/CD pipeline?

Need Help with LLM Evaluation?

Our experts can help you set up Promptfoo, build test suites, and integrate automated evaluations into your workflow.

How to Use Promptfoo for LLM Evaluation: Complete Guide

LLM Evaluation in 60 Seconds

What Is Promptfoo?

Open Source

350K+ Developers

Setting Up Your Promptfoo Environment

Building Your First Evaluation

Prompt and Providers

defaultTest Block

Test Cases

Running the Evaluation

Writing Assertions

Model-Assisted Assertions

Custom Python Assertions

Comparing Models Side by Side

From Local Tests to CI/CD

Change a prompt

Open a PR

CI runs the eval

Review results

Fix if needed

Merge when passing

Next Steps

Red Teaming

Custom Providers

CSV Test Data

Frequently Asked Questions

What is Promptfoo, and what problem does it solve?

How do you set up and run your first Promptfoo evaluation?

What assertion types does Promptfoo support?

How do you compare multiple models on the same test suite?

How does Promptfoo fit into a CI/CD pipeline?

Need Help with LLM Evaluation?

Need this implemented in your project?

Take the guide with you

Book a 30-min architecture call

Get a free 48-hour written brief

How to Use Promptfoo for LLM Evaluation: Complete Guide

LLM Evaluation in 60 Seconds

What Is Promptfoo?

Open Source

350K+ Developers

Setting Up Your Promptfoo Environment

Building Your First Evaluation

Prompt and Providers

defaultTest Block

Test Cases

Running the Evaluation

Writing Assertions

Model-Assisted Assertions

Custom Python Assertions

Comparing Models Side by Side

From Local Tests to CI/CD

Change a prompt

Open a PR

CI runs the eval

Review results

Fix if needed

Merge when passing

Next Steps

Red Teaming

Custom Providers

CSV Test Data

Frequently Asked Questions

What is Promptfoo, and what problem does it solve?

How do you set up and run your first Promptfoo evaluation?

What assertion types does Promptfoo support?

How do you compare multiple models on the same test suite?

How does Promptfoo fit into a CI/CD pipeline?

Need Help with LLM Evaluation?

Need this implemented in your project?

Take the guide with you

Book a 30-min architecture call

Get a free 48-hour written brief