AutoResearch is an open-source tool by Andrej Karpathy that runs ML experiments in a loop, keeping only the changes that beat the current best result. You describe research directions in a markdown file, point an AI coding agent at the repo, and walk away. By morning, you have a git history of validated improvements.

What You'll Learn:

What AutoResearch is and how it differs from AutoML
The three-file architecture and ownership rules
How the ratchet loop works step by step
Getting started with your first overnight run
Results achieved and limitations of the approach
When to use AutoResearch in your workflow

What is AutoResearch?

AutoResearch is an open-source Python tool that lets an AI agent run ML experiments on a single GPU without human intervention. It loops through propose-train-evaluate cycles, keeping only changes that improve validation loss. The project ships under an MIT license.

This is not hyperparameter tuning. Tools like Optuna or Ray Tune search a predefined parameter space. AutoResearch gives the agent freedom to modify arbitrary code. The search space is whatever the LLM can think of.

Tool	Type	Search Space
Optuna / Ray Tune	Hyperparameter tuning	Predefined parameters
AutoML / NAS	Architecture search	Fixed algorithms
AlphaEvolve	Evolutionary	Closed-source
AutoResearch	AI Agent	Unlimited (LLM knowledge)

From vibe coding to research advisor

Karpathy frames this as a natural progression in how engineers work with AI. The progression goes: vibe coding (human prompts, AI writes code, human reviews) to agentic engineering (human orchestrates agents in real time) to fully independent research (human sets direction, agent runs on its own).

Each step reduces the human role from writer to director to research advisor.

Key Insight

21,000+ GitHub stars in days. AutoResearch picked up massive interest after release on March 7, 2026.

AutoResearch Three-File Architecture

AutoResearch design comes down to a contract between three files, each with strict rules about who can touch it.

prepare.py

IMMUTABLE - Handles data preparation and evaluation. Defines validation metric (val_bpb). Neither human nor agent modifies it.

train.py

AGENT SANDBOX - Contains GPT model, optimizer, training loop. Agent can rewrite anything here as long as it produces val_bpb score.

program.md

HUMAN WRITTEN - Research directions, constraints, experiment rules. Only file human author touches.

What program.md controls

The file is more specific than most people expect. It hardcodes baseline metrics so the agent knows what to beat. It specifies exact commands for running experiments and extracting results. It tells the agent how to handle failures.

Setting	Value
Baseline val_bpb	0.997900
Peak VRAM	45 GB
Max experiment time	10 minutes
Training budget	5 minutes

AutoResearch Ratchet Loop

The core of AutoResearch is an experiment cycle that runs without human input. Here is how a single iteration works:

Read program.md

Agent reads program.md to understand current research priorities and constraints.

Examine current state

Examines train.py and recent results in results.tsv.

Propose hypothesis

Proposes an architecture change, optimizer adjustment, or training modification.

Modify train.py

Implements the proposed change in train.py.

Commit to git

Commits the change to a git branch.

Train for 5 minutes

Runs training for exactly 5 minutes - fixed wall-clock budget.

Evaluate result

If training crashes, logs failure and reverts. Otherwise evaluates val_bpb.

Ratchet decision

If val_bpb improved: commit stays. If not: git reset reverts.

Repeat

Starts again from step 1. At 5 min/experiment, runs ~12 experiments per hour.

The "ratchet" name comes from git history. Each successful experiment adds a commit; each failure gets reverted. The codebase can only move forward, never backward, accumulating validated improvements.

Getting Started With AutoResearch

Running AutoResearch requires a machine with an NVIDIA GPU (20+ GB VRAM recommended), Python 3.10+, uv, and a coding agent.

Terminal

git clone https://github.com/karpathy/autoresearch.git
cd autoresearch
uv sync
uv run prepare.py

There is no orchestration script. The README says to "simply spin up your Claude/Codex or whatever you want in this repo." You open a coding agent in the project directory, prompt it to read program.md, and the agent runs the experiment loop.

Hardware Requirements

Default config targets modern GPUs with 20+ GB VRAM. For smaller hardware, switch to TinyStories dataset and reduce vocabulary size to 256, depth to 4.

AutoResearch Results

Run	Experiments	Improvements	Result
Initial overnight	83	15	val_bpb: 1.000 - 0.975
Extended 2-day	~700	~20	All additive
Community session	126	-	val_bpb: 0.9979 - 0.9697
Shopify production	37	Multiple	19% improvement (0.8B model)

The Creativity Ceiling

GitHub Issue #22 captures the main limitation. The ratchet only accepts changes that immediately improve val_bpb, so the agent can never take a step backward to set up a larger gain. Human researchers reason "it will get worse before it gets better." The ratchet has no room for that.

Karpathy acknowledges the agent feels "cagy and scared" on open-ended problems, attributing this to RLHF training which rewards safe, conservative outputs over bold experimentation.

Fixed 5-minute window

Changes that show value quickly get found; changes that would only prove over longer runs stay invisible.

Local search pattern

Agent cycles through minor variations, stuck in local optima rather than novel architecture.

When to Use AutoResearch

The three-file contract transfers beyond LLM training to any domain with automatic scoring function: search ranking, product categorization, fraud scoring, intent classification.

When experiments run 100x faster than a human can manage, the eval pipeline becomes the constraint. Teams need eval sets that evolve alongside production data.

Bottom Line

Writing a good program.md requires having done the research yourself. The agent handles execution, but the judgment behind the research agenda remains human.

Frequently Asked Questions

What is AutoResearch?

AutoResearch is an open-source Python tool by Andrej Karpathy that lets an AI coding agent run ML experiments on a single GPU without human intervention.

How does the AutoResearch ratchet loop work?

The agent reads program.md for direction, modifies train.py, commits, runs training for 5 minutes, and evaluates val_bpb. If improved, commit stays; if not, git reset reverts.

What is the three-file architecture?

prepare.py is immutable (evaluation), train.py is agent sandbox (modifiable), program.md is human-written (research directions). Each has strict ownership rules.

What are AutoResearch limitations?

The main limitation is the creativity ceiling. The ratchet cannot take a step backward for larger gains, tends to cycle through minor variations, and is constrained by the fixed 5-minute training window.

How does AutoResearch compare to AutoML?

AutoML and NAS search predefined parameter spaces. AutoResearch gives the LLM freedom to modify arbitrary code, betting on general knowledge rather than constrained search spaces.

Need Help with AutoResearch?

Our experts can help you set up automated ML experiments, configure your first overnight run, and integrate AI agents into your research workflow.

What You'll Learn:

What AutoResearch is and how it differs from AutoML
The three-file architecture and ownership rules
How the ratchet loop works step by step
Getting started with your first overnight run
Results achieved and limitations of the approach
When to use AutoResearch in your workflow

What is AutoResearch?

Tool	Type	Search Space
Optuna / Ray Tune	Hyperparameter tuning	Predefined parameters
AutoML / NAS	Architecture search	Fixed algorithms
AlphaEvolve	Evolutionary	Closed-source
AutoResearch	AI Agent	Unlimited (LLM knowledge)

From vibe coding to research advisor

Each step reduces the human role from writer to director to research advisor.

Key Insight

21,000+ GitHub stars in days. AutoResearch picked up massive interest after release on March 7, 2026.

AutoResearch Three-File Architecture

AutoResearch design comes down to a contract between three files, each with strict rules about who can touch it.

prepare.py

IMMUTABLE - Handles data preparation and evaluation. Defines validation metric (val_bpb). Neither human nor agent modifies it.

train.py

AGENT SANDBOX - Contains GPT model, optimizer, training loop. Agent can rewrite anything here as long as it produces val_bpb score.

program.md

HUMAN WRITTEN - Research directions, constraints, experiment rules. Only file human author touches.

What program.md controls

Setting	Value
Baseline val_bpb	0.997900
Peak VRAM	45 GB
Max experiment time	10 minutes
Training budget	5 minutes

AutoResearch Ratchet Loop

The core of AutoResearch is an experiment cycle that runs without human input. Here is how a single iteration works:

Read program.md

Agent reads program.md to understand current research priorities and constraints.

Examine current state

Examines train.py and recent results in results.tsv.

Propose hypothesis

Proposes an architecture change, optimizer adjustment, or training modification.

Modify train.py

Implements the proposed change in train.py.

Commit to git

Commits the change to a git branch.

Train for 5 minutes

Runs training for exactly 5 minutes - fixed wall-clock budget.

Evaluate result

If training crashes, logs failure and reverts. Otherwise evaluates val_bpb.

Ratchet decision

If val_bpb improved: commit stays. If not: git reset reverts.

Repeat

Starts again from step 1. At 5 min/experiment, runs ~12 experiments per hour.

Getting Started With AutoResearch

Running AutoResearch requires a machine with an NVIDIA GPU (20+ GB VRAM recommended), Python 3.10+, uv, and a coding agent.

Terminal

git clone https://github.com/karpathy/autoresearch.git
cd autoresearch
uv sync
uv run prepare.py

Hardware Requirements

Default config targets modern GPUs with 20+ GB VRAM. For smaller hardware, switch to TinyStories dataset and reduce vocabulary size to 256, depth to 4.

AutoResearch Results

Run	Experiments	Improvements	Result
Initial overnight	83	15	val_bpb: 1.000 - 0.975
Extended 2-day	~700	~20	All additive
Community session	126	-	val_bpb: 0.9979 - 0.9697
Shopify production	37	Multiple	19% improvement (0.8B model)

The Creativity Ceiling

Karpathy acknowledges the agent feels "cagy and scared" on open-ended problems, attributing this to RLHF training which rewards safe, conservative outputs over bold experimentation.

Fixed 5-minute window

Changes that show value quickly get found; changes that would only prove over longer runs stay invisible.

Local search pattern

Agent cycles through minor variations, stuck in local optima rather than novel architecture.

When to Use AutoResearch

The three-file contract transfers beyond LLM training to any domain with automatic scoring function: search ranking, product categorization, fraud scoring, intent classification.

When experiments run 100x faster than a human can manage, the eval pipeline becomes the constraint. Teams need eval sets that evolve alongside production data.

Bottom Line

Writing a good program.md requires having done the research yourself. The agent handles execution, but the judgment behind the research agenda remains human.

Frequently Asked Questions

What is AutoResearch?

AutoResearch is an open-source Python tool by Andrej Karpathy that lets an AI coding agent run ML experiments on a single GPU without human intervention.

How does the AutoResearch ratchet loop work?

The agent reads program.md for direction, modifies train.py, commits, runs training for 5 minutes, and evaluates val_bpb. If improved, commit stays; if not, git reset reverts.

What is the three-file architecture?

prepare.py is immutable (evaluation), train.py is agent sandbox (modifiable), program.md is human-written (research directions). Each has strict ownership rules.

What are AutoResearch limitations?

How does AutoResearch compare to AutoML?

AutoML and NAS search predefined parameter spaces. AutoResearch gives the LLM freedom to modify arbitrary code, betting on general knowledge rather than constrained search spaces.

Need Help with AutoResearch?

Our experts can help you set up automated ML experiments, configure your first overnight run, and integrate AI agents into your research workflow.

How to Use AutoResearch by Karpathy: Complete Guide

What is AutoResearch?

From vibe coding to research advisor

AutoResearch Three-File Architecture

prepare.py

train.py

program.md

What program.md controls

AutoResearch Ratchet Loop

Read program.md

Examine current state

Propose hypothesis

Modify train.py

Commit to git

Train for 5 minutes

Evaluate result

Ratchet decision

Repeat

Getting Started With AutoResearch

AutoResearch Results

The Creativity Ceiling

Fixed 5-minute window

Local search pattern

When to Use AutoResearch

Frequently Asked Questions

What is AutoResearch?

How does the AutoResearch ratchet loop work?

What is the three-file architecture?

What are AutoResearch limitations?

How does AutoResearch compare to AutoML?

Need Help with AutoResearch?

Need this implemented in your project?

Take the guide with you

Book a 30-min architecture call

Get a free 48-hour written brief

How to Use AutoResearch by Karpathy: Complete Guide

What is AutoResearch?

From vibe coding to research advisor

AutoResearch Three-File Architecture

prepare.py

train.py

program.md

What program.md controls

AutoResearch Ratchet Loop

Read program.md

Examine current state

Propose hypothesis

Modify train.py

Commit to git

Train for 5 minutes

Evaluate result

Ratchet decision

Repeat

Getting Started With AutoResearch

AutoResearch Results

The Creativity Ceiling

Fixed 5-minute window

Local search pattern

When to Use AutoResearch

Frequently Asked Questions

What is AutoResearch?

How does the AutoResearch ratchet loop work?

What is the three-file architecture?

What are AutoResearch limitations?

How does AutoResearch compare to AutoML?

Need Help with AutoResearch?

Need this implemented in your project?

Take the guide with you

Book a 30-min architecture call

Get a free 48-hour written brief