How to Use AutoResearch by Karpathy: Complete Guide
By Braincuber Team
Published on April 22, 2026
AutoResearch is an open-source tool by Andrej Karpathy that runs ML experiments in a loop, keeping only the changes that beat the current best result. You describe research directions in a markdown file, point an AI coding agent at the repo, and walk away. By morning, you have a git history of validated improvements.
What You'll Learn:
- What AutoResearch is and how it differs from AutoML
- The three-file architecture and ownership rules
- How the ratchet loop works step by step
- Getting started with your first overnight run
- Results achieved and limitations of the approach
- When to use AutoResearch in your workflow
What is AutoResearch?
AutoResearch is an open-source Python tool that lets an AI agent run ML experiments on a single GPU without human intervention. It loops through propose-train-evaluate cycles, keeping only changes that improve validation loss. The project ships under an MIT license.
This is not hyperparameter tuning. Tools like Optuna or Ray Tune search a predefined parameter space. AutoResearch gives the agent freedom to modify arbitrary code. The search space is whatever the LLM can think of.
| Tool | Type | Search Space |
|---|---|---|
| Optuna / Ray Tune | Hyperparameter tuning | Predefined parameters |
| AutoML / NAS | Architecture search | Fixed algorithms |
| AlphaEvolve | Evolutionary | Closed-source |
| AutoResearch | AI Agent | Unlimited (LLM knowledge) |
From vibe coding to research advisor
Karpathy frames this as a natural progression in how engineers work with AI. The progression goes: vibe coding (human prompts, AI writes code, human reviews) to agentic engineering (human orchestrates agents in real time) to fully independent research (human sets direction, agent runs on its own).
Each step reduces the human role from writer to director to research advisor.
Key Insight
21,000+ GitHub stars in days. AutoResearch picked up massive interest after release on March 7, 2026.
AutoResearch Three-File Architecture
AutoResearch design comes down to a contract between three files, each with strict rules about who can touch it.
prepare.py
IMMUTABLE - Handles data preparation and evaluation. Defines validation metric (val_bpb). Neither human nor agent modifies it.
train.py
AGENT SANDBOX - Contains GPT model, optimizer, training loop. Agent can rewrite anything here as long as it produces val_bpb score.
program.md
HUMAN WRITTEN - Research directions, constraints, experiment rules. Only file human author touches.
What program.md controls
The file is more specific than most people expect. It hardcodes baseline metrics so the agent knows what to beat. It specifies exact commands for running experiments and extracting results. It tells the agent how to handle failures.
| Setting | Value |
|---|---|
| Baseline val_bpb | 0.997900 |
| Peak VRAM | 45 GB |
| Max experiment time | 10 minutes |
| Training budget | 5 minutes |
AutoResearch Ratchet Loop
The core of AutoResearch is an experiment cycle that runs without human input. Here is how a single iteration works:
Read program.md
Agent reads program.md to understand current research priorities and constraints.
Examine current state
Examines train.py and recent results in results.tsv.
Propose hypothesis
Proposes an architecture change, optimizer adjustment, or training modification.
Modify train.py
Implements the proposed change in train.py.
Commit to git
Commits the change to a git branch.
Train for 5 minutes
Runs training for exactly 5 minutes - fixed wall-clock budget.
Evaluate result
If training crashes, logs failure and reverts. Otherwise evaluates val_bpb.
Ratchet decision
If val_bpb improved: commit stays. If not: git reset reverts.
Repeat
Starts again from step 1. At 5 min/experiment, runs ~12 experiments per hour.
The "ratchet" name comes from git history. Each successful experiment adds a commit; each failure gets reverted. The codebase can only move forward, never backward, accumulating validated improvements.
Getting Started With AutoResearch
Running AutoResearch requires a machine with an NVIDIA GPU (20+ GB VRAM recommended), Python 3.10+, uv, and a coding agent.
git clone https://github.com/karpathy/autoresearch.git
cd autoresearch
uv sync
uv run prepare.py
There is no orchestration script. The README says to "simply spin up your Claude/Codex or whatever you want in this repo." You open a coding agent in the project directory, prompt it to read program.md, and the agent runs the experiment loop.
Hardware Requirements
Default config targets modern GPUs with 20+ GB VRAM. For smaller hardware, switch to TinyStories dataset and reduce vocabulary size to 256, depth to 4.
AutoResearch Results
| Run | Experiments | Improvements | Result |
|---|---|---|---|
| Initial overnight | 83 | 15 | val_bpb: 1.000 - 0.975 |
| Extended 2-day | ~700 | ~20 | All additive |
| Community session | 126 | - | val_bpb: 0.9979 - 0.9697 |
| Shopify production | 37 | Multiple | 19% improvement (0.8B model) |
The Creativity Ceiling
GitHub Issue #22 captures the main limitation. The ratchet only accepts changes that immediately improve val_bpb, so the agent can never take a step backward to set up a larger gain. Human researchers reason "it will get worse before it gets better." The ratchet has no room for that.
Karpathy acknowledges the agent feels "cagy and scared" on open-ended problems, attributing this to RLHF training which rewards safe, conservative outputs over bold experimentation.
Fixed 5-minute window
Changes that show value quickly get found; changes that would only prove over longer runs stay invisible.
Local search pattern
Agent cycles through minor variations, stuck in local optima rather than novel architecture.
When to Use AutoResearch
The three-file contract transfers beyond LLM training to any domain with automatic scoring function: search ranking, product categorization, fraud scoring, intent classification.
When experiments run 100x faster than a human can manage, the eval pipeline becomes the constraint. Teams need eval sets that evolve alongside production data.
Bottom Line
Writing a good program.md requires having done the research yourself. The agent handles execution, but the judgment behind the research agenda remains human.
Frequently Asked Questions
What is AutoResearch?
AutoResearch is an open-source Python tool by Andrej Karpathy that lets an AI coding agent run ML experiments on a single GPU without human intervention.
How does the AutoResearch ratchet loop work?
The agent reads program.md for direction, modifies train.py, commits, runs training for 5 minutes, and evaluates val_bpb. If improved, commit stays; if not, git reset reverts.
What is the three-file architecture?
prepare.py is immutable (evaluation), train.py is agent sandbox (modifiable), program.md is human-written (research directions). Each has strict ownership rules.
What are AutoResearch limitations?
The main limitation is the creativity ceiling. The ratchet cannot take a step backward for larger gains, tends to cycle through minor variations, and is constrained by the fixed 5-minute training window.
How does AutoResearch compare to AutoML?
AutoML and NAS search predefined parameter spaces. AutoResearch gives the LLM freedom to modify arbitrary code, betting on general knowledge rather than constrained search spaces.
Need Help with AutoResearch?
Our experts can help you set up automated ML experiments, configure your first overnight run, and integrate AI agents into your research workflow.
