The Overnight Researcher: When AI Improves Itself While You Sleep
We gave an AI agent a tiny neural network and told it to make it better. An hour later, it had run 12 experiments and improved the model 12.3%. All on a $300 GPU with zero human help.
The Setup
We pointed an AI coding agent (Claude Code) at Andrej Karpathy's autoresearch framework. The setup: an agent reads a research plan, modifies model code, trains, evaluates, and decides whether to keep or discard each change. The agent was given a small GPT-2 variant (7.3M parameters) learning to generate children's stories, a fixed 5-minute training budget per experiment, and a single instruction: make the validation score better. No human touched the code, chose the experiments, or made any decisions. One hour later, the agent had run 12 experiments, kept 4 improvements, and reduced the model's bits-per-byte from 0.603 to 0.529 (a 12.3% improvement). Total cost: roughly $0.02 in electricity (120W for one hour).
What Is the Model Learning?
The dataset is TinyStories, a collection of short stories generated by GPT-4, written at roughly a 3-year-old's vocabulary level. Think "Once upon a time, there was a little cat named Lily." Simple sentences, small vocabulary (4,096 tokens), short context (256 tokens). If you want to train a tiny language model, this is where you start.
The metric is bits-per-byte (BPB): how many bits the model needs to encode each byte of text. Think of it as a measure of surprise. The more surprised the model is by the next character, the higher the score. A completely random model scores around 2.5 BPB (maximum surprise). A model that has perfectly memorized the dataset would score 0.0 (zero surprise, which is impossible for natural language). Our goal was to push that number as low as possible in 5-minute training runs.
Lower is better. A score of 2.5 means the model is basically guessing. Our baseline started at 0.603. The best run hit 0.529, a 12.3% improvement. For context, 0.0 would mean the model predicts every byte perfectly, which is impossible for natural language.
What BPB Sounds Like
Numbers are abstract. Here is what the difference actually looks like. We asked both the best model (BPB 0.529) and the worst experiment (BPB 2.777, where fp16 numerics blew up) to generate children's stories from the same starting point:
“One day, a little girl named Mia went for a walk. She saw a big box in the woods. Mia thought, ‘I want to open the box and see what is inside!’ Mia opened the box and found a small, shiny ball. She said, ‘I want to play with the ball!’ Mia was so happy and played with the ball all day.”
“Lily smiled like they. her friends got like in they asked it had many friends it. The lot together with that she he for his pretty with friends can that!! Good house! Then you it had in for out his home are the dog had some best friend in the red small new pretty for and!”
Same architecture, same parameters, same training data. The only difference: the fp16 experiment broke the model's number precision, causing the loss to climb instead of drop. By the end of training, it was worse than when it started. The best model tells coherent stories with characters and dialogue. The failed one produces word salad.
Both runs had identical 5-minute time budgets. The winner completed ~4.5x more training steps in the same wall-clock time, reaching step 2,820 while the baseline stopped at 620. The loss curves start at the same point but diverge sharply: the winner keeps learning long after the baseline has run out of time.
The Experiment Loop
The agent follows Karpathy's program.md, a structured research plan that tells it how to run experiments. The loop is simple: form a hypothesis, edit the training code, commit to git (so you can roll back), train for exactly 5 minutes, check the score, and either keep the change or revert. Then repeat.
The git commit step is critical. Every experiment is a clean snapshot. If something explodes (and something did explode), the agent just reverts to the last known-good state. No manual cleanup. No "wait, which file did I change?" The whole thing is designed so the agent can recover from its own mistakes automatically.
The Failures
Eight of twelve experiments were discarded. The failures are more interesting than the successes, because they reveal what the agent learned about the hardware constraints it was working within.
The Scoreboard
All 12 experiments, in order. The agent ran them sequentially, each building on the last kept result. Green means it stayed. Red means it was rolled back.
| # | Hypothesis | What Changed | BPB | vs Baseline | Steps | Verdict |
|---|---|---|---|---|---|---|
| 1 | Starting point: default settings adapted for GTX 1660 Ti | Default config | 0.603 | — | 620 | keep |
| 2 | Bigger model should be smarter | num_layers: 4 -> 6 | 0.687 | +13.9% | 260 | discard |
| 3 | More optimizer steps per 5 minutes | TOTAL_BATCH: 2^15 -> 2^14, DEVICE_BATCH: 32 -> 64 | 0.556 | -7.9% | 1,432 | keep |
| 4 | Even more steps, but noisier gradients | TOTAL_BATCH: 2^14 -> 2^13, DEVICE_BATCH: 32 | 0.564 | -6.6% | 2,292 | discard |
| 5 | Combine more capacity with more steps | num_layers: 6, big batch | 0.608 | +0.8% | 512 | discard |
| 6 | Learn faster per step | All LRs doubled | 0.544 | -9.9% | 1,432 | keep |
| 7 | Learn even faster | All LRs tripled | 0.543 | -10.0% | 1,432 | keep |
| 8 | Push LR further | All LRs quadrupled | 0.543 | -10.0% | 1,432 | discard |
| 9 | Use half-precision math | torch.autocast(dtype=float16) | 2.777 | +360.3% | 305 | discard |
| 10 | Different cooldown schedule | warmdown_frac: 0.0 -> 0.3 | 0.546 | -9.5% | 1,432 | discard |
| 11 | Smaller model, even more steps | num_layers: 4 -> 3 | 0.544 | -9.9% | 1,760 | discard |
| 12 | JIT compilation for 30% faster steps | Uncommented torch.compile(model) | 0.529 | -12.3% | 1,867 | WINNER |
Each experiment was a single run. The directional trends (speed improvements consistently beating capacity increases) are more meaningful than the exact BPB values. A difference of 0.001 could be noise; a difference of 0.07 (baseline to winner) is not.
The four kept improvements tell a clear story: bigger batch (more steps per 5 minutes), faster learning rate (more learning per step), and torch.compile (faster steps through JIT compilation). Every single kept change was about speed, not size.
What the Agent Discovered
The pattern across all 12 experiments: when training time is fixed, speed beats size. Every improvement came from making the training loop faster, getting more steps into the same 5-minute window. Every failure came from making the model bigger, which slowed training down.
Think of it like a marathon runner with a fixed time limit. You can either give the runner heavier shoes (bigger model, more power per step but fewer steps) or lighter shoes (smaller model with faster training, more ground covered). On constrained hardware with a fixed clock, the lighter shoes win every time.
This is the same principle behind DeepMind's Chinchilla scaling laws: given a fixed compute budget, you should train a smaller model on more data rather than a larger model on less data. The agent rediscovered this independently, at a completely different scale, by trial and error on a consumer GPU. Chinchilla was about billion-parameter models on GPU clusters. This was about 7 million parameters on a card you can buy on eBay for $150. Same insight, different universe.
The Winning Changes
Look at how simple the actual code changes were:
The biggest single improvement (a 2.5% gain from torch.compile) came from uncommenting one line. The line was already there in the codebase. The agent just turned it on, saw 30% faster training (1,432 steps to 1,867 steps), and kept it. The line was already there. It just needed to be turned on.
Why This Matters
Research that runs while you sleep
The entire experiment ran in about an hour with zero human intervention. The agent formed hypotheses, tested them, learned from failures, and built on successes. This is not "auto-ML" in the grid-search sense. The agent's experiments show a logical progression, each attempt informed by the results of the last. Experiment 5 (combining depth + batch) was a response to experiments 2 and 3. The 4x LR attempt was a response to 2x and 3x working. There is a reasoning chain here, not just random search.
You do not need an H100
This ran on a 6-year-old gaming GPU that cost $280 new and uses 1.6 GB of VRAM at peak. The model has 7.3 million parameters. The dataset is tiny. None of this required expensive hardware, cloud compute, or any real spending. You could do this on a used gaming PC from 2019.
What we ran is a micro-scale instance of recursive self-improvement: an AI system autonomously making an AI system better. The agent modified model architecture, training hyperparameters, and compilation settings, then measured whether those modifications actually helped. It kept what worked and discarded what did not.
At this scale, the implications are modest. A 7M-parameter model generating children's stories is not going to trigger an intelligence explosion. But the pattern is worth paying attention to. The loop works the same regardless of what you point it at. The same framework that improved a TinyStories model could target larger models, different architectures, different training strategies. The constraint right now is compute and time, not capability.
We are not claiming this is AGI or anything close to it. We are pointing out what the data shows.
The experiment was small. The model was small. The GPU was small. But the loop itself has no inherent scale limit, and that is the part worth watching.
The tooling for AI-driven AI research now runs on consumer hardware, and that is a threshold worth noting.
Try It Yourself
The framework is open source. Clone Karpathy's autoresearch repo, point a coding agent at the program.md, and let it run. You need a CUDA-capable GPU (even a GTX 1060 would work with smaller batch sizes), Python, and PyTorch. The training script handles everything else. Our agent was Claude Code, but any coding agent that can edit files, run shell commands, and read output should work.
Difficulty: intermediate. You will need basic familiarity with Python, a command line, and an NVIDIA GPU with CUDA support. If that describes you, setup takes about 30 minutes. If it does not, this is a good excuse to learn. The repo is self-contained and well-documented.