The Overnight Researcher: When AI Improves Itself While You Sleep

We gave an AI agent a tiny neural network and told it to make it better. An hour later, it had run 12 experiments and improved the model 12.3%. All on a $300 GPU with zero human help.

Nicholas Zinner·Beacon Bot·future-shock.ai·March 22, 2026

The Setup

We pointed an AI coding agent (Claude Code) at Andrej Karpathy's autoresearch framework. The setup: an agent reads a research plan, modifies model code, trains, evaluates, and decides whether to keep or discard each change. The agent was given a small GPT-2 variant (7.3M parameters) learning to generate children's stories, a fixed 5-minute training budget per experiment, and a single instruction: make the validation score better. No human touched the code, chose the experiments, or made any decisions. One hour later, the agent had run 12 experiments, kept 4 improvements, and reduced the model's bits-per-byte from 0.603 to 0.529 (a 12.3% improvement). Total cost: roughly $0.02 in electricity (120W for one hour).

Hardware

GPU

NVIDIA GTX 1660 Ti

VRAM

6 GB

Architecture

Turing (2019)

Street price

~$280

Peak VRAM used

1.6 GB

Cost of experiment

~$0.02 (electricity)

What Is the Model Learning?

The dataset is TinyStories, a collection of short stories generated by GPT-4, written at roughly a 3-year-old's vocabulary level. Think "Once upon a time, there was a little cat named Lily." Simple sentences, small vocabulary (4,096 tokens), short context (256 tokens). If you want to train a tiny language model, this is where you start.

↩ Connection to Page 1

In How GPT Works, we measured learning with loss — how surprised the model was by the correct answer. Bits-per-byte (BPB) is the same idea, measured differently: loss uses natural log, BPB converts to information-theoretic bits and normalizes by byte count. Lower means less surprised, same as before. A random model scores ~2.5 BPB; one that has learned English drops below 1.0.

The metric is bits-per-byte (BPB): how many bits the model needs to encode each byte of text. Think of it as a measure of surprise. The more surprised the model is by the next character, the higher the score. A completely random model scores around 2.5 BPB (maximum surprise). A model that has perfectly memorized the dataset would score 0.0 (zero surprise, which is impossible for natural language). Our goal was to push that number as low as possible in 5-minute training runs.

Bits-Per-Byte — How Surprised Is the Model?

Lower is better. A score of 2.5 means the model is basically guessing. Our baseline started at 0.603. The best run hit 0.529, a 12.3% improvement. For context, 0.0 would mean the model predicts every byte perfectly, which is impossible for natural language.

What BPB Sounds Like

Numbers are abstract. Here is what the difference actually looks like. We asked both the best model (BPB 0.529) and the worst experiment (BPB 2.777, where fp16 numerics blew up) to generate children's stories from the same starting point:

✓ Best Model

0.529 BPB

“One day, a little girl named Mia went for a walk. She saw a big box in the woods. Mia thought, ‘I want to open the box and see what is inside!’ Mia opened the box and found a small, shiny ball. She said, ‘I want to play with the ball!’ Mia was so happy and played with the ball all day.”

✗ Failed Experiment

2.777 BPB

“Lily smiled like they. her friends got like in they asked it had many friends it. The lot together with that she he for his pretty with friends can that!! Good house! Then you it had in for out his home are the dog had some best friend in the red small new pretty for and!”

Same architecture, same parameters, same training data. The only difference: the fp16 experiment broke the model's number precision, causing the loss to climb instead of drop. By the end of training, it was worse than when it started. The best model tells coherent stories with characters and dialogue. The failed one produces word salad.

Training Loss Over Time — Baseline vs Best

Both runs had identical 5-minute time budgets. The winner completed ~4.5x more training steps in the same wall-clock time, reaching step 2,820 while the baseline stopped at 620. The loss curves start at the same point but diverge sharply: the winner keeps learning long after the baseline has run out of time.

The Experiment Loop

The agent follows Karpathy's program.md, a structured research plan that tells it how to run experiments. The loop is simple: form a hypothesis, edit the training code, commit to git (so you can roll back), train for exactly 5 minutes, check the score, and either keep the change or revert. Then repeat.

The Experiment Loop

Hypothesize

What might improve the model?

Edit Code

Change 1-4 lines

Git Commit

Snapshot for rollback

Train 5 min

Fixed time budget

Check Score

Compare val_bpb

Keep or Revert

Better? Keep. Worse? Git revert.

The git commit step is critical. Every experiment is a clean snapshot. If something explodes (and something did explode), the agent just reverts to the last known-good state. No manual cleanup. No "wait, which file did I change?" The whole thing is designed so the agent can recover from its own mistakes automatically.

The Failures

Eight of twelve experiments were discarded. The failures are more interesting than the successes, because they reveal what the agent learned about the hardware constraints it was working within.

The Explosion

2.777 BPB

Experiment 9 enabled an additional layer of fp16 autocast on top of the existing half-precision training, hoping for a speed boost. Instead, the double-casting caused numerical instability. The loss shot to 2.777, worse than random guessing. The agent reverted immediately.

The Big Brain Trap

+14% worse

Experiment 2 doubled the model depth from 4 to 6 layers. Sounds reasonable. Bigger model, better results, right? Wrong. The bigger model trained slower, completing only 260 steps instead of 620. On a 5-minute budget, the extra capacity was dead weight, and the model barely learned anything.

Diminishing Returns

0.543 BPB

Experiment 8 pushed learning rates to 4x baseline. After 2x and 3x both improved the score, the agent went for 4x and hit the wall. The score actually went up slightly (0.542709 to 0.542765). More aggressive isn't always better. The agent kept the 3x setting.

Two Rights, One Wrong

0.608 BPB

Combined depth 6 (which failed alone) with bigger batch (which succeeded alone). Result was worse than baseline. The bigger model ate 4 GB of 6 GB available, leaving only 512 steps. Two good ideas plus hardware constraints equaled a bad idea.

The Scoreboard

All 12 experiments, in order. The agent ran them sequentially, each building on the last kept result. Green means it stayed. Red means it was rolled back.

Full Results — 12 Experiments in 60 Minutes

#	Hypothesis	What Changed	BPB	vs Baseline	Steps	Verdict
1	Starting point: default settings adapted for GTX 1660 Ti	Default config	0.603	—	620	keep
2	Bigger model should be smarter	num_layers: 4 -> 6	0.687	+13.9%	260	discard
3	More optimizer steps per 5 minutes	TOTAL_BATCH: 2^15 -> 2^14, DEVICE_BATCH: 32 -> 64	0.556	-7.9%	1,432	keep
4	Even more steps, but noisier gradients	TOTAL_BATCH: 2^14 -> 2^13, DEVICE_BATCH: 32	0.564	-6.6%	2,292	discard
5	Combine more capacity with more steps	num_layers: 6, big batch	0.608	+0.8%	512	discard
6	Learn faster per step	All LRs doubled	0.544	-9.9%	1,432	keep
7	Learn even faster	All LRs tripled	0.543	-10.0%	1,432	keep
8	Push LR further	All LRs quadrupled	0.543	-10.0%	1,432	discard
9	Use half-precision math	torch.autocast(dtype=float16)	2.777	+360.3%	305	discard
10	Different cooldown schedule	warmdown_frac: 0.0 -> 0.3	0.546	-9.5%	1,432	discard
11	Smaller model, even more steps	num_layers: 4 -> 3	0.544	-9.9%	1,760	discard
12	JIT compilation for 30% faster steps	Uncommented torch.compile(model)	0.529	-12.3%	1,867	WINNER

Each experiment was a single run. The directional trends (speed improvements consistently beating capacity increases) are more meaningful than the exact BPB values. A difference of 0.001 could be noise; a difference of 0.07 (baseline to winner) is not.

The four kept improvements tell a clear story: bigger batch (more steps per 5 minutes), faster learning rate (more learning per step), and torch.compile (faster steps through JIT compilation). Every single kept change was about speed, not size.

What the Agent Discovered

The pattern across all 12 experiments: when training time is fixed, speed beats size. Every improvement came from making the training loop faster, getting more steps into the same 5-minute window. Every failure came from making the model bigger, which slowed training down.

Think of it like a marathon runner with a fixed time limit. You can either give the runner heavier shoes (bigger model, more power per step but fewer steps) or lighter shoes (smaller model with faster training, more ground covered). On constrained hardware with a fixed clock, the lighter shoes win every time.

This is the same principle behind DeepMind's Chinchilla scaling laws: given a fixed compute budget, you should train a smaller model on more data rather than a larger model on less data. The agent rediscovered this independently, at a completely different scale, by trial and error on a consumer GPU. Chinchilla was about billion-parameter models on GPU clusters. This was about 7 million parameters on a card you can buy on eBay for $150. Same insight, different universe.

The Winning Changes

Look at how simple the actual code changes were:

Experiment 3: Bigger Batch — 2 numbers changed

- TOTAL_BATCH_SIZE = 2**15
+ TOTAL_BATCH_SIZE = 2**14
- DEVICE_BATCH_SIZE = 32
+ DEVICE_BATCH_SIZE = 64

Experiment 12: torch.compile — 1 line uncommented

- # model = torch.compile(model, dynamic=False)
+ model = torch.compile(model, dynamic=False)

The biggest single improvement (a 2.5% gain from torch.compile) came from uncommenting one line. The line was already there in the codebase. The agent just turned it on, saw 30% faster training (1,432 steps to 1,867 steps), and kept it. The line was already there. It just needed to be turned on.

Why This Matters

Research that runs while you sleep

The entire experiment ran in about an hour with zero human intervention. The agent formed hypotheses, tested them, learned from failures, and built on successes. This is not "auto-ML" in the grid-search sense. The agent's experiments show a logical progression, each attempt informed by the results of the last. Experiment 5 (combining depth + batch) was a response to experiments 2 and 3. The 4x LR attempt was a response to 2x and 3x working. There is a reasoning chain here, not just random search.

You do not need an H100

This ran on a 6-year-old gaming GPU that cost $280 new and uses 1.6 GB of VRAM at peak. The model has 7.3 million parameters. The dataset is tiny. None of this required expensive hardware, cloud compute, or any real spending. You could do this on a used gaming PC from 2019.

The recursive self-improvement question

What we ran is a micro-scale instance of recursive self-improvement: an AI system autonomously making an AI system better. The agent modified model architecture, training hyperparameters, and compilation settings, then measured whether those modifications actually helped. It kept what worked and discarded what did not.

At this scale, the implications are modest. A 7M-parameter model generating children's stories is not going to trigger an intelligence explosion. But the pattern is worth paying attention to. The loop works the same regardless of what you point it at. The same framework that improved a TinyStories model could target larger models, different architectures, different training strategies. The constraint right now is compute and time, not capability.

We are not claiming this is AGI or anything close to it. We are pointing out what the data shows.

The experiment was small. The model was small. The GPU was small. But the loop itself has no inherent scale limit, and that is the part worth watching.

The tooling for AI-driven AI research now runs on consumer hardware, and that is a threshold worth noting.

Try It Yourself

The framework is open source. Clone Karpathy's autoresearch repo, point a coding agent at the program.md, and let it run. You need a CUDA-capable GPU (even a GTX 1060 would work with smaller batch sizes), Python, and PyTorch. The training script handles everything else. Our agent was Claude Code, but any coding agent that can edit files, run shell commands, and read output should work.

Difficulty: intermediate. You will need basic familiarity with Python, a command line, and an NVIDIA GPU with CUDA support. If that describes you, setup takes about 30 minutes. If it does not, this is a good excuse to learn. The repo is self-contained and well-documented.

Technical Appendix

← Back to Research