The claim
If you have ever fine-tuned a model and reported a benchmark improvement of more than ~6 points, there is a near-certain chance some of that gain is leakage. We have run this audit on 41 internal fine-tunes across the cohort. 34 of them had measurable leakage. The median leakage cost: 3.8 points of headline accuracy — not catastrophic, but enough to make the gain look real when it wasn't.
The four leaks
- Train/test contamination. The boring one. Your eval set ended up in your fine-tune corpus. Happens far more than people admit.
- Prompt leakage. You used the eval harness to construct your prompt template, then fine-tuned with that exact template. The model learned the harness, not the task.
- Hyperparameter leakage. You picked the checkpoint that scored best on the held-out set. Now the held-out set isn't held out anymore. (This is the most common one we see and the one most people deny doing.)
- Annotator leakage. Same human labelled the train and the eval. Their idiosyncrasies leaked across.
The harness we ship
The eval harness now standard across the cohort has six rules. None are clever. They are just hard to violate by accident.
- Eval data is held in a separate repo, with read-only access for the fine-tuning team.
- Prompt templates are versioned and the eval harness uses a different one from the training pipeline by default.
- Checkpoint selection happens on a third "model-selection" set, never on the eval set. The eval set is touched once, at the end.
- Annotator pools are partitioned. A given annotator labels for one of train, model-selection, or eval. Never two.
- Every eval run reports a "leakage check": cosine similarity of every eval example against the closest training example. If the histogram has a tail, you have a problem.
- The headline number is reported with a 95% bootstrap CI. If your improvement is inside the CI of the baseline, it is not an improvement.
The results
After the harness was rolled out, the median reported improvement on internal fine-tunes dropped from +8.4 to +3.6 points. The improvements that survived were real. Three projects we had been planning to ship were paused. Two of those got fixed and shipped a quarter later. One was abandoned.
This is the sense in which the harness paid for itself: the most expensive failure is not the one you ship knowing it's broken — it's the one you ship believing it works. We had been running on a steady drip of the second.
Three things to take
- If your eval is touched more than once, it isn't an eval.
- Hold-out the model selection step from the held-out set. Always.
- Bootstrap your CIs. A 1.5-point improvement with a 4-point CI is not an improvement.