// essaySEP 202516 MIN read

Honest metrics for fine-tuned models.

Most reported numbers from fine-tuning are leakage in disguise. A working protocol for the eval harness we now ship across every venture in the cohort.

— Marta Chen & Daniel Park

The claim

If you have ever fine-tuned a model and reported a benchmark improvement of more than ~6 points, there is a near-certain chance some of that gain is leakage. We have run this audit on 41 internal fine-tunes across the cohort. 34 of them had measurable leakage. The median leakage cost: 3.8 points of headline accuracy — not catastrophic, but enough to make the gain look real when it wasn't.

The four leaks

Train/test contamination. The boring one. Your eval set ended up in your fine-tune corpus. Happens far more than people admit.
Prompt leakage. You used the eval harness to construct your prompt template, then fine-tuned with that exact template. The model learned the harness, not the task.
Hyperparameter leakage. You picked the checkpoint that scored best on the held-out set. Now the held-out set isn't held out anymore. (This is the most common one we see and the one most people deny doing.)
Annotator leakage. Same human labelled the train and the eval. Their idiosyncrasies leaked across.

The harness we ship

The eval harness now standard across the cohort has six rules. None are clever. They are just hard to violate by accident.

Eval data is held in a separate repo, with read-only access for the fine-tuning team.
Prompt templates are versioned and the eval harness uses a different one from the training pipeline by default.
Checkpoint selection happens on a third "model-selection" set, never on the eval set. The eval set is touched once, at the end.
Annotator pools are partitioned. A given annotator labels for one of train, model-selection, or eval. Never two.
Every eval run reports a "leakage check": cosine similarity of every eval example against the closest training example. If the histogram has a tail, you have a problem.
The headline number is reported with a 95% bootstrap CI. If your improvement is inside the CI of the baseline, it is not an improvement.

The results

After the harness was rolled out, the median reported improvement on internal fine-tunes dropped from +8.4 to +3.6 points. The improvements that survived were real. Three projects we had been planning to ship were paused. Two of those got fixed and shipped a quarter later. One was abandoned.

This is the sense in which the harness paid for itself: the most expensive failure is not the one you ship knowing it's broken — it's the one you ship believing it works. We had been running on a steady drip of the second.

Three things to take

If your eval is touched more than once, it isn't an eval.
Hold-out the model selection step from the held-out set. Always.
Bootstrap your CIs. A 1.5-point improvement with a 4-point CI is not an improvement.

further / 03

get the next
dispatch?

One email a month. Real engineering content. We don't sell the list and we don't pad to fill the schedule.

all field notes say hi

Honest metrics for fine-tuned models.

The claim

The four leaks

The harness we ship

The results

Three things to take

more from the archive.

Edge inference at 2 MW

Subsoil sensors are lying to you

Model drift in the ocean

get the next
dispatch?

Honest metrics for fine-tuned models.

The claim

The four leaks

The harness we ship

The results

Three things to take

more from the archive.

Edge inference at 2 MW

Subsoil sensors are lying to you

Model drift in the ocean

get the nextdispatch?

get the next
dispatch?