The bill that broke us
Across the cohort we hit an inflection in 2024. Inference for one of the ventures cost $4.20 per active user per month at peak. Active users were growing. The model was good. The unit economics were not.
Three options: a smaller model (lossy), a different cloud provider (marginal), or stop renting GPUs by the hour. We picked the third.
The math
Once you have stable demand and a model architecture you trust for at least 12 months, the on-prem break-even on H100-class hardware lands at around 0.55 GPU-utilisation hours per dollar of cloud spend. We were running at 0.71. The decision was made.
We commissioned a 2 MW inference pod — not a training site, just inference — co-located with a hydroelectric grid in northern Portugal. Cost-of-power on the contract is just under one-third of the equivalent in our nearest hyperscaler region.
What broke
Three things that the cloud had been silently solving for us, that we now had to solve ourselves:
- Hot-spare capacity. A hyperscaler hides the fact that not every GPU is available all the time. We learned this when a half-rack went into thermal throttle and our routing layer hadn't been written to handle it.
- Model swap orchestration. Pushing a new model to 1,000 cloud GPUs is a rolling deploy. To 1,000 on-prem GPUs without a deploy infrastructure, it is several days of phone calls.
- Failure-mode reporting. The cloud has telemetry. The pod did not. We built it.
The cloud, as debugger
We did not move everything. We kept a small reservation in the cloud for two purposes: spike-load overflow, and shadow-mode evaluation of new model candidates against live traffic. Both jobs are inherently bursty. The cloud is good at bursty. It is bad at steady-state.
The framing that worked, internally, was: treat the cloud as a debugger, not a runtime. Steady-state inference is on the pod. Investigations, replays, and shadow evals run on rented hardware. The bill came down by 74% at the same scale. We did not regret it.
Three things to take
- Cloud spend isn't fixed. If your model architecture is stable for a year, run the on-prem math. The break-even is closer than it looks.
- You will under-budget for the orchestration layer. Double whatever you scoped.
- Keep a foot in the cloud. Use it for the workloads it's actually good at.