~/ insights/ edge inference at 2 mw
// essayDEC 202511 MIN read

Edge inference at 2 MW.

When the GPU bill exceeded the salaries of the team running the workload, something had to change. Notes on moving inference to on-prem at scale — and why we now treat the cloud as a debugger, not a runtime.

— Marta Chen, head of research

The bill that broke us

Across the cohort we hit an inflection in 2024. Inference for one of the ventures cost $4.20 per active user per month at peak. Active users were growing. The model was good. The unit economics were not.

Three options: a smaller model (lossy), a different cloud provider (marginal), or stop renting GPUs by the hour. We picked the third.

The math

Once you have stable demand and a model architecture you trust for at least 12 months, the on-prem break-even on H100-class hardware lands at around 0.55 GPU-utilisation hours per dollar of cloud spend. We were running at 0.71. The decision was made.

We commissioned a 2 MW inference pod — not a training site, just inference — co-located with a hydroelectric grid in northern Portugal. Cost-of-power on the contract is just under one-third of the equivalent in our nearest hyperscaler region.

What broke

Three things that the cloud had been silently solving for us, that we now had to solve ourselves:

  1. Hot-spare capacity. A hyperscaler hides the fact that not every GPU is available all the time. We learned this when a half-rack went into thermal throttle and our routing layer hadn't been written to handle it.
  2. Model swap orchestration. Pushing a new model to 1,000 cloud GPUs is a rolling deploy. To 1,000 on-prem GPUs without a deploy infrastructure, it is several days of phone calls.
  3. Failure-mode reporting. The cloud has telemetry. The pod did not. We built it.

The cloud, as debugger

We did not move everything. We kept a small reservation in the cloud for two purposes: spike-load overflow, and shadow-mode evaluation of new model candidates against live traffic. Both jobs are inherently bursty. The cloud is good at bursty. It is bad at steady-state.

The framing that worked, internally, was: treat the cloud as a debugger, not a runtime. Steady-state inference is on the pod. Investigations, replays, and shadow evals run on rented hardware. The bill came down by 74% at the same scale. We did not regret it.

Three things to take

  • Cloud spend isn't fixed. If your model architecture is stable for a year, run the on-prem math. The break-even is closer than it looks.
  • You will under-budget for the orchestration layer. Double whatever you scoped.
  • Keep a foot in the cloud. Use it for the workloads it's actually good at.

more from the archive.

get the next
dispatch?

One email a month. Real engineering content. We don't sell the list and we don't pad to fill the schedule.