How does AI inference power demand differ from training?

9 June 2026ai · inference · training · siting

Inference and training are two different grid loads. Stop siting them like they're one.

The industry talks about AI power as one thing. It's at least two, and the deployments that grow fastest from here look almost nothing like the training clusters most operator diligence checklists were written for.

Training is a batch job. It runs for weeks on a fixed cluster, draws a roughly flat 30–80 MW per pod, and tolerates timing slips of hours without a product consequence. Inference is a service. It runs continuously, follows diurnal user demand by a factor of three to five, can't be deferred by more than a few hundred milliseconds, and pays a revenue penalty every minute it isn't sitting close to the user. Same NVIDIA H100s, same Blackwells, different load.

The industry talks about "AI power" as though that's one thing. It isn't. The site you'd pick for a 200 MW training cluster has almost nothing in common with the site you'd pick for a 200 MW inference fleet, and the diligence checklists getting passed around in 2026 are mostly written for the first while the deployments are increasingly the second.

Training first.

The job is embarrassingly parallel across batches, serial within a batch, and predictable in profile once the run plan is fixed. Anthropic, OpenAI, and Meta size their runs months in advance and book capacity accordingly. A 50,000-GPU Blackwell pod doing a [BRACKET: weeks-long frontier training run] looks like a slab of demand: 60–90 MW, drawn around the clock, with checkpoint pauses every few hours and the occasional cluster-wide synchronisation. The grid sees a constant load with low coincidence against everything else. The load doesn't follow weather, time of day, or local user demand. It follows the training run's wall-clock progress and nothing else.

That gives training a property inference doesn't have: it's time-flexible at the batch boundary. Pausing a run for three hours during an ERCOT heat alert costs maybe half a percent of the wall-clock time of a four-week training run. The model team will mutter about it. Nobody will lose a customer. The same pause on an inference fleet means the product was down for the duration, customers route around, the revenue is gone.

So training can be sited where power is cheap, abundant, and possibly intermittent. West Texas. Iowa. Quebec. Mid-Norway. Places where the grid connection takes a multi-year horizon but the connection, once made, is large and inexpensive per MWh. Training operators have been quietly underwriting curtailable interruptible contracts that look insane to a traditional colo and quite rational once you've internalised that a 96-hour interruption per year costs almost nothing if it's predictable.

Inference is the other story.

The load profile follows the user. US daytime traffic peaks around 14:00–16:00 local. European peaks lag by roughly six hours. Asia adds another twelve. A global inference platform shapes its load curve into something that looks more like a streaming service than a compute facility — a three- to five-fold diurnal swing on top of a non-zero floor, with sharp peaks on product launch days and a long tail of overnight low utilisation. Google's TPU inference fleets show this clearly in the public papers they've let out. So do the inference clusters Microsoft runs for OpenAI in [BRACKET: US-East and Europe-West Azure regions].

The product constraint is latency. A user querying an inference endpoint expects sub-200 ms response. That means the inference cluster has to sit inside a network round-trip window of the user's eyeballs, which means it has to sit inside an availability zone whose users it serves, which means inference siting is constrained by population centres and submarine cable topology the same way CDN siting always was. Frankfurt. Ashburn. Singapore. São Paulo. Expensive power. Constrained grids. Long connection queues.

I know this is going to annoy the people who've spent the last two years writing decks about how AI workloads will fundamentally rebalance global data centre geography, and I'm partly playing devil's advocate, but the actual rebalancing is bifurcation, not migration. Training drifts to the cheap-power frontier. Inference stays exactly where the existing internet already lives, and the inference fleets are the part of the business growing fastest.

The flexibility implications follow.

A training cluster is one of the more attractive flexible loads on any grid in market terms. Predictable baseline, large nameplate, controllable through the orchestrator, capable of multi-hour interruptions on day-ahead signal. ERCOT, AESO, the Nordic balancing markets, and increasingly the Dutch and German curtailment regimes will pay real money for that profile. A 100 MW training cluster running at 80% utilisation with the ability to shed to zero within 15 minutes is worth, depending on market and year, somewhere between [BRACKET: $2M and $8M] per year in capacity payments and energy arbitrage. The economics for participating in PJM Synchronized Reserve, in ERCOT's ECRS, in continental aFRR, are good enough that they should be in every training-site underwriting model. They mostly aren't.

An inference fleet is the opposite kind of asset. It's already running near baseline. Its peak demand correlates with the grid's peak demand. It can't shed without dropping product. The only flexibility lever it has is geographic load shifting between regions — routing US-East traffic to US-West during a PJM emergency — and that lever is bounded by latency budgets, capacity headroom in the destination region, and the willingness of customers to accept a 40 ms hit when their local cluster is constrained. Inference flexibility is real but it's a network problem, not a load-shedding problem.

Different load, different siting, different flexibility playbook, different underwriting model. Treating them as one workload is the mistake.

The pattern I keep seeing in 2026 is operators with mixed fleets — some training, some inference, often on the same site because that's where the GPUs landed — applying training-style flexibility offers to inference racks and getting burned the first time the dispatch call comes in. Or applying inference-style must-run framing to the training racks and leaving a six-figure annual revenue stream on the table because nobody on the energy team had the conversation with the model team about what a training pause actually costs.

If you operate one site and run both loads on it, you have two grid customers under one roof. They don't want the same things. They shouldn't be procured against the same contracts. The orchestration layer above the workload scheduler has to know which is which.

So: how many people on the energy side of your operation know which of your racks are training and which are inference this hour?