How do you forecast AI data center power demand?

8 July 2025ai · forecasting · operations

Why your load forecast breaks the moment training starts

Forecasting libraries built in 2021 weren't trained on square-wave loads. Most DC operators are running models that quietly broke.

A good load forecast for a traditional data centre looked like this. Pull the last 90 days of half-hourly data, fit a seasonal-trend decomposition, add a temperature regressor for cooling load, publish a 7-day forward curve with a MAPE of 3-4%. Operators built whole dispatch strategies around the assumption the curve was mostly right.

Then training runs started.

A GPT-scale training job on 4,000 GPUs doesn't look like load. It looks like a square wave. Ramp to full utilisation inside an hour, hold for days or weeks, drop to near-baseline between runs. No seasonality, no temperature correlation, no gentle diurnal curve. The forecasting library you picked in 2021 does what it was trained to do: it averages the spike across the window, emits a smooth line down the middle, and hands you back a MAPE that suddenly reads 18%.

Most DC operators I've talked to in the last six months are still running those models. A few know the forecasts got worse and blame the weather. The smarter ones have figured out the shape is the problem and are retrofitting.

The retrofit is unpleasant. You can't just bolt a better model on top. The inputs are different. Traditional forecasting takes historical power as the primary signal. AI load forecasting needs the compute scheduler as the primary signal — what jobs are queued, when they're expected to start, how long they'll run, what their power profile looks like. That's a completely different data pipeline. Often it lives in a different org. Platform engineering instead of facilities, with different access patterns and security review cycles. The political work to build the pipeline is harder than the engineering work.

There's a second-order problem underneath. Cooling load is correlated to IT load, but the correlation isn't linear for AI workloads. Traditional DC racks ran at 5-10 kW. AI racks now routinely exceed 40 kW, with NVIDIA's GB200 NVL72 reference designs coming in north of 120 kW per rack and B200 GPUs rated at 1,200W each in liquid-cooled configurations. At those densities, air cooling doesn't just get less efficient — it becomes physically impractical. A 100 kW rack needs around 15,700 CFM of airflow through server intakes measuring a few square inches. That's hurricane-force wind pressure, and most legacy air-cooled halls can't deliver it.

The forecasting implication is that a model trained on air-cooled data will mispredict a liquid-cooled or hybrid site by a meaningful margin. Vertiv and NVIDIA published a joint study showing liquid cooling cuts total data centre power by around 10% versus air, with 15%+ improvements in TUE. Those numbers sit inside your forecast's error bars. Which means a site that retrofits from air to liquid mid-year without updating its forecasting pipeline will see its model quietly drift for months before anyone notices.

So the forecast miss compounds. You're wrong about IT load because the scheduler isn't wired in. You're then wrong about cooling load because the underlying thermal regime shifted under you. Your total-site forecast is off by enough that the dispatch decisions built on top of it (BESS charge windows, imbalance participation, constraint alarms) start making bad calls.

The operators who've fixed this have done one of two things. Either they've integrated their compute scheduler directly into the forecasting pipeline, treating queued jobs as first-class inputs. Or they've stopped trying to forecast the load and started forecasting dispatch decisions directly — what's the probability we dispatch the battery in the next 30 minutes given current queue depth. Different framing, same end goal.

The first approach is cleaner if you own both sides. The second works for operators who don't control the workload.

What doesn't work is the thing most people are still doing. Running Prophet against last quarter's data and hoping.

A benchmark worth running against your own forecast: take the last 30 days, split into training-run windows and idle windows, score your model separately on each. If your MAPE is under 5% on idle windows and over 12% on training windows, your model has quietly broken and you probably didn't notice because the aggregate number still looks acceptable.

Nobody publishes numbers on how common this is. But of the European operators I've spoken to in 2025, I can count on one hand the ones who've actually checked.