AI inference will be unbundled
The future of inference is bifurcation: cheap edge hustle at one pole, sovereign-scale sanctuaries at the other
For the past few years, AI has been straightforward: everything flows into the data center cathedral. You buy more GPUs, build bigger data centers and pray to the Nvidia god. The assumption, repeated by analysts, journalists, and investors alike, is that all inference loads lead to a data center.
But physics, law, and money don’t obey theology. They’re forcing a split. Inference, the act of actually using AI models day-to-day, is bifurcating into two poles:
Edge hustlers, where inference happens close to the event: on phones, cars, AR glasses, IoT gateways, industrial sensors.
Central cathedrals, where inference lives in hardened bunkers with exotic hardware, specialized interconnects, and sovereign-level controls.
The middle, in which cloud data centers monopolize inference, gets squeezed from both sides.
Why the edge wins by physics
AMD’s CTO, Mark Papermaster, recently predicted that “by 2030, the majority of inference tasks will be run on-device.” That’s not hype; it’s physics. There are four thresholds that force workloads out of the cathedral and into the street:
Latency budget: If closed-loop response time needs ≤ 30 ms, the WAN is out. No amount of marketing spin fixes the speed of light.
Secrecy/Sovereignty: If regulators ban data transit (PII, HIPAA, ITAR, etc.) inference must stay local.
Model footprint: If even a quantized model exceeds device VRAM or thermal headroom, the edge can’t host it. But if the model fits on the device, it’s cheaper and faster to run it locally.
Backhaul cost: If
data egress per inference * forecast volume
exceeds a set floor, edge filtering becomes mandatory. A simple model helps reify this. Take a camera gateway:100 KB per request * 1 million requests/day = 100 GB/day
. At$0.05/GB egress
, that’s$5/day
. At 10,000 sites, it’s$50,000/day
(~$1.5M/month
). Run a lightweight detector locally and cut 90% of traffic, and you cut 90% of that bill.
Physics and dollars conspire: if you can run locally, you will.
Why centralization strengthens
At the opposite pole, some workloads are too big, too sensitive, or too valuable to push outward.
Scientific research: Protein folding, materials modeling, climate simulations.
National defense: Multi-theater sensor function, strategic decision support, dual-use intelligence.
Finance: Real-time stress testing, cross-market contagion simulations, sovereign risk assessment.
A recent arXiv review was blunt: “large language models, due to their scale and computational intensity, remain best suited for cloud-hosted deployment.” Translation: trillion-parameter models don’t fit in your pocket.
The FT has noted that data centers remain indispensable for “high-demand workloads like training, deep learning, and big data analytics. And inference itself, once seen as a sideshow compared to training, is becoming the dominant compute load in hyperscale facilities. Google, Amazon, Microsoft, and Meta are throwing billions at inference-specific infrastructure, aiming to loosen Nvidia’s chokehold.
In the cathedrals, the economics invert: the value isn’t billions of cheap inferences; it’s a smaller number of high-value, high-margin inferences where accuracy and auditability matter more than latency.
The cloud gets unbundled
So what about the middle? This is where cloud inference, meaning the business model that AWS and Azure bet would dominate, faces pressure. TechRadar recently asked, “Is the cloud the wrong place for AI?” and pointed out that real-time response, massively parallel workloads, and regulatory strictures all push inference away from one-size-fits-all cloud deployments.
The better framing: the cloud is being unbundled.
Edge siphons off the latency-sensitive, privacy-bound, bandwidth-heavy tail.
Central fortresses absorb the high-complexity, sovereign-scale head.
The cloud remains the connective tissue, but it loses exclusivity, margin, and control.
The cathedral is still there. But the main street around it is filling with strip malls, pop-ups, and independent hustlers.
Tripwires that could accelerate or stall my predictions
What I’ve argued here isn’t a foregone conclusion. There are accelerants and brakes that could affect the timing and degree of this bifurcation.
Accelerants:
Better quanitzation/distallation toolchains.
Stronger data-sovereignty regimes (EU, India, U.S. states).
Rising egress charges.
Edge NPUs with improved memory bandwidth.
Brakes:
Ultra-cheap bandwidth (LEO constellations with sub-10 ms response times).
Standardized compliance enclaves in cloud providers.
Frontier models where context windows, not FLOPs, are the gating feature.
Where this leaves us
Inference isn’t moving to the edge. Nor is it staying locked in the cathedral. It’s splitting. Edge hustlers own the latency-sensitive, privacy-bound, high-volume tail. Central cathedrals own the high-complexity, sovereign-critical head. The cloud becomes the main street that connects them, but it no longer owns the town.
The heresy is becoming orthodoxy: if you’re planning around a single-feature, whether all edge or all cloud, you’re wrong. The only rational play is to embrace the bifurcation and anticipate both poles.
If you enjoy this newsletter, consider sharing it with a colleague.
I’m always happy to receive comments, questions, and pushback. If you want to connect with me directly, you can:
What is the implication of this trend for NVIDIA? Will there still be such a high demand for GPU two years from now?
Would love to hear your thoughts on interpetability in the context of enterprise adoption in high value use cases.