The next AI moat is a data flywheel not GPUs

Building closed-loop synthetic + sensor systems to outrun diminshing returns

Aug 08, 2025

Everyone expected exponential improvements in GPT-5, but it feels more like a linear evolution. It’s not that the model isn’t an improvement, but it feels linear because, in this phase of AI development, it is. OpenAI hasn’t forgoten how to push the frontier. Rather we’re running headlong into the two biggest walls in current AI scaling:

Diminishing returns from scaling laws. Each incremental performance gain requires exponentially more compute and training data.
Text data exhaustion. The pool of high-quality internet text for pretraining is nearly tapped out.

Those two forces are colliding. And they’re why even the most cutting-edge models can feel like steady, linear steps rather than exponential leaps. If the industry wants the next holy shit moment, it will need to change the kind of data the models feast on, not just the amount.

That’s where synthetic data and real-world sensor data come in. Alone, each has promise; together, they could be the foundation for the next step-function jump.

The Wall: Scaling Laws and Data Exhaustion

The underlying math of scaling laws has been well studied: model performance improves predictably with more compute, more parameters, and more training data. But the relationship isn’t linear. You have to feed the beast exponentially more to get smaller and smaller gains.

Until now, the industry has been able to do just that: throw ever-larger compute clusters at the problem, train on nearly the entire internet, and ride that curve upward. But compute is scaling at three times Moore’s Law pace, and that is already becoming unsustainable, both in terms of cost and power demand. Meanwhile, the internet’s supply of clean, high-quality, diverse text is finite, and we’re scraping the bottom of the barrel.

This leaves us with a paradox:

We know how to get better models: more and better data, plus more compute.
We’re out of easy data, and compute is starting to hit economic and physical limits.

So if we keep doing more of the same, each GPT release will feel less like a revolution and more like a predictable upgrade cycle.

Synthetic Data: Manufacturing Novelty at Scale

Synthetic data is the most obvious lever to pull when you run out of real data. This isn’t new. AlphaZero trained entirely on synthetic self-play games. But for LLMs, synthetic data has been a secondary supplement, not a primary driver.

That’s changing. The new approaches worth watching include:

Self-generated corpora. Use existing large models to produce new training data, filter it for quality, and fine-tune on it. Done well, this can create data distributions that simply don’t exist in the real world, which allows the model to practice on rare edge cases or multi-step reasoning chains.
Synthetic simulation environments. Think procedurally generated worlds where agents can explore, solve problems, and generate labeled experience. This is already the norm in robotics research, but it could become a major LLM training pipeline for reasoning and planning.
Iterative refinement loops. The model generates answers, critiques itself (or is critiqued by another model), then produces improved answers. This is bootstrapping quality in ways that mimic human learning.

The danger is synthetic drift, where errors and biases in synthetic data compound over time. That’s why synthetic data alone won’t solve the problem. You need something to keep it anchored to reality.

Sensor Data: The Infinite, Living Dataset

This is where real-world sensor data comes in. The physical world produces more novel, unstructured data per second than the internet ever has:

Cameras and LiDAR from autonomous vehicles and drones.
Wearable and biometrics from consumer health devices.
Industrial IoT in factories, power plants, and shipping.
AR/VR environments where people interact with both digital and physical spaces.
Scientific instrumentation from telescops, particle accelerators, and wet labs.

Unlike static text scraped from the web, these data streams are continuous, high-dimensional, and grounded in physical reality.

Sensor data has two key advantages:

It’s bottomless. You can always collect more.
It evolves. As the world changes, so does the data distribution.

The problem? It’s messy, high-volume, and often domain-specific. You can’t just dump raw LiDAR scans or protein folding trajectories into an LLM and expect miracles.

Why Synthetic + Sensor Beats Either Alone

Synthetic data can be generated infinitely, but it risks drifting into a disconnected fantasy. Sensor data is grounded and endless, but often narrow and noisy. Combine them, and you get a virtuous cycle:

Synthetic data as ground truth anchor. The synthetic generation prcess is constantly calibrated against fresh real-world measurements.
Synthetic expansion for combinatorial coverage. Real-world data points are used as seeds to generate vast variations: testing edge cases, unseen scenarios, and counterfactuals that the sensors will never capture naturally.
Feedback loops for skill acquisition. Models trained in simulated environments can be deployed into sensor-equipped systems (robots, AR/VR agents, lab automation) to collect new data, which is fed back into training.

This isn’t hypothetical. It’s already happening in narrow domains:

In autonomous driving, simulation-generated hazards are blended with real dashcam footage to train vision systems.
In drug discovery, lab robots generate experimental data, which is then expanded with AI-driven molecular simulations.

Scaling this to general AI training could break the diminshing returns wall.

A New Paradigm: Data as an Evolving Asset

In the internet text paradigm, data was mostly static: you scraped it once, cleaned it, and trained your model. In a synthetic-sensor paradigm, data is dynamic:

You don’t scrape it. You grow it.
You can aim generation toward gaps in the model’s competence.
Your dataset improves as your model improves.

This has deep implications for AI company strategy:

The best models will be owned by those who own the most productive sensor-synthetic loops.
AI training becomes less about finding data and more about producing and curating it.
Moats become less about model weights and more about closed-loop data ecosystems.

Economic and Physical Sustainability

Synthetic-sensor loops can help here too:

More data efficiency. Better targeted data means fewer wasted training tokens.
Continuous fine-tuning. Instead of massive one-off training runs, you can update models incrementally as new sensor + synthetic data flows in.
Edge processing. Some sensor data can be pre-processed locally, reducing the need for every raw frame or measurement to hit the data center.

If we want AI progress to remain economically viable while bending the performance curve upward, we need to start treating data generation as strategically important as model architecture.

What This Means for GPT-6, GPT-7, and Beyond

If GPT-5 feels linear, it’s because the scaling wall is real. The next breakout moment, where models jump from impressive to holy shit, will almost certainly come from a paradigm shift in data, not just raw scale.

Here’s what that shift might look like in practice:

Integrated multimodal training from the start: text, audio, video, spatial maps, biosignals, fed by live sensor streams.
Targeted synthetic expansion to cover rare or complex scenarios the sensors never naturally encounter.
Tight feedback loops between deployed AI systems and their training corpora. Every interaction becomes new training fuel.
Domain fusion: LLMs trained on physical science data, code, natural language, and sensor readings simultaneously.

That’s an AI that’s genuinely grounded in the real world and able to reason across it.

If you enjoy this newsletter, consider sharing it with a colleague.

Most posts are public. Some are paywalled.

I’m always happy to receive comments, questions, and pushback. If you want to connect with me directly, you can:

follow me on Twitter,
connect with me on LinkedIn, or
send an email to dave [at] davefriedman dot co. (Not .com!)

David W Baldwin

Aug 11

Appreciate your writings... We're stuck for awhile as the big boys are going to reward the tiresome use of misleading terms (reason, hallucinate) and it will probably take another aha moment like DeepSeek to send shiver where the consumer is going to demand something for something. Mobile SLMs combined (at first) toward a clean LLM will make more sense and actually lead to thinking that leads to discovery (which requires curiosity).

Thanks again!

Expand full comment

Michael Fuchs

Aug 9Edited

This makes no sense. Once you exhaust all the available text on the Internet and start making up your own, how do you prevent synthesized hallucinations from infecting your training set?

How does intersecting the textual hallucinations with sensor data from refrigerators and airplane cockpits and oil wells help identify which synthesized text needs to be kept from the models?

This post is by a true believer who can’t confront the disappointing reality that he reports about generative AI hitting a wall and being doomed to asymptotic diminishing returns. So, instead he predicts a solution that makes no sense.

It would have been better to simply write that LLMs have just about gone as far as they can go.

Admitting that one has been wrong about so-called AI is not the end of the world. There will be many other hype horses to ride in the future.

1 more comment...

Buy the Rumor; Sell the News

Discussion about this post