Are we running out of data on the road to AGI?
Data constraints are a real issue but we may not run out of it before we get to AGI
In my recent analysis of Leopold Aschenbrenner’s argument for an imminent arrival of AGI, I highlighted a critical weakness in his position: the potential exhaustion of data required to train large language models. While Aschenbrenner acknowledges this issue and attempts to refute it, the concern remains significant. If advanced LLMs require vast amounts of training data, and AI research labs can’t access this data, his argument falters.
Aschenbrenner addresses this objection directly:
There is a potentially important source of variance for all of this: we’re running out of internet data. That could mean that, very soon, the naive approach to pretraining larger language models on more scraped data could start hitting serious bottlenecks.
Frontier models are already trained on much of the internet. Llama 3, for example, was trained on over 15T tokens. Common Crawl, a dump of much of the internet used for LLM training, is >100T tokens raw, though much of that is spam and duplication (e.g., a relatively simple deduplication leads to 30T tokens, implying Llama 3 would already be using basically all the data). Moreover, for more specific domains like code, there are many fewer tokens still, e.g. public github repos are estimated to be in low trillions of tokens.
You can go somewhat further by repeating data, but academic work on this suggests that repetition only gets you so far, finding that after 16 epochs (a 16-fold repetition), returns diminish extremely fast to nil. At some point, even with more (effective) compute, making your models better can become much tougher because of the data constraint. This isn’t to be understated: we’ve been riding the scaling curves, riding the wave of the language-modeling-pretraining-paradigm, and without something new here, this paradigm will (at least naively) run out. Despite the massive investments, we’d plateau.All of the labs are rumored to be making massive research bets on new algorithmic improvements or approaches to get around this. Researchers are purportedly trying many strategies, from synthetic data to self-play and RL approaches. Industry insiders seem to be very bullish: Dario Amodei (CEO of Anthropic) recently said on a podcast: “if you look at it very naively we’re not that far from running out of data […] My guess is that this will not be a blocker […] There’s just many different ways to do it.” Of course, any research results on this are proprietary and not being published these days.
This concern was investigated by Epoch AI researchers, whose conclusions intriguingly align with Aschenbrenner’s AGI timeline of around 2027-28. They write:
Given our estimate of the data stock, we then forecast when this data would be fully utilized. We develop two models of dataset growth. One simply extrapolates the historical growth rate in dataset sizes, and the other accounts for our projection of training compute growth and derives the corresponding dataset size….Our overall projection…comes from combining these two models. Our 80% confidence interval is that the data stock will be fully utilized at some point between 2026 and 2032.
If the Epoch AI researchers are correct that data exhaustion occurs within the 2026-32 timeframe, data constraints may not hinder AGI progress within Aschenbrenner’s framework.
Their accompanying paper elaborates:
We have projected the growth trends in both the training dataset sizes used for state-of-the-art language models and the total stock of available human-generated public text data. Our analysis suggest that, if rapid growth in dataset sizes continues, models will utilize the full supply of public human text data at some point between 2026 and 2032, or one or two years earlier if frontier models are overtrained. At this point, the availability of human text data may become a limiting factor in further scaling of language models.
Even if this bottleneck occurs sooner than anticipated, the researchers suggest that techniques like transfer learning and synthetic data generation could mitigate the impact. (Aschenbrenner notes this as well.) Furthermore, we can speculate that as AGI develops, AI itself could enhance our ability to leverage these techniques, further advancing AI capabilities.