Thoughts on Leopold Aschenbrenner's short AGI timeline
Can we solve the problem of not having enough internet data to train LLMs, in the timeline proposed?
My basic mental model for the possible future of AI is something along the lines of one of the following:
This is it. GPT4o and its peers are as good as we’ll get.
Progress towards AGI over the course of decades and beyond that, and an eventual artificial superintelligence (ASI) some decades or centuries hence.
Rapid progress towards AGI and ASI soon thereafter (less than 20 years).
I suspect that the first scenario is wrong. I’m skeptical that the third one is correct.
Leopold Aschenbrenner, on the other hand, very much believes in the third scenario, and he recently published a 165-page collection of essays on this topic. He makes two broad claims in his essays:
the linear trend of AI improvements augurs AGI relatively soon, and
relatively few people are aware of this and its implications today, but once nation states wake up, the United States and its allies will need to form a latter day Manhattan Project, which he dubs the Project, to facilitate the safe creation of AGI and, eventually, ASI.
The linear trend towards AGI
He presents this chart:
Believing in the imminence of AGI, he notes, doesn’t require that we believe in sci-fi tropes. Rather we need only believe in the power of straight lines.
His argument is rather straightforward: by looking at trends in (1) the amount of computational horsepower we can build with clusters of GPUs, (2) algorithmic improvements (“efficiencies” is the term he uses), and (3) “unhobbling,” by which he means “fixing obvious ways in which models are hobbled by default”, he concludes that the linear trend shown above will continue for the next several years, and that, by dint of this trend, AGI will be achieved rather rapidly. As he writes1:
While the inference is simple, the implication is striking. Another jump like that [from GPT-2 to GPT-4] very well could take us to AGI, to models as smart as PhDs or experts that can work beside us as coworkers. Perhaps most importantly, if these AI systems could automate AI research itself, that would set in motion intense feedback loops…
He further writes, expanding on the above:
We can decompose the progress in the four years from GPT-2 to GPT-4 into three categories of scaleups:
Compute: We’re using much bigger computers to train these models.
Algorithmic efficiencies: There’s a continuous trend of algorithmic progress. Many of these act as “compute multipliers,” and we can put them on a unified scale of growing effective compute.
“Unhobbling gains”: By default, models learn a lot of amazing raw capabilities, but they are hobbled in all sorts of dumb ways, limiting their practical value. With simple algorithmic improvements like reinforcement learning from human feedback (RLHF), chain-of-thought (CoT) tools, and scaffolding, we can unlock significant latent capabilities.
The obvious counter to these arguments is this: we’re running out of data to train models on. If we can’t acquire more data to train more powerful LLMs on, then we can’t really progress much further than the state of the art. Here is Aschenbrenner’s attempt to refute this argument:
All of this is to say that data constraints seem to inject large error bars either way into forecasting the coming years of AI progress. There’s a very real chance things stall out (LLMs might still be as big of a deal as the internet, but we wouldn’t get to truly crazy AGI). But I think it’s reasonable to guess that the labs will crack it, and that doing so will not just keep the scaling curves going, but possibly enable huge gains in model capability.
All of this is true, and it means that the entirety of Aschenbrenner’s argument in favor of rapid progress towards AGI depends crucially on AI research labs being able to solve the data constraint problem. In principle this should be a solveable problem—it’s not something which violates the laws of physics. But I’ve no idea about the prospects of solving it in the timeline he suggests. Maybe it’s doable, maybe it’s not. If it’s not, then the rest of his argument, insofar as it claims rapid progress to AGI, collapses.
Concluding thoughts
The key issue here is that we are running out of internet data to train models on. If we can overcome this limitation, either by learning how to use synthetic data to train models, or by developing new techniques to optimize models given the amount of data we have, or some combination of these two strategies, then the scaling laws will continue as they have over the past several years, and progress towards AGI will be rapid.
More explicitly: the internet contains a finite amount of useful, high-quality data. As AI models grow more sophisticated, they require increasingly vast datasets to continue improving. If this data is exhausted, AI progress will slow significantly.
Some possible solutions to the data problem:
Synthetic data generation: Creating high-quality synthetic data to supplement real-world data may help. However, this relies on the quality of the initial data and being able to generate realistic and diverse datasets.
Domain-specific data collection: Collecting specialized data in specific fields might provide the necessary depth and variety for continued AI training. This would require targeted efforts and investent in data collection and curation.
Data efficiency improvements: If AI researchers can figure out how to use the data they have more efficiently, they may be able to mitigate this issue by reducing the amount of data needed for training.
Rapid progress towards AGI is contingent on overcoming the ‘running out of internet data’ problem. While Aschenbrenner’s national security concerns are valid, they are intertwined with the resolution of the data issue.
Throughout this post, when I quote from Aschenbrenner, the source link will be the collection of essays to which I initially referred.
Hmm…your three possible solutions ignore the one that seems to me to be the most likely: that we develop better and more efficient ways of ingesting data that exists outside the Internet. The Google Books project of scanning volumes from libraries is one part of this, but there are tons of data and information that are not currently online. My guess is that this is where the near/mid-term data growth comes from.