Introduction
A metaphor frequently bruited about is that data is the new oil. We are, the claim goes, sitting on vast oceans of a valuable resource—data—which we need to extract and refine. Having been extracted and refined, the metaphor continues, these oceans of data can be put to valuable use. Most recently, of course, that valuable use has been training large language models (LLMs), the technology that underlies tools like OpenAI’s ChatGPT or Anthropic’s Claude.
Does the metaphor really make sense, though? Is data as fungible as oil? A barrel of oil is a barrel of oil, whether it is sourced from the Gulf of Mexico or the Persian Gulf. Sure, different oil fields may vary somewhat in the quality or amount of oil they produce but once the raw material is refined into a saleable product, it’s all the same.
The Metaphor: Strengths and Weaknesses
There is nuance to be had here, and we’re going to explore it.
Strengths
Essential Resource: Just as oil was crucial for powering industries, data is fundamental for developing and operating modern AI tech. AI models rely heavily on vast amounts of data for training and improvement.
Extraction and Refining: Extracting and refining data so that it can be used by large language models parallels the difficulty of finding new sources of oil and extracting and refining it. Not all data is readily usable; much of it requires significant processing, annotation, and cleaning to be valuable for AI training, akin to how oil must be refined before use.
Economic Incentives: Economic incentives encourage us to find more challenging and expensive sources of both oil and data. eEasier sources are exhausted, so the market encourages innovation and investment.
Limitations
Diverse Nature of Data: Unlike oil, data is not a single homogenous resource. It comes in various forms and qualities, and its value can vary significantly based on context and application.
Renewability and Growth: Data can grow exponentially, and the same data can be used multiple times for different purposes. Of course as soon as you burn a barrel of oil, that barrel of oil can no longer be used. Further, the supply of oil does not grow at exponential rates. Oil is a finite resource that depletes over time. Data grows continuously as more digital interactions and records are created.
Accessibility and Privacy: Unlike oil, which is primarily limited by physical extraction challenges, data accessibility is often governed by legal and ethical considerations.
If Data is Accessible, is it Usable?
Vast amounts of data are inaccessible to the organizations which train foundational models. This inaccessible data is often referred to as the deep web.1 Some people claim that more than 95% of the data that exists on the web is on the deep web. In other words, 95% or more of the world’s digital data exists beyond the reach of foundational models.
This sounds like a promising observation. Back in June I wrote that my principal objection to Leopold Aschenbrenner’s collection of essays, which argue for a short timeline to AGI, was that we are running out of accessible data for foundational models to be trained on. (To his credit, he acknowledges this argument, and attempts to refute it.) But if 95% of the world’s data is currently inaccessible, surely our data problems will be solved if we can just get access to that inaccessible data!
Alas, it’s likely not that simple. Just because inaccessible data can in theory be made accessible does not mean that the data is useful for training. Some points bear consideration.
Accessibility vs Usefulness of Data
Quality and Relevance:
While the deep web contains a significant portion of the world’s data, its usefulness for AI depends on the quality and relevance of this data. Much of it might not be structured, labeled, or even relevant for the specific tasks AI aims to accomplish.
Structured vs Unstructured Data: AI models benefit most from well-structured, labeled data2. Data that is buried in corporate databases, legal documents, or medical records often requires extensive preprocessing and annotation to be useful.
Legal and Ethical Constraints:
Privacy and Regulations: Accessing and using deep web data often involves legal and ethical considerations, especially when dealing with personal, medical, or sensitive information. Privacy regulations like GDPR and HIPAA impose strict limitations on data usage, which can restrict the availability of valuable datasets.
Consent and Anonymization: Even if AI systems could access this data, ensuring proper consent and anonymization would be necessary, adding another layer of complexity to data usability.
Economic and Technical Challenges
Cost of Data Extaction and Refinement
High Costs: Extracting and refining deep web data can be costly and resource-intensive. Just as with oil, where deeper and less accessible reserves require more advanced and expensive extraction techniques, the same applies to data. The cost-benefit analysis might not always justify the investment.
Specialized Skills: Refining raw data into a form usable by AI often requires specialized skills and technologies, including advanced natural language processing, computer vision, and other AI subfields.
Data Variety and Integration:
Heterogeneous Data Structure: Deep web data comes from a variety of sources, each with its own format, structure and content. Integrating this heterogeneous data into a coherent dataset for AI training is challenging.
Data Cleaning and Preprocessing: Much of deep web data is likely to be noisy, incomplete, or biased, necessitating extensive cleaning and preprocessing before it can be effectively used for training AI models.
The claim that over 95% of the world’s data is not accessible by search engines is likely correct, but the usefulness of this data for AI is far from guaranteed. The practical challenges of extracting, refining and ethically using deep web data make it a less straightforward resource than the metaphor “data is the new oil” might suggest. Additionally, advancements in synthetic data generation and self-play techniques offer promising alternatives that might render the issue of deep web data usability moot.
Conclusion
Data is the new oil provides a useful framework for understanding the value and challenges associated with data in the AI era. However, it is important to recognize its limitation and the unique characteristics of data that differentiate it from oil. The metaphor works well to illustrate certain economic and operational dynamics but falls short in capturing the full complexity of data’s role in AI development.
Deep web is conceptually different from the similar-sounding dark web, which, while also inaccessible to search engines, is mainly used by criminals attempting to evade detection by lalw enforcement.
Some people will quibble with this point, I think. LLMs use unstructured data for their training, but the data has been pre-processed and annotated. LLM training doesn’t use raw unstructured data; rather the pre-processing of the raw unstructured data imparts some structure to it.