A GPT4-powered data business idea
Here's an idea for a GPT4-powered data business. I've no interest in founding or running a company so I'm offering it here for free.
While I was driving around town a few days ago, I came up with an AI-related business idea, which I have not seen discussed elsewhere. This idea is somewhat far afield from typical AI-related entrepreneurial discourse. Nonetheless, it seems like a promising idea. And, as I am not interested in founding a company, I thought I’d share the idea here. If anyone sees this, and is interested in talking about it, feel free to reach out.
Here’s the idea in bullet form:
The US federal government has dozens of regulatory agencies, all of which compile large amounts of data relevant to the industry which they regulate.
The data that federal regulators compile are free to use and not copyrighted, so you can build a business selling subscriptions to the data1.
GPT4 and related technologies can be used to provide a natural language search interface on top of this data.
Customers would include: corporate development groups in large corporations, lobbyists, bankers, lawyers, and management consultants. This type of customer is used to paying for access to specialized data. Subscription prices can be high, and subscriptions are sticky. Further, because access to regulatory information is considered mission critical, this kind of expense isn’t usually cut in recessions.
Profit would be the spread between the subscription price and the cost of compiling the data and training it on GPT4. (For the sake of simplicity, I am including employee compensation and other ancillary expenses in the “cost of compiling data”.)
This is probably not VC-fundable, as the market is too small, but it can be a cash machine, aking to Bloomberg’s financial data business or LexisNexis. The business looks superficially like the SaaS businesses with which pattern-matching VCs are familiar, but I think a more accurate model is either Bloomberg’s data business or LexisNexis.
Given that bullet introduction, let’s take a look at some of the challenges that such a business would encounter:
You would need to assemble a somewhat unusual team. The CEO should be someone who has deep and extensive industry contacts in the industry which the regulatory agency regulates. If you decide to use the FDA’s data, for example, the CEO ought to have deep and extensive industry contacts in the industry which the FDA regulates.
Finding the initial customers for this business will be tough. Likely customers include: management consultants, bankers, lawyers, lobbyists, and corporate development groups in large corporations. Selling anything to any of these groups is hard. This is why the CEO ought to have deep and extensive relationships in the target industry. This likely implies that the CEO will be older than the average tech startup founder.
The CEO would need to complement her skillset and relationships with a technical person who knows how to corral a set of data and let GPT4 loose on it. And the type of person who has this technical knowledge generally is locked up by the FAANGs or cash-rich AI startups. Prying one or two loose from these companies is tricky.
The interface that customers use will have to be well-designed, and easy for a non-technical person to use. This requires a design sense not commonly found in either a CEO or technical person.
Factuality will need to be solved, or at least mitigated. I am not very knowledgeable about this topic, so I will provide some information that ChatGPT4 provided in response to my prompt to it about this topic:
Factuality in the context of GPTs refers to the extent to which the language model produces text that is accurate and consistent with factual information. As language models like GPT-3 are trained on large corpora of text data from diverse sources, they learn patterns in language, including factual information, common sense reasoning, and stylistic features. However, because the training data includes both accurate and inaccurate information, GPT models may occasionally generate text that contains inaccuracies or falsehoods.
Factuality is an important consideration when using GPT models for various applications, such as generating text for news articles, summarizing information, ansering questions, or providing explanations. Ensuring that the generated text is factually accurate and reliable is crucial for maintaining trust and credibility, especially in applications that involve providing information to users or making decisions based on the generated text.
Researchers and developers have explored various methods to improve the factuality of GPT models, including fine-tuning the models on carefully curated databases, adding constraints to the generation process, and using external knowledge sources. Despite these efforts, factuality remains a challenge in natural language generation, and users should critically evaluate the text generated by GPT models to ensure its accuracy and reliability.
Prompt injections may be a risk. Again, I am not very knowledgeable on this topic, so I will offer this link with more information.
This list of challenges is by no means exhaustive, but it should at least provide a start for thinking about potential pitfalls.
A common objection at this point is “if the data are free to use then why would anyone pay for it?” While this sounds like an good argument, it’s actually not. Bloomberg, LP, for example, compiles gigabytes of free data that the SEC provides to the public, and charges its institutional investors upwards of $20,000 per month per person. What matters is not the data itself, but the integration of the data into a useful service.