LLMs hallucinate. How can large enterprises overcome this?
Large enterprises won't use AI tech if the models incessantly hallucinate. Fortunately there are some solutions
Introduction
There are two main models for enterprise use of large language models (LLMs): (1) the enterprise builds its own tooling; (2) the enterprise buys a solution from an outside vendor. Not much surprising there—this is the model for basically all enterprise computing. However, the probabilistic nature of LLMs make this decision much more complicated for the enterprise. Traditional software is deterministic: every time you open up a spreadsheet or a web browser or a database or a design document, the software does exactly what it has done thousands of times before. When you input these values into those spreadsheet cells, this happens.
LLMs are inherently probabilistic, which results in outputs that can be unpredictable or unreliable. This poses unique challenges for enterprises that want to integrate this technology into their workflows. This conundrum is well-expressed in this Twitter post, also mirrored here. The post is hard to quote from, but it is worth a read. The author notes that LLMs are “strangely-shaped tools”. This is an especially evocative, and poetic, description of LLMs-as-software, which bears further consideration.
Most tools we use—say a spreadsheet, or even a hammer—have a definite form with which we’re familiar. We understand its limits and its limitations. We don’t try to use a hammer to drive a screw. Nor do we mock up a web site in Excel. But LLMs can do all of this: they can build you a discounted cash flow analysis that you would otherwise build in Excel; they can mock up a web site, for which task you would otherwise use Figma; they can be connected to a robot wielding a hammer.
So this is what an enterprise has to contend with when trying to figure out how AI fits into its workflows. AI seems useful, but it also seems unreliable. How can an enterprise reconcile this? There are at least seven techniques that can be used to reduce LLMs’ tendency to generate unreliable or inaccurate output.
Methods for making LLM output more reliable
Output filtering and validation. One straightforward approach is to implement layers of output filtering and validation to catch and correct hallucinations or inaccurate responses before they reach the end user. This can involve:
Rule-based checks to verify that outputs adhere to specific factual or logical constraints.
Cross-referencing with trusted databases or APIs to validate facts and figures mentioned in the LLM’s responses.
Human-in-the-loop (HITL): Incorporate a human in the loop to enhance the reliability of LLM outputs, especially for critical applications.
Manual review of outputs by domain experts can ensure accuracy and relevance, although this approach may not scale well for high-volume applications.
Semi-automated workflows where LLM outputs are used to draft responses or generate insights, but human oversight is applied selectively based on confidence scores or other metrics.
Fine-tuning on domain-specific data. Customizing or fine-tuning the LLM on a specific domain’s data can improve its accuracy and reduce the likelihood of irrelevant or inaccurate outputs.
Domain-specific fine-tuning involves training the model further on a curated dataset from the specific field or industry, enhancing its performance on relevant tasks. I recently wrote a post about this, in the context of the legal field, here.
Prompt engineering and dynamic prompts, which adust the input based on context or previous interactions, can also guide the model to produce more reliable outputs.
Ensemble approaches. Using an ensemble of models and taking a consensus, or the most likely output, can also reduce errors.
Combining multiple LLMs or different versions of a model to cross-verify outputs.
Hybrid models that integrate deterministic and probabilistic components to leverage the strengths of both approaches.
Continuous monitoring and feedback loops. Establishing mechanisms for continuous monitoring and incorporating user feedback into the system can help identify and correct errors more efficiently.
Automated monitoring tools to track the performance and reliability of LLM outputs over time.
Feedback loops where user corrections and suggestions are used to fine-tune the model or its deployment strategy continually.
Limiting scope of use. Being selective about where and how LLMs are deployed in enterprise contexts can also mitigate risks.
Scope limitation to use cases where the probabilistic nature of LLMs is less likely to cause significant issues, or where there is a lower risk associated with potential inaccuracies.
Clear communication about the limitations and expected reliability of LLM-powered features to users.
Advanced techniques. Emerging research and technologies offer advanced methods for improving the reliability of LLM outputs.
Causality models and reinforcement learning from human feedback (RLHF) to refine more outputs based on quality and relevance.
Uncertainty quantification techniques to assess and communicate the confidence level of model outputs.
These methods are all strategy-level decisions for the enterprise
The problem that we run into is this: the mental model that most corporate executives have of ‘artificial intelligence’ is that it is an IT problem. They think that it is similar to installing Microsoft Office on employees’ computers, or granting employees access to databases, etc. And, as we have reviewed above, it’s really not. A company must undertake a strategic review of the technology, its capabilities and limitations, and how these things map to the company’s needs. These are strategy-level considerations, not IT office considerations. Senior leadership needs to be intimately involved in these discussions, and provide the organizational authority and institutional backing for those who implement these technologies. Fobbing off these responsibilities to under-funded IT departments—which are often viewed as cost centers by traditional corporations, and so remain organizationally powerless—just won’t cut it.