Ehtan Mollick and his colleagues recently released a paper, which explores how management consultants can benefit from using generative AI tools like ChatGPT. He and his co-authors came up with the notion of a “‘jagged technological frontier’ where some tasks are easily done by AI, while others, though seemingly similar in difficulty level, are outside the current capability of AI.” This sentence appears in the paper’s abstract, and it’s worth quoting the abstract in full, in order to give you the full context of the observation:
The public release of Large Language Models (LLMs) has sparked tremendous interest in how humans will use Artifiical Intelligence (AI) to accomplish a variety of tasks. In our study conducted with the Boston Consulting Group, a global management consulting firm, we examine the performance implications of AI on realistic, complex, and knowledge-intensive tasks. The pre-registered experiemnt involved 758 consultings comprising about 7% of the individual contributor-level consultants at the company. After establishing a performance baseline on a similar task, subjects were randomly assigned to one of three conditions: no AI access, GPT-4 AI access, or GPT-4 AI access with a prompt engineering overview. We suggest that the capabilities of AI create a “jagged technological frontier” where some tasks are easily done by AI, while others, though seemingly similar in difficulty level, are outside the current capability of AI. For each one of a set of 18 realistic consulting tasks within the frontier of AI capabilities, consultants using AI were significantly more productive (they completed 12.2% more tasks on average, and completed tasks 25.1% more quickly), and produced significant;y higher quality results (more than 40% higher quality compared to a control group). Consultants across the skills distribution benefited significantly from having AI augmentation, with those below the average performance threshold increasing by 43%, and those above increasing by 17% compared to their own scores. For a task selected to be outside the frontier, however, consultants using AI were 19 percentage points less likely to produce correct solutions compared to those without AI. Further, our analysis shows the emergence of two distinctive patterns of successful AI useby humans along a spectrum of human-AI integration. One set of consultants acted as “Centaurs,” like the mythical half-horse/half-human creature, dividing and delegating their solution-creation activities to the AI or to themselves. Anotehr set of consultants acted more like “Cyborgs,” completely integrating their task flow with the AI and continually interacting with the technology.
It seems to me that the consultants who managed to turn themselves into “cyborgs,” to use the authors’ term, somehow figured out where the jagged frontier of AI capabilities lies for a given task. They were able to adapt their workflows accordingly. Mollick provides more detail about this observation in a Substack post that serves as a kind of more accessible companion piece to his research paper:
AI is weird. No one actually knows the full range of capabilities of the most advanced Large Language Models, like GPT-4. No one really knows the best ways to use them, or the conditions under which they fail. There is no instruction manual. On some tasks AI is immensely powerful, and on others it fails completely or subtly. And, unless you use AI a lot, you won’t know which is which.
The result is what we call the “Jagged Frontier” of AI. Imagine a fortress wall, with some towers and battlements jutting out into the countryside, while others folds back towards the center of the castle. That wall is the capability of AI, and the further from the center, the harder the task. Everything inside the wall can be done by the AI, everything outside is hard for the AI to do. The problem is that the wall is invisible, so some tasks that might logically seem to be the same distance away from the center, and therefore equally difficult—say, writing a sonnet and an exactly 50 word poem—are actually on different sides of the wall. The AI is great at the sonnet, but, because of how it conceptualizes the world in tokens, rather than words, it consistently produces poems of more or less than 50 words. Similarly, some unepected tasks (like idea generation) are easy for AIs while other tasks that seem to be easy for machines to do (like basic math) are challenges for LLMs.
I decided to ask ChatGPT directly about this. I gave it the following prompt: In some of the discourse surrounding generative AI tools such as ChatGPT, there is the notion of the “jagged frontier”. This metaphor is taken to mean that there is a jagged, and unpredictable, line separating those things at which ChatGPT excels from those things which ChatGPT does poorly. Given this metaphor of the jagged frontier, how can we conceive of this puzzle? How can we learn which side of the jagged frontier a given task falls for ChatGPT?
It’s worth noting here that Ethan Mollick’s paper was published recently, which is to say, after ChatGPT’s knowledge cutoff date of Septemeber 2021. Therefore I structured the prompt above in a way such that ChatGPT was provided sufficient context about a concept likely absent from its training corpus. In other words, if we assume that the metaphor of the “jagged frontier” of generative AI capabilities had not been used prior to Ethan Mollick and his colleagues having used it in their paper, ChatGPT would not have been trained on it.
In any event, ChatGPT’s response to my prompt appears between the following section breaks.
Conception of the “Jagged Edge” Puzzle:
Asymmetry in Performance
Why it Occurs: The generative model underlying ChatGPT has a skewed distribution of expertise due to the heterogeneous nature of its training data. Some subjects are vastly represented, while others are underepresented.
Implications: This asymmetry manifests in erratic performance, where the AI can answer complex queries in one domain but fail in a seemingly straightforward task in another.
Anecdotal Evidence: ChatGPT might provide a well-crafted response on a general concept like “quantum mechanics” but fail at answering a straightforward question about a less popular scientific theory.
Complexity Dimensionality
Multi-Faceted Nature: Tasks don’t come with a single measure of complexity. They can vary in linguistic sophistication, required domain knowledge, logical reasoning, etc.
Non-Linearity: A task might appear simple but may intersect multiple dimensions of complexity, pushing it to the ‘poor performance’ side of the jagged edge.
Examples: Understanding sarcasm, for instance, requires both linguistic and contextual understanding, and it’s a task that generative models like ChatGPT sometimes struggle with.
Training vs Reality
Data Limitations: ChatGPT is trained on a large corpus of text from the internet, books, articles, etc., which may not fully represent the depth or breadth of human knowledge.
False Positives/Negatives: The tool may produce responses that seem accurate but are actually misleading or incorrect due to gaps or biases in the training data.
The Long Tail: Some topics are so rare that they fall into the long tail of the data distribution, making the model’s performance on them inherently unpredictable.
Emergent Behavior
Unpredictable Outcomes: Certain combinations of input queries and context may result in ChatGPT generating responses that even its developers didn’t explicitly intend.
Chained Reasoning: Sometimes, a line of inquiry might trigger a sequence of reasoning within the model that either brilliantly solves a problem or leads to an unexpected failure.
Performance Boundaries
Temporal Variability: The capabilities of ChatGPT emerge with updates, re-training, or tweaking of hyperparameters.
Moving Goalposts: What is considered the edge today may not be the edge tomorrow, complicating longitudinal studies of its capabilities.
External Factors: Introduction of new competitors, or changes in underlying algorithsm and hardware, can also affect where the jagged edge lies.
Each of these elements represents a different facet of the “jagegd egde” problem, and understanding them in depth can pave the way for more reliable metrics and frameworks to gauge the capabilities and limitations of generative models like ChatGPT.
This is all very interesting, and informative, but it’s also fairly theoretical. What I am interested in is practical and concrete things that a user can do to understand what ChatGPT is good at, and what it not good at. Where, in other words, does the jagged frontier lie, and how can I identify that invisible line? How did the management consultants who outperformed their peers in Ethan Mollick’s test use AI to distinguish themselves?
I prompted ChatGPT the following: Great, given this explanation, what are some practical tips that a person can use to understand where this jagged frontier lies?
Its response follows below the section break.
Practical Tips for Understanding the Jagged Frontier of ChatGPT
1. Start with Broad Queries, Then Narrow Down
Why: This helps you gauge the general domain capabilities before diving into specific tasks.
How: Start with a question like "Tell me about machine learning" and then narrow it down to "Explain the concept of backpropagation in neural networks."
2. Utilize Known Facts as Baselines
Why: Using known facts can act as a control mechanism to evaluate the reliability of the model.
How: Ask questions whose answers are well-established, then evaluate how closely the model’s response aligns with those facts.
3. Identify Boundary Cases
Why: Knowing the edge cases can provide insights into the model's limitations.
How: Test questions that are ambiguous, misleading, or involve multi-step reasoning to observe how the model copes.
4. Experiment with Context
Why: Context greatly affects performance.
How: Repeat the same query but provide different context or framing to see if the model's answers change.
5. Investigate Time Sensitivity
Why: Some queries are time-sensitive due to model updates.
How: Revisit previously asked questions after significant updates to see if there’s a shift in the jagged edge.
6. Use Counterfactuals
Why: They help test the model’s reasoning capabilities.
How: Ask “what if” questions that challenge conventional wisdom or established facts.
7. Leverage User Communities
Why: Collective experience offers a broader view of capabilities.
How: Follow forums, social media, or specialized groups that discuss the strengths and weaknesses of ChatGPT.
8. Conduct A/B Testing
Why: It provides statistical evidence of performance.
How: Use the same set of queries but alter one variable, like prompt engineering, to gauge performance differences.
9. Test with Specialized Domains
Why: To understand the model's expertise or lack thereof in specialized areas.
How: Pose questions in niche areas like quantum mechanics or specific historical events to observe the accuracy and depth of the model's responses.
10. Document and Reflect
Why: Keeping records helps in noticing patterns and anomalies.
How: Maintain a log of your interactions, especially the surprising failures or successes, to analyze later.