Q* and the road to AGI
Speculation is rampant that OpenAI has figured out how to get to AGI, using something called Q*. What, exactly, is going on?
Speculation is rampant that OpenAI has discovered a path to artificial general intelligence. There is further speculation that this discovery precipitated this past weekend’s temporary ouster of Sam Altman.
There are a lot of links flying around, but I want to focus on two links, sourced from this Marginal Revolution post. The post offers three links, but one of them is to the Twitter account of someone who I find to be frustratingly oblique, so I’m ignoring that link. The two links I am focusing on are:
As I am not an AI researcher, but I am, as evidenced by this Substack, interested in artificial intelligence, whether of the general kind or not, I decided to run these links through ChatGPT, and ask it for a plain English explanation of their claims. What follows below the section break is ChatGPT’s unexpurgated output1.
The TL;DR of the below is that Q* and Google’s Gemini project seem to be converging on strategies to develop artificial intelligence agents which exhibit the following qualities and skills:
Combining Diverse Learning Strategies
Better Planning and Decision-Making
Combining Specialized and General Knowledge
Flexibility and Order
OpenAI already lets users built their own agents via its GPT tool; however, those agents can’t interact with each other, and they’re limited in functionality by the underlying technology (GPT4) as well as the GPT4 API. Further, the most powerful ability of these GPTs, which is the ability to call external APIs, is hobbled when users use ChatGPT on a mobile device, due to Apple’s and Google’s privacy policies. But to the extent that someone is concerned about the above four points leading to artificial general intelligence, one could see these baby GPTs as another step on the road to AGI. Thus, we may have elicited the proximate cause for the OpenAI board’s initial decision to fire Sam Altman.
IMPORTANT: Note that ChatGPT uses the word “your” a few times in its response to me: ChatGPT mistakenly attributes to me what I have copied from Twitter and LessWrong. As my intention here is to provide to you an unexpurgated version of what ChatGPT infers from these claims, I have left these errors in ChatGPT’s output.
For the Twitter thread, I manually copied the tweets into ChatGPT’s interface, since Elon doesn’t allow OpenAI to access Twitter.
These comments, explanations, and clarifications out of the way, let’s proceed to ChatGPT’s analysis. Bracketed and bolded comments will provide some commentary to you about which link ChatGPT is commenting on, so that you can keep track.
[First, the Twitter thread.]
The concept you’re describing involves a sophisticated approach to enhancing language model training using a population of agents, self-play, and reinforcement learning (RL) to achieve a more dynamic and robust text generation capability. Here’s a plain English breakdown of the key elements and their significance:
Population of Agents for Text Generation:
What: Training multiple agents (mini-models within a large model) to cover different aspects of a large text corpus.
Why: Similar to how a human brain uses different areas to process various types of information, this approach aims to create specialized agents for different text types or contexts, improving the overall model’s capability to handle diverse text.
Self-Play with Reinforcement Learning (RL) Reward Signal:
What: Using self-play (agents interacting with each other) with a reward system that encourages agents to return to their original text passage.
Why: This method trains agents to stay relevant to the original topic and context, simulating more realistic and coherent text generation, like a focused conversation among experts.
Maximizing Inter-Agent Orthagonality and Density Coverage:
What: Ensuring agents are diverse (orthogonal) yet collectively cover a wide range of text types and styles (density coverage).
Why: To avoid redundancy among agents and ensure comprehensive understanding and generation of varied text, much like having a team of experts from different fields.
Reducing A:B Correlation for Higher-Level Representation Learning:
What: Minimizing direct correlations between two points (A and B) in training data to encourage learning of more abstact concepts.
Why: It’s like teaching someone to understand a story’s theme rather than just memorizing the words. This leads to a deeper and more nuanced understanding of texts.
Incorporating Implicit Agent-Agent Communication and Cooperation:
What: Training agents to interact and cooperate implicitly, sharing information and strategies.
Why: This simulates more realistic scenarios where different information sources (like people) collaborate to form a complete picture or narrative.
Addressing the Limitations of Current Large Language Models (LLMs):
Problem: LLMs are not trained to recover from errors in text generation; they deviate from the original text distribution during inference.
Solution: By training agents to return to the original text distribution after deviations, the model learns to self-correct, much like a storyteller adjusting their narrative to stay on track.
Creating a More Human-Like Interaction Model:
Goal: To have agents that mimic human conversation patterns, where information is limited and cooperation is necessary.
Outcome: This leads to a language model that better understands and generates human-like text, improving its applicability and user experience.
In summary, your approach aims to create a more dynamic, cooperative, and context-aware language model by training multiple specialized agents to work together, stay relevant to original texts, and learn from each other’s strengths and weaknesses. This method addresses current limitations in language models by focusing on more human-like interaction patterns and error recovery, potentially leading to more realistic and versatile text generation.
[I then asked ChatGPT to speculate about how this might be a way to achieve artificial general intelligence, as much of the speculation on Twitter and in the media has been that Sam Altman was booted from OpenAI in part due to fears that this research presaged an imminent AGI.]
The approach you’ve described, focusing on training a population of agents with interagent orthogonality and density coverage, and utilizing self-play with a reinforcement learning (RL) reward system, has intriguing implications for the pursuit of Artificial General Intelligence (AGI). Here’s a speculative exploration of how these concepts might contribute to achieving AGI:
Diverse and Specialized Knowledge Representation:
Concept: Training different agents to specialize in various aspects of a large text corpus mirrors the human brain’s division of labor among different regions for different cognitive tasks.
Towards AGI: This approach can lead to a more nuanced and comprehensive understanding of complex concepts, akin to having a team of specialized experts. This diversity in cognitive processing is a key characteristic of human intelligence that AGI endeavors to emulate.
Dynamic Interaction and Collaboration Among Agents:
Concept: Agents interacting and learning from each other in a self-play environment can develop a more complex understanding of interactions and dependencies in data.
Towards AGI: Such dynamic interactions could foster emergent behaviors and problem-solving strategies that are not directly programmed, a hallmark of general intelligence.
Advanced Contextual Understanding and Adaptation:
Concept: By incentivizing agents to return to the original context or text distribution, they learn to stay relevant and adapt to evolving scenarios.
Towards AGI: This ability to understand and adapt to context is crucial for AGI, enabling the system to apply knowledge flexibly across varied scenarios, much like human reasoning.
Orthogonality and Cooperative Learning:
Concept: Encouraging diversity (orthogonality) among agents while ensuring they work cooperatively can lead to a more robust and comprehensive learning system.
Towards AGI: This balance between diversity and cooperation is essential for AGI, mirroring how human cognition integrates disparate viewpoints and knowledge areas to form holistic understanding.
Error Recovery and Robustness:
Concept: Training agents to correct their course and return to the original text distribution addresses a critical flaw in current models, which is the inabilitu to recover from errors autonomously.
Towards AGI: The capacity for self-correction and reliance in the face of errors is a key aspect of human intelligence and learning, which is essential for AGI.
Mimicking Human-Like Interaction:
Concept: Creating agents that simulate human conversation patterns, where cooperation is necessary and information is limited.
Towards AGI: This could lead to more human-like problem-solving and interaction abilities, a core aim of AGI.
Learning State Space Recovery:
Concept: Training agents to understand and navigate back to a desired state space (e.g., original text distribution) when perturbed.
Towards AGI: This is analogous to cognitive flexibility in humans, where we adjust our thinking and strategies based on new information or errors, a critical component of AGI.
In conclusion, your approach, which integrates diverse knowledge representation, dynamic interaction, contextual understanding, and error recovery, presents a promising avenue towards AGI. It aligns with many key attributes of human-like intelligence, such as flexibility, adaptability, and the ability to learn from interaction and experience. While there’s still a significant journey ahead to achieve true AGI, strategies like the one you’re exploring are important steps in that direction, pushing the boundaries of current AI capabilities towards more generalizable and sophisticated forms of intelligence.
[Next, I asked ChatGPT to synthesize its analysis with the LessWrong link I provide above.]
The information presented in the LessWrong article about OpenAI’s Q* project and DeepMind’s AlphaGo-type systems, when synthesized with the concepts of training a population of agents for text generation and using self-play with reinforcement learning, offers a compelling glimbes into the potential pathways towards Artificial General Intelligence (AGI).
Combining Diverse Learning Strategies: The Q* project, speculated to be a blend of Q-learning and A* algorithm, or an AlphaGo-style Monte Carlo Tree Search for token trajectories, represents an advanced approach to problem-solving and learning. Similarly, training multiple agents to specialize in different aspects of a large text corpus, and encouraging their interaction and cooperation, mirrors this multidisciplinary approach, which could be key in achieving AGI.
Improving Planning and Agentic Behavior: OpenAI’s Q* potentially represents a signicant advance in planning and agentic behavior for AI models. By scaling this up to larger models, AI could start planning for more abstract goals, a fundamental aspect of AGI. This aligns with the approach of training agents to navigate complex text environments and recover from errors autonomously, enhancing the model’s ability to handle unforeseen scenarios and think more abstractly.
Synergy Between General and Narrow Intelligence: DeepMind’s Gemini project, combining AlphaZero-based Monte Carlo Tree Search with the language capabilities of large models, illustrates a fusion of general and narrow intelligence. This echoes the need for AI systems that not only have specialized knowledge (narrow intelligence) but can also apply this knowledge flexibly in a wide range of scenarios (general intelligence). Training agnets with specialized knowledge that work together in a coooperative and context-aware manner moves in this direction.
Flexibility and Systematicity for Superintelligence: The pursuit of AGI requires a balance between flexibility and systematicity. The discussed approaches—whether it’s the speculated mechanics of Q* or the method of training a population of agents for diverse and dynamic text generation—aim to imbue AI systems with the ability to adapt to new information and systematically apply their learning across various contexts.
In summary, the integration of diverse learning strategies, improved planning and agentic behavior, and the synergy between general and narrow intelligence, all point toward a more holistic and sophisticated approach to AI development. This approach, which aligns with the principles discussed in the LessWrong article, could significantly advance the quest for AGI, moving beyond traditional models to systems that more closely mimic the complexity and adaptability of human intelligence.
[I then asked it to re-state the above in plain English.]
Sure, I’ll simplify that:
Mixing Different Ways to Learn: Just like mixing different ingredients to bake a better cake, Q* and Gemini projects mix different AI learning methods to solve problems more smartly. This is like using various experts (agents) in AI to understand and talk about a wide range of topics.
Better Planning and Decision-Making: Q* might help small AI models think agead and make better choices, like planning a road trip using the best route. If this works in bigger AI models, they could start thinking about really complex stuff, like a chess master planning several moves ahead.
Combining Specialized and General Knowledge: Gemini tries to mix specific skills (like playing a game really well) with the ability to understand and use language broadly. This is like having a specialist doctor who is also a great general practitioner.
Flexibility and Order: For AI to be really smart (like human-level smart), it needs to be flexible (able to handle new things) and orderly (able to use what it knows in a tidy way). The new approaches in Q* and Gemini are steps towards this, making AI more like a versatile and organized human thinker.