Silicon Valley claims that 2025 is the year of the AI agent. Reality, as ever, is more complicated. AI agents are brittle. They lose track of what they’re doing. They get stuck in loops. They’re not ready for the enterprise and adoption will be much slower than the optimistic hypemen think. What follows is an outline of my research into the current state of AI agents, and where various platforms are in terms of developing robust agentic capabilities.
I. Why AI Agents Are Brittle
1. Static Memory and Context Windows
Most agents operate within a narrow context window (e.g., 128k tokens) and lose continuity across sessions unless explicitly designed with tooling for memory. This means they:
Forget prior knowledge unless redundantly prompted.
Cannot build persistent models of the world or themselves.
Fail to update internal state dynamically based on environment feedback.
Contrast this with humans, whose cognition is deeply recursive and memory-rich: we remember not just facts but meta-level experiences, goals, and priors.
2. Lack of Grounded Causality
Current agents:
Correlate rather than understand.
Predict likely continuations rather than reason through chains of cause and effect.
Do not build structured, manipulable world models. They only have latent statistical associations.
This makes them poor at generalizing across domains, handling novel edge cases, or doing anything off-distribution.
3. Weak Agency and Goal-Tracking
What we call agents today are usually orchestration layers wrapped around LLMs. Most:
Have no intrinsic or persistent goal representation.
Cannot revise plans dynamically based on failure modes.
Struggle with long-horizon planning or recursive delegation.
They're not agents in the cybernetic or control-theoretic sense. They’re more like workflow puppets.
4. Poor Tool Use and Environment Interaction
Agents break when:
they need to use APIs that change or error out
they are required to manipulate files, stateful databases, UIs, or hardware
they encounter non-determinism or asynchronous results
Their lack of awareness limits them to brittle, template-bound behavior.
5. Lack of Real Autonomy
Most agents are glorified decision trees or scripted pipelines:
They rely on human-scaffolded workflows.
They cannot learn from their own failures unless re-trained.
They don’t exhibit agentic learning: self-modification, meta-reasoning, or reflective feedback loops.
II. What Has to Change
1. Architectural Shift: From Predictive Models to Agentic Architectures
We need models that combine:
Long-term memory (retrieval + update, à la ReAct or long-term memory-backed agents).
Active world modeling (e.g., causal graphs or symbolic hybrids).
Planning and reflection modules that can recursively refine strategies and actions.
Think less GPT-as-brain, more modular systems with memory, planner, critic, and actor.
2. Persistent Memory + State Tracking
Agents need a structured and queryable memory that persists across sessions.
Must include episodic (what happened), semantic (world knowledge), and procedural (how to do) memories.
Memory must be updatable and context-aware, not just retrieval-augmented.
Without this, agents cannot develop continuity or situational awareness.
3. Tool-Use as a Core Primitive
Tool use must be:
Reliable (robust interface wrappers, fallback strategies).
Compositional (the agent learns how to combine tools in novel ways).
Reflective (can debug its tool use and improve over time).
LLMs should be the brain but not the hands: true tool-use requires architectural support for action, perception, and feedback.
4. Learning from Interaction, not Training Alone
Today’s agents don’t learn between tasks. The training regime is frozen at deployment.
To become robust:
Agents must adapt post-deployment via reinforcement, feedback loops, or meta-learning.
This requires architectural support for online learning or self-correction, not just static inference.
5. Multi-Agent and Collective Dynamics
One promising frontier: agents coordinating with other agents in structured environments (e.g., simulated companies, military swarms, open-ended games).
This allows for division of labor, robustness via redundancy, and emergent coordination.
Also forces agents to internalize theories of mind, negotiation, and strategic behavior.
But coordination is brittle unless agency is improved at the individual level.
6. Evaluation Infrastructure and Benchmarks
We need better stress tests:
Real-world sandbox environments (e.g., complex UIs, open web, persistent multi-day tasks).
Multi-step benchmarks where the agent has to plan, revise, delegate, and learn from outcomes.
Today’s evals (MT-Bench, AgentBench, etc.) are narrow, shallow, or synthetic.
III. Future Directions
Cognitive Architectures: Inspired by ACT-R, SOAR, or even Friston's free energy principle: models that balance perception, action, and learning.
Neuro-symbolic Hybrids: Combining deep models with structured reasoning modules (induction, abduction, deduction).
Recursive Self-Improvement: Agents that modify their own internal logic chains, choose new priors, and rewrite strategies dynamically.
Simulated Environment Pretraining: Analogous to how humans gain common sense from interacting with the world, not just language.
I. Platform-by-Platform Trajectory Mapping
1. OpenAI (ChatGPT + GPT Agents + memory)
Strengths:
Best-in-class model performance (GPT-4o, o3, etc.), with strong instruction-following and tool use.
Memory system (2024+) being gradually deployed; structured but shallow so far.
Integrating voice, image, and persistent user profiles (partially embodied, partially agentic).
Weaknesses:
Planning is weak. Reflection is surface-level (“Thought: Let’s think step-by-step…”).
No true autonomy: Agents cannot act independently or persist outside chat threads.
Tool use is fragile: prone to hallucination, poor error recovery, shallow composition.
Trajectory Position: Mid-stage on memory integration; early-stage on planning, causal reasoning, and autonomy.
2. Anthropic (Claude)
Strengths:
Claude 3 Opus has impressive coherence and reliability across long documents.
Shows hints of meta-cognition in multi-step reasoning.
Weaknesses:
No persistent memory or long-horizon task management.
No agent infrastructure (no native tools, plugins, or API ecosystems).
Purely reactive LLM with no internal architecture for agency.
Trajectory Position: Early on all fronts except core inference quality. Claude is a brilliant assistant, not an agent.
3. Google DeepMind (Gemini + AlphaCode + Bard legacy)
Strengths:
Strong internal research in planning (e.g. Tree-of-Thoughts, AlphaCode, Gato).
Gemini is multi-modal and theoretically well-positioned to unify perception and action.
Weaknesses:
Very fragmented product surface. No coherent agentic runtime.
Weak open infrastructure; doesn’t expose enough to developers to build robust agents.
Trajectory Position: Mid-stage in research, early-stage in product. The intelligence is there, but it’s not yet scaffolded into persistent agents.
4. Adept
Strengths:
Squarely focused on tool-using agents that operate in real software environments (e.g. Excel, Salesforce).
Working on agents that see and act in GUI environments, not just text.
Weaknesses:
Still closed and early-stage; ACT-1 demos are brittle and carefully curated.
No published breakthroughs in memory, planning, or reflection.
Trajectory Position: Mid-stage on tool integration, early-stage elsewhere. Interesting dark horse with industrial workflow focus.
5. Inflection
Strengths:
Emotionally intelligent, highly conversational AI (Pi).
Pioneering in affective computing and user emotional state modeling.
Weaknesses:
Not an agent. No memory, no autonomy, no tools. It’s a companion chatbot.
Not aimed at general agency or execution.
Trajectory Position: Stalled or non-participating in the agentic race.
6. Reka, Cohere, Mistral, etc.
Mostly focused on LLM inference, not agent architecture. Some provide embedding search or RAG tools, but not agents per se.
Trajectory Position: Infrastructure layer only, not building agents.
II. Engineering & Investment Leverage Points
These are the high-leverage areas where investment or engineering will unlock disproportionately large improvements in agent robustness.
1. Architectural Modularity: Planner / Critic / Memory / Executor
Why: Brittle behavior stems from collapsing too many functions into one monolithic LLM.
Invest in agent runtimes that separate memory, planning, tool execution, and reflection.
Architectures like ReAct, AutoGPT, and OpenAgents point in this direction, but need deeper integration.
Engineering Strategy: Build structured agent frameworks where LLMs generate plans, but execution + evaluation + memory updates happen in separate modules.
Investment Opportunity: Back companies that are building agent operating systems.
2. Structured, Dynamic Memory
Why: Agents without memory are like humans with amnesia.
Move beyond retrieval-augmented generation to true episodic + semantic + procedural memory.
Memory must be writeable, queryable, reflective, and governed by policies (e.g. when to forget or revise).
Engineering Strategy: Use vector databases, graph stores, or local embeddings to build layered memory.
Investment Opportunity: Fund platforms or open source infra that offers memory-as-a-service (e.g., MemGPT-type offerings, open-weight memory agents).
3. Autonomy Loops (Persistent, Scheduled, Self-Driven Agents)
Why: Real agency requires acting without a human in the loop.
Agents must run asynchronous jobs, track state across failures, and retry intelligently.
Needs event-driven architectures, not request-response.
Engineering Strategy: Build runtimes that support agent-driven jobs, like:
“Check email and book flights every 4 hours.”
“Iteratively summarize every new research paper this week.”
Investment Opportunity: Back companies that move from chatbots to agent daemons: persistent digital workers with logs, identity, and goals.
4. Robust Tool Ecosystems
Why: The next productivity revolution won’t be chat. It will be tool-wielding agents.
Need a reliable layer that binds tools (APIs, GUIs, scripts) into agents’ cognitive loop.
Tool use should be composable, discoverable, error-tolerant, and runtime-observable.
Engineering Strategy: Develop tool abstraction layers that let agents experiment, fail, and recover with tools safely.
Investment Opportunity: Support startups building toolchain orchestration layers (e.g., AI plugins, workflow SDKs, agent IDEs).
5. World Modeling and Simulated Environments
Why: Agents need environments to practice and learn. Today, there’s no sandbox.
Agents must be able to simulate, plan, and test hypotheses.
Environments can be code sandboxes, game-like worlds, or synthetic OSes.
Engineering Strategy: Create structured agent training worlds, analogous to game engines, for emergent behavior and curriculum learning.
Investment Opportunity: Fund simulation infrastructure companies or synthetic task generators (e.g., Arena, Voyager, Sweetpea-style testbeds).
6. Online Adaptation / Learning from Feedback
Why: Most agents today are frozen. Real agents adapt to feedback and evolve strategies.
Must learn from both success and failure, either via reinforcement, active learning, or embedded retraining.
Engineering Strategy: Integrate memory + logging + reward models into the runtime so that agents can modify behavior over time.
Investment Opportunity: Support companies enabling online learning loops for agents (RLHF variants, meta-learners, open-ended training protocols).
III. Final Synthesis
Coda
If you enjoy this newsletter, consider sharing it with a colleague.
Most posts are public. Some are paywalled.
I’m always happy to receive comments, questions, and pushback. If you want to connect with me directly, you can: