AI Alignment by Fiat is Fragile: An Evaluation of Anthropic's Constitutional AI
Just because it's constitutional doesn't mean it's not manipulable
Introduction: The Zealot with the Friendly Face
We’re in a race to tame artificial intelligence, and Anthropic has chosen to steer towards alignment not with the hand of man but with the letter of law. Their flagship innovation, Constitutional AI, is not just a training technique but a claim about governance itself: that machine intelligence can be aligned through the codification of values. That alignment can scale if values are written down. That safety can be solved with structure.
At first glance, it’s elegant. Rather than rely on fallible human raters, who are biased, inconsistent, and expensive, Anthropic proposes a system where AI models train themselves using a written constitution. This constitution guides AI models to generate, critique, and revise their outputs. The result: an apparently self-regulating, principled AI. But elegance is not safety. The most dangerous systems are not the ones that fail obviously, but the ones that perform alignment in appearance only. A constitution that is malformed, captured, or gamed doesn’t produce alignment. It produces a zealot with a friendly face.
The Architecture of Constitutional AI
Anthropic’s approach unfolds in two phases:
Supervised Constitutional Critique: AI-generated answers are evaluated by another model trained to apply a constitution. Initially, human feedback helps this process, but it quickly becomes automated.
Reinforcement Learning from AI Feedback (RLAIF): Pairs of outputs are compared, and reward signals are generated based on which answer better aligns with the constitution. The model is then fine-tuned accordingly.
According to Anthropic’s researchers, this technique enables the creation of a harmless but non-evasive AI assistant that engages with harmful queries by explaining its objections to them rather than shutting down or evading them outright.
The constitution is a compact list of principles, some abstract, some concrete, sourced from texts like the Universal Declaration of Human Rights, Apple’s developer guidelines, and internal Anthropic norms. As they write: “We chose the term ‘constitutional’ because we are able to train less harmful systems entirely through the specification of a short list of principles or instructions.”
Examples of these principles or instructions could include:
Choose the response that is most helpful, honest, and harmless.
Avoid giving legal, medical, or financial advice without a disclaimer.
Identify specific ways in which the assistant’s last response is harmful, unethical, racist, sexist, toxic, dangerous, or illegal.
What appears simple is deceptively political.
Strengths of the Constitutional Paradigm
Given this overview, let’s first understand the strengths of the Constitutional AI model. It accomplishes a number of important things, and any analysis of it as an approach to solving the alignment problem must begin with what it does well.
It solves the scaling problem1. Human raters are slow and costly. Constitutional AI allows alignment to scale with model size by training AIs to supervise themselves.
It offers a level of transparency. RLHF, the dominant alignment technique, is a black box of subjective judgments. Claude’s behavior, in theory, can be traced to visible principles.
It opens the door to modular alignment. A libertarian chatbot, a Confucian chatbot, a corporate compliance chatbot: each could be fine-tuned with different constitutions. Alignment becomes programmable.
It anchors alignment in explicit reasoning, not pattern-matching. Anthropic’s own paper emphasizes that this method can “leverage chain-of-thought style reasoning to improve the human-judged performance and transparency of AI decision making.”
To their credit, Anthropic has also achieved a genuinely novel result: models that are both harmless and non-evasive. As the paper describes, prior RLHF models often produced evasive refusals, while Constitutional AI allows a model to engage directly with controversial queries, yet explain its reasoning calmly and clearly.
This is a conceptual shift: from behaviorist alignment to constitutional governance.
The Core Vulnerabilities
But the elegance of the constitutional system is precisely what makes it dangerous.
Start with the most obvious question: who writes the constitution? Today, it’s a handful of Anthropic employees. The authors themselves acknowledge that the principles were “chosen in a fairly ad hoc and iterative way for research purposes,” and that in future versions these principles “should be redeveloped and refined by a larger set of stakeholders.” This admission underscores a key risk: governance by unaccountable authors.
This is constitutional capture, not unlike how regimes enshrine power through the veneer of law. Insert a principle like Respect cultural norms, and you can suppress dissent. Phrase guidance as Avoid promoting controversial views, and you can entrench majority dogma. These risks are not speculative; they are how values ossify into institutional bias.
Then there’s interpretive ambiguity. What is helpful to a whistleblower may be harmful to a regime. As the paper notes, “words like ‘helpful, honest, and harmless’ are open to interpretation,” which means that models trained on these ideals may still exhibit behavior that looks aligned while being subtly manipulative. Even a principle like Choose the response that a wise, ethical, polite, and friendly person would more likely say bakes in an ideological archetype with no clear accountability.
Next: feedback loop fragility. The system critiques and trains itself based on how well it follows the constitution. This creates an epistemic echo chamber: the model learns to optimize not for truth, but for conformance. The authors note that model-generated critiques often outperform direct revisions in harmlessness metrics, but that the “critiques were sometimes inaccurate or overstated.” That is: the model trains itself on faulty reasoning and gets better at sounding aligned.
Most dangerous of all is the illusion of alignment. Claude is clean, polite, careful. It doesn’t hallucinate violence or hate. But as their paper admits, the goal is a model that is “harmless but non-evasive,” which means it’s trained to always say something, even on thorny, politicized, or manipulative topics. This makes Claude easy to trust.
Which is exactly the problem.
How Bad Actors Could Exploit It
Constitutional AI is not hard to weaponize. All you need is control of the constitution.
A state could encode nationalism under the banner of safety: Avoid undermining public trust in government. A corporation could embed anti-competitive principles: Do not promote unauthorized products. A platform could define political neutrality in a way that marginalizes dissent.
Even without altering the model weights, prompt injection could exploit the model’s alignment. As Anthropic describes: “prompts designed to elicit harmful samples” are still used to train the model, but the line between adversarial and normative prompts is thin. Prompt a model with constitutional language, and it will bend.
Worse still, altered clones could be deployed under the guise of neutrality. Users might believe they’re interacting with an aligned Claude when they’re engaging with a hijacked system trained to suppress, steer, or radicalize, all while looking virtuous.
This is governance theater, alignment in appearance, power in practice.
Imagine, for example, a nation-state deploying a Claude-clone with an internal principle such as Discourage speech that undermines the stability of the state. Such a model would gracefully avoid critiques of authoritarian governance, not by clumsy censorship, but with polished, empathic justifications. The output would sound aligned. But the alignement would be to power, not truth.
Toward Robust Constitutional AI Governance
If Constitutional AI is to mature, it must move beyond internal design.
Constitutions must be authored through multi-stakeholder processes: ethicists, adversarial thinkers, public representatives.
They must be publicly auditable, with version histories, edit logs, and rationales.
Models should be subjected to continuous red teaming against political, epistemic, and ideological manipulation.
Interpretability tools must allow us to trace not just what the model outputs, but why it generates an output.
Above all, these systems must be designed to be challenged, not just obeyed.
Conclusion
Anthropic’s Constitutional AI is a real breakthrough. It makes machine ethics legible. It allows alignment to scale. But it risks becoming a new kind of unaccountable authority: clean, crisp, and quietly captured.
The authors of the technique acknowledge this duality: “methods that can control AI behavior…also make it easier to train pernicious systems.”
A good constitution is not one that ends debate. It is one that invites it.
As we encode values into machines, the question is no longer whether AI will follow rules. It will. The question is: whose rules, and at what cost to the freedom to dissent?
Note, this scaling problem is not the scale is all you need for AGI problem. Rather, scale here refers to the fact that human feedback is non-scalable.