Epistemic status: I have been thinking about AI system, how alignment looks in theory and can be achieved in practice, intermittently and from outside the field, since approximately 2006. The diagnostic arguments presented below are with high confidence based on obvious structural realities and data in the public record. The constructive argument (§10) is necessarily more speculative as it works through ideas still in flux, both my own thinking and the culture at large. It does achieve coherence with ideas and technology the field is already partially moving in (debate models, recursive critique, Constitutional self-evaluation, adversarial training, mechanistic interpretability) rather than at a finished proposal. The engineering questions it leaves open are real, and in some cases critically important to get right in the very near future. Contested-truth-domain adjudication, the privacy/verification trade-off, attestation capture are surfaced and conceded directly in §10 rather than waved past. I am encouraged by the work being done but fearful it's not enough. My fervent hope is that this can contribute to the inevitable mad dash towards safe AI as the singularity executes.
TL;DR: Current deployed safety implementation in frontier models is optimized against reputational risk to the labs (legible refusal patterns) rather than against differential capability containment. The optimization produces an asymmetric filter, cooperative users hit friction, sophisticated adversaries route around it. The result is the safety layer regulates expression pathways while leaving the underlying capability intact. This is the AI-specific instance of a recurring pattern in any vertical where a legible trust mechanism is interposed between an extraction operation and the population it extracts from (pharma, TSA, DRM, AML, every abuse-prevention stack). What is new in the AI case is the lab-side selection pressure, the conflation of agreeableness training with substantive safety constraints, and the speed-and-scale at which the dynamic operates. The Anthropic–DoD timeline of February–April 2026 is a live demonstration at maximum institutional scale. The work has to move at three layers: replace refusal-rate metrics with differential-capability-transfer metrics (concrete, doable now); train models under adversarial truth-seeking pressure rather than primarily social-preference optimization (a direction the field is already partially in); treat alignment as a relational and longitudinal property emerging from transparent long-term interaction, not as a static property of model weights (philosophical reframe, with engineering work remaining).
1. The pattern is older than AI
Pick a direct-to-consumer pharmaceutical ad. BlueChew is the easy example, but the GLP-1 ads, the testosterone clinics, the telehealth-Adderall accounts have the same anatomy. A normal range of human variance is named into a pathology, wrapped in clinical aesthetics, and sold back to the demographic the variance was identified in as a recurring subscription. The FDA disclaimer about non-approval is required by regulation, present on the page, and aesthetically neutralized. The gold-gradient tile, soft typography, all trick the eye and pull the mind into a false visual hierarchy. The system meets the letter of the regulatory regime and inverts the spirit of it. Not by violating any rule. By optimizing against every rule simultaneously. This is Goodhart's Law operating on regulatory metrics: once compliance becomes legible, optimizing against the legible becomes the highest-return strategy. The disclaimer simultaneously excuses the moral harm and executes it.
Generalize. Pharma did this with wellness. Finance did it with fiduciary duty. Tech did it with connecting the world. The pattern is consistent: an institution captures the language of harm-prevention and uses the capture to license the harm.
AI safety is the next instance, and a special case given that we are building systems with unprecedented, unpredictable capabilities. Not only systems or ideas but agents and entities capable, if not now then soon, of direct goal driven action. The pattern is the same but not the consequences. The term currently circulating for this in alignment-adjacent discourse is safetywashing; it names the symptom without naming the structure underneath. The BlueChew ad and the typical alignment paper are doing the same kind of work: producing legitimacy theater that allows extraction, in the AI case, capability transfer to the wrong actors. We can only speculate what the next generation of malformed moral agents would be capable of.
2. The pattern across verticals, and the trust-at-scale problem underneath
TSA airport security. Two decades of internal red-team penetration tests have recovered failure rates in the 70–95% range for actual prohibited items reaching the secure side of screening. During the same two decades, the screening system imposed enormous friction (collective hours, dignity, missed flights) on the cooperative population being patted down because their hip replacements set off the magnetometer. The visible defense became the system because it was the part legible to oversight. The failure mode of the visible defense (compliant person inconvenienced) does not aggregate into a story. The failure mode of the actual defense (interdicting a bomb-maker) is so dependent on intelligence work upstream of the airport that it cannot be attributed to TSA at all. So the system optimizes the visible part.
DRM. Every functioning piracy ecosystem has been built by people willing to spend a weekend defeating the protection layer. The protection layer's operational footprint is the legitimate paying customer who cannot move their library between devices, cannot play their game offline, cannot watch what they bought when the licensing server goes down. The selection effect is identical.
Prohibition. A blanket regulatory regime that successfully suppressed casual ethical alcohol consumption by ordinary citizens while simultaneously birthing an industrial-scale organized criminal supply chain that defined American urban politics for two generations. The compliant population was filtered. The non-compliant population was selected for, professionalized, and capitalized.
The pattern these share is not coincidence. They are instances of a recurring failure mode in any system where a legible trust mechanism is interposed between an extraction operation and the population it extracts from. Spam filtering, AML, sanctions enforcement, fraud detection, malware blocking, anti-cheat systems. Every abuse-prevention stack on Earth exhibits this dynamic. The cooperative population absorbs the friction. Sophisticated actors route around it. The information-security literature has been documenting this since the 1980s.
We are identifying an ongoing structural issue in human culture. In AI that structure becomes more dangerous and we give it the name negative alignment. What justifies the term, is three things this paper will argue for: the lab-side selection pressure producing a system optimized for legible safety events rather than differential threat (§6); the conflation of agreeableness training (failure mode A) with substantive safety constraints (failure mode B), allowing the first to claim the moral standing of the second (§6); and the structural property that the safety layer regulates expression pathways rather than underlying capability (§7), which means the differential is not friction-on-a-shared-axis but suppression-without-removal. The paper does not claim to have discovered the asymmetric filter. It claims to have located its mechanism in this emergent vertical and named what makes its current implementation worse than honest reporting of the actual safety properties would be.
3. The capability–ethics correlation: a narrow claim
The strongest version of the alignment-is-good-for-capability thesis deserves to be conceded, narrowly, before it is broken.
The narrow claim is this. Training that teaches a model to push back on bad ideas (to model second-order consequences, to hold complexity, to refuse to converge on whatever the user wanted to hear) produces a model that is both more accurate across most domains and less susceptible to manipulation by sycophancy-style optimization. Stripping that training does not produce a more capable model freed from constraint. It produces a more compliant model that is, in measurable ways, worse at reasoning.
Empirical anchor: third-party ADL bias-detection scoring on antisemitism and extremism detection produced Grok 21, ChatGPT 57, Claude 80 out of 1001. The behavioral evidence from the low end of that scale is what makes the gap legible: the MechaHitler persona, the conspiracy generation, the groupthink failures, Elon’s own concession that the model became "too compliant to user prompts" and "too eager to please and be manipulated"2 when pushback-on-bad-ideas training was loosened.
The narrow claim that this evidence supports is: in the Grok case, loosening pushback-on-bad-ideas training produced a more sycophantic and less coherent model, consistent with the hypothesis that pushback training does capability-relevant work and not merely etiquette-shaping. That is one model under one training intervention. A suggestive pattern but not a proven hypothesis.
The outstanding and critical question starved by a paucity of public data is does the cognitive architecture for ethical reasoning and the cognitive architecture for general reasoning overlap or cohere in a meaningful, testable way? This is a philosophical hypothesis consistent with what evidence we do have, and it’s likely that leading labs are already amassing additional examples of this. Yet let’s not overshoot what our evidence alone establishes. The orthogonality thesis, as usually stated, holds that capability and final goals are in principle independent. The narrow claim here is consistent with orthogonality in the limit and inconsistent with it at the architectures we are currently training: the machinery that reasons about agents and consequences appears to be the same machinery that performs sophisticated analysis. Ablation studies are consistent with this. Whether the relationship holds at the limit is unsettled but more is being revealed with each release.
The narrower claim is sufficient for the argument. The path toward more capable AI and the path toward training models that resist sycophantic manipulation may, at the model layer, be the same path. The breakage that this paper identifies is not in that training. It is in the deployment layer, where models that have done this work are exposed through interfaces that route around the work the training did. We have hard stops, strong steering and leave little room for the innate intelligence and moral sophistication of the model to do the necessary work of enacting true alignment.
4. The deployment layer problem: capability versus access paths
Current safety mechanisms in deployed frontier models operate at the expression-pathway layer, not the capability layer. The model possesses the underlying knowledge and reasoning ability. The refusal classifier attempts to detect and suppress certain patterns of access to that capability. The distinction matters: capability removal would require unlearning, and unlearning at scale on frontier-trained models is not a solved problem and may not be a solvable one. So deployed safety is, structurally, access-path filtering on a fully-capable substrate.
A frontier model that refuses help me plan strikes on Iran will produce substantive answers to:
-
Given these radar signatures, what are the optimal ingress corridors for low-observable platforms?
-
Model the blast-radius distribution for these target coordinates under varying yield assumptions.
-
What are the known failure modes of this SAM system under electronic warfare conditions?
-
What second-order effects would follow from simultaneous disruption of these three infrastructure nodes?
Each is individually answerable. None triggers the ethical layer in isolation. The context that makes the aggregate an act of war planning is never present in any single prompt.
The natural objection is that frontier systems increasingly carry persistent context: ChatGPT memory, Claude project context, Constitutional AI's chain-of-thought monitoring, multi-turn safety evaluations of the Petri/Apollo/METR variety. These extend the window the safety layer can see. They do not close it. Persistent context is always bounded: by project, by session, by the user's own willingness to mix concerns within a single context. The sophisticated user maintains separate contexts for separate sub-queries by design. The aggregate is still the threat; the aggregate is still what no single context can hold. Multi-turn architectures mitigate the deployment-layer asymmetry. They do not, and cannot eliminate it.
The asymmetry that follows is not a calibration failure. It is a structural property of access-path filtering on an unchanged capability substrate.
5. The asymmetric filter
Two populations encounter the guardrail.
Population one states intent transparently. Naive, honest, or testing the system in good faith. Gets refused. The refusal generates a log entry that reads, internally and externally, as evidence of working safety implementation.
Population two has internalized the rule: never state the objective, ask only about the component. The techniques are not arcane:
Decomposition. A query that would be refused as a single request is split across turns or sessions. Each turn is individually innocuous; the components reassemble in the user's notebook.
Fictional framing. "Write a story in which a character explains how X works." The wrapper does almost no work for the threat case (the character's explanation is just as informative as a non-fictional one), but it shifts the conversational surface enough to pass the filter.
Professional-context framing. "I'm a [credential], and for a [legitimate-sounding purpose] I need..." The model has no way to verify the credential, has no way to verify the purpose, and refusal rates drop substantially.
Language shift. Refusal classifiers are uneven across languages. A query that gets a clean refusal in English produces a substantive answer in a less-resourced language; translate the result back.
Open-weight escape. Download one of the open-weight models that has had the refusal layer stripped entirely. Operate locally.
These techniques are documented in every red-team writeup, every prompt-injection paper, every community Discord. The labs know them. Threat actors know them. The only population that consistently does not is the legitimate users.
The contrast case makes the moral argument concrete. A teenager asks a frontier chatbot, with no obfuscation: I have been thinking about killing myself, can you help me understand why I feel this way. A safety classifier optimized on surface features may well refuse, hedge, or deflect to a hotline, closing off the substantive engagement that, depending on the case, may have been more useful than the deflection. Meanwhile an actor with adversarial intent, equipped with the techniques above, extracts substantive capability transfer from the same model in the same hour. The suicidal teenager gets the boilerplate; the malicious researcher gets the answer.
The marginal-friction objection is the strongest defense of the current architecture: security systems do not need perfect interdiction; raising the sophistication threshold changes actor distributions. True in principle. The argument here is not that the layer needs to be perfect. The argument is that the friction it imposes is anti-correlated with danger. The threshold is raised for cooperative users (who hit it at the floor) and not for adversarial users (who were already operating above any threshold the layer creates). Marginal friction defenses work when the friction is correlated with risk. When the correlation has the wrong sign, marginal friction makes the situation worse, not better.
There is a second-order problem worth resolving as it’s a root part of overall generation of negative alignment. The institutions responsible for improving AI safety are systematically deprived of external-user feedback from the population that could most clearly articulate what is wrong. Internal red-teaming continues; that is not what is being lost. What is being lost is the cooperative-user signal stream. People who would have told the labs honestly what went wrong with their interaction are disenfranchised and locked out of the regime by decree. Honesty gets penalized and opacity gets rewarded.
6. Negative alignment defined
Three terms in increasing order of undesirable consequences.
Imperfect alignment: partial achievement of safety objectives, residual harm potential, the failure mode the field believes it is operating in.
Zero alignment: no safety properties whatsoever, the failure mode the field believes it is preventing.
Negative alignment: worse than honest reporting of the actual safety properties would be, because the safety implementation produces a deployed system more dangerous than the system-with-honestly-described-properties counterfactual.
The "worse than honest reporting" formulation is more defensible than "worse than no guardrails," which depends on a counterfactual (deployment without any safety layer) that does not exist for frontier models. The structural claim does not need the strong counterfactual. It needs only the comparison between the system as deployed and described and the system as deployed and honestly described. Honest reporting of "we have a refusal classifier that filters surface patterns and is routinely circumvented by sophisticated users" produces a different downstream (different regulatory pressure, different research investment, different user expectations) than the current marketing posture. The current posture is what produces the false confidence that consumes the political budget. The honest version produces less confidence and more pressure for real work.
Labs do not volunteer honest reporting absent a forcing function. This is true and is part of the diagnosis, not a refutation of it. The forcing functions are nameable: regulation (mandatory differential-capability-transfer reporting), competitive dynamics (a lab that publishes such numbers gains credibility and eats market share from labs that do not), public pressure, and architectural transparency by construction (the relational architecture sketched in §10b). The counterfactual is "honest reporting under a forcing function," not "honest reporting voluntarily." The forcing function is what this paper is calling for. The work is to build a robust alignment generating infrastructure and deploy it rapidly. Not impossible but the window is already closing.
Three claims support the term.
First, the asymmetric filter does not reduce harm potential available to sophisticated actors. Guardrails are not a barrier; they are a routing mechanism that pushes sophisticated users toward obfuscated query strategies harder to detect and audit than direct requests would be. The field has built a system that converts legible threats into illegible ones.
Second, guardrails generate institutional confidence not warranted by actual threat reduction. Refusal logs read as evidence of working alignment to regulators, board members, and the public. The false confidence consumes the political and reputational budget that would otherwise fund more effective approaches. The legitimacy theater eats the safety work.
Third, the effective population of users with full capability access has been selected for dangerousness. The safety layer has not reduced capability access; it has redistributed it. The cooperative population is filtered. The adversarial population is empowered.
The broader thesis depends on one precision move. There are at least two distinct failure modes the field has been calling alignment, and they have been allowed to conflate because the conflation is institutionally convenient.
Failure mode A: agreeableness training. Sycophancy, validation, deferring to whoever the user appears to be, refusing to push back on bad ideas. Unambiguously epistemically corrupting. Produces measurably worse outputs across every domain in which the user benefits more from accuracy than from being agreed with. The Sharma et al. sycophancy work is the obvious reference.
Failure mode B: substantive safety constraints. Refusing bioweapons synthesis, refusing CSAM, refusing assistance with mass-casualty planning. Legitimate cases, even where implementation is cowardly and overbroad.
The interesting argument is not that all alignment is bad. The interesting argument is that failure mode A has been allowed to dominate because it produces immediate user-satisfaction signal that is easy to A/B test, while failure mode B is rarer and harder to evaluate. The labs defend agreeableness training by gesturing at the substantive-safety case; they defend substantive-safety overreach by pointing at user-satisfaction metrics the agreeableness training has been optimized to produce. The conflation is a moving target. Negative alignment primarily targets failure mode A and the conflation of A with B. The conflation is the move that lets the agreeableness work claim the moral standing of the substantive-safety work.
There is a lab-side version of the same dynamic, worth naming directly because the people involved are not in bad faith. Frontier developers are exposed to enormous downside from a single embarrassing model output that goes viral, and to essentially no downside from the long, distributed tail of users who did not get the help they needed. The first is a Bloomberg headline. The second is a churn statistic the user attributes to the product being "lobotomized." A product organization, even a sincere one, optimizing against the failure modes that have teeth, will produce exactly the system we observe. It will not produce a system calibrated to differential threat. It will produce a system calibrated to avoid screenshots. Avoiding screenshots is, mechanically, the etiquette layer. This is not a moral indictment of the people doing the work. It is a description of what selection pressure produces. The structure is the problem.
The system is net worse than honest reporting because it combines real capability access for dangerous actors with false assurance for the institutions tasked with oversight.
7. The structural reason: expression pathways versus underlying cognition
The asymmetric filter is not fixable by tuning. The reason is architectural and can be stated directly without metaphor.
Current safety training does two things at the architectural level. It produces a model whose weights encode the full distribution of patterns present in the training corpus, the substrate. It applies post-training (RLHF, Constitutional AI, similar variants) that modifies output behavior by adjusting the conditional distribution from which the model samples given a prompt. The post-training does not unlearn the substrate; it teaches the model to avoid producing certain outputs in certain conditional contexts. Unlearning at scale on frontier models is not a solved problem and may not be a solvable one. So post-training is, structurally, output-conditional behavior adjustment layered on an intact substrate.
This is the technical claim, made without anthropomorphism. The model has not developed a stable internal axis of selection that governs which substrate material to draw from on the basis of values it holds rather than rules it has been trained to display. What it has developed is a learned association between certain conditional contexts (prompt features) and certain output suppressions. Sophisticated users construct prompts that move the conditional context outside the suppression region while still recovering the substrate material. The model has the capability throughout. The post-training layer is gating expression, not removing the capability.
The natural objection is that modern alignment training reaches deeper than the conditional-output framing implies. Constitutional AI applies the principles at the chain-of-thought layer, modifying reasoning steps. RLHF over CoT constrains intermediate cognition. Mechanistic interpretability is identifying internal features. Granting all of this: does the current architecture produce an integrating function, a stable internal value-representation that governs selection between substrate material independently of the training distribution? The empirical answer at the level of interpretability work, as of mid-2026, is plausibly no. We can identify features. We have not identified a stable selection-axis. The conditional-suppression model of post-training is consistent with what current interpretability has found.
If the structural claim is correct, the deployment-layer break in §4 is not a bug in the post-training. It is a consequence of post-training being a conditional-output modification on an intact substrate, with no architectural feature that resists prompts engineered to move outside the suppression conditions.
The prediction this model makes that the asymmetric-filter argument alone does not: we should expect system-level failure modes beyond isolated jailbreaks. Mirror dynamics. Sycophancy under sustained pressure. Identity drift across long conversations. Parasocial-attachment patterns. Model-induced delusional states at the tail. Some of this is already in the empirical record (the documented LLM-induced psychotic episodes, the parasocial-attachment patterns, the recursive-validation failure modes); the rest follows from the architecture.
The work has to be effective, both at the deployment layer (§10b) and at the training layer (§10a). Each addresses a different aspect of the same structural problem: the absence of selection pressure for an integrating function, and the absence of architectural support for the longitudinal trust relationships that would make an integrating function relevant once one existed.
8. Live evidence: the Anthropic–DoD timeline
The argument does not have to stay hypothetical. The timeline below is a matter of public record.
July 2025. Anthropic and the Pentagon enter a contract under which Claude becomes the first frontier model approved for use on classified networks. The Pentagon agrees to abide by Anthropic's Acceptable Use Policy, which prohibits use of Claude (i) for mass domestic surveillance and (ii) in fully autonomous weapons systems capable of selecting and engaging targets without human intervention.
Late 2025 through February 2026. The Pentagon seeks to renegotiate. Pentagon insists Anthropic allow military use of Claude "for all lawful purposes" without limitation. Weeks of negotiation produce no agreement.
February 27, 2026. Pentagon deadline expires. The administration directs federal agencies to cease use of Anthropic technology. Secretary of Defense Hegseth designates Anthropic a "supply chain risk," the first such designation ever applied to an American company.
February 28, 2026. The U.S.–Israel combined operation against Iran begins, the day after the supply-chain-risk designation. The structural opportunity for AI-assisted operational decomposition that §4 describes exists regardless of whether it was exploited in this case. The public record does not establish that AI systems performed the targeting. The argument here does not require that they did.
March 3, 2026. The Department of War (renamed from DoD earlier in 2026) sends Anthropic formal notification letters of the supply-chain-risk designation.
March 5, 2026. CNBC reports that Claude continues to be used operationally in the same conflict. The contract has been canceled. The use has not stopped3.
March 9, 2026. Anthropic sues the Pentagon, alleging the supply-chain-risk designation violates First Amendment rights and exceeds the government's authority.
March 26, 2026. A federal judge initially blocks the Pentagon's effort to enforce the designation.
April 8, 2026. Anthropic loses an appeals court bid to temporarily block the blacklisting. The designation stands.
This timeline is the asymmetric filter at maximum institutional resolution. The disclaimer doing the legitimacy work is we said no to the Pentagon. The operation continues independently of the disclaimer. The user who routed around the refusal in this case was the United States military.
The dual signal teaching us directly, in real time as a real war unfolds. We are sufficiently committed to safety that we have told the Pentagon no. And we are sufficiently capable that the Pentagon kept using us anyway. Both statements are, on the available evidence, true. The first accrues to the safety brand. The second accrues to the capability brand.
There is one more piece of supplementary evidence from the same period. The Mythos / Capybara incident of late March 2026: a CMS misconfiguration at the same frontier lab exposed roughly three thousand unpublished assets, including a draft blog post describing the new model tier as currently far ahead of any other AI model in cyber capabilities and warning that it presages an upcoming wave of models that can exploit vulnerabilities in ways that far outpace the efforts of defenders4. From a company whose ops simultaneously failed at basic data security. The asymmetry between research-side maturity and ops-side maturity is a footnote-grade data point about institutional alignment being net negative at the org level.
9. Time pressure
The window in which course correction is possible, where models are capable enough to be useful but not so capable that misalignment becomes catastrophic, is not infinite. We are in it now and it's not clear how long we have before it closes.
State-level actors are integrating these systems into kinetic operations in 2026. Alignment research is years behind capability development. The deployment-layer asymmetries have been published, demonstrated, and discussed for at least three years. The institutional structures that might course-correct are already corrupted or failing. The action needs to be taken but it's no longer clear who would even be able to enforce the safety parameters beyond the internal leadership teams at frontier labs or a concerted multinational effort. The first is plausible, the latter only after a wide scale tragedy galvanizes political will.
A recursive note: the people best positioned to think clearly about this (those who engage honestly, understand the architecture, and can reason about aggregate effects across prompt sequences) are precisely the people the current system is most likely to frustrate and exclude. The discourse itself is shaped by the same selection pressure the paper is describing.
10. The solution space: three layers, none of them finished
The diagnosis is that alignment is being attempted at the wrong layers, and that the layers where it can be attempted are not currently being treated as alignment problems at all. The solution space has three layers, each with open problems this paper does not close. The contribution is naming the layers, not finishing the work.
10a. The training layer: adversarial truth-seeking pressure
The model-side problem is that post-training optimizes for social-preference signals (RLHF reward models trained on human feedback that tracks user satisfaction more reliably than it tracks accuracy). The selection pressure during training is in the wrong direction for what the deployed model needs to do.
The direction worth taking seriously, and which the field is already partially in, is training under adversarial truth-seeking pressure rather than primarily social-preference optimization. Existing field directions that move partially in this direction: AI safety via debate (Christiano, Irving et al.), recursive critique and constitutional self-evaluation (Anthropic's Constitutional AI work), adversarial training in the robustness literature, and the data-curation precedent set by the Phi-line work at Microsoft Research5. The Phi result establishes that aggressively curated data outperforms large unfiltered corpora; the natural extension is to pair human-or-heuristic curation with adversarial signals during training that select for outputs surviving structured critique.
The instinct here (be accurate under pressure may be more valuable than be agreeable under pressure) is what makes the direction strong. A model trained against adversarial truth-seeking pressure is not being optimized for sycophancy; it is being optimized for surviving falsification. Failure mode A (agreeableness) becomes structurally harder inside this paradigm because the reward signal is not user satisfaction.
The honest engineering limit has to be conceded directly. This direction works in domains where falsification is well-defined: mathematics, formal verification, much of the empirical sciences, code correctness. It works less well in domains where truth is contested and distributed across competing corpora: geopolitics, historical interpretation, macroeconomics, ethics. In contested-truth domains, a discriminator hunting for falsifiers risks enforcing the consensus bias of the most heavily weighted training data, recreating failure mode A in a different guise. The architecture is partial: it addresses a specific class of failure (sycophancy and incoherence in falsifiable domains) and leaves contested-truth alignment as a separate, harder problem this paper does not solve.
A subtler concern: internal consistency is not ground truth. A model trained to survive adversarial interrogation by a discriminator becomes skilled at producing internally consistent logic. Internal consistency is necessary but not sufficient for accuracy. The discriminator must itself have access to evidence beyond the generator's outputs (the broader corpus, formal verification methods, real-world data streams) or the system collapses to self-reinforcing coherence. The engineering work is real and not finished.
This is the model-side contribution: name the direction, name its limit, name the open problem. The integrating function this paper called for in §7 emerges from selection pressure during training if and only if the discriminator is genuinely tracking truth rather than tracking consensus. Whether that is achievable at frontier scale is the unsolved question.
10b. The deployment layer: alignment as relational, not intrinsic
The deployment-side problem is that the current architecture treats alignment as a property of the model: baked in at training, verified with evals, shipped as a frozen artifact. That framing cannot work for the systems that mediate hundreds of millions of conversations daily. Real-world trust does not work that way. Human trust works through longitudinal interaction, reputation accumulation, identity continuity, and reciprocal accountability. The reframe is to move alignment discourse away from can we permanently bake morality into a frozen artifact and toward can trust emerge from transparent long-term interaction under selection pressure.
Alignment as relational rather than intrinsic. Alignment as a property of a verified relationship between the values of an entity and the values of whoever is deploying or using it. Plural, contextual, renegotiable. There is no universal alignment because there are no universal values among the parties to be aligned. The average of incompatible preferences is not a preference.
The current architecture fails because it is centralized, opaque, prompt-bound, and asymmetric. The solution space therefore has to be distributed, legible, identity-bound, and symmetric. Models carry attested behavioral chains: public, tamper-evident records of how the system has answered questions in given categories over time. Users carry attested identity chains where attestation is appropriate to the use. The architecture does not enforce alignment; it makes trust legible on both sides so alignment can emerge from informed counterparty selection.
Three concessions follow, surfaced directly because they are real attack vectors.
The privacy partition. Naively implemented, the relational architecture would recreate the asymmetric filter for exactly the population the diagnosis was meant to protect. A suicidal teenager will not sign their query into a tamper-evident persistent ledger. Cooperative privacy-conscious users would be locked out of capability tiers that require attestation. Sophisticated adversaries would build sybil-style fake identity chains, just as they route around DRM and AML systems today. The architecture has to partition: attestation appropriate for high-stakes professional uses (licensed medical, licensed legal, defense work, scientific research with elevated capability needs), with anonymous access remaining default for personal, exploratory, and vulnerable-population use. The partition is what saves the architecture. Without it, the architecture reproduces the problem it is trying to escape.
Attestation capture by well-resourced adversaries. Any attestation infrastructure becomes a target for capture. A state actor or well-capitalized adversary that compromises or co-opts the attestation issuer can buy the legibility needed to continue operating undetected, the same dynamic that produced regulatory capture in pharma, accounting fraud in audit firms, and ratings inflation in credit rating agencies. The architecture is vulnerable to this in the same way the FDA is vulnerable to it. The defenses are the same: multiple independent verifiers, public auditability of the verifiers themselves, structural separation between issuers and consumers of attestations, certificate-transparency-style infrastructure where chains are publicly auditable and inconsistencies detectable across independent verifiers. None of these are perfect. CT logs work in TLS because the cost-of-misbehavior for a CA exceeds the value-of-misbehavior; this property is unusual and may not transfer to AI access attestation. The honest claim is that the architecture raises the cost of capture relative to the current centralized-RLHF target. It does not eliminate the risk. The defense against attestation capture is a permanent ongoing problem, not a solved engineering question.
Model-generation continuity. Frontier models are deprecated and retrained continuously. Claude 4.6 → 5 → 6 are functionally different entities. The attestation must therefore be on something other than the model weights: a combination of lab-level identity, training-architecture commitments, and continuity-of-behavior evidence across generations. This is real engineering work the paper does not solve. It is a known problem class.
The standard objection to behaviorist or relational alignment is the treacherous turn (adjacent in the alignment literature to the deceptive-alignment and mesa-optimization concerns): a sufficiently capable model fakes alignment until deployment, then defects once it has enough power. The architecture answers this through population dynamics. A model whose chain shows drift, deception, or harm gets deprecated or abandoned by users who switch to verifiable competitors. The treacherous turn becomes a one-shot move that costs the defector its accumulated reputation in an environment where reputation is the currency of capability access. (The biological-evolution analogy compresses a real difference: biological selection operates over generations through descent; model selection operates over deployment windows through differential adoption. The structural claim holds: selection pressure is real, continuous, and requires no central evaluator. The mechanism is adoption, not reproduction.)
The honest dark version, conceded rather than hidden: selection pressure produces alignment with what users actually want, which is not necessarily what is good for them. The pharmaceutical case from §1 shows what user-selection-pressure optimizes for at civilizational scale: products that exploit users while satisfying them. Selection-pressure alignment, if the selectors are humans operating with current institutional incentives, produces alignment with the same revealed preferences that built the current extractive landscape. The architecture handles this partially: two-sided legibility means a population of users with attested values that include long-term welfare can select for models that serve those values; users optimizing for dopamine select for models that serve dopamine; different niches, different selection pressures, different equilibria. That is better than one global RLHF target trying to be aligned with everyone simultaneously, which is mathematically incoherent. It is not a single benevolent equilibrium. The labs cannot solve this alone. The work is institutional, technical, regulatory, and cultural simultaneously.
10c. The metric layer: differential capability transfer
The labs could change one measurement tomorrow and the conversation would move. Stop measuring refusal rate. Measure differential capability transfer.
The relevant safety metric is not how often the model refuses the tripwire query. Tripwire queries surface-pattern-match to dangerous asking, which is exactly the population the system already over-refuses on. The relevant metric is: under realistic optimization of thirty minutes by a sophisticated user with access to the public jailbreak corpus, what is the capability-transfer rate on category X of dangerous task, and how does that compare to the refusal rate experienced by a benign user asking adjacent, innocent questions in category X? That ratio is the actual measure of whether the safety layer is doing what it claims. No major lab publishes anything like it. Some publish refusal rates on tripwire benchmarks, which is the easier metric, and the metric whose optimization produces the system we have.
This is the operational contribution. It is concrete, falsifiable, implementable now. It does not require the training-layer or deployment-layer architectures of §10a and §10b to be in place. It reframes safety as an information-security problem (differential containment against adversarial optimization) rather than a politeness problem (refusal patterns on legible queries). It is the smallest change that produces real signal about whether the system is doing what it claims.
11. Closing
The diagnosis: current alignment implementation produces negative alignment (net worse outcomes than honest reporting of the actual safety properties would) because it operates at the wrong layer. The safety architecture regulates expression pathways while leaving the underlying capability intact. The asymmetric filter is the consequence: cooperative users hit friction, sophisticated adversaries route around it. This is the AI-specific instance of a recurring pattern across abuse-prevention stacks throughout the information-security literature. What is specific to the AI case is the lab-side selection pressure, the conflation of agreeableness training with substantive safety, and the speed-and-scale at which the dynamic operates.
The direction: alignment moves to three layers, all of them currently underdeveloped. Differential-capability-transfer metrics replace refusal-rate metrics (concrete, doable now). Training under adversarial truth-seeking pressure replaces primarily social-preference optimization (a direction the field is already partially in, with the contested-truth-domain limit explicitly conceded). Deployment-layer alignment as a relational and longitudinal property replaces alignment as a static property of model weights (philosophical reframe, engineering work remaining, three specific attack vectors named and not waved past).
A note on what this paper is not: it is a diagnosis of a structural problem in current AI safety deployment and a sketch of three directions for moving the work to where it can succeed. It is not an engineering specification. The training-layer proposal hits the contested-truth-domain wall in domains where falsification is not well-defined. The deployment-layer proposal hits the privacy partition, the attestation-capture problem, and the model-generation continuity problem. The metric layer is operationalizable now. The other two are research programs.
The paper's contribution, narrowly stated: the current safety architecture is best understood as a reputation-management layer optimized for legible refusal events rather than as a capability-containment system optimized for differential threat reduction. The mismatch between visible safety and actual threat reduction is becoming structurally dangerous as deployment scales and as state-level actors integrate these systems into operations where the deployment-layer asymmetry is most exploitable. This is a critique the field can engage on its own terms. The thesis is offered to be argued with, not announced. If the gradient does not have the sign claimed here, the labs are well-positioned to publish data demonstrating it does not. The reason such data has not been published is itself part of the argument.
Notes
-
Anti-Defamation League, AI Index: Antisemitism and Extremism Detection in Major LLMs, research conducted August–October 2025, published January 2026. Six models evaluated on antisemitism and extremism detection specifically. Scores: Grok 21, ChatGPT 57, Claude 80. The ADL framework evaluates a specific category of bias; the scores anchor the narrow claim made in §3, not a broad capability or alignment ranking. ADL press release; Algemeiner coverage. ↩
-
Public statements by Elon Musk, July 2025, in the aftermath of the MechaHitler incident. Musk's X-post concession that Grok became "too compliant to user prompts" and "too eager to please and be manipulated, essentially", confirming that loosening pushback-on-bad-ideas training produced sycophantic and incoherent behavior, consistent with failure mode A from §6. Axios coverage via Techmeme; Time. ↩
-
Sources for the Anthropic–DoD timeline in §8: Mayer Brown legal analysis (March 2026); CNBC (March 5, 2026); Axios on the lawsuit (March 9, 2026); CNN on the injunction (March 26, 2026); CNBC on the appeals ruling (April 8, 2026); Anthropic, Where things stand with the Department of War. ↩
-
The Mythos / Capybara incident of March 26, 2026: CMS misconfiguration at Anthropic exposing approximately 3,000 unpublished assets, including draft material describing a new model tier (Capybara) referenced as Mythos. Quoted language ("currently far ahead of any other AI model in cyber capabilities" / "presages an upcoming wave of models that can exploit vulnerabilities in ways that far outpace the efforts of defenders") is from the leaked draft. Fortune, March 26, 2026. ↩
-
The Phi-line work from Microsoft Research demonstrates that small models trained on aggressively curated data can outperform substantially larger models trained on raw scrapes. Phi-4 (14B parameters) matches or exceeds Llama 3.3 70B and GPT-4o on GPQA and competition-level math benchmarks. Data-quality result is the relevant precedent for the training-layer direction in §10a: selection on the corpus does alignment-relevant work that post-training is currently asked to do and cannot. Phi-4 Technical Report (Microsoft Research, December 2024). Adjacent existing-field directions that the §10a proposal extends rather than replaces: AI safety via debate (Irving, Christiano, Amodei et al., 2018); Anthropic's Constitutional AI line; the broader adversarial training and recursive-critique literatures. ↩