Why it matters for agentic AI
Access control is designed to stop unauthorised actions. Reversibility is designed to contain authorised-but-wrong ones, the class of failures that access control cannot see, because the action was permitted. It is the counterpart to Safety (which enforces gates before irreversible actions) and to Resilience and Recovery (which depends on reversible state to have a rollback target at all). For agents, this class is large. Hallucinations produce confident wrong outputs that are acted upon. Injection-induced mistakes look identical to legitimate actions from the access-control layer’s perspective. A multi-step plan that was correct at step one may be wrong by step six as context has shifted. None of these are permission violations; all of them benefit from the ability to undo.
The practical approach is to classify every action on a reversibility spectrum (fully reversible, reversible with cost, hard to reverse, or irreversible) and escalate controls as the classification worsens. A fully reversible action (appending to a log, creating a draft) can proceed with minimal overhead. A hard-to-reverse action (mass database update, external email to thousands of recipients) warrants a dry-run that surfaces the projected state delta for human review before execution. An irreversible action (permanent deletion, financial disbursement) warrants a hold period, a compensating-transaction plan, and ideally a second-party approval. The key insight is that humans can meaningfully approve a projected outcome (“here is what will change”) far more effectively than they can approve an abstract plan (“here is what the agent intends”). Show the delta, not the intention.
For multi-agent pipelines this principle gains urgency. When several agents each take a step in a shared workflow, a wrong intermediate state propagates to subsequent agents before any validation checkpoint fires. By the time the error is detected, it may be encoded in three systems simultaneously. Staged rollout with explicit checkpoints between agents, where the system’s state is verified before the next agent begins, contains errors to the stage where they occurred.
Scenario: the cascading pricing change
An agent-driven pricing update calculates a new price schedule and writes it simultaneously to the billing system, the reporting database, and the customer-facing catalogue. A hallucination in the calculation means all three systems are now inconsistent with the correct price. By the time monitoring detects the anomaly (perhaps hours later, when customer complaints arrive) the bad state is deep in three systems with no single compensating transaction that covers all three. Staged rollout writes to the billing system first, checkpoints with a validation query, then proceeds to reporting, then to the catalogue. The hallucination is caught at the first checkpoint, and only one system requires a rollback.
Scenario: the new capability at full autonomy
A team deploys a new agent capability (automated customer-account closure) at the same autonomy tier as its existing account-management capabilities, reasoning that the new capability is “similar.” Within its first week the agent, misreading an ambiguous request, permanently closes an active account. The action is irreversible; the account data is gone. A quantitative graduation gate that requires the agent to process a set number of dry-run closures, reviewed by humans, before gaining live-write permission to the irreversible action would have caught the misinterpretation before it caused harm. Capabilities should earn their autonomy tier; they should not inherit it.
How it fails
- Writes overwrite in place with no version history, meaning a wrong write is also the current truth and there is no restore point.
- There is no dry-run environment; the agent’s plan is reviewed in abstract rather than as a concrete state delta, giving the human reviewer no way to spot unexpected side-effects.
- New capabilities are shipped at full autonomy on day one, removing the graduated observation window in which hallucinations and edge-case misreadings would surface in a safe context.
- Multi-agent pipelines have no inter-stage checkpoints, so an early error propagates through all subsequent agents before any validation fires.
Why the mapped controls work
Versioning and soft-delete with retention mean every write creates a restorable history. A wrong action is recoverable rather than terminal. The shadow dry-run environment evaluates the agent’s plan against a copy of the live state and surfaces the delta. The reviewer sees what will change, not a natural-language summary of intent, which is the only basis on which meaningful approval is possible. The saga pattern with compensating transactions treats multi-step plans as a sequence of reversible phases: if any step fails or is rolled back, the compensating transaction restores the prior state, making partial execution safe to recover from. Configurable hold periods on high-value irreversible actions insert a time window in which a human can cancel without needing to predict in advance that something will go wrong. Quantitative graduation gates before tier promotion mean a new capability must prove correct behaviour at read-only or dry-run level before it earns live-write authority, catching reasoning errors in a safe context where reversibility is total.
First steps
- Classify every action your agent can take on the reversibility spectrum today. Create a simple table with columns (action, reversibility tier: fully-reversible / reversible-with-cost / hard-to-reverse / irreversible) and use it to identify which actions need additional gates before any further deployment.
- Build a shadow dry-run environment that mirrors your live state (a read replica for databases, a sandbox namespace for APIs) and route every planned hard-to-reverse or irreversible action through it first. The agent presents the dry-run state delta for human review before the live write is authorised.
- Enable soft-delete with a configurable retention period (minimum 30 days) on every data store your agent writes to. Replace any hard-delete call paths with a
deleted_attimestamp and a scheduled purge job, so that an erroneous deletion is recoverable within the retention window without needing a full backup restore.
Threats it governs
When this principle is absent, these threats become reachable.
- T4 Resource Overload Agents autonomously schedule, queue, and execute work. Exhaustion fans out.
- T6 Intent Breaking and Goal Manipulation Adversaries manipulate planning, reasoning, or self-evaluation to override goals.
Controls that advance it
Catalogue mitigations that strengthen this principle, grouped by the defence-in-depth stage they sit in.
- OOB verify An agent that can propose payments, update banking details, or modify production configuration is, by construction, a manipulation surface. If the only thing standing between a proposed change and its execution is the agent's own UI, a successful prompt injection or RAG poisoning attack requires no additional steps. Out-of-band verification breaks that dependency by routing a one-use confirmation code through a channel that is structurally separate from the agent's primary interaction channel, so an attacker who controls the agent's context cannot complete the approval without also compromising the user's registered secondary device.
- Plan check A plan-then-execute agent produces a sequence of steps before acting. If the planner is manipulated, it will emit steps that serve the attacker's goal rather than the user's. Plan-vs-goal validation addresses this by placing an independent validator between the planner and the execution loop: it evaluates each proposed step against the originally-declared goal before the agent is permitted to act on it.
- Model registry An agent loads whichever model weights are available at startup unless the runtime is told exactly which artifact to load. If a poisoned or regressed weight is published to the model store, the agent picks it up silently on the next restart. A model registry prevents that: every artifact is registered with a cryptographic checksum and an approval stage, the agent runtime loads by explicit version pin, and new versions must pass a canary evaluation before promotion to production.
- Dual control An AI agent operating with broad authority can propose actions that are irreversible: deleting records, modifying IAM policies, moving funds. A single human reviewer at the approval gate is a single point of failure, one compromised account, one fatigued reviewer, or one successful social-engineering attempt is enough to commit the action. Human dual-control addresses that by requiring two distinct, independent humans to approve before the action commits.
- Blockchain tx guard A blockchain transaction, once committed, cannot be undone. An agent that signs and broadcasts a transaction without an enforcement layer before it can exceed its authorised value, call a contract it was never provisioned to reach, or drain a wallet in a runaway loop, and by then the funds are gone. A transaction guard intercepts each proposed transaction before signing, checks it against value bounds, a contract allowlist, a gas or compute-unit limit, and a replay-protection nonce, and refuses to sign anything that falls outside declared policy.
- Workflow state consistency When multiple agents read and write shared workflow state concurrently, a network partition, a delayed message, or an adversarially timed race condition can produce divergent views. An agent acting on stale or conflicting state may authorise an action it would reject given correct current state. Hash-chained state snapshots, merge-point conflict detection, and optimistic concurrency control close that window.
No catalogued control.
In Helmwart
The Q1 irreversible-authority signal raises a threat’s exposure rank; reversibility itself isn’t audited.