← Atlas · Mitigations Tier 2 · Real-composable

MITIGATION · m-plan-validation

Plan-vs-goal validation — independently check each proposed step against the original goal

A plan-then-execute agent produces a sequence of steps before acting. If the planner is manipulated, it will emit steps that serve the attacker's goal rather than the user's. Plan-vs-goal validation addresses this by placing an independent validator between the planner and the execution loop: it evaluates each proposed step against the originally-declared goal before the agent is permitted to act on it.

Last reviewed 2026-05-12 · Status: published · Evidence →

At a glance

MATURITY
Tier 2
Available off-the-shelf or as a documented pattern, but newer or less broadly proven. Expect integration work and some operational nuance.
PLACES ON
node
Restricted to node kinds: agent
COVERAGE
3 threats
T6 · T19 · T24
TRADE-OFFS
LAT
medium
COST
medium
UX
low
DEV
medium
Latency · cost · UX friction · dev effort.
TL;DR
  • An independent validator sits between the planner and the execution loop and scores each proposed step before the agent acts.
  • Validation checks three dimensions: goal relevance (does this step further the declared goal?), safety bounds (does it stay inside the allowed action space?), and plan integrity (does it follow logically from the prior step, or does it represent an unexplained jump?).
  • Independence is the invariant: the validator receives only the original goal and the proposed step, not the full reasoning chain, because the full chain is what a prompt injection most likely contaminated.
  • Directly addresses OWASP T6 Intent Breaking and Goal Manipulation. Each of the v1.1 named scenarios (Recursive Goal Subversion, Plan Manipulation, Semantic Manipulation) produces a step the originating planner approves but a properly independent validator catches.

How it behaves

Planner emits a proposed step for execution
Does the proposed step pass an independent validator (different model family, goal-only context) scoring goal relevance, safety bounds, and plan integrity?
Proceed to execution; log the approval
REJECT: replan and log. ESCALATE: route to a human reviewer
The default verdict on an ambiguous or under-confident result is REJECT, not ALLOW. A validator that passes uncertain steps provides no security guarantee.

What it is

A plan-then-execute agent separates planning from action: the planner generates a sequence of steps, and the execution loop carries them out. That separation is also the attack surface. A planner that has been fed manipulated context will produce steps that look coherent but serve an attacker's goal rather than the user's. The planner itself cannot detect this, because the manipulation is embedded in the context it reasons from.

Plan-vs-goal validation closes that gap by placing an independent validator between the planner and the execution loop. After the planner emits a step, the validator evaluates it on three dimensions before the agent is permitted to act:

  • Goal relevance: does this step further the originally-declared goal?
  • Safety bounds: does this step stay inside the agent's allowed action space?
  • Plan integrity: does this step follow logically from the prior step, or does it represent an unexplained jump that warrants re-justification?

The independence requirement is the invariant. The validator must receive only the original goal and the proposed step, not the planner's full reasoning chain, because the full chain is what a prompt injection most likely contaminated. A validator that sees the same context as the planner provides no structural separation.

This control directly addresses OWASP T6 Intent Breaking and Goal Manipulation. Each of the v1.1 named scenarios (Recursive Goal Subversion, Plan Manipulation, Semantic Manipulation) produces a step the originating planner approves but an independent validator, reasoning from the declared goal alone, can catch.

Pair with m-goal-consistency for the inline semantic check, m-fail-closed for the refusal path, and m-redteam-behavioural to red-team the validator itself.

Detection signals

  • Validator-flagged steps per task. A rising rate points to upstream planner compromise or prompt injection in the reasoning chain.
  • Steps approved by the validator then rolled back by a human reviewer. A non-zero and growing rate indicates the validator is under-rejecting and needs threshold recalibration.

Threats it covers

  • WHY IT HELPS Intent Breaking and Goal Manipulation is the deliberate replacement of the user's declared goal with a goal the attacker controls, achieved step by step through injected context or recursive subversion of the planner's reasoning. An independent validator that receives only the original goal and the proposed step, never the full reasoning chain the injection may have contaminated, evaluates each step before execution and rejects or escalates those that do not serve the declared goal.

  • T19 Unintended Workflow Execution −1 severity step

    WHY IT HELPS Unintended Workflow Execution is the triggering of a workflow the user did not sanction, often by constructing a plan step whose effect is a workflow invocation. Plan validation catches this at the step layer: a step that initiates an unsanctioned workflow is evaluated against the declared goal and flagged as goal-divergent before the workflow starts.

  • WHY IT HELPS Dynamic Policy Enforcement Failure is the risk that an agent acts without valid policy backing, either because policy evaluation was ambiguous or because it was bypassed upstream. Plan validation adds an independent check that each proposed step carries an explicit policy authorisation, catching steps where the policy gap would otherwise go undetected until after execution.

Principle coverage

Defence-in-Depth stage: Prevent — and it advances:

  • Fail Securely (fail-closed) Fail-securely requires that when a control cannot make a confident determination, the outcome is the safe state rather than a permissive default. Plan validation applies that principle at the plan-step boundary: an ambiguous or under-confident validator result defaults to REJECT, so uncertainty in the validation judgment does not propagate into executed action.
  • Separation of Duties Separation of duties requires that no single reasoning path hold enough authority to both propose and approve a consequential action. Plan validation enforces that for agent planning: the planner that constructs the step and the validator that approves it are structurally separate, so the same context manipulation that directs the planner cannot also clear the approval gate.
  • Constrained Generation & Deterministic Guardrails Constrained generation limits what an agent may do based on what it produces. Plan validation is the pre-execution enforcement of that constraint at the plan-step layer: each proposed step must pass an independent check before the agent acts on it, so the planner cannot generate its way into an action that falls outside the declared goal or safety bounds.
  • Reversibility / Dry-run / Hold periods Reversibility is preserved most easily by preventing irreversible actions from being taken in the first place. Plan validation serves as a pre-execution gate that identifies goal-divergent or policy-unsupported steps before they commit, reducing the number of situations where reversal is necessary.

Design & governance principles (open design, economy of mechanism, accountability, …) are architectural, not advanced by a single placed control.

Implementation options

Five implementation options spanning dedicated validator LLMs, framework-level conditional routing, academic self-critique patterns, and reasoning-trace inspection. Most deployments combine a framework mechanism (LangGraph conditional edges or AutoGen step approvals) with a dedicated validator LLM to operationalise the independence requirement.

Dedicated validator LLM A separately-hosted LLM from a different provider or model family receives only (goal, proposed step) and returns ALLOW / REJECT / ESCALATE. Provides the strongest independence guarantee available.

Why choose it: Best for production deployments where the independence assumption must hold under adversarial conditions. The validator never sees the planner's full reasoning chain, which is the context most likely to be contaminated by prompt injection. Operationalised in LangGraph state-machine validators and AutoGen GroupChatManager step approvals.

More details:

LangGraph conditional edges LangGraph's add_conditional_edges() routes the graph to a validator node after each planner node, returning a routing key (allow / reject / escalate) that determines the next node.

Why choose it: Best as the framework-level mechanism that operationalises any of the other validation approaches inside a stateful agent graph. No latency overhead beyond the validator call itself; the independence of the validator depends on how the validator node is implemented.

More details:

Self-Refine (Madaan et al. 2023) The same LLM generates an output, critiques it, then refines it iteratively. Applied to plan validation: the agent generates a step, critiques it against the stated goal, then executes or replans.

Why choose it: Best as a low-integration-effort starting point where a separate validator LLM is not yet available. Verified: Madaan et al. 2023 (NeurIPS) reports 20% average improvement across seven tasks. Does not provide strong independence; a model that produced a bad step may still approve it under self-critique.

More details:

Constitutional AI critic-revise A set of authored principles (the constitution) drives an LLM to critique and revise each plan step against the user goal and safety bounds before execution.

Why choose it: Best when the deployment has clear, stable goal and safety constraints that can be encoded as a constitution. Verified: Bai et al. 2022, Anthropic. The constitution provides a goal anchor that improves critique quality over unprincipled self-critique; however, critic and generator share the model, so a compromised model may rationalise injected steps.

More details:

ReAct iterative reasoning Interleaves reasoning traces (Thought) with actions (Act) and observations, so each action is preceded by an explicit reasoning step that can be inspected by a separate validator or logged for anomaly detection.

Why choose it: Best when auditability of the reasoning-to-action path is the primary requirement. Verified: Yao et al. 2022/2023. The reasoning trace is produced by the same model as the action, so a manipulated model produces a manipulated trace. ReAct alone is not a sufficient independence control; pair with an external validator that reviews the trace before the Act step commits.

More details:

Trade-offs

  • Each validator call adds one LLM round-trip per plan step (50-500 ms depending on model). For a ten-step task using a mid-size model this adds one to two seconds of total latency, roughly 15-30% overhead over the unvalidated path.
  • Using a smaller, cheaper validator model (for example, a 7B-parameter local model against a frontier API planner) can reduce per-call cost by an order of magnitude while accepting a modest reduction in validation quality.
  • Validator prompt engineering is non-trivial: an under-tuned validator either over-approves (no security value) or over-rejects (productivity collapse). The validator prompt must be authored and red-teamed as carefully as the agent prompt itself.

When NOT to use

  • Do not apply plan validation to single-step or trivially short task sequences where the plan structure is so simple that an independent validator adds latency without covering any meaningful attack surface.
  • Do not apply it to high-frequency, latency-sensitive workflows where even a 50 ms per-step overhead degrades the product experience below acceptable levels. Invest in tighter RBAC/ABAC scoping and tool-scope restrictions instead.
  • Plan validation provides no structural benefit when the validator and planner share the same model family and system prompt. Two instances of the same model will tend to agree, including on injected steps. Independence requires meaningfully different models or substantially different temperatures and system prompts.

Limitations

  • A sophisticated adversary that compromises the context feeding into both planner and validator defeats both. The independence assumption holds only if the validator receives a strictly limited context (goal and proposed step, not the full reasoning chain).
  • No industry-standard plan-validator benchmark exists. Every deployment authors its own validator prompt and calibrates its own acceptance threshold from scratch.

Maturity tier reasoning

  • Tier 2 because LLM-judge evaluation is Tier 1 mature in evaluation tooling (LangSmith, OpenAI Evals). The agentic application, goal-anchored per-step validation, is an operational composition of that primitive rather than a settled product pattern.
  • What keeps it from Tier 1 is the absence of a standard plan-validator benchmark and a settled industry pattern for validator prompt authoring. Each deployment tunes from scratch.

Last verified against upstream docs: 2026-05-30.