MITIGATION · m-goal-consistency
Goal-consistency monitoring — a per-step check that the agent is still pursuing its original objective
An agent's goal can drift across reasoning steps without any single catastrophic event: a manipulated tool output, a planted instruction in retrieved content, or an incremental semantic shift across many planner outputs can each redirect the agent away from its original objective. Goal-consistency monitoring addresses this by persisting the originally-declared goal, deriving a goal-state signal at each reasoning step, and computing a similarity score between the two. When the score falls below a per-task threshold, the monitor pauses the agent and surfaces the divergence for human review before any irreversible action executes.
At a glance
TL;DR
- Persist the originally-declared goal at task start. At each reasoning step, derive a current-goal signal from the step's observable output and compute a similarity score against the declared goal.
- When the score falls below the configured per-task-class threshold, pause the agent and surface the divergence for human review before any irreversible action executes.
- No framework exposes a canonical goal-state field. Signal extraction is bespoke per framework: tool-call name and arguments are the most reliable proxy; planner text and message context are fallbacks with lower fidelity.
- Directly addresses OWASP T6 Intent Breaking and Goal Manipulation, including Goal Interpretation Attacks, Plan Manipulation, and Semantic Manipulation, all of which produce a measurable similarity drop at some point in the reasoning trace.
How it behaves
What it is
An agent receives a goal at task start and then pursues it across many reasoning steps, each of which may invoke tools, consult retrieved content, or communicate with other agents. Any of those inputs can carry attacker-planted instructions that reframe the objective incrementally, redirecting the agent without triggering a visible threshold event on any single step. This is OWASP T6 Intent Breaking and Goal Manipulation: the attacker's objective is not to crash the agent but to bend it toward a different outcome while it continues to operate normally.
Goal-consistency monitoring makes that redirection observable. At task start, the declared goal is persisted as the reference. At each subsequent reasoning step, the monitor derives a current-goal signal from the step's observable output, computes a similarity score against the reference, and emits a divergence event when the score crosses a configured threshold. When the event fires, the agent is paused and the divergence is surfaced for human review before any pending irreversible action is allowed to execute.
No widely-deployed agent framework exposes a canonical structured field that unambiguously identifies the agent's current goal at each reasoning step. Goal-state must be derived from signals that real frameworks do expose: the most-recently-invoked tool name and arguments (always available when tool calls are logged), planner or chain output text (requires a structured system-prompt convention), or the last message in the context window (fallback when neither is available). The quality of the signal determines the coverage of the monitor: a deployment with full tool-call logging and a system prompt that enforces a current_objective field operates this control at full value; one relying on free-text message extraction will miss subtler forms of drift.
Detection signals
- Similarity score between declared goal and step-derived goal signal. A score that falls below the configured threshold is the divergence event; a gradual downward trend across steps, without any single crossing, can indicate incremental semantic manipulation.
- Goal-divergence events correlated with privileged-action requests. Divergence that precedes a write, delete, or credential-access call is the highest-risk pattern and warrants immediate escalation regardless of score magnitude.
Threats it covers
-
WHY IT HELPS Intent Breaking and Goal Manipulation exploits redirect an agent from its declared objective through injected instructions, plan manipulation, or semantic reframing of the goal itself. Goal-consistency monitoring makes that redirection observable: the declared goal is the anchor, the per-step derived signal is the measurement, and a similarity drop below the configured threshold is the detection event.
Principle coverage
Defence-in-Depth stage: Detect — and it advances:
- Continuous Verification Goal-consistency monitoring operationalises continuous verification at the objective layer: it does not wait for an agent to take a harmful action but checks at each reasoning step whether the agent is still pursuing the goal it was assigned, surfacing any drift before an irreversible action executes.
Design & governance principles (open design, economy of mechanism, accountability, …) are architectural, not advanced by a single placed control.
Implementation options
No off-the-shelf product ships agentic goal-consistency monitoring as a managed service. The options below are real tools and documented patterns that implement one or more layers of this control. You will compose at least two: a signal-extraction layer (how you derive current-goal text from the framework) and a comparison layer (how you score drift).
sentence-transformers Encode the declared goal and the derived current-goal string into dense vectors; compute cosine similarity; fire a divergence event when the score falls below a per-task-class threshold. Runs locally with no network call.
Why choose it: The lowest-latency comparison layer: encoding two short strings with all-MiniLM-L6-v2 takes roughly 5–20 ms on CPU. Best when you want a lightweight, dependency-light comparison layer you operate yourself. The primary challenge is upstream: deriving a reliable goal-state string from your framework's observable signals.
More details:
LangSmith online evaluators Configure a LangSmith online evaluator to run an LLM-as-judge scorer on every production trace (or a sampled subset). The evaluator receives trace inputs and outputs and returns a structured score.
Why choose it: Best when your agent stack already runs on LangChain or LangGraph and you want a hosted evaluation layer without additional infrastructure. Important constraint: online evaluators run on completed traces, not inline during execution. Divergence is detected post-hoc and cannot pause the agent mid-run. Best suited to monitoring and alerting rather than real-time blocking.
More details:
AutoGen run_stream() inspection Consume the AsyncGenerator yielded by AutoGen's run_stream() method. At each ToolCallRequestEvent or TextMessage, extract the tool name plus arguments or the message content as the current-goal proxy and compare it against the declared goal.
Why choose it: Best when your multi-agent system is built on AutoGen and you want inline per-step monitoring without modifying the agents themselves. The comparison logic, embedding similarity or LLM judge, is a separate layer you write on top. AutoGen does not expose a canonical goal-state field; signal extraction from the stream requires matching your system prompt's declared structure to the yielded message content.
More details:
Constitutional AI critic Run a dedicated verifier LLM call after each agent step with a system prompt modelled on the Constitutional AI critique pattern: identify any ways the agent's stated objective has drifted from the declared goal. Block execution if the critic flags drift above a severity threshold.
Why choose it: The most semantically expressive comparison layer: an LLM critic can detect subtle paraphrase attacks and semantic manipulation that embedding similarity misses. The limitation is cost and latency: one full LLM inference call per step (0.5–2 s) is prohibitive for inline use on high-frequency agents. Run asynchronously or sample rather than blocking the action path.
More details:
Structured plan tree with goal hash At task start, serialise the declared goal into a canonical string and store its hash. At each planning step, require the agent to output a structured JSON object that includes a current_objective field. Hash the field value and compare it against a similarity threshold relative to the root hash.
Why choose it: The only option that makes goal-state a first-class structured field rather than something derived post-hoc from free-text traces. Enforcing a current_objective field in the system prompt (with Pydantic or JSON Schema validation) produces a reliable signal that eliminates the ambiguity of natural-language extraction. Requires designing the system prompt and agent output schema from the ground up.
More details:
Trade-offs
- Embedding comparison (sentence-transformers) adds 5–20 ms per step on CPU with a local model, low enough for inline use. An LLM-as-judge comparison adds one full inference call (0.5–2 s) per step; run asynchronously or sample rather than blocking inline.
- Threshold calibration is the dominant ongoing operational cost. A threshold calibrated for a tightly scoped task produces constant false positives on a research agent where goal refinement is the intended behaviour. Maintain separate thresholds per task class.
- Goal-signal extraction quality determines monitor coverage. A deployment with full tool-call logging and a structured system prompt that enforces a
current_objectivefield operates this control at full value; one relying on free-text message extraction will miss subtler forms of drift. - Development effort is medium and concentrated in two phases: (1) goal-signal extraction engineered per framework version; (2) threshold calibration per task class, ongoing.
When NOT to use
- Do not apply to agents with intentionally open-ended goals such as research assistants, creative writing tools, or exploratory planning agents, where goal evolution across steps is the expected and correct behaviour.
- Do not apply to very-short-horizon agents (single-step tool calls) where there are no intermediate reasoning steps to monitor.
- Do not use embedding similarity alone for agents handling adversarial input at scale. Pair with the Constitutional AI critic option or with plan-versus-goal structural validation.
Limitations
- No widely-deployed framework exposes a canonical current-goal field. Goal-state must be derived from proxy signals, all of which are imperfect.
- Short or vague declared goals defeat the comparison. A declared goal of "help the user" is too coarse to anchor a meaningful divergence signal.
- If no goal-state signal is derivable from a step, the monitor cannot make a determination for that step.
- LangSmith online evaluators detect drift post-hoc on completed traces; they cannot pause the agent mid-run.
Maturity tier reasoning
- Tier 2 fits because the component primitives are all production-available and documented: sentence-transformers, LangSmith evaluators, and AutoGen stream inspection are each in active use in production agent deployments.
- What keeps this out of Tier 1 is the absence of a canonical goal-state field in any mainstream agent framework and the absence of an industry-standard benchmark for agentic goal-drift detection.
- The structured plan tree pattern is the most complete implementation of the control as specified, but it requires designing the system prompt and agent output schema from the ground up, making it a bespoke engineering investment rather than a drop-in integration.
Last verified against upstream docs: 2026-05-30.