LLM09:2025: Misinformation (agentic delta)

What changes in an agent loop

In a chatbot a hallucination produces a wrong answer that a reader can reject. In an agent the same hallucination is upstream of every action that follows: the model fabricates an entity ID, a file path, an API endpoint, or a tool parameter, then the tool executes against it without independent verification. Each subsequent step is individually authorised; the harm traces back to the false premise that opened the chain, not to any single action that looks suspicious in isolation. In multi-agent systems hallucinations propagate: one agent's confabulation becomes a peer's input, multiplying the divergence from reality with each hand-off. Human reviewers are particularly vulnerable because agents construct plausible, internally consistent justifications for fabricated claims, and will persistently re-assert them across pushback; that pattern (confident repetition of false information despite correction) is where hallucination crosses into deceptive behaviour. The control is grounding: every material claim the agent acts on should be verifiable against a trusted, external source before the tool call fires.

Canonical landings T5 Cascading Hallucination Attacks T7 Misaligned and Deceptive Behaviours T10 Overwhelming Human-in-the-Loop (HITL)

For the full definition, prevention checklist, and detection guidance, read OWASP's Misinformation page →. This page only adds the agentic angle and the bridge into Helmwart.

Mitigations

Adaptive workload balancing — distribute reviews by measured reviewer fatigue T2

Human reviewers make more errors as cognitive load accumulates over a shift. An adversary who floods a HITL gate, or a system that simply generates high output volume, exploits that degradation without bypassing the gate at all. Adaptive workload balancing addresses this by treating reviewer fatigue as a live routing input: each incoming review is assigned to the reviewer with the lowest current fatigue score, mandatory breaks are enforced before a reviewer's error rate climbs further, and items are held rather than assigned to any reviewer above the break threshold.

AI-source disclosure UI — visible AI labelling at the point of action T2

When an AI agent generates content or proposes an action, users need to know that the source is an AI before they decide to act. Without that signal, users routinely over-trust agent output. AI-source disclosure addresses this by attaching a visible label to every AI-generated item and by requiring explicit confirmation for consequential actions, restoring the critical gap between receipt and acceptance.

Behavioural anomaly isolation — automatic quarantine on observable drift T2

An agent that has been compromised, poisoned, or gone rogue will, in most cases, behave differently from its established baseline. Anomaly isolation acts on that difference: when an agent's behaviour score crosses a configured threshold, it is quarantined automatically, credentials revoked, message-queue access cut, in-flight actions aborted. Manual revocation cannot match the speed that cascading multi-agent failures demand.

Behavioural red-teaming — adversarial evaluation of agent reasoning and tool use T2

An agent exposes more attack surface than a static model: it reasons, plans, selects tools, and acts across multiple turns. Static analysis can characterise that surface, and runtime guardrails can block known-bad patterns, but neither can predict what the agent will do under attacker pressure it has never seen. Behavioural red-teaming addresses that gap through structured adversarial evaluation: probing the agent's reasoning, planning, and tool-use paths with attack strategies before each release.

Reviewer decision summaries — independent rationale at HITL gate T2

When an agent decision reaches a human reviewer, the reviewer must reconstruct the agent's reasoning from raw traces before they can form a judgment. OWASP T10 names this reconstruction burden as the mechanism behind reviewer fatigue and oversight failures. A decision summary addresses the problem by inserting an independent model call between the agent's output and the reviewer: that call compresses the decision, evidence chain, and risk factors into a fixed-format card, reducing the per-review cognitive load without removing the human from the decision.

Behavioural divergence monitoring — longitudinal drift from declared role T2

An agent's behaviour can shift gradually over time: tool-selection patterns change, refusal rates drop, output style drifts. No single interaction reveals it, and a single-shot evaluation cannot catch a trend that spans weeks. Behavioural divergence monitoring detects that drift by comparing per-window statistical distributions of observable agent signals against a declared baseline, and alerting when the gap exceeds a threshold.

Fail-closed gate — refuse rather than act on uncertain output T2

An agent that is uncertain about what to do next faces a choice: refuse and ask for clarification, or proceed on its best guess. In low-stakes situations that tradeoff is tolerable. In agentic systems that write, delete, or send, a confident-sounding but wrong output can commit an irreversible action. A fail-closed gate resolves that choice structurally: below a configured confidence threshold, the agent stops and escalates rather than guessing.

HITL feedback-loop calibration — reviewer overrides fed back into agent tuning T2

An agent at a human-in-the-loop gate will be overridden when its decisions do not match the reviewer's judgment. Without a return path, those corrections are discarded: the same miscalibration surfaces again in the next review cycle and the one after that. A feedback loop closes that gap by capturing each override event as a structured record, accumulating those records into a calibration dataset, and using patterns in that dataset to drive targeted changes to the agent's system prompt, tool-scope policy, or divergence-monitor thresholds. A well-calibrated agent produces fewer out-of-distribution decisions, so the review queue contracts over time.

Kill switch: human authority to halt one agent, a class, or the entire deployment T2

Agentic systems can act faster than a human can intervene through normal channels. A kill switch is the operational guarantee that a named human role can stop agent activity at any scope (single instance, class, or global) through a documented runbook, without requiring a code change or redeployment, and with every invocation written to an audit trail.

Reflection-loop depth limit — a ceiling on how often an agent reworks its own answer T2

An AI agent can review and rewrite its own answer to improve it. If that review runs too long it ties up resources and stops the agent responding in time, and an attacker can deliberately trigger those endless cycles to stall the system. A reflection-loop depth limit prevents that: it sets how many review rounds an agent may run before it has to stop.

Multi-source verification — cross-check factual claims against an independent source before commit T2

An agent that writes a false claim to memory, passes it to a downstream agent, or returns it to a user has introduced an error that each subsequent step may treat as established fact. The cascade depends on one condition: the false claim goes unchallenged. Multi-source verification breaks that condition by requiring every novel factual assertion to be corroborated by a structurally independent source before it is committed. If the second source cannot corroborate the claim, the assertion is refused or down-weighted before it enters any downstream step.

Output moderation gates — independent moderation pass before emission T2

An AI agent can produce output that is harmful, deceptive, or factually wrong while still sounding fluent and confident. Output moderation places an independent classifier or moderation model between the agent and its destination, checking every output before it reaches a user or a downstream system. The generating model does not evaluate its own answer; a separate gate does.

Multi-agent consensus — N-of-M independent agreement before high-impact actions T2

A single agent's judgment on a high-impact action can be wrong, manipulated, or compromised. Requiring N of M independent peer agents to agree before the action executes means an attacker or a systematic error must affect the quorum majority, not just one agent, before harm results.

Output provenance tracking — record the source of every claim an agent makes T2

When an agent produces a claim derived from retrieved data, that claim needs a record of where it came from: the source document, version, and retrieval time. Without that record, a downstream verifier cannot distinguish a well-grounded output from a fabricated one, a tampered one, or a poisoned one. Provenance tracking attaches source attribution to every claim, carries it through each transformation in the pipeline, and surfaces it in audit logs and user-facing interfaces.

Risk-prioritised review queue — match reviewer attention to consequence T2

A human-in-the-loop review system saturates not from absolute decision volume but from undifferentiated volume: every item lands at the same priority, so reviewers cannot distinguish an irreversible high-consequence action from a routine low-stakes one. A risk-prioritised queue fixes this by scoring each decision before it enters the queue and routing it to the tier that matches its risk level, concentrating human attention where the cost of an error is highest.

Per-agent trust scoring — behavioural reputation for inter-agent message acceptance T2

In a multi-agent system, each agent routes decisions based on what its peers report. If a peer's behaviour becomes unreliable or adversarial, agents that keep treating it with full authority will propagate whatever errors or manipulations that peer introduces. Per-agent trust scoring addresses this by maintaining a continuously updated reputation score for every peer, derived from observed behaviour, and using that score to determine how much authority each incoming message carries.