T5: Cascading Hallucination Attacks

Definition

Cascading Hallucination Attacks exploit the agent’s inability to distinguish fact from fiction by getting a fabricated output embedded into memory, tool inputs, or downstream agents. Once embedded, the fabrication propagates, compounds, and is treated as evidence by later reasoning. The threat is the propagation, not the original hallucination.

What it looks like in practice

Sales Orchestration Misinformation Cascade. A sales agent for a software vendor generates a proposal and fabricates a product feature (“supports real-time multi-region sync”) that the product does not actually have. The proposal is saved to a shared CRM and that feature claim is committed to the agent’s long-term memory as a confirmed product capability. A separate pricing agent later retrieves that memory entry when generating quotes, prices the deal as if the feature exists, and the contract is signed on that basis. No human reviewed either the original proposal or the retrieved memory before the contract was issued.

API Call Manipulation and Information Leakage. A developer-tooling agent is asked to call a REST API for which no documentation is in context. It hallucinates a plausible endpoint path (/internal/admin/users) and makes the call. That path happens to exist and to be accessible without authentication because it was left open for a legacy integration. The agent receives a JSON payload of user records, includes them in its response, and the data is displayed to the attacker. The hallucinated endpoint was not the intended API; the agent reached a real internal surface it was never intended to access.

Healthcare Decision Amplification. A clinical-support agent generates a dosage recommendation for a patient with an unusual combination of conditions. It fabricates a contraindication threshold that sounds plausible but has no basis in clinical guidelines and stores it in the patient’s session context. A second agent in the same pipeline, a medication reconciliation agent, retrieves the session context, treats the fabricated threshold as a verified guideline, and flags a different medication as unsafe based on the invented figure. A clinician receives the reconciliation alert, spends time investigating a false risk, and the original fabrication has now produced a clinical workflow disturbance attributable to no single obvious error.

Foreign Exchange Market Manipulation. A trading agent monitoring emerging-market currencies constructs a fabricated central-bank rate announcement from fragmentary news snippets: “Banco Central has raised the benchmark rate to 12.5%.” That fabrication is forwarded as a confirmed signal to three downstream strategy agents that use it as a trigger for currency positions. By the time a human trader identifies the fabrication, positions have been taken across multiple instruments based on the invented rate. Because each strategy agent treated the upstream output as a reliable signal, no individual agent’s log shows an anomaly beyond a normal trade instruction.

Why it’s dangerous

Agents reinforce themselves. Reflection, self-critique, and memory recall let the same fabrication be cited back as prior knowledge in subsequent turns. In multi-agent systems, fabricated content flows through inter-agent communication and gets validated as input by agents that did not see how it was produced.

Where it manifests

Three seams are risky. The first is where an LLM output is committed to memory. The second is where one agent’s output becomes another agent’s input. The third is where reflection or self-critique is used as a quality gate. None of these steps inherently distinguish “what the agent said” from “what is true.”

Detection signals

Monitor the points where one agent’s output becomes another’s input or is written to memory:

Cross-agent message payloads that contain numeric values (prices, rates, clinical thresholds) with no corresponding citation, source URL, or retrieved document ID attached. Flag for verification before downstream consumption.
A memory write event whose source is a planning or generation step rather than a tool-return or retrieval step, indicating the agent stored its own output as a fact rather than storing externally retrieved data.
A fact or figure appearing in a downstream agent’s prompt that did not appear in any tool response during the upstream agent’s session trace. Detect by diffing the upstream tool-response log against the downstream input payload.
Repeated retrieval of the same memory entry across more than N agent sessions without any human or tool confirmation of that entry, which may indicate a fabrication has become self-reinforcing through recall.
A downstream agent’s action (API call, document write, alert issuance) whose trigger value cannot be traced back to a ground-truth data source in the session’s provenance log. Raise a missing-provenance alert in the observability pipeline.

OWASP Top 10 for Agentic Applications 2026

The Agentic Top 10 (ASI01 through ASI10) is a separate practitioner-facing publication that maps onto the master Threats & Mitigations threat numbering. T5 is covered by the following Top 10 entries:

ASI08 Cascading Failures primary

A single low-severity fault (a hallucinated value, a corrupted tool output, a poisoned memory entry) propagates across a network of agents that each build on the last agent's output, compounding into system-wide harm that is disproportionate to the original defect. ASI08 is about propagation and amplification, not the fault's origin; the initial trigger may itself be innocuous.

OWASP LLM Top 10: LLM01:2025 LLM04:2025 LLM06:2025

Source: OWASP Top 10 for Agentic Applications 2026 (Dec 2025) · the Top 10 is a compass into the master Threats & Mitigations taxonomy, not a replacement for it.

Design principles at stake

When T5 is present, these security design principles are the ones being violated or tested. Each links to the full principle; the mitigations below are how you restore them.

Defence-in-Depth The threat identifies three distinct seams (memory commit, inter-agent handoff, and self-critique loops), and a fabrication exploits whichever is least defended. Depth means independent controls at each: out-of-band factual verification before any memory write, schema-validated structured handoffs with provenance tags, and a separate deterministic verifier the reflection loop cannot influence. A hallucinated claim that slips past the model's own critique still hits the verifier before it is committed as fact.
Continuous Verification Fabricated content that enters memory or an inter-agent channel is treated as validated knowledge by every subsequent reasoning step, so a single missed check at ingestion propagates indefinitely. A behavioural monitor baselined on the agent's normal action stream detects the goal drift that hallucination propagation causes: anomalous tool-chain sequences and action-velocity changes correlated with ingesting low-trust content are the signals that flag a cascade before it compounds.
Assume Breach Prompt injection is effectively inevitable for any agent that reads external documents or tool output, so the design must hold even after a fabrication has entered the context. The dual-LLM pattern (a quarantined LLM with no tools processes untrusted content and returns structured data; the privileged LLM never sees raw external text) breaks the propagation path regardless of whether the initial hallucination was prevented.
Resilience & Recovery The healthcare and forex case studies show a hallucinated output compounding silently across many sessions before any error event is raised. There is no anomaly to catch at the time of poisoning. Versioned memory with rollback to a pre-poisoning snapshot is the recovery mechanism; without it, correcting the original fabrication still leaves every downstream decision it influenced intact and uncorrected.
Sandboxing & Isolation A fabricated output gains its leverage by becoming another agent's instruction and being executed; it relies on the absence of a boundary between reasoning output and trusted input. Isolating each agent's execution so that inter-agent messages are treated as untrusted until cryptographically authenticated prevents a hallucinated endpoint claim from triggering an actual API call regardless of how confidently the model asserted it.
Constrained Generation & Deterministic Guardrails Hallucination propagation is possible because unstructured LLM output is passed directly into tool calls and inter-agent channels without a deterministic gate. Typed schemas that reject tool calls whose parameters fall outside a declared value set mean a fabricated endpoint or parameter value never reaches execution; the schema doesn't care how confidently the model produced it.
Input/Output Validation The API call manipulation case study shows that hallucinated endpoints are executed because the agent's output is not validated against an allow-list before it influences downstream systems. An output-moderation pipeline that confirms tool-call destinations against a static allow-list and scans for exfiltration patterns is the outbound control; provenance wrappers on every retrieved chunk are the inbound control that prevents a fabricated source from gaining retrieval authority.
Robustness / Reliability The threat is explicitly about non-determinism exploited across multi-step pipelines, where a single inconsistency in reasoning becomes a fleet-wide integrity failure if unchecked. Adversarial red-teaming that exercises hallucination injection paths before deployment, combined with drift detection that flags when an agent's outputs diverge from verified sources, are the two operational forms of robustness this threat demands.

Recommended mitigations

Auto-generated from the mitigation catalog: every mitigation whose coverage map includes T5, sorted by maturity tier (Tier 1 production-canonical first, then Tier 2, then Tier 3 research-stage).

Tier 2 Fail-closed (Fail-closed gate — refuse rather than act on uncertain output)

An agent that is uncertain about what to do next faces a choice: refuse and ask for clarification, or proceed on its best guess. In low-stakes situations that tradeoff is tolerable. In agentic systems that write, delete, or send, a confident-sounding but wrong output can commit an irreversible action. A fail-closed gate resolves that choice structurally: below a configured confidence threshold, the agent stops and escalates rather than guessing.

why it helps Cascading Hallucination Attacks succeed because a fabricated output is treated as ground truth and passed forward. A fail-closed gate intercepts that path at the point of action rather than at the point of generation, so a confident-sounding but incorrect output is refused before it becomes a committed step in a longer chain.
Tier 2 Loop limit (Reflection-loop depth limit — a ceiling on how often an agent reworks its own answer)

An AI agent can review and rewrite its own answer to improve it. If that review runs too long it ties up resources and stops the agent responding in time, and an attacker can deliberately trigger those endless cycles to stall the system. A reflection-loop depth limit prevents that: it sets how many review rounds an agent may run before it has to stop.

why it helps Cascading Hallucination Attacks turn one false output into many, as it gets embedded and then treated as fact by later steps. Every extra reflection round is another chance for that error to compound, so limiting the number of rounds restricts how far it spreads.
Tier 2 Multi-source verify (Multi-source verification — cross-check factual claims against an independent source before commit)

An agent that writes a false claim to memory, passes it to a downstream agent, or returns it to a user has introduced an error that each subsequent step may treat as established fact. The cascade depends on one condition: the false claim goes unchallenged. Multi-source verification breaks that condition by requiring every novel factual assertion to be corroborated by a structurally independent source before it is committed. If the second source cannot corroborate the claim, the assertion is refused or down-weighted before it enters any downstream step.

why it helps Cascading Hallucination Attacks propagate a false claim through an agent pipeline by embedding it into shared memory or passing it between agents in a way that each recipient treats as established fact. The cascade relies on the claim never encountering a source that contradicts it. An independent verification gate at the commit boundary breaks this reliance: a second source that disagrees, or that cannot find supporting evidence, halts propagation before the claim is embedded.
Tier 2 Output moderation (Output moderation gates — independent moderation pass before emission)

An AI agent can produce output that is harmful, deceptive, or factually wrong while still sounding fluent and confident. Output moderation places an independent classifier or moderation model between the agent and its destination, checking every output before it reaches a user or a downstream system. The generating model does not evaluate its own answer; a separate gate does.

why it helps Cascading Hallucination Attacks work by embedding a fabricated output into memory or downstream context, where it propagates and compounds as later agents treat it as fact. An independent output classifier intercepts the fabricated output before it reaches any downstream consumer, breaking the propagation path at the emission boundary rather than relying on the originating model to recognize its own error.
Tier 2 Peer consensus (Multi-agent consensus — N-of-M independent agreement before high-impact actions)

A single agent's judgment on a high-impact action can be wrong, manipulated, or compromised. Requiring N of M independent peer agents to agree before the action executes means an attacker or a systematic error must affect the quorum majority, not just one agent, before harm results.

why it helps Cascading Hallucination Attacks propagate a false conclusion through later reasoning steps by treating the initial error as established fact. Independent peer evaluation means each agent reasons from its own context; a hallucinated conclusion that fails to persuade the quorum majority is refused before it enters downstream steps.
Tier 2 Provenance tracking (Output provenance tracking — record the source of every claim an agent makes)

When an agent produces a claim derived from retrieved data, that claim needs a record of where it came from: the source document, version, and retrieval time. Without that record, a downstream verifier cannot distinguish a well-grounded output from a fabricated one, a tampered one, or a poisoned one. Provenance tracking attaches source attribution to every claim, carries it through each transformation in the pipeline, and surfaces it in audit logs and user-facing interfaces.

why it helps Cascading Hallucination Attacks compound when a fabricated or weakly-grounded claim propagates through multi-agent pipelines and is treated as authoritative by downstream steps. Per-claim source attribution exposes which claims lack a real retrieval ID, allowing the pipeline to hold or flag those claims before they reach the next agent in the chain.

Multi-agent variants: OWASP MAS Guide

The OWASP OWASP MAS Threat Modelling Guide v1.0 catalogues 1 named multi-agent variant of T5, anchored to specific MAESTRO layers. Each is a concrete attack pattern that emerges when this threat compounds across agents.

CL Hallucination Attacks extends T5

Crafted partial data forces an agent to generate and act on fabricated conclusions.

Source: OWASP MAS Threat Modelling Guide v1.0, §2 Overview of MAESTRO Framework — Extended Threat Scenarios + Cross-Layer table.

Catalogue extensions: Helmwart T18 to T49

This normalized catalogue includes 3 multi-agent entries based on the OWASP MAS Threat Modelling Guide v1.0 that extend T5. The source guide reuses some numbers between worked systems; these Helmwart entries provide stable detail pages, MAESTRO layers, and mitigation coverage.

T26 Model Instability Leading to Inconsistent Blockchain Interactions
LLM instability causes an agent to interact with blockchain infrastructure in unpredictable ways, submitting invalid transactions or skipping expected calls.
T41 Schema Mismatch Leading to Errors
Ambiguous or inconsistently implemented MCP schemas cause client and server to interpret data differently, producing silent data corruption.
T48 MAS source T16 Model Inconsistency Leading to Variable Approvals
Non-deterministic LLM behaviour produces divergent outputs for identical inputs, causing inconsistent decisions across agent invocations.

Red-team pivot: MITRE ATLAS techniques

MITRE ATLAS catalogues adversary techniques against AI systems. Where this OWASP threat has an attacker-perspective counterpart, the ATLAS technique is shown below. That is what a red team would actually be doing on the wire. Use this for detection-signal anchoring, threat-hunting hypotheses, and IR runbooks. Source: mitre-atlas/atlas-data v5.6.0.

AML.T0031 Erode AI Model Integrity view on ATLAS ↗

Adversary degrades model output quality over time so users lose confidence or downstream consumers act on incorrect predictions.

AML.T0060 Publish Hallucinated Entities view on ATLAS ↗

Adversary registers package names, repos, or services that they know LLMs frequently hallucinate, so an agent that trusts the model output downloads the attacker's artefact.

Agentic angle: Coding agents are the most exposed: a hallucinated `npm install foo-utils` becomes a real supply-chain compromise once an attacker squats the name.

AML.T0062 Discover LLM Hallucinations view on ATLAS ↗

Adversary probes a model to identify what it consistently hallucinates (package names, citations, APIs) so they can stage a Publish Hallucinated Entities attack.

Sources

OWASP-Agentic-AI ↗ · 1.1 (Dec 2025) · Agentic Threats Taxonomy Navigator §Step 2; Threat Model T5
MAESTRO ↗ · 1.0 (Apr 2025) · Layer 3 Agent Frameworks; Cross-Layer Hallucination Attacks