EVIDENCE TRAIL
Input sanitisation for LLM context
Verbatim excerpts from the upstream sources cited on the mitigation page, with what each source does and does not prove. "Input sanitisation for LLM context" is Helmwart's normalised label — the upstream phrase that comes closest verbatim is OWASP Top 10 Agentic 2026 §ASI01: "Treat all natural-language inputs … as untrusted. Route them through the same input-validation and prompt-injection safeguards … before they can influence goal selection, planning, or tool calls."
Last cross-checked against upstream sources: · 8 sources
References
Each entry shows what the source supports and what it does not prove.
OWASP Top 10 for Agentic Applications 2026
§ASI01 Agent Goal Hijack — Prevention and Mitigation Guidelines, item 1
"Treat all natural-language inputs (e.g., user-provided text, uploaded documents, retrieved content) as untrusted. Route them through the same input-validation and prompt-injection safeguards defined in LLM01:2025 before they can influence goal selection, planning, or tool calls."
Supports: Verbatim statement of the core principle this control implements: all natural-language inputs are untrusted and must be routed through input-validation and prompt-injection safeguards before reaching planning or tool-call logic.
Does not prove: Does not specify the mechanism (tagged segmentation, pattern filtering, or re-encoding). Delegates implementation details to LLM01:2025.
OWASP Top 10 for Agentic Applications 2026
§ASI01 Agent Goal Hijack — Prevention and Mitigation Guidelines, item 6
"Sanitize and validate any connected data source — including RAG inputs, emails, calendar invites, uploaded files, external APIs, browsing output, and peer-agent messages — using CDR, prompt-carrier detection, and content filtering before the data can influence agent goals or actions."
Supports: Explicitly extends the sanitisation perimeter beyond the user-input field to every inbound channel in an agentic system — the widened surface area described in this control's "What it is" section.
Does not prove: Addresses goal-hijack specifically; does not prove sanitisation prevents all classes of prompt injection or that it is sufficient as a standalone control.
OWASP Top 10 for Agentic Applications 2026
§ASI02 Tool Misuse and Exploitation — Prevention and Mitigation Guidelines, item 4
"Policy Enforcement Middleware ("Intent Gate"). Treat LLM or planner outputs as untrusted. A pre-execution Policy Enforcement Point (PEP/PDP) validates intent and arguments, enforces schemas and rate limits, issues short-lived credentials, and revokes or audits on drift."
Supports: Frames LLM/planner outputs — not just user inputs — as untrusted content requiring validation, supporting the control's guidance that sanitisation must also cover tool outputs and inter-agent messages.
Does not prove: The PEP/PDP pattern is an architectural gate, not the content-level sanitisation pass. Complementary, not equivalent.
OWASP Agentic AI — Threats & Mitigations v1.1
§T6 Intent Breaking & Goal Manipulation — Description
"Intent Breaking and Goal Manipulation occurs when attackers exploit the lack of separation between data and instructions in AI agents, using prompt injections, compromised data sources, or malicious tools to alter the agent's planning, reasoning, and self-evaluation."
Supports: Names "lack of separation between data and instructions" as the root-cause vulnerability that this control addresses. Directly motivates the tagged-segmentation and channel-separation approach.
Does not prove: Describes the threat, not the countermeasure. Does not specify how to implement the data/instruction separation — that implementation detail is in the control itself.
OWASP Agentic AI — Threats & Mitigations v1.1
§T1 Memory Poisoning — Mitigation summary (threat overview table)
"Implement memory content validation, session isolation, robust authentication mechanisms for memory access, anomaly detection systems, and regular memory sanitization routines."
Supports: Names "memory sanitization routines" as a mitigating action for T1 memory poisoning, which is one of the two threats (T1 and T6) this control's MDX maps to.
Does not prove: The table row summary is brief; it does not specify normalisation passes, delimiter-based segmentation, or pattern filtering. Granular implementation guidance is not provided here.
Microsoft Prompt Shields (Azure AI Content Safety)
Introduction paragraph — "Prompt Shields in Azure AI Content Safety"
"Prompt Shields is a unified API in Azure AI Content Safety that detects and blocks adversarial user input attacks on large language models (LLMs). It helps prevent harmful, unsafe, or policy-violating AI outputs by analyzing prompts and documents before content is generated."
Supports: Describes a production-grade managed classifier that implements the "pattern filtering" variant of this control. The document also classifies attack types (encoding attacks, conversation mockup, role-play, document attacks) that sanitisation aims to catch.
Does not prove: Vendor documentation; does not publish independent accuracy figures on the page itself. The 89% PINT figure cited in the MDX comes from the Microsoft Build 2024 announcement blog (separate source, not independently verifiable via this URL at time of cross-check). The PINT leaderboard (github.com/lakeraai/pint-benchmark) shows Azure AI Prompt Shield at 89.12% — consistent with the MDX claim, but note Lakera Guard scores 95.22% and AWS Bedrock Guardrails 89.24% on the same benchmark.
Lakera PINT Benchmark (Prompt Injection Testing)
"Important" notice in benchmark README — independence guarantee for evaluated solutions
"All evaluated solutions (including Lakera Guard) are not directly trained on any of the inputs in this dataset."
Supports: Provides the independent, multi-vendor benchmark that the MDX cites. Confirms Azure AI Prompt Shield at ~89% and shows the comparative landscape (Lakera 95.22%, AWS 89.24%, Google Model Armor 70.07%). Demonstrates that no managed classifier is a silver bullet.
Does not prove: The benchmark tests classifier-based detection; it does not cover the tagged-segmentation or re-encoding sub-techniques of this control, which operate at a different layer and are not measurable on PINT.
NIST AI 600-1 — Generative AI Profile (NIST AI RMF)
§2.9 Information Security — description of prompt injection risk
"Prompt injection involves modifying what input is provided to a GAI system so that it behaves in unintended ways. In direct prompt injections, attackers might craft malicious prompts and input them directly to a GAI system, with a variety of downstream negative consequences to interconnected systems. Indirect prompt injection attacks occur when adversaries remotely (i.e., without a direct interface) exploit LLM-integrated applications by injecting prompts into data likely to be retrieved."
Supports: Authoritative NIST framing of both direct and indirect prompt injection as a GAI information-security risk. Grounds this control in the NIST AI RMF risk taxonomy.
Does not prove: NIST AI 600-1 names prompt injection as a risk but does not prescribe input sanitisation as a specific countermeasure in a numbered measure action. MS-2.7-007 mandates red-teaming against "prompt injection" attacks but does not name content filtering or tagged segmentation as the mitigating technique.