← Atlas · Mitigations Tier 2 · Real-composable

MITIGATION · m-prompt-injection-defences-plus

Advanced prompt-injection defences — spotlighting, delimiter gate, dual-LLM

Prompt injection succeeds when untrusted content entering an agent's prompt is indistinguishable from trusted instruction. Three layered techniques address that: spotlighting tags untrusted content with a machine-readable origin mark before it reaches the model; delimiter defence rejects input carrying reserved framework tokens before the model is called; and dual-LLM extraction routes attacker-influenceable content through a quarantined model that holds no tool access, so injected instructions cannot reach the model that can act on them.

Last reviewed 2026-05-12 · Status: published · Evidence →

At a glance

MATURITY
Tier 2
Available off-the-shelf or as a documented pattern, but newer or less broadly proven. Expect integration work and some operational nuance.
PLACES ON
node
Restricted to node kinds: agent, shared-memory
COVERAGE
2 threats
T1 · T6
TRADE-OFFS
LAT
medium
COST
medium
UX
low
DEV
medium
Latency · cost · UX friction · dev effort.
TL;DR
  • Spotlighting (Hines et al. 2024, arxiv:2403.14720) marks untrusted content before it enters the prompt (via datamarking, base64 encoding, or delimiter wrapping) so the model can verify origin from the mark, not from position; datamarking and encoding each reduce attack success rates from above 50% to below 2% on GPT-3/4.
  • Delimiter defence reserves framework tokens (e.g. <<<system>>>) that the pipeline rejects on detection in untrusted input; the boundary is enforced by the framework before the model is called, making it a deterministic first gate independent of model compliance.
  • Dual-LLM extraction (Simon Willison Apr 2023) routes attacker-influenceable content through a quarantined LLM with no tool access that emits only schema-constrained structured data; the privileged LLM, which holds tool access, never reads the raw attacker-controlled text.
  • All three are compositional: delimiter gate first (cheapest), spotlighting at prompt-assembly time, dual-LLM at the tool-invocation seam. Spotlighting and delimiter defence are Tier 2 (production-shipped); dual-LLM is Tier 3 (research-demonstrated, no turnkey framework default as of catalog version).

How it behaves

Untrusted content (RAG result, tool output, user message) enters the input-handling pipeline.
Does the content carry reserved delimiter tokens, or does it already contain spotlight markers that the pipeline has not yet applied?
Apply spotlight mark; pass to model prompt, or route through quarantined LLM if dual-LLM path is active.
Reject at ingestion; log a delimiter-validation failure or spotlighting-mimicry event; fail closed and do not forward to the model.
The framework enforces the outer gate (delimiter scan); the model enforces the inner gate (spotlight rule). Dual-LLM ensures that even a model that ignores its spotlight rule cannot reach the privileged tool-invocation path.

What it is

Prompt injection is the class of attacks where untrusted content entering an agent's context is crafted to look like, or override, a trusted instruction. The root cause is that LLMs have no native mechanism to distinguish between a directive in the system prompt and an instruction embedded in a retrieved document or user message; position and formatting are the only signals, and both can be spoofed.

Context isolation and input sanitisation address the problem at the seam level: they separate untrusted input from the trusted prompt and strip known attack patterns at ingestion. This page covers three techniques that operate inside the model's context window, for injection that survives those outer layers.

Spotlighting (Tier 2) transforms untrusted content before it enters the prompt so its origin is marked in a way the model can verify. Hines et al. 2024 (arxiv:2403.14720) introduce the technique in three variants: datamarking replaces whitespace throughout the untrusted span with a reserved character such as ^, so the mark is distributed across the text and survives concatenation; base64 encoding wraps the untrusted span so the model must decode before reading and any injected plain-text instruction is encoded along with it; and delimiting wraps the span in reserved boundary tokens. In the paper's evaluation, datamarking and base64 encoding each reduce attack success rates from above 50 percent to below 2 percent on GPT-3 and GPT-4. The system prompt instructs the model to treat any marked span as data, not instruction. The mark survives concatenation, so the model can verify content origin from the mark, not from position alone.

Delimiter defence (Tier 2) reserves a set of framework-specific tokens, such as <<<system>>>, that are structurally impossible in legitimate untrusted input. The pipeline scans untrusted content for those tokens before prompt assembly; any input containing them is rejected and logged rather than forwarded to the model. The system prompt names the convention so the model also enforces the boundary from the inside. The security property differs from spotlighting: rejection is deterministic and happens before the model is called, so the model's compliance with its own instructions is not on the critical path. Hines et al. classify this as the weakest standalone spotlighting variant (attack success rate reduction of roughly 50 percent versus 98 percent for datamarking), which is why it functions as the outermost gate rather than a complete defence.

Dual-LLM extraction (Tier 3) addresses the scenario where a model ignores or is tricked out of its spotlight rule. The pattern splits the agent into two models: a quarantined LLM reads all attacker-influenceable content (retrieved documents, emails, tool output) and returns only structured data conforming to a declared schema, with no tool access. A separate privileged LLM receives that structured output and holds full tool access, but never reads the raw attacker-controlled text. Simon Willison's April 2023 post (simonwillison.net/2023/Apr/25/dual-llm-pattern/) is the canonical statement of the pattern. An injection in the quarantined LLM's input cannot propagate to the privileged LLM's prompt; the attacker's viable surface narrows to compromising the schema itself, which is a much harder target than the free-text prompt. No turnkey production agent framework ships this as a default as of catalog version; deployments compose it from agent-framework primitives.

The three techniques are compositional. The delimiter gate is the cheapest first gate (a string scan, no model call). Spotlighting runs next, at prompt-assembly time. Dual-LLM applies at the tool-invocation seam where the cost of a second model call is justified by the consequence of a successful injection.

Detection signals

  • Delimiter-validation failure rate. Any non-zero value indicates content carrying reserved framework tokens at a seam that should never see them; investigate immediately.
  • Spotlighting-tag mimicry rate. Incoming content that already carries spotlight tokens before the pipeline applies them signals an attacker who has read the system prompt and is pre-encoding to evade the marker.

Threats it covers

  • T1 Memory Poisoning −1 severity step

    WHY IT HELPS Memory poisoning via indirect prompt injection plants instructions in retrieved content so they steer the agent as if they were system directives. Spotlighting makes the provenance of every content span legible to the model, so poisoned RAG content cannot impersonate a trusted instruction. Dual-LLM extraction ensures the quarantined model that reads poisoned content cannot invoke any tool, so the injection cannot propagate to the execution path regardless of whether the model honours its spotlight rule.

  • WHY IT HELPS Direct prompt injection rewrites the agent's goal by embedding instructions in user-controlled input. Delimiter defence rejects input containing reserved framework tokens before the model receives it, making framework-boundary injection fail closed. Dual-LLM ensures the privileged model, which holds tool access, never reads raw attacker-controlled text.

Principle coverage

Defence-in-Depth stage: Prevent — and it advances:

  • Defence-in-Depth Spotlighting, delimiter defence, and dual-LLM extraction each operate at a different layer of the input-handling pipeline, so an injection that bypasses the outer delimiter gate still faces a model-level spotlight rule, and one that defeats the model-level rule still cannot reach the privileged execution path if the dual-LLM split is in place.
  • Provenance & Trust-tagging Spotlighting embeds a machine-readable origin mark directly in untrusted content before it enters the prompt, so the model can verify whether a span is data or instruction from the mark itself rather than from its position in the context window.
  • Input/Output Validation Delimiter defence and spotlighting together constitute the input-validation layer for the prompt-assembly seam: the delimiter gate rejects content carrying reserved tokens before the model call, and spotlighting marks every untrusted span so the model processes it as data rather than as instruction.

Design & governance principles (open design, economy of mechanism, accountability, …) are architectural, not advanced by a single placed control.

Implementation options

Five verified implementation options spanning two spotlighting variants, a framework-enforced delimiter gate, the dual-LLM architectural split, and a managed cloud service. Most deployments layer at least two: a cheap deterministic gate plus a model-level spotlighting rule.

Spotlighting: datamarking Replace whitespace throughout untrusted content with a reserved character (e.g. ^) before inserting it into the prompt. The system prompt instructs the model to treat any span containing the marker as data, not instruction. Attack success rate drops from above 50% to below 3% on GPT-3.5-Turbo and to 0% on GPT-3 in the paper's evaluation.

Why choose it: Best when you need a low-overhead, prompt-only defence that works with any model that can follow a system-prompt rule. The transformation is a string pre-processing step; no framework changes are required. Use dynamic marker tokens (randomised per request) to prevent an attacker who has read the system prompt from pre-encoding their injection to evade the marker. Pair with a delimiter gate to catch injections that contain no spaces, the one case where pure datamarking fails.

More details:

Spotlighting: base64 encoding Base64-encode untrusted content before inserting it into the prompt. The system prompt instructs the model to decode and process the content as data only. Achieves 0% attack success rate in summarisation tasks and 1.8% in Q&A on GPT-3.5-Turbo in the paper's evaluation; lowest attack success rate of the three spotlighting variants.

Why choose it: Best when you are using a high-capacity model (GPT-4 or equivalent) and can accept a small task-performance overhead from the decoding step. Encoding is transparent to the model but opaque to naive injection attempts, which are encoded along with the surrounding content and cannot issue plain-text instructions. Not suitable for GPT-3.5-Turbo class models where decoding errors and hallucinations significantly impair task performance; validate per model before deploying.

More details:

Framework delimiter gate Reserve a set of tokens (e.g. <<<system>>>, [INST], or a custom UUID-prefixed token) in the prompt-assembly layer. Before inserting any untrusted content into the prompt, scan it for reserved tokens. If found, reject the input and log a delimiter-validation failure rather than forwarding to the model. The system prompt names the reserved tokens and instructs the model to treat them as instruction boundaries when they appear in trusted positions.

Why choose it: Best as the outermost, cheapest gate: a deterministic string scan that runs before the model is called. The security property is stronger than a model-level spotlight rule because rejection happens before the model sees the content. The limitation is encoding bypass: an attacker who knows the delimiter can URL-encode, hex-encode, or apply Unicode normalisation to evade the scan. Normalise common encodings before scanning; accept that novel encodings will surface. Hines et al. classify this as the weakest standalone spotlighting variant (attack success rate reduction of roughly 50% versus 98% for datamarking); its value is as a deterministic first gate, not a complete defence.

More details:

Dual-LLM extraction Route all attacker-influenceable content through a quarantined LLM configured with no tool access and a schema-constrained output format (JSON schema or structured type). A controller passes only the validated structured output to a privileged LLM that holds tool access. The quarantined LLM is expected to be compromisable; the privileged LLM never reads compromised text. The controller is regular software, not an LLM, and mediates via structured objects.

Why choose it: Best for tool-invoking agent paths where indirect prompt injection in retrieved content could trigger real side-effects (email send, API write, file modification). The attack surface narrows from the full free-text prompt to compromising the structured schema. The cost is architectural: two model calls per tool-invoking path, schema design and maintenance, and the discipline of never forwarding raw quarantined text to the privileged model. Reserve for the tool-invocation seam where it matters; applying it uniformly roughly doubles model cost and latency across all paths.

More details:

Azure Prompt Shields Azure AI Content Safety Prompt Shields is a managed API that classifies both user prompt attacks (jailbreak attempts) and document attacks (indirect injection in grounded documents, emails, or retrieved content) before content reaches the model. It detects encoding attacks, role-play persona hijacking, conversation-mockup injection, and system-rule modification attempts. The API returns an attack classification and confidence score per input span; no model-side prompt engineering is required.

Why choose it: Best when you need a production-ready, maintained classification layer without implementing spotlighting transformations yourself. Prompt Shields ships Microsoft's internal spotlighting research as a managed service. Evaluate the false-positive rate on your domain before enabling hard-rejection; the documentation explicitly notes that the shield may flag legitimate prompts. Combine with a model-level spotlight rule and delimiter gate for defence in depth: Prompt Shields catches known attack patterns; the model-level rules catch novel variants that evade classification.

More details:

Trade-offs

  • Spotlighting (datamarking and encoding) adds negligible latency: a string transformation before prompt assembly. The delimiter gate is a string scan, also negligible. Neither increases model call count.
  • Dual-LLM extraction doubles model calls on every tool-invoking path. Using a smaller, cheaper model for the quarantined role (e.g. a Haiku-class model versus a Sonnet-class one) can reduce cost to roughly 1.2x rather than 2x, but the architectural overhead and schema maintenance are non-trivial regardless.
  • Encoding-based spotlighting impairs task performance on smaller models due to decoding errors and hallucinations. Validate against your specific model before deploying in production.
  • The delimiter gate is vulnerable to encoding bypass (URL-encoding, hex-encoding, Unicode normalisation of the reserved token). Normalise common encodings before scanning; accept that novel encodings will surface and the gate will need updating.
  • Dual-LLM schema discipline is the control: if the quarantined model can emit arbitrary prose rather than schema-constrained output, the split fails. Schema design and adversarial testing of the quarantined output are ongoing operational costs.

When NOT to use

  • Do not apply dual-LLM extraction to read-only, no-tool-access agents. The quarantined model cannot reach any execution path regardless, so the isolation overhead adds cost without adding security.
  • Do not apply spotlighting or delimiter defence to system-prompt-only agents where no untrusted content ever enters the prompt (e.g. template-driven agents with typed form inputs and strict API-layer schema validation). There is no untrusted span to mark.
  • Avoid dual-LLM on latency-critical streaming paths where doubling model round-trips is architecturally incompatible with product requirements. Invest in spotlighting and context isolation first and defer the Tier 3 pattern.
  • Do not treat Prompt Shields (or any classifier) as the sole gate. Classifiers have false-negative rates and are updated reactively to novel attack patterns. Layer with model-level spotlight rules for defence in depth.

Limitations

  • Spotlighting works only if the model honours the spotlight rule. A model fine-tuned to ignore the rule, or attacker prompting that causes the model to disregard it, defeats spotlighting. Pair with output moderation on the egress side.
  • Delimiter defence is bypassed by encoding the reserved token. Normalisation before scanning helps; novel encodings will continue to surface as the attack landscape evolves.
  • Dual-LLM assumes a narrow structured schema between the two stages. If the quarantined model can emit unstructured text that the privileged model reads as instruction, the split fails. The schema must be adversarially tested and maintained.
  • No defence in this family is permanent. Active adversarial research consistently finds new bypass techniques. Treat spotlighting and delimiter defence as layers in a defence-in-depth stack, not as complete solutions.
  • Greshake et al. 2023 (arxiv:2302.12173) characterise indirect prompt injection as a class of attacks for which effective mitigations are currently lacking. These techniques reduce but do not eliminate the attack surface.

Maturity tier reasoning

  • Tier 2 at the page level because spotlighting and delimiter defence are production-shipped: Hines et al. 2024 is peer-reviewed, Microsoft Prompt Shields is a production service, and datamarking and encoding are deployable as prompt-engineering patterns today.
  • Dual-LLM extraction is Tier 3 internally: the principle is peer-reviewed and demonstrated, but no turnkey production agent framework ships it as a default as of catalog version. Deployments compose it from framework primitives.
  • Expect Tier 2 upgrade for dual-LLM as agent frameworks (LangChain, LlamaIndex, AutoGen) standardise the privileged/quarantined split as a first-class pattern.

Last verified against upstream docs: 2026-05-30.