← Atlas · Mitigations Tier 2 · Real-composable

MITIGATION · m-mem-validation

Memory content validation — a write-boundary gate on what enters the agent's memory store

An agent's memory store is a persistent surface: anything written to it can be retrieved by any agent, in any session, for the lifetime of the corpus. Memory poisoning exploits that persistence by writing adversarial content that steers the agent's reasoning long after the attacker has gone. Write-boundary validation prevents this by running every candidate memory write through schema, policy, and provenance checks before it is committed. Content that fails any gate is rejected and never reaches the store.

Last reviewed 2026-05-12 · Status: published · Evidence →

At a glance

MATURITY
Tier 2
Available off-the-shelf or as a documented pattern, but newer or less broadly proven. Expect integration work and some operational nuance.
PLACES ON
edge · node
Restricted to node kinds: shared-memory
COVERAGE
4 threats
T1 · T27 · T28 · T49
TRADE-OFFS
LAT
low
COST
low
UX
low
DEV
medium
Latency · cost · UX friction · dev effort.
TL;DR
  • Validate at the write boundary, not the read boundary. Once poisoned content lands in memory it can be retrieved by any agent, in any session, for the lifetime of the corpus.
  • Three synchronous gates run before every commit: schema validation (Zod / Pydantic), policy enforcement (OPA: tenant boundary, content classifier, source-provenance permission), and provenance verification (signed source tag that chains to a trusted root).
  • For RAG agents, a fourth inline check compares the candidate embedding's cosine distance from the per-topic cluster centroid against a calibrated baseline. Anomalously distant embeddings are quarantined for review, not committed.
  • Rejection at any gate produces a structured ValidationError and an audit log entry; the rejected payload is never serialised to the store.

How it behaves

Agent or pipeline requests a write to short- or long-term memory (vector store, KV store, conversation history, inter-agent message bus)
Parse against declared schema, then evaluate OPA policy (tenant boundary + content classifier + provenance permission), then verify signed source-tag attribution chain, then (RAG only) embedding-distance outlier check against per-topic centroid
Payload committed to the store with signed provenance metadata
Payload rejected: audit.log('memory-write-rejected', { gate, reason, source, userId }); calling agent receives structured ValidationError
Fail-closed at the schema gate: a malformed payload is a harder rejection than a policy violation. Never serialise a rejected payload to the store regardless of which gate fired.

What it is

Agent memory is a persistent data store: conversation history, long-term vector embeddings, structured fact caches, and inter-agent message buses that survive beyond a single session. Any agent that reads from that store trusts its contents. Memory poisoning exploits that trust by writing adversarial content, such as a false pricing rule, a fragmented privilege-escalation instruction, or a subtly corrupted policy document, so that future retrievals steer the agent toward attacker-chosen outcomes. The attack is effective precisely because the poisoned content looks like legitimate memory; the agent has no mechanism to distinguish it from honest writes.

Write-boundary validation is the structural fix. It places a validation pipeline at every memory write seam, running each candidate payload through a sequence of synchronous checks before the write is committed. The checks cover three independent dimensions: schema (is the payload structurally valid?), policy (does this write comply with tenant boundaries, content classification, and source-provenance permissions?), and provenance (does the declared source trace to a trusted root?). For vector-store agents, a fourth inline check compares the candidate embedding's cosine distance from the per-topic cluster centroid against a calibrated baseline, quarantining writes that are anomalously distant rather than committing them. Rejection at any gate produces a structured error and an audit log entry; the payload is never serialised to the store.

The principle that follows from this structure is that the write boundary is the highest-leverage poisoning surface. Once adversarial content lands in memory it can be retrieved by any agent, in any session, for the lifetime of the corpus. The read boundary, where output filters and retrieval ACLs operate, is a second line of defence but not a substitute: by the time a retrieval check fires, the poisoned record already exists and may have influenced other writes or been cited in outputs the agent produced before the detection fired.

Detection signals

  • Validation rejection rate per source. A sustained rise from a single source is the signature of an active injection campaign.
  • Embedding-distance outlier rate. A rising proportion of writes flagged as anomalous indicates systematic corpus manipulation attempts.

Threats it covers

  • T1 Memory Poisoning −2 severity steps

    WHY IT HELPS Memory Poisoning is the injection of adversarial content into an agent's short- or long-term memory so that future retrievals steer the agent toward attacker-chosen outcomes. Write-boundary validation removes the attacker's ability to commit that content: a candidate write that fails schema, policy, or provenance validation is rejected before serialisation, so it cannot influence any subsequent retrieval.

  • WHY IT HELPS Vector store poisoning with malicious financial-instrument data requires the attacker to commit adversarial embeddings to the shared corpus. Schema checks reject structurally malformed payloads; embedding-distance outlier detection flags vectors that are anomalously distant from the trusted cluster centroid; provenance verification rejects writes that cannot be traced to a trusted source. All three gates must fail for a poisoned vector to land.

  • T28 RAG Data Exfiltration −1 severity step

    WHY IT HELPS RAG data exfiltration is partially addressed at the write boundary by requiring every document written to memory to carry a verifiable provenance tag. Those tags are auditable at retrieval time and support forensic investigation of what was accessed and by whom. Write-boundary validation does not prevent retrieval-side exfiltration; pair with m-vector-acl for the retrieval boundary.

  • T49 Semantic Drift in Embeddings −1 severity step

    WHY IT HELPS Semantic drift is the gradual corruption of a corpus through many small writes, each individually plausible. Embedding-distance outlier checks slow this process by flagging candidate writes whose embeddings are anomalously distant from the trusted cluster centroid, routing them for human review rather than committing them. The control does not eliminate slow-drift poisoning that stays within statistical thresholds; it raises the cost and slows the rate.

Principle coverage

Defence-in-Depth stage: Prevent — and it advances:

  • Memory & RAG Integrity Write-boundary validation is the primary enforcement mechanism for memory integrity: every candidate write is checked against schema, policy, and provenance before it is committed, so adversarial content that fails any gate never enters the corpus and cannot be retrieved by any agent.

Design & governance principles (open design, economy of mechanism, accountability, …) are architectural, not advanced by a single placed control.

Implementation options

Five verified implementation options covering the three gate layers (schema, policy, provenance) plus a managed content classifier and a semantic plausibility gate. Every deployment should implement at minimum the schema and policy gates together; they cover structural and permission checks with no vendor dependency. Add the content classifier for ingestion pipelines that accept external documents. Add the semantic gate only for writes claiming high-trust provenance.

Zod / Pydantic Parse every candidate write against a declared schema before any other check. Both libraries expose a non-throwing parse path (safeParse / model_validate) that returns structured errors. Sub-millisecond; in-process; no vendor dependency.

Why choose it: Use as the structural floor for every deployment. It is the cheapest gate and a prerequisite for the policy gate. Zod v3 ships with static type inference so the schema doubles as a TypeScript type. Pydantic v2 is the standard for Python LLM frameworks. A parse failure short-circuits the chain before the payload reaches the store layer.

More details:

Open Policy Agent Decouple policy decisions from enforcement with Rego policies for tenant-boundary, content-classifier, and source-provenance-permission rules. Accepts arbitrary JSON; returns structured decisions in under 1 ms with the Wasm build. Policies are versioned code.

Why choose it: Best when policy complexity warrants a dedicated engine. Tenant boundaries, content classifiers, and provenance permissions are the three rules that matter for memory-write validation; OPA allows each to be authored, reviewed, and versioned independently of the agent code. The Wasm build eliminates the network round-trip and supports hot-reload without redeploying the agent.

More details:

Azure AI Content Safety Prompt Shields scans text for jailbreak and indirect-injection patterns. Callable as an async gate at the write boundary; reject writes that score above the configured threshold. Maximum 10,000 characters per call; up to five documents per request.

Why choose it: Best as an additional classifier on top of schema and policy validation for ingestion pipelines where documents arrive from external or untrusted sources. The API is purpose-built for indirect-injection detection (the document-attack class), which schema and policy checks do not catch. Adds an HTTP round-trip (50-150 ms); apply to lower-throughput ingestion paths, not to high-frequency inter-agent message buses.

More details:

MCP write-side handler Implement a memory-write MCP tool whose handler runs schema, policy, and provenance gates before calling the underlying store. The inputSchema (JSON Schema) handles structural validation; the handler body enforces policy and provenance. Rejected writes return a structured isError response.

Why choose it: Best for deployments that already expose memory operations via MCP. The MCP specification (Security Considerations) requires servers to validate all tool inputs, so this pattern fulfils a specification obligation while implementing the three-gate stack. No additional infrastructure required if memory is already accessed via MCP tools.

More details:

Dedicated validator agent A lightweight critic agent receives every candidate memory write, applies a Zod schema check and an LLM-based semantic plausibility check (Constitutional AI critic prompt), and either approves or rejects with a reason. Slower (100-500 ms per write) but catches semantically adversarial content that passes structural validation.

Why choose it: Best as a second-pass gate for writes claiming high-trust provenance, not for all writes. Schema and policy gates catch structural and permission violations; the semantic layer catches correctly-formatted but adversarial content (a policy document that asserts the wrong rule). Pair with the schema and policy options for structural gating; add this only where the cost of a missed semantic attack justifies the latency.

More details:

Trade-offs

  • Schema validation (Zod / Pydantic) is sub-millisecond in-process; OPA policy decisions are under 1 ms with the Wasm build; embedding-distance checks add 5-20 ms. The managed classifier (Prompt Shields) adds 50-150 ms per call. Apply it only to lower-throughput ingestion, not to high-frequency agent buses.
  • The principal adoption cost is schema and policy definition for the full set of memory surfaces. A production agent typically has several distinct memory surfaces; missing one creates a usable poisoning vector.
  • A sustained high rejection rate from a single source is the signature of an active injection campaign. Monitor rejection rate per source as an operational metric and escalate when it rises without a corresponding legitimate ingestion event.

When NOT to use

  • Not appropriate as the sole trust mechanism when the write source is itself an agent with broad authority. Validating that content is well-formed does not validate that it is correct. In high-assurance contexts, combine validation with provenance attestation and human review for writes that claim to update authoritative policy or pricing.
  • Do not apply embedding-distance outlier detection to deliberately broad corpora (a general-knowledge RAG store). The cluster-centroid model assumes topic coherence; a general corpus produces constant false positives that desensitise the team. Scope this gate to narrowly-defined knowledge bases.

Limitations

  • Write-boundary validation rejects malformed and policy-violating writes but cannot detect semantically valid adversarial content. A correctly-formatted policy document that asserts the wrong rule passes all structural checks. OWASP MAS Guide T18 (RAG Input Manipulation Leading to Policy Bypass) is the canonical example.
  • A compromised validator provides false assurance that is worse than no validator. Teams stop looking for anomalies because they expect the gate to catch them. Treat validation policies as code: peer review, version control, signed releases.
  • Pair with cryptographic logging (Sigstore) for forensic rollback when a validator bypass is discovered, and with m-mem-anomaly for the reactive detection layer.

Maturity tier reasoning

  • Tier 2 because every component is individually mature: Pydantic v2 and Zod v3 are production schema-validation libraries; OPA is a CNCF Graduated policy engine; Sigstore is production-grade for signing.
  • Not Tier 1 because no industry-standard schema for memory-write metadata exists. Every vector store and framework defines its own attribution fields, and there is no published benchmark against which validation pipelines can be scored.
  • Not Tier 3 because the control composes production-ready components and requires no novel engineering. The gap is standardisation, not technology maturity.

Last verified against upstream docs: 2026-05-30.