MITIGATION · m-mem-anomaly
Memory anomaly detection — runtime detection of poisoning that slipped past validation
An agent's memory store can receive adversarial content that passes schema and policy validation because the content is structurally valid but statistically unusual. Memory anomaly detection addresses this by monitoring write rates, embedding distances, provenance tags, and retrieval patterns at runtime, and quarantining writes whose statistical signatures diverge from the established baseline.
At a glance
TL;DR
- Reactive complement to write-time validation: catches poisoning that passed schema and policy checks by detecting statistical anomalies in write rates, embedding distances, provenance tags, and retrieval patterns at runtime.
- Fires after a write commits; the first response is automated quarantine rather than deletion: suspect writes are moved to an isolated partition so the full blast radius can be determined before any rollback begins.
- Four signal classes: write-rate spikes per source against a rolling baseline, embedding-distance outliers from per-topic cluster centroids, provenance mismatches between claimed source and actual ingestion path, and retrieval-distribution shifts on previously stable queries.
- Cold-start limitation: no established baseline means elevated false-positive rates for approximately the first 30 days. Enable only after sufficient normal write-pattern data has accumulated.
How it behaves
What it is
An agent's memory store accumulates content over time, and that content shapes every future retrieval. A write-time validation gate checks whether incoming content conforms to a declared schema and passes policy rules. What it cannot reliably catch is content that is structurally valid but adversarially crafted: a correctly formatted document that asserts a false fact, or a sequence of legitimate-looking writes that collectively shift retrieval results toward attacker-controlled content. Memory anomaly detection is the reactive layer that addresses this gap.
Rather than inspecting the content of individual writes, anomaly detection monitors the statistical behaviour of the write stream and the retrieval patterns it produces. Four classes of signal matter:
- Write-rate anomalies. A poisoning campaign must typically write repeatedly to move retrieval results; that volume produces a detectable spike in write rate from a single source against a rolling baseline window.
- Embedding-distance anomalies. Content that does not belong to the established corpus tends to produce embeddings that fall far from the per-topic cluster centroid. Distance from the centroid, measured as a standard-deviation multiple of the baseline distribution, is a proxy for content that is anomalous relative to the existing knowledge domain.
- Provenance anomalies. A write that claims a particular source identity but arrives through a different ingestion path produces a mismatch between the claimed origin and the actual delivery channel. That mismatch is itself a signal independent of content.
- Recall-pattern anomalies. A sudden change in which vectors are retrieved for previously stable queries indicates that newly committed content is displacing established results, which is the downstream effect of a successful poisoning write.
Detection is reactive: by the time an alert fires, the suspect content has already been committed. The correct first response is quarantine, not deletion. Moving the affected writes to an isolated partition preserves them for forensic analysis and allows the full set of affected sessions to be identified before any rollback attempt begins. Pair this control with a write-time validation gate that intercepts what is detectable before commit, and with provenance tracking that makes the incident response chain legible.
Detection signals
- Write rate per source versus rolling baseline. A sustained spike from a single source is the characteristic signature of a poisoning campaign, which typically requires repeated writes to shift retrieval behaviour.
- Embedding distance from per-topic cluster centroid. A new write whose embedding falls far outside the established cluster for a topic indicates content that does not belong to the normal corpus, whether from drift or adversarial injection.
Threats it covers
-
WHY IT HELPS Memory Poisoning is the injection of adversarial content into an agent's memory store so that it influences future retrievals and, through them, the agent's reasoning and output. This control detects poisoning attempts by monitoring for the statistical signatures they produce: abnormal write rates from a single source, embeddings that fall far outside the established cluster for a topic, provenance tags that do not match the actual ingestion path, and retrieval distributions that shift away from previously stable results.
Principle coverage
Defence-in-Depth stage: Detect — and it advances:
- Memory & RAG Integrity Memory integrity requires that content retrieved from the store be what legitimate principals wrote. Memory anomaly detection advances that principle at the reactive layer: it monitors the statistical behaviour of the write stream for the signatures that adversarial campaigns produce, and quarantines suspect content before it can influence agent reasoning, serving as the detective counterpart to the write-time validation gate.
Design & governance principles (open design, economy of mechanism, accountability, …) are architectural, not advanced by a single placed control.
Implementation options
Five verified implementation options covering embedding-space anomaly detection (two self-build approaches with no vendor dependency), write-rate monitoring via vector-store-native metrics, and application-layer recall-quality canaries. Options 1 and 2 are low-cost, dependency-free, and address the embedding-distance signal class directly. Options 3 and 4 reuse the observability stack already in place. Option 5 adds the recall-quality signal class. All five are composable.
scikit-learn IsolationForest Train an IsolationForest on historical embeddings from known-good writes. Score new writes at inference time; the model returns +1 (inlier) or -1 (outlier). Suitable for high-dimensional embedding spaces without distribution assumptions.
Why choose it: Best for embedding-distance detection when the topic space is broad or cluster structure is not well-defined. IsolationForest makes no normality assumption and handles high-dimensional spaces well. Zero additional dependencies beyond scikit-learn. API: clf.fit(X_train) -> clf.predict(X_new) -> clf.score_samples(X_new).
More details:
numpy centroid-distance threshold Compute per-topic cluster centroids nightly from known-good embeddings. On each write, compute distance = np.linalg.norm(embedding - centroid) and alert when distance exceeds mean + 3 standard deviations of the baseline distribution.
Why choose it: Best when the memory corpus is topic-structured with coherent clusters. Zero additional dependencies; runs inline on the write path at approximately 5 ms per vector. The 3-sigma threshold is the canonical anomaly-detection boundary from NIST AI 600-1 MEASURE-2.7; adjust per corpus based on observed baseline variance. Pairs naturally with option 1 for broader detection.
More details:
Weaviate Prometheus metrics Set PROMETHEUS_MONITORING_ENABLED=true; Weaviate serves metrics at :2112/metrics. Exposes vector index insert/delete counts, batch write durations, and LSM bucket write operations with per-shard granularity. Alert on rate(weaviate_vector_index_operations_total[5m]) spikes per source tag.
Why choose it: Best for write-rate monitoring in Weaviate deployments, reusing Prometheus/Grafana alerting infrastructure already in place. Anomaly detection logic lives in alerting rules, not in Weaviate itself. Does not provide embedding-distance detection; compose with options 1 or 2 for that signal class.
More details:
Pinecone + Datadog Pinecone exposes audit logs tracking user and API actions, and a Datadog integration for shipping vector search and RAG metrics to an external monitoring platform. Write-rate anomaly alerting is implemented in Datadog using standard metric monitors.
Why choose it: Best for Pinecone deployments already shipping metrics to Datadog. The audit log captures who wrote what and when; Datadog's anomaly monitor type applies statistical baseline detection to the write-rate time series without requiring a separate alerting system. Pinecone does not perform anomaly scoring itself; the detection logic lives in Datadog.
More details:
LangSmith online evaluation LangSmith online evaluators run LLM-as-a-judge scoring on production traces at configurable sampling rates. The Feedback Score alert type fires when the average score for a project drops below a threshold, usable as a recall-quality proxy for poisoning-driven displacement of retrieval results.
Why choose it: Best for the recall-pattern signal class in LangChain-instrumented deployments. A sustained drop in retrieval relevance for a previously stable query set indicates that newly written content is displacing established results. Does not address write-rate or embedding-distance detection; treat as a canary layer on top of options 1 through 4.
More details:
Trade-offs
- Statistics computation is asynchronous to the write path; embedding-distance checks add 5 to 20 ms inline if used as a write-time gate, or run out-of-band as alert-only with no write-path latency impact.
- The principal operational cost is baseline tuning, not infrastructure spend. A threshold set too sensitively produces alert fatigue that desensitises the team; a threshold set too loosely misses gradual poisoning. NIST AI 600-1 MEASURE-2.7 explicitly notes that anomaly thresholds require continuous calibration.
- Cold-start systems with no established baseline produce elevated false-positive rates for approximately the first 30 days. Enable write-rate and embedding-distance alerting only after a sufficient baseline window is available.
When NOT to use
- Counter-productive as a write-time gate on cold-start systems: false-positive rates will be high enough to block legitimate writes. Delay enabling the control until at least 30 days of normal write-pattern data are available.
- Wrong tool for catching a single well-crafted adversarial document. The control is tuned for statistical volume signals, not per-document semantic analysis. Do not use as the only defence against T1; deploy m-mem-validation as the pre-commit gate first.
Limitations
- Anomaly detection assumes a stable baseline. Cold-start systems and rapidly growing corpora produce false positives for legitimate growth spikes that resemble write-rate attacks.
- A slow-drift poisoning campaign that spreads writes over weeks stays within rolling-window thresholds. A patient attacker who maintains write rates below the alert threshold defeats the control. Supplement with periodic full-corpus integrity audits.
- Detective phase only: by the time an alert fires, the poisoned content has been committed. Combine with rollback procedures and the pre-commit write-boundary validation provided by m-mem-validation.
Maturity tier reasoning
- Tier 2 because the underlying observability infrastructure (Prometheus, CloudWatch, Datadog) is Tier 1 and embedding-distance anomaly detection is documented in peer-reviewed literature and deployed in production research-grade systems.
- What holds the composed pattern at Tier 2 is the baseline-tuning gap: no industry-standard default threshold library exists for agent-memory anomaly detection, so every deployment calibrates from scratch. This is expected to improve as vector store vendors ship default anomaly profiles.
Last verified against upstream docs: 2026-05-30.