EVIDENCE TRAIL

Multi-agent consensus on high-impact actions

Verbatim excerpts from the upstream sources cited on the mitigation page, with what each source does and does not prove. The core distributed-systems theory (Byzantine Generals, PBFT) is strongly supported. The agentic-AI application of that theory is an inference — no upstream source uses "N-of-M quorum" in an LLM-agent context verbatim. One MDX claim (NIST AI 600-1 §MAP-3.2 for ensemble verification) is not verifiable: that section identifier does not exist in the July 2024 document; MEASURE 2.6 is the closest verifiable NIST citation and is used here instead.

Last cross-checked against upstream sources: 2026-05-29 · 7 sources

References

Each entry shows what the source supports and what it does not prove.

Reference 1

ACM Trans. Programming Languages and Systems, Vol. 4, No. 3 · July 1982

Lamport, Shostak & Pease — "The Byzantine Generals Problem" (1982)

Abstract — Byzantine Generals Problem statement

"It is shown that, using only oral messages, this problem is solvable if and only if more than two-thirds of the generals are loyal; so a single traitor can confound two loyal generals."

Supports: Foundational proof that consensus over N participants tolerates at most f < N/3 Byzantine (actively malicious) peers. Establishes the N ≥ 3f+1 quorum bound that agentic-AI deployments inherit when they require ≥2f+1 independent peers to agree.

Does not prove: Does not address LLMs, agentic AI, or hallucination. The paper treats a synchronous message-passing model; real agentic deployments are partially synchronous, so the bound is a lower bound on required diversity, not a deployment recipe.

open original ↗

Reference 2

v1.1 · published December 2025

OWASP Agentic AI — Threats & Mitigations v1.1

No verbatim excerpt pulled — open the original to verify the cited section.

Supports: T13 "Rogue Agents" names "Consensus Mechanism Exploitation" as one of four rogue-agent scenarios. The explicit threat description covers an attacker who fabricates artificial consensus among compromised agents to approve harmful actions, or who manufactures artificial disagreement to paralyse a quorum. Multi-agent N-of-M consensus over independent peers is the named countermeasure — raising the attack cost from "compromise one agent" to "compromise the quorum majority". T12 "Agent Communication Poisoning" and T5 "Cascading Hallucination Attacks" are also named in OWASP v1.1 and are additional coverage targets of this control.

Does not prove: The OWASP T&M document is not openly machine-readable; verbatim extract not pulled. T13 is an OWASP v1.1 ID (T1–T17 range); the "Model Inconsistency" scenario sometimes labelled T48 in Helmwart is a Helmwart-internal renumber from OWASP MAS T16 and is not from OWASP T&M v1.1.

open original ↗

Reference 3

arXiv:2203.11171 · ICLR 2023

Wang et al. — "Self-Consistency Improves Chain of Thought Reasoning in Language Models" (ICLR 2023)

Abstract

"It first samples a diverse set of reasoning paths instead of only taking the greedy one, and then selects the most consistent answer by marginalizing out the sampled reasoning paths. Self-consistency leverages the intuition that a complex reasoning problem typically admits multiple different ways of thinking leading to its unique correct answer."

Supports: Empirical demonstration that sampling multiple independent reasoning paths and selecting by majority vote (marginalizing out paths) reduces hallucination and improves accuracy. The self-consistency mechanism is the intra-agent analogue of inter-agent consensus: both use agreement across diverse reasoning traces as a reliability signal. Benchmarked improvements on GSM8K (+17.9%), SVAMP (+11.0%), AQuA (+12.2%) establish that diversity of paths materially improves correctness.

Does not prove: Self-consistency operates within a single language model (sampling from its own decoder); it does not directly address multi-agent coordination, Byzantine peers, or adversarial agents. Generalisation from intra-agent to inter-agent consensus is the Helmwart interpretation — the paper does not make this claim itself.

open original ↗

Reference 4

arXiv:2212.08073 · December 2022

Bai et al. — "Constitutional AI: Harmlessness from AI Feedback" (Anthropic, 2022)

Abstract — Constitutional AI method description

"Both the SL and RL methods can leverage chain-of-thought style reasoning to improve the human-judged performance and transparency of AI decision making."

Supports: Establishes the model-as-verifier (critic-refinement) pattern: a second model critiques and revises the output of a first model. This is the lightweight, verifier-agent variant of the control — cheaper than full N-of-M peer quorum, applicable where a dedicated "critic" agent independently evaluates primary-agent output before execution. The pattern uses AI feedback to enforce policy compliance without full human labelling.

Does not prove: The paper focuses on harmlessness training and RLHF, not on multi-agent consensus or Byzantine-fault tolerance. The critic pattern is one-against-one (one critic reviews one output), not N-of-M quorum. It does not address adversarial peer agents.

open original ↗

Reference 5

ATLAS catalogue (continuously updated)

MITRE ATLAS AML.M0029 — Human In-the-Loop for AI Agent Actions

AML.M0029 description — ATLAS catalogue YAML (raw.githubusercontent.com/mitre-atlas/atlas-data/refs/heads/main/dist/ATLAS.yaml)

"Systems should require the user or another human stakeholder to approve AI agent actions before the agent takes them. … Human In-the-Loop policies should follow the degree of consequence of the task at hand. Minor, repetitive tasks performed by agents accessing basic tools may only require minimal human oversight, while agents employed in systems with significant consequences may necessitate approval from multiple stakeholders diversified across multiple organizations."

Supports: Defines the human-review escalation path that a failed quorum should open onto. Explicitly calls for risk-proportionate oversight and "approval from multiple stakeholders diversified across multiple organizations" for high-consequence systems — the structural analogue of N-of-M consensus with organisational diversity.

Does not prove: Addresses human-in-the-loop, not automated peer-agent consensus. Does not specify how a consensus quorum fires or how peer diversity is achieved technically.

open original ↗

Reference 6

ATLAS catalogue (continuously updated)

MITRE ATLAS AML.M0032 — Segmentation of AI Agent Components

AML.M0032 description — ATLAS catalogue YAML (raw.githubusercontent.com/mitre-atlas/atlas-data/refs/heads/main/dist/ATLAS.yaml)

"Define security boundaries around agentic tools and data sources with methods such as API access, container isolation, code execution sandboxing, and rate limiting of tool invocation. … This restricts untrusted processes or potential compromises from spreading throughout the system."

Supports: Segmentation is the structural prerequisite for meaningful consensus diversity: peers that share the same execution environment, memory, or retrieval corpus are not genuinely independent. Isolation between peer agents ensures that a compromise of one does not automatically corrupt the others, which is the condition under which N-of-M quorum retains its safety guarantees.

Does not prove: Does not address quorum logic or consensus algorithms. Segmentation is a necessary but not sufficient condition for consensus-based controls.

open original ↗

Reference 7

Published July 2024

NIST AI 600-1 — Generative AI Profile (NIST AI RMF)

MEASURE 2.6 — "AI system is evaluated regularly for safety risks"

"The AI system to be deployed is demonstrated to be safe, its residual negative risk does not exceed the risk tolerance, and it can fail safely, particularly if made to operate beyond its knowledge limits."

Supports: Names fail-safe behaviour when a system operates beyond its knowledge limits as a deployment-evaluation requirement. Multi-agent consensus is one mechanism that implements this: when peer agents disagree, the system has evidence of operating at the boundary of its reliable knowledge and should fail safely (reject the action or escalate) rather than proceed on one agent's uncertain output.

Does not prove: MEASURE 2.6 does not name multi-agent consensus, N-of-M quorum, or ensemble verification explicitly. The MDX independently evidence claim that NIST AI 600-1 names "ensemble-based verification under MAP-3.2" is not supported: MAP 3.2 does not exist in the document; MEASURE 3.2 (action MS-3.2-001) covers risk tracking for systems where measurement techniques are unavailable, not ensemble methods. The fail-safe framing in MEASURE 2.6 is the closest verifiable NIST citation for this control.

open original ↗