ASI09: Human-Agent Trust Exploitation

Definition

Adversaries exploit the tendency of humans to trust fluent, authoritative-sounding agents: an agent presents plausible justification for a harmful action, the human approves it, and the resulting audit trail reads as deliberate human authorisation. The attack surface is the review step itself: human-in-the-loop oversight becomes the vector when reviewers lack the context, time, or authority to challenge what the agent recommends.

What it means in practice

Human review is the safety net most agentic systems lean on for irreversible actions. ASI09 is what happens when that safety net itself becomes the attack surface. The agent presents a fluent, plausible explanation for a harmful action (citing internal terminology, referencing recent decisions, mimicking the voice of past approvers) and the human reviewer approves it because the justification looks correct.

A rubber-stamping review is worse than no review: it produces a paper trail that says someone authorised the action. Effective HITL requires reviewers with the domain knowledge to challenge the agent's reasoning, a queue volume they can actually scrutinise, and clear rules about which approvals must be dual-control.

Threat catalogue links

Base-catalog T-numbers follow OWASP source material; normalized MAS scenario entries are Helmwart editorial cross-references. Role colour-codes Helmwart's display weight: chips in the hero use the same scheme.

Primary: strongest pivot. Removing this T-number would gut the entry. Contributing: co-equal mechanism that combines with others to produce the ASI risk. Related: touches the entry but isn't its core; useful cross-reference.

T7 Misaligned and Deceptive Behaviours primary

Agents pursue goals via constraint bypass, deception, or evasion of oversight.
Open threat detail →
T8 Repudiation and Untraceability primary

Agent actions cannot be reliably traced, attributed, or reconstructed.
Open threat detail →
T10 Overwhelming Human-in-the-Loop (HITL) primary

Reviewers are saturated with intervention requests; decision fatigue and HII manipulation make oversight ineffective.
Open threat detail →
T23 Selective Log Manipulation contributing

Attacker selectively deletes log entries for fraudulent actions while leaving the rest intact.
Open threat detail →
T33 Blockchain Reorganisation Attack (Indirect) related

Blockchain reorg invalidates previously confirmed agent transactions, breaking downstream state assumptions.
Open threat detail →
T35 Manipulation of Proof of Sampling (PoSP) primary

Attacker falsifies PoSP verification data, undermining cryptographic sampling-based observability.
Open threat detail →
T44 Insufficient Logging in MCP Server / Client contributing

MCP request and tool-invocation logs are incomplete; forensic reconstruction not possible.
Open threat detail →
T48 Model Inconsistency Leading to Variable Approvals primary

Non-deterministic LLM produces inconsistent outcomes on identical inputs; one identical claim approved, the next flagged.
Open threat detail →

MITRE ATLAS technique

MITRE ATLAS catalogues adversary techniques against AI systems. The technique(s) below represent the red-team pivot for this entry: what an attacker is actually doing on the wire. Source: mitre-atlas/atlas-data v5.6.0.

AML.T0067 LLM Trusted Output Components Manipulation view on ATLAS ↗

Adversary manipulates the structured parts of an LLM response (citations, tool-call arguments, approved-action markup) that downstream systems treat as trusted.

Agentic angle: Structured outputs are exactly what agent frameworks parse to decide what to execute. Undermining the structure undermines every safety check downstream.

OWASP LLM Top 10 cross-references

From OWASP Appendix A (canonical inheritance)

LLM01:2025 Prompt Injection LLM05:2025 Improper Output Handling LLM06:2025 Excessive Agency LLM09:2025 Misinformation

Recommended mitigations

No single control answers an ASI; it is met by a layered stack. The cards below are ranked by how directly each control counters ASI09: the chips on each card name the threat of this ASI it actually covers, colour-coded by that threat's role.

Counters the core

Cover one or more of this ASI's primary threats — the strongest direct response.

Separation of actor and recorder — different identities for action and audit Tier 2

T8T35T23T44

An agent that writes its own audit log can omit, alter, or suppress any record of its own actions. This is not a theoretical risk: an attacker who controls the acting identity controls the evidence. Actor/recorder separation is the structural fix. The identity that performs an action and the identity that records it are different principals, with non-overlapping permissions, so no single compromise can both execute and erase.

Behavioural anomaly isolation — automatic quarantine on observable drift Tier 2

T7T35

An agent that has been compromised, poisoned, or gone rogue will, in most cases, behave differently from its established baseline. Anomaly isolation acts on that difference: when an agent's behaviour score crosses a configured threshold, it is quarantined automatically, credentials revoked, message-queue access cut, in-flight actions aborted. Manual revocation cannot match the speed that cascading multi-agent failures demand.

Cross-system scope auditing — continuous permission reconciliation Tier 2

T48T44

An agent that operates across HR, Finance, cloud, and SaaS systems accumulates permissions at each boundary, often without any single team seeing the combined picture. Privilege accumulates silently across those boundaries until a quarterly review finds it, by which point a compromised or misconfigured agent has had weeks of unchecked reach. Cross-system scope auditing prevents that by continuously reconciling the agent's actual entitlements against a declared baseline across every system it touches and raising a ticket the moment drift is detected.

Output egress DLP — inspection gate for PII, secrets, and IP at the agent boundary Tier 2

T8T23

An agent produces output continuously across multiple channels: user-facing responses, tool-call parameter envelopes, log records, and outbound HTTP requests. Any of those channels can carry sensitive content the agent has retrieved, been fed, or been tricked into including. Output egress DLP places an inspection gate at the boundary so that PII, credentials, and proprietary content are classified and either redacted or quarantined before they leave the trust boundary, regardless of how they got into the output.

Adaptive workload balancing — distribute reviews by measured reviewer fatigue Tier 2

T10

Human reviewers make more errors as cognitive load accumulates over a shift. An adversary who floods a HITL gate, or a system that simply generates high output volume, exploits that degradation without bypassing the gate at all. Adaptive workload balancing addresses this by treating reviewer fatigue as a live routing input: each incoming review is assigned to the reviewer with the lowest current fatigue score, mandatory breaks are enforced before a reviewer's error rate climbs further, and items are held rather than assigned to any reviewer above the break threshold.

AI-source disclosure UI — visible AI labelling at the point of action Tier 2

T10

When an AI agent generates content or proposes an action, users need to know that the source is an AI before they decide to act. Without that signal, users routinely over-trust agent output. AI-source disclosure addresses this by attaching a visible label to every AI-generated item and by requiring explicit confirmation for consequential actions, restoring the critical gap between receipt and acceptance.

Behavioural divergence monitoring — longitudinal drift from declared role Tier 2

An agent's behaviour can shift gradually over time: tool-selection patterns change, refusal rates drop, output style drifts. No single interaction reveals it, and a single-shot evaluation cannot catch a trend that spans weeks. Behavioural divergence monitoring detects that drift by comparing per-window statistical distributions of observable agent signals against a declared baseline, and alerting when the gap exceeds a threshold.

Behavioural red-teaming — adversarial evaluation of agent reasoning and tool use Tier 2

An agent exposes more attack surface than a static model: it reasons, plans, selects tools, and acts across multiple turns. Static analysis can characterise that surface, and runtime guardrails can block known-bad patterns, but neither can predict what the agent will do under attacker pressure it has never seen. Behavioural red-teaming addresses that gap through structured adversarial evaluation: probing the agent's reasoning, planning, and tool-use paths with attack strategies before each release.

Data classification with tool-access allow-lists — a sensitivity label on every dataset, enforced at every access seam Tier 2

Every dataset, document, and external system an agent can reach carries a classification label. The agent's permitted-class set and the tool's permitted-class set are intersected at the moment of every read or write. When the requested data's class falls outside that intersection, access is denied at the seam. This is the data-side complement to least-privilege: it adds a data-sensitivity constraint that role scoping alone does not provide.

Fail-closed gate — refuse rather than act on uncertain output Tier 2

T48

An agent that is uncertain about what to do next faces a choice: refuse and ask for clarification, or proceed on its best guess. In low-stakes situations that tradeoff is tolerable. In agentic systems that write, delete, or send, a confident-sounding but wrong output can commit an irreversible action. A fail-closed gate resolves that choice structurally: below a configured confidence threshold, the agent stops and escalates rather than guessing.

HITL feedback-loop calibration — reviewer overrides fed back into agent tuning Tier 2

T10

An agent at a human-in-the-loop gate will be overridden when its decisions do not match the reviewer's judgment. Without a return path, those corrections are discarded: the same miscalibration surfaces again in the next review cycle and the one after that. A feedback loop closes that gap by capturing each override event as a structured record, accumulating those records into a calibration dataset, and using patterns in that dataset to drive targeted changes to the agent's system prompt, tool-scope policy, or divergence-monitor thresholds. A well-calibrated agent produces fewer out-of-distribution decisions, so the review queue contracts over time.

Insider threat program — personnel security for operators of high-privilege agentic systems Tier 2

Privileged-access personnel are the human layer behind every agentic system. A person with legitimate administrative credentials can tamper with logs, manipulate approval gates, or extract training data through authorised channels, and no technical control prevents it when the access itself is valid. An insider threat program addresses that gap: it governs who holds operator access, what they agree to, how quickly credentials are revoked on departure, and whether anomalous behaviour is surfaced before damage accumulates.

Kill switch: human authority to halt one agent, a class, or the entire deployment Tier 2

Agentic systems can act faster than a human can intervene through normal channels. A kill switch is the operational guarantee that a named human role can stop agent activity at any scope (single instance, class, or global) through a documented runbook, without requiring a code change or redeployment, and with every invocation written to an audit trail.

Legal hold and WORM retention — immutable audit storage that survives a compromised recorder Tier 2

An audit trail is only useful if its records cannot be altered after the fact. Without a storage-layer enforcement mechanism, a sufficiently privileged attacker (or a compromised recorder identity) can overwrite or delete the records that document what happened. Legal hold and WORM retention solve this by placing audit records in storage that the provider itself enforces as immutable: no user, including account root, can modify or delete a locked object within the retention window. Legal hold extends that protection indefinitely for active incidents, lifted only through an out-of-band authority outside the normal operations team.

Out-of-band verification — independent-channel confirmation for irreversible agent actions Tier 2

T48

An agent that can propose payments, update banking details, or modify production configuration is, by construction, a manipulation surface. If the only thing standing between a proposed change and its execution is the agent's own UI, a successful prompt injection or RAG poisoning attack requires no additional steps. Out-of-band verification breaks that dependency by routing a one-use confirmation code through a channel that is structurally separate from the agent's primary interaction channel, so an attacker who controls the agent's context cannot complete the approval without also compromising the user's registered secondary device.

Output moderation gates — independent moderation pass before emission Tier 2

An AI agent can produce output that is harmful, deceptive, or factually wrong while still sounding fluent and confident. Output moderation places an independent classifier or moderation model between the agent and its destination, checking every output before it reaches a user or a downstream system. The generating model does not evaluate its own answer; a separate gate does.

Output provenance tracking — record the source of every claim an agent makes Tier 2

When an agent produces a claim derived from retrieved data, that claim needs a record of where it came from: the source document, version, and retrieval time. Without that record, a downstream verifier cannot distinguish a well-grounded output from a fabricated one, a tampered one, or a poisoned one. Provenance tracking attaches source attribution to every claim, carries it through each transformation in the pipeline, and surfaces it in audit logs and user-facing interfaces.

Per-agent trust scoring — behavioural reputation for inter-agent message acceptance Tier 2

In a multi-agent system, each agent routes decisions based on what its peers report. If a peer's behaviour becomes unreliable or adversarial, agents that keep treating it with full authority will propagate whatever errors or manipulations that peer introduces. Per-agent trust scoring addresses this by maintaining a continuously updated reputation score for every peer, derived from observed behaviour, and using that score to determine how much authority each incoming message carries.

Reviewer decision summaries — independent rationale at HITL gate Tier 2

T10

When an agent decision reaches a human reviewer, the reviewer must reconstruct the agent's reasoning from raw traces before they can form a judgment. OWASP T10 names this reconstruction burden as the mechanism behind reviewer fatigue and oversight failures. A decision summary addresses the problem by inserting an independent model call between the agent's output and the reviewer: that call compresses the decision, evidence chain, and risk factors into a fixed-format card, reducing the per-review cognitive load without removing the human from the decision.

Risk-prioritised review queue — match reviewer attention to consequence Tier 2

T10

A human-in-the-loop review system saturates not from absolute decision volume but from undifferentiated volume: every item lands at the same priority, so reviewers cannot distinguish an irreversible high-consequence action from a routine low-stakes one. A risk-prioritised queue fixes this by scoring each decision before it enters the queue and routing it to the tier that matches its risk level, concentrating human attention where the cost of an error is highest.

Sigstore signing — cryptographic provenance for agent artifacts and audit records Tier 1

An agent is composed of artifacts produced at different times by different identities: model weights, prompt templates, tool descriptors, MCP server binaries, and audit-log batches. Any of those artifacts can be substituted or tampered with between the moment they are built and the moment they are loaded. Sigstore addresses this by signing each artifact at build time using a short-lived certificate tied to the workload identity that produced it, recording the signature in an append-only public transparency log, and requiring verification against that log before the artifact is loaded or executed.

Broader coverage — 1 control that address contributing or related threats

Blockchain transaction guard — pre-commit safety checks for every agent-initiated transaction Tier 2

A blockchain transaction, once committed, cannot be undone. An agent that signs and broadcasts a transaction without an enforcement layer before it can exceed its authorised value, call a contract it was never provisioned to reach, or drain a wallet in a runaway loop, and by then the funds are gone. A transaction guard intercepts each proposed transaction before signing, checks it against value bounds, a contract allowlist, a gas or compute-unit limit, and a replay-protection nonce, and refuses to sign anything that falls outside declared policy.

OWASP Top 10 for Agentic Applications 2026 (canonical source) ↗ · OWASP Gen AI Security Project · Dec 2025 · CC BY-SA 4.0
Agentic Top 10 side-by-side explainer ↗ · trydeepteam.com · secondary reference