08 · PRINCIPLES

Security design principles for agentic AI

The principles that hold when threats change every year, re-read for autonomous agents. Pick any principle for the full write-up: why agents change it, worked scenarios, the threats it governs, and the controls that advance it.

A CENTRAL ENFORCEMENT PRINCIPLE

Enforcement must live outside the model. An LLM is probabilistic: it can be prompted around, injected, or silently changed by a model upgrade, and it can't tell instructions from data. Every hard control belongs in the orchestrator, the policy engine, the tool gateway, and the infrastructure: deterministic code the agent cannot reason its way past. Model-layer safeguards can supplement those controls, but should not be the only gate.

What Helmwart actually enforces 40 principles

Honest accounting: most principles are documented here but not yet wired into the threat model. This doubles as the backlog.

A Access & trust 6

Who (or what) is allowed to do what, verified every time.

  • Zero Trust Enforced

    Never trust, always verify. Grant no implicit trust from network location or prior authentication; authenticate and authorise every request, every time, against current context.

  • Least Privilege Enforced

    Every program and user operates with the minimum privileges needed for the job, and nothing more.

  • Default / Implicit Deny Reference

    Base access on explicit permission, not exclusion. The default is denial; you allow-list the exceptions (“fail-safe defaults”).

  • Continuous Verification Partial

    Trust state is re-evaluated continuously, not cached from login.

  • Attack Surface Minimization Reference

    Reduce the number of pathways an attacker can use. Remove every unnecessary tool, interface, and capability.

  • Microsegmentation Partial

    Divide the system into fine-grained, independently-authorised zones so a compromise in one cannot reach the others (east-west control, not just a perimeter).

B Resilience & failure 6

Assume something breaks: contain it and recover.

C Agentic autonomy & control 8

No classical analogue. The actor is autonomous, tool-using, probabilistic.

  • Least Agency / Minimal Autonomy Partial

    Give an agent no more authority to decide and act than the task needs; prefer suggesting an action over taking it. Treat every increase in autonomy as a liability you have to justify.

  • Human Oversight (HITL / HOTL) Enforced

    Keep meaningful human control over consequential actions: a blocking checkpoint before execution (HITL) or live monitoring with authority to interrupt (HOTL). “Meaningful” rules out rubber-stamp dialogs and approvals that time out to “allow.”

  • Safe Interruptibility / Corrigibility Reference

    An agent must always be stoppable and correctable by its operators, and must not learn to resist, evade, or plan around shutdown.

  • Sandboxing & Isolation Partial

    Every action runs in a constrained, revocable environment that limits what it can read, write, reach, or spend. This is the physical enforcement of least agency.

  • Constrained Generation & Deterministic Guardrails Reference

    Place hard controls outside the probabilistic model (schema validation, allow-lists, policy engines) and treat model output as raw material to verify before anything acts on it. “The LLM said it’s safe” is never sufficient.

  • Reversibility / Dry-run / Hold periods Partial

    Make actions undoable where possible; preview irreversible ones before they execute; expand capability in stages; insert a delay before high-value irreversible actions so a human can cancel.

  • Rate-limiting / Budgets / Loop prevention Reference

    Operate within hard, externally-enforced ceilings on time, cost, tokens, tool-call frequency, and recursion depth. A looping agent cannot be trusted to stop itself.

  • Kill-switch / Circuit-breaker Reference

    A layered emergency stop: an external kill switch that halts an agent immediately, circuit breakers that auto-trip on bad patterns, and graceful degradation that keeps unaffected capability running. All components are architecturally external to the agent.

D Data, identity & trust 9

The context window is a flat string with no hardware trust boundary.

  • Provenance & Trust-tagging Partial

    Every piece of text in the context (system prompt, user message, retrieved document, tool result, sub-agent reply) has a trust level. Track where it came from and tag it so instructions from low-trust sources are not obeyed.

  • Confused-Deputy Prevention Reference

    Stop the agent, a legitimately-privileged “deputy,” from being tricked into wielding its authority for an attacker (capability-based security).

  • Input/Output Validation Partial

    Treat all input and all output as potentially attacker-controlled. Validate inbound text before it influences the agent, and validate outbound text before it touches any downstream system.

  • The Lethal Trifecta Enforced

    A design heuristic: an agent that simultaneously has (1) access to private data, (2) exposure to untrusted content, and (3) the ability to communicate externally creates a direct exfiltration path if attacker-controlled content successfully drives action.

  • Memory & RAG Integrity Reference

    Protect the agent’s persistent memory and retrieval corpora from poisoning. Only authenticated, validated content is written; stored content is tamper-evident; retrieved content is trusted according to its provenance.

  • Observability / Non-repudiation Partial

    Every agent decision is recorded, attributable, tamper-evident, and undeniable: enough to reconstruct what happened, under whose authority, and prove it.

  • Supply-chain Security Reference

    Verify the provenance and integrity of every external component (models, MCP servers, tools, plugins, agent cards, frameworks, data sources) before integrating, and continuously after.

  • Data Minimization & Privacy Partial

    Access, process, retain, and transmit only the minimum data the current task needs. No “just in case.”

  • Agent-as-principal Identity Partial

    Every agent has a unique, cryptographically verifiable non-human identity (not a shared key, not the user’s credentials), and delegation from human to agent is explicit, scoped, and carries the chain of intent.

E Secure-design classics 6

Saltzer & Schroeder (1975), re-read for agents.

F Governance & safety 5

Security-relevant slices of the AI-governance frameworks.