← Atlas · Mitigations Tier 2 · Real-composable

MITIGATION · m-graceful-degrade

Graceful degradation — fail closed where it matters, fail open where it's safe

An agent that encounters a quota trip, a dependency failure, or a timeout faces a choice: continue at reduced quality, or refuse. Getting that choice wrong is the core operational failure. Graceful degradation requires the answer to be declared before the incident, not improvised during it: write-authority paths fail closed and return a refusal; read-only paths fail open and disclose the degraded state explicitly.

Last reviewed 2026-05-12 · Status: published · Evidence →

At a glance

MATURITY

Tier 2

Available off-the-shelf or as a documented pattern, but newer or less broadly proven. Expect integration work and some operational nuance.

PLACES ON

node

Restricted to node kinds: agent

COVERAGE

2 threats

T4 · T25

TRADE-OFFS

LAT

low

COST

low

medium

DEV

medium

Latency · cost · UX friction · dev effort.

TL;DR

Declare the failure mode for every action class before deployment: write-authority paths fail closed and return a structured refusal; read-only paths fail open with an explicit disclosure that the result is degraded.
A quota trip or circuit-open event is a decision point, not a silent continuation. The agent must refuse or disclose; continuing without acknowledgment is the canonical operational mistake.
The central work is the per-action-class taxonomy. If the team cannot agree whether an action is write-authority or advisory, default to fail-closed until the classification is settled.
Graceful degradation is only as reliable as the failure signals that trigger it. Pair with rate quotas and longitudinal monitoring to detect sub-threshold resource exhaustion that the circuit breaker cannot see.

How it behaves

Agent action fails or quota trips (timeout, dependency error, rate limit exceeded)

Look up the pre-declared failure mode for this action class

Fail open, return partial or degraded result with explicit disclosure

Fail closed, refuse the action, log the denial, do not guess

The failure mode must be declared at design time, not decided at runtime. An undeclared action class defaults to fail-closed.

What it is

Graceful degradation is the practice of declaring, at design time, exactly what an agent does when a failure condition is reached: a quota is exhausted, a dependency returns an error, a circuit breaker opens, or a timeout elapses. The declaration is not a runtime decision; it is a pre-committed policy, indexed by action class, that determines whether the agent fails open or fails closed.

Failing open means the agent continues and returns a partial or reduced result, with an explicit disclosure that the result is degraded. Failing closed means the agent refuses the action entirely, logs the denial, and returns a structured error to the caller. The choice between the two is not arbitrary: write-authority paths, those that modify state, send external communication, or commit irreversible actions, must fail closed. Read-only or advisory paths may fail open, provided the degraded state is disclosed rather than concealed.

The reason the declaration must precede the incident is that an agent making a runtime judgment about its own failure mode is itself a failure mode. An attacker who can force a resource-exhaustion condition can also force the agent into whatever ad-hoc behavior its undeclared failure path produces. A pre-declared, externally enforced policy removes that surface.

Pair with m-rate-quota for the rate-limit primitives that generate the failure signals this control responds to, with m-fail-closed for the confidence-gate variant, and with m-divergence-monitor so that degradation events are tracked longitudinally.

Detection signals

Suspension events per agent per hour. A rising rate means failure conditions are being hit repeatedly, which points to a quota misconfiguration, an upstream dependency under stress, or a deliberate resource-exhaustion attempt.
Fail-open versus fail-closed event ratio. A ratio that drifts toward fail-open on action classes that should be fail-closed means the policy matrix is incomplete or a new tool was added without a declared failure mode.

Threats it covers

T4 Resource Overload −1 severity step

WHY IT HELPS Resource Overload is the deliberate exhaustion of an agent's compute, memory, or budget, forcing it into a degraded operating state. Whether that overload becomes a security incident depends entirely on how the agent responds when its resources are exhausted: fail-open silence allows it to continue acting on write-authority paths without the capacity to do so safely, while a pre-declared fail-closed response stops the action and logs the denial.
T25 Workflow Disruption via Dependency Exploitation −1 severity step

WHY IT HELPS Workflow Disruption via Dependency Exploitation occurs when an attacker degrades or disables a dependency that a pipeline agent relies on, causing upstream agents to block, retry indefinitely, or propagate corrupt state. A pre-declared fail-closed response with a circuit breaker prevents the degraded dependency from stalling the entire pipeline: the affected agent halts and returns a structured refusal rather than blocking indefinitely.

Principle coverage

Defence-in-Depth stage: Respond — and it advances:

Defence-in-Depth Graceful degradation is the failure-path layer of defence-in-depth: when rate limits, isolation, and monitoring have all been reached or bypassed, a pre-declared fail-closed response stops the action at the last enforcement point rather than allowing it to proceed under degraded conditions.
Assume Breach Assume Breach requires planning for an agent that is already operating under resource exhaustion or dependency compromise. Graceful degradation is the operational response to that scenario: a pre-declared failure mode ensures the agent refuses write-authority actions and discloses degraded state on read-only paths, regardless of whether the exhaustion is accidental or deliberate.
Fail Securely (fail-closed) Fail-securely requires that when a control cannot operate normally, the outcome is the safe state rather than an uncontrolled permissive default. A pre-declared fail-closed policy for write-authority paths is the direct implementation of that principle: the absence of capacity to act safely produces a structured refusal, not a partial action.
Resilience & Recovery Graceful degradation is the immediate containment step in a resilience and recovery sequence: a pre-declared failure mode stops the agent from accumulating further damage during a resource exhaustion or dependency failure event, creating a stable halted state from which recovery and forensics can begin.
Safe Interruptibility / Corrigibility Safe Interruptibility requires that an agent can be stopped reliably without producing harmful side effects. A pre-declared fail-closed response on write-authority paths implements that requirement at the failure boundary: when a resource or dependency condition is reached, the agent stops the action cleanly rather than attempting a partial write.
Rate-limiting / Budgets / Loop prevention Graceful degradation is the response layer that rate limiting triggers: when a rate limit or quota is reached, the pre-declared failure mode determines whether the agent refuses the action or discloses a degraded result, turning the rate-limit signal into a defined, auditable outcome rather than an uncontrolled one.
Kill-switch / Circuit-breaker Graceful degradation extends the kill-switch principle to automated failure conditions: where a kill switch requires a human to initiate a halt, a pre-declared fail-closed policy stops write-authority actions automatically when a resource or dependency threshold is crossed, reducing the window between a failure condition and a safe halted state.
Robustness / Reliability Graceful degradation is the structural basis for robustness under adversarial resource pressure: by declaring failure modes at design time rather than improvising them at runtime, the agent produces consistent, predictable behavior when its operating conditions degrade, which is the condition robustness must hold under.

Design & governance principles (open design, economy of mechanism, accountability, …) are architectural, not advanced by a single placed control.

Implementation options

Five implementation options, each at a different layer of the stack. Resilience4j is the default for JVM-based agent runtimes. Envoy circuit breaking and outlier detection suit service-mesh deployments. Istio DestinationRule composes both into a single declarative policy. AWS FIS closes the loop by verifying that the degradation paths declared in the policy matrix actually fire under realistic fault conditions.

Resilience4j CircuitBreaker Lightweight Java library that wraps callable actions in a circuit breaker with CLOSED, OPEN, and HALF_OPEN state transitions. When the failure rate or slow-call rate exceeds a configured threshold, the circuit opens and subsequent calls are rejected immediately.

Why choose it: Best for JVM-based agent runtimes (Spring Boot, Micronaut, Kotlin) where failure policy is managed in application code. Configure one CircuitBreaker instance per action class, set failureRateThreshold to match your SLA, and wire the fallback method to your refusal or degraded-response path. Per-circuit metrics map directly onto the fail-open versus fail-closed event ratio detection signal.

More details:

Resilience4j CircuitBreaker docs ↗

Envoy circuit breaking Envoy enforces per-upstream-cluster resource limits (maximum connections, pending requests, and retries) at the proxy layer. Outlier detection passively tracks per-host error rates and ejects misbehaving hosts from the load-balancing set without application code changes.

Why choose it: Best for service-mesh or sidecar-proxy deployments where the agent calls downstream services through Envoy. Enforcement at the proxy layer requires no application changes. outlierDetection adds a host-granular layer on top of the connection-pool limits.

More details:

Istio DestinationRule Configures connection-pool limits and outlier detection as a single declarative policy applied to all traffic destined for a service, without touching application code.

Why choose it: Best when the agent runtime is inside an Istio service mesh and both circuit-breaking and host-ejection policy are needed in one place. For write-authority dependencies where any silent degradation is unacceptable, set maxEjectionPercent to 0: a circuit-open returns HTTP 503 immediately, and the agent's fail-closed path fires.

More details:

Istio circuit breaking task ↗

AWS Fault Injection Service Managed chaos-engineering service that injects real fault conditions (CPU stress, network latency, API throttling, dependency failures) into AWS workloads and rolls back automatically if a CloudWatch stop condition is breached.

Why choose it: Best as a verification step after any of the other options are deployed. Declaring failure modes does not confirm they fire correctly. AWS FIS lets you inject a dependency failure, confirm the circuit opens within the expected window, verify the fail-closed path returns a structured refusal rather than an unhandled exception, and confirm recovery after the fault clears.

More details:

Per-action-class policy matrix An explicit map from every tool or action class to its declared failure mode (fail-closed or fail-open-degraded), wired into a central action-execution wrapper that catches any error, looks up the declared mode, and returns either a refusal or a degraded response.

Why choose it: The only option when no infrastructure-layer circuit breaker covers the action surface: agents that call internal non-HTTP APIs, manipulate file systems, or execute code in sandboxes. This matrix is also the ground-truth document that infrastructure-layer options enforce. Even deployments using Resilience4j or Istio require it.

More details:

Trade-offs

The policy matrix and Resilience4j are application-layer changes. They require a failure-mode declaration step for every new tool or action class, which adds ongoing authoring overhead as the agent's action surface grows.
Envoy and Istio circuit-breaking add no application code changes but require the agent to be behind a compatible proxy. Bare-metal or serverless runtimes require a different enforcement layer.
AWS FIS experiments run against real resources. Run in a pre-production environment with CloudWatch stop conditions in place before testing in production.
Fail-open paths for read-only dependencies must emit an explicit degraded-mode disclosure. A silent partial result that appears complete to the caller is a worse outcome than a refusal.

When NOT to use

Do not author a policy matrix for prototype agents where the action taxonomy changes frequently. The maintenance lag means declared failure modes will not match the current action surface.
Do not use Istio or Envoy circuit-breaking as the sole graceful-degradation control for agents whose critical actions call non-HTTP dependencies. The proxy layer cannot intercept those calls.
Do not rely on AWS FIS as a runtime control. It is a testing and verification tool, not an enforcement mechanism.

Limitations

Graceful degradation cannot fire if the system cannot detect the failure condition. Slow resource exhaustion that stays below every configured threshold consumes capacity without tripping any circuit.
Circuit breakers at the infrastructure layer operate on connection and error-rate signals, not semantic content. A downstream dependency that returns HTTP 200 with corrupt or adversarially crafted data does not open the circuit.
The HALF_OPEN state in Resilience4j and the ejection-recovery cycle in Envoy outlier detection both allow limited traffic through to probe recovery. If the downstream failure is adversarial rather than accidental, that probe traffic is also adversarial exposure.
Per-action-class policy matrices can become stale. A tool that was read-only when the matrix was authored and later gained write capability is mis-declared without a re-review.

Maturity tier reasoning

Tier 2 fits because all five options are production-available with stable APIs and maintainer documentation.
What keeps this out of Tier 1 is the per-action-class policy matrix: the agentic addition that makes circuit-breaking relevant to this threat model. No industry-standard schema or tooling exists for authoring, validating, or machine-reading an agent action-class failure-mode taxonomy.
The combination of infrastructure-layer circuit breaking with application-layer action-class policy authoring is what the maturity tier reflects.

Last verified against upstream docs: 2026-05-30.

PLACEMENT

On the canvas, this control can be placed on:

node

Valid node kinds: agent

Place it on the canvas →

MAESTRO LAYERS

L3 L5

ATLAS TECHNIQUES

AML.T0029 Denial of AI Service
Adversary exhausts compute, memory, or rate-limit budgets so the AI system stops responding or stops processing legitimate requests.
AML.T0034 Cost Harvesting
Adversary intentionally inflates the victim's inference bill (long prompts, expensive tools, repeated calls) to cause financial harm rather than service disruption.

ATLAS MITIGATIONS

AML.M0024 AI Telemetry Logging
Log inputs, outputs, and reasoning steps of deployed AI models so anomalous behaviour can be detected and incidents reconstructed.

TRADE-OFFS

latency low
cost low
ux friction medium
dev effort medium

PLAYBOOKS

OWASP v1.1 playbook that recommends this control:

P3 Securing AI Tool Execution & Preventing Unauthorised Actions Across Supply Chains