MITIGATION · m-graceful-degrade
Graceful degradation — fail closed where it matters, fail open where it's safe
An agent that encounters a quota trip, a dependency failure, or a timeout faces a choice: continue at reduced quality, or refuse. Getting that choice wrong is the core operational failure. Graceful degradation requires the answer to be declared before the incident, not improvised during it: write-authority paths fail closed and return a refusal; read-only paths fail open and disclose the degraded state explicitly.
At a glance
TL;DR
- Declare the failure mode for every action class before deployment: write-authority paths fail closed and return a structured refusal; read-only paths fail open with an explicit disclosure that the result is degraded.
- A quota trip or circuit-open event is a decision point, not a silent continuation. The agent must refuse or disclose; continuing without acknowledgment is the canonical operational mistake.
- The central work is the per-action-class taxonomy. If the team cannot agree whether an action is write-authority or advisory, default to fail-closed until the classification is settled.
- Graceful degradation is only as reliable as the failure signals that trigger it. Pair with rate quotas and longitudinal monitoring to detect sub-threshold resource exhaustion that the circuit breaker cannot see.
How it behaves
What it is
Graceful degradation is the practice of declaring, at design time, exactly what an agent does when a failure condition is reached: a quota is exhausted, a dependency returns an error, a circuit breaker opens, or a timeout elapses. The declaration is not a runtime decision; it is a pre-committed policy, indexed by action class, that determines whether the agent fails open or fails closed.
Failing open means the agent continues and returns a partial or reduced result, with an explicit disclosure that the result is degraded. Failing closed means the agent refuses the action entirely, logs the denial, and returns a structured error to the caller. The choice between the two is not arbitrary: write-authority paths, those that modify state, send external communication, or commit irreversible actions, must fail closed. Read-only or advisory paths may fail open, provided the degraded state is disclosed rather than concealed.
The reason the declaration must precede the incident is that an agent making a runtime judgment about its own failure mode is itself a failure mode. An attacker who can force a resource-exhaustion condition can also force the agent into whatever ad-hoc behavior its undeclared failure path produces. A pre-declared, externally enforced policy removes that surface.
Pair with m-rate-quota for the rate-limit primitives that generate the failure signals this control responds to, with m-fail-closed for the confidence-gate variant, and with m-divergence-monitor so that degradation events are tracked longitudinally.
Detection signals
- Suspension events per agent per hour. A rising rate means failure conditions are being hit repeatedly, which points to a quota misconfiguration, an upstream dependency under stress, or a deliberate resource-exhaustion attempt.
- Fail-open versus fail-closed event ratio. A ratio that drifts toward fail-open on action classes that should be fail-closed means the policy matrix is incomplete or a new tool was added without a declared failure mode.
Threats it covers
-
WHY IT HELPS Resource Overload is the deliberate exhaustion of an agent's compute, memory, or budget, forcing it into a degraded operating state. Whether that overload becomes a security incident depends entirely on how the agent responds when its resources are exhausted: fail-open silence allows it to continue acting on write-authority paths without the capacity to do so safely, while a pre-declared fail-closed response stops the action and logs the denial.
-
WHY IT HELPS Workflow Disruption via Dependency Exploitation occurs when an attacker degrades or disables a dependency that a pipeline agent relies on, causing upstream agents to block, retry indefinitely, or propagate corrupt state. A pre-declared fail-closed response with a circuit breaker prevents the degraded dependency from stalling the entire pipeline: the affected agent halts and returns a structured refusal rather than blocking indefinitely.
Principle coverage
Defence-in-Depth stage: Respond — and it advances:
- Defence-in-Depth Graceful degradation is the failure-path layer of defence-in-depth: when rate limits, isolation, and monitoring have all been reached or bypassed, a pre-declared fail-closed response stops the action at the last enforcement point rather than allowing it to proceed under degraded conditions.
- Assume Breach Assume Breach requires planning for an agent that is already operating under resource exhaustion or dependency compromise. Graceful degradation is the operational response to that scenario: a pre-declared failure mode ensures the agent refuses write-authority actions and discloses degraded state on read-only paths, regardless of whether the exhaustion is accidental or deliberate.
- Fail Securely (fail-closed) Fail-securely requires that when a control cannot operate normally, the outcome is the safe state rather than an uncontrolled permissive default. A pre-declared fail-closed policy for write-authority paths is the direct implementation of that principle: the absence of capacity to act safely produces a structured refusal, not a partial action.
- Resilience & Recovery Graceful degradation is the immediate containment step in a resilience and recovery sequence: a pre-declared failure mode stops the agent from accumulating further damage during a resource exhaustion or dependency failure event, creating a stable halted state from which recovery and forensics can begin.
- Safe Interruptibility / Corrigibility Safe Interruptibility requires that an agent can be stopped reliably without producing harmful side effects. A pre-declared fail-closed response on write-authority paths implements that requirement at the failure boundary: when a resource or dependency condition is reached, the agent stops the action cleanly rather than attempting a partial write.
- Rate-limiting / Budgets / Loop prevention Graceful degradation is the response layer that rate limiting triggers: when a rate limit or quota is reached, the pre-declared failure mode determines whether the agent refuses the action or discloses a degraded result, turning the rate-limit signal into a defined, auditable outcome rather than an uncontrolled one.
- Kill-switch / Circuit-breaker Graceful degradation extends the kill-switch principle to automated failure conditions: where a kill switch requires a human to initiate a halt, a pre-declared fail-closed policy stops write-authority actions automatically when a resource or dependency threshold is crossed, reducing the window between a failure condition and a safe halted state.
- Robustness / Reliability Graceful degradation is the structural basis for robustness under adversarial resource pressure: by declaring failure modes at design time rather than improvising them at runtime, the agent produces consistent, predictable behavior when its operating conditions degrade, which is the condition robustness must hold under.
Design & governance principles (open design, economy of mechanism, accountability, …) are architectural, not advanced by a single placed control.
Implementation options
Five implementation options, each at a different layer of the stack. Resilience4j is the default for JVM-based agent runtimes. Envoy circuit breaking and outlier detection suit service-mesh deployments. Istio DestinationRule composes both into a single declarative policy. AWS FIS closes the loop by verifying that the degradation paths declared in the policy matrix actually fire under realistic fault conditions.
Resilience4j CircuitBreaker Lightweight Java library that wraps callable actions in a circuit breaker with CLOSED, OPEN, and HALF_OPEN state transitions. When the failure rate or slow-call rate exceeds a configured threshold, the circuit opens and subsequent calls are rejected immediately.
Why choose it: Best for JVM-based agent runtimes (Spring Boot, Micronaut, Kotlin) where failure policy is managed in application code. Configure one CircuitBreaker instance per action class, set failureRateThreshold to match your SLA, and wire the fallback method to your refusal or degraded-response path. Per-circuit metrics map directly onto the fail-open versus fail-closed event ratio detection signal.
More details:
Envoy circuit breaking Envoy enforces per-upstream-cluster resource limits (maximum connections, pending requests, and retries) at the proxy layer. Outlier detection passively tracks per-host error rates and ejects misbehaving hosts from the load-balancing set without application code changes.
Why choose it: Best for service-mesh or sidecar-proxy deployments where the agent calls downstream services through Envoy. Enforcement at the proxy layer requires no application changes. outlierDetection adds a host-granular layer on top of the connection-pool limits.
More details:
Istio DestinationRule Configures connection-pool limits and outlier detection as a single declarative policy applied to all traffic destined for a service, without touching application code.
Why choose it: Best when the agent runtime is inside an Istio service mesh and both circuit-breaking and host-ejection policy are needed in one place. For write-authority dependencies where any silent degradation is unacceptable, set maxEjectionPercent to 0: a circuit-open returns HTTP 503 immediately, and the agent's fail-closed path fires.
More details:
AWS Fault Injection Service Managed chaos-engineering service that injects real fault conditions (CPU stress, network latency, API throttling, dependency failures) into AWS workloads and rolls back automatically if a CloudWatch stop condition is breached.
Why choose it: Best as a verification step after any of the other options are deployed. Declaring failure modes does not confirm they fire correctly. AWS FIS lets you inject a dependency failure, confirm the circuit opens within the expected window, verify the fail-closed path returns a structured refusal rather than an unhandled exception, and confirm recovery after the fault clears.
More details:
Per-action-class policy matrix An explicit map from every tool or action class to its declared failure mode (fail-closed or fail-open-degraded), wired into a central action-execution wrapper that catches any error, looks up the declared mode, and returns either a refusal or a degraded response.
Why choose it: The only option when no infrastructure-layer circuit breaker covers the action surface: agents that call internal non-HTTP APIs, manipulate file systems, or execute code in sandboxes. This matrix is also the ground-truth document that infrastructure-layer options enforce. Even deployments using Resilience4j or Istio require it.
More details:
Trade-offs
- The policy matrix and Resilience4j are application-layer changes. They require a failure-mode declaration step for every new tool or action class, which adds ongoing authoring overhead as the agent's action surface grows.
- Envoy and Istio circuit-breaking add no application code changes but require the agent to be behind a compatible proxy. Bare-metal or serverless runtimes require a different enforcement layer.
- AWS FIS experiments run against real resources. Run in a pre-production environment with CloudWatch stop conditions in place before testing in production.
- Fail-open paths for read-only dependencies must emit an explicit degraded-mode disclosure. A silent partial result that appears complete to the caller is a worse outcome than a refusal.
When NOT to use
- Do not author a policy matrix for prototype agents where the action taxonomy changes frequently. The maintenance lag means declared failure modes will not match the current action surface.
- Do not use Istio or Envoy circuit-breaking as the sole graceful-degradation control for agents whose critical actions call non-HTTP dependencies. The proxy layer cannot intercept those calls.
- Do not rely on AWS FIS as a runtime control. It is a testing and verification tool, not an enforcement mechanism.
Limitations
- Graceful degradation cannot fire if the system cannot detect the failure condition. Slow resource exhaustion that stays below every configured threshold consumes capacity without tripping any circuit.
- Circuit breakers at the infrastructure layer operate on connection and error-rate signals, not semantic content. A downstream dependency that returns HTTP 200 with corrupt or adversarially crafted data does not open the circuit.
- The HALF_OPEN state in Resilience4j and the ejection-recovery cycle in Envoy outlier detection both allow limited traffic through to probe recovery. If the downstream failure is adversarial rather than accidental, that probe traffic is also adversarial exposure.
- Per-action-class policy matrices can become stale. A tool that was read-only when the matrix was authored and later gained write capability is mis-declared without a re-review.
Maturity tier reasoning
- Tier 2 fits because all five options are production-available with stable APIs and maintainer documentation.
- What keeps this out of Tier 1 is the per-action-class policy matrix: the agentic addition that makes circuit-breaking relevant to this threat model. No industry-standard schema or tooling exists for authoring, validating, or machine-reading an agent action-class failure-mode taxonomy.
- The combination of infrastructure-layer circuit breaking with application-layer action-class policy authoring is what the maturity tier reflects.
Last verified against upstream docs: 2026-05-30.