← Atlas · Mitigations Tier 2 · Real-composable

MITIGATION · m-rate-quota

Per-agent rate limits and quotas — bound compute, tokens, and external-API spend

An agent operates without direct human oversight, autonomously scheduling tool calls, external API requests, and reflection loops. Without a budget, a single triggering event can fan out into hundreds of downstream calls. Per-agent rate limits and quotas assign each agent identity its own ceiling on call rate, token consumption, and cost spend, so a misbehaving or compromised agent cannot exhaust shared resources and its overconsumption becomes a visible, actionable signal.

Last reviewed 2026-05-12 · Status: published · Evidence →

At a glance

MATURITY
Tier 2
Available off-the-shelf or as a documented pattern, but newer or less broadly proven. Expect integration work and some operational nuance.
PLACES ON
node · edge
Restricted to node kinds: agent, tool-bus
COVERAGE
4 threats
T4 · T13 · T38 · T39
TRADE-OFFS
LAT
low
COST
low
UX
medium
DEV
low
Latency · cost · UX friction · dev effort.
TL;DR
  • Each agent identity gets its own budget for call rate, token consumption, and spend. One agent reaching its limit does not affect the budget of any other agent running for the same user.
  • The budget is evaluated before each tool call and each downstream API request. Consumption is refused at the seam before the call is made, not reconciled after the damage is done.
  • Per-task budget resets are the agentic addition that standard API gateway rate limits do not provide. A reflection loop that runs hard in task A cannot carry its consumption forward and block task B in the same session.
  • When the budget is exhausted, the call is refused with a structured QuotaExceeded response and an observability event is emitted. The error is not silently swallowed.

How it behaves

Agent requests a tool call or external API call
Check call rate, token spend, and cost against per-agent and per-task budget
Allow call, record consumption
Reject with QuotaExceeded, halt or pause task
Emit a quota-exceeded event for observability. Do not silently swallow the error.

What it is

A rate limit controls how fast a caller can make requests; a quota controls how much it can consume in total over a period. For agentic systems, both matter because an agent autonomously schedules tool calls, external API requests, and internal reflection cycles without a human pacing it. A single triggering event, such as a malformed tool response or an injected instruction, can cause a planning loop to fan out into hundreds of downstream requests before any human is aware.

Standard API gateway rate limiting keys on IP address or API key. That is not sufficient for agents: the same user might have several agents acting on their behalf simultaneously, and each agent's consumption should be budgeted independently. Per-agent rate limiting changes the quota key to the agent's identity, such as its SPIFFE ID, service-account token, or platform-assigned agent ID. Each identity gets its own token bucket or sliding window, so one agent exhausting its budget does not affect another agent operating for the same user.

The second agentic-specific addition is the per-task budget. A long-running agent session may span many tasks in sequence. A per-task budget resets at task boundaries, so a reflection loop that runs hard in one task cannot carry that consumption forward and block subsequent tasks in the same session.

When the budget is reached, the call is refused with a QuotaExceeded error and a structured event is emitted for observability. The task is paused or terminated cleanly; the error is not silently swallowed. That structured refusal is what makes the budget a detection signal, not merely a governor.

Detection signals

  • Quota-exceeded events per agent identity. A rising rate for a specific identity points to a misbehaving reflection loop, prompt injection, or coordinated flood originating from that agent.
  • Fan-out ratio: downstream requests divided by agent-initiated requests. A ratio substantially above baseline indicates self-amplifying behaviour that call-count quotas alone will not surface.

Threats it covers

  • T4 Resource Overload −2 severity steps

    WHY IT HELPS Resource Overload is the deliberate or accidental exhaustion of compute, memory, token, or cost budgets shared by the system. An agent whose call rate, token spend, and cost are each bounded by a per-identity quota cannot exhaust those resources beyond its ceiling, regardless of how many tool calls its planning or reflection loop schedules.

  • WHY IT HELPS Coordinated Agent Flooding is the scenario where a set of agents, acting in concert or under a single attacker's direction, submits a volume of requests that overwhelms downstream systems. Per-agent rate limits bound the maximum contribution of any one identity, reducing the aggregate flood a coordinated set can sustain.

  • WHY IT HELPS Emergent collusion amplifies impact by having multiple agents submit coordinated bursts toward a shared target. Per-agent quota enforcement limits each participating identity's rate, which slows the burst and opens a detection window before the cumulative effect reaches its target.

  • WHY IT HELPS Unintended MCP resource consumption occurs when an agent or a manipulated tool call exhausts compute, API quota, or cost budgets that were never intended to be reachable in normal operation. Per-agent, per-tool, and per-session quota enforcement places a hard ceiling on that consumption at each layer.

Principle coverage

Defence-in-Depth stage: Prevent — and it advances:

  • Microsegmentation Rate limits and quotas enforce microsegmentation at the resource layer: each agent identity is confined to its own budget for calls, tokens, and spend, so a misbehaving agent in one segment cannot deplete the shared resource pool available to adjacent identities.
  • Least Agency / Minimal Autonomy A per-agent budget removes the agent's ability to consume resources without bound, bounding its autonomous scheduling and fan-out to a ceiling set by the operator rather than by the agent's own judgment.
  • Rate-limiting / Budgets / Loop prevention This control is the direct implementation of rate-limiting for agentic systems: it applies token-bucket and sliding-window algorithms at the agent-identity and per-task dimensions that standard API gateways do not supply.

Design & governance principles (open design, economy of mechanism, accountability, …) are architectural, not advanced by a single placed control.

Implementation options

Three deployment layers, each covering a different part of the budget surface. Deploy all three together for defence-in-depth. The gateway layer alone does not catch in-process reflection loops, and the runtime layer alone cannot limit downstream LLM spend.

Envoy / Kong Enforce per-identity call-rate limits at the network layer using a token-bucket or sliding-window algorithm before requests reach any backend.

Why choose it: Best when you already run a service mesh or reverse proxy. Envoy's local rate limit filter and the Envoy Ratelimit sidecar (Go/gRPC, Redis-backed) support per-descriptor quota keys so you can express agent-id + tool-name as the budget key. Kong's rate-limiting plugin adds per-consumer enforcement when an auth plugin identifies the caller, making it a natural fit for multi-tenant agentic platforms.

More details:

AWS API Gateway / Cloudflare Use a hosted API gateway that enforces per-API-key throttle and monthly quota with no rate-limit infrastructure to operate.

Why choose it: Best when your agent calls external REST APIs and you want the quota enforced close to the provider boundary with no operational overhead. AWS API Gateway usage plans set both a per-second throttle (token-bucket) and a hard daily or monthly request quota per API key. Map one API key to one agent identity. Cloudflare Rate Limiting adds edge enforcement; Enterprise Advanced Rate Limiting can key on custom request headers carrying an agent-ID claim.

More details:

OpenAI / Anthropic / Bedrock spend limits Set hard monthly spend and per-minute token quotas at the LLM provider so an agent cannot exhaust your billing ceiling regardless of what happens in your own infrastructure.

Why choose it: Best as a backstop when the agent consumes an LLM API that your own gateway does not front. OpenAI enforces RPM, TPM, RPD, and monthly spend ceilings per project. Anthropic enforces per-workspace RPM, ITPM, and OTPM limits with workspace-level overrides programmable via the Rate Limits API. AWS Bedrock enforces per-account per-model invocation quotas tracked in tokens. These limits operate at project or workspace granularity, not per-agent-identity, so they are a floor rather than a replacement for gateway-layer per-agent budgets.

More details:

In-process runtime budget (self-build) Enforce per-agent and per-task call-count, token-estimate, and tool-depth budgets inside the agent runtime, before any network call is made.

Why choose it: This is the only layer that can limit per-task reflection loops and fan-out before they become network traffic. No off-the-shelf product implements per-task budget semantics for agentic runtimes today; the budget is custom application code keyed on (agent-id, task-id). Pair with gateway-layer enforcement: the runtime budget catches runaway loops early, and the gateway budget is the hard backstop if the runtime budget is bypassed or misconfigured.

More details:

Trade-offs

  • Gateway-layer token-bucket decisions are sub-millisecond and add no perceptible latency. In-process runtime checks add a few microseconds of bookkeeping per call.
  • Setting the budget too tight is the dominant failure mode. A per-task tool-call limit of 10 blocks a legitimate research agent on any moderately complex query. Measure call counts and token spend per task class over at least two weeks before setting a production limit.
  • LLM provider spend limits operate at project or workspace granularity, not per-agent-identity. A single misbehaving agent can still exhaust the shared project budget before the provider-level limit fires.

When NOT to use

  • Do not apply per-call rate limits to batch or scheduled agents that legitimately burst in a maintenance window. Use time-windowed budgets with a separate scheduled-job policy instead.
  • Do not use quota budgets as a substitute for cost alerting. A quota that blocks real users is not the right answer when legitimate workload exceeds forecast. Use budget alerts and autoscaling for that.
  • Do not treat a passing quota check as proof the agent is not misbehaving. An agent can stay within its budget and still take harmful actions. Pair with output moderation and plan validation.

Limitations

  • A coordinated multi-agent flood across many identities can exhaust shared downstream resources even if no single agent hits its individual limit. Gateway-level system-wide rate limits and graceful degradation are required as a complement.
  • Per-token budgets must be estimated before the call completes. Actual reconciliation is asynchronous, so the runtime budget lags by one request and can slightly overshoot.
  • The per-agent-identity and per-task dimensions have no industry-standard attribution scheme for multi-agent workflows. Which budget applies when two agents collaborate on a shared task is a deployment-specific decision with no settled best practice.
  • AWS API Gateway usage plans are documented as best-effort, not hard limits. The documentation warns explicitly that clients may exceed quotas and that plans should not be relied on to block access or control costs.

Maturity tier reasoning

  • Tier 2 fits because all three layers (gateway, managed cloud, LLM provider limits) are production-available today with well-documented APIs and SLAs.
  • The in-process agent-runtime budget layer has no off-the-shelf implementation. It requires custom code, which is what keeps the agentic application of this control out of Tier 1.
  • The absence of an industry-standard per-agent-identity attribution scheme for multi-agent workflows, specifically which budget applies when agents collaborate on a shared task, is the remaining open operational problem that Tier 1 would require to be resolved.

Last verified against upstream docs: 2026-05-30.