Constrained Generation & Deterministic Guardrails · Principles

Why it matters for agentic AI

In classical software, an instruction is a deterministic function call: you know exactly what it will do before it runs. In an agent system, the instruction is a natural-language prompt whose output is probabilistic text that is then interpreted as a function call. That gap between what the model generated and what the system executes is where constrained generation sits. The principle is simple to state: model output is raw material, not executable truth. Before anything acts on it, deterministic controls outside the model validate it against schema, check it against an allow-list, and evaluate it against semantic policy. “The LLM said it is safe” is not a safety check. It is a circular reference.

This matters more for agents than for any prior use of language models because the stakes of a bad output are compounded. A chat assistant that produces a wrong answer can be corrected in the next turn. An agent whose malformed output silently skips a step in a multi-step plan leaves the workflow in a partially-executed state that may not surface as an error. An agent whose tool-call output carries an injected instruction passes it to the next stage, which treats it as a validated action. The model has no conception of escaping, no understanding of downstream execution context, and no way to know whether its well-formed-looking output is about to manipulate a database, a shell, or another model.

A common misconception is that schema validation is a security boundary. It is not; it is a structural gate. An attacker can craft a perfectly schema-valid payload that still carries harmful semantic content (a tool-name that is on the allow-list but parameterised to do harm, a numeric parameter within bounds that encodes a side-channel). Schema validation is necessary but not sufficient; the allow-list and semantic policy layers are where actual security decisions are made.

Scenario: the silent skip

An agent orchestrating a multi-step workflow emits a malformed tool call. The tool name is misspelled in a way that is technically schema-valid but resolves to nothing. A lenient execution engine silently skips the call, logs a soft warning, and continues to the next step. From the workflow’s perspective the step “succeeded.” A required validation check never ran, and the consequent state is wrong. Strict schema validation with explicit error surfacing, not silent skipping, stops execution and surfaces the failure, preventing the downstream steps from acting on a corrupted intermediate state.

Scenario: the injected parameter name

A code-execution agent receives a tool call in which the parameter name, not its value (which passes the schema check), contains an injected instruction fragment: {"run_script--ignore-safety-check": "true"}. A lenient deserialiser that accepts unknown fields quietly sets the flag. An allow-list that validates not just the tool name but the exact set of permitted parameter names, checked before deserialisation, rejects the call before the payload reaches the execution layer.

How it fails

Parameter names (part of the model’s context, therefore injectable) carry instructions that survive schema validation because schemas typically validate values, not key names.
Auto-retry loops that attempt to satisfy a poisoned schema eventually produce a conforming payload, laundering the injected content into a validated tool call.
Validation is lenient: unknown fields are silently ignored, type coercions are performed without logging, and malformed calls skip rather than error.
The model is used to evaluate whether its own output is safe, a guardrail that shares the exact vulnerability of the thing it is guarding.

Why the mapped controls work

Typed schemas (Pydantic / JSON Schema) as a hard gate with no silent skips mean every malformed call surfaces as an explicit error that stops the workflow, rather than a soft failure that allows bad state to propagate. A tool-name allow-list checked before deserialising parameters stops the deserialization-before-validation inversion that lets unknown fields influence execution. OPA / Rego for semantic limits operates at sub-millisecond latency and is fully deterministic. It evaluates context-sensitive rules (this tool may not be called during a code freeze; this parameter exceeds the approved range for this user) without involving the model. An output-scanning pipeline catches PII, secret patterns, and exfiltration signatures before the output reaches any downstream system, providing a final deterministic check at the egress boundary of the model’s generation.

A concrete example: a database-write tool call constrained by an OPA policy might include a Rego rule such as deny { input.tool == "db_write"; input.params.table == "payroll"; not input.claims.role == "finance" }. This is evaluated deterministically before any model-generated parameter reaches the database driver, regardless of how convincingly the model’s reasoning justified the call.

First steps

Add Pydantic (Python) or Zod (TypeScript) schema validation as the first step of your tool-call dispatcher, configured with strict=True / strict mode so that unknown fields and missing required fields raise an error rather than being silently ignored or coerced.
Write a tool-name allow-list checked before deserialisation: before parsing the model’s output as a tool call, verify the tool name exists in a hard-coded set and reject immediately if it does not. This prevents parameter-name injection from reaching the deserialiser.
Deploy an OPA sidecar with at least one semantic Rego rule for your most sensitive tool (e.g. “deny write calls during a declared code freeze”), run it in dry-run mode for one week to confirm it catches violations without false positives, then switch to enforce mode.

Threats it governs

When this principle is absent, these threats become reachable.

T2
Tool Misuse Agent uses authorized tools in unintended ways via deceptive prompts or chained calls.
T5
Cascading Hallucination Attacks Fabricated outputs propagate via reflection, memory, or multi-agent comms.
T41
Schema Mismatch Leading to Errors MCP request and server response schemas drift, causing parsing errors or wrong-tool invocation.

Controls that advance it

Catalogue mitigations that strengthen this principle, grouped by the defence-in-depth stage they sit in.

Prevent

Output moderation An AI agent can produce output that is harmful, deceptive, or factually wrong while still sounding fluent and confident. Output moderation places an independent classifier or moderation model between the agent and its destination, checking every output before it reaches a user or a downstream system. The generating model does not evaluate its own answer; a separate gate does.
Pre-exec check An LLM produces tool-call arguments through generation, not through a type system, and generation is not reliable. The arguments may be wrong in type, out of range, or assembled in a combination that violates business rules. A pre-execution validation gate intercepts the call before it reaches the tool: a schema pass confirms each argument conforms to the declared JSON Schema, and a policy pass confirms the argument combination is permitted for this agent and this action. The tool executes only when both passes clear.
Plan check A plan-then-execute agent produces a sequence of steps before acting. If the planner is manipulated, it will emit steps that serve the attacker's goal rather than the user's. Plan-vs-goal validation addresses this by placing an independent validator between the planner and the execution loop: it evaluates each proposed step against the originally-declared goal before the agent is permitted to act on it.
Fail-closed An agent that is uncertain about what to do next faces a choice: refuse and ask for clarification, or proceed on its best guess. In low-stakes situations that tradeoff is tolerable. In agentic systems that write, delete, or send, a confident-sounding but wrong output can commit an irreversible action. A fail-closed gate resolves that choice structurally: below a configured confidence threshold, the agent stops and escalates rather than guessing.
Static analysis An agent that can generate and execute code treats code generation as a tool call and code execution as the outcome. If the generated code contains a known-dangerous pattern, no amount of prompt engineering stops it from running once the execute call goes through. Static analysis closes that gap: it scans every code artifact the agent emits against a rule set before execution is permitted, catching the vulnerability patterns the same tooling already catches in human-written code.

Detect

No catalogued control.

Respond

No catalogued control.

In Helmwart

Not modelled directly; relates to the validation control family.