← Atlas · Mitigations Tier 2 · Real-composable

MITIGATION · m-hitl-feedback-loop

HITL feedback-loop calibration — reviewer overrides fed back into agent tuning

An agent at a human-in-the-loop gate will be overridden when its decisions do not match the reviewer's judgment. Without a return path, those corrections are discarded: the same miscalibration surfaces again in the next review cycle and the one after that. A feedback loop closes that gap by capturing each override event as a structured record, accumulating those records into a calibration dataset, and using patterns in that dataset to drive targeted changes to the agent's system prompt, tool-scope policy, or divergence-monitor thresholds. A well-calibrated agent produces fewer out-of-distribution decisions, so the review queue contracts over time.

Last reviewed 2026-05-12 · Status: published · Evidence →

At a glance

MATURITY
Tier 2
Available off-the-shelf or as a documented pattern, but newer or less broadly proven. Expect integration work and some operational nuance.
PLACES ON
node · edge
Restricted to node kinds: agent, hitl-gate
COVERAGE
2 threats
T6 · T10
TRADE-OFFS
LAT
low
COST
medium
UX
low
DEV
high
Latency · cost · UX friction · dev effort.
TL;DR
  • Every reviewer override at the HITL gate is recorded as a structured event: agent decision, reviewer decision, rationale, and context.
  • Override events accumulate into a calibration dataset; batch analysis identifies systematic over-permissiveness, over-restriction, and confidence-mismatch clusters.
  • Patterns drive targeted changes to agent system prompts, tool-scope policies, and divergence-monitor thresholds. Each change requires a human sign-off on the calibration report before deployment.
  • A well-calibrated agent produces fewer out-of-distribution decisions, so the review queue contracts over time.

How it behaves

Reviewer overrides an agent decision at the HITL gate
Override event captured; does the accumulated pattern exceed the minimum event threshold for calibration?
Generate calibration report; human signs off; deploy prompt / policy / threshold changes
Log event; continue accumulating; hold calibration until threshold is reached
Calibration deployment is always gated by human approval. The agent cannot close its own feedback loop.

What it is

A human-in-the-loop gate surfaces agent decisions for reviewer approval, but the gate alone does not improve the agent. Each override tells you the agent was wrong for that decision class; without a return path, that information is discarded. The same miscalibration that produced today's override will produce tomorrow's, and the review queue does not shrink.

A feedback loop adds that return path. It has three components:

  1. Override capture. When a reviewer overrides an agent decision (approves what the agent refused, refuses what the agent recommended, or selects a narrower action than the agent proposed), the event is recorded with the agent's original decision, the reviewer's decision, the reviewer's stated rationale, and the context at the time of the decision. This record is produced by decision summary cards and logged by actor-recorder separation.

  2. Pattern analysis. Override events accumulate into a calibration dataset. Patterns are identified: systematic over-permissiveness (the agent approves what reviewers consistently refuse), systematic over-restriction (the agent refuses what reviewers consistently approve), and context clusters where the agent's confidence signal does not predict reviewer agreement.

  3. Calibration deployment. Patterns drive targeted changes: prompt updates that address the over-permissive or over-restrictive behaviour, tool-scope policy updates that narrow or widen the default tool grants that produced the pattern, and behavioural divergence monitoring threshold adjustments tuned to match the decision surface reviewers are actually drawing. Each change requires a human sign-off on the calibration report before deployment.

The human sign-off requirement is the structural guard against the feedback loop becoming an attack surface. An agent cannot manipulate its own calibration by generating override events that push its thresholds in a permissive direction, because the calibration report must be reviewed and approved by a human before any change deploys.

Detection signals

  • Reviewer override rate per agent. A sustained increase means the agent's behaviour has drifted from its calibrated baseline, or a new task class has entered the distribution that training data did not cover.
  • Post-calibration override rate delta. A rate that is stable or rising after a calibration deployment means the change did not address the root cause.

Threats it covers

  • WHY IT HELPS Intent Breaking and Goal Manipulation is the class of attack in which an agent's behaviour is shifted away from its intended goal without the operator detecting or correcting the drift. Override capture makes each reviewer correction a structured, queryable event; pattern analysis over those events identifies systematic drift before it becomes entrenched; calibration deployment restores the intended behaviour. The loop directly shortens the window during which a goal-manipulation campaign can operate undetected.

  • WHY IT HELPS Overwhelming the HITL gate is the threat that a sustained volume of agent decisions requiring human intervention exhausts reviewer capacity and degrades oversight quality. A well-calibrated agent produces fewer out-of-distribution decisions, so the review queue contracts over time. The feedback loop is the mechanism by which calibration data generated by queue activity is converted into prompt and policy changes that reduce that activity.

Principle coverage

Defence-in-Depth stage: Detect — and it advances:

  • Human Oversight (HITL / HOTL) The feedback loop makes human oversight self-improving: each reviewer correction is captured as structured data, and accumulated patterns drive calibration changes that reduce the volume of decisions requiring review, so oversight capacity is concentrated where the agent remains genuinely uncertain rather than diluted across decisions it has already learned to handle.

Design & governance principles (open design, economy of mechanism, accountability, …) are architectural, not advanced by a single placed control.

Implementation options

There is no single product that ships override-driven agent calibration end-to-end. The options below cover the two calibration techniques that supply the underlying mathematics, the two annotation platforms that provide the override-capture and dataset layer, and a managed fine-tuning API for teams that want to go beyond prompt updates to full model fine-tuning. Deploy them in combination: capture layer first, calibration technique second.

RLHF Train a reward model on pairwise comparisons between agent decisions and reviewer overrides, then use that reward model to fine-tune the agent via proximal policy optimisation (PPO).

Why choose it: Best when override volume is high (thousands of events per calibration cycle) and the miscalibration is deep enough that prompt updates alone cannot correct it. The foundational technique for feedback-driven agent alignment, established by Christiano et al. 2017 (arxiv:1706.03741). Requires substantial infrastructure: a reward model training pipeline, a PPO training loop, and compute for both. Override events map directly to the preference pairs the reward model trains on: the reviewer's decision is the preferred output; the agent's overridden decision is the rejected output.

More details:

DPO Fine-tune the agent model directly on preference pairs: (prompt, agent decision, reviewer override) triples, using a classification loss rather than a learned reward model and PPO training loop.

Why choose it: Best when you want the accuracy gains of RLHF with substantially lower compute and infrastructure overhead. Rafailov et al. 2023 (arxiv:2305.18290) show DPO matches or exceeds PPO-based RLHF on summarisation and dialogue tasks. Each override event maps cleanly to a DPO training example: the prompt is the task context, the chosen completion is the reviewer's preferred decision, the rejected completion is the agent's overridden decision. Can be run via the OpenAI fine-tuning API (which supports DPO natively with JSONL preference pairs) or via self-hosted pipelines (trl library, Hugging Face).

More details:

LangSmith Push each HITL override event as a run into a LangSmith annotation queue; capture reviewer feedback via the annotation SDK; add corrected runs to a versioned dataset for downstream fine-tuning.

Why choose it: Best as the capture and dataset layer when your agent pipeline already uses LangSmith for tracing. Annotation queues present one run at a time and accept structured feedback configs (continuous scores, categorical choices, freeform text). The feedback API lets you retrieve submitted reviewer scores for export. Dataset versioning is native: runs from queues can be added to LangSmith datasets and exported as JSONL for DPO or SFT fine-tuning. Does not provide the calibration compute layer; pair with DPO or prompt-versioning downstream.

More details:

Argilla Self-hosted open-source annotation platform; attach the agent's original decision as a model suggestion on each record; reviewers accept, correct, or override it; export the resulting dataset in Hugging Face Datasets format for DPO or SFT fine-tuning.

Why choose it: Best when data sovereignty requires that override data never leave your infrastructure, or when you need a fully open-source pipeline with no vendor dependency. The suggestion system (rg.Suggestion with confidence score) lets you pre-attach the agent's decision as a suggestion; the reviewer's correction becomes the accepted annotation, which is exactly the (rejected, chosen) pair DPO requires. Export is via the Python SDK to Hugging Face Datasets format; from there, the trl library or any DPO trainer can consume the data directly. Requires self-hosted deployment (Docker or Kubernetes).

More details:

OpenAI fine-tuning API Upload reviewer override events as JSONL preference pairs to the OpenAI fine-tuning API; select the DPO method; fine-tune a GPT-4o or GPT-4.1 variant on the override dataset without managing GPU infrastructure.

Why choose it: Best when the agent is built on an OpenAI model and you want managed DPO compute without running your own training cluster. The API supports DPO natively: each training example is a (prompt, chosen, rejected) triple where chosen is the reviewer's preferred decision and rejected is the agent's overridden decision, which is exactly what an override event produces. Also supports SFT if the team prefers to fine-tune on reviewer decisions alone rather than preference pairs. Cost is per-token of training data. Human sign-off on the override dataset before upload is still required; the API does not enforce an approval gate on training data.

More details:

Trade-offs

  • Override capture adds negligible latency to the reviewer path; feedback is async. The calibration cycle is the real cost: prompt updates can deploy in hours, but DPO or RLHF fine-tuning typically takes days to weeks of engineering and compute. During that window the miscalibration persists in production and continues to generate review burden.
  • Cost splits across two layers: capture and dataset tooling (LangSmith or Argilla) is low; fine-tuning compute (OpenAI fine-tuning API or self-hosted GPU) is medium to high and scales with dataset size and model size. Prompt-only calibration has near-zero compute cost but limited scope.
  • If reviewers are themselves inconsistent, fatigued, or biased, the feedback loop encodes reviewer error into agent policy. Override-pattern audits by a senior reviewer are the structural check. Pair with m-adaptive-workload to reduce fatigue-driven inconsistency in the source data.
  • A calibration dataset below roughly 50 override events is too thin for statistically reliable pattern detection; below roughly 200 events, DPO training produces unstable results. Set a minimum event threshold per calibration window and do not trigger fine-tuning runs on sparse data.

When NOT to use

  • Do not use for agents with fewer than 50 override events per calibration window. The pattern-analysis signal is too thin to distinguish systematic miscalibration from reviewer variance; run a manual baseline review instead.
  • Do not use when the review population is fewer than three active reviewers. The feedback loop risks encoding a single reviewer's idiosyncrasies into agent policy. Fix reviewer coverage before enabling calibration.
  • Do not use in domains where the correct answer shifts on a timescale shorter than your calibration cycle (for example, rapidly-changing regulations or live market conditions). Calibration data collected two weeks ago may already be stale; use explicit, manually authored policy updates for high-drift domains instead.
  • Do not allow the agent to trigger its own calibration runs. The human sign-off gate on the calibration report is the structural guard against prompt-injection attacks that attempt to manipulate the agent's own tuning by generating synthetic override events.

Limitations

  • Loop latency, the time from override capture to deployed calibration, can be weeks when fine-tuning is involved. During that window the miscalibration persists. Prompt updates are faster (hours) but limited in the scope of behaviour they can correct.
  • Prompt-level calibration does not propagate evenly across model temperatures and sampling strategies. A prompt change that corrects behaviour at temperature 0.0 may not produce the same correction at temperature 0.7.
  • DPO and RLHF fine-tuning correct behaviours present in the override dataset but do not generalise reliably to task classes not represented in the training distribution. Monitor the override rate on held-out task classes after each calibration deployment.
  • The override dataset is a lagging signal. It captures miscalibrations that have already generated reviewer burden; it cannot prevent novel miscalibrations on task classes the agent has not yet encountered. Pair with m-goal-consistency and m-divergence-monitor for earlier-signal controls.

Maturity tier reasoning

  • Tier 2 fits because every component primitive is production-available: RLHF and DPO are established techniques (arxiv:1706.03741, arxiv:2305.18290); LangSmith annotation queues and Argilla are actively maintained annotation platforms; the OpenAI fine-tuning API ships DPO as a documented method. The closed-loop assembly, override schema, pattern analysis, and calibration deployment with human sign-off, is bespoke per deployment with no industry-standard product that implements it end-to-end for agentic AI.
  • The override-to-fine-tune pipeline has no standardised schema for agentic AI override events. Every deployment designs its own (agentDecision, reviewerDecision, overrideType, contextSnapshot) structure; there is no published interoperability standard between annotation platforms and agentic-AI calibration pipelines.
  • Fine-tuning cadence norms for agentic AI, including how frequently to retrain, what minimum override volume to require, and how to evaluate calibration success, are emerging practice, not settled operations. Teams calibrating for the first time should expect to iterate on these parameters for three to six months before the loop is operationally stable.

Last verified against upstream docs: 2026-05-30.