L5 · MAESTRO

Evaluation and Observability

Last reviewed 2026-05-08 · Status: published · Order 5 of 7
WHERE L5 LIVES ON THE AGENTIC REFERENCE ARCHITECTURE
Agentic reference architecture: L5 Evaluation and Observability highlighted APPLICATION USER INPUT OUTPUT AI AGENTS PLANNING TOOL CALLING ACTION MEMORY (short) MODEL LLM Function Calling AGENT peer / MCP CONTENT CODE DATA HITL DEVICE SERVICE LONG-TERM MEMORY VECTOR DATASTORE AUDIT LOG

L5 governs the output return path and audit/log emissions: the surfaces that make the system observable after the fact.

The Evaluation and Observability layer covers the monitoring, logging, tracing, alerting, and human-review surfaces that let operators understand what an agent has done, detect anomalies in what it is doing, and intervene when something goes wrong. This layer is not just about post-hoc debugging. It is the operational foundation for every detective and corrective control in the system. In a multi-agent system (MAS), this layer has heightened importance: individual agents may appear normal in isolation while the system as a whole drifts, making distributed and cross-agent observability a distinct engineering requirement.

What lives here

  • Distributed tracing pipelines that correlate spans across agent calls, tool invocations, and peer-agent interactions (OpenTelemetry, Jaeger, Zipkin)
  • Structured logging of agent inputs, outputs, tool arguments, and decision rationales
  • Metrics collection: latency, token consumption, tool call frequency, error rates, semantic drift scores
  • Evaluation harnesses that run against agent outputs offline or in canary: LLM-as-judge, embedding-distance drift, human spot-check queues
  • Human-in-the-loop (HITL) interfaces: approval queues, audit review dashboards, escalation paths
  • Alert rules and anomaly detection that fire when observable behaviour departs from baseline
  • Immutable or tamper-evident audit logs: append-only stores, WORM buckets, hash-chained records
  • Continuous evaluation pipelines (CI-eval) that run regression suites against deployed agents
  • Post-incident forensics tooling: replay of agent traces, attribution of actions to identities

The MAESTRO guide (Cloud Security Alliance, Ken Huang, 2025) identifies a MAS-specific threat at this layer: individual agents may appear to perform normally while collectively exhibiting degradation that only becomes visible in aggregate metrics. This makes cross-agent correlation a first-class L5 requirement, not an optional enhancement.

Concrete example: A financial-analysis platform runs three Semantic Kernel agents (a data-fetcher, an analyst, and a report-writer) connected via OpenTelemetry. Without cross-agent span correlation, a gradual increase in hallucinated figure citations by the analyst agent is invisible in per-agent logs (each response passes its own plausibility check). Only when an L5 evaluation harness computes embedding-distance drift across the full pipeline does the operator see that the analyst’s outputs have shifted 0.4 cosine distance from the established baseline over 72 hours, triggering an alert before the report-writer publishes.

Threats that target this layer

  • T8 Repudiation and Untraceability: an agent that can deny or obscure its actions requires that the observability layer capture a complete, tamper-resistant record. Gaps in logging, mutable audit records, or missing action attribution directly enable T8. Every OWASP v1.1 T8 mitigation is primarily an L5 control.
  • T10 Overwhelming Human-in-the-Loop: if the human-in-the-loop interface is the primary safety control, it becomes an attack surface: adversarial workloads can generate approval queues large enough that human reviewers approve without adequate scrutiny. Effective HITL design at L5 includes workload management, fatigue-aware routing, and escalation policies.
  • T5 Cascading Hallucination Attacks: observability tooling that measures semantic quality (embedding drift, factual consistency scores, downstream citation accuracy) provides the only reliable signal that hallucination rates are elevated above baseline. Without this, a cascade can persist through many agent turns before an operator notices.

Mitigations anchored here

  • behavioural divergence monitoring: continuously measure agent output against a declared semantic baseline. Flag statistically significant departures from expected output distribution before they accumulate into visible harm. The primary L5 control for catching T5 and T7 drift early.
  • goal-consistency monitoring: at each agent turn, evaluate whether the agent’s declared intent is consistent with its previous turns and its stated objective. Inconsistency is an early signal of goal substitution (T6) or memory poisoning (T1) before the effect is observable in tool calls.
  • multi-source verification: for claims the agent makes that will be acted on, corroborate against at least two independent retrieval sources before propagating the claim. Applies both at evaluation time (offline) and in a live canary posture.
  • human dual-control: route high-consequence actions through two independent reviewers. Provides a structural check on the human-approval surface (T10) that is independent of whether any single reviewer was fatigued or deceived.
  • plan-vs-goal validation: validate agent plans before execution and record the validation decision in the audit log. The audit record is an L5 artifact; the execution guard is an L3 control. Both are required for full coverage.
  • legal-hold / WORM retention: when an agent is involved in a regulated action or an incident begins, activate a legal-hold policy that prevents log deletion, rotation, or modification. Preserves the audit record that T8 attacks attempt to erase.
  • Sigstore signing: sign pipeline artifacts and evaluation results with Sigstore/Rekor to produce a tamper-evident record of what evaluation ran, when, and what it found. Prevents post-hoc alteration of evaluation results.

How L5 relates to its neighbours

L5 sits directly above L4 Deployment Infrastructure, which provides the substrate (log forwarding agents, metrics exporters, storage backends) that L5 depends on. If L4 is compromised in a way that silences telemetry, L5 loses its visibility. Hardening log infrastructure (immutable storage, network-isolated log collectors) is an L4 concern that serves L5 function.

Below L5 in the MAESTRO stack is L4; above L5 is L6 Security and Compliance, the vertical band. L6 policies determine what must be logged, how long records must be retained, and who may access audit data. L5 provides the mechanism; L6 provides the mandate and the governance accountability structure.


Observability is not a security afterthought in agentic systems. It is a primary control. An agent that cannot be traced, evaluated, or interrupted provides no meaningful safety guarantee regardless of how carefully its model, data, and framework layers were hardened. L5 is where that accountability is operationalised.

All threats tagged to this layer

Every threat whose maestroLayers list includes L5. The prose above may discuss a subset; this list is the complete index.

Upstream sources