Written for: Head of Security CTO

Runtime monitoring for AI agents

Runtime monitoring for AI agents means more than application logs. What to instrument, how to detect abuse in production, and how to bound damage early.

By Giovanni Salvador · 12 June 2026 · 6 min read

Everything before runtime defence is about lowering the probability that something goes wrong. Runtime monitoring starts from the opposite premise: assume something will.

The security teams I work with in financial services have, by and large, figured out that AI agents need guardrails before they ship. What they have not always figured out is what happens after. Build-time controls lower the hit rate of attacks; they do not eliminate it. An injection your classifier did not recognise, a connector that returns a poisoned response, a model that drifts from its known-good behaviour: all of these produce consequences in production, after the guardrails have already run. Runtime defence is the layer that observes the system in production, detects the abuse the build-time controls missed, and contains a misbehaving agent before its blast radius becomes a reportable incident.

The agenda is narrow by design: detect abuse from the agent’s own telemetry, and contain it with automatic, bounded circuit-breaking. The incident-response process that consumes detection and containment signals is a separate domain. But without the signals, no process helps.

The stake

The further right an agent sits on the autonomy spectrum, from suggesting actions to drafting for approval to acting with guardrails to acting without any human in the loop, the more a single bad decision can do before anyone sees it. An autonomous agent that has been turned by an injection does not ask permission. The first signal is often the consequence: an unexpected refund issued, a fraud flag suppressed, a customer record mutated.

Detection narrows the window between compromise and discovery. Containment bounds the damage regardless of whether detection has fired. Neither replaces build-time guardrails or least-privilege scoping, but without them the other controls have no fallback.

What to log: the agent-action schema

A conventional application log captures requests, responses, and errors. That is necessary but not sufficient for an agent. The security-relevant events in an agentic system are the model’s reasoning and actions, not just its HTTP traffic. A complete agent-action log covers five fields:

Prompts and context provenance. The user turn, the system prompt version in force, and the provenance of every span of retrieved or tool-sourced content concatenated into the context window, tagged with its trust level. When you investigate an indirect injection, the question is which untrusted span carried the malicious instruction. You can only answer it if provenance was logged at ingestion.
Decisions and plans. The agent’s intermediate reasoning, the plan it formed, and the specific decision that led to each action. For a multi-step agent this is the difference between “the agent transferred funds” and “the agent decided to transfer funds because step 4 of its plan misinterpreted a poisoned tool description.”
Tool calls. Every tool or connector invocation: which tool, with what arguments, under which identity and scope, with what authorisation result. This is the highest-value stream for security, because tool calls are where an agent reaches the real world.
Connector responses. What each tool or MCP server returned, so you can see a poisoned or anomalous response and trace its downstream effect on the next decision.
Outcomes and effects. The final output, any state change effected, and the guardrail and policy decisions applied on the way.

Three properties make this schema usable rather than just voluminous. First, every record carries a trace identifier so a single agent task reconstructs as one causal chain across many model calls and tool invocations. Second, the schema is machine-parseable, not free-text log lines, so the detection layer and your SIEM can query it. Third, the logs are tamper-evident and write-once for security-relevant fields. An attacker who reaches the agent should not also be able to rewrite the record of what it did.

One tension to name plainly. The same completeness that makes logs useful for detection makes the log store itself a concentrated repository of regulated data. Prompts and connector responses routinely contain customer personally identifiable information. A verbose, long-retained, broadly readable log becomes one of the exfiltration channels you are trying to prevent. The resolution is not to log less of the decision trail. It is to apply data-minimisation, pseudonymisation, and retrieval-authorisation controls to the logs themselves. Your agent logs are a regulated-data asset.

Four detection signals

Logging makes the system observable. Detection makes it defensible. Layer four complementary signals over the agent-action stream:

Injection and jailbreak signatures. Pattern- and classifier-based detection of known injection and jailbreak attempts in inbound content and in the agent’s outputs. Seed the signature set from the OWASP Top 10 for LLM Applications and your own incident history. This catches the known and the lightly mutated. A novel phrasing will pass. That is exactly why signatures are paired with the behavioural signals below.

Anomalous tool-call detection. Alert on tool calls that do not fit the task: a connector invoked with arguments outside its normal distribution, a read-only workflow suddenly issuing writes, a tool called with a scope it has never used, or a connector response that does not match the request. This is the highest-leverage detection in an agentic estate. Tool calls are where injection and tool abuse become real-world effects, and where connector supply-chain compromise first shows itself in the logs.

Behavioural baselining. Establish a baseline of normal behaviour per agent and per workflow: typical tool sequences, call rates, data volumes, decision patterns. Alert on deviation. Baselining catches the abuse that has no signature: a model drifting from its known-good behaviour because a corpus document was poisoned, or an agent gradually walked off-task by a low-and-slow injection. One design note: a slow-and-low attacker can gradually walk the baseline, training the detector to accept the abuse. Baseline updates should themselves be change-controlled, with a defined approval process and tighter interim containment limits while a new baseline beds in.

SOC and SIEM integration. None of the above is useful in a separate console. Agent telemetry and alerts should flow into the same SOC and SIEM that runs the rest of the estate, correlated with identity, network, and endpoint events. An anomalous tool call correlated with an unusual identity assertion is a far stronger signal than either alone. Each AI alert class needs a named triage owner and a playbook, not a best-efforts sideline for the AI team. Budget for the false-positive tuning loop before rollout: signatures fire on benign content, baselines produce false positives during onboarding, and an estate of dozens of agents multiplies both.

Containment: bounding damage without knowing why

Detection is probabilistic. An agent abused competently may stay within every baseline until the moment it acts. That is why detection is paired with containment, which bounds the damage regardless of whether detection has fired.

Five controls compose into a defensible containment design:

Per-action limits. Bound any single action to a safe envelope: maximum transaction value, maximum record count per operation, restricted counterparty destinations. A per-action cap means no single agent decision, however wrong, exceeds a tolerable threshold.

Aggregate rate and spend limits. Bound the agent over a window: actions per minute, total value moved per hour, total model-inference spend per day. These catch the failure mode per-action limits miss: many individually small actions that aggregate into a large effect. Spend limits also bound the denial-of-wallet failure mode, where an abused or looping agent runs up significant inference cost.

Circuit breakers. Trip automatically on a defined condition: an error-rate spike, a run of anomalous tool calls fed by the detection layer, a breached rate or spend limit. A circuit breaker converts a detection signal or a breached limit into an automatic halt, closing the gap between detection and response without waiting for a human to act.

Kill switch and safe-stop. A reliable, fast, and tested means to stop a single agent, a class of agents, or the whole fleet, immediately and without a deploy. Two properties matter and are routinely missed. First, it must be a safe-stop: halting mid-task should leave the system in a coherent, recoverable state, not a half-completed money movement. Second, it must be tested. An untested kill switch is a comfort, not a control.

Graceful degradation. When a control trips or a dependency fails, the system should fall back to a safer, more constrained mode: drop to draft-for-approval, disable the highest-impact tools, or route to a human rather than failing open.

Containment is sized by the agent’s position on the autonomy spectrum. The further right, the tighter and faster the containment, because the human checkpoint that would otherwise catch the error has been removed and the limits have to substitute for it.

What to do this week

Name your highest-autonomy production agent and ask the question. If it started issuing unexpected tool calls right now, how long before the SOC would know? That answer tells you the urgency of your monitoring gap.
Implement the agent-action logging schema. Start with tool calls and connector responses, which are the highest-value stream. Add context provenance and decision trails as a second pass.
Wire agent telemetry into the existing SIEM. Do not build a parallel console. Treat AI agents as a new telemetry source for the detection-and-response capability you already run.
Set per-action and aggregate limits on every act-autonomously agent. Even rough limits are better than none. They convert “uncapped blast radius” to “a threshold agreed with risk and finance.”
Test the kill switch. Book a 30-minute exercise. Stop a non-production agent, verify the task ends in a coherent state, and restart it. If the exercise surfaces a gap, that is precisely the right time to find it.

If you're working on this right now — Book a discovery call