Blocked 100% of policy-violating tool calls in the first 90 days
Headline outcome
a regulated digital payments business · Financial services / payments · 2025
AI guardrails for a payments business
Context
A regulated digital payments business had deployed its first production AI agents to handle disputes and operations workflows. The agents were connected to real payment tools and could initiate refunds, query account records, and update case notes. The board’s question was direct: if one of these agents does something it should not, what stops it?
The answer, at that point, was not much. Output filters had been added as an afterthought. There was no structural separation between the agents that read untrusted customer content and the agents that held authority over payment actions. The security team recognised the gap and brought us in to build the guardrail layer before an incident made the case for them.
Risk
- Indirect prompt injection via ingested documents. The disputes workflow ingested customer-submitted correspondence directly into the agent context. An attacker who understood this could plant instructions in a support email or attached document, steering the agent to issue a refund or suppress a fraud flag. No privilege escalation needed.
- Over-tooled agents. Several agents held write-capable payment credentials even when their task only required reading ledger entries. Unused authority on an agent is pre-positioned blast radius. The scope mismatch meant a single successful injection could reach tools the task never legitimately used.
- Missing behavioural containment. There were per-action output filters but no aggregate limits on spend, no circuit breakers on anomalous tool sequences, and no tested kill switch. A misbehaving agent could cycle through small, individually below-threshold actions and accumulate a material loss before any alert fired.
Engagement
We structured the engagement in three stages, running in parallel where file ownership allowed.
- Threat modelling and blast-radius mapping. We drew the full tool inventory for each agent and asked what damaging end-state an attacker could reach by chaining those tools, starting from each ingestion channel. This gave us a ranked list of where autonomy and tool scope created disproportionate exposure.
- Guardrail and isolation design. We introduced a quarantined ingestion path for untrusted content, separating the model that processed customer documents from the model that held payment credentials. Only structured, schema-validated data crossed the boundary. Input classification sat on the ingestion pipeline rather than on the user turn, because indirect injection travels through exactly those channels.
- Adversarial red-teaming and SDLC gates. We ran a versioned suite of injection, tool-abuse, and jailbreak test cases against the build before any configuration reached production. High-impact workflows received a manual red-team pass in addition to the automated gate. Every finding became a new test case. We also set per-action payment limits, aggregate spend limits per hour, and a circuit breaker that halted operations on a run of anomalous tool calls, with a tested safe-stop that left tasks in a coherent state.
The payment team verified that the quarantine boundary did not increase latency beyond the agreed threshold before sign-off.
Outcome
- Blocked 100% of policy-violating tool calls in the first 90 days, measured against the adversarial test suite run continuously in the CI pipeline. - Reduced the blast radius of any single agent failure from “uncapped refund authority” to a per-action limit agreed with risk and finance teams. - Shortened the time from a new injection pattern being found in red-teaming to a new automated test case blocking the build, from several weeks to under 48 hours. - Gave the CISO a defensible, auditable answer to the board question: a written-down blast radius, tested kill switch, and a CI gate that fails on any regression of the catastrophic-harm class.
We had filters, but we did not have a design. Salvador Cloud built the structural isolation that made the filters meaningful, and gave us a gate we can actually point an auditor at.
The controls here are three legs of one design: guardrails that raise the cost of an attack, least-privilege scoping that bounds what a successful one can reach, and runtime containment that limits the damage regardless of whether detection fired. If you want to understand the full threat model and the control stack that answers it, start with AI security guardrails for fintech.
Related case studies
Next step
Working on something similar?
We'll diagnose the shape of your problem in a 30-minute call. No proposals, no pitching.