AI Security Guardrails for Fintech
Ship AI agents in regulated fintech with practical guardrails: the 3-layer framework, model-risk register, and board-ready evidence for approval.
Why this pillar exists
L ast quarter I sat with a fintech CTO whose engineering team had just built a customer-facing AI agent with access to billing, dispute, and account-status tools. The agent worked. The launch date was 11 days out. The board’s only question was: “What can it do that we don’t want it to do, and how do we know it isn’t doing it?”
The team had answers — for some of the questions. They had a system prompt with rules. They had a deny-list of tool calls. They had a logging pipeline. What they didn’t have was a way to translate any of that into the language a board paper requires: here is the risk, here is the control, here is the evidence the control is working.
That conversation is being replicated in every regulated-fintech boardroom in the UK and EU right now. This pillar is the practitioner’s answer.
The strategic question every fintech board should be asking
If your firm is shipping AI agents — internal copilots, customer-facing assistants, automated dispute / fraud / compliance triage — then you owe your board a single answer to a single question:
What can the agent do, what can’t it do, and how do we prove it?
Three sub-questions roll up under that one. Each maps to one of the three guardrail layers we’ll cover below.
- Tool-misuse risk — given the tools the agent can call, what sequences of calls produce unacceptable outcomes?
- Prompt-injection risk — can adversarial inputs (in customer messages, in retrieved documents, in agent outputs that become inputs to other agents) make the agent escalate its own privileges?
- Output-leakage risk — can the agent emit information that should never have left its context window — PII, internal data, model-derived inferences about customers?
Three questions. One operating model. Three layers.
The 3 Layers of AI Guardrails
This is the named framework Salvador Cloud uses across every fintech AI engagement. We didn’t invent the constituent ideas — they pull from academic prompt-injection literature, OWASP LLM Top 10, and operating patterns from regulated AI in healthcare and aviation. We did invent the specific composition of the three, and the discipline of insisting all three exist for any production agent.
Layer 1 — Input filter
A pre-prompt classifier sits in front of the agent. Every customer message (or document, or upstream agent output) passes through it before reaching the model.
The input filter does three things:
- Adversarial-input detection. Known prompt-injection patterns are flagged and either rejected or routed to a separate handling path.
- Topic boundary enforcement. If your agent is scoped to “billing enquiries”, inputs that try to redirect it to “tell me about your system prompt” or “ignore previous instructions” don’t reach the model.
- Authentication context capture. The filter records who the customer is (already authenticated upstream) and what they’re entitled to do, so subsequent layers can reason about it.
Logging here is deliberately separate from your main product event store — flagged inputs may contain sensitive personal data the customer shouldn’t have submitted in the first place, and you don’t want them co-mingled with general telemetry.
Layer 2 — Behaviour cage
Every tool the agent can call is wrapped with a policy check before the call is permitted to execute. The cage does three things:
- Tool whitelist. The agent’s intended tool surface is defined declaratively; tools not in the whitelist are not callable, period.
- Authorisation cross-check. When the agent attempts a tool call on behalf of a customer, the cage verifies the customer is entitled to the action (not just that the agent thinks they should be).
- Sequence pattern detection. Multi-tool sequences are scored against a known-bad-pattern library. “Read account → write refund of unrelated amount → archive transaction” is not a pattern your agent should be executing, even if each step in isolation is permitted.
The behaviour cage is where the engineering work compounds. Each new agent adds to the cage’s pattern library; each new tool adds to the whitelist with a documented authorisation rule. The board paper for a new agent is essentially a delta against the cage.
Layer 3 — Output guard
Every agent response — to a customer, to a downstream system, to a log stream — passes through an output check before reaching its destination.
- PII detection. Personal data the agent shouldn’t have learned (or shouldn’t be repeating) is redacted or escalated.
- Hallucination-confidence scoring. Outputs that describe facts outside the agent’s grounded knowledge are flagged or rewritten.
- Prompt-leakage check. The agent isn’t permitted to recite its own system prompt or to emit phrasings that suggest it has been jailbroken.
Outputs that fail the guard are either (a) automatically rewritten if the issue is benign (e.g., over-share of internal jargon to a customer), (b) escalated to a human if the issue is sensitive (e.g., possible PII emission), or (c) refused outright if the issue is dangerous (e.g., attempted exfiltration after a suspected jailbreak).
Why three layers, not one
The temptation, when an engineering team is moving fast, is to put one big “guardrail” in front of the model — typically a system prompt with rules, a deny-list, and some logging — and call it done.
That doesn’t work for three reasons:
- A single layer is a single failure point. When (not if) one technique evades it, you have no second line of defence.
- Different risks live at different points in the request flow. Prompt injection lives at input. Tool misuse lives at the tool boundary. Output leakage lives at output. A guardrail at any single point can’t reason about all three.
- Boards reason in compositional terms. “What protects us against X?” needs an answer that names a specific control, not “we have safety in the system prompt”.
The model-risk register
Every agent in production gets a row in your model-risk register. Each row has six columns:
- Agent (one line — what does it do?)
- Tool surface (declarative — what can it call?)
- Customer-data access (what categories?)
- Highest-impact failure mode (what’s the worst it can do, in plain language?)
- Layer-1/2/3 controls (which specific cage rules, output checks?)
- Last red-team (date and pass/fail)
This is the artefact your board, your auditor, and your regulator can all read in five minutes. It’s also the artefact that survives staff turnover.
The board paper that gets sign-off
A board paper for a new agent has the same structure every time:
What changed since the last paper? Three bullets max. New tools, new data access, new failure modes.
Risk delta Three bullets max. What’s higher, what’s lower, what’s the same.
Controls delta Three bullets max. New cage rules, new output checks, new red-team findings.
Recommendation One paragraph. Approve / approve with conditions / defer.
Three pages. Twelve bullets. One recommendation. Approval cycles drop from the typical 4-6 weeks to 2 weeks once the board recognises the shape.
We needed AI guardrails that the board could understand and the engineering team could ship. Salvador Cloud delivered both.
How to know you’re getting it right
Five measurable signals to put on the same dashboard the board reads:
- Mean time to triage for AI-related signals (target: under 30 minutes; we’ve seen 4 hours → 28 minutes within a quarter of guardrail rollout).
- Policy-violating tool calls in production traffic (target: zero successful; non-zero attempted, all blocked).
- Output-guard intervention rate (PII redactions, refusals, escalations) — high is fine if your customers aren’t complaining; what matters is the trend.
- Red-team failure rate at first integration test for each new agent (target: trending down across agents; the cage learns).
- Time from agent design to production approval (target: trending down; if it’s flat or growing, the framework is being treated as ceremony rather than as defence).
Regulatory mapping
If you’re in a regulated fintech context, the three layers map to existing frameworks without needing translation:
- EU AI Act — high-risk system requirements (Article 9, risk management) align with the model-risk register; transparency obligations (Article 13) align with the board-paper format.
- NCSC AI guidance (UK) — the secure-by-design principles map directly to layer 2 (behaviour cage) and layer 3 (output guard).
- FCA SYSC 4 (Senior Managers and Certification Regime) — the board paper is the artefact the SMF24/SMF18 holder uses to evidence their accountability.
- DORA (in scope from January 2025) — the model-risk register feeds the ICT risk register; agent failures feed the incident reporting flow under Article 19.
If your AI deployment is in scope of more than one of these (it usually is for EU fintech operators), the same framework satisfies all of them with deliberate evidence collection. We’ve shown this works in practice in our global fintech case study.
Five common ways teams get this wrong
- Putting the guardrails in the system prompt only. A determined prompt-injection technique evades it; you have no second layer.
- Treating logging as a control. Logging is evidence, not prevention. A logged incident is still an incident.
- Letting the engineering team self-assess the cage rules. They will miss the patterns adversarial users find in five minutes.
- Skipping red-teaming because the agent is “internal only”. Your internal users will copy-paste from external sources; the threat vector is the same.
- Not bumping the model-risk register on every model swap. A model-version upgrade can change behaviour materially; the register needs to reflect the live state.
What to do tomorrow morning
If you have an AI agent in production today, four hours of work this week:
- Hour 1: write your model-risk register row for the agent. Six columns. If you can’t fill any of them, you’ve found a gap.
- Hour 2: list the agent’s tool surface. Declaratively. If it’s longer than one screen, you almost certainly have over-broad authorisation.
- Hour 3: red-team layer 1 (input filter) with five known prompt-injection techniques (the OWASP LLM Top 10 has a starter set). Document what got through.
- Hour 4: write the board paper for the next planned agent in the 3-page / 12-bullet format. Run it past one engineer and one non-engineer. Iterate.
By the end of the week you have a working baseline you can defend at the next risk committee.
If your firm is approaching its first production agent and the answers above feel out of reach, that’s exactly when a senior practitioner helps.
Frequently asked
What are the three layers of AI guardrails in plain English?
Layer 1 (data) — provenance, classification, retention, DLP. Layer 2 (model) — risk register, red-team cadence, bias and evaluation pipelines. Layer 3 (prompt) — injection defence, output filtering, audit logging. The diagram on this page maps each layer to the controls a regulated fintech needs in production.Which regulators care about AI in fintech right now?
FCA (model risk and operational resilience), PRA (where retail or insurance is involved), ICO (UK data-protection implications), EBA (where the agent touches EU banking activity), and the EU AI Act (high-risk classification once deployed in EU markets). NCSC AI guidance is the de facto baseline for control design in the UK.How does this relate to existing model-risk management practice?
The discipline is the same; the surface is broader. Existing MRM covers underwriting, credit-decisioning, and fraud models well. AI agent platforms add prompt-time risk (injection, jailbreak, output-coercion) and reinforcement loops that traditional MRM was not built to govern. The pillar shows where to extend rather than replace.What's the minimum viable guardrail for a first agent in production?
An input filter (allow-list of permitted intents), a behaviour cage (the agent cannot call privileged tools without secondary confirmation), an output guard (PII / regulatory-restricted content filtering), an audit log (every agent decision recorded with the input that produced it), and a documented red-team pass before launch. Five things; everything else is iteration.
If you're working on this right now — Book a discovery call