Zero PII leakage incidents in the first 6 months post-launch
Headline outcome
a UK e-commerce platform · E-commerce / retail · 2024
Securing a service copilot for retail
Context
A UK e-commerce platform had built a natural-language service copilot to handle order queries, returns, and account questions across a large customer base. The copilot sat in the customer-facing channel and had access to order records, account details, and a set of action tools covering refund initiation and address updates. The business case was clear: reduce the volume reaching human agents. The security question was equally clear and arrived later: what happens when a customer, deliberately or otherwise, feeds the copilot instructions instead of queries?
When the security team reviewed the architecture, they found the copilot was ingesting customer messages directly into the model context with no quarantine layer. Every piece of text a customer typed became part of the reasoning space the copilot used to decide what to call next. The CISO asked for a full guardrail programme before the system went to the full customer base.
Risk
- Direct and indirect prompt injection via customer input. Customers typed messages the copilot treated as instructions. A crafted message asking the copilot to disclose another customer’s order history, or to initiate a refund it was not authorised to give, had a realistic path to success. The context window held no privilege boundary between the customer’s text and the copilot’s operating instructions.
- PII exposure across session boundaries. The copilot retrieved customer records into the context window to answer queries. Where session isolation was incomplete, retrieved data from one query could persist into the next. An attacker who understood this could time queries to surface another customer’s details through the copilot’s own retrieval.
- Over-tooled action set. The copilot held credentials for refund initiation and address changes across all orders, not only the session customer’s. The task never required cross-customer authority, yet the permission set did not enforce that boundary. Every unused credential was reachable blast radius if injection succeeded.
Engagement
We ran the programme in three structured stages.
- Threat modelling and channel mapping. We mapped every channel that could place content into the copilot’s context: user turns, retrieved order records, the action-tool response payloads, and any cached context. For each we asked what an attacker with full control of that channel could steer the copilot to do, given its current tool set. That gave us a ranked blast-radius list rather than a generic risk assessment.
- Quarantine and isolation design. We introduced a classification layer on the ingestion path that separated customer text from structured retrieval data before it entered the model context. Customer-supplied content was labelled untrusted and processed through a sandboxed context that could not directly invoke action tools. Only schema-validated, structured signals crossed into the action-capable context. We also scoped tool credentials to the session customer’s records, removing the cross-customer reach that the original permission set carried without it being needed.
- Red-teaming and evidence gating. We built a versioned adversarial test suite covering direct injection, indirect injection via crafted order descriptions, cross-session PII extraction attempts, and refund-manipulation paths. Every test case ran automatically in the CI pipeline so that a regression in the guardrail layer blocked the build before reaching production. We ran a manual red-team pass against the PII-extraction and refund paths specifically, given their direct cost to customers.
The customer experience team confirmed that the quarantine boundary did not increase response latency beyond the agreed threshold. The product sat within the service-level target throughout testing.
Outcome
- Zero PII leakage incidents in the first 6 months after the guardrail layer went live, validated against the continuous adversarial test suite. - Reduced the copilot’s tool authority to session-scoped credentials, cutting the worst-case blast radius from cross-customer refund capability to a per-session ceiling agreed with the risk team. - All injection categories in the red-team suite blocked on the automated gate, with new patterns moved from discovery to test coverage in under 48 hours. - Delivered a written-down threat model and blast-radius map the CISO could take into board reporting, replacing an informal assurance that “the filters are on.”
We thought we had covered injection with output filters. What we did not have was a design that made injection expensive in the first place. The quarantine layer changed the structural question from “did the filter catch it” to “did it even reach the tools.”
The lesson from this engagement is one that applies beyond retail. Injection is not a bug you patch out of one prompt template. It is a structural property of how models read their context, and the defence is architectural: separate untrusted content from trusted authority before the model reasons over both. If you want to understand the full threat model and the controls that answer it, read AI security guardrails for fintech.
Related case studies
Next step
Working on something similar?
We'll diagnose the shape of your problem in a 30-minute call. No proposals, no pitching.