Closed 23 ICT third-party resilience gaps before the audit window
Headline outcome
a regulated UK utility · Energy / utilities · 2024
Resilience for a regulated UK utility
Context
A regulated UK utility was six months from a scheduled operational-resilience audit when its CISO brought us in. The organisation had introduced AI-assisted scheduling and fault-triage tools across two critical services. Neither was covered by the existing incident-response runbooks. The board’s question was blunt: if an AI component in our operational infrastructure fails or behaves unexpectedly, do we know what to do, and can we prove it to our auditors?
The gap was real. Third-party ICT dependencies had accumulated without a formal register, resilience testing had not been extended to cover the AI components, and the incident-management process assumed deterministic systems. The organisation needed to close those gaps before the audit window and then demonstrate, in writing, that the controls were designed and operating.
Risk
- Non-deterministic failure modes not captured in the existing IR runbooks. The fault-triage model was agentic: it could initiate work orders without a human in the loop at the drafting stage. Existing runbooks had no procedure for containing an agent that took erroneous action, because no one had mapped the kill-switch decision path before an incident.
- Third-party ICT dependency register incomplete. The AI scheduling tool relied on two external model APIs and a managed connector. None appeared in the ICT third-party register. Without a register entry, there was no exit plan, no concentration-risk assessment, and no contractual basis for demanding incident notification from the providers.
- Resilience testing limited to infrastructure, not AI components. Severance tests existed for network and compute. No test had ever disabled the AI scheduler or the model APIs to verify that the operational service degraded safely, stayed within its impact tolerance, and recovered cleanly.
Engagement
We ran a four-week assessment followed by an eight-week remediation sprint. The work was structured around the four ICT resilience pillars most relevant to the audit scope: risk management framework, incident management, resilience testing, and third-party risk.
- AI incident taxonomy and kill-switch playbook. We extended the existing IR runbook with a four-category AI incident taxonomy covering injection-driven action, data leakage, rogue agent action, and poisoned model or connector. For each category we wrote a pre-agreed decision tree mapping the incident type and confirmed blast radius to the right containment action, from session-level circuit-break through to full service safe-stop.
- Third-party register remediation. We identified and documented all ICT third-party service arrangements, including the AI model APIs and the managed connector. Each entry covered classification, dependency mapping, concentration risk, exit conditions, and contractual notice requirements. The register was brought into the format required for audit evidence.
- Resilience testing extension. We designed and ran severance tests against both AI components. Each test disabled the dependency, verified the service degraded to its pre-agreed fallback mode, and then asserted that behaviour as well as availability returned within tolerance when the dependency was restored.
- Tabletop exercise. We facilitated an AI-specific tabletop with the CISO, SOC lead, model owners, legal, and communications. Scenarios covered a rogue agent action in the fault-triage path and a poisoned connector confirmed upstream. The exercise surfaced two unresolved decisions, who is authorised to invoke the kill switch and what counts as “actively leaking,” and produced written agreed answers that were committed to the runbook.
Outcome
- Closed 23 documented ICT third-party resilience gaps across the register, exit planning, and contractual notice requirements. - AI incident playbook and kill-switch decision tree signed off by legal, CISO, and the board risk committee before the audit window. - Severance tests passed for both AI-dependent services, with behavioural assertions confirmed, not just availability. - Zero audit findings against the ICT third-party and incident-management sections.
We knew the audit was coming and we knew the AI components were not covered. Working through the incident taxonomy and the kill-switch playbook gave us something we could actually rehearse, not just a document that sat on a shelf. The auditors asked three questions about it and we had a written answer for all three.
For the regulatory context behind this work, see our guide to DORA readiness for fintech, which covers ICT third-party risk, incident reporting, and resilience testing obligations in full.
Related case studies
Next step
Working on something similar?
We'll diagnose the shape of your problem in a 30-minute call. No proposals, no pitching.