Hollow Pentesting
Confidently using AI in your penitents. Hollow Testing.
Using AI assistance to get faster security assurance.
LMs are useful during pentesting. They parse configs, spot misconfigurations, generate PoC code, and help with reporting. None of that is in dispute.
What should be in dispute is where client data goes when you paste it into a cloud-hosted AI. Depending on the provider, the tier, and how the account is set up, that data may be retained for 7 days to 5 years, read by provider employees, or fed into a training dataset. No major certification body — CREST, OffSec, CHECK, TIGER — has published guidance on this.
Hollow Pentesting is the practice of stripping all client-identifiable information from data before it touches an AI system, while preserving the technical structure that makes the analysis work. This paper covers the methodology, a provider data risk matrix, and a tiered decision framework.
Provider Data Risk Matrix
Provider policies differ by tier and have changed significantly during 2025–2026. Every cell below is sourced; references follow the table.
| Risk Factor | OpenAI (ChatGPT Free/Plus/Pro) | OpenAI (API / Enterprise) | Anthropic (Claude Free/Pro/Max) | Anthropic (API / Commercial) | Google (Gemini Consumer) | Google (Gemini API Paid / Workspace) |
|---|---|---|---|---|---|---|
| Training on input by default | No (user-configurable) | No | Yes, unless opted out (since Sept 2025) | No | Yes on free tier | No |
| Opt-out available | Yes | N/A — off by default | Yes, via Privacy Settings | N/A — excluded | Yes, but kills chat history | N/A — excluded |
| Default retention | 30 days | 30 days; ZDR on approval | 30 days; 5 years if training opted in | 7 days (reduced Sept 2025) | Up to 18 months; 3 years if human-reviewed | Admin-configurable |
| ZDR available | Enterprise only | Yes, eligible endpoints | No (consumer) | Yes, via addendum | No | Admin-configurable |
| Human review possible | Yes | Excluded under ZDR | Yes (consumer) | Excluded | Yes; reviewed copies kept up to 3 years | Only with org consent |
| Risk of data entering training | Low if configured | Very low with ZDR | HIGH if not opted out | Very low | HIGH on free tier | Very low |
| Suitable for raw client data | No | Conditional (ZDR + DPA) | No | Conditional (ZDR + DPA) | No | Conditional (DPA + admin controls) |
Sources (all verified March 2026)
- OpenAI data controls —
developers.openai.com/api/docs/guides/your-data - OpenAI / NYT litigation retention —
openai.com/index/response-to-nyt-data-demands - Anthropic consumer terms (Aug 2025) —
anthropic.com/news/updates-to-our-consumer-terms - Anthropic API retention —
char.com/blog/anthropic-data-retention-policy - Anthropic privacy centre —
privacy.claude.com/en/articles/10023548-how-long-do-you-store-my-data - Google Gemini API abuse policy —
ai.google.dev/gemini-api/docs/usage-policies - Google Gemini API terms —
ai.google.dev/gemini-api/terms - Google Gemini consumer retention —
char.com/blog/google-gemini-data-retention-policy - Google Cloud Gemini governance —
docs.cloud.google.com/gemini/docs/discover/data-governance
Consumer tiers are unsuitable for client data under any configuration. API and enterprise tiers are conditionally acceptable with ZDR and a DPA. Provider terms are not stable — Anthropic flipped from no-training to opt-out in a single update in August 2025, and OpenAI had its policies overridden by a court order for months. Treat any assessment as point-in-time.
The Methodology
Hollow Pentesting strips client identity from prompts while keeping the technical structure intact. The AI gets the skeleton of the problem — the misconfiguration pattern, the topology, the vulnerability condition — but nothing that maps back to the client.
Four principles:
Data minimisation — only include what the AI needs for the specific question. Everything else stays out.
Identity elimination — replace or remove hostnames, domains, IPs, subnets, usernames, service accounts, OU names, group names, policy names, geographic references, ticket numbers, and human-readable comments.
Structural preservation — replacements must maintain the relationships in the original data. If two subnets route to each other and that matters, the synthetic versions must too.
Reversible mapping — keep a local encrypted mapping table that lets you translate AI output back to real assets for the report.
Example
Original — not suitable for AI submission:
Source zone: DMZ-CLIENT-PAYMENT-PROCESSING
Destination: 10.45.12.0/24 (internal DB subnet)
Hostname: PA-FW-01-LONDON-DC2
Comments: "Added by J.Smith, REF: INC-2024-4471"
Hollowed — suitable for AI submission:
Source zone: DMZ-ZONE-A
Destination: 10.0.1.0/24
Hostname: FW-PRIMARY
Comments: [removed]
Rule permissiveness, zone policy, ordering — intact. Client identity — gone.
Synthetic Configs
When anonymised data is still structurally distinctive enough to fingerprint the environment, go further: build an entirely synthetic config that reproduces only the vulnerability condition. Feed the synthetic version to the AI. Apply the analysis to the real environment locally.
AD Mapping Table
| Data Element | Real Value | Synthetic Replacement |
|---|---|---|
| Domain | acmefinancial.local | lab.corp |
| Domain Controller | DC01-LON.acmefinancial.local | DC01.lab.corp |
| Service Account | svc_sqlprod | svc_app01 |
| Security Group | UK-Finance-DBA-Admins | GROUP-A-ADMINS |
| GPO | YOURCO-Workstation-BitLocker-Policy | GPO-ENDPOINT-001 |
| OU | OU=London,OU=Finance,DC=acmefinancial,DC=local | OU=Site-A,OU=Dept-1,DC=lab,DC=corp |
Local Inference
Running LLMs on your own hardware means nothing leaves your network. Ollama, llama.cpp, and vLLM support deployment of Qwen 2.5, DeepSeek-R1, Llama 3, and Mistral on 24GB+ VRAM GPUs. Dedicated inference hardware handles larger models with longer context.
Tradeoffs exist. Local models are less capable on complex reasoning. Quesma's October 2025 research showed smaller models comply with prompt injection at rates up to 95% — relevant when your LLM processes content from untrusted sources during an engagement. Validate outputs carefully.
Local inference combined with hollowed data is the highest-assurance option available.
Recommendations
For pentesters: document a data classification policy for AI usage. Maintain per-engagement mapping tables, encrypted and local. Verify provider terms before each engagement. Disclose AI usage to clients at scoping. Put it in the engagement letter. Prefer local inference where you have the hardware.
For organisations commissioning tests: ask about AI tool usage in vendor assessment. Ask which providers, which tiers, which controls. Consider contractual provisions for AI-assisted analysis. Check compatibility with PCI DSS, UK GDPR, NIS2 obligations.
For the industry: no major certification body has guidance on this. That gap is being filled by practitioners making it up as they go. It should not stay that way.