How to Strip PII Before Sending Data to ChatGPT

We were three weeks into building an internal support tool when someone on the team asked the question nobody had thought about.

"Wait, are we sending customer emails to OpenAI?"

We were. Every support ticket went through GPT-4 for summarisation and suggested replies. It worked brilliantly. Customers were getting faster responses, the team was handling twice the volume, and everyone was happy.

Except we were piping names, email addresses, phone numbers, and occasionally home addresses straight into a third-party API. Every single ticket.

If you're building anything with LLMs right now, there's a decent chance you're doing the same thing.

The Problem Nobody Talks About

Here's what a typical support ticket looks like before it hits your LLM:

Hi, my name is Sarah Mitchell and I placed an order last week but
haven't received it. My order number is #48291 and my email is
sarah.mitchell@gmail.com. I live at 14 Beechwood Avenue, Manchester,
M20 3FJ. Can someone help?

You send that to ChatGPT. OpenAI now has Sarah's full name, email, home address, and buying history. Your privacy policy probably says you don't share data with third parties. Your GDPR basis for processing may not explicitly cover sending personal data to a third-party AI provider for summarisation. At minimum, you'd need a DPA with the provider and a clear legal basis for the transfer.

Nobody means to do this. It just happens because the LLM integration is the exciting part and the PII stripping is the boring part that gets pushed to "we'll handle that later."

Later never comes.

What Actually Needs to Happen

Before any customer data hits an LLM API, you need to:

Detect all personally identifiable information in the text
Replace it with something safe
Send the cleaned version to the LLM
Get your response back

Simple enough in theory. In practice, most teams try one of these approaches:

The Regex Approach (What Everyone Tries First)

import re
text = re.sub(r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}', '[EMAIL]', text)
text = re.sub(r'\b\d{5}\b', '[POSTCODE]', text)
# ... 47 more patterns you'll write over the next month

This works for emails. It sort of works for phone numbers. It completely falls apart for names, addresses, and anything that doesn't have a predictable format.

You'll spend two weeks writing regex patterns, feel good about it, and then a customer named "Will May" will break everything because your name detector thinks every common English word might be a name.

The "Send It to Another AI" Approach

Some teams use a cloud NLP service to detect PII first. Azure AI, Google Cloud DLP, AWS Comprehend.

The irony: you're sending PII to a cloud service to detect PII before sending it to a different cloud service. You've doubled your third-party data exposure to halve it.

For some organisations this is fine. For others (particularly those in healthcare, finance, or handling EU citizen data) it defeats the entire purpose.

The "Build Nothing and Hope" Approach

Also known as the "we'll add a disclaimer to the privacy policy" strategy. Legal teams love this one until the ICO comes knocking.

What We Actually Did

We needed something that could:

Catch PII in unstructured customer messages (names, emails, phones, addresses)
Run without sending data to yet another external service
Slot into our existing pipeline with minimal code changes
Handle JSON payloads (because our ticket data comes through as JSON, not plain text)

We ended up using ComplyTech's API. One POST request before the OpenAI call:

import requests

def strip_pii(text):
    response = requests.post(
        "https://api.comply-tech.co.uk/api/v1/anonymise",
        headers={
            "X-Api-Key": "your-api-key",
            "Content-Type": "application/json"
        },
        json={
            "content": text,
            "contentType": "text",
            "strategy": "Redact",
            "frameworks": ["GDPR"]
        }
    )
    return response.json()["anonymisedContent"]

# Before sending to OpenAI
clean_ticket = strip_pii(raw_ticket)
summary = openai.chat(clean_ticket)

That support ticket from earlier becomes:

Hi, my name is [NAME REDACTED] and I placed an order last week but
haven't received it. My order number is #48291 and my email is
[EMAIL REDACTED]. I live at [ADDRESS REDACTED]. Can someone help?

GPT-4 can still summarise it perfectly. It can still suggest a reply. It just can't see who the customer actually is.

Things We Learned Along the Way

The LLM doesn't need PII to be useful. This surprised us. We assumed stripping names and emails would degrade the quality of the summaries and suggested replies. It didn't. The LLM cares about the problem described, not who described it.

JSON support matters more than you'd think. Our ticket data comes through as structured JSON with fields like customerName, email, messageBody. Being able to send the whole JSON blob and get it back with PII stripped from every field saved us from writing a bunch of pre-processing code.

You want redaction, not pseudonymisation, for LLM inputs. We initially tried replacing names with fake names (pseudonymisation), thinking it would give the LLM more natural text to work with. It didn't matter, and it introduced confusion when the LLM referred to "Jessica Thompson" in its response when the actual customer was someone else entirely. Simple [NAME REDACTED] labels are clearer for everyone.

Allow-lists prevent annoying false positives. Our company domain kept getting flagged as PII (because it's technically an email domain). One allow-list entry for @ourcompany.com fixed it permanently.

The Compliance Reality

If you're processing EU citizen data (you probably are), GDPR Article 28 requires a data processing agreement with any third party that handles personal data. OpenAI, Anthropic, and Google all offer standard DPAs for their API products, so you may already be covered. But a DPA doesn't eliminate the data minimisation question: if the LLM doesn't need the PII to do its job, sending it anyway is harder to justify under Article 5.

The principle of data minimisation says you should only share the minimum data necessary for the purpose. The purpose is "summarise this ticket." The customer's home address is not necessary for that purpose.

This isn't about being paranoid. It's about not being the company in the case study that gets fined because they were lazy about a solvable problem.

Try It With Your Own Data

Paste this into your terminal:

curl -X POST https://api.comply-tech.co.uk/api/v1/anonymise \
  -H "X-Api-Key: demo-key-complytech" \
  -H "Content-Type: application/json" \
  -d '{
    "content": "Contact Sarah Mitchell at sarah.m@gmail.com or 07700 900123",
    "contentType": "text",
    "strategy": "Redact",
    "frameworks": ["GDPR"]
  }'

Takes about 30 seconds to see if it catches what you need it to catch.

The demo key gives you 100 fields per month, enough to test with real data and see whether it catches what you need before committing to anything.

We built ComplyTech because we had this exact problem and couldn't find a tool that solved it without creating a new one. If you're building with LLMs and handling customer data, you probably have this problem too. You just might not have had the uncomfortable conversation about it yet.