Your Staging Database Is Full of Real Customer Data (And Everyone Knows It)
Nobody talks about this, but almost every engineering team has done it at some point: copied the production database into staging so the test environment has "realistic" data.
It makes sense. You need data that looks like production to catch real bugs. Generated test data is always too clean: perfectly formatted emails, sequential IDs, no edge cases. Production data has the messy reality that actually breaks things.
The problem is that your staging environment now contains every customer's real name, email address, phone number, home address, and whatever else you store. And staging doesn't have the same security controls as production. It probably has shared credentials. It might be accessible from a wider network. The logs definitely aren't monitored the same way.
You know this is bad. Your team knows it. Nobody has fixed it because the alternative (writing and maintaining a custom anonymisation pipeline) is a project nobody wants to own.
The Compliance Problem (It's Worse Than You Think)
GDPR Article 25 requires "data protection by design and by default." Using real personal data in test environments is pretty much the opposite of that.
But beyond GDPR, think about who has access to your staging environment:
- Every developer on the team
- Contractors and freelancers (who might rotate every few months)
- CI/CD pipelines that log data to third-party services
- QA teams who might screenshot bugs with real customer data visible
- New starters on day one, before they've even completed security training
A production database with 50,000 customers' data, copied to staging, is 50,000 people's personal information accessible to everyone with a staging password. Which is probably staging123 or in a shared 1Password vault with 20 people.
HIPAA is even stricter. If you're in healthcare and your test environment has real patient data, you're in violation whether or not anything bad happens. The mere presence of PHI in an insufficiently secured environment is a finding.
Why Synthetic Data Doesn't Work
"Just use Faker to generate test data."
If you've actually tried this, you know why it doesn't work for real testing:
The data is too perfect. Faker generates john.smith@example.com. Your real customers have xX_darkl0rd_Xx@hotmail.co.uk. The edge cases that break your code only exist in production data.
Relationships are wrong. Production data has customers with 47 orders, customers with 0 orders, customers who changed their email three times, customers with duplicate phone numbers. Synthetic data generators create uniform distributions that don't reflect reality.
Volume patterns don't match. Your production database has 200,000 customers added over 5 years with realistic growth curves, seasonal patterns, and churn. A synthetic dataset is either too small to test performance or too uniform to catch real bugs.
Nobody maintains the generator. Someone writes a seed script, it works for the current schema, and then nobody updates it when the schema changes. Six months later, seeding fails and someone copies production again because they have a deadline.
The right answer isn't synthetic data. It's real data with the personal bits replaced.
Pseudonymisation: The Best of Both Worlds
Here's what you actually want:
- Take your production database dump
- Replace every name with a different realistic name
- Replace every email with a different realistic email
- Replace every phone number, address, and national ID
- Keep everything else exactly as it is: order history, timestamps, amounts, relationships, edge cases
This is pseudonymisation. And the critical feature is determinism: the same input should always produce the same output, so relationships stay consistent. If sarah.mitchell@gmail.com becomes karen.taylor@outlook.com, that mapping holds across every table in the database. Foreign key relationships survive. Your test scenarios still make sense.
How We Set This Up
We use ComplyTech's API with the Pseudonymise strategy. Here's what the pipeline looks like:
Step 1: Export from Production
Standard pg_dump or whatever your database uses. Nothing special here.
Step 2: Process Each Table
For text fields (names, notes, descriptions):
import requests
def anonymise_text(text):
response = requests.post(
"https://api.comply-tech.co.uk/api/v1/anonymise",
headers={
"X-Api-Key": "your-api-key",
"Content-Type": "application/json"
},
json={
"content": text,
"contentType": "text",
"strategy": "Pseudonymise",
"frameworks": ["GDPR"]
}
)
return response.json()["anonymisedContent"]
For structured data (CSV exports of tables):
def anonymise_csv(csv_content):
response = requests.post(
"https://api.comply-tech.co.uk/api/v1/anonymise",
headers={
"X-Api-Key": "your-api-key",
"Content-Type": "application/json"
},
json={
"content": csv_content,
"contentType": "csv",
"strategy": "Pseudonymise",
"frameworks": ["GDPR"]
}
)
return response.json()["anonymisedContent"]
For JSON API responses or document stores:
def anonymise_json(json_string):
response = requests.post(
"https://api.comply-tech.co.uk/api/v1/anonymise",
headers={
"X-Api-Key": "your-api-key",
"Content-Type": "application/json"
},
json={
"content": json_string,
"contentType": "json",
"strategy": "Pseudonymise",
"frameworks": ["GDPR"]
}
)
return response.json()["anonymisedContent"]
Step 3: Load into Staging
Import the anonymised dump into your staging database. Done.
Step 4: Automate It
Add this to your CI pipeline or a weekly cron job. Fresh production-like data in staging every Monday morning, with zero personal information.
What the Output Actually Looks Like
Before:
id,name,email,phone,notes
1,James O'Brien,james.obrien@gmail.com,07700 900456,"Called about late delivery to 42 Victoria Road, Leeds"
2,Priya Sharma,priya.s@outlook.com,07700 900789,"VIP customer, handles large orders for Birmingham office"
3,Michael Chen,m.chen@yahoo.co.uk,+44 7700 900321,"Requested invoice sent to michael.chen@workmail.com"
After (pseudonymised):
id,name,email,phone,notes
1,David Thompson,sarah.jones@outlook.com,07123 456789,"Called about late delivery to [ADDRESS REDACTED]"
2,Emily Watson,mark.w@gmail.com,07234 567890,"VIP customer, handles large orders for Birmingham office"
3,Sarah Williams,j.taylor@yahoo.co.uk,+44 7345 678901,"Requested invoice sent to [EMAIL REDACTED]"
Notice what happened:
- Names replaced with different realistic names
- Emails replaced with different realistic emails
- Phone numbers replaced with different realistic phone numbers
- The address in the notes field was caught and redacted
- The embedded email in the notes field was caught and redacted
- Non-PII content ("VIP customer, handles large orders") is untouched
- IDs, relationships, and business data are preserved
The data is still messy and realistic. It still has the edge cases your tests need. It just doesn't contain real people's information anymore.
The Batch API for Large Datasets
If you're processing a full database dump, you don't want to make 50,000 individual API calls. ComplyTech has a batch endpoint for this:
import requests
batch_items = []
for chunk in csv_chunks: # Split your CSV into manageable chunks
batch_items.append({
"type": "anonymise",
"request": {
"content": chunk,
"contentType": "csv",
"strategy": "Pseudonymise",
"frameworks": ["GDPR"]
}
})
response = requests.post(
"https://api.comply-tech.co.uk/api/v1/batch",
headers={
"X-Api-Key": "your-api-key",
"Content-Type": "application/json"
},
json={
"items": batch_items,
"webhookUrl": "https://your-server.com/webhooks/anonymisation-complete"
}
)
job_id = response.json()["jobId"]
# Poll or wait for webhook callback
Submit up to 100 items per batch. Get a webhook when it's done. No babysitting.
What This Costs vs What It Saves
The cost:
- Pro plan: £99/mo for 150,000 fields
- A 10,000-row table with 8 columns = 80,000 fields
- So you can process ~2 full database refreshes per month on Pro
What it replaces:
- 2–5 days of developer time building a custom anonymisation script
- Ongoing maintenance of that script every time the schema changes
- The risk of a staging data breach (potentially hundreds of thousands in fines)
- The compliance gap that keeps your DPO awake at night
What it doesn't replace:
- This won't anonymise your database in-place. It's an API: you export, process, and re-import.
- It won't handle binary data (images, PDFs). Text, JSON, and CSV only.
- If you need field-level encryption rather than pseudonymisation, you need a different tool.
Getting Started Takes About 10 Minutes
- Grab the demo key:
demo-key-complytech - Export a small sample from your staging database (10–20 rows of a table with PII)
- Send it through the API with
strategy: "Pseudonymise" - Look at the output. Does it catch the PII you care about? Is the non-PII data preserved?
curl -X POST https://api.comply-tech.co.uk/api/v1/anonymise \
-H "X-Api-Key: demo-key-complytech" \
-H "Content-Type: application/json" \
-d '{
"content": "id,name,email\n1,Sarah Jones,sarah@gmail.com\n2,Tom Wilson,tom@company.co.uk",
"contentType": "csv",
"strategy": "Pseudonymise",
"frameworks": ["GDPR"]
}'
If the output looks right, you can have a working staging pipeline by end of day. If it doesn't catch something you need, we want to know about it. That's how the detection gets better.
The Uncomfortable Truth
Your staging database probably has real customer data in it right now. You've known about it for a while. It hasn't been fixed because it's not urgent until it is. By then it's an incident, not a task.
This is a fixable problem with a straightforward solution. The tools exist. The API is live. The demo key is free. The only question is whether you do it now or after someone asks about it in an audit.
We built ComplyTech because we were the team with production data in staging, and we got tired of pretending that was fine. If you're in the same position, you already know what you should do. Here's a way to actually do it.
Clean up your staging data today
Try the demo key or get your own API key in minutes.