Your RAG Pipeline Is Leaking Customer Data Into Vector Embeddings
If you're building a RAG (Retrieval Augmented Generation) system on internal documents such as customer support history, knowledge base articles, or internal comms, there's a data problem hiding in your vector store.
When you embed a document chunk that contains "Sarah Mitchell called on 15 March about her order to 14 Beechwood Avenue, Manchester," the embedding captures the semantic meaning of that entire passage. Including the name. Including the address.
Your vector database now contains a mathematical representation of a customer's personal data. There's a reasonable argument that this constitutes personal data under GDPR, because the embeddings were derived from personal data and are associated with metadata chunks that store the original text. The legal position on embeddings specifically hasn't been tested, but the text chunks that get injected into the LLM prompt are unambiguously personal data.
That's the more immediate problem: when someone queries the RAG system and the retriever pulls that chunk, the full text (including the PII) gets injected into the LLM prompt. Now the LLM is processing customer data that the user may have no business seeing.
The Specific Risks
Cross-user data leakage. Support agent A asks the RAG system "how do we handle delivery delays?" The retriever pulls chunks from agent B's resolved tickets because they're semantically similar. Those chunks contain agent B's customers' personal details. Agent A now sees them in the LLM response.
Data subject rights. If a customer exercises their GDPR right to erasure (Article 17), you need to delete their data. But their data is now fragmented across thousands of vector embeddings. Identifying and removing the specific embeddings that encode their personal data is technically complex and potentially impossible to do completely.
Vendor exposure. If your vector database is hosted (Pinecone, Weaviate Cloud, Qdrant Cloud), customer PII is now sitting in yet another third party's infrastructure.
The Fix: Sanitise Before Embedding
Strip PII from document chunks before you generate embeddings:
import requests
def sanitise_chunk(text):
response = requests.post(
"https://api.comply-tech.co.uk/api/v1/anonymise",
headers={"X-Api-Key": "your-api-key", "Content-Type": "application/json"},
json={
"content": text,
"contentType": "text",
"strategy": "Redact",
"frameworks": ["GDPR"]
}
)
return response.json()["anonymisedContent"]
# In your ingestion pipeline
for chunk in document_chunks:
clean_chunk = sanitise_chunk(chunk.text)
embedding = embed_model.encode(clean_chunk)
vector_db.upsert(id=chunk.id, vector=embedding, metadata={"text": clean_chunk})
Now the embedded text reads: "[NAME REDACTED] called on 15 March about their order to [ADDRESS REDACTED]." The semantic meaning (a customer calling about a delivery issue) is preserved. The personal identifiers are gone.
What About Search Quality?
We were worried about this. Would stripping names and addresses from support ticket chunks degrade retrieval quality?
In practice: no. RAG retrieval works on semantic similarity. The query "how do we handle delivery delays?" matches on "order," "delivery," and the surrounding context, not on the customer's name. Redacting PII removes noise, not signal.
The one exception: if your RAG system is used to look up specific customers by name, redaction obviously breaks that. For that use case, you'd need a separate lookup mechanism with appropriate access controls, not a vector search over PII.
Try It
curl -X POST https://api.comply-tech.co.uk/api/v1/anonymise \
-H "X-Api-Key: demo-key-complytech" \
-H "Content-Type: application/json" \
-d '{
"content": "Customer Sarah Mitchell (sarah.m@gmail.com) called on 15 March about delayed delivery to 14 Beechwood Ave, Manchester M20 3FJ. Order #4829, value £234.50. Resolved with full refund.",
"contentType": "text",
"strategy": "Redact",
"frameworks": ["GDPR"]
}'
Sanitise before you embed
Keep PII out of your vector store from day one.