Skip to main content

10.3 Data Privacy & Compliance

AI-Generated Content

AI-generated content may contain errors. Always verify against official sources.

10.3 Data Privacy & Compliance

Key Concepts: On-premise vs cloud · PII handling · Data residency (UAE/GDPR)

Official Docs: OpenAI Data Privacy · Azure OpenAI Data Residency


Key Regulations to Know

RegulationRegionKey requirement
GDPREU/EEAData must not leave EU without adequacy; right to erasure
UAE PDPLUAEPersonal data of UAE residents protected; consent required
HIPAAUSAHealth data must be encrypted and audit-logged
PDPAThailand/SingaporeExplicit consent for personal data processing

Decision: Cloud API vs On-Premise

Does your data include PII or regulated data?

No │ Yes
═════════╫════════
↓ ↓
Cloud Does your cloud provider offer data residency?
API │
No │ Yes
══════════╫═══════
↓ ↓
On-premise Cloud with data residency
(Ollama/ (Azure OpenAI EU region)
vLLM)

PII Detection and Redaction

Before sending data to any cloud LLM, detect and redact PII:

import re
from openai import OpenAI

client = OpenAI()

# Simple rule-based PII redaction
def redact_pii(text: str) -> tuple[str, dict]:
"""Redact PII from text. Returns redacted text and a mapping for restoration."""
redacted = text
mapping = {}
counter = {"email": 0, "phone": 0, "id": 0}

# Email addresses
emails = re.findall(r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}', text)
for email in emails:
placeholder = f"[EMAIL_{counter['email']}]"
mapping[placeholder] = email
redacted = redacted.replace(email, placeholder)
counter['email'] += 1

# Phone numbers (international format)
phones = re.findall(r'\+?[1-9]\d{7,14}', text)
for phone in phones:
placeholder = f"[PHONE_{counter['phone']}]"
mapping[placeholder] = phone
redacted = redacted.replace(phone, placeholder)
counter['phone'] += 1

# UAE Emirates IDs (784-XXXX-XXXXXXX-X)
uae_ids = re.findall(r'784-\d{4}-\d{7}-\d', text)
for id_ in uae_ids:
placeholder = f"[UAE_ID_{counter['id']}]"
mapping[placeholder] = id_
redacted = redacted.replace(id_, placeholder)
counter['id'] += 1

return redacted, mapping

def restore_pii(text: str, mapping: dict) -> str:
"""Restore redacted PII after LLM processing."""
for placeholder, original in mapping.items():
text = text.replace(placeholder, original)
return text

# Example
user_message = "My email is [email protected] and my phone is +971501234567. Please analyse my contract."
redacted, mapping = redact_pii(user_message)
print(f"Redacted: {redacted}")

# Safe to send to cloud API
resp = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": redacted}],
)
llm_response = resp.choices[0].message.content

# Restore if needed
final = restore_pii(llm_response, mapping)

Audit Logging

For regulated industries, every LLM call must be logged:

import json
import hashlib
from datetime import datetime
from pathlib import Path

def audit_log(user_id: str, prompt_hash: str, response_hash: str, tokens: int, model: str):
"""Log LLM calls without storing actual content (for GDPR compliance)."""
entry = {
"timestamp": datetime.utcnow().isoformat(),
"user_id": user_id,
"prompt_sha256": prompt_hash, # Hash, not raw text
"response_sha256": response_hash, # Hash, not raw text
"tokens_used": tokens,
"model": model,
}
with open("audit.jsonl", "a") as f:
f.write(json.dumps(entry) + "\n")

# Usage
from hashlib import sha256
audit_log(
user_id="user_abc123",
prompt_hash=sha256("user message".encode()).hexdigest(),
response_hash=sha256("llm response".encode()).hexdigest(),
tokens=150,
model="gpt-4o-mini",
)

Common Mistakes

Common Mistakes
  1. Sending raw medical/legal data to cloud APIs — unless you have a Business Associate Agreement (BAA) with the provider, this may violate HIPAA/GDPR.
  2. Rule-based PII detection only — simple regex misses names, addresses, custom ID formats. Consider using a dedicated PII detection model (e.g., spaCy NER, Microsoft Presidio).
  3. Logging full prompts — storing raw prompts containing user data creates a compliance liability. Log hashes or metadata only.
  4. Forgetting about RAG data — if your vector database contains personal data, retrieval can expose that data to the LLM and downstream systems.

Quick Quiz

Test Your Understanding

Q1. What does data residency mean?
A1. Ensuring that data is stored and processed only within a specified geographic region — e.g., EU data stays in EU servers to comply with GDPR.

Q2. Why should you log prompt hashes rather than raw prompts?
A2. Raw prompts may contain personal data, creating a GDPR liability. Hashes allow integrity verification without storing sensitive content.

Q3. What is Microsoft Presidio?
A3. An open-source PII detection and anonymisation framework that supports 20+ entity types and multiple languages, more comprehensive than regex.


Student Exercise

Exercise 11.3 — PII redaction pipeline
Extend the redact_pii function to also detect: names (using spacy NER), IBAN numbers, and passport numbers. Build a test with 10 examples. Measure false positive and false negative rates.


Further Reading

Next → 11.4 Monitoring & Observability