10.3 Data Privacy & Compliance
AI-generated content may contain errors. Always verify against official sources.
10.3 Data Privacy & Compliance
Key Concepts: On-premise vs cloud · PII handling · Data residency (UAE/GDPR)
Official Docs: OpenAI Data Privacy · Azure OpenAI Data Residency
Key Regulations to Know
| Regulation | Region | Key requirement |
|---|---|---|
| GDPR | EU/EEA | Data must not leave EU without adequacy; right to erasure |
| UAE PDPL | UAE | Personal data of UAE residents protected; consent required |
| HIPAA | USA | Health data must be encrypted and audit-logged |
| PDPA | Thailand/Singapore | Explicit consent for personal data processing |
Decision: Cloud API vs On-Premise
Does your data include PII or regulated data?
│
No │ Yes
═════════╫════════
↓ ↓
Cloud Does your cloud provider offer data residency?
API │
No │ Yes
══════════╫═══════
↓ ↓
On-premise Cloud with data residency
(Ollama/ (Azure OpenAI EU region)
vLLM)
PII Detection and Redaction
Before sending data to any cloud LLM, detect and redact PII:
import re
from openai import OpenAI
client = OpenAI()
# Simple rule-based PII redaction
def redact_pii(text: str) -> tuple[str, dict]:
"""Redact PII from text. Returns redacted text and a mapping for restoration."""
redacted = text
mapping = {}
counter = {"email": 0, "phone": 0, "id": 0}
# Email addresses
emails = re.findall(r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}', text)
for email in emails:
placeholder = f"[EMAIL_{counter['email']}]"
mapping[placeholder] = email
redacted = redacted.replace(email, placeholder)
counter['email'] += 1
# Phone numbers (international format)
phones = re.findall(r'\+?[1-9]\d{7,14}', text)
for phone in phones:
placeholder = f"[PHONE_{counter['phone']}]"
mapping[placeholder] = phone
redacted = redacted.replace(phone, placeholder)
counter['phone'] += 1
# UAE Emirates IDs (784-XXXX-XXXXXXX-X)
uae_ids = re.findall(r'784-\d{4}-\d{7}-\d', text)
for id_ in uae_ids:
placeholder = f"[UAE_ID_{counter['id']}]"
mapping[placeholder] = id_
redacted = redacted.replace(id_, placeholder)
counter['id'] += 1
return redacted, mapping
def restore_pii(text: str, mapping: dict) -> str:
"""Restore redacted PII after LLM processing."""
for placeholder, original in mapping.items():
text = text.replace(placeholder, original)
return text
# Example
user_message = "My email is [email protected] and my phone is +971501234567. Please analyse my contract."
redacted, mapping = redact_pii(user_message)
print(f"Redacted: {redacted}")
# Safe to send to cloud API
resp = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": redacted}],
)
llm_response = resp.choices[0].message.content
# Restore if needed
final = restore_pii(llm_response, mapping)
Audit Logging
For regulated industries, every LLM call must be logged:
import json
import hashlib
from datetime import datetime
from pathlib import Path
def audit_log(user_id: str, prompt_hash: str, response_hash: str, tokens: int, model: str):
"""Log LLM calls without storing actual content (for GDPR compliance)."""
entry = {
"timestamp": datetime.utcnow().isoformat(),
"user_id": user_id,
"prompt_sha256": prompt_hash, # Hash, not raw text
"response_sha256": response_hash, # Hash, not raw text
"tokens_used": tokens,
"model": model,
}
with open("audit.jsonl", "a") as f:
f.write(json.dumps(entry) + "\n")
# Usage
from hashlib import sha256
audit_log(
user_id="user_abc123",
prompt_hash=sha256("user message".encode()).hexdigest(),
response_hash=sha256("llm response".encode()).hexdigest(),
tokens=150,
model="gpt-4o-mini",
)
Common Mistakes
- Sending raw medical/legal data to cloud APIs — unless you have a Business Associate Agreement (BAA) with the provider, this may violate HIPAA/GDPR.
- Rule-based PII detection only — simple regex misses names, addresses, custom ID formats. Consider using a dedicated PII detection model (e.g., spaCy NER, Microsoft Presidio).
- Logging full prompts — storing raw prompts containing user data creates a compliance liability. Log hashes or metadata only.
- Forgetting about RAG data — if your vector database contains personal data, retrieval can expose that data to the LLM and downstream systems.
Quick Quiz
Q1. What does data residency mean?
A1. Ensuring that data is stored and processed only within a specified geographic region — e.g., EU data stays in EU servers to comply with GDPR.
Q2. Why should you log prompt hashes rather than raw prompts?
A2. Raw prompts may contain personal data, creating a GDPR liability. Hashes allow integrity verification without storing sensitive content.
Q3. What is Microsoft Presidio?
A3. An open-source PII detection and anonymisation framework that supports 20+ entity types and multiple languages, more comprehensive than regex.
Student Exercise
Exercise 11.3 — PII redaction pipeline
Extend the redact_pii function to also detect: names (using spacy NER), IBAN numbers, and passport numbers. Build a test with 10 examples. Measure false positive and false negative rates.