Chapter 10: Deployment & Production | Mohammed Sarfaraz

📄️ 10.1 Self-Hosting

Ollama, vLLM, TGI, hardware sizing, and GPU vs CPU inference for self-hosted LLMs.

FastAPI wrapper, request queuing, and token budgets for LLM API gateways.

On-premise vs cloud, PII handling, and data residency (UAE/GDPR) for LLM deployments.

LangSmith, logging prompts/responses, and drift detection for production LLM systems.

Caching, prompt compression, and model routing (small vs large model) for LLM cost control.