10.1 Self-Hosting LLMs
Ollama, vLLM, TGI, hardware sizing, and GPU vs CPU inference for self-hosted LLMs.
Ollama, vLLM, TGI, hardware sizing, and GPU vs CPU inference for self-hosted LLMs.
FastAPI wrapper, request queuing, and token budgets for LLM API gateways.
On-premise vs cloud, PII handling, and data residency (UAE/GDPR) for LLM deployments.
LangSmith, logging prompts/responses, and drift detection for production LLM systems.
Caching, prompt compression, and model routing (small vs large model) for LLM cost control.