Why should I self-host instead of using OpenAI/Anthropic APIs?

Key reasons include: 1) Data Privacy (data never leaves your VPC), 2) Cost at Scale (tokens become cheaper than API calls for high-volume use), 3) Latency (no internet round-trip), 4) Reliability (no rate limits or provider outages), and 5) Customisation (fine-tuning or using specific open-weight models).

What hardware do I need to run Llama 3 70B?

For Llama 3 70B in FP16 (full precision), you need ~140GB VRAM (2x A100 80GB). However, using 4-bit quantisation (AWQ/GPTQ), you can fit it into ~40GB VRAM, which runs comfortably on a single A6000 Ada or 2x RTX 3090s/4090s.

What is the difference between vLLM and Ollama?

vLLM is a high-performance serving engine designed for production throughput, supporting multiple concurrent requests and PagedAttention. Ollama is a user-friendly wrapper designed for local development and simplicity, making it easy to download and run models with zero configuration. Use Ollama for dev, vLLM for prod.

What is Quantisation and does it hurt performance?

Quantisation reduces the precision of model weights (e.g., from 16-bit to 4-bit) to save memory. Modern techniques like AWQ and GPTQ preserve 95-99% of the model's reasoning capability while reducing memory usage by up to 75%. For most enterprise use cases, 4-bit or 8-bit quantisation is indistinguishable from full precision.

Can I run these on CPU?

Yes, using llama.cpp (GGUF format), you can run models on CPU. However, it is significantly slower (tokens/sec) compared to GPU inference. CPU inference is viable for batch processing or single-user local assistants, but rarely for production user-facing applications.

The Complete Guide to Local LLM Deployment in 2026

Why Self-Host in 2026?

While proprietary models like GPT-5 and Claude 4.5 push the frontiers of reasoning, open-weight models (Llama 3, Mistral, Qwen) have become "good enough" for 90% of enterprise use cases, and often faster and cheaper.

Self-hosting allows you to:

Own Your Data: Ensure sensitive PII or IP never leaves your VPC.
Control Latency: Eliminate internet round-trips. Local inference can achieve <10ms TTFT (Time to First Token).
Fix Costs: Stop paying per token. Pay for GPU hours, regardless of volume.
Ensure Reliability: No more "Service Unavailable" errors during provider outages.

Hardware Sizing Guide

The most common question: "How much VRAM do I need?"

The rule of thumb for FP16 (16-bit) precision is: 2GB VRAM per 1 Billion Parameters. For 4-bit quantisation, it's roughly 0.7-0.8GB per 1 Billion Parameters (plus overhead for context).

Model Size	FP16 VRAM	4-bit (AWQ) VRAM	Recommended GPU
7B / 8B	~16 GB	~6-8 GB	RTX 3060 / 4060 Ti / T4
13B / 14B	~28 GB	~10-12 GB	RTX 3090 / 4090 / L4
70B / 72B	~140 GB	~40-48 GB	1x A6000 / 2x RTX 4090
Mixtral 8x7B	~96 GB	~26-32 GB	1x A100 40GB / 2x RTX 3090

The Inference Software Landscape

Local LLM Inference Architecture Diagram — High-Performance Inference Stack: From User to GPU

Choosing the right serving engine is as critical as the hardware.

vLLM (The Production Standard)

Open-source, high-throughput serving engine. Famous for inventing PagedAttention, which manages KV cache memory like an OS manages RAM, maximising batch sizes. Supports OpenAI-compatible API.

Ollama (The Developer Standard)

Focuses on simplicity. Wraps llama.cpp in a Go binary. Just ollama run llama3. Perfect for local dev, laptops, and simple deployments.

TGI (Text Generation Inference)

Built by Hugging Face. Highly optimised, rust-based. Features tensor parallelism, continuous batching, and native integration with Hugging Face Hub.

Production Inference with vLLM

For enterprise deployment, vLLM is the recommended choice in 2026 due to its throughput capabilities and broad model support.

# Run vLLM with Docker
docker run --runtime nvidia --gpus all \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    -p 8000:8000 \
    --ipc=host \
    vllm/vllm-openai:latest \
    --model meta-llama/Meta-Llama-3-8B-Instruct \
    --dtype auto \
    --api-key your-secret-key

This exposes an OpenAI-compatible API at http://localhost:8000/v1. You can drop this URL into any app built for OpenAI.

Local Development with Ollama

Ollama shines for local testing and running on Mac/Windows.

# Installation (Linux/Mac)
curl -fsSL https://ollama.com/install.sh | sh

# Run a model
ollama run llama3

# Serve API (runs on port 11434 by default)
ollama serve

Understanding Quantisation

Quantisation is the magic that fits large models onto smaller GPUs.

GGUF: Format used by llama.cpp/Ollama. Designed for CPU+GPU split inference (Apple Silicon).
AWQ (Activation-aware Weight Quantisation): Best for GPU inference. Preserves accuracy by protecting important weights.
GPTQ: Older but widely supported GPU quantisation.

Recommendation: Use AWQ for vLLM/TGI production deployments. Use GGUF for local/Mac deployments.

Deployment Patterns

1. Single Node Docker

Simplest. Run a docker container on a VM with a GPU. Good for internal tools.

2. Kubernetes with KServe

Autoscaling production. KServe manages scale-to-zero and canary rollouts. Complex to set up but essential for high-traffic apps.

3. SkyPilot

Cloud abstraction. Run your inference task on any cloud (AWS, GCP, Azure, Lambda Labs) where GPUs are cheapest.

Practical: Docker Compose Setup

Here is a complete docker-compose.yml to run vLLM alongside OpenWebUI (a ChatGPT-like interface).

docker-compose.yml

version: '3.8'

services:
  vllm:
    image: vllm/vllm-openai:latest
    runtime: nvidia
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    volumes:
      - ~/.cache/huggingface:/root/.cache/huggingface
    ports:
      - "8000:8000"
    environment:
      - HUGGING_FACE_HUB_TOKEN=${HF_TOKEN}
    command: >
      --model meta-llama/Meta-Llama-3-8B-Instruct
      --dtype auto
      --gpu-memory-utilization 0.95
      --max-model-len 8192

  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    ports:
      - "3000:8080"
    environment:
      - OPENAI_API_BASE_URL=http://vllm:8000/v1
      - OPENAI_API_KEY=sk-no-key-required
    depends_on:
      - vllm

Conclusion

Self-hosting LLMs in 2026 is no longer a research project; it's a viable engineering decision. With tools like vLLM and quantisation techniques like AWQ, you can run state-of-the-art models on affordable hardware with total privacy and control.

Start with Ollama for development. Move to vLLM for production. And always size your hardware for the quantised version of the model to save 50%+ on infrastructure costs.