Deployment Cookbook

Battle-tested guides for running LLMs on real hardware

BeginnerServer / VPS
8 min read

Run Llama 3.1 8B on a €20/month VPS

A complete guide to running a private LLM API on a budget Linux VPS using llama.cpp server mode.

llama.cppVPSLinuxGGUF
Read Guide →
BeginnerMac / Apple
6 min read

Mac M3 Max: The Ultimate Local LLM Setup

Maximise your Apple Silicon with Ollama. Run multiple models, set up an OpenAI-compatible API, and tune Metal GPU layers.

OllamaMacApple SiliconMetal
Read Guide →
IntermediateServer / VPS
12 min read

Multi-Model API Server on RTX 4090 with vLLM

Serve multiple AWQ-quantized models with vLLM's continuous batching for production-grade throughput.

vLLMAWQNVIDIARTX 4090
Read Guide →
BeginnerDocker
5 min read

Docker Compose LLM Stack: Ollama + Open WebUI

A production-ready Docker Compose stack that gives you a local ChatGPT experience with one command.

DockerOllamaOpen WebUICompose
Read Guide →
BeginnerEdge / Local
7 min read

What Can You Run on RTX 4060 Ti 16G?

A practical guide to picking the right model and quant level for NVIDIA's best budget 16GB card.

RTX 4060 TiGGUFEXL2VRAM
Read Guide →
IntermediateEdge / Local
9 min read

DeepSeek-R1 Distill 14B: EXL2 vs GGUF

Head-to-head on RTX 4090 — when to pick turboderp EXL2 over bartowski GGUF.

DeepSeek-R1EXL2GGUFExLlamaV2
Read Guide →
IntermediateEdge / Local
10 min read

ExLlamaV2 on RTX 4090: Full Setup Guide

Install ExLlamaV2, load an EXL2 quant, and serve an OpenAI-compatible API in under 10 minutes.

ExLlamaV2EXL2RTX 4090API
Read Guide →
AdvancedEdge / Local
11 min read

Running 70B on Dual RTX 3090 with llama.cpp

Tensor-split across two 24GB cards to run Llama 3.1 70B or Qwen2.5 72B at Q4.

70BMulti-GPUllama.cpptensor-split
Read Guide →
IntermediateEdge / Local
8 min read

Qwen2.5-Coder 32B on a Single RTX 4090

The best open coding model that fits in 24GB — quant selection and tuning tips.

Qwen2.5-Coder32BRTX 4090GGUF
Read Guide →
BeginnerMac / Apple
6 min read

Mac M3 Pro: Realistic Model Limits

What actually fits in 18GB or 36GB unified memory with Ollama and llama.cpp.

MacM3 ProOllamaUnified Memory
Read Guide →
BeginnerEdge / Local
9 min read

llama.cpp on Windows with CUDA

Build llama.cpp with NVIDIA GPU support on Windows 11 — the path of least resistance for PC gamers.

Windowsllama.cppCUDAGGUF
Read Guide →
IntermediateServer / VPS
8 min read

TabbyAPI: ExLlamaV2 with a Web UI

Wrap ExLlamaV2 in TabbyAPI for a polished OpenAI-compatible server with streaming and model hot-swap.

TabbyAPIExLlamaV2APIEXL2
Read Guide →
AdvancedEdge / Local
12 min read

Quantize Your Own Model to GGUF

Use llama.cpp's quantize tool to convert any HF model to GGUF Q4_K_M for local inference.

GGUFllama.cppquantizecustom
Read Guide →
AdvancedServer / VPS
10 min read

vLLM + AWQ in Production: Tuning Guide

gpu-memory-utilization, max-model-len, and batching knobs for stable API serving.

vLLMAWQproductionAPI
Read Guide →
IntermediateServer / VPS
7 min read

CPU Inference: OpenBLAS Tuning for llama.cpp

Maximize tokens/sec on a CPU-only VPS with thread count and BLAS backend tuning.

CPUllama.cppOpenBLASVPS
Read Guide →
BeginnerEdge / Local
8 min read

8GB GPU Starter Guide: 3060 / 4060 / 3070

The most common local LLM hardware tier — which models, quants, and context lengths actually fit in 8GB VRAM.

8GB VRAMRTX 3060RTX 4060GGUF
Read Guide →
BeginnerMac / Apple
7 min read

M1 / M2 Mac 8GB: Realistic Ollama Limits

Unified memory is shared with macOS — here is what actually works on base MacBooks without swapping.

M1M28GB RAMOllama
Read Guide →
IntermediateEdge / Local
10 min read

WSL2 + Ollama GPU Passthrough on Windows

Run Ollama with NVIDIA GPU acceleration inside WSL2 — the most reliable Windows path for local LLMs.

WSL2WindowsOllamaNVIDIA
Read Guide →
IntermediateDocker
9 min read

Docker: Ollama with NVIDIA GPU Passthrough

Containerised Ollama with GPU access — isolate models, pin versions, and run alongside other services.

DockerOllamaNVIDIAGPU
Read Guide →
IntermediateServer / VPS
11 min read

Nginx Reverse Proxy for Local LLM APIs

Put Ollama or llama.cpp behind Nginx with TLS, rate limiting, and a stable /v1 endpoint for your apps.

NginxAPITLSOllama
Read Guide →
AdvancedEdge / Local
12 min read

AMD GPU + llama.cpp via ROCm (Quick Start)

Run GGUF models on Radeon RX 7900 / 6800 series with llama.cpp HIP backend — what works and what does not.

AMDROCmllama.cppHIP
Read Guide →
BeginnerEdge / Local
6 min read

Ollama on Windows (Native, No WSL)

Install the Windows Ollama app for the simplest path — GPU works on NVIDIA; AMD is CPU-only for now.

WindowsOllamaNVIDIADesktop
Read Guide →