IntermediateEdge / Local 9 min read

DeepSeek-R1 Distill 14B: EXL2 vs GGUF

Head-to-head on RTX 4090 — when to pick turboderp EXL2 over bartowski GGUF.

DeepSeek-R1EXL2GGUFExLlamaV2

Speed vs simplicity

EXL2 via ExLlamaV2 delivers ~35% faster inference than GGUF via llama.cpp on the same 14B model. GGUF wins on setup simplicity and Ollama compatibility.

text

EXL2 4.65bpw → ~128 tok/s (ExLlamaV2, RTX 4090)
GGUF Q4_K_M  → ~95 tok/s (llama.cpp, RTX 4090)

Download EXL2

Grab the turboderp EXL2 quant from Hugging Face. Use TabbyAPI or ExLlamaV2 server for OpenAI-compatible API.

bash

huggingface-cli download turboderp/DeepSeek-R1-Distill-Qwen-14B-exl2 \
  --include "*4.65bpw*" --local-dir ./models/r1-14b-exl2

Deployment guides are educational. Each model is subject to its own license — read the official Hugging Face model card before downloading or deploying.