Back to Cookbook
IntermediateEdge / Local 9 min read
DeepSeek-R1 Distill 14B: EXL2 vs GGUF
Head-to-head on RTX 4090 — when to pick turboderp EXL2 over bartowski GGUF.
DeepSeek-R1EXL2GGUFExLlamaV2
Speed vs simplicity
EXL2 via ExLlamaV2 delivers ~35% faster inference than GGUF via llama.cpp on the same 14B model. GGUF wins on setup simplicity and Ollama compatibility.
text
EXL2 4.65bpw → ~128 tok/s (ExLlamaV2, RTX 4090)
GGUF Q4_K_M → ~95 tok/s (llama.cpp, RTX 4090)Download EXL2
Grab the turboderp EXL2 quant from Hugging Face. Use TabbyAPI or ExLlamaV2 server for OpenAI-compatible API.
bash
huggingface-cli download turboderp/DeepSeek-R1-Distill-Qwen-14B-exl2 \
--include "*4.65bpw*" --local-dir ./models/r1-14b-exl2Deployment guides are educational. Each model is subject to its own license — read the official Hugging Face model card before downloading or deploying.