Back to Cookbook
AdvancedEdge / Local 11 min read
Running 70B on Dual RTX 3090 with llama.cpp
Tensor-split across two 24GB cards to run Llama 3.1 70B or Qwen2.5 72B at Q4.
70BMulti-GPUllama.cpptensor-split
Tensor split
Use --tensor-split to distribute layers across GPUs. Q4_K_M 70B needs ~44GB weights — tight but workable on 48GB total.
bash
./build/bin/llama-server \
-m ./models/Llama-3.1-70B-Q4_K_M.gguf \
--tensor-split 24,24 \
-c 4096 -ngl 99 \
--host 0.0.0.0 --port 8080Deployment guides are educational. Each model is subject to its own license — read the official Hugging Face model card before downloading or deploying.