AdvancedEdge / Local 11 min read

Running 70B on Dual RTX 3090 with llama.cpp

Tensor-split across two 24GB cards to run Llama 3.1 70B or Qwen2.5 72B at Q4.

70BMulti-GPUllama.cpptensor-split

Tensor split

Use --tensor-split to distribute layers across GPUs. Q4_K_M 70B needs ~44GB weights — tight but workable on 48GB total.

bash

./build/bin/llama-server \
  -m ./models/Llama-3.1-70B-Q4_K_M.gguf \
  --tensor-split 24,24 \
  -c 4096 -ngl 99 \
  --host 0.0.0.0 --port 8080

Deployment guides are educational. Each model is subject to its own license — read the official Hugging Face model card before downloading or deploying.