nick.dev
← llm-reviews

Llama 3 70B Instruct

2026-03-15

8.5
/10

Parameters

70B

Quantization

GPTQ 4-bit

VRAM Required

38GB

GPU Setup

2x RTX A6000 48GB

Tensor Parallel

TP=2

Context Length

8,192

Tokens/sec

42.5

TTFT

850ms

pros

  • +Strong reasoning and instruction following
  • +Excellent code generation
  • +Good at multi-step problems

cons

  • High VRAM requirement even quantized
  • Tends to be verbose — needs explicit brevity prompting

Llama 3 70B is the model I reach for when I need something that actually reasons rather than pattern-matches. It's noticeably better than 3.1 on complex tasks.

Setup

Running GPTQ 4-bit via vLLM with tensor parallelism across 2x A6000 48GB. The quantization loses maybe 0.2 points on MMLU vs BF16 — totally acceptable for the VRAM savings.

vllm serve meta-llama/Meta-Llama-3-70B-Instruct \
  --quantization gptq \
  --tensor-parallel-size 2 \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.92

Performance

42.5 tok/s at batch=1 is genuinely usable for interactive work. TTFT at 850ms is acceptable but not great — mostly because loading 38GB across PCIe is slow.

What it's good at

  • Long multi-step math proofs — doesn't lose track of variables
  • Code review with actionable feedback, not just "looks good"
  • Summarization of dense technical papers without hallucinating citations

What it struggles with

Verbose by default. Without "be concise, no hedging" in the system prompt, it wraps every answer in 2 paragraphs of throat-clearing. This is a training artifact, not a capability issue.

Verdict

If you have the VRAM, run it. Best open model at this size class for technical work. Mistral 7B is 5x faster but this is just smarter.