Llama 3 70B Instruct
2026-03-15
Parameters
70B
Quantization
GPTQ 4-bit
VRAM Required
38GB
GPU Setup
2x RTX A6000 48GB
Tensor Parallel
TP=2
Context Length
8,192
Tokens/sec
42.5
TTFT
850ms
pros
- +Strong reasoning and instruction following
- +Excellent code generation
- +Good at multi-step problems
cons
- −High VRAM requirement even quantized
- −Tends to be verbose — needs explicit brevity prompting
Llama 3 70B is the model I reach for when I need something that actually reasons rather than pattern-matches. It's noticeably better than 3.1 on complex tasks.
Setup
Running GPTQ 4-bit via vLLM with tensor parallelism across 2x A6000 48GB. The quantization loses maybe 0.2 points on MMLU vs BF16 — totally acceptable for the VRAM savings.
vllm serve meta-llama/Meta-Llama-3-70B-Instruct \
--quantization gptq \
--tensor-parallel-size 2 \
--max-model-len 8192 \
--gpu-memory-utilization 0.92Performance
42.5 tok/s at batch=1 is genuinely usable for interactive work. TTFT at 850ms is acceptable but not great — mostly because loading 38GB across PCIe is slow.
What it's good at
- Long multi-step math proofs — doesn't lose track of variables
- Code review with actionable feedback, not just "looks good"
- Summarization of dense technical papers without hallucinating citations
What it struggles with
Verbose by default. Without "be concise, no hedging" in the system prompt, it wraps every answer in 2 paragraphs of throat-clearing. This is a training artifact, not a capability issue.
Verdict
If you have the VRAM, run it. Best open model at this size class for technical work. Mistral 7B is 5x faster but this is just smarter.