Pharath Sathya

Llama 3 70B is the model I reach for when I need something that actually reasons rather than pattern-matches. It's noticeably better than 3.1 on complex tasks.

Setup

Running GPTQ 4-bit via vLLM with tensor parallelism across 2x A6000 48GB. The quantization loses maybe 0.2 points on MMLU vs BF16 — totally acceptable for the VRAM savings.

vllm serve meta-llama/Meta-Llama-3-70B-Instruct \
  --quantization gptq \
  --tensor-parallel-size 2 \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.92

Performance

42.5 tok/s at batch=1 is genuinely usable for interactive work. TTFT at 850ms is acceptable but not great — mostly because loading 38GB across PCIe is slow.

What it's good at

Long multi-step math proofs — doesn't lose track of variables
Code review with actionable feedback, not just "looks good"
Summarization of dense technical papers without hallucinating citations

What it struggles with

Verbose by default. Without "be concise, no hedging" in the system prompt, it wraps every answer in 2 paragraphs of throat-clearing. This is a training artifact, not a capability issue.

Verdict

If you have the VRAM, run it. Best open model at this size class for technical work. Mistral 7B is 5x faster but this is just smarter.

Llama 3 70B Instruct

Setup

Performance

What it's good at

What it struggles with

Verdict