nick.dev
← llm-reviews

DeepSeek R1 671B

2026-03-01

9.5
/10

Parameters

671B

Quantization

FP8

VRAM Required

320GB

GPU Setup

8x H100 80GB

Tensor Parallel

TP=8

Context Length

65,536

Tokens/sec

28

TTFT

2400ms

pros

  • +Best reasoning I've seen from any open model
  • +Matches o1 on math/coding benchmarks
  • +Chain-of-thought is genuinely useful, not fluff
  • +FP8 makes it feasible on 8xH100

cons

  • Requires 8x H100 — not for mortals
  • Slow — thinking takes time
  • Chain-of-thought tokens inflate cost

The first open model that made me reconsider my assumption that frontier reasoning required proprietary weights.

Setup

FP8 on 8x H100 80GB via tensor parallelism. The FP8 quantization at this model size is impressively clean — minimal quality regression vs BF16.

vllm serve deepseek-ai/DeepSeek-R1 \
  --tensor-parallel-size 8 \
  --quantization fp8 \
  --max-model-len 65536 \
  --gpu-memory-utilization 0.95

Performance

28 tok/s sounds slow, but remember this is a 671B MoE model. The effective parameter count during inference is much lower (only ~37B active per token), which is how it achieves this throughput at all.

TTFT at 2.4s is the real bottleneck. Loading KV for a long context across 8 GPUs adds up.

The reasoning capability

This is the part that surprised me. On AMC/AIME problems it doesn't just get the right answer — it finds cleaner solutions than I would. The chain-of-thought isn't just padding, it's actual search over solution strategies.

On a graduate-level algorithms problem I gave it, it tried three approaches, identified the flaw in two of them, and arrived at the correct amortized analysis. Unprompted.

Verdict

If you have access to 8x H100 (cluster access, not your garage), this is the model to run for hard technical problems. It's not fast, it's not cheap, but it's the best open reasoning model available.