DeepSeek R1 671B
2026-03-01
Parameters
671B
Quantization
FP8
VRAM Required
320GB
GPU Setup
8x H100 80GB
Tensor Parallel
TP=8
Context Length
65,536
Tokens/sec
28
TTFT
2400ms
pros
- +Best reasoning I've seen from any open model
- +Matches o1 on math/coding benchmarks
- +Chain-of-thought is genuinely useful, not fluff
- +FP8 makes it feasible on 8xH100
cons
- −Requires 8x H100 — not for mortals
- −Slow — thinking takes time
- −Chain-of-thought tokens inflate cost
The first open model that made me reconsider my assumption that frontier reasoning required proprietary weights.
Setup
FP8 on 8x H100 80GB via tensor parallelism. The FP8 quantization at this model size is impressively clean — minimal quality regression vs BF16.
vllm serve deepseek-ai/DeepSeek-R1 \
--tensor-parallel-size 8 \
--quantization fp8 \
--max-model-len 65536 \
--gpu-memory-utilization 0.95Performance
28 tok/s sounds slow, but remember this is a 671B MoE model. The effective parameter count during inference is much lower (only ~37B active per token), which is how it achieves this throughput at all.
TTFT at 2.4s is the real bottleneck. Loading KV for a long context across 8 GPUs adds up.
The reasoning capability
This is the part that surprised me. On AMC/AIME problems it doesn't just get the right answer — it finds cleaner solutions than I would. The chain-of-thought isn't just padding, it's actual search over solution strategies.
On a graduate-level algorithms problem I gave it, it tried three approaches, identified the flaw in two of them, and arrived at the correct amortized analysis. Unprompted.
Verdict
If you have access to 8x H100 (cluster access, not your garage), this is the model to run for hard technical problems. It's not fast, it's not cheap, but it's the best open reasoning model available.