Pharath Sathya

The first open model that made me reconsider my assumption that frontier reasoning required proprietary weights.

Setup

FP8 on 8x H100 80GB via tensor parallelism. The FP8 quantization at this model size is impressively clean — minimal quality regression vs BF16.

vllm serve deepseek-ai/DeepSeek-R1 \
  --tensor-parallel-size 8 \
  --quantization fp8 \
  --max-model-len 65536 \
  --gpu-memory-utilization 0.95

Performance

28 tok/s sounds slow, but remember this is a 671B MoE model. The effective parameter count during inference is much lower (only ~37B active per token), which is how it achieves this throughput at all.

TTFT at 2.4s is the real bottleneck. Loading KV for a long context across 8 GPUs adds up.

The reasoning capability

This is the part that surprised me. On AMC/AIME problems it doesn't just get the right answer — it finds cleaner solutions than I would. The chain-of-thought isn't just padding, it's actual search over solution strategies.

On a graduate-level algorithms problem I gave it, it tried three approaches, identified the flaw in two of them, and arrived at the correct amortized analysis. Unprompted.

Verdict

If you have access to 8x H100 (cluster access, not your garage), this is the model to run for hard technical problems. It's not fast, it's not cheap, but it's the best open reasoning model available.

DeepSeek R1 671B

Setup

Performance

The reasoning capability

Verdict