nick.dev
← llm-reviews

Mistral 7B Instruct v0.3

2026-02-20

7.5
/10

Parameters

7B

Quantization

AWQ 4-bit

VRAM Required

5GB

GPU Setup

1x RTX 3090 24GB

Tensor Parallel

TP=1

Context Length

32,768

Tokens/sec

187

TTFT

90ms

pros

  • +Blazing fast — great for high-throughput inference
  • +Fits on a single consumer GPU
  • +Long context window
  • +Sliding window attention is clever

cons

  • Falls apart on complex multi-step reasoning
  • Hallucinations on technical details

Mistral 7B is what I use when I need answers fast and the task doesn't require deep reasoning. It's the workhorse for my local dev loop.

Setup

AWQ 4-bit on a single 3090. 5GB VRAM leaves plenty of headroom for a large KV cache.

vllm serve mistralai/Mistral-7B-Instruct-v0.3 \
  --quantization awq \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.85

Performance

187 tok/s is remarkable for a local model. TTFT at 90ms feels instant. This is what good inference looks like.

What it's good at

Fast drafting, quick lookups, simple code completion. If I'm doing rapid iteration and need a model that responds in under 2 seconds, Mistral is the answer.

What it struggles with

Give it a problem that requires holding 5+ pieces of information simultaneously and it starts dropping context. Not a 70B replacement — don't pretend it is.

Verdict

Best performance-per-watt of anything I've run. The go-to for latency-sensitive applications and local development where you don't want to wait.