Pharath Sathya

Mistral 7B is what I use when I need answers fast and the task doesn't require deep reasoning. It's the workhorse for my local dev loop.

Setup

AWQ 4-bit on a single 3090. 5GB VRAM leaves plenty of headroom for a large KV cache.

vllm serve mistralai/Mistral-7B-Instruct-v0.3 \
  --quantization awq \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.85

Performance

187 tok/s is remarkable for a local model. TTFT at 90ms feels instant. This is what good inference looks like.

What it's good at

Fast drafting, quick lookups, simple code completion. If I'm doing rapid iteration and need a model that responds in under 2 seconds, Mistral is the answer.

What it struggles with

Give it a problem that requires holding 5+ pieces of information simultaneously and it starts dropping context. Not a 70B replacement — don't pretend it is.

Verdict

Best performance-per-watt of anything I've run. The go-to for latency-sensitive applications and local development where you don't want to wait.

Mistral 7B Instruct v0.3

Setup

Performance

What it's good at

What it struggles with

Verdict