← learn-log
distributedgpullmtraining
Tensor Parallelism: Megatron-style Column/Row Splitting
Tensor parallelism (TP) is the reason you can run a model that doesn't fit on one GPU without naive model parallelism's pipeline bubble problem.
The key insight
For a linear layer Y = XA, you can split A column-wise across GPUs, compute partial outputs in parallel, and only do an all-reduce at the end. For attention, split heads across GPUs — each GPU owns a subset of heads.
GPU 0: heads 0..h/2
GPU 1: heads h/2..h
All-reduce after the attention output projection. One sync per transformer layer.
Column vs Row parallelism
Megatron alternates:
- Column parallel for the first linear (no sync needed, inputs are identical)
- Row parallel for the second linear (all-reduce to combine partial sums)
This means 2 all-reduces per MLP block. With NVLink at 600GB/s between GPUs, at TP=8 this costs ~50μs per layer — negligible vs the compute time.
When TP stops scaling
Communication overhead dominates when:
- GPUs aren't NVLink connected (PCIe TP is painful)
- TP degree > number of attention heads
- Batch size is tiny (compute time shrinks, communication doesn't)
Rule of thumb: TP within a node (NVLink), pipeline parallel across nodes.