nick.dev
← learn-log
distributedgpullmtraining

Tensor Parallelism: Megatron-style Column/Row Splitting

Tensor parallelism (TP) is the reason you can run a model that doesn't fit on one GPU without naive model parallelism's pipeline bubble problem.

The key insight

For a linear layer Y = XA, you can split A column-wise across GPUs, compute partial outputs in parallel, and only do an all-reduce at the end. For attention, split heads across GPUs — each GPU owns a subset of heads.

GPU 0: heads 0..h/2
GPU 1: heads h/2..h

All-reduce after the attention output projection. One sync per transformer layer.

Column vs Row parallelism

Megatron alternates:

  1. Column parallel for the first linear (no sync needed, inputs are identical)
  2. Row parallel for the second linear (all-reduce to combine partial sums)

This means 2 all-reduces per MLP block. With NVLink at 600GB/s between GPUs, at TP=8 this costs ~50μs per layer — negligible vs the compute time.

When TP stops scaling

Communication overhead dominates when:

  • GPUs aren't NVLink connected (PCIe TP is painful)
  • TP degree > number of attention heads
  • Batch size is tiny (compute time shrinks, communication doesn't)

Rule of thumb: TP within a node (NVLink), pipeline parallel across nodes.