cudagpuperformance
CUDA Kernel Fusion Basics
How kernel fusion reduces memory bandwidth bottlenecks in GPU workloads
Things I picked up — CUDA, LLMs, distributed systems, whatever.
How kernel fusion reduces memory bandwidth bottlenecks in GPU workloads
Quantizing the KV cache to FP8 to fit longer contexts without OOM
How Megatron-LM splits attention heads and MLP layers across GPUs without extra communication