$ ls learn-log/

Learn Log

Things I picked up — CUDA, LLMs, distributed systems, whatever.

How kernel fusion reduces memory bandwidth bottlenecks in GPU workloads

Quantizing the KV cache to FP8 to fit longer contexts without OOM

How Megatron-LM splits attention heads and MLP layers across GPUs without extra communication