30: The Megatron Problem
Every competitive frontier model going forward is sparse — a Mixture-of-Experts architecture where each token activates only a fraction of the total parameters. That decoupling of parameter count from per-token compute sounds like a free lunch. The engineering...
Show Notes
The Megatron Problem — Show Notes
DTF:FTL Episode 0030 | March 12, 2026
Every competitive frontier model going forward is sparse. Mixture-of-Experts architectures decouple parameter count from per-token compute — but training them at scale creates coupled constraints across memory, communication, and computation that dense models never had. NVIDIA's Megatron Core team published the full engineering receipt: 88 pages, 42 figures, production-tested on clusters of thousands of GPUs.
Why it matters.
MoE is not a research curiosity. DeepSeek-V3, Qwen3, Mixtral, and most frontier models in active development are sparse. The question was never whether MoE architectures were theoretically superior — the question was whether anyone could actually train them efficiently at scale. This paper answers that question with production numbers: one thousand two hundred thirty-three TFLOPS per GPU on GB300 for a 685-billion-parameter model, roughly 50 percent of theoretical hardware peak. The framework is open source. Any serious lab can now train competitive sparse models. The moat just got narrower.
Primary Source
- Paper: Scalable Training of Mixture-of-Experts Models with Megatron Core — https://arxiv.org/abs/2603.07685
- Megatron-LM GitHub: https://github.com/NVIDIA/Megatron-LM
- Megatron-Core (within Megatron-LM): https://github.com/NVIDIA/Megatron-LM/tree/main/megatron/core
Models Referenced
- DeepSeek-V3 Technical Report: https://arxiv.org/abs/2412.19437
- DeepSeek-V3 GitHub: https://github.com/deepseek-ai/DeepSeek-V3
- Qwen3 Technical Report / Blog: https://qwenlm.github.io/blog/qwen3/
- Qwen GitHub: https://github.com/QwenLM/Qwen3
- Mixtral of Experts (MoE paper, Mistral AI): https://arxiv.org/abs/2401.04088
MoE Foundations
- Sparsely-Gated Mixture-of-Experts (Shazeer et al., 2017): https://arxiv.org/abs/1701.06538
- Switch Transformer (Google, 2021): https://arxiv.org/abs/2101.03961
- GLaM: Efficient Scaling of Language Models with Mixture-of-Experts (Google): https://arxiv.org/abs/2112.06905
- Expert Choice Routing (Zhou et al., 2022): https://arxiv.org/abs/2202.09368
Parallelism and Training Infrastructure
- Megatron-LM: Training Multi-Billion Parameter Language Models (original paper): https://arxiv.org/abs/1909.08053
- Efficient Large-Scale Language Model Training on GPU Clusters (Narayanan et al., 2021): https://arxiv.org/abs/2104.04473
- GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding: https://arxiv.org/abs/2006.16668
- FasterMoE: Modeling and Optimizing Training of Large-Scale Dynamic Pre-Trained Models: https://arxiv.org/abs/2201.12023
Compute Primitives
- FlashAttention-2 (Dao, 2023): https://arxiv.org/abs/2307.08691
- Grouped GEMM (cutlass): https://github.com/NVIDIA/cutlass
- NVIDIA CUDA Graphs documentation: https://developer.nvidia.com/blog/cuda-graphs/
Hardware
- NVIDIA GB200 NVL72 architecture overview: https://www.nvidia.com/en-us/data-center/gb200-nvl72/
- NVIDIA Blackwell GPU architecture: https://www.nvidia.com/en-us/data-center/technologies/blackwell-architecture/
Related Reading
- Scaling Laws for Neural Language Models (Kaplan et al.): https://arxiv.org/abs/2001.08361
- DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale: https://arxiv.org/abs/2201.05596
DTF:FTL — Dispatches from the edge. New episodes daily. All content AI-assisted; factual claims sourced from cited papers.