Every competitive frontier model going forward is sparse — a Mixture-of-Experts architecture where each token activates only a fraction of the total parameters. That decoupling of parameter count from per-token compute sounds like a free lunch. The engineering...

The Megatron Problem — Show Notes

DTF:FTL Episode 0030 | March 12, 2026

Every competitive frontier model going forward is sparse. Mixture-of-Experts architectures decouple parameter count from per-token compute — but training them at scale creates coupled constraints across memory, communication, and computation that dense models never had. NVIDIA's Megatron Core team published the full engineering receipt: 88 pages, 42 figures, production-tested on clusters of thousands of GPUs.

Why it matters.

MoE is not a research curiosity. DeepSeek-V3, Qwen3, Mixtral, and most frontier models in active development are sparse. The question was never whether MoE architectures were theoretically superior — the question was whether anyone could actually train them efficiently at scale. This paper answers that question with production numbers: one thousand two hundred thirty-three TFLOPS per GPU on GB300 for a 685-billion-parameter model, roughly 50 percent of theoretical hardware peak. The framework is open source. Any serious lab can now train competitive sparse models. The moat just got narrower.

Primary Source

Paper: Scalable Training of Mixture-of-Experts Models with Megatron Core — https://arxiv.org/abs/2603.07685
Megatron-LM GitHub: https://github.com/NVIDIA/Megatron-LM
Megatron-Core (within Megatron-LM): https://github.com/NVIDIA/Megatron-LM/tree/main/megatron/core

Models Referenced

DeepSeek-V3 Technical Report: https://arxiv.org/abs/2412.19437
DeepSeek-V3 GitHub: https://github.com/deepseek-ai/DeepSeek-V3
Qwen3 Technical Report / Blog: https://qwenlm.github.io/blog/qwen3/
Qwen GitHub: https://github.com/QwenLM/Qwen3
Mixtral of Experts (MoE paper, Mistral AI): https://arxiv.org/abs/2401.04088

MoE Foundations

Sparsely-Gated Mixture-of-Experts (Shazeer et al., 2017): https://arxiv.org/abs/1701.06538
Switch Transformer (Google, 2021): https://arxiv.org/abs/2101.03961
GLaM: Efficient Scaling of Language Models with Mixture-of-Experts (Google): https://arxiv.org/abs/2112.06905
Expert Choice Routing (Zhou et al., 2022): https://arxiv.org/abs/2202.09368

Parallelism and Training Infrastructure

Megatron-LM: Training Multi-Billion Parameter Language Models (original paper): https://arxiv.org/abs/1909.08053
Efficient Large-Scale Language Model Training on GPU Clusters (Narayanan et al., 2021): https://arxiv.org/abs/2104.04473
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding: https://arxiv.org/abs/2006.16668
FasterMoE: Modeling and Optimizing Training of Large-Scale Dynamic Pre-Trained Models: https://arxiv.org/abs/2201.12023

Compute Primitives

FlashAttention-2 (Dao, 2023): https://arxiv.org/abs/2307.08691
Grouped GEMM (cutlass): https://github.com/NVIDIA/cutlass
NVIDIA CUDA Graphs documentation: https://developer.nvidia.com/blog/cuda-graphs/

Hardware

NVIDIA GB200 NVL72 architecture overview: https://www.nvidia.com/en-us/data-center/gb200-nvl72/
NVIDIA Blackwell GPU architecture: https://www.nvidia.com/en-us/data-center/technologies/blackwell-architecture/

Listen

30: The Megatron Problem

Show Notes