Overview
- Today's AI Systems Are Underutilized: Training runs like DeepSeek-V3 achieve only ~20% MFU in FP8, compared to 50% MFU in older runs. Inference is even worse—optimized implementations hit <5% FLOP utilization because they're bottlenecked on memory bandwidth, not compute.
- Hardware-aware architecture codesign, FP4 training, and diffusion language models offer concrete paths to better utilization
- Models Lag Hardware Buildout: DeepSeek-V3 trained on just 2,048 H800 GPUs, while clusters of 40K-100K+ GPUs are being built. Most current models use last-gen Hopper chips; Blackwell offers 2.2x FP8 throughput and native FP4 with 4.5x more FLOPs.
- New hardware features like GB200's rack-scale NVLink domain require model redesigns to exploit
- Useful AGI May Already Be Close: Current LLMs already produce the majority of code for kernel engineers and handle complex GPU programming with human-in-the-loop guidance. Better post-training formulas, sample complexity improvements, and domain expertise can extend these gains across fields.
Takeaways
Dan Fu, a kernel engineering lead at Together AI, wrote this response to Tim Dettmers' post on AGI hardware bottlenecks. He argues there's at least 10x more compute available through better software-hardware codesign and new chip generations.
The models are already insanely useful!