Study Note
Memory Bandwidth and AI Workloads
Many AI workloads are limited less by peak compute and more by the speed, placement, and reuse of data.
Bandwidth As A Constraint
Accelerators expose high floating-point throughput, but tensors must still move through memory hierarchy. When a kernel performs little computation per byte loaded, memory bandwidth becomes the binding constraint.
Arithmetic intensity is a useful first estimate: higher reuse of loaded data gives compute units more useful work before the next memory transaction.
Locality And Reuse
Good locality reduces repeated movement across expensive memory boundaries. Tiling, shared-memory reuse, cache-friendly layout, and fusion all try to keep values close to the compute that consumes them.
For AI models, bandwidth pressure often appears in embedding access, normalization, attention variants, and any stage that streams large tensors with modest computation.
Takeaways
- Peak FLOPS does not predict performance without data-movement analysis.
- Memory hierarchy choices can change the shape of an AI model implementation.
- Kernel fusion trades modularity for fewer memory round trips.
- Bandwidth-bound workloads require measurement at the operator and system levels.