Study Note

Memory Bandwidth and AI Workloads

Memory Architecture

Many AI workloads are limited less by peak compute and more by the speed, placement, and reuse of data.

Bandwidth As A Constraint

Accelerators expose high floating-point throughput, but tensors must still move through memory hierarchy. When a kernel performs little computation per byte loaded, memory bandwidth becomes the binding constraint.

Arithmetic intensity is a useful first estimate: higher reuse of loaded data gives compute units more useful work before the next memory transaction.

Locality And Reuse

Good locality reduces repeated movement across expensive memory boundaries. Tiling, shared-memory reuse, cache-friendly layout, and fusion all try to keep values close to the compute that consumes them.

For AI models, bandwidth pressure often appears in embedding access, normalization, attention variants, and any stage that streams large tensors with modest computation.

Takeaways

Peak FLOPS does not predict performance without data-movement analysis.
Memory hierarchy choices can change the shape of an AI model implementation.
Kernel fusion trades modularity for fewer memory round trips.
Bandwidth-bound workloads require measurement at the operator and system levels.

Back to Study Notes