Study Note
GPU Execution Model: Threads, Blocks, and Warps
CUDA exposes a hierarchy that looks simple at the programming level but maps onto a highly constrained execution machine.
Execution Hierarchy
A kernel launches a grid. The grid contains thread blocks. Each block contains threads. Hardware groups threads into warps, and a warp is the unit that actually advances through instructions together.
This means a program written in terms of individual threads still pays the cost of warp-level scheduling, memory access patterns, and divergence.
Why Warps Matter
Threads in a warp share an instruction stream. When branches diverge, execution serializes across the paths that active lanes need to take. The result is lower utilization even when the code appears parallel.
Memory behavior also matters at warp scale. Coalesced accesses can use memory bandwidth effectively, while scattered accesses turn into more transactions and higher latency.
Takeaways
- Thread count alone is not a performance model.
- Block size shapes occupancy, shared-memory use, and scheduling flexibility.
- Warp divergence and non-coalesced memory access can dominate runtime.
- Useful kernel analysis starts by mapping code structure to warp behavior.