Hacker News

points by imtringued 9 hours ago | hide | 0 comments

>The second problem is: it used to be true that all threads in a warp would execute in lockstep, and strictly have on/off masks for thread divergence, but this is strictly no longer true for modern GPUs, the above is just wrong. On a modern GPU, each thread has its own program counter and callstack, and can independently make forward progress. Divergent threads can have a better throughput than you'd expect on a modern GPU, as they get more capable at handling this. Divergence isn't bad, its just something you have to manage - and hardware architectures are rapidly improving here

I haven't found any evidence of the individual program counter thing being true beyond one niche application: Running mutexes for a single vector lane, which is not a performance optimization at all. In fact, you are serializing the performance in the worst way possible.

From a hardware design perspective it is completely impractical to implement independent instruction pointers other than maybe as a performance counter. Each instruction pointer requires its own read port on the instruction memory and adding 32, 64 or 128 read ports to SRAM is prohibitively expensive, but even if you had those ports, divergence would still lead to some lanes finishing earlier than others.

What you're probably referring to is a scheduler trick that Nvidia has implemented where they split a streaming processor thread with divergence into two masked streaming processor threads without divergence. This doesn't fundamentally change anything about divergence being bad, you will still get worse performance than if you had figured out a way to avoid divergence. The read port limitations still apply.