I've seen this objection pop up every single time and I still don't get it.

GPUs run 32, 64 or even 128 vector lanes at once. If you have a block of Rust threads that are properly programmed to take advantage of the vector processing by avoiding divergence, etc how is it supposed to be slower?

Consider the following:

You have a hyperoptimized matrix multiplication kernel and you also have your inference engine code that previously ran on the CPU. You now port the critical inference engine code to directly run on the GPU, thereby implementing paged attention, prefix caching, avoiding data transfers, context switches, etc. You still call into your optimized GPU kernels.

Where is the magical slowdown supposed to come from? The mega kernel researchers are moving more and more code to the GPU and they got more performance out of it.

Is it really that hard to understand that the CUDA style programming model is inherently inflexible and limiting? I think the fundamental problem here is that Nvidia marketing gave an incredibly misleading perception of how the hardware actually works. GPUs don't have thousands of cores like CUDA Core marketing suggests. They have a hundred "barrel CPU"-like cores.

The RTX 5090 is advertised to have 21760 CUDA cores. This is a meaningless number in practice since the "CUDA cores" are purely a software concept that doesn't exist in hardware. The vector processing units are not cores. The RTX 5090 actually has 170 streaming multiprocessors each with their own instruction pointer that you can target independently just like a CPU. The key restriction here is that if you want maximum performance you need to take advantage of all 128 lanes and you also need enough thread copies that only differ in the subset of data they process so that the GPU can switch between them while it is working on multi cycle instructions (memory loads and the like). That's it.

Here is what you can do: You can take a bunch of streaming processors, lets say 8 and use them to run your management code on the GPU side without having to transfer data back to the CPU. When you want to do heavy lifting you are in luck, because you still have 162 streaming processors left to do whatever you want. You proceed to call into cuDNN and get great performance.

kevmo3145 hours ago | | | parent | | on: 47762981
> a block of Rust threads that are properly programmed to take advantage of the vector processing by avoiding divergence

Sure, if you have that then of course it would be fast. But that’s not what this library is proposing.

Bimos9 hours ago | | | parent | | on: 47762981
> a block of Rust threads that are properly programmed to take advantage of the vector processing by avoiding divergence

But the library is using a warp as a single thread

monideas9 hours ago | | | parent | | on: 47762981
I really appreciate the way you've explained this. Are there any resources you recommend to reach your level of understanding?