I don’t understand why this is a useful effort. It seems like a solution in source of a problem. It’s going to be incredibly easy to end up with hopelessly inefficient programs that need a full redesign in a normal gpu programming model to be useful.
LegNeato17 hours ago | | | parent | | on: 47760827
Founder here.

1. Programming GPUs is a problem. The ratio of CPUs to CPU programmers and GPUs to GPU programmers is massively out of whack. Not because GPU programming is less valuable or lucrative, because GPUs are weird and the tools are weird.

2. We are more interested in leveraging existing libraries than running existing binaries wholesale (mostly within a warp). But, running GPU-unaware code leaves a lot of space for the compiler to move stuff around and optimize things.

3. The compiler changes are not our product, the GPU apps we are building with them are. So it is in our interest to make the apps very fast.

Anyway, skepticism is understandable and we are well aware code wins arguments.

electronsoup4 hours ago | | | parent | | on: 47760997
> the GPU apps we are building with them are

I can't help but get the feeling you have use-case end-goal in mind that's opaque to many of us who are gpu-ignorant.

It could be helpful if there were an example of the type of application that would be nicer to express through your abstractions.

(I think what you've shown so far is super cool btw)

ghighi787813 hours ago | | | parent | | on: 47760997
Good point about gpu threads being equivalent to warps.
esperent9 hours ago | | | parent | | on: 47762359
Groups of gpu threads are called warps.

https://modal.com/gpu-glossary/device-software/warp

jzombie16 hours ago | | | parent | | on: 47760997
Do you foresee this being faster than SIMD for things like cosine similarity? Apologies if I missed that context somewhere.
LegNeato16 hours ago | | | parent | | on: 47761331
It depends. At VecorWare are a bit of an extreme case in that we are inverting the relationship and making the GPU the main loop that calls out to the CPU sparingly. So in that model, yes. If your code is run in a more traditional model (CPU driving and using GPU as a coprocessor), probably not. Going across the bus dominates most workloads. That being said, the traditional wisdom is becoming less relevant as integrated memory is popping up everywhere and tech like GPUDirect exists with the right datacenter hardware.

These are the details we intend to insulate people from so they can just write code and have it run fast. There is a reason why abstractions were invented on the CPU and we think we are at that point for the GPU.

(for the datacenter folks I know hardware topology has a HUGE impact that software cannot overcome on its own in many situations)

shmerl16 hours ago | | | parent | | on: 47760997
> because GPUs are weird and the tools are weird.

Why is it also that terminology is so all over the place. Subgroups, wavefronts, warps etc. referring to the same concept. That doesn't help it.

adrian_b14 hours ago | | | parent | | on: 47761212
This is the fault of NVIDIA, who, instead of using the terms that had been used for decades in computer science before them for things like vector lanes, processor threads, processor cores etc., have invented a new jargon by replacing each old word with a new word, in order to obfuscate how their GPUs really work.

Unfortunately, ATI/AMD has imitated slavishly many things initiated by NVIDIA, so soon after that they have created their own jargon, by replacing every word used by NVIDIA with a different word, also different from the traditional word, enhancing the confusion. The worst is that the NVIDIA jargon and the AMD jargon sometimes reuse traditional terms by giving them different meanings, e.g. an NVIDIA thread is not what a "thread" normally means.

Later standards, like OpenCL, have attempted to make a compromise between the GPU vendor jargons, instead of going back to a more traditional terminology, so they have only increased the number of possible confusions.

So to be able to understand GPUs, you must create a dictionary with word equivalences: traditional => NVIDIA => ATI/AMD (e.g. IBM 1964 task = Vyssotsky 1966 thread => NVIDIA warp => AMD wavefront).

MindSpunk16 hours ago | | | parent | | on: 47761212
All the names for waves come from different hardware and software vendors adopting names for the same or similar concept.

- Wavefront: AMD, comes from their hardware naming

- Warp: Nvidia, comes from their hardware naming for largely the same concept

Both of these were implementation detail until Microsoft and Khronos enshrined them in the shader programming model independent of the hardware implementation so you get

- Subgroup: Khronos' name for the abstract model that maps to the hardware

- Wave: Microsoft's name for the same

They all describe mostly the same thing so they all get used and you get the naming mess. Doesn't help that you'll have the API spec use wave/subgroup, but the vendor profilers will use warp/wavefront in the names of their hardware counters.

raphlinus10 hours ago | | | parent | | on: 47761333
You can add to this the Apple terminology, which is simdgroup. This reinforces your point – vendors have a tendency to invent their own terminology rather than use something standard.
amelius10 hours ago | | | parent | | on: 47763756
Rule #1 in not getting involved in any patent lawsuit: don't use the same terminology as your competitors.
coffeeaddict18 hours ago | | | parent | | on: 47763756
I have to give it to Apple though in this case. Waves or warps are ridiculously uninformative, while simdgroups at least convey some useful information.
zozbot23413 hours ago | | | parent | | on: 47760827
It looks like they're trying to map the entire "normal GPU programming model" to Rust code, including potentially things like GPU "threads" (to SIMD lanes + masked/predicated execution to account for divergence) and the execution model where a single GPU shader is launched in multiple instances with varying x, y and z indexes. In this context, it makes sense to map the GPU "warp" to a Rust thread since GPU lanes, even with partially independent program counters, still execute in lockstep much like CPU SIMD/SPMD or vector code.
rl317 hours ago | | | parent | | on: 47760827
I think they've taken the integration difficulty into account.

Besides, full redesign isn't so expensive these days (depending).

>It seems like a solution in source of a problem.

Agreed, but it'll be interesting to see how it plays out.