I know AMD has a whole bunch of (related?) projects for GPU compute, but man - if they could just provide an interop layer that Just Works they'd get immediate access to so much more market share.
“Just works” in this context means executing the compiled CUDA or the PTX bytecode without recompiling. Nobody is ever going to utilize ROCm if it requires distributing as source and recompiling.
To make it even more insulting, even simply installing ROCm itself is a massive burden, even on an ostensibly-supported (as geohot discovered) and even just “it works out of the box if you distribute and compile it locally” is ignoring that whole massive “draw the rest of the owl” stage of getting ROCm installed and building properly in your environment.
> “Just works” in this context means executing the compiled CUDA or the PTX bytecode without recompiling. Nobody is ever going to utilize ROCm if it requires distributing as source and recompiling.
Even a source-compatible layer that let you just recompile CUDA code for an AMD GPU would be a huge improvement. That alone would eliminate the CUDA lock-in.
Don't forget AMD doesn't seem to even care about ROCm themselves. Six months in and RDNA3 cards still don't support it. Can you imagine if Nvidia launched RTX40- cards with no DLSS even though 30- cards already had it, and six months started boasting about how DLSS support was "coming this fall"?
The hardware that is officially supported is a subset of the hardware that works. You are correct that the RX 7900 XT is not officially supported, but I must point out that you are linking to a fork of the documentation from 2019. This is the official ROCm documentation: https://rocm.docs.amd.com/en/latest/release/gpu_os_support.h...
TLDR; If you provide even more functions through the overloaded headers, incl. "hidden ones", e.g., `__cudaPushCallConfiguration`, you can use LLVM/Clang as a CUDA compiler and target AMD GPUs, the host, and soon GPUs of two other manufacturers.
Yes, though with caveats. The driver and parts of the extended API we used to lower CUDA calls are in upstream LLVM. The wrapper headers are not.
We will continue the process of getting it all to work in upstream/vanilla LLVM soon though. Help is always appreciated.
FWIW, we have some alternative ideas on how to get out of the vendor trap, as well as some existing prototypes to deal with things like CUBLAS and Thrust.
Feel free to reach out, or just keep an eye out.