How GPU Computing Works [video]

Lichtso · on July 12, 2022

If I understand correctly:

CPUs do minimize latency by:

- Register renaming

- Out of order execution

- Branch prediction

- Speculative execution

They should not be over subscribed as they have to context switch by storing / loading registers and the cache coherence protocols scale badly with more threads.

GPUs on the other hand maximize throughput by:

- A lot more memory bandwidth

- Smaller and slower cores, but more of them

- Ultra threading (the massively over subscribed hyper threading the video mentions)

- Context switching between wavefronts (basically the equivalent of a CPU thread), just shifts the offset into the huge register file (no store and load)

The one area in which CPUs are getting closer to GPUs is SIMD / SIMT. CPUs used to be able to apply one instruction to a vector of elements without masking (SIMD). In ARM SVE and x86 AVX-512 they can now (like GPUs) mask out individual lanes (SIMT) for ALU operations and memory operations (gather load / scatter store).

dragontamer · on July 12, 2022

> They should not be over subscribed as they have to context switch by storing / loading registers and the cache coherence protocols scale badly with more threads.

CPUs can be oversubscribed if so designed.

POWER9 had SMT4 and SMT8, (4-threads per core and 8-threads per core, respectively). SMT8 is basically GPU-level throughput / threading. SMT4 is probably a better medium between x86 (2-threads per core) and the craziness that is GPUs.

> - Register renaming

> - Out of order execution

These are one and the same. Tomasulo's Algorithm is the key thing to understand here: https://en.wikipedia.org/wiki/Tomasulo_algorithm

-------

I'd describe modern CPUs to be pipelined (break up instructions to help parallelize fetch/decode/execution), superscalar (multiple execution pipelines in parallel), out-of-order (renaming / retirement units with Tomasulo's algorithm), branch predicted, speculative processors with virtual memory and MESI-like algorithms to enable cache-coherence / multicore.

-------

I basically agree with your post. Just clarifying a few points. My mental model of CPUs / GPUs seems to match yours.

zozbot234 · on July 12, 2022

SMT8 is not new, the UltraSPARC T2 ("Niagara 2") processor also supports it as do its successors in the UltraSPARC T/SPARC T series.

oddity · on July 12, 2022

The difference is much more nuanced than this. A modern GPU can (and probably does) do most of what you've listed for a CPU. Speculative execution and branch prediction are a bit less likely to be invested in (because they don't need it as much due to oversubscription), but that's increasingly true for CPUs as well for high-efficiency cores. The difference (at a category vs category level and not specific microarch) is mostly a matter of tuning for particular workloads. I'm increasingly souring on SIMD/SIMT being a useful distinction now that bleeding-edge CPUs are widening in the microarch and bleeding-edge GPUs are getting better at handling thread divergence in the microarch. There is a difference, certainly, but it's difficult to describe in a few bullet points.

GPUs are more likely to have more exotic features than you'll see on a CPU to deal with things like thread coordination and cache coherence, but there's nothing fundamentally stopping CPUs from adding that (or wanting that) as well.

Lichtso · on July 12, 2022

> GPUs are getting better at handling thread divergence in the microarch

That is an interesting point, how does that work (especially with the dynamics of ray tracing)? Do they recombine under utilized wavefronts or something?

dragontamer · on July 12, 2022

I'm not aware of anything that improves thread-divergence. NVidia's most recent GPUs have superscalar operations, which is a trick from CPU-land (multiple pipelines operating 2 or more instructions per clock tick). NVidia has an integer-pipeline and a floating-point pipeline, and both can operate simultaneously (ex: for(int i=0; i<100; i++) x *=blah; the "i++" is integer, while the "x *= blah" is floating point, so both operate simultaneously.

CPUs have extremely flexible pipelines: Intel's pipeline 0 and 1 basically can do anything, pipeline 5 can do most stuff but is missing division IIRC (and a few other things). Load/store are done on some other pipelines, etc. etc.

Apple's and AMD's CPU pipelines are more symmetrical and uniform.

NVidia GPUs are the only superscalar ones I can think of, aside from AMD GPU's scalar vs vector split (which isn't really the "superscalar" operation I'm trying to describe).

TomVDB · on July 12, 2022

Starting with Volta, Nvidia GPUs have forward progress guarantee, preventing lockups when there’s thread divergence.

That doesn’t improve the performance of a well behaved and well written compute shader. But avoiding hard hangs IMO deserves the label “improved thread divergence.”

jjoonathan · on July 12, 2022

Aren't warps still 32 threads, even though number of threads is skyrocketing, effectively making them proportionately finer granularity? Are things different in AMD land?

JonChesterfield · on July 12, 2022

Slightly, the older tech is 64 threads/lanes per warp/wavefront. Newer ones are 32 by default but 64 if desired.

Bigger differences are the instruction counter per thread since volta on nvidia (which I think is a terrible feature) and that forward progress guarantees are stronger on nvidia (those are _really_ helpful but expensive).

TomVDB · on July 12, 2022

Nvidia GPUs were 32 threads per warps eight from the start of CUDA with the 8800 GTX.

> which I think is a terrible feature <> those are _really_ helpful but expensive

Guaranteed forward progress is a direct consequence of having an instruction counter per thread???

Or so I thought. How else would an SM be able to know the PC of a group of threads that wasn’t stuck?

dragontamer · on July 12, 2022

> Slightly, the older tech is 64 threads/lanes per warp/wavefront. Newer ones are 32 by default but 64 if desired.

AMD GCN was 64 threads/wavefront. NVidia always was 32 threads/warp.

AMD's newest consumer cards RDNA and RDNA2 are 32 threads/wavefront. However, GCN lives on with CDNA (MI200 supercomputer chips), with 64 threads/wavefront architecture.

djmips · on July 13, 2022

There is tech in late model GPUs to keep all same divergent threads in the same warp/wavefront.

the_optimist · on July 12, 2022

Excepting that AXV-512 is being removed or disabled from newer chip sets.

dragontamer · on July 13, 2022

The dumbest move by Intel IMO.

Intel had such a huge advantage in compute capabilities on their CPU for the last 7 years, and they have effectively wasted it... by locking it behind boutique Xeon CPUs.

AMD is coming out with AVX512 support in Zen4 and that's that. No more advantage to Intel. Unless Intel re-enables AVX512, they'll feel behind AMD on important tasks / benchmarks (Matlab, Blender, etc. etc.)

boberoni · on July 12, 2022

> (Almost) Nobody (really) cares about flops ...because we should really be caring about memory bandwidth

In university, I was shocked to learn in a database class that CPU costs are dwarfed by the I/O costs in the memory hierarchy. This was after spending a whole year on data structures and algorithms, where we obsessed over runtime complexity and # of operations.

It seems that the low-hanging fruit of optimization is all gone. New innovations for performance will have to happen in transporting data.

malnourish · on July 12, 2022

At the risk of being flippant, I hope you learned about space complexity and the lessons behind how the algorithms and data structures you use impact performance via the cache.

bogomipz · on July 12, 2022

I don't really understand your comment. Databases are generally I/O bound workloads, almost by definition of what they do. Regardless, data structures and algorithms are equally important in databases. B-Trees, linked-lists, buffers, LSM Trees, Bloom filters, caching strategies etc are all fundamental to databases. At any rate for a long time now the low hanging fruit of optimization has been throwing money and hardware at problems - NAND Flash, more cores, larger caches, tons of memory, edge networks etc. Those options are all still there.

vladf · on July 12, 2022

Compression converts I/O bottlenecks to compute ones again.

Sherl · on July 12, 2022

There is an entire field of parallel algorithms which makes use of sequential algorithms to overcome some of these issues. So no it's not wasted. You would apply the knowledge to build parallel algorithms.

There are projects in my school where we implemented a combination of CUDA and openMP in some and MPI+OpenMP in a few. I think the bottleneck is always gonna be there, its just how much and how you deal with it in hardware from the software front.

einpoklum · on July 12, 2022

This seems like this year's version of the talk given last year, which was just recently posted here on HN as "How CUDA Programming works":

https://news.ycombinator.com/item?id=31983460

oifjsidjf · on July 12, 2022

Here is another interesting series of articles which describes in more details how GPUs draw:

https://fgiesen.wordpress.com/2011/07/09/a-trip-through-the-...

wrs · on July 12, 2022

The Wheel of Reincarnation continues. [0] (Though it’s sort of turning the other way, this time around?)

[0] http://www.catb.org/jargon/html/W/wheel-of-reincarnation.htm...

dragontamer · on July 12, 2022

The opposite.

GPUs became more general purpose. Old, vector processors from the 1980s served as inspiration. Even in 90s commercials its obvious that the GPU / SIMD-compute similarities were all over the 3dfx cards.

In the 00s, GPUs became flexible enough to execute arbitrary code for vector-effects and pixel-effects. Truly general purpose, arbitrary code, albeit in the SIMD methodology.

--------

Today, your Direct2D windowing code in Windows is largely running on the GPU, and has been migrated away from the CPU. In fact, your video decoders (Youtube) and video games (shaders) are all GPU code.

GPUs have proven themselves to be a general purpose processor, albeit with a strange SIMD-model of compute rather than the traditional Von Neumann design.

We're in a cycle where more-and-more code is moving away from CPUs into GPUs, and permanently staying in GPU space. This is the opposite effect of the cycle of reincarnation (CPUs may have gotten faster, but GPUs have become not only faster at a higher rate, but also more general purpose and generic allowing for more general code to be run on them).

Code successfully ported over (ex: Tensorflow), may never return back to CPU-side. SIMD-compute is just superior underlying model for a large set of applications.

nayuki · on July 12, 2022

> In fact, your video decoders (Youtube) [...] are all GPU code.

I believe this is false; video decoders like H.264/AVC have a significant ASIC component that cannot be expressed as general-purpose SIMT code. I think this is because the entropy coding portion (arithmetic coding, Huffman coding, etc.) needs to be decoded serially. Some stuff like macroblock-to-macroblock prediction in I-frames is serial as well. But IDCT is indeed parallelizable.