They should not be over subscribed as they have to context switch by storing / loading registers and the cache coherence protocols scale badly with more threads.
GPUs on the other hand maximize throughput by:
- A lot more memory bandwidth
- Smaller and slower cores, but more of them
- Ultra threading (the massively over subscribed hyper threading the video mentions)
- Context switching between wavefronts (basically the equivalent of a CPU thread), just shifts the offset into the huge register file (no store and load)
The one area in which CPUs are getting closer to GPUs is SIMD / SIMT. CPUs used to be able to apply one instruction to a vector of elements without masking (SIMD). In ARM SVE and x86 AVX-512 they can now (like GPUs) mask out individual lanes (SIMT) for ALU operations and memory operations (gather load / scatter store).
> They should not be over subscribed as they have to context switch by storing / loading registers and the cache coherence protocols scale badly with more threads.
CPUs can be oversubscribed if so designed.
POWER9 had SMT4 and SMT8, (4-threads per core and 8-threads per core, respectively). SMT8 is basically GPU-level throughput / threading. SMT4 is probably a better medium between x86 (2-threads per core) and the craziness that is GPUs.
I'd describe modern CPUs to be pipelined (break up instructions to help parallelize fetch/decode/execution), superscalar (multiple execution pipelines in parallel), out-of-order (renaming / retirement units with Tomasulo's algorithm), branch predicted, speculative processors with virtual memory and MESI-like algorithms to enable cache-coherence / multicore.
-------
I basically agree with your post. Just clarifying a few points. My mental model of CPUs / GPUs seems to match yours.
The difference is much more nuanced than this. A modern GPU can (and probably does) do most of what you've listed for a CPU. Speculative execution and branch prediction are a bit less likely to be invested in (because they don't need it as much due to oversubscription), but that's increasingly true for CPUs as well for high-efficiency cores. The difference (at a category vs category level and not specific microarch) is mostly a matter of tuning for particular workloads. I'm increasingly souring on SIMD/SIMT being a useful distinction now that bleeding-edge CPUs are widening in the microarch and bleeding-edge GPUs are getting better at handling thread divergence in the microarch. There is a difference, certainly, but it's difficult to describe in a few bullet points.
GPUs are more likely to have more exotic features than you'll see on a CPU to deal with things like thread coordination and cache coherence, but there's nothing fundamentally stopping CPUs from adding that (or wanting that) as well.
> GPUs are getting better at handling thread divergence in the microarch
That is an interesting point, how does that work (especially with the dynamics of ray tracing)? Do they recombine under utilized wavefronts or something?
I'm not aware of anything that improves thread-divergence. NVidia's most recent GPUs have superscalar operations, which is a trick from CPU-land (multiple pipelines operating 2 or more instructions per clock tick). NVidia has an integer-pipeline and a floating-point pipeline, and both can operate simultaneously (ex: for(int i=0; i<100; i++) x *=blah; the "i++" is integer, while the "x *= blah" is floating point, so both operate simultaneously.
CPUs have extremely flexible pipelines: Intel's pipeline 0 and 1 basically can do anything, pipeline 5 can do most stuff but is missing division IIRC (and a few other things). Load/store are done on some other pipelines, etc. etc.
Apple's and AMD's CPU pipelines are more symmetrical and uniform.
NVidia GPUs are the only superscalar ones I can think of, aside from AMD GPU's scalar vs vector split (which isn't really the "superscalar" operation I'm trying to describe).
Starting with Volta, Nvidia GPUs have forward progress guarantee, preventing lockups when there’s thread divergence.
That doesn’t improve the performance of a well behaved and well written compute shader. But avoiding hard hangs IMO deserves the label “improved thread divergence.”
Aren't warps still 32 threads, even though number of threads is skyrocketing, effectively making them proportionately finer granularity? Are things different in AMD land?
Slightly, the older tech is 64 threads/lanes per warp/wavefront. Newer ones are 32 by default but 64 if desired.
Bigger differences are the instruction counter per thread since volta on nvidia (which I think is a terrible feature) and that forward progress guarantees are stronger on nvidia (those are _really_ helpful but expensive).
> Slightly, the older tech is 64 threads/lanes per warp/wavefront. Newer ones are 32 by default but 64 if desired.
AMD GCN was 64 threads/wavefront. NVidia always was 32 threads/warp.
AMD's newest consumer cards RDNA and RDNA2 are 32 threads/wavefront. However, GCN lives on with CDNA (MI200 supercomputer chips), with 64 threads/wavefront architecture.
Intel had such a huge advantage in compute capabilities on their CPU for the last 7 years, and they have effectively wasted it... by locking it behind boutique Xeon CPUs.
AMD is coming out with AVX512 support in Zen4 and that's that. No more advantage to Intel. Unless Intel re-enables AVX512, they'll feel behind AMD on important tasks / benchmarks (Matlab, Blender, etc. etc.)
> (Almost) Nobody (really) cares about flops ...because we should really be caring about memory bandwidth
In university, I was shocked to learn in a database class that CPU costs are dwarfed by the I/O costs in the memory hierarchy. This was after spending a whole year on data structures and algorithms, where we obsessed over runtime complexity and # of operations.
It seems that the low-hanging fruit of optimization is all gone. New innovations for performance will have to happen in transporting data.
At the risk of being flippant, I hope you learned about space complexity and the lessons behind how the algorithms and data structures you use impact performance via the cache.
I don't really understand your comment. Databases are generally I/O bound workloads, almost by definition of what they do. Regardless, data structures and algorithms are equally important in databases. B-Trees, linked-lists, buffers, LSM Trees, Bloom filters, caching strategies etc are all fundamental to databases. At any rate for a long time now the low hanging fruit of optimization has been throwing money and hardware at problems - NAND Flash, more cores, larger caches, tons of memory, edge networks etc. Those options are all still there.
There is an entire field of parallel algorithms which makes use of sequential algorithms to overcome some of these issues. So no it's not wasted. You would apply the knowledge to build parallel algorithms.
There are projects in my school where we implemented a combination of CUDA and openMP in some and MPI+OpenMP in a few. I think the bottleneck is always gonna be there, its just how much and how you deal with it in hardware from the software front.
GPUs became more general purpose. Old, vector processors from the 1980s served as inspiration. Even in 90s commercials its obvious that the GPU / SIMD-compute similarities were all over the 3dfx cards.
In the 00s, GPUs became flexible enough to execute arbitrary code for vector-effects and pixel-effects. Truly general purpose, arbitrary code, albeit in the SIMD methodology.
--------
Today, your Direct2D windowing code in Windows is largely running on the GPU, and has been migrated away from the CPU. In fact, your video decoders (Youtube) and video games (shaders) are all GPU code.
GPUs have proven themselves to be a general purpose processor, albeit with a strange SIMD-model of compute rather than the traditional Von Neumann design.
We're in a cycle where more-and-more code is moving away from CPUs into GPUs, and permanently staying in GPU space. This is the opposite effect of the cycle of reincarnation (CPUs may have gotten faster, but GPUs have become not only faster at a higher rate, but also more general purpose and generic allowing for more general code to be run on them).
Code successfully ported over (ex: Tensorflow), may never return back to CPU-side. SIMD-compute is just superior underlying model for a large set of applications.
> In fact, your video decoders (Youtube) [...] are all GPU code.
I believe this is false; video decoders like H.264/AVC have a significant ASIC component that cannot be expressed as general-purpose SIMT code. I think this is because the entropy coding portion (arithmetic coding, Huffman coding, etc.) needs to be decoded serially. Some stuff like macroblock-to-macroblock prediction in I-frames is serial as well. But IDCT is indeed parallelizable.
CPUs do minimize latency by:
- Register renaming
- Out of order execution
- Branch prediction
- Speculative execution
They should not be over subscribed as they have to context switch by storing / loading registers and the cache coherence protocols scale badly with more threads.
GPUs on the other hand maximize throughput by:
- A lot more memory bandwidth
- Smaller and slower cores, but more of them
- Ultra threading (the massively over subscribed hyper threading the video mentions)
- Context switching between wavefronts (basically the equivalent of a CPU thread), just shifts the offset into the huge register file (no store and load)
The one area in which CPUs are getting closer to GPUs is SIMD / SIMT. CPUs used to be able to apply one instruction to a vector of elements without masking (SIMD). In ARM SVE and x86 AVX-512 they can now (like GPUs) mask out individual lanes (SIMT) for ALU operations and memory operations (gather load / scatter store).