My point with vectorization was that the one case where indexed loads/stores are...

brucehoult · on Feb 6, 2025

And outside of tight loops is where a cycle here or there is irrelevant to the overall speed of the program. All the more so if you're going to have cache or TLB misses on those loads.

dzaima · on Feb 6, 2025

I quite heavily disagree. Perhaps might apply to programs which do spend like 90% of their time in a couple tight loops, but there's tons of software that isn't that simple (especially web.. well, everything, but also compilers, video game logic, whatever bits of kernel logic happen in syscalls, etc), instead spending a ton of time whizzing around a massive mess. And you want that mess to run as fast as possible regardless of how much the mess being a mess makes low-level devs cry. If there's headroom in the AGU for a 64-bit adder, I'd imagine it's an extremely free good couple percent boost; though the cost of extra register port(s) (or logic of sharing some with an ALU) might be annoying.

And indexed loads aren't a "here or there", they're a pretty damn common thing; like, a ton more common than most instructions in Zbb/Zbc/Zbs.

brucehoult · on Feb 7, 2025

This is not a discussion that can be resolved in the abstract. It requires actual experimentation and data and pointing at actual physical CPUs differing only in this respect and compare the silicon area, energy use, MHz achieved, and cycles per program.

dzaima · on Feb 7, 2025

It's certainly not a thing to be resolved in the abstract, but it's also far from thing to be ignored as irrelevant in the abstract.

But I have a hard time imagining that my general point of "if there's headroom for a full 64-bit adder in the AGU, adding such is very cheap and can provide a couple percent boost in applicable programs" is far from true. Though the register file port requirement might make that less trivial as I'd like it to be.