IMO You should just stick some programs in Ghidra/Godbolt and see what they emit...

CalChris · on April 22, 2024

Everyone should know Godbolt. (I don't know Ghidra; I probably should.)

First, Godbolt is a useful tool for understanding what a particular compiler will do with your code with different switch set.

Second, it's a useful tool for communicating because you can enter source, set some switch and see some output. You can (and we do) get a persistent shortened url to submit on StackExchange, HN, … so that people concretely know what you're complaining about.

Godbolt should get the ACM System Software Award.

rerdavies · on April 22, 2024

> much better optimization passes, which learning more about assembly doesn't help with

I think you have to include an understanding of the underlying processor architecture as part of "learning more about assembly". If you're writing assembler without instruction scheduling in mind, you would be better off not writing assembler at all, and letting the LLVM optimizers do instruction scheduling for you.

hnthrowaway0328 · on April 22, 2024

How can one learn instruction scheduling for Intel. Are the algo of branch prediction and out of order execution revealed in the manual? Thanks.

rerdavies · on April 22, 2024

I don't think it's humanly possible to do it perfectly. Heuristically, the Architecture manual gets you 95% of the way there.

For really detailed insight into code optimization, the Intel Profiler ($$) gives you a lot of tools for precise instruction scheduling (e.g. an indication of which instructions are stalling during execution of your code, useful analysis of cache miss rates, and which instructions caused those cache misses). ARM also provides a profiler that may do the same for ARM chips, but it is insanely expensive.

You can make do with LINUX stochastic profilers, but it may be helpful to have some utility code that provides dumps of relevant profiling registers for your CPU (e.g. L1, L2,L3 cache missed counts, missed-branch counts, processor stall counts, &c.) I'm not sure what x86 processors provide; but writing code to dump ARM profiling registers proved to be incredibly useful in a recent profiling and optimization misadventure.

Fwiw, unless you're using instructions that don't map well onto high-level languages, it's pretty difficult to beat well-tweaked GCC-generated code by more than a few percent. I imagine LLVM is the same. Unless you're writing code whose wellfare depends on whether it's 3% faster than a competitor, it's probably not worth it to drop into assembler.

With a bit of tweaking you can even get all the major C/C++ compilers to generate SIMD code that's consistently annoyingly good from non-SIMD C/C++ by encouraging the compilers to perform SIMD vectorization optimizations.

The other way to learn is to do. Profile EVERYTHING with a stochastic profiler. Tweak based on your necessarily limited understanding of the architecture. Profile again to confirm that your optimization actually is valid. Repeat until done.

JonChesterfield · on April 22, 2024

It's proprietary but somewhat amenable to exploring through experiment.

The heuristics go something like:

1. Find out what execution ports your processor has. E.g. it can probably do two 256bit loads from L1 cache each cycle and probably can't do two stores. It can do arithmetic at the same time. Beware collisions between your arithmetic and address calculations.

2. Look for some indication of what the register files are - you don't want to read from a register immediately and probably don't want to wait too long either, and there's a load of latency hiding renaming going on in the background. This one seems especially poorly documented.

3. Aim is to order instructions so that the dynamic scheduler has an easier time keeping the ports occupied and so that stalls on register access are unlikely

4. Choosing different instructions may make that work better in a gnarly NP over NP sort of fashion

5. Moving redundant or reversible calculations across branches can be a good idea

The DSP chips are much more fun to schedule in the compiler as branches are usually cheaper and there's probably no reordering happening at runtime.

anonymoushn · on April 22, 2024

Some of the information you want is published by Intel. Some of is better found on https://uops.info/ or in Peter Cordes's answers on StackOverflow or in the comments section of Agner Fog's web site. Some of it you must determine experimentally. Empirically modern compilers do not fully exploit much of this knowledge, though e.g. LLVM-MCA proves that they contain a good bit of it.

zypeh · on April 30, 2024

For HPC code that aims to run on intel cpu. Do you recommend compiler like intel OneAPI c compiler over LLVM or GCC? Before one starting to profile / invest time on reading manual on specific compiler.

rerdavies · on April 22, 2024

Anecdotally, GCC seems to do a startlingly good job of instruction scheduling, on x86 and ARM. I've always wondered what sort of architecture models the big compilers have, and have been meaning to browse the source code to find out for some time. Does anyone know?

astrange · on April 22, 2024

> or you can use a setcc with a results flag from something you already computed

Funny enough, on large CPUs this can be slower than recomputing something, because they don't like long dependency chains and sometimes even have penalties for reading a register that hasn't been written for hundreds of instructions.