Thorough work. Going back to the premise, I want to offer an alternative viewpoint by asking whether, in the case of binary trees implemented by arrays, “integer division by 2” of the array index is necessarily the best interpretation of what you are trying to do here?
Instead, you can also see the array index as a bit string, where every bit tells you which path to go down, left or right. In that case, “shifting right by one bit” moves you up to the parent. “Shifting left” moves you down to the left child. Flipping one bit flips you over to the other child. Bit wise operations indeed seem more natural with that interpretation.
A lot of “power of 2” multiplication/division has similar interpretations. For example, when walking page tables, you could see walking down the levels as “dividing by the size of the granule”, or simply as “shifting right to select the index on that level”.
No contest on anything where the power of 2 is coincidence, i.e. for non “computery” things where there is no such underlying structure.
I had a slightly similar experience at work. We deal a lot with angles and binary angles are often the most efficient representation. Many colleagues find it annoying because they insist on converting every binary angle to radian or degrees, but if you actually interpret the bits as successive divisions of a circumference, I actually find the binary representation way more intuitive than a floating point number.
That method is exactly equivalent to using revolutions as the unit with a fixed-point numeric representation. Maybe your skeptical colleagues would find that perspective more palatable?
That's some good advice. Unfortunately, most of the people I work with do not have a computer science background (they are materials scientists and mechanical engineers), so they are not familiarized with fixed-point arithmetic neither.
> they insist on converting every binary angle to [] degrees
Bit of hypocrisy there, since degrees are literally the same thing, just with 360 sectors rather than (say) 256. Though the extra factors of 3 and 5 are sometimes useful.
The point is to not think of the bits as the representation of a decimal number. For example, if I need to know in which quadrant an angle is, I just need to look at the first two bits, independently of the precision (which we change just shifting and truncating). But I do not mind if 90 degrees are equal to 64 or 1073741824 "sectors".
I think this is another good example where a binary representation (and a shift) can be a more natural way to approach the problem.
Not just more natural, it also lends itself to better accuracy and/or efficiency in some cases too.
As a quick example, if you're writing a numerical approximation or even just a LUT for sin(x), you can get away with only approximating/storing values for a reduced range by using some bit twiddling. The top bit becomes sign of output if you XOR it out first, the next bit lets you take advantage of sin(pi/2-x) = sin(x+pi/2) to half the input range again, eek out symmetry for the next bit and you can nearly approximate sin(x) as a linear function for some use cases.
I appreciate the thorough exploration and resulting detail in this article. Still, I find it a little sad that either the state of our toolchains or the perception thereof prevent it from being self-evident that basic strength reduction will always happen, and developers need not worry about the cost of expressing simple arithmetic operations in the clearest way.
I would go further in that many programmers don't understand the difference between logical and arithmetic operations and why they exist.
I have had to puzzle over far too much code doing adds/subtracts/multiplies/divides instead of and/or/xor/shift FAR too often.
I blame Java not having an unsigned type. There are apparently some weird tricks you can do with arithmetic in Java that operate on things like an unsigned type without having to go up to the next higher integer width.
Blame it on developers that think they know unsigned math.
> One of the little experiments I tried was asking people about the rules for unsigned arithmetic in C. It turns out nobody understands how unsigned arithmetic in C works. There are a few obvious things that people understand, but many people don't understand it.
Question from a perf noob: it seems like this in principle only shows that there's no difference for a Pixel 3 because other Android machines could have processors that have/lack an optimization for shift or divide. Couldn't a different Android phone have different performance characteristics?
It is not that the processor that contains an optimization per se. The thing to understand is that shift is fundamentally a simpler operation than multiply... a shift can be implemented with a few transistors per bit and done in a single cycle trivially. A multiply unit takes tons of transistors, and often takes multiple cycles (this is a trade off you make when you design a multiply unit, you can save space by making it work on smaller integers and reusing it multiple times over several cycles to do multiplies of larger integers, just you like you iteratively multiply digits one you do it on paper by hand). Even on processors that have single cycle multipliers it takes a lot more power to do a multiply than a shift because of all the extra hardware you need to engage.
Since shifts are fundamentally simpler than multiplies it always makes sense to do this transform. This is one of a number of transforms that are generally called "strength reductions" <https://en.wikipedia.org/wiki/Strength_reduction>, converting for a more general expensive operation into a more constrained cheaper operation. In this case it is the equivalent to knowing that if you want to multiply a number by 10 you can just add a 0 at the beginning instead of having to write all the work by hand.
The only reason not to do this transform would be if you had a CPU that literally does not have a shift operation, but I cannot think of any such part. Even if you did have such a part, the odds are you could emulate a shift using other other instructions and still outperform the multiply.
This has been a standard optimization for half a century. The original C compiler for the PDP-11 did these transforms even when you turned off optimizations <http://c-faq.com/misc/shifts.html>.
Barrel shifters use n log n transistors, which isn't quite as bad as a multiplier but is still very significant. The Pentium 4 was notable for excluding it making shifts very slow.
> The thing to understand is that shift is fundamentally a simpler operation than multipl
... well... If you have hardware multiply that is guaranteed to take one click cycle and a shift that will take the same... does fundamental complexity of how that happens even matter?
> This has been a standard optimization for half a century. The original C compiler for the PDP-11 did these transforms even when you turned off optimizations
Consider this, a common easily applied optimization that compilers have been doing for half a century MAY have made it's way into modern CPUs.
Transistors aren't nearly as power hungry as you paint them and CPUs aren't nearly as bad at optimization. There is no reason to switch a multiply or divide for a shift. The ONLY reason to make that switch is if you are dealing with the simplest of processors (Such as a microwave processors). If you are using anything developed in the last 10 years that consumes more than 1W of power, chances are really high that the you aren't saving any power by using shifts instead of multiples. It is the sort of micro-optimization that fundamentally misunderstands how modern CPUs actually work and over estimates how much power or space transistors actually need.
If the ALU contained an early out or fast path for simpler multiplies, the latency would read 1-3. You can verify this by looking at div, which does early out and has a latency of 35-88.
Any compiler that doesn't swap a multiply to a shift when it can is negligent.
Valid points, but in this case (and many others where you encounter power-of-2 mult/div), I'd consider that a shift might actually semantically be the more natural operation in the first place, instead of an "optimization" of the mult/div operation. (With their equivalence being obvious to any reader, it might not matter.)
I was not arguing the people writing code should perform strength reductions manually, I was explaining what they were and then stating that even ancient compilers do them automatically. While I did not explicitly state it, the logical follow on is that programmers should almost never explicitly strength reduce in their code, they should write the semantically clear version and let the compiler handle it for them.
You are correct, that on modern CPUs there are often specifically recognized idioms where the processor can implicitly perform an instruction transform such as a strength reduction from a multiply to a shift.
Having said that, it still makes sense a compiler to perform strength reductions rather than depending on the CPU frontend, at least if your compiler has a relatively decent scheduling model for the CPU. I don't know of any modern production quality compiler that would omit a simple strength reduction like this and leave it to the CPU.
Given that performance is so close to the same, my only question now is whether there's a measurable difference in power usage between shift and multiply.
Most processors will decode an instruction into micro-ops and guess what that decode phase can do? It can say "Is this a power of 2? Great, engage the shifting operator".
Only on super simple processors (think embedded systems) would this ever actually make a difference. Anything else, this is an optimization that your processor is going to do for you automatically.
Presumably the GP was talking about multiplying by an immediate value embedded in the instruction. That would be possible, but I doubt there are any current processors that do so.
You can do it in decode if the relevant operand is encoded as an immediate. I'm not aware of any processors that do any of the strength reductions discussed here in that way.
Yes: https://community.arm.com/developer/ip-products/processors/b.... I ran into this several years ago when one of my devices was surprisingly slow with code like "x = (x + 1) % limit". Turns out it didn't do integer division in hardware (the mod operator requires a divide) and was using a runtime library. Replacing the code with "x++; if (x == limit) x = 0;" made it much faster.
Even in the same model different regions can have different performance characteristics, or even functionality. While I was working at Unity Technologies (the game engine company) I have seen many horror stories when it comes to Android. Everything looked the same from the software, and also the name and versions of the hardware components, but for the components made in different countries, it behaved slightly differently.
It's more of a language/compiler thing than a hardware one. Hardware support for bit shifting exists on every processor I've ever used. The only real question from a performance standpoint is whether the language you are using will automatically generate optimal code for the x*2 and x/2 cases. Most of the time, it's not worth worrying about: you do it in the clearest way possible (i.e. as a math expression, not a bit shift) and only look at forcing the optimization if you have a performance problem.
In theory, yes. In practice, shifting is usually a very simple operation, while general multiplication (and especially division) is not. There have been (and there might still be, on the lowest end) many processors that have shift, but no multiplication or division.
A lot of microcontrollers fall into that category. The Cortex-M0 ARM core doesn't have integer multiply or divide, for instance, and it's almost always absent on 8-bit cores like AVR and PIC.
For integers, x86 has had mul and div all the way back to at least the 8086. If you are looking for a cpu with no mul or div then look at the classic 6502/6510 as known from c64, nes, and many more contemporary machines.
I am really struggling to understand the benchmark output quoted.
Can anyone elaborate what does benchmark=3/4 ns mean, and the count? Is the set-up part of the benchmark (test structure suggest not, but just to make sure)?
The only way I can read it is that 4000 divisions takes 4ns, and 4000 shift-rights takes 3ns, but that only has 1 digit of precision, which makes it unusable for comparison, but even then suggests a 25%/33% difference, which is not insignificant.
Also, the VM seems to optimise multiply out, so it must be doing it for a reason.
It's definitely not 4000 divisions per 4ns — that'd imply a terahertz computer. I think it's saying that the amortized cost of 4000 divisions is 4ns per division. Small integer division is an "easy win" for a dedicated HW path, so it doesn't surprise me that it's only a little slower than a shift-right. Variable length right shifts aren't that fast.
It's not small integer division that is being benchmarked, the JIT compiler has reduced it to an addition, a conditional move and a right shift. This sequence is then benchmarked against the right shift.
Everyone is talking about this microoptimization on the math of the access but all I can think about is how the data structure you are building using this is probably memory bandwidth limited and powers-of-two storage is almost definitely going to cause some kind of cache line aliasing, so maybe you should try something non-obvious like "division by 3" (after doing three key comparisons instead of one) and see if it makes your algorithm much much faster than messing around with a division by 2; there was even some good analysis of this effect a while back I can reference.
Am I mis-reading this? The article keeps claiming there is no difference but the way I read it the compiler(s) are transforming mul/div to shifts. i.e. it very likely is faster on the hardware but it won't matter for this particular toolchain because of the conversion.
Yes, the point of the article is to convince you to stop micro-optimizing code for imagined performance benefits, and instead code for readability (the compiler will worry about optimizing).
> Am I mis-reading this? The article keeps claiming there is no difference but the way I read it the compiler(s) are transforming mul/div to shifts. i.e. it very likely is faster on the hardware but it won't matter for this particular toolchain because of the conversion.
There is no difference... because of the conversion.
These types of transformations are simple to detect, well known, and applied by ~every compiler.
Unless you have evidence otherwise (ASM differences or benchmarks), there is no use in manually transforming your arithmetic into something more complex but faster. The compiler will do it for you.
I think that's a point which is bordering on religious - many people would debate trusting the compiler, especially over time and more complex situations.
I'd certainly tend to agree with you in general but more for the reason that the compiler can abstract over hardware changes across time. I'd take that benefit over the risk of the optimisation not being applied for most code I write - non-optimisations would be considered bugs and probably/eventually fixed.
I'd strongly disagree the code is more complex (in this case).
It especially seems religious to me because it's saying that somehow "/ 2" is simpler than ">> 1" because it has one less character for the symbol, and because division is a more commonly known operator to most people than bitwise shifting.
It seems to me that they are equally simple if we assume that programmers dealing with low level or performance intensive code know what a bitwise shift is and ignore the extra character, then they are literally equivalently complicated expressions with 1 symbol and 1 value applied to the symbol.
Code does not happen in a vacuum. Which is more understandable/simple depends on the domain of the code in question. Usually that's going to be the multiply or the divide.
Right, but my statement was that the domain would be low level or performance intensive code -- do you disagree that in that domain they are equally simple?
This very much depends upon the CPU upon which the code is running.
The 'shift' vs 'multiply' (or add to self) for doubling came about because in the past, it was very common that shifts were faster than multiplies (if your CPU even had a multiply instruction) and often faster than adds as well.
Example, from the 8086 (yes, very long time ago, but this is the environment where the differences often massively mattered):
Add reg->reg: 3 clock cycles
Mul: 70-133 depending on 8 vs. 16 bit size
Shift: Reg with shift of 1 (which is a *2): 2 clock cycles.
Now, for divide by 2 the issue is even larger (as you can't 'subtract from itself' to achieve divide by 2):
Idiv: 101-184 clocks, depending on 8 vs. 16 bit size
Shift: 2 clock cycles.
So, on the 8086, for times 2, a shift was 33% faster than an add to self (and so much faster than a Mul that no one should use Mul for times 2).
And for divide by 2, a shift was massively faster than an Idiv (2 cycles vs minimum of 101 cycles).
Now, these relative values change as one moves up the x86 CPU line to newer CPU's. Intel built faster adders, faster multipliers, faster dividers, so one really has to check the specific CPU to see which instruction is faster. But the one item that will remain fairly constant is that presuming that using a shift for powers of two multiply or divide is generally close to the 'fastest' method is a good ball-park estimate that is more often right than it is wrong.
In the article, there's a whole path from source to JVM bytecode to ARM assembly. With auxiliary goals like bytecode size. This makes it interesting to include add-double because it has a simple bytecode encoding ('DUP ADD' compared to 'PUSH 2 MUL') and it is interesting to see what the several compilers do with it along the way in terms of optimization.
Interesting info on 8086. Another approach that doesn't apply to OP's article, but does to x86 assembly is (ab)using LEA for small multiplications. At 2 clock cycles it looks competitive with shift for doubling, but can also be used for multiples like 3 and 5.
SHL/SAL/SHR/SAR (the shifts) affect the condition codes so one could check for overflow (or use it as part of multibyte arithmetic) while LEA does not affect the conditions code (so overflow goes undetected). It's something else to keep in mind.
Isn't that the wrong direction for the optimization? I would assume you would want to compile adding two numbers into shifting by one, not the other way around.
(I know nothing about hardware, it just intuitively seems like moving a bunch of bits over by 1 should be faster than dealing with xor and carries)
In hardware terms, adders are simpler than shifters. You can usually do both in a single cycle, but it's going to be lower power to do the add instead of the shift.
To put this in more concrete terms: an N-bit adder involves N 1-bit stages to add each bit, and then a 1-bit carry network on top of that, which has N stages in it. So overall, it's O(N) in terms of hardware. An N-bit shift unit is going to use lg N N-bit muxes--or O(N lg N) in terms of hardware. Total gate delay in both cases is O(lg N), but adders have O(N) hardware (and thus energy consumption) while shifters have O(N lg N).
A secondary consequence of being larger area is that a superscalar architecture may choose to have one execution unit that has an adder and a shifter and a second that only has the adder. So an addition may schedule better than a shift, since there are more things it can execute on.
> To put this in more concrete terms: an N-bit adder involves N 1-bit stages to add each bit, and then a 1-bit carry network on top of that, which has N stages in it. So overall, it's O(N) in terms of hardware.
O(N) adders cannot meet the latency demands of modern high-frequency CPUs. The actual complexity of adders in real CPUs is usually O(N²).
There's some hardware that surprisingly doesn't have a real shift but instead has rotate operations. These will take the bits that get dropped off and put them on the other side, in those cases the addition can be a much better choice than doing a bitmask and then rotate operation. These types of hardware are usually embedded devices that also have high cost multiplication instructions too so unrolling to a smaller number of fixed additions can actually perform better sometimes.
> it just intuitively seems like moving a bunch of bits over by 1 should be faster than dealing with xor and carries
Yes, a fixed shift-by-one unit would be much simpler than an adder. But many (most?) CPUs that supports shifting have generic shift units, where the number of bits to shift varies, and that makes them much more complex.
Realistically when you get into real world questions about performance the only way to be sure is to measure it.
In this case I imagine you're right. Although also worth pointing out that due to the way modern CPUs are basically frontends to generate uOps that it could actually perform the optimization by itself anyway. Time to break out PAPI (very cool tool for anyone unaware, you can get instruction level profiling in your program with basically 4 function calls and a header file).
He did this in Java? In one case, running on an emulator? That's removed too far from the hardware for this kind of benchmarking. Try in a hard-compiled language.
Using shifts for constant divide has been a compiler code generator optimization for decades. This is not something programmers have needed to worry about in source code for a long time, unless targeting some small microcontroller that lacks fast divide hardware.
Java is a hard-compiled language on Android since version 5.0.
ART, which replaced Dalvik on 5.0 (available as experimental on 4.4), was AOT only up to version 7.0.
As it was proven that Android users lack the patience of a C++ developer when updating their apps, Google adopted another approach with version 7.0.
A multi-tier compiler infrastructure, composed by a very fast interpreter hand written in Assembly for fast startup, a JIT compiler for the first optimization level, with gathering of PGO data, then the AOT compiler runs in the background and when the device is idle gets that PGO data and just like a C++ compiler with PGO data, outputs a clean AOT compiled binary for the usual user workflow.
In case of an update or changes in the workflow that trigger the execution of code that wasn't AOT compiled, the process restarts.
As means to reduce this kind of de-optimizations, since Android 10 those PGO files are uploaded into the Play Store and when a user installs an application that already has PGO data available, it is downloaded alongside the APK and the AOT compiler can do its job right from the start.
In any case, he used dex2aot which is the AOT compiler daemon on Android.
Microsoft has gone through similar process with .NET for UWP, with the main difference that the AOT compiler lives on the Microsoft store and what gets downloaded is already straight binary code.
Apparently mixing language capabilities with toolchains keeps being an issue.
How come the division is almost the same as the shifting? Is the CPU pipelining the operations between iterations of the loop or something? There is a direct data-dependency in those operations but not between iterations so perhaps that's it?
AFAIK there's no fused add-shift op that could be used.
Given the number of execution loops. The profiler applied an optimization. Don't expect this optimization during initial executions or seldom used code or cases where properties of the method do not allow it to be optimized; be it on Android VMs or JVMs.
Apart from the fact that compilers are really good and generally will choose the best option for you, it seems like it boils down to what processor is used. On Android aren't ARM processors the most common?
No. An arithmetic shift right does a division that rounds down; an integer division operation instead truncates (rounds to zero). The easiest example is -1 / 2: -1 / 2 is 0, but -1 ashr 1 is -1.
To replace a signed division with ashr, you have to know that for all negative inputs, the value of the bits shifted out are all 0.
Ah, i haven't kept up. Anyway, a right bit shift should absolutely be quicker than floating point or even integer division. If it isn't, that is an implementation problem.
ART as of Android 10, combines an hand written interpreter in Assembly, a JIT compiler that generates PGO data as well, and an AOT PGO based optimizing compiler capable of doing bounds check elision, de-virtualization, auto-vectorization, escape analysis and a couple of other traditional optimizations.
The PGO metadata files also get shared across devices via the Play Store as means to steer the AOT compiler into the optimal level of optimization across all users of the application.
I assume that at the current level of ongoing ART optimizations, the team would consider that a compiler bug.
Which pretty much means Assembly, given that Java and Koltin go through JVM and DEX bytecodes to machine code, and C and C++ on Android go through LLVM bitcode to machine code.
Yeah but you should also then understand what will yield real results, and >>1 instead of /2 has become useless to write manually a long time ago, whereas e.g. continuous memory is more important than ever. You will not optimize a lot by attempting micro techniques from 30 years ago.
Other random example: in some edge cases, integer division replacement by a multiplication can still be relevant today (depends on if its a constant, the compiler, and if nothing optimized also on the exact processor, though, because last models are already ultra-fast with the real integer divide instructions), but I suspect in 15 years (maybe even 10) this will be completely irrelevant, at least for high perf targets.
Had you read the article, you would have known that the question answered here was whether, and where in the process, strength reduction was actually applied.
1. Looking at the Java bytecode is practically meaningless, you would have to look at the machine code the JIT is creating.
2. A division by 2 is identical to a shift right by 1 only if the integer is unsigned. Java integers are signed. Try this program in C to see for yourself:
int foo(int a) { return a/2; }
int bar(int a) { return a>>1; }
Run gcc -S -O2 to get assembly output in text form.
Basically, the problem is this:
5/2 -> 2 (ok, rounds down)
5>>1 -> 2 (ok, same)
-5/2 -> -2 (ok, rounds down)
-5>>1 -> -3 (oops!)
3. The question is really about the JIT backend for the target platform, which means CPU platform, not OS platform. So "on Android" does not make much sense here, as Android exists for x86 and ARM and those JIT backends might behave differently.
You can't just divide by 2 and expect that to be equivalent to shifting right by 1. That only works if the integer is unsigned. For signed integers, that won't yield the right result. This is trivial to verify.
In defense of the author re: 2, since the application domain seems to be manipulating indices for a binary tree represented in an array, the integer will always be non-negative.
Honestly, it's hard to read this article and not question the author's mental state or intelligence.
He literally presents x86 and ARM assembly dumps where shift right generates one instruction, and divide generates that same instruction plus several others.
Then, he feels the need to run an unnecessary benchmark (most likely screwing it up somehow) and concludes there is no difference!
But how can there possibly be no performance difference, in general, between the CPU running an ALU instruction and running that same instruction plus several other ALU instructions?!?
It's almost unbelievable.
As to how he screwed up the benchmark, my guesses are that either he failed to inline the function (and the CPU is really bad), or failed to prevent the optimizer from optimizing the whole loop, or didn't run enough iterations, or perhaps he ran the benchmark on a different VM than what produced the assembly (or maybe somehow the CPU can extract instruction level parallelism in this microbenchmark, but obviously that doesn't generalize to arbitrary code).
While I think that there is merit to your argument, I feel like it could have been presented without questioning the author's intelligence or mental state.
The "asrs" mnemonics are the same, but the opcodes use different addressing modes. The 4th instruction in the longer sequence is a register-to-register move, which might be a simple register rename. It's not obvious to me that the shorter sequence is necessarily faster than the longer one.
OK, the "adds" is probably slower, and introduces a register dependency. But the whole exercise here is to test and see, not trust that some simple operation performs like how it looks.
Indeed, if you discount the "bx lr" (that's just the return, right?), the 1-instruction version takes 3ns, and the 4-instruction version takes 4ns. Clearly, 4 times as many instructions is not taking anywhere near 4 times as long to execute.
Instead, you can also see the array index as a bit string, where every bit tells you which path to go down, left or right. In that case, “shifting right by one bit” moves you up to the parent. “Shifting left” moves you down to the left child. Flipping one bit flips you over to the other child. Bit wise operations indeed seem more natural with that interpretation.
A lot of “power of 2” multiplication/division has similar interpretations. For example, when walking page tables, you could see walking down the levels as “dividing by the size of the granule”, or simply as “shifting right to select the index on that level”.
No contest on anything where the power of 2 is coincidence, i.e. for non “computery” things where there is no such underlying structure.