Which is better on Android: divide by 2 or shift by 1?

anyfoo · on April 24, 2020

Thorough work. Going back to the premise, I want to offer an alternative viewpoint by asking whether, in the case of binary trees implemented by arrays, “integer division by 2” of the array index is necessarily the best interpretation of what you are trying to do here?

Instead, you can also see the array index as a bit string, where every bit tells you which path to go down, left or right. In that case, “shifting right by one bit” moves you up to the parent. “Shifting left” moves you down to the left child. Flipping one bit flips you over to the other child. Bit wise operations indeed seem more natural with that interpretation.

A lot of “power of 2” multiplication/division has similar interpretations. For example, when walking page tables, you could see walking down the levels as “dividing by the size of the granule”, or simply as “shifting right to select the index on that level”.

No contest on anything where the power of 2 is coincidence, i.e. for non “computery” things where there is no such underlying structure.

yiyus · on April 24, 2020

I had a slightly similar experience at work. We deal a lot with angles and binary angles are often the most efficient representation. Many colleagues find it annoying because they insist on converting every binary angle to radian or degrees, but if you actually interpret the bits as successive divisions of a circumference, I actually find the binary representation way more intuitive than a floating point number.

slavik81 · on April 25, 2020

That method is exactly equivalent to using revolutions as the unit with a fixed-point numeric representation. Maybe your skeptical colleagues would find that perspective more palatable?

yiyus · on April 25, 2020

That's some good advice. Unfortunately, most of the people I work with do not have a computer science background (they are materials scientists and mechanical engineers), so they are not familiarized with fixed-point arithmetic neither.

a1369209993 · on April 24, 2020

> they insist on converting every binary angle to [] degrees

Bit of hypocrisy there, since degrees are literally the same thing, just with 360 sectors rather than (say) 256. Though the extra factors of 3 and 5 are sometimes useful.

yiyus · on April 25, 2020

The point is to not think of the bits as the representation of a decimal number. For example, if I need to know in which quadrant an angle is, I just need to look at the first two bits, independently of the precision (which we change just shifting and truncating). But I do not mind if 90 degrees are equal to 64 or 1073741824 "sectors".

I think this is another good example where a binary representation (and a shift) can be a more natural way to approach the problem.

pdehaan · on April 25, 2020

Not just more natural, it also lends itself to better accuracy and/or efficiency in some cases too.

As a quick example, if you're writing a numerical approximation or even just a LUT for sin(x), you can get away with only approximating/storing values for a reduced range by using some bit twiddling. The top bit becomes sign of output if you XOR it out first, the next bit lets you take advantage of sin(pi/2-x) = sin(x+pi/2) to half the input range again, eek out symmetry for the next bit and you can nearly approximate sin(x) as a linear function for some use cases.

twoodfin · on April 24, 2020

I appreciate the thorough exploration and resulting detail in this article. Still, I find it a little sad that either the state of our toolchains or the perception thereof prevent it from being self-evident that basic strength reduction will always happen, and developers need not worry about the cost of expressing simple arithmetic operations in the clearest way.

bsder · on April 24, 2020

I would go further in that many programmers don't understand the difference between logical and arithmetic operations and why they exist.

I have had to puzzle over far too much code doing adds/subtracts/multiplies/divides instead of and/or/xor/shift FAR too often.

I blame Java not having an unsigned type. There are apparently some weird tricks you can do with arithmetic in Java that operate on things like an unsigned type without having to go up to the next higher integer width.

pjmlp · on April 25, 2020

Blame it on developers that think they know unsigned math.

> One of the little experiments I tried was asking people about the rules for unsigned arithmetic in C. It turns out nobody understands how unsigned arithmetic in C works. There are a few obvious things that people understand, but many people don't understand it.

https://www.artima.com/intv/gosling3.html

Since 2014 that Java now has support for unsigned math via methods on its boxed types since Java 8.

And depending on how Valhalla goes, it might gain inline classes for unsigned types, which is pretty much the same thing as having an uint.

vbezhenar · on April 25, 2020

What exactly do you miss in Java? It's possible to treat int as unsigned for all the necessary operations.

blibble · on April 25, 2020

plus Java's had various unsigned integer operations on Integer for years

lgessler · on April 24, 2020

Question from a perf noob: it seems like this in principle only shows that there's no difference for a Pixel 3 because other Android machines could have processors that have/lack an optimization for shift or divide. Couldn't a different Android phone have different performance characteristics?

lgg · on April 24, 2020

It is not that the processor that contains an optimization per se. The thing to understand is that shift is fundamentally a simpler operation than multiply... a shift can be implemented with a few transistors per bit and done in a single cycle trivially. A multiply unit takes tons of transistors, and often takes multiple cycles (this is a trade off you make when you design a multiply unit, you can save space by making it work on smaller integers and reusing it multiple times over several cycles to do multiplies of larger integers, just you like you iteratively multiply digits one you do it on paper by hand). Even on processors that have single cycle multipliers it takes a lot more power to do a multiply than a shift because of all the extra hardware you need to engage.

Since shifts are fundamentally simpler than multiplies it always makes sense to do this transform. This is one of a number of transforms that are generally called "strength reductions" <https://en.wikipedia.org/wiki/Strength_reduction>, converting for a more general expensive operation into a more constrained cheaper operation. In this case it is the equivalent to knowing that if you want to multiply a number by 10 you can just add a 0 at the beginning instead of having to write all the work by hand.

The only reason not to do this transform would be if you had a CPU that literally does not have a shift operation, but I cannot think of any such part. Even if you did have such a part, the odds are you could emulate a shift using other other instructions and still outperform the multiply.

This has been a standard optimization for half a century. The original C compiler for the PDP-11 did these transforms even when you turned off optimizations <http://c-faq.com/misc/shifts.html>.

jdsully · on April 24, 2020

Barrel shifters use n log n transistors, which isn't quite as bad as a multiplier but is still very significant. The Pentium 4 was notable for excluding it making shifts very slow.

Aachen · on April 24, 2020

Small note that I'll delete again: your links are broken, the <> doesn't work here, this isn't proper markdown

lgg · on April 24, 2020

Thanks for the heads up, unfortunately I got distracted with work, and I am no longer able to edit the post.

SlowRobotAhead · on April 25, 2020

> The thing to understand is that shift is fundamentally a simpler operation than multipl

... well... If you have hardware multiply that is guaranteed to take one click cycle and a shift that will take the same... does fundamental complexity of how that happens even matter?

rasz · on April 26, 2020

depends, shift might be fusable. First ARM chips had free shifts taking zero cycles http://www.csbio.unc.edu/mcmillan/Comp411F18/Lecture07.pdf

cogman10 · on April 24, 2020

> This has been a standard optimization for half a century. The original C compiler for the PDP-11 did these transforms even when you turned off optimizations

Consider this, a common easily applied optimization that compilers have been doing for half a century MAY have made it's way into modern CPUs.

Transistors aren't nearly as power hungry as you paint them and CPUs aren't nearly as bad at optimization. There is no reason to switch a multiply or divide for a shift. The ONLY reason to make that switch is if you are dealing with the simplest of processors (Such as a microwave processors). If you are using anything developed in the last 10 years that consumes more than 1W of power, chances are really high that the you aren't saving any power by using shifts instead of multiples. It is the sort of micro-optimization that fundamentally misunderstands how modern CPUs actually work and over estimates how much power or space transistors actually need.

reitzensteinm · on April 25, 2020

Since it's what I'm typing this on, let's look at Skylake.

Multiply: Latency 3, Throughput 1 Shift: Latency 1, Throughput 2

If the ALU contained an early out or fast path for simpler multiplies, the latency would read 1-3. You can verify this by looking at div, which does early out and has a latency of 35-88.

Any compiler that doesn't swap a multiply to a shift when it can is negligent.

https://www.agner.org/optimize/instruction_tables.pdf

anyfoo · on April 24, 2020

Valid points, but in this case (and many others where you encounter power-of-2 mult/div), I'd consider that a shift might actually semantically be the more natural operation in the first place, instead of an "optimization" of the mult/div operation. (With their equivalence being obvious to any reader, it might not matter.)

lgg · on April 24, 2020

I was not arguing the people writing code should perform strength reductions manually, I was explaining what they were and then stating that even ancient compilers do them automatically. While I did not explicitly state it, the logical follow on is that programmers should almost never explicitly strength reduce in their code, they should write the semantically clear version and let the compiler handle it for them.

You are correct, that on modern CPUs there are often specifically recognized idioms where the processor can implicitly perform an instruction transform such as a strength reduction from a multiply to a shift.

Having said that, it still makes sense a compiler to perform strength reductions rather than depending on the CPU frontend, at least if your compiler has a relatively decent scheduling model for the CPU. I don't know of any modern production quality compiler that would omit a simple strength reduction like this and leave it to the CPU.

MithrilTuxedo · on April 24, 2020

Given that performance is so close to the same, my only question now is whether there's a measurable difference in power usage between shift and multiply.

cogman10 · on April 24, 2020

Depends on the processor, but no.

Most processors will decode an instruction into micro-ops and guess what that decode phase can do? It can say "Is this a power of 2? Great, engage the shifting operator".

Only on super simple processors (think embedded systems) would this ever actually make a difference. Anything else, this is an optimization that your processor is going to do for you automatically.

ynik · on April 24, 2020

The decode phase can't do that, because it runs before the input values for the instruction are available.

KMag · on April 25, 2020

Presumably the GP was talking about multiplying by an immediate value embedded in the instruction. That would be possible, but I doubt there are any current processors that do so.

lgg · on April 25, 2020

You can do it in decode if the relevant operand is encoded as an immediate. I'm not aware of any processors that do any of the strength reductions discussed here in that way.

lern_too_spel · on April 24, 2020

Are there any processors that do this? I am not aware of any.

orangecat · on April 24, 2020

Yes: https://community.arm.com/developer/ip-products/processors/b.... I ran into this several years ago when one of my devices was surprisingly slow with code like "x = (x + 1) % limit". Turns out it didn't do integer division in hardware (the mod operator requires a divide) and was using a runtime library. Replacing the code with "x++; if (x == limit) x = 0;" made it much faster.

benibela · on April 25, 2020

I just benchmarked a bunch of hashmaps, and the fastest maps were using a power of 2 as size.

So they could do "x = (x + 1) & (limit - 1)". No division and branch free.

firtoz · on April 24, 2020

Even in the same model different regions can have different performance characteristics, or even functionality. While I was working at Unity Technologies (the game engine company) I have seen many horror stories when it comes to Android. Everything looked the same from the software, and also the name and versions of the hardware components, but for the components made in different countries, it behaved slightly differently.

blihp · on April 24, 2020

It's more of a language/compiler thing than a hardware one. Hardware support for bit shifting exists on every processor I've ever used. The only real question from a performance standpoint is whether the language you are using will automatically generate optimal code for the x*2 and x/2 cases. Most of the time, it's not worth worrying about: you do it in the clearest way possible (i.e. as a math expression, not a bit shift) and only look at forcing the optimization if you have a performance problem.

anyfoo · on April 24, 2020

In theory, yes. In practice, shifting is usually a very simple operation, while general multiplication (and especially division) is not. There have been (and there might still be, on the lowest end) many processors that have shift, but no multiplication or division.

duskwuff · on April 24, 2020

A lot of microcontrollers fall into that category. The Cortex-M0 ARM core doesn't have integer multiply or divide, for instance, and it's almost always absent on 8-bit cores like AVR and PIC.

edoceo · on April 24, 2020

Like the 386 w/a math coprocessor?

0x0 · on April 24, 2020

For integers, x86 has had mul and div all the way back to at least the 8086. If you are looking for a cpu with no mul or div then look at the classic 6502/6510 as known from c64, nes, and many more contemporary machines.

anyfoo · on April 24, 2020

No, the math coprocessor was for floating point, all integer operations (or at least all the basic ones) were on-die.

fortran77 · on April 24, 2020

The co-processor was for floating point operations. Integer MUL and DIV were in the 386 (and the 8086/8088)

__s · on April 24, 2020

You're correct

necovek · on April 24, 2020

I am really struggling to understand the benchmark output quoted.

Can anyone elaborate what does benchmark=3/4 ns mean, and the count? Is the set-up part of the benchmark (test structure suggest not, but just to make sure)?

The only way I can read it is that 4000 divisions takes 4ns, and 4000 shift-rights takes 3ns, but that only has 1 digit of precision, which makes it unusable for comparison, but even then suggests a 25%/33% difference, which is not insignificant.

Also, the VM seems to optimise multiply out, so it must be doing it for a reason.

thechao · on April 24, 2020

It's definitely not 4000 divisions per 4ns — that'd imply a terahertz computer. I think it's saying that the amortized cost of 4000 divisions is 4ns per division. Small integer division is an "easy win" for a dedicated HW path, so it doesn't surprise me that it's only a little slower than a shift-right. Variable length right shifts aren't that fast.

bonzini · on April 24, 2020

It's not small integer division that is being benchmarked, the JIT compiler has reduced it to an addition, a conditional move and a right shift. This sequence is then benchmarked against the right shift.

saagarjha · on April 24, 2020

ART is an AOT complier, is it not?

monocasa · on April 24, 2020

It's both AOT and JIT.

https://source.android.com/devices/tech/dalvik/jit-compiler

pjmlp · on April 25, 2020

With a PGO cache updated across execution runs and devices (since Android 10 PGO data is shared across the Play Store).

ignoramous · on April 30, 2020

For anyone like me wondering what a PGO is: https://source.android.com/devices/tech/perf/pgo

MithrilTuxedo · on April 24, 2020

Now I'm wondering if there's a power usage difference between the two.

It stands to reason that if two operations take the same amount of time, but one requires more transistors to compute, power usage should diverge.

wmf · on April 24, 2020

For scalar instructions, most of the energy is consumed in scheduling not executing them. This means that cycles is a good proxy for total energy.

saurik · on April 25, 2020

Everyone is talking about this microoptimization on the math of the access but all I can think about is how the data structure you are building using this is probably memory bandwidth limited and powers-of-two storage is almost definitely going to cause some kind of cache line aliasing, so maybe you should try something non-obvious like "division by 3" (after doing three key comparisons instead of one) and see if it makes your algorithm much much faster than messing around with a division by 2; there was even some good analysis of this effect a while back I can reference.

https://pvk.ca/Blog/2012/07/30/binary-search-is-a-pathologic...

sambe · on April 24, 2020

Am I mis-reading this? The article keeps claiming there is no difference but the way I read it the compiler(s) are transforming mul/div to shifts. i.e. it very likely is faster on the hardware but it won't matter for this particular toolchain because of the conversion.

wmf · on April 24, 2020

The point is that from the perspective of the programmer there is no performance difference between / 2 and >> 1.

umvi · on April 24, 2020

> Am I mis-reading this?

Yes, the point of the article is to convince you to stop micro-optimizing code for imagined performance benefits, and instead code for readability (the compiler will worry about optimizing).

chrisseaton · on April 24, 2020

> Am I mis-reading this? The article keeps claiming there is no difference but the way I read it the compiler(s) are transforming mul/div to shifts. i.e. it very likely is faster on the hardware but it won't matter for this particular toolchain because of the conversion.

There is no difference... because of the conversion.

sambe · on April 24, 2020

Right: for now, in certain situations, on the tested toolchain.

Even ignoring those caveats, several commentators seem to have got the impression that this applies to the CPU.

uluyol · on April 24, 2020

These types of transformations are simple to detect, well known, and applied by ~every compiler.

Unless you have evidence otherwise (ASM differences or benchmarks), there is no use in manually transforming your arithmetic into something more complex but faster. The compiler will do it for you.

sambe · on April 24, 2020

I think that's a point which is bordering on religious - many people would debate trusting the compiler, especially over time and more complex situations.

I'd certainly tend to agree with you in general but more for the reason that the compiler can abstract over hardware changes across time. I'd take that benefit over the risk of the optimisation not being applied for most code I write - non-optimisations would be considered bugs and probably/eventually fixed.

I'd strongly disagree the code is more complex (in this case).

merlincorey · on April 24, 2020

It especially seems religious to me because it's saying that somehow "/ 2" is simpler than ">> 1" because it has one less character for the symbol, and because division is a more commonly known operator to most people than bitwise shifting.

It seems to me that they are equally simple if we assume that programmers dealing with low level or performance intensive code know what a bitwise shift is and ignore the extra character, then they are literally equivalently complicated expressions with 1 symbol and 1 value applied to the symbol.

kadoban · on April 24, 2020

Code does not happen in a vacuum. Which is more understandable/simple depends on the domain of the code in question. Usually that's going to be the multiply or the divide.

merlincorey · on April 24, 2020

Right, but my statement was that the domain would be low level or performance intensive code -- do you disagree that in that domain they are equally simple?

wk_end · on April 24, 2020

I wonder: do any microarchitectures detect and redirect multiplications/divisions by powers of two to the shifter?

remcob · on April 24, 2020

Another fast way to double a number is to add it to itself.

pwg · on April 24, 2020

This very much depends upon the CPU upon which the code is running.

The 'shift' vs 'multiply' (or add to self) for doubling came about because in the past, it was very common that shifts were faster than multiplies (if your CPU even had a multiply instruction) and often faster than adds as well.

Example, from the 8086 (yes, very long time ago, but this is the environment where the differences often massively mattered):

https://www.oocities.org/mc_introtocomputers/Instruction_Tim...

Add reg->reg: 3 clock cycles Mul: 70-133 depending on 8 vs. 16 bit size Shift: Reg with shift of 1 (which is a *2): 2 clock cycles.

Now, for divide by 2 the issue is even larger (as you can't 'subtract from itself' to achieve divide by 2):

Idiv: 101-184 clocks, depending on 8 vs. 16 bit size Shift: 2 clock cycles.

So, on the 8086, for times 2, a shift was 33% faster than an add to self (and so much faster than a Mul that no one should use Mul for times 2).

And for divide by 2, a shift was massively faster than an Idiv (2 cycles vs minimum of 101 cycles).

Now, these relative values change as one moves up the x86 CPU line to newer CPU's. Intel built faster adders, faster multipliers, faster dividers, so one really has to check the specific CPU to see which instruction is faster. But the one item that will remain fairly constant is that presuming that using a shift for powers of two multiply or divide is generally close to the 'fastest' method is a good ball-park estimate that is more often right than it is wrong.

remcob · on April 24, 2020

In the article, there's a whole path from source to JVM bytecode to ARM assembly. With auxiliary goals like bytecode size. This makes it interesting to include add-double because it has a simple bytecode encoding ('DUP ADD' compared to 'PUSH 2 MUL') and it is interesting to see what the several compilers do with it along the way in terms of optimization.

Interesting info on 8086. Another approach that doesn't apply to OP's article, but does to x86 assembly is (ab)using LEA for small multiplications. At 2 clock cycles it looks competitive with shift for doubling, but can also be used for multiples like 3 and 5.

spc476 · on April 24, 2020

SHL/SAL/SHR/SAR (the shifts) affect the condition codes so one could check for overflow (or use it as part of multibyte arithmetic) while LEA does not affect the conditions code (so overflow goes undetected). It's something else to keep in mind.

remcob · on April 24, 2020

Indeed! And this is not necessarily a downside, sometimes you want to preserve the flags.

Intell added the MULX instruction that is similar to MUL, except that it doesn't touch the flags. It's very useful in bignum math.

fyp · on April 24, 2020

Isn't that the wrong direction for the optimization? I would assume you would want to compile adding two numbers into shifting by one, not the other way around.

(I know nothing about hardware, it just intuitively seems like moving a bunch of bits over by 1 should be faster than dealing with xor and carries)

jcranmer · on April 24, 2020

In hardware terms, adders are simpler than shifters. You can usually do both in a single cycle, but it's going to be lower power to do the add instead of the shift.

To put this in more concrete terms: an N-bit adder involves N 1-bit stages to add each bit, and then a 1-bit carry network on top of that, which has N stages in it. So overall, it's O(N) in terms of hardware. An N-bit shift unit is going to use lg N N-bit muxes--or O(N lg N) in terms of hardware. Total gate delay in both cases is O(lg N), but adders have O(N) hardware (and thus energy consumption) while shifters have O(N lg N).

A secondary consequence of being larger area is that a superscalar architecture may choose to have one execution unit that has an adder and a shifter and a second that only has the adder. So an addition may schedule better than a shift, since there are more things it can execute on.

Tuna-Fish · on April 24, 2020

> To put this in more concrete terms: an N-bit adder involves N 1-bit stages to add each bit, and then a 1-bit carry network on top of that, which has N stages in it. So overall, it's O(N) in terms of hardware.

O(N) adders cannot meet the latency demands of modern high-frequency CPUs. The actual complexity of adders in real CPUs is usually O(N²).

simcop2387 · on April 24, 2020

There's some hardware that surprisingly doesn't have a real shift but instead has rotate operations. These will take the bits that get dropped off and put them on the other side, in those cases the addition can be a much better choice than doing a bitmask and then rotate operation. These types of hardware are usually embedded devices that also have high cost multiplication instructions too so unrolling to a smaller number of fixed additions can actually perform better sometimes.

magicalhippo · on April 24, 2020

> it just intuitively seems like moving a bunch of bits over by 1 should be faster than dealing with xor and carries

Yes, a fixed shift-by-one unit would be much simpler than an adder. But many (most?) CPUs that supports shifting have generic shift units, where the number of bits to shift varies, and that makes them much more complex.

kevin_thibedeau · on April 24, 2020

Barrel shifters are still frequently omitted from microcontrollers. Primarily because of their size.

magicalhippo · on April 24, 2020

Right, I was thinking mostly of "application-level" CPUs capable of running Android.

mhh__ · on April 24, 2020

Realistically when you get into real world questions about performance the only way to be sure is to measure it.

In this case I imagine you're right. Although also worth pointing out that due to the way modern CPUs are basically frontends to generate uOps that it could actually perform the optimization by itself anyway. Time to break out PAPI (very cool tool for anyone unaware, you can get instruction level profiling in your program with basically 4 function calls and a header file).

Animats · on April 24, 2020

He did this in Java? In one case, running on an emulator? That's removed too far from the hardware for this kind of benchmarking. Try in a hard-compiled language.

Using shifts for constant divide has been a compiler code generator optimization for decades. This is not something programmers have needed to worry about in source code for a long time, unless targeting some small microcontroller that lacks fast divide hardware.

pjmlp · on April 25, 2020

Java is a hard-compiled language on Android since version 5.0.

ART, which replaced Dalvik on 5.0 (available as experimental on 4.4), was AOT only up to version 7.0.

As it was proven that Android users lack the patience of a C++ developer when updating their apps, Google adopted another approach with version 7.0.

A multi-tier compiler infrastructure, composed by a very fast interpreter hand written in Assembly for fast startup, a JIT compiler for the first optimization level, with gathering of PGO data, then the AOT compiler runs in the background and when the device is idle gets that PGO data and just like a C++ compiler with PGO data, outputs a clean AOT compiled binary for the usual user workflow.

In case of an update or changes in the workflow that trigger the execution of code that wasn't AOT compiled, the process restarts.

As means to reduce this kind of de-optimizations, since Android 10 those PGO files are uploaded into the Play Store and when a user installs an application that already has PGO data available, it is downloaded alongside the APK and the AOT compiler can do its job right from the start.

In any case, he used dex2aot which is the AOT compiler daemon on Android.

Microsoft has gone through similar process with .NET for UWP, with the main difference that the AOT compiler lives on the Microsoft store and what gets downloaded is already straight binary code.

Apparently mixing language capabilities with toolchains keeps being an issue.

renewiltord · on April 24, 2020

How come the division is almost the same as the shifting? Is the CPU pipelining the operations between iterations of the loop or something? There is a direct data-dependency in those operations but not between iterations so perhaps that's it?

AFAIK there's no fused add-shift op that could be used.

esnellman · on April 25, 2020

Given the number of execution loops. The profiler applied an optimization. Don't expect this optimization during initial executions or seldom used code or cases where properties of the method do not allow it to be optimized; be it on Android VMs or JVMs.

pacman83 · on April 24, 2020

Apart from the fact that compilers are really good and generally will choose the best option for you, it seems like it boils down to what processor is used. On Android aren't ARM processors the most common?

pjmlp · on April 24, 2020

Yes, followed by some Intel and MIPS survivors.

jejones3141 · on April 24, 2020

If it's signed int, unless you know the value is positive you can't just shift right 1--if the result of the shift is negative, you have to add 1.

chrisseaton · on April 24, 2020

Can't you do an arithmetic shift-right? That takes the sign into account.

jcranmer · on April 24, 2020

No. An arithmetic shift right does a division that rounds down; an integer division operation instead truncates (rounds to zero). The easiest example is -1 / 2: -1 / 2 is 0, but -1 ashr 1 is -1.

To replace a signed division with ashr, you have to know that for all negative inputs, the value of the bits shifted out are all 0.

xxpor · on April 24, 2020

Yep, ASR is the ARM64 instruction.

Too · on April 25, 2020

I thought this discussion was settled 30 years ago?

Write what you want to do, not how to do it. There is no difference.

nicetryguy · on April 25, 2020

...So the Dalvik VM sucks?

pjmlp · on April 25, 2020

Yes it sucks, that is why it was replaced by ART on Android 5.0.

nicetryguy · on April 25, 2020

Ah, i haven't kept up. Anyway, a right bit shift should absolutely be quicker than floating point or even integer division. If it isn't, that is an implementation problem.

pjmlp · on April 25, 2020

ART as of Android 10, combines an hand written interpreter in Assembly, a JIT compiler that generates PGO data as well, and an AOT PGO based optimizing compiler capable of doing bounds check elision, de-virtualization, auto-vectorization, escape analysis and a couple of other traditional optimizations.

The PGO metadata files also get shared across devices via the Play Store as means to steer the AOT compiler into the optimal level of optimization across all users of the application.

I assume that at the current level of ongoing ART optimizations, the team would consider that a compiler bug.

nicetryguy · on April 25, 2020

Awesome info! Thanks!

What would you recommend for an IDE? I used Eclipse some years ago. Is that still common?

I may want to experiment with some Android flavored Java again.

pjmlp · on April 25, 2020

Android Studio is the official IDE, it is a merge of InteliJ with Clion and Google specific plugins.

madhato · on April 24, 2020

Is there an advantage of using multiply by .5 versus divide by 2?

fox8091 · on April 24, 2020

Multiply by 0.5 would be slower, as it's a floating point operation rather than simple arithmetic.

nipxx · on April 24, 2020

if these kinds of optimizations make a difference for your applications write it in native code, dammitl

pjmlp · on April 24, 2020

Which pretty much means Assembly, given that Java and Koltin go through JVM and DEX bytecodes to machine code, and C and C++ on Android go through LLVM bitcode to machine code.

temac · on April 24, 2020

TLDR: it is the same, like it should.

We are in 2020. Don't shift by 1 instead of /2 if you mean to /2.

OrgNet · on April 24, 2020

lol... if you care about optimizing, you should care about all possible optimizations... (even if most of today's platforms don't)

temac · on April 24, 2020

Yeah but you should also then understand what will yield real results, and >>1 instead of /2 has become useless to write manually a long time ago, whereas e.g. continuous memory is more important than ever. You will not optimize a lot by attempting micro techniques from 30 years ago.

Other random example: in some edge cases, integer division replacement by a multiplication can still be relevant today (depends on if its a constant, the compiler, and if nothing optimized also on the exact processor, though, because last models are already ultra-fast with the real integer divide instructions), but I suspect in 15 years (maybe even 10) this will be completely irrelevant, at least for high perf targets.

drivebycomment · on April 24, 2020

News at 11 - someone learned strength reduction exists, which have been in use for the past 4 decades.

anyfoo · on April 24, 2020

Had you read the article, you would have known that the question answered here was whether, and where in the process, strength reduction was actually applied.

fefe23 · on April 24, 2020

A few points.

1. Looking at the Java bytecode is practically meaningless, you would have to look at the machine code the JIT is creating.

2. A division by 2 is identical to a shift right by 1 only if the integer is unsigned. Java integers are signed. Try this program in C to see for yourself:

int foo(int a) { return a/2; } int bar(int a) { return a>>1; }

Run gcc -S -O2 to get assembly output in text form.

Basically, the problem is this:

5/2 -> 2 (ok, rounds down)

5>>1 -> 2 (ok, same)

-5/2 -> -2 (ok, rounds down)

-5>>1 -> -3 (oops!)

3. The question is really about the JIT backend for the target platform, which means CPU platform, not OS platform. So "on Android" does not make much sense here, as Android exists for x86 and ARM and those JIT backends might behave differently.

anyfoo · on April 24, 2020

All things directly addressed by the article.

pjmlp · on April 24, 2020

Yet another one that didn't bother to actually read the article.

ryenus · on April 24, 2020

  -5/2 -> -2 (ok, rounds down)

Actually this is rounding up, to zero, as -2.5 < -2 < 0

renewiltord · on April 24, 2020

One addition:

You can't just divide by 2 and expect that to be equivalent to shifting right by 1. That only works if the integer is unsigned. For signed integers, that won't yield the right result. This is trivial to verify.

maxwindiff · on April 24, 2020

The article did investigate the assembly.

lgessler · on April 24, 2020

In defense of the author re: 2, since the application domain seems to be manipulating indices for a binary tree represented in an array, the integer will always be non-negative.

devit · on April 24, 2020

Honestly, it's hard to read this article and not question the author's mental state or intelligence.

He literally presents x86 and ARM assembly dumps where shift right generates one instruction, and divide generates that same instruction plus several others.

Then, he feels the need to run an unnecessary benchmark (most likely screwing it up somehow) and concludes there is no difference!

But how can there possibly be no performance difference, in general, between the CPU running an ALU instruction and running that same instruction plus several other ALU instructions?!?

It's almost unbelievable.

As to how he screwed up the benchmark, my guesses are that either he failed to inline the function (and the CPU is really bad), or failed to prevent the optimizer from optimizing the whole loop, or didn't run enough iterations, or perhaps he ran the benchmark on a different VM than what produced the assembly (or maybe somehow the CPU can extract instruction level parallelism in this microbenchmark, but obviously that doesn't generalize to arbitrary code).

brianyu8 · on April 24, 2020

While I think that there is merit to your argument, I feel like it could have been presented without questioning the author's intelligence or mental state.

ken · on April 24, 2020

The "asrs" mnemonics are the same, but the opcodes use different addressing modes. The 4th instruction in the longer sequence is a register-to-register move, which might be a simple register rename. It's not obvious to me that the shorter sequence is necessarily faster than the longer one.

OK, the "adds" is probably slower, and introduces a register dependency. But the whole exercise here is to test and see, not trust that some simple operation performs like how it looks.

Indeed, if you discount the "bx lr" (that's just the return, right?), the 1-instruction version takes 3ns, and the 4-instruction version takes 4ns. Clearly, 4 times as many instructions is not taking anywhere near 4 times as long to execute.