How come the division is almost the same as the shifting? Is the CPU pipelining the operations between iterations of the loop or something? There is a direct data-dependency in those operations but not between iterations so perhaps that's it?
AFAIK there's no fused add-shift op that could be used.
AFAIK there's no fused add-shift op that could be used.