Most architectures seem to have an instruction for finding the leftmost set bit in a word [1] and most major C/C++ compilers/libraries have a function for it [2].
I wonder how using that would compare?
You'd find the leftmost bit, invoke the func_true for that, and clear the leftmost bit.
Then loop finding the leftmost bit, invoking the func_false for the bits between the current leftmost bit and the previous leftmost bit, func_true for the leftmost bit, and then clear that bit.
I wouldn't expect it to be dramatically different, but it is trading test and clear on the off bits and a conditional branch off of the result for a loop calling func_false a predetermined number of times. One of those options might be a little faster than the other.
Author here. I cannot use BSR, because have to execute code for all bits, regardless their value. The range is: the last bit to the first with value 1. It's a non-obvious iteration schema. :)
Back in the 8-bit days you could roll or shift left and the MSB would move to the carry bit or flag. Then branching was built into instructions like BCS/BCC (branch on carry set or clear).
This is a very artificial sort of task. You have to visit all bits down to the least significant 1 bit, and ignore the rest? In designing the system, since the assignment of bits to meaning is probably arbitrary, it is smarter to assign them in the way that is more convenient and less error-prone to code against, arranging so the 1 bits may be visited least-significant first.
But, OK.
(It is probably important, here, to call more attention to the fact that shifting a signed value left until its sign changes would be Undefined Behavior, making your optimizer likely to do something crazy. That is why the code sample must cast a shifted uint64_t to an int64_t before comparing to zero; that conversion is well-defined. Reader be advised!)
In the final, indexing, example, the code the compiler generates is to shift the word right 63 places, instead of checking the sign, to make the array index. You might as well do that in the source code, instead of coding a check of the sign bit that the compiler will convert to what you really meant anyway.
The final trick of indexing an array to get the right function to call is a good one, and could be a lot faster if the bit patterns are close to random. Branch predictors are very good at teasing out patterns, but the penalty for guessing wrong is very high. A possibly better way, particularly on clang, would look more like
(mask >> 63 ? func_true : func_false)(i);
Modern chips have a "cmov" instruction that is likely to be better than indexing into an array, with its a three-cycle trip out to L1 cache: on a modern chip, that is enough time to execute 18 other instructions!
In the loop, there is no point in comparing i to zero, because you are going to break out when the mask hits zero anyway. So, that is just wasted code; deleting it would eliminate the useless "dec %ebp; js 108" instructions. That branch is reliably predicted, so mostly just costs cache footprint, not cycles. But the micro-operations cache is very small, and if the extra instructions make you blow your micro-op cache, that could be pretty expensive.
Finally, and most important: the number of instructions is an extremely poor way to estimate performance. Often, many instructions take less time to run than a few, different instructions. So, to make a smart choice about performance, or to present actually meaningful results in your blog post, you need to set up a micro-benchmark and actually see what really happens, and not just guess by counting lines. There are extremely good micro-benchmark tools available free online. Google's is as good as any. Or you could paste the code into Godbolt.org and benchmark it there.
But again, you need to measure (which I didn't do either). Results might be very, very different between x86 and ARM. Of course the results will depend on what happens in func_true and in func_false, too. And, you might find this is altogether the wrong place spend your optimizing attention.
A quick detour to Godbolt shows all three loop_v4 versions result in exactly identical code, on all the compilers I tried. And, all used the condition code matching the sign bit, directly, so did not do a right shift. All used cmov to select a function to call. (On ARM they used csel, equivalently.)
The "a" versions are without the redundant check for (i >= 0). They are mostly the same, here, except v4a. The func_false and func_true functions are linked in from bf.o.
I wonder how using that would compare?
You'd find the leftmost bit, invoke the func_true for that, and clear the leftmost bit.
Then loop finding the leftmost bit, invoking the func_false for the bits between the current leftmost bit and the previous leftmost bit, func_true for the leftmost bit, and then clear that bit.
I wouldn't expect it to be dramatically different, but it is trading test and clear on the off bits and a conditional branch off of the result for a loop calling func_false a predetermined number of times. One of those options might be a little faster than the other.
[1] https://en.wikipedia.org/wiki/Find_first_set#Hardware_suppor...
[2] https://en.wikipedia.org/wiki/Find_first_set#Tool_and_librar...