Bit Test and Reset vs. Compilers

tzs · on Dec 23, 2021

Most architectures seem to have an instruction for finding the leftmost set bit in a word [1] and most major C/C++ compilers/libraries have a function for it [2].

I wonder how using that would compare?

You'd find the leftmost bit, invoke the func_true for that, and clear the leftmost bit.

Then loop finding the leftmost bit, invoking the func_false for the bits between the current leftmost bit and the previous leftmost bit, func_true for the leftmost bit, and then clear that bit.

I wouldn't expect it to be dramatically different, but it is trading test and clear on the off bits and a conditional branch off of the result for a loop calling func_false a predetermined number of times. One of those options might be a little faster than the other.

[1] https://en.wikipedia.org/wiki/Find_first_set#Hardware_suppor...

[2] https://en.wikipedia.org/wiki/Find_first_set#Tool_and_librar...

wmu · on Dec 23, 2021

Author here. I cannot use BSR, because have to execute code for all bits, regardless their value. The range is: the last bit to the first with value 1. It's a non-obvious iteration schema. :)

tzs · on Dec 24, 2021

I took that into account, although it may not have been clear because I tried to describe it in words rather than code.

What I was suggesting was something like this (in pseudocode):

  int previous_leftmost = 64;
  while (mask != 0) {
    leftmost = find_leftmost(mask);
    mask ^= 1u << leftmost;
    while (--previous_leftmost > leftmost)
      func_false(previous_leftmost);
    func_true(leftmost);
  }

wmu · on Dec 24, 2021

Thanks, that's a really elegant approach.

joezydeco · on Dec 23, 2021

Back in the 8-bit days you could roll or shift left and the MSB would move to the carry bit or flag. Then branching was built into instructions like BCS/BCC (branch on carry set or clear).

josephcsible · on Dec 23, 2021

Back in the 8-bit days? Isn't that what RCL still does on modern processors?

joezydeco · on Dec 24, 2021

It probably does for all I know, I just don't do a lot of assembly anymore.

RantyDave · on Dec 24, 2021

Ummm, no. I ran the routine through clang 13 with '-O' and didn't get an explicit compare.

tyfighter · on Dec 23, 2021

Why not use x86's BSF or BSR instructions? They've been around since 386.

secondcoming · on Dec 23, 2021

Those give undefined results if the operand is zero, so maybe having to do that check defeats the point?

tyfighter · on Dec 23, 2021

The destination register is undefined, but it does set the zero flag. That's why the intrinsics return a value to be used for that condition.

ncmncm · on Dec 24, 2021

This is a very artificial sort of task. You have to visit all bits down to the least significant 1 bit, and ignore the rest? In designing the system, since the assignment of bits to meaning is probably arbitrary, it is smarter to assign them in the way that is more convenient and less error-prone to code against, arranging so the 1 bits may be visited least-significant first.

But, OK.

(It is probably important, here, to call more attention to the fact that shifting a signed value left until its sign changes would be Undefined Behavior, making your optimizer likely to do something crazy. That is why the code sample must cast a shifted uint64_t to an int64_t before comparing to zero; that conversion is well-defined. Reader be advised!)

In the final, indexing, example, the code the compiler generates is to shift the word right 63 places, instead of checking the sign, to make the array index. You might as well do that in the source code, instead of coding a check of the sign bit that the compiler will convert to what you really meant anyway.

The final trick of indexing an array to get the right function to call is a good one, and could be a lot faster if the bit patterns are close to random. Branch predictors are very good at teasing out patterns, but the penalty for guessing wrong is very high. A possibly better way, particularly on clang, would look more like

  (mask >> 63 ? func_true : func_false)(i);

Modern chips have a "cmov" instruction that is likely to be better than indexing into an array, with its a three-cycle trip out to L1 cache: on a modern chip, that is enough time to execute 18 other instructions!

In the loop, there is no point in comparing i to zero, because you are going to break out when the mask hits zero anyway. So, that is just wasted code; deleting it would eliminate the useless "dec %ebp; js 108" instructions. That branch is reliably predicted, so mostly just costs cache footprint, not cycles. But the micro-operations cache is very small, and if the extra instructions make you blow your micro-op cache, that could be pretty expensive.

Finally, and most important: the number of instructions is an extremely poor way to estimate performance. Often, many instructions take less time to run than a few, different instructions. So, to make a smart choice about performance, or to present actually meaningful results in your blog post, you need to set up a micro-benchmark and actually see what really happens, and not just guess by counting lines. There are extremely good micro-benchmark tools available free online. Google's is as good as any. Or you could paste the code into Godbolt.org and benchmark it there.

So, in the end, this might be best:

  void loop_v4(uint64_t mask) {
    for (int i=63; mask; mask <<= 1, --i)
      (mask >> 63 ? func_true : func_false)(i);
  }

or possibly

     (mask & ~(-1ull>>1) ? func_true : func_false)(i);

or

     ((int64_t)mask < 0 ? func_true : func_false)(i);

But again, you need to measure (which I didn't do either). Results might be very, very different between x86 and ARM. Of course the results will depend on what happens in func_true and in func_false, too. And, you might find this is altogether the wrong place spend your optimizing attention.

ncmncm · on Dec 24, 2021

A quick detour to Godbolt shows all three loop_v4 versions result in exactly identical code, on all the compilers I tried. And, all used the condition code matching the sign bit, directly, so did not do a right shift. All used cmov to select a function to call. (On ARM they used csel, equivalently.)

Running under Google Benchmark, I get:

  $ g++ -O2 b.cc -L benchmark/build/src -lbenchmark \
  > -I benchmark/include -lpthread -o b -march=skylake bf.o
  $ ./b
  ...
  -----------------------------------------------------
  Benchmark Time    CPU     Iterations
  -----------------------------------------------------
  v1        244 ns  244 ns  2640782
  v1a       245 ns  245 ns  2861829
  v2        251 ns  251 ns  2761882
  v2a       250 ns  250 ns  2775107
  v3        264 ns  264 ns  2634478
  v3a       263 ns  263 ns  2646271
  v4        246 ns  246 ns  2812437
  v4a       231 ns  231 ns  3042318

The "a" versions are without the redundant check for (i >= 0). They are mostly the same, here, except v4a. The func_false and func_true functions are linked in from bf.o.

  $ clang++-11 -O2 b.cc -L benchmark/build/src -lbenchmark \
  > -I benchmark/include -lpthread -o b -march=skylake bf.o
  $ ./b
  ...
  v1        282 ns  282 ns  2314149
  v1a       282 ns  282 ns  2496325
  v2        239 ns  239 ns  2939233
  v2a       253 ns  253 ns  2767869
  v3        277 ns  277 ns  2525156
  v3a       242 ns  242 ns  2874514
  v4        254 ns  254 ns  2753859
  v4a       222 ns  222 ns  3157533

Interesting here is that Clang ends up with the best time, after starting off way behind. (The v2/v2a thing is some kind of testing artifact.)

Benchmarking almost always produces surprising results. It could be just good luck that mine came out fastest.