# The selection problem

Given a machine word, we are interested in locating the `k`-th bit set to one, starting
from the least significant bit. This problem is important as it leads to excellent practical implementation
of selection data structures
on bit arrays of any size.

One of the results of the research around the Sux project is the design of a very fast algorithm
to perform selection in a word. The algorithm uses *broadword* (a.k.a. SWAR—“SIMD in A Register”) techniques
to perform selection in time *O*(log log `w`) instead of *O*(log `w`) or
*O*(`w`), where `w` is the word size (albeit we use three multiplications, which however
in modern processors are extremely fast). Already with words of 64 bits, the broadword
algorithm is significantly faster than the *O*(log `w`)
counterparts.

For instance, allegedly efficient selection algorithms reported in this page of bit hacks or in this book are significantly slower than the code shown here (you don't have to believe me—do your own microbenchmarks).

The broadword macro `EASY_LEQ_STEP_8`

compares two words byte-by-byte and leaves the highest bit in each byte set if the corresponding
bytes in the two words where in order. Note that the code does not perform tests, contains no
branch instruction and works unmodified up to word size 128.

The details of the first part of the
algorithm are explained in my paper “Broadword Implementation of Rank/Select
Queries” (basically, we compute the cumulative byte-by-byte bit counting function, and identify
using a broadword comparison the byte containing the bit of interest). However, after the computation of `byte_rank`

the code
below resorts to a simple table lookup, as it happens in Simon Gog's SDSL library.
The original algorithm (described further on) avoided at all lookup tables, which in 2007 was probably
a good strategy.

The idea of computing selection in a byte *via* lookup tables, of course, is not
new—it was used, for instance, by Daisuke Okanohara and Kunihiko Sadakane in the code
associated with their paper “Practical entropy-compressed rank/select dictionary”,
*Proc. of the Workshop on Algorithm Engineering and Experiments, ALENEX 2007*, SIAM, 2007.

The main disadvantage of this approach is that `k`

must be smaller than the
number of ones in the first argument—that is, a bit of rank `k`

must
exists in `x`

. Otherwise, the behaviour of this function is unpredictable,
and may include illegal memory access. In practice, this is not a significant limitation,
as usually when we resort to selection in a word we know in advance that the bit of
interest is there.

#include <inttypes.h> #define ONES_STEP_4 ( 0x1111111111111111ULL ) #define ONES_STEP_8 ( 0x0101010101010101ULL ) #define EASY_LEQ_STEP_8(x,y) ( ( ( ( ( (y) | MSBS_STEP_8 ) - ( x ) ) ) & MSBS_STEP_8 ) >> 7 ) int select_in_word( const uint64_t x, const int k ) { /* k < number of ones in x. */ // Phase 1: sums by byte uint64_t byte_sums = x - ( x >> 1 & 0x5ULL * ONES_STEP_4 ); byte_sums = ( byte_sums & 3 * ONES_STEP_4 ) + ( ( byte_sums >> 2 ) & 3 * ONES_STEP_4 ); byte_sums = ( byte_sums + ( byte_sums >> 4 ) ) & 0x0f * ONES_STEP_8; byte_sums *= ONES_STEP_8; // Phase 2: compare each byte sum with k const uint64_t k_step_8 = k * ONES_STEP_8; const int place = ( EASY_LEQ_STEP_8( byte_sums, k_step_8 ) * ONES_STEP_8 >> 53 ) & ~0x7; // Phase 3: Locate the relevant byte and look up the result in select_in_byte return place + select_in_byte[ x >> place & 0xFF | k - ( ( byte_sums << 8 ) >> place & 0xFF ) << 8 ]; }

## The fastest code

Gog and Petri, in their paper “Optimized succinct data
structures for massive data”, *Software: Practice and
Experience*, 2014, propose a further improvement (used by SDSL) that replaces the
`EASY_LEQ_STEP_8`

broadword macro with an extended LSB
instruction and a lookup in a very small table. Currently this is the
fastest select code—the version reported here, which reduces
slightly the number of operations, is used by Sux. It is about 7% faster than the code above in C. In Java, however,
the high cost for accessing an array makes this code about 5% *slower* than the code above. Benchmarks
were performed on an Intel® Core™ i7-4770 CPU @3.40GHz (Haswell).

#include <inttypes.h> #define ONES_STEP_4 ( 0x1111111111111111ULL ) #define ONES_STEP_8 ( 0x0101010101010101ULL ) int select_in_word( const uint64_t x, const int k ) { // Phase 1: sums by byte uint64_t byte_sums = x - ( x >> 1 & 0x5ULL * ONES_STEP_4 ); byte_sums = ( byte_sums & 3ULL * ONES_STEP_4 ) + ( ( byte_sums >> 2 ) & 3ULL * ONES_STEP_4 ); byte_sums = ( byte_sums + ( byte_sums >> 4 ) ) & 0x0fULL * ONES_STEP_8; byte_sums *= ONES_STEP_8; // Phase 2: find the right byte shift const int place = ( __builtin_ctzll( byte_sums + overflow[ k ] & 0x80ULL * ONES_STEP_8 ) >> 3 ) << 3; // Phase 3: Locate the relevant byte and look up the result in select_in_byte return place + select_in_byte[ x >> place & 0xFFULL | ( k - ( ( byte_sums << 8 ) >> place & 0xFF ) ) << 8 ]; }

## The original code

The original broadword algorithm contained in my paper “Broadword Implementation of Rank/Select
Queries” is described below in C. Again, the two relevant broadword macros are
`LEQ_STEP_8`

, which compares two words byte-by-byte and leaves the highest bit in
each byte set if the corresponding bytes in the two words where in order, and
`ZCOMPARE_STEP_8`

, which finds nonzero bytes. Note that the code does not perform
tests, contains no branch instruction, uses no precomputed table and works unmodified up to word size 256 (but the
`LEQ_STEP_8`

macro must be replaced with a slightly more complex unsigned comparison
at `w`=256); three more instructions will make the code work up to word size 65536
[sic].

The algorithm has the property of returning
a sensible value for values of `k`

larger than the number of ones of the first
argument, but, alas, nowadays we have larger and faster caches, so
resolving selection in a byte is more quickly done via a table lookup, and this algorithm
ends up being about twice as slow as the previous one.

#include <inttypes.h> #define ONES_STEP_4 ( 0x1111111111111111ULL ) #define ONES_STEP_8 ( 0x0101010101010101ULL ) #define ONES_STEP_16 ( 1ULL << 0 | 1ULL << 16 | 1ULL << 32 | 1ULL << 48 ) #define MSBS_STEP_4 ( 0x8ULL * ONES_STEP_4 ) #define MSBS_STEP_8 ( 0x80ULL * ONES_STEP_8 ) #define MSBS_STEP_16 ( 0x8000ULL * ONES_STEP_16 ) #define INCR_STEP_8 ( 0x80ULL << 56 | 0x40ULL << 48 | 0x20ULL << 40 | 0x10ULL << 32 | 0x8ULL << 24 | 0x4ULL << 16 | 0x2ULL << 8 | 0x1 ) #define LEQ_STEP_8(x,y) ( ( ( ( ( (y) | MSBS_STEP_8 ) - ( (x) & ~MSBS_STEP_8 ) ) ^ (x) ^ (y) ) & MSBS_STEP_8 ) >> 7 ) #define ZCOMPARE_STEP_8(x) ( ( ( x | ( ( x | MSBS_STEP_8 ) - ONES_STEP_8 ) ) & MSBS_STEP_8 ) >> 7 ) int select( const uint64_t x, const int k ) { /* k < 128; returns 72 if there are less than k ones in x. */ // Phase 1: sums by byte register uint64_t byte_sums = x - ( ( x & 0xa * ONES_STEP_4 ) >> 1 ); byte_sums = ( byte_sums & 3 * ONES_STEP_4 ) + ( ( byte_sums >> 2 ) & 3 * ONES_STEP_4 ); byte_sums = ( byte_sums + ( byte_sums >> 4 ) ) & 0x0f * ONES_STEP_8; byte_sums *= ONES_STEP_8; // Phase 2: compare each byte sum with k const uint64_t k_step_8 = k * ONES_STEP_8; const uint64_t place = ( LEQ_STEP_8( byte_sums, k_step_8 ) * ONES_STEP_8 >> 53 ) & ~0x7; // Phase 3: Locate the relevant byte and make 8 copies with incrental masks const int byte_rank = k - ( ( ( byte_sums << 8 ) >> place ) & 0xFF ); const uint64_t spread_bits = ( x >> place & 0xFF ) * ONES_STEP_8 & INCR_STEP_8; const uint64_t bit_sums = ZCOMPARE_STEP_8( spread_bits ) * ONES_STEP_8; // Compute the inside-byte location and return the sum const uint64_t byte_rank_step_8 = byte_rank * ONES_STEP_8; return place + ( LEQ_STEP_8( bit_sums, byte_rank_step_8 ) * ONES_STEP_8 >> 56 ); }