The selection problem
Given a machine word, we are interested in locating the k-th bit set to one, starting from the least significant bit. This problem is important as it leads to excellent practical implementation of selection data structures on bit arrays of any size.
One of the results of the research around the Sux project is the design of a very fast algorithm to perform selection in a word. The algorithm uses broadword (a.k.a. SWAR—“SIMD in A Register”) techniques to perform selection in time O(log log w) instead of O(log w) or O(w), where w is the word size (albeit we use two multiplications, which however in modern processors are extremely fast). Already with words of 64 bits, the broadword algorithm is significantly faster than the O(log w) counterparts.
For instance, allegedly efficient selection algorithms reported in this page of bit hacks or in this book are significantly slower than the code shown here (you don't have to believe me—do your own microbenchmarks). The code is currently used by Facebook in their implementation of the Elias–Fano encoding, but look at their code for a replacement using just three extended instructions on post-Haswell Intel cores.
The broadword macro EASY_LEQ_STEP_8_MSBS
compares two words byte-by-byte and leaves the highest bit in each byte set if the corresponding
bytes in the two words where in order. Note that the code does not perform tests, contains no
branch instruction and works unmodified up to word size 128.
The details of the first part of the
algorithm are explained in my paper “Broadword Implementation of Rank/Select
Queries” (basically, we compute the cumulative byte-by-byte bit counting function, and identify
using a broadword comparison the byte containing the bit of interest). However, after the computation of byte_rank
the code
below resorts to a simple table lookup, as it happens in Simon Gog's SDSL library.
The original algorithm (described further on) avoided at all lookup tables, which in 2007 was probably
a good strategy.
The idea of computing selection in a byte via lookup tables, of course, is not new—it was used, for instance, by Daisuke Okanohara and Kunihiko Sadakane in the code associated with their paper “Practical entropy-compressed rank/select dictionary”, Proc. of the Workshop on Algorithm Engineering and Experiments, ALENEX 2007, SIAM, 2007.
The main disadvantage of this approach is that k
must be smaller than the
number of ones in the first argument—that is, a bit of rank k
must
exists in x
. Otherwise, the behaviour of this function is unpredictable,
and may include illegal memory access. In practice, this is not a significant limitation,
as usually when we resort to selection in a word we know in advance that the bit of
interest is there.
#include <inttypes.h> #define ONES_STEP_4 ( 0x1111111111111111ULL ) #define ONES_STEP_8 ( 0x0101010101010101ULL ) #define MSBS_STEP_8 ( 0x80ULL * ONES_STEP_8 ) #define EASY_LEQ_STEP_8_MSBS(x,y) ( ( ( ( (y) | MSBS_STEP_8 ) - ( x ) ) ) & MSBS_STEP_8 ) int select_in_word( const uint64_t x, const int k ) { /* k < number of ones in x. */ // Phase 1: sums by byte uint64_t byte_sums = x - ( x >> 1 & 0x5ULL * ONES_STEP_4 ); byte_sums = ( byte_sums & 3 * ONES_STEP_4 ) + ( ( byte_sums >> 2 ) & 3 * ONES_STEP_4 ); byte_sums = ( byte_sums + ( byte_sums >> 4 ) ) & 0x0f * ONES_STEP_8; byte_sums *= ONES_STEP_8; // Phase 2: compare each byte sum with k const uint64_t k_step_8 = k * ONES_STEP_8; // Giuseppe Ottaviano's improvement const int place = __builtin_popcountll( EASY_LEQ_STEP_8_MSBS( byte_sums, k_step_8 ) ) * 8; // Phase 3: Locate the relevant byte and look up the result in select_in_byte return place + select_in_byte[ x >> place & 0xFF | k - ( ( byte_sums << 8 ) >> place & 0xFF ) << 8 ]; }
This is the fastest code I know for selection. Previosly, Gog and Petri, in their paper “Optimized succinct data
structures for massive data”, Software: Practice and
Experience, 2014, had replaced the
EASY_LEQ_STEP_8_MSBS
broadword macro with an extended LSB
instruction and a lookup in a very small table, getting a very fast select, but after Giuseppe Ottaviano's improvement
the code above is the fastest—it actually breaks the 2 ns barrier on an Intel® Core™ i7-4770 CPU @3.40GHz (Haswell).
This is the code appearing in the DSI utilities (Java)
and Sux (C/C++).
The original code
The original broadword algorithm contained in my paper “Broadword Implementation of Rank/Select
Queries” is described below in C. Again, the two relevant broadword macros are
LEQ_STEP_8
, which compares two words byte-by-byte and leaves the highest bit in
each byte set if the corresponding bytes in the two words where in order, and
ZCOMPARE_STEP_8
, which finds nonzero bytes. Note that the code does not perform
tests, contains no branch instruction, uses no precomputed table and works unmodified up to word size 256 (but the
LEQ_STEP_8
macro must be replaced with a slightly more complex unsigned comparison
at w=256); three more instructions will make the code work up to word size 65536
[sic].
The algorithm has the property of returning
a sensible value for values of k
larger than the number of ones of the first
argument, but, alas, nowadays we have larger and faster caches, so
resolving selection in a byte is more quickly done via a table lookup, and this algorithm
ends up being about twice as slow as the previous one.
#include <inttypes.h> #define ONES_STEP_4 ( 0x1111111111111111ULL ) #define ONES_STEP_8 ( 0x0101010101010101ULL ) #define ONES_STEP_16 ( 1ULL << 0 | 1ULL << 16 | 1ULL << 32 | 1ULL << 48 ) #define MSBS_STEP_4 ( 0x8ULL * ONES_STEP_4 ) #define MSBS_STEP_8 ( 0x80ULL * ONES_STEP_8 ) #define MSBS_STEP_16 ( 0x8000ULL * ONES_STEP_16 ) #define INCR_STEP_8 ( 0x80ULL << 56 | 0x40ULL << 48 | 0x20ULL << 40 | 0x10ULL << 32 | 0x8ULL << 24 | 0x4ULL << 16 | 0x2ULL << 8 | 0x1 ) #define LEQ_STEP_8(x,y) ( ( ( ( ( (y) | MSBS_STEP_8 ) - ( (x) & ~MSBS_STEP_8 ) ) ^ (x) ^ (y) ) & MSBS_STEP_8 ) >> 7 ) #define ZCOMPARE_STEP_8(x) ( ( ( x | ( ( x | MSBS_STEP_8 ) - ONES_STEP_8 ) ) & MSBS_STEP_8 ) >> 7 ) int select( const uint64_t x, const int k ) { /* k < 128; returns 72 if there are less than k ones in x. */ // Phase 1: sums by byte register uint64_t byte_sums = x - ( ( x & 0xa * ONES_STEP_4 ) >> 1 ); byte_sums = ( byte_sums & 3 * ONES_STEP_4 ) + ( ( byte_sums >> 2 ) & 3 * ONES_STEP_4 ); byte_sums = ( byte_sums + ( byte_sums >> 4 ) ) & 0x0f * ONES_STEP_8; byte_sums *= ONES_STEP_8; // Phase 2: compare each byte sum with k const uint64_t k_step_8 = k * ONES_STEP_8; const uint64_t place = ( LEQ_STEP_8( byte_sums, k_step_8 ) * ONES_STEP_8 >> 53 ) & ~0x7; // Phase 3: Locate the relevant byte and make 8 copies with incrental masks const int byte_rank = k - ( ( ( byte_sums << 8 ) >> place ) & 0xFF ); const uint64_t spread_bits = ( x >> place & 0xFF ) * ONES_STEP_8 & INCR_STEP_8; const uint64_t bit_sums = ZCOMPARE_STEP_8( spread_bits ) * ONES_STEP_8; // Compute the inside-byte location and return the sum const uint64_t byte_rank_step_8 = byte_rank * ONES_STEP_8; return place + ( LEQ_STEP_8( bit_sums, byte_rank_step_8 ) * ONES_STEP_8 >> 56 ); }