The selection problem

Given a machine word, we are interested in locating the k-th bit set to one, starting from the least significant bit. This problem is important as it leads to excellent practical implementation of selection data structures on bit arrays of any size.

One of the results of the research around the Sux project is the design of a very fast algorithm to perform selection in a word. The algorithm uses broadword (a.k.a. SWAR—“SIMD in A Register”) techniques to perform selection in time O(log log w) instead of O(log w) or O(w), where w is the word size (albeit we use two multiplications, which however in modern processors are extremely fast). Already with words of 64 bits, the broadword algorithm is significantly faster than the O(log w) counterparts.

For instance, allegedly efficient selection algorithms reported in this page of bit hacks or in this book are significantly slower than the code shown here (you don't have to believe me—do your own microbenchmarks). The code is currently used by Facebook in their implementation of the Elias–Fano encoding, but look at their code for a replacement using just three extended instructions on post-Haswell Intel cores.

The broadword macro EASY_LEQ_STEP_8_MSBS compares two words byte-by-byte and leaves the highest bit in each byte set if the corresponding bytes in the two words where in order. Note that the code does not perform tests, contains no branch instruction and works unmodified up to word size 128.

The details of the first part of the algorithm are explained in my paper “Broadword Implementation of Rank/Select Queries” (basically, we compute the cumulative byte-by-byte bit counting function, and identify using a broadword comparison the byte containing the bit of interest). However, after the computation of byte_rank the code below resorts to a simple table lookup, as it happens in Simon Gog's SDSL library. The original algorithm (described further on) avoided at all lookup tables, which in 2007 was probably a good strategy.

The idea of computing selection in a byte via lookup tables, of course, is not new—it was used, for instance, by Daisuke Okanohara and Kunihiko Sadakane in the code associated with their paper “Practical entropy-compressed rank/select dictionary”, Proc. of the Workshop on Algorithm Engineering and Experiments, ALENEX 2007, SIAM, 2007.

The main disadvantage of this approach is that k must be smaller than the number of ones in the first argument—that is, a bit of rank k must exists in x. Otherwise, the behaviour of this function is unpredictable, and may include illegal memory access. In practice, this is not a significant limitation, as usually when we resort to selection in a word we know in advance that the bit of interest is there.

#include <inttypes.h>

#define ONES_STEP_4 ( 0x1111111111111111ULL )
#define ONES_STEP_8 ( 0x0101010101010101ULL )
#define MSBS_STEP_8 ( 0x80ULL * ONES_STEP_8 )
#define EASY_LEQ_STEP_8_MSBS(x,y) ( ( ( ( (y) | MSBS_STEP_8 ) - ( x ) ) ) & MSBS_STEP_8 )

int select_in_word( const uint64_t x, const int k ) {  /* k < number of ones in x. */

   // Phase 1: sums by byte
   uint64_t byte_sums = x - ( x >> 1 & 0x5ULL * ONES_STEP_4 );
   byte_sums = ( byte_sums & 3 * ONES_STEP_4 ) + ( ( byte_sums >> 2 ) & 3 * ONES_STEP_4 );
   byte_sums = ( byte_sums + ( byte_sums >> 4 ) ) & 0x0f * ONES_STEP_8;
   byte_sums *= ONES_STEP_8;

   // Phase 2: compare each byte sum with k
   const uint64_t k_step_8 = k * ONES_STEP_8;
   // Giuseppe Ottaviano's improvement
   const int place = __builtin_popcountll( EASY_LEQ_STEP_8_MSBS( byte_sums, k_step_8 ) ) * 8;

   // Phase 3: Locate the relevant byte and look up the result in select_in_byte
   return place + select_in_byte[ x >> place & 0xFF | k - ( ( byte_sums << 8 ) >> place & 0xFF ) << 8 ];
}
	

This is the fastest code I know for selection. Previosly, Gog and Petri, in their paper “Optimized succinct data structures for massive data”, Software: Practice and Experience, 2014, had replaced the EASY_LEQ_STEP_8_MSBS broadword macro with an extended LSB instruction and a lookup in a very small table, getting a very fast select, but after Giuseppe Ottaviano's improvement the code above is the fastest—it actually breaks the 2 ns barrier on an Intel® Core™ i7-4770 CPU @3.40GHz (Haswell). This is the code appearing in the DSI utilities (Java) and Sux (C/C++).

The original code

The original broadword algorithm contained in my paper “Broadword Implementation of Rank/Select Queries” is described below in C. Again, the two relevant broadword macros are LEQ_STEP_8, which compares two words byte-by-byte and leaves the highest bit in each byte set if the corresponding bytes in the two words where in order, and ZCOMPARE_STEP_8, which finds nonzero bytes. Note that the code does not perform tests, contains no branch instruction, uses no precomputed table and works unmodified up to word size 256 (but the LEQ_STEP_8 macro must be replaced with a slightly more complex unsigned comparison at w=256); three more instructions will make the code work up to word size 65536 [sic].

The algorithm has the property of returning a sensible value for values of k larger than the number of ones of the first argument, but, alas, nowadays we have larger and faster caches, so resolving selection in a byte is more quickly done via a table lookup, and this algorithm ends up being about twice as slow as the previous one.

#include <inttypes.h>

#define ONES_STEP_4 ( 0x1111111111111111ULL )
#define ONES_STEP_8 ( 0x0101010101010101ULL )
#define ONES_STEP_16 ( 1ULL << 0 | 1ULL << 16 | 1ULL << 32 | 1ULL << 48 )
#define MSBS_STEP_4 ( 0x8ULL * ONES_STEP_4 )
#define MSBS_STEP_8 ( 0x80ULL * ONES_STEP_8 )
#define MSBS_STEP_16 ( 0x8000ULL * ONES_STEP_16 )
#define INCR_STEP_8 ( 0x80ULL << 56 | 0x40ULL << 48 | 0x20ULL << 40 | 0x10ULL << 32 | 0x8ULL << 24 | 0x4ULL << 16 | 0x2ULL << 8 | 0x1 )

#define LEQ_STEP_8(x,y) ( ( ( ( ( (y) | MSBS_STEP_8 ) - ( (x) & ~MSBS_STEP_8 ) ) ^ (x) ^ (y) ) & MSBS_STEP_8 ) >> 7 )

#define ZCOMPARE_STEP_8(x) ( ( ( x | ( ( x | MSBS_STEP_8 ) - ONES_STEP_8 ) ) & MSBS_STEP_8 ) >> 7 )


int select( const uint64_t x, const int k ) { /* k < 128; returns 72 if there are less than k ones in x. */

   // Phase 1: sums by byte
   register uint64_t byte_sums = x - ( ( x & 0xa * ONES_STEP_4 ) >> 1 );
   byte_sums = ( byte_sums & 3 * ONES_STEP_4 ) + ( ( byte_sums >> 2 ) & 3 * ONES_STEP_4 );
   byte_sums = ( byte_sums + ( byte_sums >> 4 ) ) & 0x0f * ONES_STEP_8;
   byte_sums *= ONES_STEP_8;

   // Phase 2: compare each byte sum with k
   const uint64_t k_step_8 = k * ONES_STEP_8;
   const uint64_t place = ( LEQ_STEP_8( byte_sums, k_step_8 ) * ONES_STEP_8 >> 53 ) & ~0x7;

   // Phase 3: Locate the relevant byte and make 8 copies with incrental masks
   const int byte_rank = k - ( ( ( byte_sums << 8 ) >> place ) & 0xFF );

   const uint64_t spread_bits = ( x >> place & 0xFF ) * ONES_STEP_8 & INCR_STEP_8;
   const uint64_t bit_sums = ZCOMPARE_STEP_8( spread_bits ) * ONES_STEP_8;

   // Compute the inside-byte location and return the sum
   const uint64_t byte_rank_step_8 = byte_rank * ONES_STEP_8;

   return place + ( LEQ_STEP_8( bit_sums, byte_rank_step_8 ) * ONES_STEP_8 >> 56 );
}