Package it.unimi.dsi.sux4j.mph
Static [[monotone] minimal perfect hash] functions.
Static functions
This package provides a number of stateoftheart implementations of static functions, that is, immutable structures that map a set of objects to a set of values. In particular, we provide minimal perfect hash functions (hence the package name), which map n objects to the first n natural numbers bijectively, and monotone minimal perfect hash functions, which map n objects to their lexicographical rank (i.e., the bijective mapping respects lexicographical order). All structures can compute their result using a space storage that is way below the informationtheoretical lower bound for a dictionary because they have an unpredictable result when queried with keys outside of the original set. You can sometimes control the behaviour outside of the original set, however, using signatures (read below). In particular, some of the maps can actually be used to store an approximate dictionary in a space very close to the optimum.
Most of the functions have constant lookup time, except for the computation of a hash function on the key.
Usage
Most functions in this package provide a builder, rather than a constructor,
and feature an easytouse main method to build serialized functions from the command line.
Moreover, they implement the Object2LongFunction
interface,
but the underlying machinery manipulates bit vectors only. To bring you own data
into the bitvector world, each function builder
requires to specify a transformation strategy
that maps your objects into bit vectors. Many readymade strategies can be found in TransformationStrategies
:
for example,
rawUtf16()
, utf16()
and
prefixFreeUtf16()
can be used with character sequences,
whereas
rawIso()
, iso()
and
prefixFreeIso()
can be used with character sequences
that are known to be within the range of the ISO88591 charset.
Note that if you plan to use monotone hashing, you must provide objects in an order such that the corresponding bit vectors
are lexicographically ordered. For instance, utf16()
obtain this
results by concatenating the reversed 16bit representation of each character. Sometimes, you need also
the resulting bit vectors to be prefixfree.
If order is not an issue,
raw transformation strategies do not reverse bit order, so
they are usually faster.
For maps whose size depends on the length of the bit vectors being hashed,
the HuTuckerTransformationStrategy
provides lexicographical optimal compression, at the price of a slower lookup.
Signing functions
All functions in this package will return a value in their range for most of the keys that are not in their domain. In the few cases in which it is possible to detect a key out of the original set, you will get the default return value.
If you are interested in getting a more precise behaviour, you can sign a function, that is, you can record a signature for each key and use it to filter keys out of the original set. For that to happen, the function must be a bijection of the original set with the first natural numbers: this happens with (monotone) minimal perfect hash functions and with suitable static functions. Several classes provide builtin signing (look at the methods of their builder).
An interesting application of the signing technique makes it possible to store a set of objects with a small
probability of error (false positives): one simply stores a function mapping each object to its signature. At lookup
time, the signature of the object is compared with the output of the function. For instance, using a GOV4Function
one can store a dictionary with probability of a false positive 2^{−w} using just 1.03nw
bits, that is, just 3% more than the theoretical lower bound. Look for the dictionary()
method in builders.
There are also external signing classes for character sequences provided
by the DSI utilities which actually implement the interface
StringMap
; in particular, LiterallySignedStringMap
will provide a full StringMap
implementation.
Available data structures
The classes can be gathered in three broad groups: General functions. They can be used to associate arbitrary values to a set of objects,
so, in particular, they can be used to implement orderpreserving minimal perfect hashing (elements are mapped
to their order in which they were provided, independently of their lexicographical order). They are also essential
building blocks for all other classes. Out of this class we suggest as a workhorse
GOV3Function
,GOV4Function
(slightly slower, smaller space) andTwoStepsGOV3Function
(slower, but uses less space if the distribution of the output values is skewed, i.e., some values are very frequent).  Compressed functions. As above, but static functions of this type
are targeted at functions whose output is not evenly
distributed (i.e., some values are much more frequent than others).
The space usage per key is proportional to the empirical 0th order entropy of the output values,
rather than to the number of bits that are necessary to express the larger value.
Out of this class we suggest as a workhorse
GV3CompressedFunction
.  Minimal perfect hash function; they map a set
of n object bijectively to the set n = { 0, 1,… n − 1 }.
Out of this class we suggest as a workhorse
GOVMinimalPerfectHashFunction
, which uses ≈2.24 bits per key and has very fast lookups.  Monotone minimal perfect hash functions; these
functions requires keys to be prefixfree and provided in lexicographical order. They will map back each key to its position using
a very small number of bits per element, providing different space/time tradeoffs (in what follows,
ℓ is the maximum string length):
LcpMonotoneMinimalPerfectHashFunction
is very fast, as it has just to evaluate threeGOV3Function
s (so if the length of the strings is a constant multiplied by the machine word, it is actually constant time); however it uses ≈2 + log log n + log ℓ bits per element.TwoStepsLcpMonotoneMinimalPerfectHashFunction
gains a few bits by performing some additional compression, but it is usually slightly slower (albeit always constant time).ZFastTrieDistributorMonotoneMinimalPerfectHashFunction
uses very little space: in fact, about one byte per key independently of the key set size (and basically of the bit vector length). It is an order of magnitude slower than anLcpMonotoneMinimalPerfectHashFunction
.
LcpMonotoneMinimalPerfectHashFunction
and ZFastTrieDistributorMonotoneMinimalPerfectHashFunction
were introduced by Djamal Belazzougui, Paolo Boldi, Rasmus Pagh and Sebastiano Vigna
in “Monotone Minimal Perfect Hashing: Searching a Sorted Table with O(1) Accesses”,
Proc. of the 20th Annual ACM–SIAM Symposium On Discrete Mathematics (SODA), ACM Press, 2009.
TwoStepsLcpMonotoneMinimalPerfectHashFunction
was introduced
by the same authors in “Theory and practice of monotone minimal perfect hashing”, ACM Journal of Experimental Algorithmics,
16(3):132−144, 2011.
Additional data structures
 General functions;
MWHCFunction
andTwoStepsMWHCFunction
are deprecated: they are still around for historical interest and for backward compatibility, asGOV3Function
,GOV4Function
andTwoStepsGOV3Function
beat them under every respect (except for a slightly greater construction time).  Minimal perfect hash function;
MinimalPerfectHashFunction
is deprecated: it is still around for historical interest and for backward compatibility, asGOVMinimalPerfectHashFunction
beats it under every respect (except for a slightly greater construction time).CHDMinimalPerfectHashFunction
is theoretically interesting as it can reach almost 2 bits per keys, with a 10% space gain with respect toGOVMinimalPerfectHashFunction
, but at the cost of slower lookups and a construction time that is an order of magnitude slower.  Monotone minimal perfect hash functions;
PaCoTrieDistributorMonotoneMinimalPerfectHashFunction
is very slow, as it uses a partial compacted trie (which requires linear time to be accessed) to distribute keys between buckets; theoretically it uses ≈2 + log(ℓ  log n) bits per element, but the partial compacted trie is every efficiency in exploiting data redundancy, so the actual occupancy is in general half with respect to the previous function.HollowTrieMonotoneMinimalPerfectHashFunction
is rather slow as it has to traverse a succinct trie on the whole key set; it uses just 4 + log ℓ + log log ℓ bits per element, and in practice it is the monotone minimal perfect hash function that uses less space.HollowTrieDistributorMonotoneMinimalPerfectHashFunction
is very slow, as it uses a enriched hollow trie as a distributor, but it has the (quite surprising) property of using 3.05 + 1.03 log log ℓ bits per element (note the double log). In practice, it will use less than a byte per element for strings of length up to a billion bits. Variablelength versions (e.g.,
VLLcpMonotoneMinimalPerfectHashFunction
andVLPaCoTrieDistributorMonotoneMinimalPerfectHashFunction
) are structures whose size depends on the average, rather than on the maximum string length, but they are mainly of theoretical interest.
PaCoTrieDistributorMonotoneMinimalPerfectHashFunction
,
HollowTrieMonotoneMinimalPerfectHashFunction
and HollowTrieDistributorMonotoneMinimalPerfectHashFunction
were introduced
by Djamal Belazzougui, Paolo Boldi, Rasmus Pagh and Sebastiano Vigna in “Theory and practice of monotone minimal perfect hashing”, ACM Journal of Experimental Algorithmics,
16(3):132−144, 2011.

Class Summary Class Description AbstractHashFunction<K> A very minimal abstract hash implementation.CHDMinimalPerfectHashFunction<T> A minimal perfect hash function implemented using the “hash, displace and compress” technique.CHDMinimalPerfectHashFunction.Builder<T> A builder class forCHDMinimalPerfectHashFunction
.GOV3Function<T> An immutable function stored quasisuccinctly using the GenuzioOttavianoVigna method to solve F_{2}linear systems.GOV3Function.Builder<T> A builder class forGOV3Function
.GOV4Function<T> An immutable function stored quasisuccinctly using the GenuzioOttavianoVigna method to solve F_{2}linear systems.GOV4Function.Builder<T> A builder class forGOV4Function
.GOVMinimalPerfectHashFunction<T> A minimal perfect hash function stored using the GenuzioOttavianoVigna 3regular F_{3}linear system technique.GOVMinimalPerfectHashFunction.Builder<T> A builder class forGOVMinimalPerfectHashFunction
.GV3CompressedFunction<T> An immutable function stored in a compressed form.GV3CompressedFunction.Builder<T> GV4CompressedFunction<T> An immutable function stored in a compressed form.GV4CompressedFunction.Builder<T> Hashes Basic hash functions.HollowTrieDistributor<T> A distributor based on a hollow trie.HollowTrieDistributorMonotoneMinimalPerfectHashFunction<T> A monotone minimal perfect hash implementation based on fixedsize bucketing that uses a hollow trie as a distributor.HollowTrieMonotoneMinimalPerfectHashFunction<T> A hollow trie, that is, a compacted trie recording just the length of the paths associated to the internal nodes.HypergraphSorter<T> A class implementing the 3hypergraph edge sorting procedure that is necessary for the MajewskiWormaldHavasCzech technique.LcpMonotoneMinimalPerfectHashFunction<T> A monotone minimal perfect hash implementation based on fixedsize bucketing that uses longest common prefixes as distributors.LcpMonotoneMinimalPerfectHashFunction.Builder<T> A builder class forLcpMonotoneMinimalPerfectHashFunction
.MinimalPerfectHashFunction<T> Deprecated. MinimalPerfectHashFunction.Builder<T> A builder class forMinimalPerfectHashFunction
.MWHCFunction<T> Deprecated. Please aGOV3Function
or aGOV4Function
.MWHCFunction.Builder<T> A builder class forMWHCFunction
.PaCoTrieDistributor<T> A succinct implementation of a binary partial compacted trie based on a recursive bitstream.PaCoTrieDistributorMonotoneMinimalPerfectHashFunction<T> A monotone minimal perfect hash implementation based on fixedsize bucketing that uses a partial compacted binary trie (PaCo trie) as distributor.TwoStepsGOV3Function<T> A function stored using two GOV3Functions—one for frequent values, and one for infrequent values.TwoStepsGOV3Function.Builder<T> A builder class forTwoStepsGOV3Function
.TwoStepsLcpMonotoneMinimalPerfectHashFunction<T> A monotone minimal perfect hash implementation based on fixedsize bucketing that uses longest common prefixes as distributors, and store their lengths using aTwoStepsGOV3Function
.TwoStepsLcpMonotoneMinimalPerfectHashFunction.Builder<T> A builder class forTwoStepsLcpMonotoneMinimalPerfectHashFunction
.TwoStepsMWHCFunction<T> Deprecated. Please useTwoStepsGOV3Function
.TwoStepsMWHCFunction.Builder<T> A builder class forTwoStepsMWHCFunction
.VLLcpMonotoneMinimalPerfectHashFunction<T> A monotone minimal perfect hash implementation based on fixedsize bucketing that uses longest common prefixes as distributors, and store their lengths using aGOVMinimalPerfectHashFunction
indexing anEliasFanoLongBigList
.VLPaCoTrieDistributor<T> A version of aPaCoTrieDistributor
whose space usage depends on the average string length, rather than on the maximum string length; mainly of theoretical interest.VLPaCoTrieDistributorMonotoneMinimalPerfectHashFunction<T> A version of aPaCoTrieDistributorMonotoneMinimalPerfectHashFunction
whose space usage depends on the average string length, rather than on the maximum string length; mainly of theoretical interest.ZFastTrieDistributor<T> A distributor based on a zfast trie.ZFastTrieDistributorMonotoneMinimalPerfectHashFunction<T> A monotone minimal perfect hash implementation based on fixedsize bucketing that uses a zfast trie as a distributor.ZFastTrieDistributorMonotoneMinimalPerfectHashFunction.Builder<T> A builder class forZFastTrieDistributorMonotoneMinimalPerfectHashFunction
.