- All Implemented Interfaces:
public class GOV4Function<T> extends AbstractObject2LongFunction<T> implements Serializable, Size64
GOV3Function, instances of this class have slightly slower lookups and are slightly slower to build, but use less space.
Instances of this class store a function from keys to values. Keys are provided by an
iterable object (whose iterators must return elements in a consistent
order), whereas values are provided by a
LongIterable. If you do not specify values, each
key will be assigned its rank (e.g., its position in iteration order starting from zero).
For convenience, this class provides a main method that reads from standard input a (possibly
gzip'd) sequence of newline-separated strings, and writes a serialised function
mapping each element of the list to its position, or to a given list of values.
Optionally, it is possible to sign a
GOV4Function. Signing is possible if no list of values
has been specified (otherwise, there is no way to associate a key with its signature). A
w-bit signature will be associated with each key, so that
will return a default return value (by default, -1) on strings
that are not in the original key set. As usual, false positives are possible with probability
If you're not interested in the rank of a key, but just to know whether the key was in the original set, you can turn the function into an approximate dictionary. In this case, the value associated by the function with a key is exactly its signature, which means that the only space used by the function is that occupied by signatures: this is one of the fastest and most compact way of storing a static approximate dictionary. In this case, the only returned value is one, and the default return value is set to zero.
Building a function
This class provides a great amount of flexibility when creating a new function; such flexibility is exposed through the builder. To exploit the various possibilities, you must understand some details of the construction.
In a first phase, we build a
BucketedHashStore containing hashes of the keys. By default,
the store will associate each hash with the rank of the key. If you
specify values, the store will associate with each
hash the corresponding value.
However, if you further require an indirect construction the
store will associate again each hash with the rank of the corresponding key, and access randomly
the values (which must be either a
LongList or a
construction is useful only in complex, multi-layer hashes (such as an
LcpMonotoneMinimalPerfectHashFunction) in which we want to reuse a checked
BucketedHashStore. Storing values in the
BucketedHashStore is extremely scalable
because the values must just be a
LongIterable that will be scanned sequentially during
the store construction. On the other hand, if you have already a store that associates ordinal
positions, and you want to build a new function for which a
LongBigList of values needs little space (e.g., because it is described implicitly), you
can opt for an indirect construction using the already built
Note that if you specify a store it will be used before building a new one (possibly because of a
obvious benefits in terms of performance. If the store is not checked, and a
DuplicateException is thrown,
the constructor will try to rebuild the store, but this requires, of course, that the keys, and
possibly the values, are available. Note that it is your responsibility to pass a correct store.
This implementation is multithreaded: each bucket returned by the
processed independently. By default, this class uses
parallel threads, but by default no more than 4. If you wish to set a specific number of threads,
you can do so through the system property "it.unimi.dsi.sux4j.mph.threads".
The detail of the data structure can be found in “Fast Scalable Construction of (Minimal
Perfect Hash) Functions”, by Marco Genuzio, Giuseppe Ottaviano and Sebastiano Vigna,
15th International Symposium on Experimental Algorithms — SEA 2016, Lecture Notes in
Computer Science, Springer, 2016. We generate a random 4-regular linear system on
F2, where the known term of the k-th equation is the output value
for the k-th key. Then, we solve it and store the
solution. Since the system must have ≈3% more variables than equations to be solvable, an
GOV4Function on n keys requires 1.03rn bits.
Nested Class Summary
Nested Classes Modifier and Type Class Description
static classA builder class for
Fields Modifier and Type Field Description
static intThe expected bucket size.
static doubleThe ratio between variables and equations.
protected LongBigListThe final magick—the list of values that define the output of the function.
protected longThe seed used to generate the initial signature.
protected longThe number of variables.
protected longThe number of keys.
static StringThe system property used to set the number of parallel threads.
protected longA long containing the start offset of each bucket in the lower 56 bits, and the local seed of each bucket in the upper 8 bits.
protected longThe mask to compare signatures, or zero for no signatures.
protected LongBigListThe signatures.
protected TransformationStrategy<? super T>The transformation strategy to turn objects of type
Tinto bit vectors.
protected intThe data width.
Constructors Modifier Constructor Description
Iterable<? extends T> keys, TransformationStrategy<? super T> transform, int signatureWidth, LongIterable values, int dataWidth, File tempDir, BucketedHashStore<T> bucketedHashStore, boolean indirect)(Creates a new function for the given keys and values.
Modifier and Type Method Description
(long signature)Low-level access to the output of this function.
()Returns the number of bits used by this structure.
()Returns the number of keys in the function domain.
Methods inherited from class it.unimi.dsi.fastutil.objects.AbstractObject2LongFunction
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
Methods inherited from interface it.unimi.dsi.fastutil.objects.Object2LongFunction
andThen, andThenByte, andThenChar, andThenDouble, andThenFloat, andThenInt, andThenLong, andThenObject, andThenReference, andThenShort, applyAsLong, composeByte, composeChar, composeDouble, composeFloat, composeInt, composeLong, composeObject, composeReference, composeShort, get, getOrDefault, getOrDefault, put, put, remove, removeLong
Cpublic static double CThe ratio between variables and equations.
NUMBER_OF_THREADS_PROPERTYpublic static final String NUMBER_OF_THREADS_PROPERTYThe system property used to set the number of parallel threads.
- See Also:
- Constant Field Values
BUCKET_SIZEpublic static final int BUCKET_SIZEThe expected bucket size.
- See Also:
- Constant Field Values
nprotected final long nThe number of keys.
mprotected final long mThe number of variables.
widthprotected final int widthThe data width.
globalSeedprotected final long globalSeedThe seed used to generate the initial signature.
offsetAndSeedprotected final long offsetAndSeedA long containing the start offset of each bucket in the lower 56 bits, and the local seed of each bucket in the upper 8 bits.
dataprotected final LongBigList dataThe final magick—the list of values that define the output of the function.
transformThe transformation strategy to turn objects of type
Tinto bit vectors.
signatureMaskprotected final long signatureMaskThe mask to compare signatures, or zero for no signatures.
signaturesprotected final LongBigList signaturesThe signatures.
GOV4Functionprotected GOV4Function(Iterable<? extends T> keys, TransformationStrategy<? super T> transform, int signatureWidth, LongIterable values, int dataWidth, File tempDir, BucketedHashStore<T> bucketedHashStore, boolean indirect) throws IOExceptionCreates a new function for the given keys and values.
keys- the keys in the domain of the function, or
transform- a transformation strategy for the keys.
signatureWidth- a positive number for a signature width, 0 for no signature, a negative value for a self-signed function; if nonzero,
widthmust be -1.
values- values to be assigned to each element, in the same order of the iterator returned by
null, the assigned value will the ordinal number of each element.
dataWidth- the bit width of the
values, or -1 if
tempDir- a temporary directory for the store files, or
nullfor the standard temporary directory.
bucketedHashStore- a bucketed hash store containing the keys associated with their ranks (if there are no values, or
indirectis true) or values, or
null; the store can be unchecked, but in this case
transformmust be non-
indirect- if true,
bucketedHashStorecontains ordinal positions, and
LongIterablethat must be accessed to retrieve the actual values.
getLongpublic long getLong(Object o)
getLongBySignaturepublic long getLongBySignature(long signature)Low-level access to the output of this function.
This method makes it possible to build several kind of functions on the same
BucketedHashStoreand then retrieve the resulting values by generating a single signature. The method
TwoStepsGOV3Function.getLong(Object)is a good example of this technique.
signature- a signature generated as documented in
- the output of the function.
size64public long size64()Returns the number of keys in the function domain.
size@Deprecated public int size()Deprecated.
numBitspublic long numBits()Returns the number of bits used by this structure.
- the number of bits used by this structure.
containsKeypublic boolean containsKey(Object o)