Class HypergraphSorter<T>

java.lang.Object
it.unimi.dsi.sux4j.mph.HypergraphSorter<T>

public class HypergraphSorter<T> extends Object
A class implementing the 3-hypergraph edge sorting procedure that is necessary for the Majewski-Wormald-Havas-Czech technique.

Bohdan S. Majewski, Nicholas C. Wormald, George Havas, and Zbigniew J. Czech have described in “A family of perfect hashing methods”, Comput. J., 39(6):547−554, 1996, a 3-hypergraph based technique to store functions (actually, the paper uses the technique just to store a permutation of the key set, but it is clear it can be used to store any function). More generally, the procedure first generates a random 3-partite 3-hypergraph whose edges correspond to elements of the function domain. Then, it sorts the edges of the random 3-hypergraph so that for each edge at least one vertex, the hinge, never appeared before in the sorted edge list (this happens with high probability if the number of vertices is at least γ times the number of edges).

Instances of this class contain the data necessary to generate the random hypergraph and apply the sorting procedure. At construction time, you provide just the desired number of edges; then, each call to generateAndSort() will generate a new 3-hypergraph using a 64-bit seed, an iterator returning the key set, and a corresponding TransformationStrategy. If the method returns true, the sorting was successful and in the public field stack you can retrieve hinges in the opposite of the desired order (so enumerating hinges starting from the last in stack you are guaranteed to find each time a vertex that never appeared in some 3-hyperedge associated with previous hinges). For each hinge, the corresponding values in vertex1 and vertex2 complete the 3-hyperedge associated with the hinge, and the corresponding value in edge contains the index of the 3-hyperedge (i.e., the index of the key that generated the hyperedge). The computation of the last value can be disabled if you do not need it.

The public fields numEdges and numVertices expose information about the generated 3-hypergraph. For m edges, the number of vertices will be ⌈ γm ⌉ + 1, rounded up to the nearest multiple of 3, unless m is zero, in which case the number of vertices will be zero, too. Note that index of the hash function that generated a particular vertex of a 3-hyperedge can be recovered dividing by partSize, which is exactly numVertices/3.

To guarantee consistent results when reading a Majewski-Wormald-Havas-Czech-like structure, the method bitVectorToEdge() can be used to retrieve, starting from a bit vector, the corresponding edge. While having a function returning the edge starting from a key would be more object-oriented and avoid hidden dependencies, it would also require storing the transformation provided at construction time, which would make this class non-thread-safe. Just be careful to transform the keys into bit vectors using the same TransformationStrategy used to generate the random 3-hypergraph.

Support for preprocessed keys

This class provides two special access points for classes that have pre-digested their keys. The methods generateAndSort(Iterator, long) and tripleToEdge(long[], long, int, int, int[]) use fixed-length 192-bit keys under the form of triples of longs. The intended usage is that of turning the keys into such a triple using SpookyHash and then operating directly on the hash codes. This is particularly useful in chunked constructions, where the keys are replaced by their 192-bit hashes in the first place. Note that the hashes are actually rehashed using Hashes.spooky4(long[], long, long[])—this is necessary to vary the associated edges whenever the generated 3-hypergraph is not acyclic.

Warning: you cannot mix the bitvector-based and the triple-based constructors and static methods. It is your responsibility to pair them correctly.

Implementation details

We use Jenkins's SpookyHash to compute three 64-bit hash values.

The XOR trick

Since the list of edges incident to a vertex is accessed during the peeling process only when the vertex has degree one, we can actually store in a single integer the XOR of the indices of all edges incident to the vertex. This approach significantly simplifies the code and reduces memory usage. It is described in detail in “Cache-oblivious peeling of random hypergraphs”, by Djamal Belazzougui, Paolo Boldi, Giuseppe Ottaviano, Rossano Venturini, and Sebastiano Vigna, Proc. Data Compression Conference 2014, 2014.

We push further this idea by observing that since one of the vertices of an edge incident to x is exactly x, we can even avoid storing the edges at all and just store for each vertex two additional values that contain a XOR of the other two vertices of each edge incident on the vertex. This approach further simplifies the code as every 3-hyperedge is presented to us as a distinguished vertex (the hinge) plus two additional vertices.

Rounds and Logging

Building and sorting a large 3-hypergraph may take hours. As it happens with all probabilistic algorithms, one can just give estimates of the expected time.

There are two probabilistic sources of problems: duplicate edges and non-acyclic hypergraphs. However, the probability of duplicate edges is vanishing when n approaches infinity, and once the hypergraph has been generated, the stripping procedure succeeds in an expected number of trials that tends to 1 as n approaches infinity.

To help diagnosing problem with the generation process, this class will log at debug level what's happening.

Author:
Sebastiano Vigna
  • Field Summary

    Fields
    Modifier and Type
    Field
    Description
    final int[]
    For each vertex, the XOR of the indices of incident 3-hyperedges.
    static final double
    The mythical threshold (or better, a very closed upper bound of): random 3-hypergraphs are acyclic with high probability if the ratio vertices/edges exceeds this constant.
    final int
    The number of edges in the hypergraph.
    final int
    The number of vertices in the hypergraph (⌈ GAMMA * numEdges ⌉ + 1, rounded up to the nearest multiple of 3).
    final int
    final int[]
    The hinge stack.
    final int[]
    For each vertex, the XOR of the values of the smallest other vertex in each incident 3-hyperedge.
    final int[]
    For each vertex, the XOR of the values of the largest other vertex in each incident 3-hyperedge.
  • Constructor Summary

    Constructors
    Constructor
    Description
    HypergraphSorter(int numEdges)
    Creates a hypergraph sorter for a given number of edges.
    HypergraphSorter(int numEdges, boolean computeEdges)
    Creates a hypergraph sorter for a given number of edges.
  • Method Summary

    Modifier and Type
    Method
    Description
    static void
    bitVectorToEdge(BitVector bv, long seed, int numVertices, int[] e)
    Turns a bit vector into a 3-hyperedge.
    static void
    bitVectorToEdge(BitVector bv, long seed, int numVertices, int partSize, int[] e)
    Turns a bit vector into a 3-hyperedge.
    boolean
    generateAndSort(Iterator<? extends T> iterator, TransformationStrategy<? super T> transform, long seed)
    Generates a random 3-hypergraph and tries to sort its edges.
    boolean
    generateAndSort(Iterator<long[]> iterator, long seed)
    Generates a random 3-hypergraph and tries to sort its edges.
    static void
    tripleToEdge(long[] triple, long seed, int numVertices, int[] e)
    Turns a triple of longs into a 3-hyperedge.
    static void
    tripleToEdge(long[] triple, long seed, int numVertices, int partSize, int[] e)
    Turns a triple of longs into a 3-hyperedge.

    Methods inherited from class java.lang.Object

    clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
  • Field Details

    • GAMMA

      public static final double GAMMA
      The mythical threshold (or better, a very closed upper bound of): random 3-hypergraphs are acyclic with high probability if the ratio vertices/edges exceeds this constant.
      See Also:
    • numVertices

      public final int numVertices
      The number of vertices in the hypergraph (⌈ GAMMA * numEdges ⌉ + 1, rounded up to the nearest multiple of 3).
    • partSize

      public final int partSize
    • numEdges

      public final int numEdges
      The number of edges in the hypergraph.
    • vertex1

      public final int[] vertex1
      For each vertex, the XOR of the values of the smallest other vertex in each incident 3-hyperedge.
    • vertex2

      public final int[] vertex2
      For each vertex, the XOR of the values of the largest other vertex in each incident 3-hyperedge.
    • edge

      public final int[] edge
      For each vertex, the XOR of the indices of incident 3-hyperedges.
    • stack

      public final int[] stack
      The hinge stack. At the end of a successful sorting phase, it contains the hinges in reverse order.
  • Constructor Details

    • HypergraphSorter

      public HypergraphSorter(int numEdges, boolean computeEdges)
      Creates a hypergraph sorter for a given number of edges.
      Parameters:
      numEdges - the number of edges of this hypergraph sorter.
      computeEdges - if false, the index of the edge associated with each hinge will not be computed.
    • HypergraphSorter

      public HypergraphSorter(int numEdges)
      Creates a hypergraph sorter for a given number of edges.
      Parameters:
      numEdges - the number of edges of this hypergraph sorter.
  • Method Details

    • bitVectorToEdge

      public static void bitVectorToEdge(BitVector bv, long seed, int numVertices, int partSize, int[] e)
      Turns a bit vector into a 3-hyperedge.

      The returned edge satisfies the property that the i-th vertex is in the interval [i·partSize..i+1·partSize). However, if there are no edges the vector e will be filled with -1.

      Parameters:
      bv - a bit vector.
      seed - the seed for the hash function.
      numVertices - the number of vertices in the underlying hypergraph.
      partSize - numVertices/3 (to avoid a division).
      e - an array to store the resulting edge.
    • bitVectorToEdge

      public static void bitVectorToEdge(BitVector bv, long seed, int numVertices, int[] e)
      Turns a bit vector into a 3-hyperedge.
      Parameters:
      bv - a bit vector.
      seed - the seed for the hash function.
      numVertices - the number of vertices in the underlying hypergraph.
      e - an array to store the resulting edge.
      See Also:
    • tripleToEdge

      public static void tripleToEdge(long[] triple, long seed, int numVertices, int partSize, int[] e)
      Turns a triple of longs into a 3-hyperedge.
      Parameters:
      triple - a triple of intermediate hashes.
      seed - the seed for the hash function.
      numVertices - the number of vertices in the underlying hypergraph.
      partSize - numVertices/3 (to avoid a division).
      e - an array to store the resulting edge.
      See Also:
    • tripleToEdge

      public static void tripleToEdge(long[] triple, long seed, int numVertices, int[] e)
      Turns a triple of longs into a 3-hyperedge.
      Parameters:
      triple - a triple of intermediate hashes.
      seed - the seed for the hash function.
      numVertices - the number of vertices in the underlying hypergraph.
      e - an array to store the resulting edge.
      See Also:
    • generateAndSort

      public boolean generateAndSort(Iterator<? extends T> iterator, TransformationStrategy<? super T> transform, long seed)
      Generates a random 3-hypergraph and tries to sort its edges.
      Parameters:
      iterator - an iterator returning numEdges keys.
      transform - a transformation from keys to bit vectors.
      seed - a 64-bit random seed.
      Returns:
      true if the sorting procedure succeeded.
    • generateAndSort

      public boolean generateAndSort(Iterator<long[]> iterator, long seed)
      Generates a random 3-hypergraph and tries to sort its edges.
      Parameters:
      iterator - an iterator returning numEdges triples of longs.
      seed - a 64-bit random seed.
      Returns:
      true if the sorting procedure succeeded.