HyperLogLog

HyperLogLog is an algorithm for the count-distinct problem, approximating the number of distinct elements in a multiset. Calculating the exact cardinality of a multiset requires an amount of memory proportional to the cardinality, which is impractical for very large data sets. Probabilistic cardinality estimators, such as the HyperLogLog algorithm, use significantly less memory than this, at the cost of obtaining only an approximation of the cardinality. The HyperLogLog algorithm is able to estimate cardinalities of > 10⁹ with a typical accuracy of 2%, using 1.5 kB of memory. HyperLogLog is an extension of the earlier LogLog algorithm, itself deriving from the 1984 Flajolet–Martin algorithm.

Terminology

In the original paper by Flajolet et al. and in related literature on the count-distinct problem, the term "cardinality" is used to mean the number of distinct elements in a data stream with repeated elements. However in the theory of multisets the term refers to the sum of multiplicities of each member of a multiset. This article chooses to use Flajolet's definition for consistency with the sources.

Algorithm

The basis of the HyperLogLog algorithm is the observation that the cardinality of a multiset of uniformly distributed random numbers can be estimated by calculating the maximum number of leading zeros in the binary representation of each number in the set. If the maximum number of leading zeros observed is n, an estimate for the number of distinct elements in the set is 2ⁿ.
In the HyperLogLog algorithm, a hash function is applied to each element in the original multiset to obtain a multiset of uniformly distributed random numbers with the same cardinality as the original multiset. The cardinality of this randomly distributed set can then be estimated using the algorithm above.
The simple estimate of cardinality obtained using the algorithm above has the disadvantage of a large variance. In the HyperLogLog algorithm, the variance is minimised by splitting the multiset into numerous subsets, calculating the maximum number of leading zeros in the numbers in each of these subsets, and using a harmonic mean to combine these estimates for each subset into an estimate of the cardinality of the whole set.

Operations

The HyperLogLog has three main operations: add to add a new element to the set, count to obtain the cardinality of the set and merge to obtain the union of two sets. Some derived operations can be computed using the Inclusion–exclusion principle like the cardinality of the intersection or the cardinality of the difference between two HyperLogLogs combining the merge and count operations.
The data of the HyperLogLog is stored in an array of counters called registers with size that are set to 0 in their initial state.

Add

The add operation consists of computing the hash of the input data with a hash function, getting the first bits, and adding 1 to them to obtain the address of the register to modify. With the remaining bits compute which returns the position of the leftmost 1. The new value of the register will be the maximum between the current value of the register and.

Count

The count algorithm consists in computing the harmonic mean of the registers, and using a constant to derive an estimate of the count:
The intuition is that being the unknown cardinality of, each subset will have elements. Then
should be close to. The harmonic mean of 2 to these quantities is which should be near. Thus, should be approximately.
Finally, the constant is introduced to correct a systematic multiplicative bias present in due to hash collisions.

Practical Considerations

The constant is not simple to calculate, and can be approximated with the formula
The HyperLogLog technique, though, is biased for small cardinalities below a threshold of. The original paper proposes using a different algorithm for small cardinalities known as Linear Counting. In the case where the estimate provided above is less than the threshold, the alternative calculation can be used:

Let be the count of registers equal to 0.
If, use the standard HyperLogLog estimator above.
Otherwise, use Linear Counting:

Additionally, for very large cardinalities approaching the limit of the size of the registers, the cardinality can be estimated with:
With the above corrections for lower and upper bounds, the error can be estimated as.

Merge

The merge operation for two HLLs consists in obtaining the maximum for each pair of registers

Complexity

To analyze the complexity, the data streaming model is used, which analyzes the space necessary to get a approximation with a fixed success probability. The relative error of HLL is and it needs space, where is the set cardinality and is the number of registers.
The add operation depends on the size of the output of the hash function. As this size is fixed, we can consider the running time for the add operation to be.
The count and merge operations depend on the number of registers and have a theoretical cost of. In some implementations the number of registers is fixed and the cost is considered to be in the documentation.

HLL++

The HyperLogLog++ algorithm proposes several improvements in the HyperLogLog algorithm to reduce memory requirements and increase accuracy in some ranges of cardinalities:

64-bit hash function is used instead of the 32 bits used in the original paper. This reduces the hash collisions for large cardinalities allowing to remove the large range correction.
Some bias is found for small cardinalities when switching from linear counting to the HLL counting. An empirical bias correction is proposed to mitigate the problem.
A sparse representation of the registers is proposed to reduce memory requirements for small cardinalities, which can be later transformed to a dense representation if the cardinality grows.
Streaming HLL

When the data arrives in a single stream, the Historic Inverse Probability or martingale estimator
significantly improves the accuracy of the HLL sketch and uses 36% less memory to achieve a given error level. This estimator is provably optimal for any duplicate insensitive approximate distinct counting sketch on a single stream.
The single stream scenario also leads to variants in the HLL sketch construction.
HLL-TailCut+ uses 45% less memory than the original HLL sketch but at the cost of being dependent on the data insertion order and not being able to merge sketches.

Popular movies

The Hunger Games (film) - 2012 American dystopian action thriller science fiction-adventure film directed by Gary Ross and based on Suzanne Collins’s 2008 novel of the same name. It is the first insta...
untitled Captain Marvel sequel - part of Marvel Cinematic Universe....
Killers of the Flower Moon (film project) - Killers of the Flower Moon - film project in United States of America. It was presented as drama, detective fiction, thriller. The film project starred Leonardo Dicaprio, Robert De Niro. Director of...
Five Nights at Freddy's (film) - Five Nights at Freddy's - film published in 2017 in United States of America. Scenarist of the film - Scott Cawthon....

Popular books

Book of Revelation - The Book of Revelation is the final book of the New Testament, and consequently is also the final book of the Christian Bible. Its title is derived from the first word of the Koine Greek text: apok...
Book of Genesis - account of the creation of the world, the early history of humanity, Israel's ancestors and the origins...
Gospel of Matthew - The Gospel According to Matthew is the first book of the New Testament and one of the three synoptic gospels. It tells how Israel's Messiah, rejected and executed in Israel, pronounces judgement on ...
Michelin Guide - Michelin Guides are a series of guide books published by the French tyre company Michelin for more than a century. The term normally refers to the annually published Michelin Red Guide , the oldest...
Psalms - The Book of Psalms , commonly referred to simply as Psalms , the Psalter or "the Psalms", is the first book of the Ketuvim , the third section of the Hebrew Bible, and thus a book of th...
Ecclesiastes - Ecclesiastes is one of 24 books of the Tanakh , where it is classified as one of the Ketuvim . Originally written c. 450–200 BCE, it is also among the canonical Wisdom literature of the Old Tes...
The 48 Laws of Power - non-fiction book by American author Robert Greene. The book...

Popular television series

The Crown (TV series) - historical drama web television series about the reign of Queen Elizabeth II, created and principally written by Peter Morgan, and produced by Left Bank Pictures and Sony Pictures Tel...
Friends - American sitcom television series, created by David Crane and Marta Kauffman, which aired on NBC from September 22, 1994, to May 6, 2004, lasting ten seasons. With an ensemble cast sta...
Young Sheldon - spin-off prequel to The Big Bang Theory and begins with the character Sheldon...
Modern Family - American television mockumentary family sitcom created by Christopher Lloyd and Steven Levitan for the American Broadcasting Company. It ran for eleven seasons, from September 23...
Loki (TV series) - upcoming American web television miniseries created for Disney+ by Michael Waldron, based on the Marvel Comics character of the same name. It is set in the Marvel Cinematic Universe, shar...
Game of Thrones - American fantasy drama television series created by David Benioff and D. B. Weiss for HBO. It...
Shameless (American TV series) - American comedy-drama television series developed by John Wells which debuted on Showtime on January 9, 2011. It...