Universal hashing

In mathematics and computing, universal hashing refers to selecting a hash function at random from a family of hash functions with a certain mathematical property. This guarantees a low number of collisions in expectation, even if the data is chosen by an adversary. Many universal families are known, and their evaluation is often very efficient. Universal hashing has numerous uses in computer science, for example in implementations of hash tables, randomized algorithms, and cryptography.

Introduction

Assume we want to map keys from some universe into bins. The algorithm will have to handle some data set of keys, which is not known in advance. Usually, the goal of hashing is to obtain a low number of collisions. A deterministic hash function cannot offer any guarantee in an adversarial setting if the size of is greater than, since the adversary may choose to be precisely the preimage of a bin. This means that all data keys land in the same bin, making hashing useless. Furthermore, a deterministic hash function does not allow for rehashing: sometimes the input data turns out to be bad for the hash function, so one would like to change the hash function.
The solution to these problems is to pick a function randomly from a family of hash functions. A family of functions is called a universal family if,.
In other words, any two keys of the universe collide with probability at most when the hash function is drawn randomly from. This is exactly the probability of collision we would expect if the hash function assigned truly random hash codes to every key. Sometimes, the definition is relaxed to allow collision probability. This concept was introduced by Carter and Wegman in 1977, and has found numerous applications in computer science. If we have an upper bound of on the collision probability, we say that we have -almost universality.
Many, but not all, universal families have the following stronger uniform difference property:
Note that the definition of universality is only concerned with whether, which counts collisions. The uniform difference property is stronger.
An even stronger condition is pairwise independence: we have this property when we have the probability that will hash to any pair of hash values is as if they were perfectly random:. Pairwise independence is sometimes called strong universality.
Another property is uniformity. We say that a family is uniform if all hash values are equally likely: for any hash value. Universality does not imply uniformity. However, strong universality does imply uniformity.
Given a family with the uniform distance property, one can produce a pairwise independent or strongly universal hash family by adding a uniformly distributed random constant with values in to the hash functions. Since a shift by a constant is sometimes irrelevant in applications, a careful distinction between the uniform distance property and pairwise independent is sometimes not made.
For some applications, it is important for the least significant bits of the hash values to be also universal. When a family is strongly universal, this is guaranteed: if is a strongly universal family with, then the family made of the functions for all is also strongly universal for. Unfortunately, the same is not true of universal families. For example, the family made of the identity function is clearly universal, but the family made of the function fails to be universal.
UMAC and Poly1305-AES and several other message authentication code algorithms are based on universal hashing.
In such applications, the software chooses a new hash function for every message, based on a unique nonce for that message.
Several hash table implementations are based on universal hashing.
In such applications, typically the software chooses a new hash function only after it notices that "too many" keys have collided; until then, the same hash function continues to be used over and over.
. A survey of fastest known universal and strongly universal hash functions for integers, vectors, and
strings is found in.

Mathematical guarantees

For any fixed set of keys, using a universal family guarantees the following properties.

For any fixed in, the expected number of keys in the bin is. When implementing hash tables by chaining, this number is proportional to the expected running time of an operation involving the key .
The expected number of pairs of keys in with that collide is bounded above by, which is of order. When the number of bins, is chosen linear in , the expected number of collisions is. When hashing into bins, there are no collisions at all with probability at least a half.
The expected number of keys in bins with at least keys in them is bounded above by. Thus, if the capacity of each bin is capped to three times the average size, the total number of keys in overflowing bins is at most. This only holds with a hash family whose collision probability is bounded above by. If a weaker definition is used, bounding it by, this result is no longer true.

As the above guarantees hold for any fixed set, they hold if the data set is chosen by an adversary. However, the adversary has to make this choice before the algorithm's random choice of a hash function. If the adversary can observe the random choice of the algorithm, randomness serves no purpose, and the situation is the same as deterministic hashing.
The second and third guarantee are typically used in conjunction with rehashing. For instance, a randomized algorithm may be prepared to handle some number of collisions. If it observes too many collisions, it chooses another random from the family and repeats. Universality guarantees that the number of repetitions is a geometric random variable.

Constructions

Since any computer data can be represented as one or more machine words, one generally needs hash functions for three types of domains: machine words ; fixed-length vectors of machine words; and variable-length vectors.

Hashing integers

This section refers to the case of hashing integers that fit in machines words; thus, operations like multiplication, addition, division, etc. are cheap machine-level instructions. Let the universe to be hashed be.
The original proposal of Carter and Wegman was to pick a prime and define
where are randomly chosen integers modulo with.
To see that is a universal family, note that only holds when
for some integer between and. If, their difference, is nonzero and has an inverse modulo. Solving for yields
There are possible choices for and, varying in the allowed range, possible non-zero values for the right hand side. Thus the collision probability is

Another way to see is a universal family is via the notion of statistical distance. Write the difference as
Since is nonzero and is uniformly distributed in, it follows that modulo is also uniformly distributed in. The distribution of is thus almost uniform, up to a difference in probability of between the samples. As a result, the statistical distance to a uniform family is, which becomes negligible when.
The family of simpler hash functions
is only approximately universal: for all. Moreover, this analysis is nearly tight; Carter and Wegman show that whenever.

Avoiding modular arithmetic

The state of the art for hashing integers is the multiply-shift scheme described by Dietzfelbinger et al. in 1997. By avoiding modular arithmetic, this method is much easier to implement and also runs significantly faster in practice. The scheme assumes the number of bins is a power of two,. Let be the number of bits in a machine word. Then the hash functions are parametrised over odd positive integers . To evaluate, multiply by modulo and then keep the high order bits as the hash code. In mathematical notation, this is
and it can be implemented in C-like programming languages by
This scheme does not satisfy the uniform difference property and is only -almost-universal; for any,.
To understand the behavior of the hash function,
notice that, if and have the same highest-order 'M' bits, then has either all 1's or all 0's as its highest order M bits.
Assume that the least significant set bit of appears on position. Since is a random odd integer and odd integers have inverses in the ring, it follows that will be uniformly distributed among -bit integers with the least significant set bit on position. The probability that these bits are all 0's or all 1's is therefore at most.
On the other hand, if, then higher-order M bits of
contain both 0's and 1's, so
it is certain that. Finally, if then bit of
is 1 and if and only if bits are also 1, which happens with probability.
This analysis is tight, as can be shown with the example and. To obtain a truly 'universal' hash function, one can use the multiply-add-shift scheme
which can be implemented in C-like programming languages by
where is a random odd positive integer with and is a random non-negative integer with. With these choices of and, for all. This differs slightly but importantly from the mistranslation in the English paper.

Hashing vectors

This section is concerned with hashing a fixed-length vector of machine words. Interpret the input as a vector of machine words. If is a universal family with the uniform difference property, the following family also has the uniform difference property :
If is a power of two, one may replace summation by exclusive or.
In practice, if double-precision arithmetic is available, this is instantiated with the multiply-shift hash family of. Initialize the hash function with a vector of random odd integers on bits each. Then if the number of bins is for :
It is possible to halve the number of multiplications, which roughly translates to a two-fold speed-up in practice. Initialize the hash function with a vector of random odd integers on bits each. The following hash family is universal:
If double-precision operations are not available, one can interpret the input as a vector of half-words. The algorithm will then use multiplications, where was the number of half-words in the vector. Thus, the algorithm runs at a "rate" of one multiplication per word of input.
The same scheme can also be used for hashing integers, by interpreting their bits as vectors of bytes. In this variant, the vector technique is known as tabulation hashing and it provides a practical alternative to multiplication-based universal hashing schemes.
Strong universality at high speed is also possible. Initialize the hash function with a vector of random integers on bits. Compute
The result is strongly universal on bits. Experimentally, it was found to run at 0.2 CPU cycle per byte on recent Intel processors for.

Hashing strings

This refers to hashing a variable-sized vector of machine words. If the length of the string can be bounded by a small number, it is best to use the vector solution from above. The space required is the maximal length of the string, but the time to evaluate is just the length of. As long as zeroes are forbidden in the string, the zero-padding can be ignored when evaluating the hash function without affecting universality. Note that if zeroes are allowed in the string, then it might be best to append a fictitious non-zero character to all strings prior to padding: this will ensure that universality is not affected.
Now assume we want to hash, where a good bound on is not known a priori. A universal family proposed by
treats the string as the coefficients of a polynomial modulo a large prime. If, let be a prime and define:
Using properties of modular arithmetic, above can be computed without producing large numbers for large strings as follows:

uint hash
uint h = INITIAL_VALUE
for
h = mod p
return h

This Rabin-Karp rolling hash is based on a linear congruential generator.
Above algorithm is also known as Multiplicative hash function. In practice, the mod operator and the parameter p can be avoided altogether by simply allowing integer to overflow because it is equivalent to mod in many programming languages. Below table shows values chosen to initialize h and a for some of the popular implementations.

Implementation	INITIAL_VALUE	a
Bernstein's hash function djb2	5381	33
STLPort 4.6.2	0	5
Kernighan and Ritchie's hash function	0	31
`java.lang.String.hashCode`	0	31

Consider two strings and let be length of the longer one; for the analysis, the shorter string is conceptually padded with zeros up to length. A collision before applying implies that is a root of the polynomial with coefficients. This polynomial has at most roots modulo, so the collision probability is at most. The probability of collision through the random brings the total collision probability to. Thus, if the prime is sufficiently large compared to the length of strings hashed, the family is very close to universal.
Other universal families of hash functions used to hash unknown-length strings to fixed-length hash values include the Rabin fingerprint and the Buzhash.

Avoiding modular arithmetic

To mitigate the computational penalty of modular arithmetic, three tricks are used in practice:

One chooses the prime to be close to a power of two, such as a Mersenne prime. This allows arithmetic modulo to be implemented without division. For instance, on modern architectures one can work with, while 's are 32-bit values.
One can apply vector hashing to blocks. For instance, one applies vector hashing to each 16-word block of the string, and applies string hashing to the results. Since the slower string hashing is applied on a substantially smaller vector, this will essentially be as fast as vector hashing.
One chooses a power-of-two as the divisor, allowing arithmetic modulo to be implemented without division. The NH hash-function family takes this approach.

Popular movies

The Hunger Games (film) - 2012 American dystopian action thriller science fiction-adventure film directed by Gary Ross and based on Suzanne Collins’s 2008 novel of the same name. It is the first insta...
untitled Captain Marvel sequel - part of Marvel Cinematic Universe....
Killers of the Flower Moon (film project) - Killers of the Flower Moon - film project in United States of America. It was presented as drama, detective fiction, thriller. The film project starred Leonardo Dicaprio, Robert De Niro. Director of...
Five Nights at Freddy's (film) - Five Nights at Freddy's - film published in 2017 in United States of America. Scenarist of the film - Scott Cawthon....

Popular books

Book of Revelation - The Book of Revelation is the final book of the New Testament, and consequently is also the final book of the Christian Bible. Its title is derived from the first word of the Koine Greek text: apok...
Book of Genesis - account of the creation of the world, the early history of humanity, Israel's ancestors and the origins...
Gospel of Matthew - The Gospel According to Matthew is the first book of the New Testament and one of the three synoptic gospels. It tells how Israel's Messiah, rejected and executed in Israel, pronounces judgement on ...
Michelin Guide - Michelin Guides are a series of guide books published by the French tyre company Michelin for more than a century. The term normally refers to the annually published Michelin Red Guide , the oldest...
Psalms - The Book of Psalms , commonly referred to simply as Psalms , the Psalter or "the Psalms", is the first book of the Ketuvim , the third section of the Hebrew Bible, and thus a book of th...
Ecclesiastes - Ecclesiastes is one of 24 books of the Tanakh , where it is classified as one of the Ketuvim . Originally written c. 450–200 BCE, it is also among the canonical Wisdom literature of the Old Tes...
The 48 Laws of Power - non-fiction book by American author Robert Greene. The book...

Popular television series

The Crown (TV series) - historical drama web television series about the reign of Queen Elizabeth II, created and principally written by Peter Morgan, and produced by Left Bank Pictures and Sony Pictures Tel...
Friends - American sitcom television series, created by David Crane and Marta Kauffman, which aired on NBC from September 22, 1994, to May 6, 2004, lasting ten seasons. With an ensemble cast sta...
Young Sheldon - spin-off prequel to The Big Bang Theory and begins with the character Sheldon...
Modern Family - American television mockumentary family sitcom created by Christopher Lloyd and Steven Levitan for the American Broadcasting Company. It ran for eleven seasons, from September 23...
Loki (TV series) - upcoming American web television miniseries created for Disney+ by Michael Waldron, based on the Marvel Comics character of the same name. It is set in the Marvel Cinematic Universe, shar...
Game of Thrones - American fantasy drama television series created by David Benioff and D. B. Weiss for HBO. It...
Shameless (American TV series) - American comedy-drama television series developed by John Wells which debuted on Showtime on January 9, 2011. It...