Kullback–Leibler divergence


In mathematical statistics, the Kullback–Leibler divergence is a measure of how one probability distribution is different from a second, reference probability distribution. Applications include characterizing the relative entropy in information systems, randomness in continuous time-series, and information gain when comparing statistical models of inference. In contrast to variation of information, it is a distribution-wise asymmetric measure and thus does not qualify as a statistical metric of spread - it also does not satisfy the triangle inequality. In the simple case, a Kullback–Leibler divergence of 0 indicates that the two distributions in question are identical. In simplified terms, it is a measure of surprise, with diverse applications such as applied statistics, fluid mechanics, neuroscience and machine learning.

Etymology

The Kullback–Leibler divergence was introduced by Solomon Kullback and Richard Leibler in 1951 as the directed divergence between two distributions; Kullback preferred the term discrimination information. The divergence is discussed in Kullback's 1959 book, Information Theory and Statistics.

Definition

For discrete probability distributions and defined on the same probability space,, the Kullback–Leibler divergence from to is defined to be
which is equivalent to
In other words, it is the expectation of the logarithmic difference between the probabilities and, where the expectation is taken using the probabilities. The Kullback–Leibler divergence is defined only if for all, implies . Whenever is zero the contribution of the corresponding term is interpreted as zero because
For distributions and of a continuous random variable, the Kullback–Leibler divergence is defined to be the integral:
where and denote the probability densities of and.
More generally, if and are probability measures over a set, and is absolutely continuous with respect to, then the Kullback–Leibler divergence from to is defined as
where is the Radon–Nikodym derivative of with respect to, and provided the expression on the right-hand side exists. Equivalently, this can be written as
which is the entropy of relative to. Continuing in this case, if is any measure on for which and exist, then the Kullback–Leibler divergence from to is given as
The logarithms in these formulae are taken to base 2 if information is measured in units of bits, or to base if information is measured in nats. Most formulas involving the Kullback–Leibler divergence hold regardless of the base of the logarithm.
Various conventions exist for referring to in words. Often it is referred to as the divergence between and, but this fails to convey the fundamental asymmetry in the relation. Sometimes, as in this article, it may be described as the divergence of from or as the divergence from to. This reflects the asymmetry in Bayesian inference, which starts from a prior and updates to the posterior. Another common way to refer to is as the relative entropy of with respect to.

Basic example

Kullback gives the following example. Let and be the distributions shown in the table and figure. is the distribution on the left side of the figure, a binomial distribution with and. is the distribution on the right side of the figure, a discrete uniform distribution with the three possible outcomes,, or , each with probability.
x012
Distribution P0.360.480.16
Distribution Q0.3330.3330.333

The KL divergences and are calculated as follows. This example uses the natural log with base e, designated to get results in nats.

Interpretations

The Kullback–Leibler divergence from to is often denoted.
In the context of machine learning, is often called the information gain achieved if is used instead of. By analogy with information theory, it is also called the relative entropy of with respect to. In the context of coding theory, can be constructed by measuring the expected number of extra bits required to code samples from using a code optimized for rather than the code optimized for.
Expressed in the language of Bayesian inference, is a measure of the information gained by revising one's beliefs from the prior probability distribution to the posterior probability distribution. In other words, it is the amount of information lost when is used to approximate. In applications, typically represents the "true" distribution of data, observations, or a precisely calculated theoretical distribution, while typically represents a theory, model, description, or approximation of. In order to find a distribution that is closest to, we can minimize KL divergence and compute an information projection.
The Kullback–Leibler divergence is a special case of a broader class of statistical divergences called f-divergences as well as the class of Bregman divergences. It is the only such divergence over probabilities that is a member of both classes. Although it is often intuited as a way of measuring the distance between probability distributions, the Kullback–Leibler divergence is not a true metric. It does not obey the Triangle Inequality, and in general does not equal. However, its infinitesimal form, specifically its Hessian, gives a metric tensor known as the Fisher information metric.
Arthur Hobson proved that the Kullback–Leibler divergence is the only measure of difference between probability distributions that satisfies some desired properties, which are the canonical extension to those appearing in a commonly used characterization of entropy. Consequently, mutual information is the only measure of mutual dependence that obeys certain related conditions, since it can be defined in terms of Kullback–Leibler divergence.

Motivation

In information theory, the Kraft–McMillan theorem establishes that any directly decodable coding scheme for coding a message to identify one value out of a set of possibilities can be seen as representing an implicit probability distribution over, where is the length of the code for in bits. Therefore, the Kullback–Leibler divergence can be interpreted as the expected extra message-length per datum that must be communicated if a code that is optimal for a given distribution is used, compared to using a code based on the true distribution.
where is the cross entropy of and, and is the entropy of .
The KL divergence can be thought of as something like a measurement of how far the distribution Q is from the distribution P. The cross-entropy is itself such a measurement, but it has the defect that isn't zero, so we subtract to make agree more closely with our notion of distance.
There is a relation between the Kullback–Leibler divergence and the "rate function" in the theory of large deviations.

Properties

Multivariate normal distributions

Suppose that we have two multivariate normal distributions, with means and with covariance matrices If the two distributions have the same dimension,, then the Kullback–Leibler divergence between the distributions is as follows:
The logarithm in the last term must be taken to base e since all terms apart from the last are base-e logarithms of expressions that are either factors of the density function or otherwise arise naturally. The equation therefore gives a result measured in nats. Dividing the entire expression above by yields the divergence in bits.
A special case, and a common quantity in variational inference, is the KL-divergence between a diagonal multivariate normal, and a standard normal distribution :

Relation to metrics

One might be tempted to call the Kullback–Leibler divergence a "distance metric" on the space of probability distributions, but this would not be correct as it is not symmetric – that is, – nor does it satisfy the triangle inequality. Even so, being a premetric, it generates a topology on the space of probability distributions. More concretely, if is a sequence of distributions such that
then it is said that
Pinsker's inequality entails that
where the latter stands for the usual convergence in total variation.

Fisher information metric

The Kullback–Leibler divergence is directly related to the Fisher information metric. This can be made explicit as follows. Assume that the probability distributions and are both parameterized by some parameter. Consider then two close-by values of and so that the parameter differs by only a small amount from the parameter value. Specifically, up to first order one has
with a small change of in the direction, and the corresponding rate of change in the probability distribution. Since the Kullback–Leibler divergence has an absolute minimum 0 for, i.e., it changes only to second order in the small parameters. More formally, as for any minimum, the first derivatives of the divergence vanish
and by the Taylor expansion one has up to second order
where the Hessian matrix of the divergence
must be positive semidefinite. Letting vary the Hessian defines a Riemannian metric on the parameter space, called the Fisher information metric.

Fisher information metric theorem

When satisfies the following regularity conditions:
where is independent of
then:

Variation of information

Another information-theoretic metric is Variation of information, which is roughly a symmetrization of conditional entropy. It is a metric on the set of partitions of a discrete probability space.

Relation to other quantities of information theory

Many of the other quantities of information theory can be interpreted as applications of the Kullback–Leibler divergence to specific cases.

Self-information

The self-information, also known as the information content of a signal, random variable, or event is defined as the negative logarithm of the probability of the given outcome occurring.
When applied to a discrete random variable, the self-information can be represented as
is the Kullback–Leibler divergence of the probability distribution from a Kronecker delta representing certainty that — i.e. the number of extra bits that must be transmitted to identify if only the probability distribution is available to the receiver, not the fact that.

Mutual information

The mutual information,
is the Kullback–Leibler divergence of the product of the two marginal probability distributions from the joint probability distribution — i.e. the expected number of extra bits that must be transmitted to identify and if they are coded using only their marginal distributions instead of the joint distribution. Equivalently, if the joint probability is known, it is the expected number of extra bits that must on average be sent to identify if the value of is not already known to the receiver.

Shannon entropy

The Shannon entropy,
is the number of bits which would have to be transmitted to identify from equally likely possibilities, less the Kullback–Leibler divergence of the uniform distribution on the random variates of,, from the true distribution — i.e. less the expected number of bits saved, which would have had to be sent if the value of were coded according to the uniform distribution rather than the true distribution.

Conditional entropy

The conditional entropy,
is the number of bits which would have to be transmitted to identify from equally likely possibilities, less the Kullback–Leibler divergence of the product distribution from the true joint distribution — i.e. less the expected number of bits saved which would have had to be sent if the value of were coded according to the uniform distribution rather than the conditional distribution of given.

Cross entropy

When we have a set of possible events, coming from the distribution, we can encode them using entropy encoding. This compresses the data by replacing each fixed-length input symbol with a corresponding unique, variable-length, prefix-free code with probabilities p = can be encoded as the bits ). If we know the distribution in advance, we can devise an encoding that would be optimal. Meaning the messages we encode will have the shortest length on average, which will be equal to Shannon's Entropy of . However, if we use a different probability distribution when creating the entropy encoding scheme, then a larger number of bits will be needed to identify an event from a set of possibilities. This new number is measured by the cross entropy between and.
The cross entropy between two probability distributions measures the average number of bits needed to identify an event from a set of possibilities, if a coding scheme is used based on a given probability distribution, rather than the "true" distribution. The cross entropy for two distributions and over the same probability space is thus defined as follows:
Under this scenario, the KL divergences can be interpreted as the extra number of bits, on average, that are needed for encoding the events because of using for constructing the encoding scheme instead of.

Bayesian updating

In Bayesian statistics the Kullback–Leibler divergence can be used as a measure of the information gain in moving from a prior distribution to a posterior distribution:. If some new fact is discovered, it can be used to update the posterior distribution for from to a new posterior distribution using Bayes' theorem:
This distribution has a new entropy:
which may be less than or greater than the original entropy. However, from the standpoint of the new probability distribution one can estimate that to have used the original code based on instead of a new code based on would have added an expected number of bits:
to the message length. This therefore represents the amount of useful information, or information gain, about, that we can estimate has been learned by discovering.
If a further piece of data,, subsequently comes in, the probability distribution for can be updated further, to give a new best guess. If one reinvestigates the information gain for using rather than, it turns out that it may be either greater or less than previously estimated:
and so the combined information gain does not obey the triangle inequality:
All one can say is that on average, averaging using, the two sides will average out.

Bayesian experimental design

A common goal in Bayesian experimental design is to maximise the expected Kullback–Leibler divergence between the prior and the posterior. When posteriors are approximated to be Gaussian distributions, a design maximising the expected Kullback–Leibler divergence is called Bayes d-optimal.

Discrimination information

The Kullback–Leibler divergence can also be interpreted as the expected discrimination information for over : the mean information per sample for discriminating in favor of a hypothesis against a hypothesis, when hypothesis is true. Another name for this quantity, given to it by I. J. Good, is the expected weight of evidence for over to be expected from each sample.
The expected weight of evidence for over is not the same as the information gain expected per sample about the probability distribution of the hypotheses,
Either of the two quantities can be used as a utility function in Bayesian experimental design, to choose an optimal next question to investigate: but they will in general lead to rather different experimental strategies.
On the entropy scale of information gain there is very little difference between near certainty and absolute certainty—coding according to a near certainty requires hardly any more bits than coding according to an absolute certainty. On the other hand, on the logit scale implied by weight of evidence, the difference between the two is enormous – infinite perhaps; this might reflect the difference between being almost sure that, say, the Riemann hypothesis is correct, compared to being certain that it is correct because one has a mathematical proof. These two different scales of loss function for uncertainty are both useful, according to how well each reflects the particular circumstances of the problem in question.

Principle of minimum discrimination information

The idea of Kullback–Leibler divergence as discrimination information led Kullback to propose the Principle of : given new facts, a new distribution should be chosen which is as hard to discriminate from the original distribution as possible; so that the new data produces as small an information gain as possible.
For example, if one had a prior distribution over and, and subsequently learnt the true distribution of was, then the Kullback–Leibler divergence between the new joint distribution for and,, and the earlier prior distribution would be:
i.e. the sum of the Kullback–Leibler divergence of the prior distribution for from the updated distribution, plus the expected value of the Kullback–Leibler divergence of the prior conditional distribution from the new conditional distribution. This is minimized if over the whole support of ; and we note that this result incorporates Bayes' theorem, if the new distribution is in fact a δ function representing certainty that has one particular value.
MDI can be seen as an extension of Laplace's Principle of Insufficient Reason, and the Principle of Maximum Entropy of E.T. Jaynes. In particular, it is the natural extension of the principle of maximum entropy from discrete to continuous distributions, for which Shannon entropy ceases to be so useful, but the Kullback–Leibler divergence continues to be just as relevant.
In the engineering literature, MDI is sometimes called the Principle of Minimum Cross-Entropy or Minxent for short. Minimising the Kullback–Leibler divergence from to with respect to is equivalent to minimizing the cross-entropy of and, since
which is appropriate if one is trying to choose an adequate approximation to. However, this is just as often not the task one is trying to achieve. Instead, just as often it is that is some fixed prior reference measure, and that one is attempting to optimise by minimising subject to some constraint. This has led to some ambiguity in the literature, with some authors attempting to resolve the inconsistency by redefining cross-entropy to be, rather than.

Relationship to available work

s add where probabilities multiply. The surprisal for an event of probability is defined as. If is then surprisal is in nats, bits, or so that, for instance, there are bits of surprisal for landing all "heads" on a toss of coins.
Best-guess states are inferred by maximizing the average surprisal for a given set of control parameters. This constrained entropy maximization, both classically and quantum mechanically, minimizes Gibbs availability in entropy units where is a constrained multiplicity or partition function.
When temperature is fixed, free energy is also minimized. Thus if and number of molecules are constant, the Helmholtz free energy is minimized as a system "equilibrates." If and are held constant, the Gibbs free energy is minimized instead. The change in free energy under these conditions is a measure of available work that might be done in the process. Thus available work for an ideal gas at constant temperature and pressure is where and .
More generally the work available relative to some ambient is obtained by multiplying ambient temperature by Kullback–Leibler divergence or net surprisal defined as the average value of where is the probability of a given state under ambient conditions. For instance, the work available in equilibrating a monatomic ideal gas to ambient values of and is thus, where Kullback–Leibler divergence
The resulting contours of constant Kullback–Leibler divergence, shown at right for a mole of Argon at standard temperature and pressure, for example put limits on the conversion of hot to cold as in flame-powered air-conditioning or in the unpowered device to convert boiling-water to ice-water discussed here. Thus Kullback–Leibler divergence measures thermodynamic availability in bits.

Quantum information theory

For density matrices and on a Hilbert space, the K–L divergence from to is defined to be
In quantum information science the minimum of over all separable states can also be used as a measure of entanglement in the state.

Relationship between models and reality

Just as Kullback–Leibler divergence of "actual from ambient" measures thermodynamic availability, Kullback–Leibler divergence of "reality from a model" is also useful even if the only clues we have about reality are some experimental measurements. In the former case Kullback–Leibler divergence describes distance to equilibrium or the amount of available work, while in the latter case it tells you about surprises that reality has up its sleeve or, in other words, how much the model has yet to learn.
Although this tool for evaluating models against systems that are accessible experimentally may be applied in any field, its application to selecting a statistical model via Akaike information criterion are particularly well described in papers and a book by Burnham and Anderson. In a nutshell the Kullback–Leibler divergence of reality from a model may be estimated, to within a constant additive term, by a function of the deviations observed between data and the model's predictions. Estimates of such divergence for models that share the same additive term can in turn be used to select among models.
When trying to fit parametrized models to data there are various estimators which attempt to minimize Kullback–Leibler divergence, such as maximum likelihood and maximum spacing estimators.

Symmetrised divergence

Kullback and Leibler themselves actually defined the divergence as:
which is symmetric and nonnegative. This quantity has sometimes been used for feature selection in classification problems, where and are the conditional pdfs of a feature under two different classes. In the Banking and Finance industries, this quantity is referred to as Population Stability Index, and is used to assess distributional shifts in model features through time.
An alternative is given via the divergence,
which can be interpreted as the expected information gain about from discovering which probability distribution is drawn from, or, if they currently have probabilities and respectively.
The value gives the Jensen–Shannon divergence, defined by
where is the average of the two distributions,
can also be interpreted as the capacity of a noisy information channel with two inputs giving the output distributions and. The Jensen–Shannon divergence, like all f-divergences, is locally proportional to the Fisher information metric. It is similar to the Hellinger metric.

Relationship to other probability-distance measures

There are many other important measures of probability distance. Some of these are particularly connected with the Kullback–Leibler divergence. For example:
Other notable measures of distance include the Hellinger distance, histogram intersection, Chi-squared statistic, quadratic form distance, match distance, Kolmogorov–Smirnov distance, and earth mover's distance.

Data differencing

Just as absolute entropy serves as theoretical background for data compression, relative entropy serves as theoretical background for data differencing – the absolute entropy of a set of data in this sense being the data required to reconstruct it, while the relative entropy of a target set of data, given a source set of data, is the data required to reconstruct the target given the source.