Diffusion map
Diffusion maps is a dimensionality reduction or feature extraction algorithm introduced by Coifman and Lafon which computes a family of embeddings of a data set into Euclidean space whose coordinates can be computed from the eigenvectors and eigenvalues of a diffusion operator on the data. The Euclidean distance between points in the embedded space is equal to the "diffusion distance" between probability distributions centered at those points. Different from linear dimensionality reduction methods such as principal component analysis and multi-dimensional scaling, diffusion maps is part of the family of nonlinear dimensionality reduction methods which focus on discovering the underlying manifold that the data has been sampled from. By integrating local similarities at different scales, diffusion maps give a global description of the data-set. Compared with other methods, the diffusion map algorithm is robust to noise perturbation and computationally inexpensive.
Definition of diffusion maps
Following and, diffusion maps can be defined in four steps.Connectivity
Diffusion maps exploit the relationship between heat diffusion and random walk Markov chain. The basic observation is that if we take a random walk on the data, walking to a nearby data-point is more likely than walking to another that is far away. Let be a measure space, where is the data set and represents the distribution of the points on.Based on this, the connectivity between two data points, and, can be defined as the probability of walking from to in one step of the random walk. Usually, this probability is specified in terms of a kernel function of the two points:. For example, the popular Gaussian kernel:
More generally, the kernel function has the following properties
.
The kernel constitutes the prior definition of the local geometry of the data-set. Since a given kernel will capture a specific feature of the data set, its choice should be guided by the application that one has in mind. This is a major difference with methods such as principal component analysis, where correlations between all data points are taken into account at once.
Given, we can then construct a reversible Markov chain on :
and define:
Although the new normalized kernel does not inherit the symmetric property, it does inherit the positivity-preserving property and gains a conservation property:
Diffusion process
From we can construct a transition matrix of a Markov chain on. In other words, represents the one-step transition probability from to, and gives the t-step transition matrix.We define the diffusion matrix
We then define the new kernel
or equivalently,
where D is a diagonal matrix and
We apply the graph Laplacian normalization to this new kernel:
where is a diagonal matrix and
One of the main ideas of diffusion framework is that running the chain forward in time reveals the geometric structure of at larger and larger scales. Specifically, the notion of a cluster in the data set is quantified as a region in which the probability of escaping this region is low. Therefore, t not only serves as a time parameter, but also has the dual role of scale parameter.
The eigendecomposition of the matrix yields
where is the sequence of eigenvalues of and and are the biorthogonal right and left eigenvectors respectively.
Due to the spectrum decay of the eigenvalues, only a few terms are necessary to achieve a given relative accuracy in this sum.
Parameter and the Diffusion Operator
The reason to introduce the normalization step involving is to tune the influence of the data point density on the infinitesimal transition of the diffusion. In some applications, the sampling of the data is generally not related to the geometry of the manifold we are interested in describing. In this case, we can set and the diffusion operator approximates the Laplace–Beltrami operator. We then recover the Riemannian geometry of the data set regardless of the distribution of the points. To describe the long-term behavior of the point distribution of a system of stochastic differential equations, we can use and the resulting Markov chain approximates the Fokker–Planck diffusion. With, it reduces to the classical graph Laplacian normalization.Diffusion distance
The diffusion distance at time between two points can be measured as the similarity of two points in the observation space with the connectivity between them. It is given bywhere is the stationary distribution of the Markov chain, given by the first left eigenvector of. Explicitly:
Intuitively, is small if there is a large number of short paths connecting and. There are several interesting features associated with the diffusion distance, based on our previous discussion that also serves as a scale parameter:
- Points are closer at a given scale if they are highly connected in the graph, therefore emphasizing the concept of a cluster.
- This distance is robust to noise, since the distance between two points depends on all possible paths of length between the points.
- From a machine learning point of view, the distance takes into account all evidences linking to, allowing us to conclude that this distance is appropriate for the design of inference algorithms based on the majority of preponderance.
Diffusion process and low-dimensional embedding
So the eigenvectors can be used as a new set of coordinates for the data. The diffusion map is defined as:
Because of the spectrum decay, it is sufficient to use only the first k eigenvectors and eigenvalues.
Thus we get the diffusion map from the original data to a k-dimensional space which is embedded in the original space.
In it is proved that
so the Euclidean distance in the diffusion coordinates approximates the diffusion distance.
Algorithm
The basic algorithm framework of diffusion map is as:Step 1. Given the similarity matrix L.
Step 2. Normalize the matrix according to parameter :.
Step 3. Form the normalized matrix.
Step 4. Compute the k largest eigenvalues of and the corresponding eigenvectors.
Step 5. Use diffusion map to get the embedding.
Application
In the paper, Nadler et. al. showed how to design a kernel that reproduces the diffusion induced by a Fokker–Planck equation. Also, they explained that, when the data approximate a manifold, one can recover the geometry of this manifold by computing an approximation of the Laplace–Beltrami operator. This computation is completely insensitiveto the distribution of the points and therefore provides a separation of the statistics and the geometry of the
data. Since diffusion maps gives a global description of the data-set, it can measure the distances between pair of sample points in the manifold in which the data is embedded. Applications based on diffusion maps include face recognition, spectral clustering, low dimensional representation of images, image segmentation, 3D model segmentation, speaker verification and identification, sampling on manifolds, anomaly detection, image inpainting, and so on.