Spectral clustering

In multivariate statistics and the clustering of data, spectral clustering techniques make use of the spectrum of the similarity matrix of the data to perform dimensionality reduction before clustering in fewer dimensions. The similarity matrix is provided as an input and consists of a quantitative assessment of the relative similarity of each pair of points in the dataset.
In application to image segmentation, spectral clustering is known as segmentation-based object categorization.

Definitions

Given an enumerated set of data points, the similarity matrix may be defined as a symmetric matrix, where represents a measure of the similarity between data points with indices and. The general approach to spectral clustering is to use a standard clustering method on relevant eigenvectors of a Laplacian matrix of. There are many different ways to define a Laplacian which have different mathematical interpretations, and so the clustering will also have different interpretations. The eigenvectors that are relevant are the ones that correspond to smallest several eigenvalues of the Laplacian except for the smallest eigenvalue which will have a value of 0. For computational efficiency, these eigenvectors are often computed as the eigenvectors corresponding to the largest several eigenvalues of a function of the Laplacian.
Spectral clustering is well known to relate to partitioning of a mass-spring system, where each mass is associated with a data point and each spring stiffness corresponds to a weight of an edge describing a similarity of the two related data points. Specifically, the classical reference explains that the eigenvalue problem describing transversal vibration modes of a mass-spring system is exactly the same as the eigenvalue problem for the graph Laplacian matrix defined as
where is the diagonal matrix
The masses that are tightly connected by the springs in the mass-spring system evidently move together from the equilibrium position in low-frequency vibration modes, so that the components of the eigenvectors corresponding to the smallest eigenvalues of the graph Laplacian can be used for meaningful clustering of the masses.
A popular related spectral clustering technique is the normalized cuts algorithm or Shi–Malik algorithm introduced by Jianbo Shi and Jitendra Malik, commonly used for image segmentation. It partitions points into two sets based on the eigenvector corresponding to the second-smallest eigenvalue of the symmetric normalized Laplacian defined as
A mathematically equivalent algorithm takes the eigenvector corresponding to the largest eigenvalue of the random walk normalized adjacency matrix.
Knowing the eigenvectors, partitioning may be done in various ways, such as by computing the median of the components of the second smallest eigenvector, and placing all points whose component in is greater than in, and the rest in. The algorithm can be used for hierarchical clustering by repeatedly partitioning the subsets in this fashion.

Algorithms

If the similarity matrix has not already been explicitly constructed, the efficiency of spectral clustering may be improved if the solution to the corresponding eigenvalue problem is performed in a matrix-free fashion, as in the Lanczos algorithm.
For large-sized graphs, the second eigenvalue of the graph Laplacian matrix is often ill-conditioned, leading to slow convergence of iterative eigenvalue solvers. Preconditioning is a key technology accelerating the convergence, e.g., in the matrix-free LOBPCG method. Spectral clustering has been successfully applied on large graphs by first identifying their community structure, and then clustering communities.
Spectral clustering is closely related to nonlinear dimensionality reduction, and dimension reduction techniques such as locally-linear embedding can be used to reduce errors from noise or outliers.
Free software to implement spectral clustering is available in large open source projects like Scikit-learn using LOBPCG with multigrid preconditioning, or ARPACK, MLlib for pseudo-eigenvector clustering using the power iteration method, and R.

Relationship with ''k''-means

The kernel k-means problem is an extension of the k-means problem where the input data points are mapped non-linearly into a higher-dimensional feature space via a kernel function. The weighted kernel k-means problem further extends this problem by defining a weight for each cluster as the reciprocal of the number of elements in the cluster,
Suppose is a matrix of the normalizing coefficients for each point for each cluster if and zero otherwise. Suppose is the kernel matrix for all points. The weighted kernel k-means problem with n points and k clusters is given as,
such that
such that. In addition, there are identity constrains on given by,
where represents a vector of ones.
This problem can be recast as,
This problem is equivalent to the spectral clustering problem when the identity constraints on are relaxed. In particular, the weighted kernel k-means problem can be reformulated as a spectral clustering problem and vice versa. The output of the algorithms are eigenvectors which do not satisfy the identity requirements for indicator variables defined by. Hence, post-processing of the eigenvectors is required for the equivalence between the problems.
Transforming the spectral clustering problem into a weighted kernel k-means problem greatly reduces the computational burden.

Relationship to DBSCAN

Spectral clustering is also related to DBSCAN clustering, that finds density-connected components. Connected components correspond to optimal spectral clusters ; and DBSCAN uses an asymmetric neighbor graph with edges removed when source points are not dense. Thus, DBSCAN is a special case of spectral clustering, but one which allows more efficient algorithms.

Measures to compare clusterings

Ravi Kannan, Santosh Vempala and Adrian Vetta proposed a bicriteria measure to define the quality of a given clustering. They said that a clustering was an -clustering if the conductance of each cluster was at least α and the weight of the inter-cluster edges was at most ε fraction of the total weight of all the edges in the graph. They also look at two approximation algorithms in the same paper.

Approximate solutions

Spectral clustering is computationally expensive unless the graph is sparse and the similarity matrix can be efficiently constructed. If the similarity matrix is an RBF kernel matrix, spectral clustering is expensive. There are approximate algorithms for making spectral clustering more efficient: power method, Nystrom method, etc. However, recent research pointed out the problems with spectral clustering with Nystrom method; in particular, the similarity matrix with Nystrom approximation is not elementwisely positive, which can be problematic.

History and related literatures

Spectral clustering has a long history. Spectral clustering as a machine learning method was popularized by Shi & Malik and Ng, Jordan, & Weiss.
Ideas and network measures related to spectral clustering also play an important role in a number of applications apparently different from clustering problems. For instance, networks with stronger spectral partitions take longer to converge in opinion-updating models used in sociology and economics.

Popular movies

The Hunger Games (film) - 2012 American dystopian action thriller science fiction-adventure film directed by Gary Ross and based on Suzanne Collins’s 2008 novel of the same name. It is the first insta...
untitled Captain Marvel sequel - part of Marvel Cinematic Universe....
Killers of the Flower Moon (film project) - Killers of the Flower Moon - film project in United States of America. It was presented as drama, detective fiction, thriller. The film project starred Leonardo Dicaprio, Robert De Niro. Director of...
Five Nights at Freddy's (film) - Five Nights at Freddy's - film published in 2017 in United States of America. Scenarist of the film - Scott Cawthon....

Popular books

Book of Revelation - The Book of Revelation is the final book of the New Testament, and consequently is also the final book of the Christian Bible. Its title is derived from the first word of the Koine Greek text: apok...
Book of Genesis - account of the creation of the world, the early history of humanity, Israel's ancestors and the origins...
Gospel of Matthew - The Gospel According to Matthew is the first book of the New Testament and one of the three synoptic gospels. It tells how Israel's Messiah, rejected and executed in Israel, pronounces judgement on ...
Michelin Guide - Michelin Guides are a series of guide books published by the French tyre company Michelin for more than a century. The term normally refers to the annually published Michelin Red Guide , the oldest...
Psalms - The Book of Psalms , commonly referred to simply as Psalms , the Psalter or "the Psalms", is the first book of the Ketuvim , the third section of the Hebrew Bible, and thus a book of th...
Ecclesiastes - Ecclesiastes is one of 24 books of the Tanakh , where it is classified as one of the Ketuvim . Originally written c. 450–200 BCE, it is also among the canonical Wisdom literature of the Old Tes...
The 48 Laws of Power - non-fiction book by American author Robert Greene. The book...

Popular television series

The Crown (TV series) - historical drama web television series about the reign of Queen Elizabeth II, created and principally written by Peter Morgan, and produced by Left Bank Pictures and Sony Pictures Tel...
Friends - American sitcom television series, created by David Crane and Marta Kauffman, which aired on NBC from September 22, 1994, to May 6, 2004, lasting ten seasons. With an ensemble cast sta...
Young Sheldon - spin-off prequel to The Big Bang Theory and begins with the character Sheldon...
Modern Family - American television mockumentary family sitcom created by Christopher Lloyd and Steven Levitan for the American Broadcasting Company. It ran for eleven seasons, from September 23...
Loki (TV series) - upcoming American web television miniseries created for Disney+ by Michael Waldron, based on the Marvel Comics character of the same name. It is set in the Marvel Cinematic Universe, shar...
Game of Thrones - American fantasy drama television series created by David Benioff and D. B. Weiss for HBO. It...
Shameless (American TV series) - American comedy-drama television series developed by John Wells which debuted on Showtime on January 9, 2011. It...