K q-flats

In data mining and machine learning, -flats algorithm
is an iterative method which aims to partition observations into clusters where each cluster is close to a -flat, where is a given integer.
It is a generalization of the -means algorithm. In -means algorithm,
clusters are formed in the way that each cluster is close to one point, which is a -flat.
-flats algorithm gives better clustering result than -means algorithm
for some data set.

Description

Problem formulation

Given a set of observations where each observation is an n-dimensional real vector, -flats algorithm aims to partition observation points by generating -flats that minimize the sum of the squares of distances of each observation to a nearest q-flat.
A -flat is a subset of that is congruent to. For example, a
-flat is a point; a -flat is a line; a -flat is a plane; a -flat is a hyperplane.
-flat can be characterized by the solution set of a linear system of equations:, where,.
Denote a partition of as.
The problem can be formulated as
where is the projection of onto.
Note that is the distance from to.

Algorithm

The algorithm is similar to the k-means algorithm in that it alternates between
cluster assignment and cluster update. In specific, the algorithm starts with an initial set of -flats, and proceeds by alternating between the following two steps:
Stop whenever the assignments no longer change.
The cluster assignment step uses the following fact: given a q-flat and a vector, where, the distance from to the q-flat
is
The key part of this algorithm is how to update the cluster, i.e. given points, how to find a -flat that minimizes the sum of squares of distances of each point to the -flat. Mathematically, this problem is: given solve the quadratic optimization problem
subject to
where is given, and.
The problem can be solved using Lagrangian multiplier method and the solution is as given in the cluster update step.
It can be shown that the algorithm will terminate in a finite number of iterations. In addition, the algorithm will terminate at a point that the overall objective cannot be decreased either by a different assignment or by defining new cluster planes for these clusters.
This convergence result is a consequence of the fact that problem can be solved exactly.
The same convergence result holds for -means algorithm because the cluster update problem can be solved exactly.

Relation to other machine learning methods

$k$ -means algorithm

-flats algorithm is a generalization of -means algorithm. In fact, -means algorithm is 0-flats algorithm since a point is a 0-flat. Despite their connection, they should be used in different scenarios. -flats algorithm for the case that data lie in a few low-dimensional spaces.-means algorithm is desirable for the case the clusters are of the ambient dimension,.
For example, if all observations lie in two lines, -flats algorithm with
may be used; if the observations are two Gaussian clouds, -means algorithm may be used.

Sparse Dictionary Learning

Natural signals lie in a high-dimensional space. For example, the dimension of a 1024 by 1024 image is about 10⁶, which is far too high for most signal processing algorithms. One way to get rid of the high dimensionality is to find a set of basis functions, such that the high-dimensional signal can be represented by only a few basis functions. In other words, the coefficients of the signal representation lies in a low-dimensional space, which is easier to apply signal processing algorithms. In the literature, wavelet transform is usually used in image processing, and fourier transform is usually used in audio processing. The set of basis functions is usually called a dictionary.
However, it is not clear what is the best dictionary to use once given a signal data set. One popular approach is to find a dictionary when given a data set using the idea of Sparse Dictionary Learning. It aims to find a dictionary, such that the signal can be sparsely represented by the dictionary. The optimization problem can be written as follows.
subject to
where

X is a d by N matrix. Each columns of X represent a signal, and there are total N signals.
B is a d by l matrix. Each columns of B represent a basis function, and there are total l basis functions in the dictionary.
R is a l by N matrix. represent the coefficients when we use the dictionary B to represent the i^th columns of X.
denotes the zero-norm of the vector v.
denotes the Frobenious norm of matrix V.

The idea of flats algorithm is similar to sparse dictionary learning in nature. If we restrict the q-flat to q-dimensional subspace, then the flats algorithm is simply finding the closed q-dimensional subspace to a given signal. Sparse dictionary learning is also doing the same thing, except for an additional constraints on the sparsity of the representation. Mathematically, it is possible to show that flats algorithm is of the form of sparse dictionary learning with an additional block structure on R.
Let be a matrix, where columns of are basis of the flat. Then the projection of the signal x to the flat is, where is a q-dimensional coefficient. Let denote concatenation of basis of K flats, it is easy to show that the k -q-flat algorithm is the same as the following.
subject to and R has a block structure.
The block structure of R refers the fact that each signal is labeled to only one flat. Comparing the two formulations, k q-flat is the same as sparse dictionary modeling when and with an additional block structure on R. Users may refer to Szlam's paper for more discussion about the relationship between the two concept.

Applications and variations

Classification

is a procedure that classifies an input signal into different classes. One example is to classify an email into spam or non-spam classes. Classification algorithms usually require a supervised learning stage. In the supervised learning stage, training data for each class is used for the algorithm to learn the characteristics of the class. In the classification stage, a new observation is classified into a class by using the characteristics that were already trained.
k q-flat algorithm can be used for classification. Suppose there are total of m classes. For each class, k flats are trained a priori via training data set. When a new data comes, find the flat that is closest to the new data. Then the new data is associate to class of the closest flat.
However, the classification performance can be further improved if we impose some structure on the flats. One possible choice is to require different flats from different class be sufficiently far apart. Some researchers use this idea and develop a discriminative k q-flat algorithm.

K-metrics

In -flats algorithm, is used to measure the representation error. denotes the projection of x to the flat F. If data lies in a q-dimension flat, then a single q-flat can represent the data very well. On the contrary, if data lies in a very high dimension space but near a common center, then k-means algorithm is a better way than k q-flat algorithm to represent the data. It is because -means algorithm use to measure the error, where denotes the center. K-metrics is a generalization that use both the idea of flat and mean. In k-metrics, error is measured by the following Mahalanobis metric.
where A is a positive semi-definite matrix.
If A is the identity matrix, then the Mahalanobis metric is exactly the same as the error measure used in k-means. If A is not the identity matrix, then will favor certain directions as the k q-flat error measure.

Popular movies

The Hunger Games (film) - 2012 American dystopian action thriller science fiction-adventure film directed by Gary Ross and based on Suzanne Collins’s 2008 novel of the same name. It is the first insta...
untitled Captain Marvel sequel - part of Marvel Cinematic Universe....
Killers of the Flower Moon (film project) - Killers of the Flower Moon - film project in United States of America. It was presented as drama, detective fiction, thriller. The film project starred Leonardo Dicaprio, Robert De Niro. Director of...
Five Nights at Freddy's (film) - Five Nights at Freddy's - film published in 2017 in United States of America. Scenarist of the film - Scott Cawthon....

Popular books

Book of Revelation - The Book of Revelation is the final book of the New Testament, and consequently is also the final book of the Christian Bible. Its title is derived from the first word of the Koine Greek text: apok...
Book of Genesis - account of the creation of the world, the early history of humanity, Israel's ancestors and the origins...
Gospel of Matthew - The Gospel According to Matthew is the first book of the New Testament and one of the three synoptic gospels. It tells how Israel's Messiah, rejected and executed in Israel, pronounces judgement on ...
Michelin Guide - Michelin Guides are a series of guide books published by the French tyre company Michelin for more than a century. The term normally refers to the annually published Michelin Red Guide , the oldest...
Psalms - The Book of Psalms , commonly referred to simply as Psalms , the Psalter or "the Psalms", is the first book of the Ketuvim , the third section of the Hebrew Bible, and thus a book of th...
Ecclesiastes - Ecclesiastes is one of 24 books of the Tanakh , where it is classified as one of the Ketuvim . Originally written c. 450–200 BCE, it is also among the canonical Wisdom literature of the Old Tes...
The 48 Laws of Power - non-fiction book by American author Robert Greene. The book...

Popular television series

The Crown (TV series) - historical drama web television series about the reign of Queen Elizabeth II, created and principally written by Peter Morgan, and produced by Left Bank Pictures and Sony Pictures Tel...
Friends - American sitcom television series, created by David Crane and Marta Kauffman, which aired on NBC from September 22, 1994, to May 6, 2004, lasting ten seasons. With an ensemble cast sta...
Young Sheldon - spin-off prequel to The Big Bang Theory and begins with the character Sheldon...
Modern Family - American television mockumentary family sitcom created by Christopher Lloyd and Steven Levitan for the American Broadcasting Company. It ran for eleven seasons, from September 23...
Loki (TV series) - upcoming American web television miniseries created for Disney+ by Michael Waldron, based on the Marvel Comics character of the same name. It is set in the Marvel Cinematic Universe, shar...
Game of Thrones - American fantasy drama television series created by David Benioff and D. B. Weiss for HBO. It...
Shameless (American TV series) - American comedy-drama television series developed by John Wells which debuted on Showtime on January 9, 2011. It...