Medoid


Medoids are representative objects of a data set or a cluster with a data set whose average dissimilarity to all the objects in the cluster is minimal. Medoids are similar in concept to means or centroids, but medoids are always restricted to be members of the data set. Medoids are most commonly used on data when a mean or centroid cannot be defined, such as graphs. They are also used in contexts where the centroid is not representative of the dataset like in images and 3-D trajectories and gene expression . These are also of interest while wanting to find a representative using some distance other than squared euclidean distance.
For some data sets there may be more than one medoid, as with medians.
A common application of the medoid is the k-medoids clustering algorithm, which is similar to the k-means algorithm but works when a mean or centroid is not definable. This algorithm basically works as follows. First, a set of medoids is chosen at random. Second, the distances to the other points are computed. Third, data are clustered according to the medoid they are most similar to. Fourth, the medoid set is optimized via an iterative process.
Note that a medoid is not equivalent to a median, a geometric median, or centroid. A median is only defined on 1-dimensional data, and it only minimizes dissimilarity to other points for metrics induced by a norm. A geometric median is defined in any dimension, but is not necessarily a point from within the original dataset.

Definition

Let be a set of points in a space with a distance function d. Medoid is defined as

Algorithms to compute medoids

From the definition above, it is clear that the medoid can be computed after computing all pairwise distances between points in the ensemble. This would take distance evaluations. In the worst case, one can not compute the medoid with fewer distance evaluations. However, there are many approaches that allow us to compute medoids either exactly or approximately in sub-quadratic time under different statistical models.
If the points lie on the real line, computing the medoid reduces to computing the median which can be done in by Quick-select algorithm of Hoare. However, in higher dimensional real spaces, no linear-time algorithm is known. RAND is an algorithm that estimates the average distance of each point to all the other points by sampling a random subset of other points. It takes a total of
distance computations to approximate the medoid within a factor of with high probability,
where is the maximum
distance between two points
in the ensemble. Note that RAND is an approximation algorithm, and moreover
may not be known apriori.
RAND was leveraged by
TOPRANK which
uses the estimates obtained by RAND to focus on a small subset of candidate points, evaluates the average distance of these points exactly, and picks the minimum of those. TOPRANK needs
distance computations
to find the exact medoid with high probability
under a distributional assumption
on the average distances.
trimed
presents an algorithm
to find the medoid with
distance evaluations under a distributional
assumption on the points. The algorithm uses the triangle inequality to cut down the search space.
Meddit leverages
a connection of the medoid computation with multi-armed bandits and uses a Upper-Confidence-bound type of algorithm to get
an algorithm which takes distance evaluations under statistical
assumptions on the points.
Correlated Sequential Halving also leverages multi-armed bandit techniques, improving upon Meddit. By exploiting the correlation structure in the problem, the algorithm is able to provably yield drastic improvement in both number of distance computations needed and wall clock time.

Implementations

An implementation of RAND, TOPRANK, and trimed can be found . An implementation of Meddit
can be found and . An implementation of Correlated Sequential Halving
can be found .