Multi-label classification


In machine learning, multi-label classification and the strongly related problem of multi-output classification are variants of the classification problem where multiple labels may be assigned to each instance. Multi-label classification is a generalization of multiclass classification, which is the single-label problem of categorizing instances into precisely one of more than two classes; in the multi-label problem there is no constraint on how many of the classes the instance can be assigned to.
Formally, multi-label classification is the problem of finding a model that maps inputs x to binary vectors y.

Problem transformation methods

Several problem transformation methods exist for multi-label classification, and can be roughly broken down into:
Some classification algorithms/models have been adapted to the multi-label task, without requiring problem transformations. Examples of these incl for multi-label data.
Based on learning paradigms, the existing multi-label classification techniques can be classified into batch learning and online machine learning. Batch learning algorithms require all the data samples to be available beforehand. It trains the model using the entire training data and then predicts the test sample using the found relationship. The online learning algorithms, on the other hand, incrementally build their models in sequential iterations. In iteration t, an online algorithm receives a sample, xt and predicts its label ŷt using the current model; the algorithm then receives yt, the true label of xt and updates its model based on the sample-label pair:.

Multi-label stream classification

s are possibly infinite sequences of data that continuously and rapidly grow over time. Multi-label stream classification is the version of multi-label classification task that takes place in data streams. It is sometimes also called online multi-label classification. The difficulties of multi-label classification are combined with difficulties of data streams.
Many MLSC methods resort to ensemble methods in order to increase their predictive performance and deal with concept drifts. Below are the most widely used ensemble methods in the literature:
Considering to be a set of labels for data sample, the extent to which a dataset is multi-label can be captured in two statistics:
Evaluation metrics for multi-label classification performance are inherently different from those used in multi-class classification, due to the inherent differences of the classification problem. If denotes the true set of labels for a given sample, and the predicted set of labels, then the following metrics can be defined on that sample:
Cross-validation in multi-label settings is complicated by the fact that the ordinary way of stratified sampling will not work; alternative ways of approximate stratified sampling have been suggested.

Implementations and datasets

Java implementations of multi-label algorithms are available in the and software packages, both based on Weka.
The scikit-learn Python package implements some .
The binary relevance method, classifier chains and other multilabel algorithms with a lot of different base learners are implemented in the R-package
A list of commonly used multi-label data-sets is available at the .