Precision and recall


In pattern recognition, information retrieval and classification, precision is the fraction of relevant instances among the retrieved instances, while recall is the fraction of the total amount of relevant instances that were actually retrieved. Both precision and recall are therefore based on an understanding and measure of relevance.
Suppose a computer program for recognizing dogs in photographs identifies 8 dogs in a picture containing 12 dogs and some cats. Of the 8 identified as dogs, 5 actually are dogs, while the rest are cats. The program's precision is 5/8 while its recall is 5/12. When a search engine returns 30 pages only 20 of which were relevant while failing to return 40 additional relevant pages, its precision is 20/30 = 2/3 while its recall is 20/60 = 1/3. So, in this case, precision is "how useful the search results are", and recall is "how complete the results are".
In statistics, if the null hypothesis is that all items are irrelevant, absence of type I and type II errors corresponds respectively to perfect precision and perfect recall. The above pattern recognition example contained 8 − 5 = 3 type I errors and 12 − 5 = 7 type II errors. Precision can be seen as a measure of exactness or quality, whereas recall is a measure of completeness or quantity.
In simple terms, high precision means that an algorithm returns substantially more relevant results than irrelevant ones, while high recall means that an algorithm returns most of the relevant results.
The relationship between sensitivity and specificity to precision depends on the proportion of positive cases in the population, also known as prevalence; with fixed sensitivity and specificity, precision rises with increasing prevalence.

Introduction

In an information retrieval scenario, the instances are documents and the task is to return a set of relevant documents given a search term; or equivalently, to assign each document to one of two categories, "relevant" and "not relevant". In this case, the "relevant" documents are simply those that belong to the "relevant" category. Recall is defined as the number of relevant documents retrieved by a search divided by the total number of existing relevant documents, while precision is defined as the number of relevant documents retrieved by a search divided by the total number of documents retrieved by that search.
In a classification task, the precision for a class is the number of true positives divided by the total number of elements labeled as belonging to the positive class. Recall in this context is defined as the number of true positives divided by the total number of elements that actually belong to the positive class.
In information retrieval, a perfect precision score of 1.0 means that every result retrieved by a search was relevant whereas a perfect recall score of 1.0 means that all relevant documents were retrieved by the search.
In a classification task, a precision score of 1.0 for a class C means that every item labeled as belonging to class C does indeed belong to class C whereas a recall of 1.0 means that every item from class C was labeled as belonging to class C.
Often, there is an inverse relationship between precision and recall, where it is possible to increase one at the cost of reducing the other. Brain surgery provides an illustrative example of the tradeoff. Consider a brain surgeon removing a cancerous tumor from a patient’s brain. The surgeon needs to remove all of the tumor cells since any remaining cancer cells will regenerate the tumor. Conversely, the surgeon must not remove healthy brain cells since that would leave the patient with impaired brain function. The surgeon may be more liberal in the area of the brain he removes to ensure he has extracted all the cancer cells. This decision increases recall but reduces precision. On the other hand, the surgeon may be more conservative in the brain he removes to ensure he extracts only cancer cells. This decision increases precision but reduces recall. That is to say, greater recall increases the chances of removing healthy cells and increases the chances of removing all cancer cells. Greater precision decreases the chances of removing healthy cells but also decreases the chances of removing all cancer cells.
Usually, precision and recall scores are not discussed in isolation. Instead, either values for one measure are compared for a fixed level at the other measure or both are combined into a single measure. Examples of measures that are a combination of precision and recall are the F-measure, or the Matthews correlation coefficient, which is a geometric mean of the chance-corrected variants: the regression coefficients Informedness and Markedness. Accuracy is a weighted arithmetic mean of Precision and Inverse Precision as well as a weighted arithmetic mean of Recall and Inverse Recall. Inverse Precision and Inverse Recall are simply the Precision and Recall of the inverse problem where positive and negative labels are exchanged. Recall and Inverse Recall, or equivalently true positive rate and false positive rate, are frequently plotted against each other as ROC curves and provide a principled mechanism to explore operating point tradeoffs. Outside of Information Retrieval, the application of Recall, Precision and F-measure are argued to be flawed as they ignore the true negative cell of the contingency table, and they are easily manipulated by biasing the predictions. The first problem is 'solved' by using Accuracy and the second problem is 'solved' by discounting the chance component and renormalizing to Cohen's kappa, but this no longer affords the opportunity to explore tradeoffs graphically. However, Informedness and Markedness are Kappa-like renormalizations of Recall and Precision, and their geometric mean Matthews correlation coefficient thus acts like a debiased F-measure.

Definition (information retrieval context)

In information retrieval contexts, precision and recall are defined in terms of a set of retrieved documents and a set of relevant documents, cf. relevance. The measures were defined in.

Precision

In the field of information retrieval, precision is the fraction of retrieved documents that are relevant to the query:
For example, for a text search on a set of documents, precision is the number of correct results divided by the number of all returned results.
Precision takes all retrieved documents into account, but it can also be evaluated at a given cut-off rank, considering only the topmost results returned by the system. This measure is called precision at n or P@n.
Precision is used with recall, the percent of all relevant documents that is returned by the search. The two measures are sometimes used together in the F1 Score to provide a single measurement for a system.
Note that the meaning and usage of "precision" in the field of information retrieval differs from the definition of accuracy and precision within other branches of science and technology.

Recall

In information retrieval, recall is the fraction of the relevant documents that are successfully retrieved.
For example, for a text search on a set of documents, recall is the number of correct results divided by the number of results that should have been returned.
In binary classification, recall is called sensitivity. It can be viewed as the probability that a relevant document is retrieved by the query.
It is trivial to achieve recall of 100% by returning all documents in response to any query. Therefore, recall alone is not enough but one needs to measure the number of non-relevant documents also, for example by also computing the precision.

Definition (classification context)

For classification tasks, the terms true positives, true negatives, false positives, and false negatives compare the results of the classifier under test with trusted external judgments. The terms positive and negative refer to the classifier's prediction, and the terms true and false refer to whether that prediction corresponds to the external judgment.
Let us define an experiment from P positive instances and N negative instances for some condition. The four outcomes can be formulated in a 2×2 contingency table or confusion matrix, as follows:
Precision and recall are then defined as:
Recall in this context is also referred to as the true positive rate or sensitivity, and precision is also referred to as positive predictive value ; other related measures used in classification include true negative rate and accuracy. True negative rate is also called specificity.

Imbalanced data

Accuracy can be a misleading metric for imbalanced data sets. Consider a sample with 95 negative and 5 positive values. Classifying all values as negative in this case gives 0.95 accuracy score. There are many metrics that don't suffer from this problem. For example, balanced accuracy normalizes true positive and true negative predictions by the number of positive and negative samples, respectively, and divides their sum by two:

For the previous example, classifying all as negative gives 0.5 balanced accuracy score, which is equivalent to the expected value of a random guess in a balanced data set. Balanced accuracy can serve as an overall performance metric for a model, whether or not the true labels are imbalanced in the data, assuming the cost of FN is the same as FP.
Another metric is the predicted positive condition rate, which identifies the percentage of the total population that is flagged. For example, for a search engine that returns 30 results out of 1,000,000 documents, the PPCR is 0.003%.


According to Saito and Rehmsmeier, precision-recall plots are more informative than ROC plots when evaluating binary classifiers on imbalanced data. In such scenarios, ROC plots may be visually deceptive with respect to conclusions about the reliability of classification performance.

Probabilistic interpretation

One can also interpret precision and recall not as ratios but as estimations of probabilities
Note that "randomly selected" means selecting a document at random from an appropriate pool of documents; i.e., selecting a document from the set of retrieved documents at random. The random selection should be such that all documents in the set are equally likely to be selected.
Note that, in a typical classification system, the probability that a retrieved document is relevant depends on the document. The above interpretation extends to that scenario also.
Another interpretation for precision and recall is as follows. Precision is the average probability of relevant retrieval. Recall is the average probability of complete retrieval. Here we average over multiple retrieval queries.

F-measure

A measure that combines precision and recall is the harmonic mean of precision and recall, the traditional F-measure or balanced F-score:
This measure is approximately the average of the two when they are close, and is more generally the harmonic mean, which, for the case of two numbers, coincides with the square of the geometric mean divided by the arithmetic mean. There are several reasons that the F-score can be criticized in particular circumstances due to its bias as an evaluation metric. This is also known as the measure, because recall and precision are evenly weighted.
It is a special case of the general measure :
Two other commonly used measures are the measure, which weights recall higher than precision, and the measure, which puts more emphasis on precision than recall.
The F-measure was derived by van Rijsbergen so that "measures the effectiveness of retrieval with respect to a user who attaches times as much importance to recall as precision". It is based on van Rijsbergen's effectiveness measure, the second term being the weighted harmonic mean of precision and recall with weights. Their relationship is where.

Limitations as goals

There are other parameters and strategies for performance metric of information retrieval system, such as the area under the ROC curve.
For web document retrieval, if the user's objectives are not clear, the precision and recall can't be optimized. As summarized by Lopresti,