Coreference


In linguistics, coreference, sometimes written co-reference, occurs when two or more expressions in a text refer to the same person or thing; they have the same referent, e.g. Bill said he would come; the proper noun Bill and the pronoun he refer to the same person, namely to Bill. Coreference is the main concept underlying binding phenomena in the field of syntax. The theory of binding explores the syntactic relationship that exists between coreferential expressions in sentences and texts. When two expressions are coreferential, the one is usually a full form and the other is an abbreviated form. Linguists use indices to show coreference, as with the i index in the example Billi said hei would come. The two expressions with the same reference are coindexed, hence in this example Bill and he are coindexed, indicating that they should be interpreted as coreferential.

Types

When exploring coreference, there are numerous distinctions that can be made, e.g. anaphora, cataphora, split antecedents, coreferring noun phrases, etc. When dealing with proforms, one distinguishes between anaphora and cataphora. When the proform follows the expression to which it refers, anaphora is present, and when it precedes the expression to which it refers, cataphora is present. These notions are illustrated as follows:

Versus bound variables

Semanticists and logicians sometimes draw a distinction between coreference and what is known as a bound variable. An instance of a bound variable can look like coreference, but from a technical standpoint, one can argue that it actually is not. Bound variables occur when the antecedent to the proform is an indefinite quantified expression, e.g.
Quantified expressions such as every student and no student are, from a technical standpoint, not referential. The subjects every student and no student are grammatically singular, but they do not pick out single referents in the discourse world. Thus since the antecedents to the possessive adjective his is not referential, one also cannot say that his is referential. Instead, one says it is a variable that is bound by its antecedent. Its reference varies based upon which of the students in the discourse world is thought of. If Jack, John, and Jerry are the three students in the discourse world, then the meaning of his varies based upon whether Jack, John, or Jerry is the focus of the minds eye. The existence of bound variables is perhaps more apparent with the following example:
This sentence is ambiguous. It can mean that Jack likes his grade, but everyone else dislikes Jack's grade, or more likely, it means that Jack likes his grade, but John dislikes his grade, and Jerry dislikes his grade. The second, more natural reading is the bound variable reading. While the distinction between coreference and bound variables may be real, coindexation can be construed as accommodating both. That is, when two or more expressions are coindexed, it indicates that one is dealing with coreference or a bound variable.

Coreference resolution

In computational linguistics, coreference resolution is a well-studied problem in discourse. To derive the correct interpretation of a text, or even to estimate the relative importance of various mentioned subjects, pronouns and other referring expressions must be connected to the right individuals. Algorithms intended to resolve coreferences commonly look first for the nearest preceding individual that is compatible with the referring expression. For example, she might attach to a preceding expression such as the woman or Anne, but not to Bill. Pronouns such as himself have much stricter constraints. As with many linguistic tasks, there is a tradeoff between precision and recall, the calculation of which can vary as no single algorithm exists to measure the quality of coreference chains. Cluster quality metrics commonly used to evaluate coreference resolution algorithms are Rand index, adjusted Rand index or different mutual information-based methods.
A classic problem for coreference resolution in English is the pronoun it, which has many uses. It can refer much like he and she, except that it generally refers to inanimate objects. It can also refer to abstractions rather than beings, e.g. He was paid minimum wage, but didn't seem to mind it. Finally, it also has pleonastic uses, which do not refer to anything specific:
Pleonastic uses are not considered referential, and so are not part of coreference.
Approaches to coreference resolution can broadly be separated into mention-pair, mention-ranking or entity-based algorithms. Mention-pair algorithms involve decisions if a pair of two given mentions belong to the same entity. Entity-wide constraints like gender are not considered, which leads to error propagation. For example the pronouns he or she can both have a high probability of coreference with the teacher, but cannot be coreferent with each other. Mention-ranking algorithms expand on this idea but instead stipulate that one mention can only be coreferent with one mention. As a result, each previous mention must be given assigned a score and the highest scoring mention is linked. Finally, in entity-based methods mentions are linked based on information of the whole coreference chain instead of individual mentions. The representation of a variable-width chain is more complex and computationally expensive than mention-based methods, which lead to these algorithms being mostly based on neural network architectures.