Named-entity recognition

Named-entity recognition is a subtask of information extraction that seeks to locate and classify named entities mentioned in unstructured text into pre-defined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc.
Most research on NER systems has been structured as taking an unannotated block of text, such as this one:
And producing an annotated block of text that highlights the names of entities:
In this example, a person name consisting of one token, a two-token company name and a temporal expression have been detected and classified.
State-of-the-art NER systems for English produce near-human performance. For example, the best system entering MUC-7 scored 93.39% of F-measure while human annotators scored 97.60% and 96.95%.

Named-entity recognition platforms

Notable NER platforms include:

GATE supports NER across many languages and domains out of the box, usable via a graphical interface and a Java API.
OpenNLP includes rule-based and statistical named-entity recognition.
SpaCy features fast statistical NER as well as an open-source named-entity visualizer.
Problem definition

In the expression named entity, the word named restricts the task to those entities for which one or many strings, such as words or phrases, stands consistently for some referent. This is closely related to rigid designators, as defined by Kripke, although in practice NER deals with many names and referents that are not philosophically "rigid". For instance, the automotive company created by Henry Ford in 1903 can be referred to as Ford or Ford Motor Company, although "Ford" can refer to many other entities as well. Rigid designators include proper names as well as terms for certain biological species and substances, but exclude pronouns, descriptions that pick out a referent by its properties, and names for kinds of things as opposed to individuals.
Full named-entity recognition is often broken down, conceptually and possibly also in implementations, as two distinct problems: detection of names, and classification of the names by the type of entity they refer to.
The first phase is typically simplified to a segmentation problem: names are defined to be contiguous spans of tokens, with no nesting, so that "Bank of America" is a single name, disregarding the fact that inside this name, the substring "America" is itself a name. This segmentation problem is formally similar to chunking. The second phase requires choosing an ontology by which to organize categories of things.
Temporal expressions and some numerical expressions may also be considered as named entities in the context of the NER task. While some instances of these types are good examples of rigid designators there are also many invalid ones. In the first case, the year 2001 refers to the 2001st year of the Gregorian calendar. In the second case, the month June may refer to the month of an undefined year. It is arguable that the definition of named entity is loosened in such cases for practical reasons. The definition of the term named entity is therefore not strict and often has to be explained in the context in which it is used.
Certain hierarchies of named entity types have been proposed in the literature. BBN categories, proposed in 2002, is used for question answering and consists of 29 types and 64 subtypes. Sekine's extended hierarchy, proposed in 2002, is made of 200 subtypes. More recently, in 2011 Ritter used a hierarchy based on common Freebase entity types in ground-breaking experiments on NER over social media text.

Formal evaluation

To evaluate the quality of a NER system's output, several measures have been defined. The usual measures are called
Precision, recall, and F1 score. However, several issues remain in just how to calculate those values.
These statistical measures work reasonably well for the obvious cases of finding or missing a real entity exactly; and for finding a non-entity. However, NER can fail in many other ways, many of which are arguably "partially correct", and should not be counted as complete success or failures. For example, identifying a real entity, but:

with fewer tokens than desired
with more tokens than desired
partitioning adjacent entities differently
assigning it a completely wrong type
assigning it a related but inexact type
correctly identifying an entity, when what the user wanted was a smaller- or larger-scope entity. This suffers from at least two problems: First, the vast majority of tokens in real-world text are not part of entity names, so the baseline accuracy is extravagantly high, typically >90%; and second, mispredicting the full span of an entity name is not properly penalized.

In academic conferences such as CoNLL, a variant of the F1 score has been defined as follows:

Precision is the number of predicted entity name spans that line up exactly with spans in the gold standard evaluation data. I.e. when is predicted but was required, precision for the predicted name is zero. Precision is then averaged over all predicted entity names.
Recall is similarly the number of names in the gold standard that appear at exactly the same location in the predictions.
F1 score is the harmonic mean of these two.

It follows from the above definition that any prediction that misses a single token, includes a spurious token, or has the wrong class, is a hard error and does not contribute positively to either precision or recall. Thus, this measure may be said to be pessimistic: it can be the case that many "errors" are close to correct, and might be adequate for a given purpose. For example, one system might always omit titles such as "Ms." or "Ph.D.", but be compared to a system or ground-truth data that expects titles to be included. In that case, every such name is treated as an error. Because of such issues, it is important actually to examine the kinds of errors, and decide how important they are given one's goals and requirements.
Evaluation models based on a token-by-token matching have been proposed. Such models may given partial credit for overlapping matches (such as using the Intersection over Union criterion. They allow a finer grained evaluation and comparison of extraction systems.

Approaches

NER systems have been created that use linguistic grammar-based techniques as well as statistical models such as machine learning. Hand-crafted grammar-based systems typically obtain better precision, but at the cost of lower recall and months of work by experienced computational linguists. Statistical NER systems typically require a large amount of manually annotated training data. Semisupervised approaches have been suggested to avoid part of the annotation effort.
Many different classifier types have been used to perform machine-learned NER, with conditional random fields being a typical choice.

Problem domains

In 2001, research indicated that even state-of-the-art NER systems were brittle, meaning that NER systems developed for one domain did not typically perform well on other domains. Considerable effort is involved in tuning NER systems to perform well in a new domain; this is true for both rule-based and trainable statistical systems.
Early work in NER systems in the 1990s was aimed primarily at extraction from journalistic articles. Attention then turned to processing of military dispatches and reports. Later stages of the automatic content extraction evaluation also included several types of informal text styles, such as weblogs and text transcripts from conversational telephone speech conversations. Since about 1998, there has been a great deal of interest in entity identification in the molecular biology, bioinformatics, and medical natural language processing communities. The most common entity of interest in that domain has been names of genes and gene products. There has been also considerable interest in the recognition of chemical entities and drugs in the context of the CHEMDNER
competition, with 27 teams participating in this task.

Current challenges and research

Despite the high F1 numbers reported on the MUC-7 dataset, the problem of named-entity recognition is far from being solved. The main efforts are directed to reducing the annotation labor by employing semi-supervised learning, robust performance across domains and scaling up to fine-grained entity types. In recent years, many projects have turned to crowdsourcing, which is a promising solution to obtain high-quality aggregate human judgments for supervised and semi-supervised machine learning approaches to NER. Another challenging task is devising models to deal with linguistically complex contexts such as Twitter and search queries.
There are some researchers who did some comparisons about the NER performances from different statistical models such as HMM, ME, and CRF, and feature sets. And some researchers recently proposed graph-based semi-supervised learning model for language specific NER tasks.
A recently emerging task of identifying "important expressions" in text and cross-linking them to Wikipedia can be seen as an instance of extremely fine-grained named-entity recognition, where the types are the actual Wikipedia pages describing the concepts. Below is an example output of a Wikification system:

Michael Jordan is a professor at Berkeley

Another field that has seen progress but remains challenging is the application of NER to Twitter and other microblogs.

Popular movies

The Hunger Games (film) - 2012 American dystopian action thriller science fiction-adventure film directed by Gary Ross and based on Suzanne Collins’s 2008 novel of the same name. It is the first insta...
untitled Captain Marvel sequel - part of Marvel Cinematic Universe....
Killers of the Flower Moon (film project) - Killers of the Flower Moon - film project in United States of America. It was presented as drama, detective fiction, thriller. The film project starred Leonardo Dicaprio, Robert De Niro. Director of...
Five Nights at Freddy's (film) - Five Nights at Freddy's - film published in 2017 in United States of America. Scenarist of the film - Scott Cawthon....

Popular books

Book of Revelation - The Book of Revelation is the final book of the New Testament, and consequently is also the final book of the Christian Bible. Its title is derived from the first word of the Koine Greek text: apok...
Book of Genesis - account of the creation of the world, the early history of humanity, Israel's ancestors and the origins...
Gospel of Matthew - The Gospel According to Matthew is the first book of the New Testament and one of the three synoptic gospels. It tells how Israel's Messiah, rejected and executed in Israel, pronounces judgement on ...
Michelin Guide - Michelin Guides are a series of guide books published by the French tyre company Michelin for more than a century. The term normally refers to the annually published Michelin Red Guide , the oldest...
Psalms - The Book of Psalms , commonly referred to simply as Psalms , the Psalter or "the Psalms", is the first book of the Ketuvim , the third section of the Hebrew Bible, and thus a book of th...
Ecclesiastes - Ecclesiastes is one of 24 books of the Tanakh , where it is classified as one of the Ketuvim . Originally written c. 450–200 BCE, it is also among the canonical Wisdom literature of the Old Tes...
The 48 Laws of Power - non-fiction book by American author Robert Greene. The book...

Popular television series

The Crown (TV series) - historical drama web television series about the reign of Queen Elizabeth II, created and principally written by Peter Morgan, and produced by Left Bank Pictures and Sony Pictures Tel...
Friends - American sitcom television series, created by David Crane and Marta Kauffman, which aired on NBC from September 22, 1994, to May 6, 2004, lasting ten seasons. With an ensemble cast sta...
Young Sheldon - spin-off prequel to The Big Bang Theory and begins with the character Sheldon...
Modern Family - American television mockumentary family sitcom created by Christopher Lloyd and Steven Levitan for the American Broadcasting Company. It ran for eleven seasons, from September 23...
Loki (TV series) - upcoming American web television miniseries created for Disney+ by Michael Waldron, based on the Marvel Comics character of the same name. It is set in the Marvel Cinematic Universe, shar...
Game of Thrones - American fantasy drama television series created by David Benioff and D. B. Weiss for HBO. It...
Shameless (American TV series) - American comedy-drama television series developed by John Wells which debuted on Showtime on January 9, 2011. It...

Named-entity recognition