BLEU

BLEU is an algorithm for evaluating the quality of text which has been machine-translated from one natural language to another. Quality is considered to be the correspondence between a machine's output and that of a human: "the closer a machine translation is to a professional human translation, the better it is" - this is the central idea behind BLEU. BLEU was one of the first metrics to claim a high correlation with human judgements of quality, and remains one of the most popular automated and inexpensive metrics.
Scores are calculated for individual translated segments—generally sentences—by comparing them with a set of good quality reference translations. Those scores are then averaged over the whole corpus to reach an estimate of the translation's overall quality. Intelligibility or grammatical correctness are not taken into account.
BLEU's output is always a number between 0 and 1. This value indicates how similar the candidate text is to the reference texts, with values closer to 1 representing more similar texts. Few human translations will attain a score of 1, since this would indicate that the candidate is identical to one of the reference translations. For this reason, it is not necessary to attain a score of 1. Because there are more opportunities to match, adding additional reference translations will increase the BLEU score.

Algorithm

BLEU uses a modified form of precision to compare a candidate translation against multiple reference translations. The metric modifies simple precision since machine translation systems have been known to generate more words than are in a reference text. This is illustrated in the following example from Papineni et al.,

Candidate	the	the	the	the	the	the	the
Reference 1	the	cat	is	on	the	mat
Reference 2	there	is	a	cat	on	the	mat

Of the seven words in the candidate translation, all of them appear in the reference translations. Thus the candidate text is given a unigram precision of,
where is number of words from the candidate that are found in the reference, and is the total number of words in the candidate. This is a perfect score, despite the fact that the candidate translation above retains little of the content of either of the references.
The modification that BLEU makes is fairly straightforward. For each word in the candidate translation, the algorithm takes its maximum total count,, in any of the reference translations. In the example above, the word "the" appears twice in reference 1, and once in reference 2. Thus.
For the candidate translation, the count of each word is clipped to a maximum of for that word. In this case, "the" has and, thus is clipped to 2. These clipped counts are then summed over all distinct words in the candidate.
This sum is then divided by the total number of unigrams in the candidate translation. In the above example, the modified unigram precision score would be:
In practice, however, using individual words as the unit of comparison is not optimal. Instead, BLEU computes the same modified precision metric using n-grams. The length which has the "highest correlation with monolingual human judgements" was found to be four. The unigram scores are found to account for the adequacy of the translation, how much information is retained. The longer -gram scores account for the fluency of the translation, or to what extent it reads like "good English".

Model	Set of grams	Score
Unigram	"the", "the", "cat"
Grouped Unigram	"the"2, "cat"1
Bigram	"the the", "the cat"

One problem with BLEU scores is that they tend to favor short translations, which can produce very high precision scores, even using modified precision. An example of a candidate translation for the same references as above might be:
In this example, the modified unigram precision would be,
as the word 'the' and the word 'cat' appear once each in the candidate, and the total number of words is two. The modified bigram precision would be as the bigram, "the cat" appears once in the candidate. It has been pointed out that precision is usually twinned with recall to overcome this problem, as the unigram recall of this example would be or. The problem being that as there are multiple reference translations, a bad translation could easily have an inflated recall, such as a translation which consisted of all the words in each of the references.
To produce a score for the whole corpus the modified precision scores for the segments are combined using the geometric mean multiplied by a brevity penalty to prevent very short candidates from receiving too high a score. Let be the total length of the reference corpus, and the total length of the translation corpus. If, the brevity penalty applies, defined to be.
iBLEU is an interactive version of BLEU that allows a user to visually examine the BLEU scores obtained by the candidate translations. It also allows comparing two different systems in a visual and interactive manner which is useful for system development.

Performance

BLEU has frequently been reported as correlating well with human judgement, and remains a benchmark for the assessment of any new evaluation metric. There are however a number of criticisms that have been voiced. It has been noted that, although in principle capable of evaluating translations of any language, BLEU cannot, in its present form, deal with languages lacking word boundaries.
It has been argued that although BLEU has significant advantages, there is no guarantee that an increase in BLEU score is an indicator of improved translation quality.
There is an inherent, systemic problem with any metric based on comparing with one or a few reference translations: in real life, sentences can be translated in many different ways, sometimes with no overlap. Therefore, the approach of comparing by how much a computer translation differs from just a few human translations is flawed. HyTER is another automated MT metric that compares to very many translations in a reference grammar defined by human translators; the drawback is then that the human effort involved in correctly defining the combinatorially many ways to render the meaning of the translation in practice means HyTER also is only an approximation.

Popular movies

The Hunger Games (film) - 2012 American dystopian action thriller science fiction-adventure film directed by Gary Ross and based on Suzanne Collins’s 2008 novel of the same name. It is the first insta...
untitled Captain Marvel sequel - part of Marvel Cinematic Universe....
Killers of the Flower Moon (film project) - Killers of the Flower Moon - film project in United States of America. It was presented as drama, detective fiction, thriller. The film project starred Leonardo Dicaprio, Robert De Niro. Director of...
Five Nights at Freddy's (film) - Five Nights at Freddy's - film published in 2017 in United States of America. Scenarist of the film - Scott Cawthon....

Popular books

Book of Revelation - The Book of Revelation is the final book of the New Testament, and consequently is also the final book of the Christian Bible. Its title is derived from the first word of the Koine Greek text: apok...
Book of Genesis - account of the creation of the world, the early history of humanity, Israel's ancestors and the origins...
Gospel of Matthew - The Gospel According to Matthew is the first book of the New Testament and one of the three synoptic gospels. It tells how Israel's Messiah, rejected and executed in Israel, pronounces judgement on ...
Michelin Guide - Michelin Guides are a series of guide books published by the French tyre company Michelin for more than a century. The term normally refers to the annually published Michelin Red Guide , the oldest...
Psalms - The Book of Psalms , commonly referred to simply as Psalms , the Psalter or "the Psalms", is the first book of the Ketuvim , the third section of the Hebrew Bible, and thus a book of th...
Ecclesiastes - Ecclesiastes is one of 24 books of the Tanakh , where it is classified as one of the Ketuvim . Originally written c. 450–200 BCE, it is also among the canonical Wisdom literature of the Old Tes...
The 48 Laws of Power - non-fiction book by American author Robert Greene. The book...

Popular television series

The Crown (TV series) - historical drama web television series about the reign of Queen Elizabeth II, created and principally written by Peter Morgan, and produced by Left Bank Pictures and Sony Pictures Tel...
Friends - American sitcom television series, created by David Crane and Marta Kauffman, which aired on NBC from September 22, 1994, to May 6, 2004, lasting ten seasons. With an ensemble cast sta...
Young Sheldon - spin-off prequel to The Big Bang Theory and begins with the character Sheldon...
Modern Family - American television mockumentary family sitcom created by Christopher Lloyd and Steven Levitan for the American Broadcasting Company. It ran for eleven seasons, from September 23...
Loki (TV series) - upcoming American web television miniseries created for Disney+ by Michael Waldron, based on the Marvel Comics character of the same name. It is set in the Marvel Cinematic Universe, shar...
Game of Thrones - American fantasy drama television series created by David Benioff and D. B. Weiss for HBO. It...
Shameless (American TV series) - American comedy-drama television series developed by John Wells which debuted on Showtime on January 9, 2011. It...