N-gram


In the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n items from a given sample of text or speech. The items can be phonemes, syllables, letters, words or base pairs according to the application. The n-grams typically are collected from a text or speech corpus. When the items are words, -grams may also be called shingles.
Using Latin numerical prefixes, an n-gram of size 1 is referred to as a "unigram"; size 2 is a "bigram" ; size 3 is a "trigram". English cardinal numbers are sometimes used, e.g., "four-gram", "five-gram", and so on. In computational biology, a polymer or oligomer of a known size is called a k-mer instead of an n-gram, with specific names using Greek numerical prefixes such as "monomer", "dimer", "trimer", "tetramer", "pentamer", etc., or English cardinal numbers, "one-mer", "two-mer", "three-mer", etc.

Applications

An n-gram model is a type of probabilistic language model for predicting the next item in such a sequence in the form of a –order Markov model. n-gram models are now widely used in probability, communication theory, computational linguistics, computational biology, and data compression. Two benefits of n-gram models are simplicity and scalability – with larger n, a model can store more context with a well-understood space–time tradeoff, enabling small experiments to scale up efficiently.

Examples

Figure 1 shows several example sequences and the corresponding 1-gram, 2-gram and 3-gram sequences.
Here are further examples; these are word-level 3-grams and 4-grams from the Google n-gram corpus.
3-grams
4-grams
An n-gram model models sequences, notably natural languages, using the statistical properties of n-grams.
This idea can be traced to an experiment by Claude Shannon's work in information theory. Shannon posed the question: given a sequence of letters, what is the likelihood of the next letter? From training data, one can derive a probability distribution for the next letter given a history of size : a = 0.4, b = 0.00001, c = 0,....; where the probabilities of all possible "next-letters" sum to 1.0.
More concisely, an n-gram model predicts based on. In probability terms, this is. When used for language modeling, independence assumptions are made so that each word depends only on the last n − 1 words. This Markov model is used as an approximation of the true underlying language. This assumption is important because it massively simplifies the problem of estimating the language model from data. In addition, because of the open nature of language, it is common to group words unknown to the language model together.
Note that in a simple n-gram language model, the probability of a word, conditioned on some number of previous words can be described as following a categorical distribution.
In practice, the probability distributions are smoothed by assigning non-zero probabilities to unseen words or n-grams; see smoothing techniques.

Applications and considerations

n-gram models are widely used in statistical natural language processing. In speech recognition, phonemes and sequences of phonemes are modeled using a n-gram distribution. For parsing, words are modeled such that each n-gram is composed of n words. For language identification, sequences of characters/graphemes are modeled for different languages. For sequences of characters, the 3-grams that can be generated from "good morning" are "goo", "ood", "od ", "d m", " mo", "mor" and so forth, counting the space character as a gram. For sequences of words, the trigrams that can be generated from "the dog smelled like a skunk" are "# the dog", "the dog smelled", "dog smelled like", "smelled like a", "like a skunk" and "a skunk #".
Practitioners more interested in multiple word terms might preprocess strings to remove spaces. Many simply collapse whitespace to a single space while preserving paragraph marks, because the whitespace is frequently either an element of writing style or introduces layout or presentation not required by the prediction and deduction methodology. Punctuation is also commonly reduced or removed by preprocessing and is frequently used to trigger functionality.
n-grams can also be used for sequences of words or almost any type of data. For example, they have been used for extracting features for clustering large sets of satellite earth images and for determining what part of the Earth a particular image came from. They have also been very successful as the first pass in genetic sequence search and in the identification of the species from which short sequences of DNA originated.
n-gram models are often criticized because they lack any explicit representation of long range dependency. This is because the only explicit dependency range is tokens for an n-gram model, and since natural languages incorporate many cases of unbounded dependencies, this means that an n-gram model cannot in principle distinguish unbounded dependencies from noise. For this reason, n-gram models have not made much impact on linguistic theory, where part of the explicit goal is to model such dependencies.
Another criticism that has been made is that Markov models of language, including n-gram models, do not explicitly capture the performance/competence distinction. This is because n-gram models are not designed to model linguistic knowledge as such, and make no claims to being complete models of linguistic knowledge; instead, they are used in practical applications.
In practice, n-gram models have been shown to be extremely effective in modeling language data, which is a core component in modern statistical language applications.
Most modern applications that rely on n-gram based models, such as machine translation applications, do not rely exclusively on such models; instead, they typically also incorporate Bayesian inference. Modern statistical models are typically made up of two parts, a prior distribution describing the inherent likelihood of a possible result and a likelihood function used to assess the compatibility of a possible result with observed data. When a language model is used, it is used as part of the prior distribution, and even then it is often not the only component in this distribution.
Handcrafted features of various sorts are also used, for example variables that represent the position of a word in a sentence or the general topic of discourse. In addition, features based on the structure of the potential result, such as syntactic considerations, are often used. Such features are also used as part of the likelihood function, which makes use of the observed data. Conventional linguistic theory can be incorporated in these features.

Out-of-vocabulary words

An issue when using n-gram language models are out-of-vocabulary words. They are encountered in computational linguistics and natural language processing when the input includes words which were not present in a system's dictionary or database during its preparation. By default, when a language model is estimated, the entire observed vocabulary is used. In some cases, it may be necessary to estimate the language model with a specific fixed vocabulary. In such a scenario, the n-grams in the corpus that contain an out-of-vocabulary word are ignored. The n-gram probabilities are smoothed over all the words in the vocabulary even if they were not observed.
Nonetheless, it is essential in some cases to explicitly model the probability of out-of-vocabulary words by introducing a special token into the vocabulary. Out-of-vocabulary words in the corpus are effectively replaced with this special token before n-grams counts are cumulated. With this option, it is possible to estimate the transition probabilities of n-grams involving out-of-vocabulary words.

''n''-grams for approximate matching

n-grams can also be used for efficient approximate matching. By converting a sequence of items to a set of n-grams, it can be embedded in a vector space, thus allowing the sequence to be compared to other sequences in an efficient manner. For example, if we convert strings with only letters in the English alphabet into single character 3-grams, we get a -dimensional space. Using this representation, we lose information about the string. For example, both the strings "abc" and "bca" give rise to exactly the same 2-gram "bc". However, we know empirically that if two strings of real text have a similar vector representation then they are likely to be similar. Other metrics have also been applied to vectors of n-grams with varying, sometimes better, results. For example, z-scores have been used to compare documents by examining how many standard deviations each n-gram differs from its mean occurrence in a large collection, or text corpus, of documents. In the event of small counts, the g-score may give better results for comparing alternative models.
It is also possible to take a more principled approach to the statistics of n-grams, modeling similarity as the likelihood that two strings came from the same source directly in terms of a problem in Bayesian inference.
n-gram-based searching can also be used for plagiarism detection.

Other applications

n-grams find use in several areas of computer science, computational linguistics, and applied mathematics.
They have been used to:
  • design kernels that allow machine learning algorithms such as support vector machines to learn from string data
  • find likely candidates for the correct spelling of a misspelled word
  • improve compression in compression algorithms where a small area of data requires n-grams of greater length
  • assess the probability of a given word sequence appearing in text of a language of interest in pattern recognition systems, speech recognition, OCR, Intelligent Character Recognition, machine translation and similar applications
  • improve retrieval in information retrieval systems when it is hoped to find similar "documents" given a single query document and a database of reference documents
  • improve retrieval performance in genetic sequence analysis as in the BLAST family of programs
  • identify the language a text is in or the species a small sequence of DNA was taken from
  • predict letters or words at random in order to create text, as in the dissociated press algorithm.
  • cryptanalysis

    Space required for an ''n''-gram

Consider an n-gram where the units are characters and a text with t characters. The space this n-gram requires is exponential:
A parabola can be fitted through each discrete data point by obtaining three pairs of coordinates and solving a linear system with three variables, which leads to the general formula:

Bias-versus-variance trade-off

To choose a value for n in an n-gram model, it is necessary to find the right trade-off between the stability of the estimate against its appropriateness. This means that trigram is a common choice with large training corpora, whereas a bigram is often used with smaller ones.

Smoothing techniques

There are problems of balance weight between infrequent grams and frequent grams. Also, items not seen in the training data will be given a probability of 0.0 without smoothing. For unseen but plausible data from a sample, one can introduce pseudocounts. Pseudocounts are generally motivated on Bayesian grounds.
In practice it is necessary to smooth the probability distributions by also assigning non-zero probabilities to unseen words or n-grams. The reason is that models derived directly from the n-gram frequency counts have severe problems when confronted with any n-grams that have not explicitly been seen before – the zero-frequency problem. Various smoothing methods are used, from simple "add-one" smoothing to more sophisticated models, such as Good–Turing discounting or back-off models. Some of these methods are equivalent to assigning a prior distribution to the probabilities of the n-grams and using Bayesian inference to compute the resulting posterior n-gram probabilities. However, the more sophisticated smoothing models were typically not derived in this fashion, but instead through independent considerations.
In the field of computational linguistics, in particular language modeling, skip-grams are a generalization of n-grams in which the components need not be consecutive in the text under consideration, but may leave gaps that are skipped over. They provide one way of overcoming the data sparsity problem found with conventional n-gram analysis.
Formally, an -gram is a consecutive subsequence of length of some sequence of tokens. A -skip--gram is a length- subsequence where the components occur at distance at most from each other.
For example, in the input text:
the set of 1-skip-2-grams includes all the bigrams, and in addition the subsequences

Syntactic ''n''-grams

Syntactic n-grams are n-grams defined by paths in syntactic dependency or constituent trees rather than the linear structure of the text. For example, the sentence "economic news has little effect on financial markets" can be transformed to syntactic n-grams following the tree structure of its dependency relations: news-economic, effect-little, effect-on-markets-financial.
Syntactic n-grams are intended to reflect syntactic structure more faithfully than linear n-grams, and have many of the same applications, especially as features in a Vector Space Model. Syntactic n-grams for certain tasks gives better results than the use of standard n-grams, for example, for authorship attribution.
Another type of syntactic n-grams are part-of-speech n-grams, defined as fixed-length contiguous overlapping subsequences that are extracted from part-of-speech sequences of text. Part-of-speech n-grams have several applications, most commonly in information retrieval.