Substitution model


In biology, a substitution model describes the process from which a sequence of symbols changes into another set of traits. For example, in cladistics, each position in the sequence might correspond to a property of a species which can either be present or absent. The alphabet could then consist of "0" for absence and "1" for presence. Then the sequence 00110 could mean, for example, that a species does not have feathers or lay eggs, does have fur, is warm-blooded, and cannot breathe underwater. Another sequence 11010 would mean that a species has feathers, lays eggs, does not have fur, is warm-blooded, and cannot breathe underwater. In phylogenetics, sequences are often obtained by firstly obtaining a nucleotide or protein sequence alignment, and then taking the bases or amino acids at corresponding positions in the alignment as the characters. Sequences achieved by this might look like AGCGGAGCTTA and GCCGTAGACGC.
Substitution models are used, for example, for constructing evolutionary trees in phylogenetics or cladistics, and simulating sequences to test other methods and algorithms.

Neutral, independent, finite sites models

Most substitution models used to date are neutral, independent, finite sites models. Neutral sites mean selection does not operate on the substitutions, and so they are unconstrained. Independent sites mean changes in one site do not affect the probability of changes in another site. Finite sites are finitely many sites, and so over evolution, a single site can be changed multiple times. This means that, for example, if a character has value 0 at time 0 and at time t, it could be that no changes occurred, or that it changed to a 1 and back to a 0, or that it changed to a 1 and back to a 0 and then to a 1 and then back to a 0, and so on.

The molecular clock and the units of time

Typically, a branch length of a phylogenetic tree is expressed as the expected number of substitutions per site; if the evolutionary model indicates that each site within an ancestral sequence will typically experience x substitutions by the time it evolves to a particular descendant's sequence then the ancestor and descendant are considered to be separated by branch length x.
Sometimes a branch length is measured in terms of geological years. For example, a fossil record may make it possible to determine the number of years between an ancestral species and a descendant species. Because some species evolve at faster rates than others, these two measures of branch length are not always in direct proportion. The expected number of substitutions per site per year is often indicated with the Greek letter mu.
A model is said to have a strict molecular clock if the expected number of substitutions per year μ is constant regardless of which species' evolution is being examined. An important implication of a strict molecular clock is that the number of expected substitutions between an ancestral species and any of its present-day descendants must be independent of which descendant species is examined.
Note that the assumption of a strict molecular clock is often unrealistic, especially across long periods of evolution. For example, even though rodents are genetically very similar to primates, they have undergone a much higher number of substitutions in the estimated time since divergence in some regions of the genome. This could be due to their shorter generation time, higher metabolic rate, increased population structuring, increased rate of speciation, or smaller body size. When studying ancient events like the Cambrian explosion under a molecular clock assumption, poor concurrence between cladistic and phylogenetic data is often observed. There has been some work on models allowing variable rate of evolution.
Models that can take into account variability of the rate of the molecular clock between different evolutionary lineages in the phylogeny are called “relaxed” in opposition to “strict”. In such models the rate can be assumed to be correlated or not between ancestors and descendants and rate variation among lineages can be drawn from many distributions but usually exponential and lognormal distributions are applied. There is a special case, called “local molecular clock” when a phylogeny is divided into at least two partitions and a strict molecular clock is applied in each, but with different rates.

Time-reversible and stationary models

Many useful substitution models are time-reversible; in terms of the mathematics, the model does not care which sequence is the ancestor and which is the descendant so long as all other parameters are held constant.
When an analysis of real biological data is performed, there is generally no access to the sequences of ancestral species, only to the present-day species. However, when a model is time-reversible, which species was the ancestral species is irrelevant. Instead, the phylogenetic tree can be rooted using any of the species, re-rooted later based on new knowledge, or left unrooted. This is because there is no 'special' species, all species will eventually derive from one another with the same probability.
A model is time reversible if and only if it satisfies the property
or, equivalently, the detailed balance property,
for every i, j, and t.
Time-reversibility should not be confused with stationarity. A model is stationary if Q does not change with time. The analysis below assumes a stationary model.

The mathematics of substitution models

Stationary, neutral, independent, finite sites models have two parameters, π, an equilibrium vector of base frequencies and a rate matrix, Q, which describes the rate at which bases of one type change into bases of another type; element for ij is the rate at which base i goes to base j. The diagonals of the Q matrix are chosen so that the rows sum to zero:
The equilibrium row vector π must be annihilated by the rate matrix Q:
The transition matrix function is a function from the branch lengths, to a matrix of conditional probabilities. It is denoted. The entry in the ith column and the jth row,, is the probability, after time t, that there is a base j at a given position, conditional on there being a base i in that position at time 0. When the model is time reversible, this can be performed between any two sequences, even if one is not the ancestor of the other, if you know the total branch length between them.
The asymptotic properties of Pij are such that Pij = δij, where δij is the Kronecker delta function. That is, there is no change in base composition between a sequence and itself. At the other extreme, or, in other words, as time goes to infinity the probability of finding base j at a position given there was a base i at that position originally goes to the equilibrium probability that there is base j at that position, regardless of the original base. Furthermore, it follows that for all t.
The transition matrix can be computed from the rate matrix via matrix exponentiation:
where Qn is the matrix Q multiplied by itself enough times to give its nth power.
If Q is diagonalizable, the matrix exponential can be computed directly: let Q = U−1 Λ U be a diagonalization of Q, with
where Λ is a diagonal matrix and where are the eigenvalues of Q, each repeated according to its multiplicity. Then
where the diagonal matrix eΛt is given by

Generalised time reversible

Generalised time reversible is the most general neutral, independent, finite-sites, time-reversible model possible. It was first described in a general form by Simon Tavaré in 1986. The GTR model is often called the general time reversible model in publications; it has also been called the REV model.
The GTR parameters for nucleotides consist of an equilibrium base frequency vector,, giving the frequency at which each base occurs at each site, and the rate matrix
Because the model must be time reversible and must approach the equilibrium nucleotide frequencies at long times, each rate below the diagonal equals the reciprocal rate above the diagonal multiplied by the equilibrium ratio of the two bases. As such, the nucleotide GTR requires 6 substitution rate parameters and 4 equilibrium base frequency parameters. Since the 4 frequency parameters must sum to 1, there are only 3 free frequency parameters. The total of 9 free parameters is often further reduced to 8 parameters plus, the overall number of substitutions per unit time. When measuring time in substitutions only 8 free parameters remain.
In general, to compute the number of parameters, you count the number of entries above the diagonal in the matrix, i.e. for n trait values per site, and then add n-1 for the equilibrium frequencies, and subtract 1 because is fixed. You get
For example, for an amino acid sequence, you would find there are 208 parameters. However, when studying coding regions of the genome, it is more common to work with a codon substitution model. There are codons, resulting in 2078 free parameters. However, the rates for transitions between codons which differ by more than one base are often assumed to be zero, reducing the number of free parameters to only parameters. Another common practice is to reduce the number of codons by forbidding the stop codons. This is a biologically reasonable assumption because including the stop codons would mean that one is calculating the probability of finding sense codon ' after time ' given that the ancestral codon is ' would involve the possibility of passing through a state with a premature stop codon.
An alternative way to write the instantaneous rate matrix for the nucleotide GTR model is:
The
' matrix is normalized so.
This notation is easier to understand than the notation originally used by Tavaré, because all model parameters correspond either to "exchangeability" parameters or to equilibrium nucleotide frequencies. Note that the nucleotides in the ' matrix have been written in alphabetical order. In other words, the transition probability matrix for the ' matrix above would be:
Some publications write the nucleotides in a different order. These difference in notation make it important to be clear regarding the order of the states when writing the ' matrix.
The value of this notation is that instantaneous rate of change from nucleotide
' to nucleotide ' can always be written as ', where ' is the exchangeability of nucleotides ' and ' and ' is the equilibrium frequency of the ' nucleotide. The matrix shown above uses the letters ' through ' for the exchangeability parameters in the interest of readability, but those parameters could also be to written in a systematic manner using the ' notation.
Note that the ordering of the nucleotide subscripts for exchangeability parameters is irrelevant but the transition probability matrix values are not.
An arbitrarily chosen exchangeability parameters is typically set to a value of 1 to increase the readability of the exchangeability parameter estimates. The practice of expressing the exchangeability parameters in relative terms is not problematic because the ' matrix is normalized. Normalization allows ' in the matrix exponentiation to be expressed in units of expected substitutions per site and reducing the number of free parameters to eight. Specifically, there are five free exchangeability parameters and three equilibrium base frequency parameters.
The alternative notation also makes it easier to understand the sub-models of the GTR model, which simply correspond to cases where exchangeability and/or equilibrium base frequency parameters are constrained to take on equal values. A number of specific sub-models have been named, largely based on their original publications:
ModelExchangeability parametersBase frequency parametersReference
JC69 Jukes and Cantor
F81all values freeFelsenstein
K2P , Kimura
HKY85, all values freeHasegawa et al.
K3ST , , Kimura
TN93, , all values freeTamura and Nei
SYMall exchangeability parameters freeZharkikh
GTR all exchangeability parameters freeall values freeTavaré

There are 203 possible ways that the exchangeability parameters can be restricted to form sub-models of GTR, ranging from the JC69 and F81 models to the SYM model and the full GTR model all values are constrained to be equal all values are treated as free parameters. Although the equilibrium base frequencies can be constrained in other ways most constraints the link some but not all values are unrealistic from a biological standpoint. The possible exception is enforcing strand symmetry.
The alternative notation also makes it straightforward to see how the GTR model can be applied to biological alphabets with a larger state-space. It is possible to write a set of equilibrium state frequencies as,,... and a set of exchangeability parameters for any alphabet of character states. These values can the be used to populate the matrix by setting the off-diagonal elements as shown above, setting the diagonal elements to the negative sum of the off-diagonal elements on the same row, and normalizing. Obviously, for amino acids and for codons. However, the generality of this notation is beneficial because one can use reduced alphabets for amino acids. For example, one can use and encode amino acids by recoding the amino acids using the six categories proposed by Margaret Dayhoff. Reduced amino acid alphabets are viewed as a way to reduce the impact of compositional variation and saturation.

Mechanistic vs. empirical models

A main difference in evolutionary models is how many parameters are estimated every time for the data set under consideration and how many of them are estimated once on a large data set. Mechanistic models describe all substitutions as a function of a number of parameters which are estimated for every data set analyzed, preferably using maximum likelihood. This has the advantage that the model can be adjusted to the particularities of a specific data set. Problems can arise when too many parameters are used, particularly if they can compensate for each other. Then it is often the case that the data set is too small to yield enough information to estimate all parameters accurately.
Empirical models are created by estimating many parameters from a large data set. These parameters are then fixed and will be reused for every data set. This has the advantage that those parameters can be estimated more accurately. Normally, it is not possible to estimate all entries of the substitution matrix from the current data set only. On the downside, the parameters estimated from the training data might be too generic and therefore have a poor fit to any particular dataset. A potential solution for that problem is to estimate some parameters from the data using maximum likelihood. In studies of protein evolution the equilibrium amino acid frequencies are often estimated from the data while keeping the exchangeability matrix fixed. Beyond the common practice of estimating amino acid frequencies from the data, methods to estimate exchangeability parameters or adjust the matrix for protein evolution in other ways have been proposed.
With the large-scale genome sequencing still producing very large amounts of DNA and protein sequences, there is enough data available to create empirical models with any number of parameters, including empirical codon models. Because of the problems mentioned above, the two approaches are often combined, by estimating most of the parameters once on large-scale data, while a few remaining parameters are then adjusted to the data set under consideration. The following sections give an overview of the different approaches taken for DNA, protein or codon-based models.

DNA substitution models

Models of DNA evolution were first proposed in 1969 by Jukes and Cantor. The Jukes-Cantor model assumes equal transition rates as well as equal equilibrium frequencies for all bases and it is the simplest sub-model of the GTR model. In 1980, Motoo Kimura introduced a model with two parameters : one for the transition and one for the transversion rate. A year later, Kimura introduced a second model with three substitution types: one for the transition rate, one for the rate of transversions that conserve the strong/weak properties of nucleotides, and one for rate of transversions that conserve the amino/keto properties of nucleotides. An 1981, Joseph Felsenstein proposed a four-parameter model in which the substitution rate corresponds to the equilibrium frequency of the target nucleotide. Hasegawa, Kishino, and Yano unified the two last models to a five-parameter model. After these pioneering efforts, many additional sub-models of the GTR model were introduced into the literature in the 1990s. Other models that move beyond the GTR model in specific ways were also developed and refined by several researchers.
Almost DNA substitution models are mechanistic models. The small number of parameters that one needs to estimate for these models makes it feasible to estimate those parameters from the data. It is also necessary because the patterns of DNA sequence evolution often differ among organisms and among genes within organisms. The later may reflect optimization by the action of selection for specific purposes or it might reflect neutral variation in the patterns of substitution. Thus, depending on the organism and the type of gene, it is likely necessary to adjust the model to these circumstances.

Amino acid substitution models

For many analyses, particularly for longer evolutionary distances, the evolution is modeled on the amino acid level. Since not all DNA substitution also alter the encoded amino acid, information is lost when looking at amino acids instead of nucleotide bases. However, several advantages speak in favor of using the amino acid information: DNA is much more inclined to show compositional bias than amino acids, not all positions in the DNA evolve at the same speed, but probably most important, because of those fast evolving positions and the limited alphabet size, the DNA suffers from more back substitutions, making it difficult to accurately estimate evolutionary longer distances.
Unlike the DNA models, amino acid models traditionally are empirical models. They were pioneered in the 1960s and 1970s by Dayhoff and co-workers by estimating replacement rates from protein alignments with at least 85% identity. This minimized the chances of observing multiple substitutions at a site. From the estimated rate matrix, a series of replacement probability matrices were derived, known under names such as PAM250. Log-odds matrices based on the Dayhoff PAM model were commonly used to assess the significance of homology search results, although the BLOSUM matrices have superseded the PAM log-odds matrices in this context because the BLOSUM matrices appear to be more sensitive across a variety of evolutionary distances, unlike the PAM log-odds matrices.
The Dayhoff PAM matrix was the source of the exchangeability parameters used in one of the first maximum-likelihood analyses of phylogeny that used protein data and the PAM model continues to be used in phylogenetics. However, the limited number of alignments used to generate the PAM model almost certainly inflated the variance of some rate matrix parameters. Regardless, it is clear that the PAM model seldom has as good of a fit to most datasets as more modern empirical models.
Starting in the 1990s, the rapid expansion of sequence databases due to improved sequencing technologies led to the estimation of many new empirical matrices. The earliest efforts used methods similar to those used by Dayhoff, using large-scale matching of the protein database to generate a new log-odds matrix and the JTT model. The rapid increases in compute power during this time made it feasible to estimate parameters for empirical models using maximum likelihood and other methods.

The no common mechanism (NCM) model and maximum parsimony

In 1997, Tuffley and Steel described a model that they named the no common mechanism model. The topology of the maximum likelihood tree for a specific dataset given the NCM model is identical to the topology of the optimal tree for the same data given the maximum parsimony criterion. The NCM model assumes all of the data are related by a common phylogenetic tree. Then ' parameters are introduced for each homologous character, where ' is the number of sequences. This can be viewed as estimating a separate rate parameter for every character × branch pair in the dataset. Thus, the number of free parameters in the NCM model always exceeds the number of homologous characters in the data matrix.