Direct coupling analysis

Direct coupling analysis or DCA is an umbrella term comprising several methods for analyzing sequence data in computational biology. The common idea of these methods is to use statistical modeling to quantify the strength of the direct relationship between two positions of a biological sequence, excluding effects from other positions. This contrasts usual measures of correlation, which can be large even if there is no direct relationship between the positions. Such a direct relationship can for example be the evolutionary pressure for two positions to maintain mutual compatibility in the biomolecular structure of the sequence, leading to molecular coevolution between the two positions.
DCA has been used in the inference of protein residue contacts, RNA structure prediction, the inference of protein-protein interaction networks
and the modeling of fitness landscapes.

Mathematical Model and Inference

Mathematical Model

The basis of DCA is a statistical model for the variability within a set of phylogenetically related biological sequences. When fitted to a multiple sequence alignment of sequences of length, the model defines a probability for all possible sequences of the same length. This probability can be interpreted as the probability that the sequence in question belongs to the same class of sequences as the ones in the MSA, for example the class of all protein sequences belonging to a specific protein family.
We denote a sequence by, with the being categorical variables representing the monomers of the sequence. The probability of a sequence within a model is then defined as
where
The parameters depend on one position and the symbol at this position. They are usually called fields and represent the propensity of symbol to be found at a certain position. The parameters depend on pairs of positions and the symbols at these positions. They are usually called couplings and represent an interaction, i.e. a term quantifying how compatible the symbols at both positions are with each other. The model is fully connected, so there are interactions between all pairs of positions. The model can be seen as a generalization of the Ising model, with spins not only taking two values, but any value from a given finite alphabet. In fact, when the size of the alphabet is 2, the model reduces to the Ising model. Since it is also reminiscent of the model of the same name, it is often called Potts Model.
Even knowing the probabilities of all sequences does not determine the parameters uniquely. For example, a simple transformation of the parameters
for any set of real numbers leaves the probabilities the same. The likelihood function is invariant under such transformations as well, so the data cannot be used to fix these degrees of freedom.
A convention often found in literature is to fix these degrees of freedom such that the Frobenius norm of the coupling matrix
is minimized.

Maximum Entropy Derivation

To justify the Potts model, it is often noted that it can be derived following a maximum entropy principle: For a given set of sample covariances and frequencies, the Potts model represents the distribution with the maximal Shannon entropy of all distributions reproducing those covariances and frequencies. For a multiple sequence alignment, the sample covariances are defined as
where is the frequency of finding symbols and at positions and in the same sequence in the MSA, and the frequency of finding symbol at position. The Potts model is then the unique distribution that maximizes the functional
The first term in the functional is the Shannon entropy of the distribution. The are Lagrange multipliers to ensure, with being the marginal probability to find symbols at positions. The Lagrange multiplier ensures normalization.
Maximizing this functional and identifying
leads to the Potts model above. This procedure only gives the functional form of the Potts model, while the numerical values of the Lagrange multipliers still have to be determined by fitting the model to the data.

Direct Couplings and Indirect Correlation

The central point of DCA is to interpret the as direct couplings. If two positions are under joint evolutionary pressure, one might expect these couplings to be large because only sequences with fitting pairs of symbols should have a significant probability. On the other hand, a large correlation between two positions does not necessarily mean that the couplings are large, since large couplings between e.g. positions and might lead to large correlations between positions and, mediated by position. In fact, such indirect correlations have been implicated in the high false positive rate when inferring protein residue contacts using correlation measures like mutual information.

Inference

The inference of the Potts model on a multiple sequence alignment using maximum likelihood estimation is usually computationally intractable, because one needs to calculate the normalization constant, which is for sequence length and possible symbols a sum of terms. Therefore, numerous approximations and alternatives have been developed:

mpDCA
mfDCA
gaussDCA
plmDCA
Adaptive Cluster Expansion

All of these methods lead to some form of estimate for the set of parameters maximizing the likelihood of the MSA. Many of them include regularization or prior terms to ensure a well-posed problem or promote a sparse solution.

Applications

Protein Residue Contact Prediction

A possible interpretation of large values of couplings in a model fitted to a MSA of a protein family is the existence of conserved contacts between positions in the family. Such a contact can lead to molecular coevolution, since a mutation in one of the two residues, without a compensating mutation in the other residue, is likely to disrupt protein structure and negatively affect the fitness of the protein. Residue pairs for which there is a strong selective pressure to maintain mutual compatibility are therefore expected to mutate together or not at all. This idea has been used to predict protein contact maps, for example analyzing the mutual information between protein residues.
Within the framework of DCA, a score for the strength of the direct interaction between a pair of residues is often defined using the Frobenius norm of the corresponding coupling matrix and applying an average product correction :
where has been defined above and
This correction term was first introduced for mutual information and is used to remove biases of specific positions to produce large. Scores that are invariant under parameter transformations that do not affect the probabilities have also been used.
Sorting all residue pairs by this score results in a list in which the top of the list is strongly enriched in residue contacts when compared to the protein contact map of a homologous protein. High-quality predictions of residue contacts are valuable as prior information in protein structure prediction.

Inference of protein-protein interaction

DCA can be used for detecting conserved interaction between protein families and for predicting which residue pairs form contacts in a protein complex. Such predictions can be used when generating structural models for these complexes, or when inferring protein-protein interaction networks made from more than two proteins.

Modeling of fitness landscapes

DCA can be used to model fitness landscapes and to predict the effect of a mutation in the amino acid sequence of a protein on its fitness.

Popular movies

The Hunger Games (film) - 2012 American dystopian action thriller science fiction-adventure film directed by Gary Ross and based on Suzanne Collins’s 2008 novel of the same name. It is the first insta...
untitled Captain Marvel sequel - part of Marvel Cinematic Universe....
Killers of the Flower Moon (film project) - Killers of the Flower Moon - film project in United States of America. It was presented as drama, detective fiction, thriller. The film project starred Leonardo Dicaprio, Robert De Niro. Director of...
Five Nights at Freddy's (film) - Five Nights at Freddy's - film published in 2017 in United States of America. Scenarist of the film - Scott Cawthon....

Popular books

Book of Revelation - The Book of Revelation is the final book of the New Testament, and consequently is also the final book of the Christian Bible. Its title is derived from the first word of the Koine Greek text: apok...
Book of Genesis - account of the creation of the world, the early history of humanity, Israel's ancestors and the origins...
Gospel of Matthew - The Gospel According to Matthew is the first book of the New Testament and one of the three synoptic gospels. It tells how Israel's Messiah, rejected and executed in Israel, pronounces judgement on ...
Michelin Guide - Michelin Guides are a series of guide books published by the French tyre company Michelin for more than a century. The term normally refers to the annually published Michelin Red Guide , the oldest...
Psalms - The Book of Psalms , commonly referred to simply as Psalms , the Psalter or "the Psalms", is the first book of the Ketuvim , the third section of the Hebrew Bible, and thus a book of th...
Ecclesiastes - Ecclesiastes is one of 24 books of the Tanakh , where it is classified as one of the Ketuvim . Originally written c. 450–200 BCE, it is also among the canonical Wisdom literature of the Old Tes...
The 48 Laws of Power - non-fiction book by American author Robert Greene. The book...

Popular television series

The Crown (TV series) - historical drama web television series about the reign of Queen Elizabeth II, created and principally written by Peter Morgan, and produced by Left Bank Pictures and Sony Pictures Tel...
Friends - American sitcom television series, created by David Crane and Marta Kauffman, which aired on NBC from September 22, 1994, to May 6, 2004, lasting ten seasons. With an ensemble cast sta...
Young Sheldon - spin-off prequel to The Big Bang Theory and begins with the character Sheldon...
Modern Family - American television mockumentary family sitcom created by Christopher Lloyd and Steven Levitan for the American Broadcasting Company. It ran for eleven seasons, from September 23...
Loki (TV series) - upcoming American web television miniseries created for Disney+ by Michael Waldron, based on the Marvel Comics character of the same name. It is set in the Marvel Cinematic Universe, shar...
Game of Thrones - American fantasy drama television series created by David Benioff and D. B. Weiss for HBO. It...
Shameless (American TV series) - American comedy-drama television series developed by John Wells which debuted on Showtime on January 9, 2011. It...