Position weight matrix

A position weight matrix , also known as a position-specific weight matrix or position-specific scoring matrix , is a commonly used representation of motifs in biological sequences.
PWMs are often derived from a set of aligned sequences that are thought to be functionally related and have become an important part of many software tools for computational motif discovery.

Background

The position weight matrix was introduced by American geneticist Gary Stormo and colleagues in 1982 as an alternative to consensus sequences. Consensus sequences had previously been used to represent patterns in biological sequences, but had difficulties in the prediction of new occurrences of these patterns. The first use of PWMs was in the discovery of RNA sites that function as translation initiation sites. The perceptron algorithm was suggested by Polish American mathematician Andrzej Ehrenfeucht in order to create a matrix of weights which could distinguish true binding sites from other non-functional sites with similar sequences. Training the perceptron on both sets of sites resulted in a matrix and a threshold to distinguish between the two sets. Using the matrix to scan new sequences not included in the training set showed that this method was both more sensitive and precise than the best consensus sequence.
The advantages of PWMs over consensus sequences have made PWMs a popular method for representing patterns in biological sequences and an essential component in modern algorithms for motif discovery.

Creation

Conversion of sequence to position probability matrix

A PWM has one row for each symbol of the alphabet and one column for each position in the pattern. In the first step in constructing a PWM, a basic position frequency matrix is created by counting the occurrences of each nucleotide at each position. From the PFM, a position probability matrix can now be created by dividing that former nucleotide count at each position by the number of sequences, thereby normalising the values. Formally, given a set X of N aligned sequences of length l, the elements of the PPM M are calculated:

where i , j , k is the set of symbols in the alphabet and I is an indicator function where I is 1 if a=k and 0 otherwise.
For example, given the following DNA sequences:

GAGGTAAAC

TCCGTAAGT

CAGGTTGGA

ACAGTCAGT

TAGGTCATT

TAGGTACTG

ATGGTAACT

CAGGTATAC

TGTGTGAGT

AAGGTAAGT

The corresponding PFM is:

Therefore, the resulting PPM is:

Both PPMs and PWMs assume statistical independence between positions in the pattern, as the probabilities for each position are calculated independently of other positions. From the definition above, it follows that the sum of values for a particular position is 1. Each column can therefore be regarded as an independent multinomial distribution. This makes it easy to calculate the probability of a sequence given a PPM, by multiplying the relevant probabilities at each position. For example, the probability of the sequence S = GAGGTAAAC given the above PPM M can be calculated:

Pseudocounts are often applied when calculating PPMs if based on a small dataset, in order to avoid matrix entries having a value of 0. This is equivalent to multiplying each column of the PPM by a Dirichlet distribution and allows the probability to be calculated for new sequences. In the example above, without pseudocounts, any sequence which did not have a G in the 4th position or a T in the 5th position would have a probability of 0, regardless of the other positions.

Conversion of position probability matrix to position weight matrix

Most often the elements in PWMs are calculated as log likelihoods. That is, the elements of a PPM are transformed using a background model so that:
describes how an element in the PWM ,, can be calculated.
The simplest background model assumes that each letter appears equally frequently in the dataset. That is, the value of for all symbols in the alphabet. Applying this transformation to the PPM M from above gives:
The entries in the matrix make clear the advantage of adding pseudocounts, especially when using small datasets to construct M. The background model need not have equal values for each symbol: for example, when studying organisms with a high GC-content, the values for C and G may be increased with a corresponding decrease for the A and T values.
When the PWM elements are calculated using log likelihoods, the score of a sequence can be calculated by adding the relevant values at each position in the PWM. The sequence score gives an indication of how different the sequence is from a random sequence. The score is 0 if the sequence has the same probability of being a functional site and of being a random site. The score is greater than 0 if it is more likely to be a functional site than a random site, and less than 0 if it is more likely to be a random site than a functional site. The sequence score can also be interpreted in a physical framework as the binding energy for that sequence.

Information content

The information content of a PWM is sometimes of interest, as it says something about how different a given PWM is from a uniform distribution.
The self-information of observing a particular symbol at a particular position of the motif is:
The expected self-information of a particular element in the PWM is then:
Finally, the IC of the PWM is then the sum of the expected self-information of every element:
Often, it is more useful to calculate the information content with the background letter frequencies of the sequences you are studying rather than assuming equal probabilities of each letter. The equation for information content thus becomes
where is the background frequency for letter. This corresponds to the Kullback–Leibler divergence or relative entropy. However, it has been shown that when using PSSM to search genomic sequences this uniform correction can lead to overestimation of the importance of the different bases in a motif, due to the uneven distribution of n-mers in real genomes, leading to a significantly larger number of false positives.

Uses

There are various algorithms to scan for hits of PWMs in sequences. One example is the MATCH algorithm which has been implemented in the ModuleMaster. More sophisticated algorithms for fast database searching with nucleotide as well as amino acid PWMs/PSSMs are implemented in the possumsearch software.

Popular movies

The Hunger Games (film) - 2012 American dystopian action thriller science fiction-adventure film directed by Gary Ross and based on Suzanne Collins’s 2008 novel of the same name. It is the first insta...
untitled Captain Marvel sequel - part of Marvel Cinematic Universe....
Killers of the Flower Moon (film project) - Killers of the Flower Moon - film project in United States of America. It was presented as drama, detective fiction, thriller. The film project starred Leonardo Dicaprio, Robert De Niro. Director of...
Five Nights at Freddy's (film) - Five Nights at Freddy's - film published in 2017 in United States of America. Scenarist of the film - Scott Cawthon....

Popular books

Book of Revelation - The Book of Revelation is the final book of the New Testament, and consequently is also the final book of the Christian Bible. Its title is derived from the first word of the Koine Greek text: apok...
Book of Genesis - account of the creation of the world, the early history of humanity, Israel's ancestors and the origins...
Gospel of Matthew - The Gospel According to Matthew is the first book of the New Testament and one of the three synoptic gospels. It tells how Israel's Messiah, rejected and executed in Israel, pronounces judgement on ...
Michelin Guide - Michelin Guides are a series of guide books published by the French tyre company Michelin for more than a century. The term normally refers to the annually published Michelin Red Guide , the oldest...
Psalms - The Book of Psalms , commonly referred to simply as Psalms , the Psalter or "the Psalms", is the first book of the Ketuvim , the third section of the Hebrew Bible, and thus a book of th...
Ecclesiastes - Ecclesiastes is one of 24 books of the Tanakh , where it is classified as one of the Ketuvim . Originally written c. 450–200 BCE, it is also among the canonical Wisdom literature of the Old Tes...
The 48 Laws of Power - non-fiction book by American author Robert Greene. The book...

Popular television series

The Crown (TV series) - historical drama web television series about the reign of Queen Elizabeth II, created and principally written by Peter Morgan, and produced by Left Bank Pictures and Sony Pictures Tel...
Friends - American sitcom television series, created by David Crane and Marta Kauffman, which aired on NBC from September 22, 1994, to May 6, 2004, lasting ten seasons. With an ensemble cast sta...
Young Sheldon - spin-off prequel to The Big Bang Theory and begins with the character Sheldon...
Modern Family - American television mockumentary family sitcom created by Christopher Lloyd and Steven Levitan for the American Broadcasting Company. It ran for eleven seasons, from September 23...
Loki (TV series) - upcoming American web television miniseries created for Disney+ by Michael Waldron, based on the Marvel Comics character of the same name. It is set in the Marvel Cinematic Universe, shar...
Game of Thrones - American fantasy drama television series created by David Benioff and D. B. Weiss for HBO. It...
Shameless (American TV series) - American comedy-drama television series developed by John Wells which debuted on Showtime on January 9, 2011. It...