Digital transcriptome subtraction

Digital transcriptome subtraction is a bioinformatics method to detect the presence of novel pathogen transcripts through computational removal of the host sequences. DTS is the direct in silico analogue of the wet-lab approach Representational Difference Analysis, and is made possible by unbiased high-throughput sequencing and the availability of a high-quality, annotated reference genome of the host. The method specifically examines the etiological agent of infectious diseases and is best known for discovering Merkel cell polymavirus, the suspect causative agent in Merkel cell carcinoma.

History

Using computational subtraction to discover novel pathogens was first proposed in 2002 by Meyerson et al. using human expressed sequence tag datasets. In a proof of principle experiment, Meyerson et al. demonstrated that it was a feasible approach using Epstein-Barr virus-infected lymphocytes in post-transplant lymphoproliferative disorder.
In 2007, the term "Digital Transcriptome Subtraction" was coined by the Chang-Moore group, and was used to discover Merkel cell polymavirus in Merkel cell carcinoma.
Simultaneously to the MCV discovery, this approach was used to implicate a novel arenavirus as cause of fatality in a case where three patients died of similar illnesses shortly following organ transplantations from a single donor.

Method

Construction of cDNA library

After treatment with DNase I to eliminate human genomic DNA, total RNA is extracted from primary infected tissue. Messenger RNA is then purified using an oligo-dT column that binds to the poly-A tail, a signal specifically found on transcribed genes. Using random hexamers priming, reverse transcriptase convert all mRNA into cDNA and cloned into bacterial vectors. Bacteria, usually E. coli, are then transformed using the cDNA vectors and selected using a marker, the collection of transformed clones is the cDNA library. This generates a snap-shot of tissue mRNA that is stable and can be sequenced at a later stage.

Sequencing and quality control

The cDNA library must be sequenced to great depth in order to detect a theoretical rare pathogen sequence, especially if the foreign sequence is novel. Chang-Moore recommend a sequencing depth of 200,000 transcripts or greater using multiple sequencing platforms.

% Viral	5,000 clones	10,000 clones	20,000 clones	50,000 clones
0.001%	4.9%	9.5%	18.1%	39.3%
0.01%	39.3%	32.2%	86.5%	99.3%
0.02%	63.2%	86.5%	98.2%	>99.995%
0.03%	77.7%	95.5%	99.8%	>99.995%
0.04%	86.5%	98.2%	>99.995%	>99.995%
0.1%	99.3%	>99.995%	>99.995%	>99.995%

Stringent quality control are then applied to the raw sequences to minimize false-positive results. The initial quality screen uses several general parameters to exclude ambiguous sequences, leaving behind a dataset of high-fidelity reads.

Low Phred score cutoff is used to remove low-quality end sequences. Typically, a Phred score cutoff of 20 or 30 is used to ensure 99%-99.9% accuracy in each base-calling.
Vector and adaptor removal.
Low complexity - complexity score of a sequence reflects number of identical bases in a series such as poly-dT or poly-dA.
Human repetitive DNA.
Length - parameter is dependent on the optimized read length specific to the sequencing technology that was used.
BLAST and exclude E. coli genome sequences.
BLAST to host genome

Using MEGABLAST, Hi-Fi reads are then matched to sequences in annotated databases and any positive matches are then subtracted from the dataset. Minimum hit length for a positive match of human sequence is typically 30 consecutive identical bases, which equates to a BLAST score of 60; generally, the remaining sequence is BLAST again with less stringent parameters to allow for slight mismatches. The vast majority of sequences should be removed from the dataset at this stage.
Subtracted sequences typically include:

Reference human transcriptome - eliminates any known human transcripts from expression library sets.
Reference human genome - eliminates genes that have been missed by the annotation process and any contaminating genomic sequences during cDNA library construction.
Mitochondrial DNA - mitochondrial DNA are highly abundant and polymorphic due to rapid mutation rate.
Immunoglobulin region - The immunoglobulin loci is highly polymorphic and would otherwise yield false-positive due to poor alignment to the reference genome.
Other vertebrate sequences
Unannotated sequences
Analysis of "non-host" candidates

Alignment to pathogen databases

After stringent rounds of subtraction, the remaining sequences are clustered into non-redundant contigs and aligned to known pathogen sequences using low-stringency parameters. As pathogen genomes mutates quickly, nucleotide-nucleotide alignments, or blastn, is usually uninformative as it is possible to have mutations at certain bases without changing the amino acid residue due to codon degeneracy. Matching the in silico translated protein sequences of all 6 open reading frames to the amino acid sequence to annotated proteins, or blastx, is the preferred alignment method as it increases the likelihood of identifying a novel pathogen by matching to a related strain/species. Experimental extension of candidate sequences might also be used at this stage to maximize chances of a positive match.

''De novo'' assembly

In cases where alignment to known pathogens is uninformative or ambiguous, contigs of candidate sequence can be used as templates for primer walking in primary infected tissue to generate the complete pathogen genome sequence. As viral transcripts are exceedingly rare ratio tissue mRNA, it is unlikely to generate a transcriptome based on the original candidate sequences alone due to low coverage.

Validation of pathogen

Once a putative pathogen has been identified in the high-throughput sequencing data, it is imperative to validate the presence of pathogen in infected patients using more sensitive techniques, such as:

RT-PCR and derivative methods, including 3'- and 5'-RACE to confirm the existence of pathogen mRNA.
Immunohistochemistry using antibodies to related pathogen to determine existence the pathogen in tissues.
Serological tests to measure pathogen-specific antibody titer.
Bacterial culture/viral culture, which is considered as the gold standard in laboratory diagnosis.
Applications

The primary application for DTS lies in identification of pathogenic viruses in cancer. It can also be used to identify viral pathogens in non-cancer related disease. Future clinical applications could include the use of DTS on a routine basis in individuals.
DTS could also apply to agriculture, identifying pathogens that have an effect on output. Computation subtraction was already used in a metagenomics study that associated viral infection by IAPV with colony collapse disorder in honey bees.

Advantages

Requires no prior knowledge about pathogen sequence.
Can identify previously unassociated, potentially treatable pathogens.
Uses already available molecular methods and resources.
Disadvantages
Identifies the presence of pathogen but does not establish causal link to disease. See Koch's postulate and Bradford-Hill criteria.
Must have a highly reliable, complete reference transcriptome for the organism being studied.
Lack of foreign sequence identification cannot entirely exclude a pathogenic foreign body.

Popular movies

The Hunger Games (film) - 2012 American dystopian action thriller science fiction-adventure film directed by Gary Ross and based on Suzanne Collins’s 2008 novel of the same name. It is the first insta...
untitled Captain Marvel sequel - part of Marvel Cinematic Universe....
Killers of the Flower Moon (film project) - Killers of the Flower Moon - film project in United States of America. It was presented as drama, detective fiction, thriller. The film project starred Leonardo Dicaprio, Robert De Niro. Director of...
Five Nights at Freddy's (film) - Five Nights at Freddy's - film published in 2017 in United States of America. Scenarist of the film - Scott Cawthon....

Popular books

Book of Revelation - The Book of Revelation is the final book of the New Testament, and consequently is also the final book of the Christian Bible. Its title is derived from the first word of the Koine Greek text: apok...
Book of Genesis - account of the creation of the world, the early history of humanity, Israel's ancestors and the origins...
Gospel of Matthew - The Gospel According to Matthew is the first book of the New Testament and one of the three synoptic gospels. It tells how Israel's Messiah, rejected and executed in Israel, pronounces judgement on ...
Michelin Guide - Michelin Guides are a series of guide books published by the French tyre company Michelin for more than a century. The term normally refers to the annually published Michelin Red Guide , the oldest...
Psalms - The Book of Psalms , commonly referred to simply as Psalms , the Psalter or "the Psalms", is the first book of the Ketuvim , the third section of the Hebrew Bible, and thus a book of th...
Ecclesiastes - Ecclesiastes is one of 24 books of the Tanakh , where it is classified as one of the Ketuvim . Originally written c. 450–200 BCE, it is also among the canonical Wisdom literature of the Old Tes...
The 48 Laws of Power - non-fiction book by American author Robert Greene. The book...

Popular television series

The Crown (TV series) - historical drama web television series about the reign of Queen Elizabeth II, created and principally written by Peter Morgan, and produced by Left Bank Pictures and Sony Pictures Tel...
Friends - American sitcom television series, created by David Crane and Marta Kauffman, which aired on NBC from September 22, 1994, to May 6, 2004, lasting ten seasons. With an ensemble cast sta...
Young Sheldon - spin-off prequel to The Big Bang Theory and begins with the character Sheldon...
Modern Family - American television mockumentary family sitcom created by Christopher Lloyd and Steven Levitan for the American Broadcasting Company. It ran for eleven seasons, from September 23...
Loki (TV series) - upcoming American web television miniseries created for Disney+ by Michael Waldron, based on the Marvel Comics character of the same name. It is set in the Marvel Cinematic Universe, shar...
Game of Thrones - American fantasy drama television series created by David Benioff and D. B. Weiss for HBO. It...
Shameless (American TV series) - American comedy-drama television series developed by John Wells which debuted on Showtime on January 9, 2011. It...

Digital transcriptome subtraction

History

Method

Construction of cDNA library

Sequencing and quality control

BLAST to host genome

Analysis of "non-host" candidates

Alignment to pathogen databases

''De novo'' assembly

Validation of pathogen

Applications

Advantages

Disadvantages