List of RNA-Seq bioinformatics tools


RNA-Seq
is a technique that allows transcriptome studies based on next-generation sequencing technologies. This technique is largely dependent on bioinformatics tools developed to support the different steps of the process. Here are listed some of the principal tools commonly employed and links to some important web resources.

Design

Design is a fundamental step of a particular RNA-Seq experiment. Some important questions like sequencing depth/coverage or how many biological or technical replicates must be carefully considered. Design review.
Quality assessment of raw data is the first step of the bioinformatics pipeline of RNA-Seq. Often, is necessary to filter data, removing low quality sequences or bases, adapters, contaminations, overrepresented sequences or correcting errors to assure a coherent final result.

Quality control

Improvement of the RNA-Seq quality, correcting the bias is a complex subject. Each RNA-Seq protocol introduces specific type of bias, each step of the process is susceptible to generate some sort of noise or type of error. Furthermore, even the species under investigation and the biological context of the samples are able to influence the results and introduce some kind of bias.
Many sources of bias were already reported – GC content and PCR enrichment, rRNA depletion, errors produced during sequencing, priming of reverse transcription caused by random hexamers.
Different tools were developed to attempt to solve each of the detected errors.

Trimming and adapters removal

Recent sequencing technologies normally require DNA samples to be amplified via polymerase chain reaction. Amplification often generates chimeric elements - sequences formed from two or more original sequences joined together.
High-throughput sequencing errors characterization and their eventual correction.
Further tasks performed before alignment, namely paired-read mergers.
After quality control, the first step of RNA-Seq analysis involves alignment of the sequenced reads to a reference genome or to a transcriptome database. See also List of sequence alignment software.

Short (unspliced) aligners

Short aligners are able to align continuous reads based on the Burrows-Wheeler transform method such as Bowtie and BWA, and 2) based on Seed-extend methods, Needleman-Wunsch or Smith-Waterman algorithms. The first group is many times faster, however some tools of the second group tend to be more sensitive, generating more correctly aligned reads.
Many reads span exon-exon junctions and can not be aligned directly by Short aligners, thus specific aligners were necessary - Spliced aligners. Some Spliced aligners employ Short aligners to align firstly unspliced/continuous reads, and after follow a different strategy to align the rest containing spliced regions - normally the reads are split into smaller segments and mapped independently. See also.

Aligners based on known splice junctions (annotation-guided aligners)

In this case the detection of splice junctions is based on data available in databases about known junctions. This type of tools cannot identify new splice junctions. Some of this data comes from other expression methods like expressed sequence tags.
De novo Splice aligners allow the detection of new Splice junctions without need to previous annotated information.

General tools

These tools perform normalization and calculate the abundance of each gene expressed in a sample. RPKM, FPKM and TPMs are some of the units employed to quantification of expression.
Some software are also designed to study the variability of genetic expression between samples. Quantitative and differential studies are largely determined by the quality of reads alignment and accuracy of isoforms reconstruction. Several studies are available comparing differential expression methods.

Commercial solutions

General tools

Genome arrangements result of diseases like cancer can produce aberrant genetic modifications like fusions or translocations. Identification of these modifications play important role in carcinogenesis studies.
. The traditional RNA-Seq methodology is commonly known as "bulk RNA-Seq", in this case RNA is extracted from a group of cells or tissues, not from the individual cell like it happens in single cell methods. Some tools available to bulk RNA-Seq are also applied to single cell analysis, however to face the specificity of this technique new algorithms were developed.
These Simulators generate in silico reads and are useful tools to compare and test the efficiency of algorithms developed to handle RNA-Seq data. Moreover, some of them make possible to analyse and model RNA-Seq protocols.
The transcriptome is the total population of RNAs expressed in one cell or group of cells, including non-coding and protein-coding RNAs.
There are two types of approaches to assemble transcriptomes. Genome-guided methods use a reference genome as a template to align and assembling reads into transcripts. Genome-independent methods does not require a reference genome and are normally used when a genome is not available. In this case reads are assembled directly in transcripts.

Genome-guided assemblers