Shapiro–Senapathy algorithm


The Shapiro–Senapathy algorithm is an algorithm for predicting the splice sites, exons and genes in animals and plants. This algorithm has the ability to discover disease-causing mutations in splice junctions in cancerous and non-cancerous diseases that is being used in major research institutions around the world.
The S&S algorithm has been in ~3,000 publications in clinical genomics on finding splicing mutations in thousands of diseases including many different forms of cancer and non-cancer diseases. It has been the basis of many leading software tools, such as Human Splicing Finder, Splice-site Analyzer Tool, dbass, Alamut and SROOGLE, which are cited by approx. 1,500 additional citations. The S&S algorithm has thus significantly impacted the field of medicine, and is increasingly applied in today's disease research, pharmacogenomics, and Precision Medicine, as up to 50% of all diseases and ADRs are now thought to be caused by RNA splicing mutations.
Using the S&S algorithm, scientists have identified mutations and genes that cause numerous cancers, inherited disorders, immune deficiency diseases and neurological disorders. In addition, mutations in various drug metabolizing genes that cause ADRs to different drugs that are used to treat different diseases, including cancer chemotherapeutic drugs, have been identified. S&S is also used in detecting the “cryptic” splice sites that are not authentic sites used in the normal splicing of gene transcripts, and the mutations in which cause numerous diseases. The details are provided in the following sections.

The algorithm

The S&S algorithm is described in a 1987 paper. It works on sliding windows of eight nucleotides, and outputs a consensus-based percentage for its possibility of being a splice site. The 1990 publication is based on the same overall method.

Cancer gene discovery using S&S

By using the S&S algorithm, mutations and genes that cause many different forms of cancer have been discovered. For example, genes causing commonly occurring cancers including breast cancer, ovarian cancer, colorectal cancer, leukemia, head and neck cancers, prostate cancer, retinoblastoma, squamous cell carcinoma, gastrointestinal cancer, melanoma, liver cancer, Lynch syndrome, skin cancer, and neurofibromatosis have been found. In addition, splicing mutations in genes causing less commonly known cancers including gastric cancer, gangliogliomas, Li-Fraumeni syndrome, Loeys–Dietz syndrome, Osteochondromas, Nevoid basal cell carcinoma syndrome, and Pheochromocytomas have been identified.
Specific mutations in different splice sites in various genes causing breast cancer, ovarian cancer, colon cancer, colorectal cancer, skin cancer, and Fanconi anemia have been uncovered. The mutations in the donor and acceptor splice sites in different genes causing a variety of cancers that have been identified by S&S are shown in Table 1.

Table 1. Mutations in the donor and acceptor splice sites in different genes

Discovery of genes causing inherited disorders using S&S

Specific mutations in different splice sites in various genes that cause inherited disorders, including, for example, Type 1 diabetes, hypertension, marfane syndrome, cardiac diseases, eye disorders have been uncovered. Few example mutations in the donor and acceptor splice sites in different genes causing a variety of inherited disorders identified using S&S are shown in Table 2.
Disease typeGene symbolMutation locationOriginal sequenceMutated sequenceSplicing aberration
DiabetesPTPN22Exon 18AAGGTAAAGAACGTAAAGSkipping of exon 18
DiabetesTCF1Intron 4TTTGTGCCCCTCAGGTTTGTGCCCCTCGGGSkipping of exon 5
HypertensionLDLIntron 10TGGGTGCGTTGGGTGCATNormolipidemic to classical heterozygous FH
HypertensionLDLRIntron 2GCTGTGAGTGCTGTGTGTMay cause splicing abnormalities through an in-silico analysis
HypertensionLPLIntron 2ACGGTAAGGACGATAAGGCryptic splice sites is activated in vivo at the sites
Marfan syndromeFBN1Intron 46CAAGTAAGACAAGTAAAAExon skipping/cryptic splice site
Marfan syndromeTGFBR2Intron 1ATCCTGTTTTACAGAATCCTGTTTTACGGAAbnormal splicing
Marfan syndromeFBN2Intron45TGGGTAAGTTGGGGAAGTSplice site alterations leading to frameshift mutations,
causing a truncated protein
Cardiac diseaseCOL1A2Intron 46GCTGTAAGTGCTGCAAGTPermitted almost exclusive use of a cryptic donor
site 17 nt upstream in the exon
Cardiac diseaseMYBPC3Intron 5CTCCATGCACACAGGCTCCATGCACACCGGAbnormal mRNA transcript with a premature
stop codon will produce a truncated protein lacking the binding sites for myosin and titin
Cardiac diseaseACTC1Intron 1TTTTCTTCTCATAGGTTTTCTTCTTATAGGNo effect
Eye disorderABCRIntron 30CAGGTACCTCAGTTACCTAutosomal recessive RP and CRD
Eye disorderVSX1Intron 5TTTTTTTTTACAAGGTATTTTTTTACAAGGAberrant splicing

Table 2. Mutations in the donor and acceptor splice sites in different genes causing inherited disorders

Genes causing immune system disorders

More than 100 immune system disorders affect humans, including inflammatory bowel diseases, multiple sclerosis, systemic lupus erythematosus, bloom syndrome, familial cold autoinflammatory syndrome, and dyskeratosis congenita. The Shapiro–Senapathy algorithm has been used to discover genes and mutations involved in many immune disorder diseases, including Ataxia telangiectasia, B-cell defects, Epidermolysis bullosa, and X-linked agammaglobulinemia.
Xeroderma pigmentosum, an autosomal recessive disorder is caused by faulty proteins formed due to new preferred splice donor site identified using S&S algorithm and resulted in defective nucleotide excision repair.
Type I Bartter syndrome is caused by mutations in the gene SLC12A1. S&S algorithm helped in disclosing the presence of two novel heterozygous mutations c.724 + 4A > G in intron 5 and c.2095delG in intron 16 leading to complete exon 5 skipping.
Mutations in the MYH gene, which is responsible for removing the oxidatively damaged DNA lesion are cancer-susceptible in the individuals. The IVS1+5C plays a causative role in the activation of a cryptic splice donor site and the alternative splicing in intron 1, S&S algorithm shows, guanine at the position of IVS+5 is well conserved among primates. This also supported the fact that the G/C SNP in the conserved splice junction of the MYH gene causes the alternative splicing of intron 1 of the β type transcript.
Splice site scores were calculated according to S&S to find EBV infection in X-linked lymphoproliferative disease. Identification of Familial tumoral calcinosis is an autosomal recessive disorder characterized by ectopic calcifications and elevated serum phosphate levels and it is because of aberrant splicing.

Application of S&S in hospitals for clinical practice and research

Applying the S&S technology platform in modern clinical genomics research hasadvance diagnosis and treatment of human diseases.
In the modern era of Next Generation Sequencing technology, S&S is applied in clinical practice extensively. Clinicians and molecular diagnostic laboratories apply S&S using various computational tools including HSF, SSF, and Alamut. It is aiding in the discovery of genes and mutations in patients whose disease are stratified or when the disease in a patient is unknown based on clinical investigations.
In this context, S&S has been applied on cohorts of patients in different ethnic groups with various cancers and inherited disorders. A few examples are given below.

Cancers

Inherited disorders

S&S - the first algorithm for identifying splice sites, exons and split genes

Dr. Senapathy's original objective in developing a method for identifying splice sites was to find complete genes in raw uncharacterized genomic sequence that could be used in the human genome project. In the landmark paper with this objective, he described the basic method for identifying the splice sites within a given sequence based on the Position Weight Matrix of the splicing sequences in different eukaryotic organism groups for the first time. He also created the first exon detection method by defining the basic characteristics of an exon as the sequence bounded by an acceptor and a donor splice sites that had S&S scores above a threshold, and by an ORF that was mandatory for an exon. An algorithm for finding complete genes based on the identified exons was also described by Dr. Senapathy for the first time.
Dr. Senapathy demonstrated that only deleterious mutations in the donor or acceptor splice sites that would drastically make the protein defective would reduce the splice site score, and other non-deleterious variations would not reduce the score. The S&S method was adapted for researching the cryptic splice sites caused by mutations leading to diseases. This method for detecting deleterious splicing mutations in eukaryotic genes has been used extensively in disease research in the humans, animals and plants over the past three decades, as described above.
The basic method for splice site identification, and for defining exons and genes was subsequently used by researchers in finding splice sites, exons and eukaryotic genes in a variety of organisms. These methods also formed the basis of all subsequent tools development for discovering genes in uncharacterized genomic sequences. It also was used in a different computational approaches including machine learning and neural network, and in alternative splicing research.

Discovering the mechanisms of aberrant splicing in diseases

The Shapiro–Senapathy algorithm has been used to determine the various aberrant splicing mechanisms in genes due to deleterious mutations in the splice sites, which cause numerous diseases. Deleterious splice site mutations impair the normal splicing of the gene transcripts, and thereby make the encoded protein defective. A mutant splice site can become “weak” compared to the original site, due to which the mutated splice junction becomes unrecognizable by the spliceosomal machinery. This can lead to the skipping of the exon in the splicing reaction, resulting in the loss of that exon in the spliced mRNA. On the other hand, a partial or complete intron could be included in the mRNA due to a splice site mutation that makes it unrecognizable. A partial exon-skipping or intron inclusion can lead to premature termination of the protein from the mRNA, which will become defective leading to diseases. The S&S has thus paved the way to determine the mechanisms by which a deleterious mutation could lead to a defective protein, resulting in different diseases depending on which gene is affected.

Examples of splicing aberrations

An example of splicing aberration caused by a mutation in the donor splice site in the exon 8 of MLH1 gene that led to colorectal cancer is given below. This example shows that a mutation in a splice site within a gene can lead to a profound effect in the sequence and structure of the mRNA, and the sequence, structure and function of the encoded protein, leading to disease.
.''' The generation of a mRNA from a split gene involves the transcription of the gene into the primary RNA transcript, and the precise removal of the introns and the joining of the exons from the primary RNA transcript. A deleterious mutation within the splicing signals can affect the recognition of the correct splice junction and lead to an aberration in the joining of the authentic exons. Depending on if the mutation occurs within the donor or the acceptor site, and the particular base that is mutated within the splice sequence, the aberration could lead to the skipping of a complete or partial exon, or the inclusion of a partial intron or a cryptic exon in the mRNA produced by the splicing process. Either of these situations will usually lead to a premature stop codon in the mRNA and result in a completely defective protein. The S&S algorithm aids in determining which splice site and exon in a gene are mutated, and the S&S score of the mutated splice site aids in determining the type of splicing aberration and the resulting mRNA structure and sequence. The example gene MLH1 affected in colorectal cancer is shown in the figure. It was found using the S&S algorithm that a mutation in the donor splice site in exon 8 led to the skipping of the exon 8. The mRNA thus lacks the sequence corresponding to exon 8. This causes a frame shift in the mRNA coding sequence at amino acid position 226, leading to premature protein truncation at amino acid position 233. This mutated protein is completely defective, which has led to colorectal cancer in the patient.

S&S in cryptic splice sites research and medical applications

The proper identification of splice sites has to be highly precise as the consensus splice sequences are very short and there are many other sequences similar to the authentic splice sites within gene sequences, which are known as cryptic, non-canonical, or pseudo splice sites. When an authentic or real splice site is mutated, any cryptic splice sites present close to the original real splice site could be erroneously used as authentic site, resulting in an aberrant mRNA. The erroneous mRNA may include a partial sequence from the neighboring intron or lose a partial exon, which may result in a premature stop codon. The result may be a truncated protein that would have lost its function completely.
Shapiro–Senapathy algorithm can identify the cryptic splice sites, in addition to the authentic splice sites. Cryptic sites can often be stronger than the authentic sites, with a higher S&S score. However, due to the lack of an accompanying complementary donor or acceptor site, this cryptic site will not be active or used in a splicing reaction. When a neighboring real site is mutated to become weaker than the cryptic site, then the cryptic site may be used instead of the real site, resulting in a cryptic exon and an aberrant transcript.
Numerous diseases have been caused by cryptic splice site mutations or usage of cryptic splice sites due to the mutations in authentic splice sites.

S&S in animal and plant genomics research

S&S has also been used in RNA splicing research in many animals and plants.
The mRNA splicing plays a fundamental role in gene functional regulation. Very recently, it has been shown that A to G conversions at splice sites can lead to mRNA mis-splicing in Arabidopsis. The splicing and exon–intron junction prediction coincided with the GT/AG rule in the Molecular characterization and evolution of carnivorous sundew class V b-1,3-glucanase. Unspliced and spliced transcripts of NAD+ dependent sorbitol dehydroge nase of strawberry were investigated for phytohormonal treatments.
Ambra1 is a positive regulator of autophagy, a lysosome-mediated degradative process involved both in physiological and pathological conditions. Nowadays, this function of Ambra1 has been characterized only in mammals and zebrafish. Diminution of rbm24a or rbm24b gene products by morpholino knockdown resulted in significant disruption of somite formation in mouse and zebrafish. Dr.Senapathy algorithm used extensively to study intron-exon organization of fut8 genes.The intron-exon boundaries of Sf9 fut8 were in agreement with the consensus sequence for the splicing donor and acceptor sites concluded using S&S.

The split-gene theory, introns and splice junctions

The motivation for Dr. Senapathy to develop a method for the detection of splice junctions came from his split-gene theory. If primordial DNA sequences had a random nucleotide organization, the random distribution of stop codons would allow only very short Open Reading Frames, as three stop codons out of 64 codons would result in an average ORF of ~60 bases. When Senapathy tested this in random DNA sequences, not only this was proven to be true, but the longest ORFs even in very long DNA sequences was found to be ~600 bases above which no ORFs existed. If so, a long coding sequence of even 1,200 bases, and longer coding sequences of 6,000 bases will not occur in a primordial random sequence. Thus, genes had to occur in pieces in a split form, with short coding sequences that became exons, interrupted by very long random sequences that became introns. When the eukaryotic DNA was tested for ORF length distribution, it exactly matched that from random DNA, with very short ORFs that matched the lengths of exons, and very long introns as predicted, supporting the split gene theory.
If this split gene theory was true, then the ends of these ORFs that had a stop codon by nature would have become the ends of exons that would occur within introns, and that would define the splice junctions. When this hypothesis was tested, the almost all splice junctions in eukaryotic genes were found to contain stop codons exactly at the ends of introns, bordering the exons. In fact, these stop codons were found to form the “canonical” AG:GT splicing sequence, with the three stop codons occurring as part of the strong consensus signals. The Nobel Laureate Dr. Marshall Nirenberg, who deciphered the codons, stated that these findings strongly showed that the split gene theory for the origin of introns and the split structure of genes must be valid, and communicated the paper to the PNAS. New Scientist covered this publication in “A long explanation for introns”.
This basic split gene theory led to the hypothesis that the splice junctions originated from the stop codons. Besides the codon CAG, only TAG, which is a stop codon, was found at the ends of introns. Surprisingly, all three stop codons were found after one base at the start of introns. These stop codons are shown in the consensus canonical donor splice junction as AG:GTGGT, wherein the TAA and TGA are the stop codons, and the additional TAG is also present at this position. The canonical acceptor splice junction is shown as AG:GT, in which TAG is the stop codon. These consensus sequence clearly show the presence of the stop codons at the ends of introns bordering the exons in all eukaryotic genes. Dr. Marshall Nirenberg again stated that these observations fully supported the split gene theory for the origin of splice junction sequences from stop codons, who was the referee for this paper. New Scientist covered this publication in “Exons, Introns and Evolution”.
Dr. Senapathy wanted to detect the splice junctions in random DNA based on the consensus splice signal sequences, as he found that there were many sequences resembling splice sites that were not the real splice sites within genes. This Position Weight Matrix method turned out to be a highly accurate algorithm to detect the real splice sites and the cryptic sites in genes. He also formulated the first exon detection method, based on the requirement for splice junctions at the ends of exons, and the requirement for an Open Reading Frame that would contain the exon. This exon detection method also turned to be highly accurate, detecting most of the exons with few false positives and false negatives. He extended this approach to define a complete split gene in a eukaryotic genomic sequence. Thus, the PWM based algorithm turned out to be very sensitive to not only detect the real splice sites and cryptic sites, but also to detect mutated splice sites that are deleterious as opposed to non-deleterious splicing mutations.
The stop codons within splice junctions turned out to be the strongest bases in splice junctions of eukaryotic genes, when tested using the PWMs of the consensus sequences. In fact, it was shown that mutations in these bases were the cause of diseases compared to other bases, as these three of the four bases of the canonical AG:GT were part of the stop codons. Senapathy showed that, when these canonical bases were mutated, the splice site score became weak, causing splicing aberrations in the splicing process and translation of the mRNA. Although the value of the splice site detection method in discovering genes with splicing mutations that caused disease has been realized over the years, its importance in clinical medicine is increasingly realized in the Next Generation Sequencing era over the past five years, with its incorporation in several tools based on the S&S algorithm.
Dr. Senapathy is currently the President and CSO of Genome International Corporation, a genomics R&D company based in Madison, WI. His team has developed several databases and tools for the analysis of splice junctions, including EuSplice, AspAlt, ExDom and RoBust. AspAlt was commended by Biotechniques, which stated that it solved a difficult problem for scientists in the comparative analysis and visualization of alternative splicing across different genomes. GIC has most recently developed the clinical genomics analysis platform Genome Explorer®.

Selected publications

*