BLAT (bioinformatics)


BLAT is a pairwise sequence alignment algorithm that was developed by Jim Kent at the University of California Santa Cruz in the early 2000s to assist in the assembly and annotation of the human genome. It was designed primarily to decrease the time needed to align millions of mouse genomic reads and expressed sequence tags against the human genome sequence. The alignment tools of the time were not capable of performing these operations in a manner that would allow a regular update of the human genome assembly. Compared to pre-existing tools, BLAT was ~500 times faster with performing mRNA/DNA alignments and ~50 times faster with protein/protein alignments.

Overview

BLAT is one of multiple algorithms developed for the analysis and comparison of biological sequences such as DNA, RNA and proteins, with a primary goal of inferring homology in order to discover biological function of genomic sequences. It is not guaranteed to find the mathematically optimal alignment between two sequences like the classic Needleman-Wunsch and Smith-Waterman dynamic programming algorithms do; rather, it first attempts to rapidly detect short sequences which are more likely to be homologous, and then it aligns and further extends the homologous regions. It is similar to the heuristic BLAST family of algorithms, but each tool has tried to deal with the problem of aligning biological sequences in a timely and efficient manner by attempting different algorithmic techniques.

Uses of BLAT

BLAT can be used to align DNA sequences as well as protein and translated nucleotide sequences. It is designed to work best on sequences with great similarity. The DNA search is most effective for primates and the protein search is effective for land vertebrates. In addition, protein or translated sequence queries are more effective for identifying distant matches and for cross-species analysis than DNA sequence queries. Typical uses of BLAT include the following:
BLAT is designed to find matches between sequences of length at least 40 bases that share ≥95% nucleotide identity or ≥80% translated protein identity.

Process

BLAT is used to find regions in a target genomic database which are similar to a query sequence under examination. The general algorithmic process followed by BLAT is similar to BLAST's in that it first searches for short segments in the database and query sequences which have a certain number of matching elements. These alignment seeds are then extended in both directions of the sequences in order to form high-scoring pairs. However, BLAT uses a different indexing approach from BLAST, which allows it to rapidly scan very large genomic and protein databases for similarities to a query sequence. It does this by keeping an indexed list of the target database in memory, which significantly reduces the time required for the comparison of the query sequences with the target database. This index is built by taking the coordinates of all the non-overlapping k-mers in the target database, except for highly repeated k-mers. BLAT then builds a list of all overlapping k-mers from the query sequence and searches for these in the target database, building up a list of hits where there are matches between the sequences.

Search stage

There are three different strategies used in order to search for candidate homologous regions:
  1. The first method requires single perfect matches between the query and database sequences i.e. the two k-mer words are exactly the same. This approach is not considered the most practical. This is because a small k-mer size is necessary in order to achieve high levels of sensitivity, but this increases the number of false positive hits, thus increasing the amount of time spent in the alignment stage of the algorithm.
  2. The second method allows at least one mismatch between the two k-mer words. This decreases the amount of false positives, allowing larger k-mer sizes which are less computationally expensive to handle than those produced from the previous method. This method is very effective in identifying small homologous regions.
  3. The third method requires multiple perfect matches which are in close proximity to each other. As Kent shows, this is a very effective technique capable of taking into consideration small insertions and deletions within the homologous regions.
When aligning nucleotides, BLAT uses the third method requiring two perfect word matches of size 11. When aligning proteins, the BLAT version determines the search methodology used: when the client/server version is used, BLAT searches for three perfect 4-mer matches; when the stand-alone version is used, BLAT searches for a single perfect 5-mer between the query and database sequences.

BLAT vs. BLAST

Some of the differences between BLAT and BLAST are outlined below:
BLAT can be used either as a web-based server-client program or as a stand-alone program.

Server-client

The web-based application of BLAT can be accessed from the UCSC Genome Bioinformatics Site. Building the index is a relatively slow procedure. Therefore, each genome assembly used by the web-based BLAT is associated with a BLAT server, in order to have a pre-computed index available for alignments. These web-based BLAT servers keep the index in memory for users to input their query sequences.
Once the query sequence is uploaded/pasted into the search field, the user can select various parameters such as which species' genome to target and the assembly version of that genome, the query type and output settings. The user can then run the search by either submitting the query or using the BLAT "I'm feeling lucky" search.
Bhagwat et al. provide step by step protocols for how to use BLAT to:
BLAT can handle long database sequences, however, it is more effective with short query sequences than long query sequences. Kent recommends a maximum query length of 200,000 bases. The UCSC browser limits query sequences to less than 25,000 letters for DNA searches and less than 10,000 letters for protein and translated sequence searches.
The BLAT Search Genome available on the UCSC website accepts query sequences as text or uploaded as text files. The BLAT Search Genome can accept multiple sequences of the same type at once, up to a maximum of 25. For multiple sequences, the total number of nucleotides must not exceed 50,000 for DNA searches or 25,000 letters for protein or translated sequence searches.
An example of searching a target database with a DNA query sequence is shown in Figure 2.

Output

A BLAT search returns a list of results that are ordered in decreasing order based on the score. The following information is returned: the score of the alignment, the region of query sequence that matches to the database sequence, the size of the query sequence, the level of identity as a percentage of the alignment and the chromosome and position that the query sequence maps to. Bhagwat et al. describe how the BLAT "Score" and "Identity" measures are calculated.
For each search result, the user is provided with a link to the UCSC Genome Browser so they can visualise the alignment on the chromosome. This a major benefit of the web-based BLAT over the stand-alone BLAT. The user is able to obtain biological information associated with the alignment, such as information about the gene to which the query may match.
The user is also provided with a link to view the alignment of the query sequence with the genome assembly. The matches between the query and genome assembly are blue and the boundaries of the alignments are lighter in colour. These exon boundaries indicate splice sites.
The "I'm feeling lucky" search result returns the highest scoring alignment for the first query sequence based on the output sort option selected by the user.

Stand-alone

Stand-alone BLAT is more suitable for batch runs, and more efficient than the web-based BLAT. It is more efficient because it is able to store the genome in memory, unlike the web-based application which only stores the index in memory.

License

Both the source and precompiled binaries of BLAT are freely available for academic and personal use. Commercial license of stand-alone BLAT is distributed by