FASTA format

In bioinformatics and biochemistry, the FASTA format is a text-based format for representing either nucleotide sequences or amino acid sequences, in which nucleotides or amino acids are represented using single-letter codes. The format also allows for sequence names and comments to precede the sequences. The format originates from the FASTA software package, but has now become a near universal standard in the field of bioinformatics.
The simplicity of FASTA format makes it easy to manipulate and parse sequences using text-processing tools and scripting languages like the R programming language, Python, Ruby, and Perl.

Original format & overview

The original FASTA/Pearson format is described in the documentation for the FASTA suite of programs. It can be downloaded with any free distribution of FASTA.
In the original format, a sequence was represented as a series of lines, each of which was no longer than 120 characters and usually
did not exceed 80 characters. This probably was to allow for preallocation of fixed line sizes in software: at the time most users relied on Digital Equipment Corporation VT220 terminals which could display 80 or 132 characters per line. Most people preferred the bigger font in 80-character modes and so it became the recommended fashion to use 80 characters or less in FASTA lines. Also, the width of a standard printed page is 70 to 80 characters. Hence, 80 characters became the norm.
The first line in a FASTA file started either with a ">" symbol or, less frequently, a ";" was taken as a comment. Subsequent lines starting with a semicolon would be ignored by software. Since the only comment used was the first, it quickly became used to hold a summary description of the sequence, often starting with a unique library accession number, and with time it has become commonplace to always use ">" for the first line and to not use ";" comments.
Following the initial line was the actual sequence itself in standard
one-letter character string. Anything other than a valid character would be ignored. It was also common to end the sequence with an "*" character and, for the same reason, to leave a blank line between the description and the sequence. Below are a few sample sequences:

;LCBO - Prolactin precursor - Bovine
; a sample sequence in FASTA format
MDSKGSSQKGSRLLLLLVVSNLLLCQGVVSTPVCPNGPGNCQVSLRDLFDRAVMVSHYIHDLSS
EMFNEFDKRYAQGKGFITMALNSCHTSSLPTPEDKEQAQQTHHEVLMSLILGLLRSWNDPLYHL
VTEVRGMKGAPDAILSRAIEIEEENKRLLEGMEMIFGQVIPGAKETEPYPVWSGLPSLQTKDED
ARYSAFYNLLHCLRRDSSKIDTYLKLLNCRIIYNNNC*
>MCHU - Calmodulin - Human, rabbit, bovine, rat, and chicken
ADQLTEEQIAEFKEAFSLFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGNGTID
FPEFLTMMARKMKDTDSEEEIREAFRVFDKDGNGYISAAELRHVMTNLGEKLTDEEVDEMIREA
DIDGDGQVNYEEFVQMMTAK*
>gi|5524211|gb|AAD44166.1| cytochrome b
LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV
EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG
LLILILLLLLLALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSIVIL
GLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQPVEYPYTIIGQMASILYFSIILAFLPIAGX
IENY

A multiple sequence FASTA format would be obtained by concatenating several single sequence FASTA files in a common file. This does not imply a contradiction with the format as only the first line in a FASTA file may start with a ";" or ">", hence forcing all subsequent sequences to start with a ">" in order to be taken as different ones. Thus, the examples above may as well be taken as a multisequence file if taken together.
Nowadays, modern bioinformatic programs that rely on the FASTA format expect the sequence headers to be preceded by ">", and the actual sequence, while generally represented as "interleaved", i.e. on multiple lines as in the above example, may also be "sequential" when the full stretch is found on a single line. Users may often need to perform conversion between "Sequential" and "Interleaved" FASTA format to run different bioinformatic programs.

Description line

The description line or header/identifier line, which begins with '>', gives a name and/or a unique identifier for the sequence, and may also contain additional information. In a deprecated practice, the header line sometimes contained more than one header, separated by a ^A character. In the original Pearson FASTA format, one or more comments, distinguished by a semi-colon at the beginning of the line, may occur after the header. Some databases and bioinformatics applications do not recognize these comments and follow . An example of a multiple sequence FASTA file follows:

>SEQUENCE_1
MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAEG
LVSVKVSDDFTIAAMRPSYLSYEDLDMTFVENEYKALVAELEKENEERRRLKDPNKPEHK
IPQFASRKQLSDAILKEAEEKIKEELKAQGKPEKIWDNIIPGKMNSFIADNSQLDSKLTL
MGQFYVMDDKKTVEQVIAEKEKEFGGKIKIVEFICFEVGEGLEKKTEDFAAEVAAQL
>SEQUENCE_2
SATVSEINSETDFVAKNDQFIALTKDTTAHIQSNSLQSVEELHSSTINGVKFEEYLKSQI
ATIGENLVVRRFATLKAGANGVVNGYIHTNGRVGVVIAAACDSAEVASKSRDLLRQICMH

NCBI identifiers

The NCBI defined a standard for the unique identifier used for the sequence in the header line. This allows a sequence that was obtained from a database to be labelled with a reference to its database record. The database identifier format is understood by the NCBI tools like makeblastdb and table2asn. The following list describes the NCBI FASTA defined format for sequence identifiers.

Type	Format	Example
local	`lcl\|integer` `lcl\|string`	`lcl\|123` `lcl\|hmm271`
GenInfo backbone seqid	`bbs\|integer`	`bbs\|123`
GenInfo backbone moltype	`bbm\|integer`	`bbm\|123`
GenInfo import ID	`gim\|integer`	`gim\|123`
	`gb\|accession\|locus`	`gb\|M73307\|AGMA13GT`
	`emb\|accession\|locus`	`emb\|CAM43271.1\|`
	`pir\|accession\|name`	`pir\|\|G36364`
	`sp\|accession\|name`	`sp\|P01013\|OVAX_CHICK`
patent	`pat\|country\|patent\|sequence-number`	`pat\|US\|RE33188\|1`
pre-grant patent	`pgp\|country\|application-number\|sequence-number`	`pgp\|EP\|0238993\|7`
	`ref\|accession\|name`	`ref\|NM_010450.1\|`
general database reference	`gnl\|database\|integer` `gnl\|database\|string`	`gnl\|taxon\|9606` `gnl\|PID\|e1632`
GenInfo integrated database	`gi\|integer`	`gi\|21434723`
	`dbj\|accession\|locus`	`dbj\|BAC85684.1\|`
	`prf\|accession\|name`	`prf\|\|0806162C`
	`pdb\|entry\|chain`	`pdb\|1I4L\|D`
third-party	`tpg\|accession\|name`	`tpg\|BK003456\|`
third-party	`tpe\|accession\|name`	`tpe\|BN000123\|`
third-party	`tpd\|accession\|name`	`tpd\|FAA00017\|`
TrEMBL	`tr\|accession\|name`	`tr\|Q90RT2\|Q90RT2_9HIV1`

The vertical bars in the above list are not separators in the sense of the Backus–Naur form, but are part of the format. Multiple identifiers can be concatenated, also separated by vertical bars.

Sequence representation

Following the header line, the actual sequence is represented. Sequences may be protein sequences or nucleic acid sequences, and they can contain gaps or alignment characters. Sequences are expected to be represented in the standard IUB/IUPAC amino acid and nucleic acid codes, with these exceptions: lower-case letters are accepted and are mapped into upper-case; a single hyphen or dash can be used to represent a gap character; and in amino acid sequences, U and * are acceptable letters. Numerical digits are not allowed but are used in some databases to indicate the position in the sequence. The nucleic acid codes supported are:

Nucleic Acid Code	Meaning	Mnemonic
A	A	Adenine
C	C	Cytosine
G	G	Guanine
T	T	Thymine
U	U	Uracil
	i	inosine
R	A or G	puRine
Y	C, T or U	pYrimidines
K	G, T or U	bases which are Ketones
M	A or C	bases with aMino groups
S	C or G	Strong interaction
W	A, T or U	Weak interaction
B	not A	B comes after A
D	not C	D comes after C
H	not G	H comes after G
V	neither T nor U	V comes after U
N	A C G T U	Nucleic acid
-	gap of indeterminate length

The amino acid codes supported are:

Amino Acid Code	Meaning
A	Alanine
B	Aspartic acid or Asparagine
C	Cysteine
D	Aspartic acid
E	Glutamic acid
F	Phenylalanine
G	Glycine
H	Histidine
I	Isoleucine
J	Leucine or Isoleucine
K	Lysine
L	Leucine
M	Methionine/Start codon
N	Asparagine
O	Pyrrolysine
P	Proline
Q	Glutamine
R	Arginine
S	Serine
T	Threonine
U	Selenocysteine
V	Valine
W	Tryptophan
Y	Tyrosine
Z	Glutamic acid or Glutamine
X	any
*	translation stop
-	gap of indeterminate length

FASTA file

Filename extension

There is no standard filename extension for a text file containing FASTA formatted sequences. The table below shows each extension and its respective meaning.

Extension	Meaning	Notes
fasta, fa	generic FASTA	Any generic fasta file. See below for other common FASTA file extensions
fna	FASTA nucleic acid	Used generically to specify nucleic acids.
ffn	FASTA nucleotide of gene regions	Contains coding regions for a genome.
faa	FASTA amino acid	Contains amino acid sequences. A multiple protein fasta file can have the more specific extension mpfa.
frn	FASTA non-coding RNA	Contains non-coding RNA regions for a genome, in DNA alphabet e.g. tRNA, rRNA

Compression

The compression of FASTA files requires a specific compressor to handle both channels of information: identifiers and sequence. For improved compression results, these are mainly divided in two streams where the compression is made assuming independence. For example, the algorithm MFCompress performs lossless compression of these files using context modelling and arithmetic encoding. For a benchmark on FASTA files compression algorithms, see Hosseini et al, 2016.

Encryption

The encryption of FASTA files has been mostly addressed with a specific encryption tool: Cryfa. Cryfa uses AES encryption and enables to compact data besides encryption. It can also address FASTQ files.

Extensions

is a form of FASTA format extended to indicate information related to sequencing. It is created by the Sanger Centre in Cambridge.
A2M/A3M are a family of FASTA-derived formats used for sequence alignments. In A2M/A3M sequences, lowercase characters are taken to mean insertions, which are then indicated in the other sequences as the dot character. The dots can be discarded for compactness without loss of information. As with typical FASTA used in alignments, the gap is taken to mean exactly one position. A3M is similar to A2M, with the added rule that gaps aligned to insertions can too be discarded.

Working with FASTA files

A plethora of user-friendly scripts are available from the community to perform FASTA file manipulations. Online toolbox are also available such as FaBox or the FASTX-Toolkit within Galaxy servers. For instance, these can be used to segregate sequence headers/identifiers, rename them, shorten them, or extract sequences of interest from large FASTA files based on a list of wanted identifiers. A tree-based approach to sorting multi-FASTA files also exists based on the coloring and/or annotation of sequence of interest in the FigTree viewer. Additionally, Bioconductor.org's Biostrings package can be used to read and manipulate FASTA files in R.
Several online format converters exist to rapidly reformat multi-FASTA files to different formats for their use with different phylogenetic programs (e.g. such as the converter available on phylogeny.fr.

Popular movies

The Hunger Games (film) - 2012 American dystopian action thriller science fiction-adventure film directed by Gary Ross and based on Suzanne Collins’s 2008 novel of the same name. It is the first insta...
untitled Captain Marvel sequel - part of Marvel Cinematic Universe....
Killers of the Flower Moon (film project) - Killers of the Flower Moon - film project in United States of America. It was presented as drama, detective fiction, thriller. The film project starred Leonardo Dicaprio, Robert De Niro. Director of...
Five Nights at Freddy's (film) - Five Nights at Freddy's - film published in 2017 in United States of America. Scenarist of the film - Scott Cawthon....

Popular books

Book of Revelation - The Book of Revelation is the final book of the New Testament, and consequently is also the final book of the Christian Bible. Its title is derived from the first word of the Koine Greek text: apok...
Book of Genesis - account of the creation of the world, the early history of humanity, Israel's ancestors and the origins...
Gospel of Matthew - The Gospel According to Matthew is the first book of the New Testament and one of the three synoptic gospels. It tells how Israel's Messiah, rejected and executed in Israel, pronounces judgement on ...
Michelin Guide - Michelin Guides are a series of guide books published by the French tyre company Michelin for more than a century. The term normally refers to the annually published Michelin Red Guide , the oldest...
Psalms - The Book of Psalms , commonly referred to simply as Psalms , the Psalter or "the Psalms", is the first book of the Ketuvim , the third section of the Hebrew Bible, and thus a book of th...
Ecclesiastes - Ecclesiastes is one of 24 books of the Tanakh , where it is classified as one of the Ketuvim . Originally written c. 450–200 BCE, it is also among the canonical Wisdom literature of the Old Tes...
The 48 Laws of Power - non-fiction book by American author Robert Greene. The book...

Popular television series

The Crown (TV series) - historical drama web television series about the reign of Queen Elizabeth II, created and principally written by Peter Morgan, and produced by Left Bank Pictures and Sony Pictures Tel...
Friends - American sitcom television series, created by David Crane and Marta Kauffman, which aired on NBC from September 22, 1994, to May 6, 2004, lasting ten seasons. With an ensemble cast sta...
Young Sheldon - spin-off prequel to The Big Bang Theory and begins with the character Sheldon...
Modern Family - American television mockumentary family sitcom created by Christopher Lloyd and Steven Levitan for the American Broadcasting Company. It ran for eleven seasons, from September 23...
Loki (TV series) - upcoming American web television miniseries created for Disney+ by Michael Waldron, based on the Marvel Comics character of the same name. It is set in the Marvel Cinematic Universe, shar...
Game of Thrones - American fantasy drama television series created by David Benioff and D. B. Weiss for HBO. It...
Shameless (American TV series) - American comedy-drama television series developed by John Wells which debuted on Showtime on January 9, 2011. It...

Type	Format	Example
local	`lcl\|integer` `lcl\|string`	`lcl\|123` `lcl\|hmm271`
GenInfo backbone seqid	`bbs\|integer`	`bbs\|123`
GenInfo backbone moltype	`bbm\|integer`	`bbm\|123`
GenInfo import ID	`gim\|integer`	`gim\|123`
	`gb\|accession\|locus`	`gb\|M73307\|AGMA13GT`
	`emb\|accession\|locus`	`emb\|CAM43271.1\|`
	`pir\|accession\|name`	`pir\|\|G36364`
	`sp\|accession\|name`	`sp\|P01013\|OVAX_CHICK`
patent	`pat\|country\|patent\|sequence-number`	`pat\|US\|RE33188\|1`
pre-grant patent	`pgp\|country\|application-number\|sequence-number`	`pgp\|EP\|0238993\|7`
	`ref\|accession\|name`	`ref\|NM_010450.1\|`
general database reference	`gnl\|database\|integer` `gnl\|database\|string`	`gnl\|taxon\|9606` `gnl\|PID\|e1632`
GenInfo integrated database	`gi\|integer`	`gi\|21434723`
	`dbj\|accession\|locus`	`dbj\|BAC85684.1\|`
	`prf\|accession\|name`	`prf\|\|0806162C`
	`pdb\|entry\|chain`	`pdb\|1I4L\|D`
third-party	`tpg\|accession\|name`	`tpg\|BK003456\|`
third-party	`tpe\|accession\|name`	`tpe\|BN000123\|`
third-party	`tpd\|accession\|name`	`tpd\|FAA00017\|`
TrEMBL	`tr\|accession\|name`	`tr\|Q90RT2\|Q90RT2_9HIV1`