Stockholm format

Stockholm format is a multiple sequence alignment format used by Pfam and Rfam to disseminate protein and RNA sequence alignments. The alignment editors
, and support Stockholm format as do the probabilistic database search tools, and HMMER, and the phylogenetic analysis tool Xrate. Stockholm format files often have the filename extension .sto or .stk.

Syntax

A well-formed stockholm file always contains a header which states the format and version identifier, currently ''. The header is then followed by a multiple lines, a mix of markup and sequences. Finally, the "" line indicates the end of the alignment.
Am example without markup looks like:


 STOCKHOLM 1.0
=GF ID EXAMPLE
 
 
 
//

Sequences are written one per line. The sequence name is written first, and after any number of whitespaces the sequence is written. Sequence names are typically in the form "name/start-end" or just "name". Sequence letters may include any characters except whitespace. Gaps may be indicated by "" or "".
Mark-up lines start with. The "parameters" are separated by whitespace, so an underscore instead of space should be used for the 1-char-per-column markups. Mark-up types defined include:


=GF  
=GC  
=GS   
=GR

Recommended features

These feature names are used by Pfam and Rfam for specific types of annotation.

#=GF

Pfam and Rfam may use the following tags:

Compulsory fields:
------------------
AC Accession number: Accession number in form PFxxxxx or RFxxxxx.
ID Identification: One word name for family.
DE Definition: Short description of family.
AU Author: Authors of the entry.
SE Source of seed: The source suggesting the seed members belong to one family.
SS Source of structure: The source of the consensus RNA secondary structure used by Rfam.
BM Build method: Command line used to generate the model
SM Search method: Command line used to perform the search
GA Gathering threshold: Search threshold to build the full alignment.
TC Trusted Cutoff: Lowest sequence score of match in the full alignment.
NC Noise Cutoff: Highest sequence score of match not in full alignment.
TP Type: Type of family -- presently Family, Domain, Motif or Repeat for Pfam.
-- a tree with roots Gene, Intron or Cis-reg for Rfam.
SQ Sequence: Number of sequences in alignment.
Optional fields:
----------------
DC Database Comment: Comment about database reference.
DR Database Reference: Reference to external database.
RC Reference Comment: Comment about literature reference.
RN Reference Number: Reference Number.
RM Reference Medline: Eight digit medline UI number.
RT Reference Title: Reference Title.
RA Reference Author: Reference Author
RL Reference Location: Journal location.
PI Previous identifier: Record of all previous ID lines.
KW Keywords: Keywords.
CC Comment: Comments.
NE Pfam accession: Indicates a nested domain.
NL Location: Location of nested domains - sequence ID, start and end of insert.
WK Wikipedia link: Wikipedia page
CL Clan: Clan accession
MB Membership: Used for listing Clan membership
For embedding trees:
----------------
NH New Hampshire A tree in New Hampshire eXtended format.
TN Tree ID A unique identifier for the next tree.
Other:
------
FR False discovery Rate: A method used to set the bit score threshold based on the ratio of
expected false positives to true positives. Floating point number between 0 and 1.
CB Calibration method: Command line used to calibrate the model

Notes: A tree may be stored on multiple #=GF NH lines.
If multiple trees are stored in the same file, each tree must be preceded by a #=GF TN line with a unique tree identifier. If only one tree is included, the #=GF TN line may be omitted.
#=GS

Rfam and Pfam may use these features:


 Feature Description
 --------------------- -----------
 AC  ACcession number
 DE  DEscription
 DR ; ; Database Reference
 OS  Organism 
 OC  Organism Classification 
 LO  Look

#=GR


 Feature Description Markup letters
 ------- ----------- --------------
 SS Secondary Structure For RNA AaBb.-_] --supports pseudoknot and further structure markup 
  For protein 
 SA Surface Accessibility 
 
 TM TransMembrane 
 PP Posterior Probability 
 
 LI LIgand binding 
 AS Active Site 
 pAS AS - Pfam predicted 
 sAS AS - from SwissProt 
 IN INtron 
 For RNA tertiary interactions:
 ------------------------------
 tWW WC/WC in trans For basepairs: For unpaired: 
 cWH WC/Hoogsteen in cis
 cWS WC/SugarEdge in cis
 tWS WC/SugarEdge in trans
 notes: for general format. 
 cWW is equivalent to SS.

#=GC

The list of valid features includes those shown below, as well as the same features as for #=GR with "_cons" appended, meaning "consensus". Example: "SS_cons".


 Feature Description Description
 ------- ----------- --------------
 RF ReFerence annotation Often the consensus RNA or protein sequence is used as a reference
  Any non-gap character can indicate consensus/conserved/match columns
  .'s or -'s indicate insert columns
  ~'s indicate unaligned insertions
  Upper and lower case can be used to discriminate strong and weakly conserved 
  residues respectively
 MM Model Mask Indicates which columns in an alignment should be masked, such
  that the emission probabilities for match states corresponding to
  those columns will be the background distribution.

Recommended placements

#=GF Above the alignment
#=GC Below the alignment
#=GS Above the alignment or just below the corresponding sequence
#=GR Just below the corresponding sequence
Size limits

There are no explicit size limits on any field. However, a simple parser that uses fixed field sizes should work safely on Pfam and Rfam alignments with these limits:

Line length: 10000.
: 255.
: 255.
Examples

A simple example of an Rfam alignment with a pseudoknot in Stockholm format is shown below:


 STOCKHOLM 1.0
=GF ID UPSK
=GF SE Predicted; Infernal 
=GF SS Published; PMID 9223489
=GF RN 
=GF RM 9223489
=GF RT The role of the pseudoknot at the 3' end of turnip yellow mosaic
=GF RT virus RNA in minus-strand synthesis by the viral RNA-dependent RNA
=GF RT polymerase.
=GF RA Deiman BA, Kortlever RM, Pleij CW;
=GF RL J Virol 1997;71:5990-5996.
AF035635.1/619-641 UGAGUUCUCGAUCUCUAAAAUCG
M24804.1/82-104 UGAGUUCUCUAUCUCUAAAAUCG
J04373.1/6212-6234 UAAGUUCUCGAUCUUUAAAAUCG
M24803.1/1-23 UAAGUUCUCGAUCUCUAAAAUCG
=GC SS_cons .AAA....<<<>>>
//

Here is a slightly more complex example showing the Pfam CBS domain:


 STOCKHOLM 1.0
=GF ID CBS
=GF AC PF00571
=GF DE CBS domain
=GF AU Bateman A
=GF CC CBS domains are small intracellular modules mostly found
=GF CC in 2 or four copies within a protein.
=GF SQ 5
=GS O31698/18-71 AC O31698
=GS O83071/192-246 AC O83071
=GS O83071/259-312 AC O83071
=GS O31698/88-139 AC O31698
=GS O31698/88-139 OS Bacillus subtilis
O83071/192-246 MTCRAQLIAVPRASSLAEAIACAQKMRVSRVPVYERS
=GR O83071/192-246 SA 9998877564535242525515252536463774777
O83071/259-312 MQHVSAPVFVFECTRLAYVQHKLRAHSRAVAIVLDEY
=GR O83071/259-312 SS CCCCCHHHHHHHHHHHHHEEEEEEEEEEEEEEEEEEE
O31698/18-71 MIEADKVAHVQVGNNLEHALLVLTKTGYTAIPVLDPS
=GR O31698/18-71 SS CCCHHHHHHHHHHHHHHHEEEEEEEEEEEEEEEEHHH
O31698/88-139 EVMLTDIPRLHINDPIMKGFGMVINN..GFVCVENDE
=GR O31698/88-139 SS CCCCCCCHHHHHHHHHHHHEEEEEEEEEEEEEEEEEH
=GC SS_cons CCCCCHHHHHHHHHHHHHEEEEEEEEEEEEEEEEEEH
O31699/88-139 EVMLTDIPRLHINDPIMKGFGMVINN..GFVCVENDE
=GR O31699/88-139 AS ________________*____________________
=GR O31699/88-139 IN ____________1____________2______0____
//

Popular movies

The Hunger Games (film) - 2012 American dystopian action thriller science fiction-adventure film directed by Gary Ross and based on Suzanne Collins’s 2008 novel of the same name. It is the first insta...
untitled Captain Marvel sequel - part of Marvel Cinematic Universe....
Killers of the Flower Moon (film project) - Killers of the Flower Moon - film project in United States of America. It was presented as drama, detective fiction, thriller. The film project starred Leonardo Dicaprio, Robert De Niro. Director of...
Five Nights at Freddy's (film) - Five Nights at Freddy's - film published in 2017 in United States of America. Scenarist of the film - Scott Cawthon....

Popular books

Book of Revelation - The Book of Revelation is the final book of the New Testament, and consequently is also the final book of the Christian Bible. Its title is derived from the first word of the Koine Greek text: apok...
Book of Genesis - account of the creation of the world, the early history of humanity, Israel's ancestors and the origins...
Gospel of Matthew - The Gospel According to Matthew is the first book of the New Testament and one of the three synoptic gospels. It tells how Israel's Messiah, rejected and executed in Israel, pronounces judgement on ...
Michelin Guide - Michelin Guides are a series of guide books published by the French tyre company Michelin for more than a century. The term normally refers to the annually published Michelin Red Guide , the oldest...
Psalms - The Book of Psalms , commonly referred to simply as Psalms , the Psalter or "the Psalms", is the first book of the Ketuvim , the third section of the Hebrew Bible, and thus a book of th...
Ecclesiastes - Ecclesiastes is one of 24 books of the Tanakh , where it is classified as one of the Ketuvim . Originally written c. 450–200 BCE, it is also among the canonical Wisdom literature of the Old Tes...
The 48 Laws of Power - non-fiction book by American author Robert Greene. The book...

Popular television series

The Crown (TV series) - historical drama web television series about the reign of Queen Elizabeth II, created and principally written by Peter Morgan, and produced by Left Bank Pictures and Sony Pictures Tel...
Friends - American sitcom television series, created by David Crane and Marta Kauffman, which aired on NBC from September 22, 1994, to May 6, 2004, lasting ten seasons. With an ensemble cast sta...
Young Sheldon - spin-off prequel to The Big Bang Theory and begins with the character Sheldon...
Modern Family - American television mockumentary family sitcom created by Christopher Lloyd and Steven Levitan for the American Broadcasting Company. It ran for eleven seasons, from September 23...
Loki (TV series) - upcoming American web television miniseries created for Disney+ by Michael Waldron, based on the Marvel Comics character of the same name. It is set in the Marvel Cinematic Universe, shar...
Game of Thrones - American fantasy drama television series created by David Benioff and D. B. Weiss for HBO. It...
Shameless (American TV series) - American comedy-drama television series developed by John Wells which debuted on Showtime on January 9, 2011. It...