Kozak consensus sequence


The Kozak consensus sequence is a nucleic acid motif that functions as the protein translation initiation site in most eukaryotic mRNA transcripts. Regarded as the optimum sequence for initiating translation in eukaryotes, the sequence is an integral aspect of protein regulation and overall cellular health as well as having implications in human disease. It ensures that a protein is correctly translated from the genetic message, mediating ribosome assembly and translation initiation. A wrong start site can result in non-functional proteins. As it has become more studied, expansions of the nucleotide sequence, bases of importance, and notable exceptions have arisen. The sequence was named after the scientist who discovered it, Marilyn Kozak. Kozak discovered the sequence through a detailed analysis of DNA genomic sequences.
The Kozak sequence is not to be confused with the ribosomal binding site, that being either the 5′ cap of a messenger RNA or an internal ribosome entry site.

Sequence

The Kozak Sequence was determined by sequencing of 699 vertebrate mRNAs and verified by site-directed mutagenesis. While initially limited to a subset of vertebrates, subsequent studies confirmed its conservation in higher eukaryotes generally. The sequence was defined as 5'-gccRccAUGG-3 where:
  1. The underlined nucleotides indicate the translation start codon, coding for Methionine.
  2. upper-case letters indicate highly conserved bases, i.e. the 'AUGG' sequence is constant or rarely, if ever, changes.
  3. 'R' indicates that a purine is always observed at this position
  4. a lower-case letter denotes the most common base at a position where the base can nevertheless vary
  5. the sequence in parentheses is of uncertain significance.
The AUG is the initiation codon encoding a methionine amino acid at the N-terminus of the protein. . Variation within the Kozak sequence alters the "strength" thereof. Kozak sequence strength refers to the favorability of initiation, affecting how much protein is synthesized from a given mRNA. The A nucleotide of the "AUG" is delineated as +1 in mRNA sequences with the preceding base being labeled as −1. For a 'strong' consensus, the nucleotides at positions +4 and −3 relative to the +1 nucleotide must both match the consensus. An 'adequate' consensus has only 1 of these sites, while a 'weak' consensus has neither. The cc at −1 and −2 are not as conserved, but contribute to the overall strength. There is also evidence that a G in the -6 position is important in the initiation of translation. While the +4 and the −3 positions in the Kozak sequence have the greatest relative importance in the establishing a favorable initiation context a CC or AA motif at −2 and −1 were found to be important in the initiation of translation in tobacco and maize plants. Protein synthesis in yeast was found to be highly affect by composition of the Kozak sequence in yeast, with adenine enrichment resulting in higher levels of gene expression. A suboptimal Kozak sequence can allow for PIC to scan past the first AUG site and start initiation at a downstream AUG codon.
showing the most conserved bases around the initiation codon from over 10 000 human mRNAs. Larger letters indicate a higher frequency of incorporation. Note the larger size of A and G at the 8 position and at the G at position 14 which corresponds to position in the Kozak sequence.

Ribosome assembly

The ribosome assembles on the start codon, located within the Kozak sequence. Prior to translation initiation, scanning is done by the pre-initiation complex. The PIC consists of the 40S bound the to ternary complex, eIF2-GTP-intiatorMet tRNA to form the 43S ribosome. Assisted by several other initiation factors it is recruited to the 5′ end of the mRNA. Eukaryotic mRNA is capped with a 7-methylguanosine nucleotide which can helps recruit the PIC to the mRNA and initiate scanning. This recruitment to the m7G 5′ cap is support by the inability of eukaryotic ribosomes to translate circular mRNA, which has no 5′ end. Once the PIC binds to the mRNA it scans until it reaches the first AUG codon in a Kozak sequence. This scanning is referred to as the scanning mechanism of initiation.
The scanning mechanism of Initiation starts when the PIC binds the 5′ end of the mRNA. Scanning is stimulated by Dhx29 and Ddx3/Ded1 and eIF4 proteins. The Dhx29 and Ddx3/Ded1 are DEAD-box helicases that help to unwind any secondary mRNA structure which could hinder scanning. The scanning of an mRNA continues until the first AUG codon on the mRNA is reached, this is known as the "First AUG Rule". While exceptions to the "First AUG Rule" exist, most exceptions take place at a second AUG codon that is located 3 to 5 nucleotides downstream from the first AUG, or within 10 nucleotides from the 5′ end of the mRNA. At the AUG codon a Methionine tRNA anticodon is recognized by mRNA codon. Upon base pairing to the start codon the eIF5 in the PIC helps to hydrolyze a guanosine triphosphate bound to the eIF2. This leads to the a structural rearrangement that commits the PIC to binding to the large ribosomal subunit and forming the ribosomal complex. Once the 80S ribosome complex is formed then the elongation phase of translation starts.
The first start codon closest to the 5′ end of the strand is not always recognized if it is not contained in a Kozak-like sequence. is an example of a gene with a weak Kozak consensus sequence. For initiation of translation from such a site, other features are required in the mRNA sequence in order for the ribosome to recognize the initiation codon. Exceptions to the first AUG rule may occur if it is not contained in a Kozak-like sequence. This is called leaky scanning and could be a potential way to control translation through initiation. For initiation of translation from such a site, other features are required in the mRNA sequence in order for the ribosome to recognize the initiation codon.
It is believed that the PIC is stalled at the Kozak sequence by interactions between eIF2 and the −3 and +4 nucleotides in the Kozak position. This stalling allows the start codon and the corresponding anticodon time to form the correct hydrogen bonding. The Kozak consensus sequence is so common that the similarity of the sequence around the AUG codon to the Kozak Sequence is used as a criterion for finding start codons in eukaryotes.

Differences between eukaryotic and prokaryotic initiation

The scanning mechanism of initiation, which utilizes the Kozak sequence, is found only in eukaryotic organism and has significant differences from the way prokaryotes and archaea initiate translation. The biggest difference is the existence of the Shine-Dalgarno sequence in mRNA for prokaryotes. The SD sequence is located near the start codon which is in contrast to the Kozak sequence which actually contains the start codon. The Shine Dalgarno sequence allows the 16S subunit of the small ribosome subunit to bind to the AUG start codon immediately with no need for scanning along the mRNA. This results in a more rigorous selection process for the AUG codon than in prokaryotes. An example of prokaryotic start codon promiscuity can be seen in the uses alternate start codons UUG and GUG for some genes.

Mutations and disease

Marilyn Kozak demonstrated, through systematic study of point mutations, that any mutations to a strong consensus sequence in the −3 position or to the +4 position resulted in highly impaired translation initiation both in vitro and in vivo.
Research has shown that a mutation of G—>C in the −6 position of the β-globin gene disrupted the haematological and biosynthetic phenotype function. This was the first mutation found in the Kozak sequence and showed a 30% decrease in translational efficiency. It was found in a family from the Southeast Italy and they suffered from thalassaemia intermedia.
Similar observations were made regarding mutations in the position −5 from the start codon, AUG. Cytosine in this position, as opposed to thymine, showed more efficient translation and increased expression of the platelet adhesion receptor, glycoprotein Ibα in humans.
Mutations to the Kozak sequence can also have drastic effects upon human health, in particular the heart disease with the GATA4 gene. The GATA4 gene is responsible for gene expression in a wide variety of tissues including the heart. When the guanosine at the -6 position in the Kozak sequence of GATA4 is mutated to a cytosine a reduction in GATA4 protein levels, which leads atrial septal defect in the heart.
The ability of the Kozak sequence to start translation can result in novel initiation codons in the typically untranslated region of the 5′ end of the mRNA transcript. When a G to A mutation was observed in this region it resulted in an out of frame and thus protein mutation. This mutated protein results in campomelic dysplasia. Campomelic dysplasia is a developmental disorder that results in skeletal malformations.

Variations in the consensus sequence

The Kozak consensus has been variously described as:
65432-+234
gccRccAUGG
AGNNAUGN
ANNAUGG
ACCAUGG
GACACCAUGG
BiotaPhylumConsensus sequences
Vertebrate gccRccATGG
Fruit fly ArthropodaatMAAMATGamc
Budding yeast AscomycotaaAaAaAATGTCt
Slime mold AmoebozoaaaaAAAATGRna
CiliateCiliophoranTaAAAATGRct
Malarial protozoa ApicomplexataaAAAATGAan
Toxoplasma ApicomplexagncAaaATGg
TrypanosomatidaeEuglenozoannnAnnATGnC
Terrestrial plantsacAACAATGGC
Microalga ''ChlorophytagccaagATGgcg