Gene nomenclature
Gene nomenclature is the scientific naming of genes, the units of heredity in living organisms. An international committee published recommendations for genetic symbols and nomenclature in 1957. The need to develop formal guidelines for human gene names and symbols was recognized in the 1960s and full guidelines were issued in 1979. Several other genus-specific research communities have adopted nomenclature standards, as well, and have published them on the relevant model organism websites and in scientific journals, including the Trends in Genetics Genetic Nomenclature Guide. Scientists familiar with a particular gene family may work together to revise the nomenclature for the entire set of genes when new information becomes available. For many genes and their corresponding proteins, an assortment of alternate names is in use across the scientific literature and public biological databases, posing a challenge to effective organization and exchange of biological information. Standardization of nomenclature thus tries to achieve the benefits of vocabulary control and bibliographic control, although adherence is voluntary. The advent of the information age has brought gene ontology, which in some ways is a next step of gene nomenclature, because it aims to unify the representation of gene and gene product attributes across all species.
Gene nomenclature and protein nomenclature are not separate endeavors; they are aspects of the same whole. Any name or symbol used for a protein can potentially also be used for the gene that encodes it, and vice versa. But owing to the nature of how science has developed, proteins and their corresponding genes have not always been discovered simultaneously, which is the largest reason why protein and gene names do not always match, or why scientists tend to favor one symbol or name for the protein and another for the gene. Another reason is that many of the mechanisms of life are the same or very similar across species, genera, orders, and phyla, so that a given protein may be produced in many kinds of organisms; and thus scientists naturally often use the same symbol and name for a given protein in one species as in another species. Regarding the first duality, the context usually makes the sense clear to scientific readers, and the nomenclatural systems also provide for some specificity by using italic for a symbol when the gene is meant and plain for when the protein is meant. Regarding the second duality, the nomenclatural systems also provide for at least human-versus-nonhuman specificity by using different capitalization, although scientists often ignore this distinction, given that it is often biologically irrelevant.
Also owing to the nature of how scientific knowledge has unfolded, proteins and their corresponding genes often have several names and symbols that are synonymous. Some of the earlier ones may be deprecated in favor of newer ones, although such deprecation is voluntary. Some older names and symbols live on simply because they have been widely used in the scientific literature and are well established among users. For example, mentions of HER2 and ERBB2 are synonymous.
Lastly, the correlation between genes and proteins is not always one-to-one ; in some cases it is several-to-one or one-to-several, and the names and symbols may then be gene-specific or protein-specific to some degree, or overlapping in usage:
- Some proteins and protein complexes are built from the products of several genes, which means that the protein or complex will not have the same name or symbol as any one gene. For example, a particular protein called "example" may have 2 chains, which are encoded by 2 genes named "example alpha chain" and "example beta chain".
- Some genes encode multiple proteins, because post-translational modification and alternative splicing provide several paths for expression. For example, glucagon and similar polypeptides all come from proglucagon, which comes from preproglucagon, which is the polypeptide that the GCG gene encodes. When one speaks of the various polypeptide products, the names and symbols refer to different things, but when one speaks of the gene, all of those names and symbols are aliases for the same gene. Another example is that the various μ-opioid receptor proteins are all splice variants encoded by one gene, OPRM1; this is how one can speak of MORs in the plural even though there is only one MOR gene, which may be called OPRM1, MOR1, or MOR—all of those aliases validly refer to it, although one of them is preferred nomenclature.
Species-specific guidelines
Bacterial genetic nomenclature
There are generally accepted rules and conventions used for naming genes in bacteria. Standards were proposed in 1966 by Demerec et al.General rules
Each bacterial gene is denoted by a mnemonic of three lower case letters which indicate the pathway or process in which the gene-product is involved, followed by a capital letter signifying the actual gene. In some cases, the gene letter may be followed by an allele number. All letters and numbers are underlined or italicised. For example, leuA is one of the genes of the leucine biosynthetic pathway, and leuA273 is a particular allele of this gene.Where the actual protein coded by the gene is known then it may become part of the basis of the mnemonic, thus:
- rpoA encodes the α-subunit of RNA polymerase
- rpoB encodes the β-subunit of RNA polymerase
- polA encodes DNA polymerase I
- polC encodes DNA polymerase III
- rpsL encodes ribosomal protein, small S12
- dna is involved in DNA replication
Common mnemonics
Biosynthetic genes
Loss of gene activity leads to a nutritional requirement not exhibited by the wildtype.Amino acids:
- ala = alanine
- arg = arginine
- asn = asparagine
- ilv: isoleucine and valine
- gua = guanine
- pur = purines
- pyr = pyrimidine
- thy = thymine
- bio = biotin
- nad = NAD
- pan = pantothenic acid
Catabolic genes
- ara = arabinose
- gal = galactose
- lac = lactose
- mal = maltose
- man = mannose
- mel = melibiose
- rha = rhamnose
- xyl = xylose
Drug and bacteriophage resistance genes
- amp = ampicillin resistance
- azi = azide resistance
- bla = beta-lactam resistance
- cat = chloramphenicol resistance
- kan = kanamycin resistance
- rif = rifampicin resistance
- tonA = phage T1 resistance
Nonsense suppressor mutations
- sup = suppressor
Mutant nomenclature
- leuA+
- leuA−
There are additional superscripts and subscripts which provide more information about the mutation:
- ts = temperature sensitive
- cs = cold sensitive
- am = amber mutation
- um = umber mutation
- oc = ochre mutation
- R = resistant
- Δ = deletion
- - = fusion
: = fusion:: = insertion- Ω = a genetic construct introduced by a two-point crossover
- Δdeleted gene::replacing gene = deletion with replacement
Phenotype nomenclature
Bacterial protein name nomenclature
Protein names are the same as the gene names, but the protein names are not italicized, and the first letter is upper-case. E.g. the name of RNA polymerase is RpoB, and this protein is encoded by rpoB gene.Vertebrate gene and protein symbol conventions
The research communities of vertebrate model organisms have adopted guidelines whereby genes in these species are given, whenever possible, the same names as their human orthologs. The use of prefixes on gene symbols to indicate species is discouraged. The recommended formatting of printed gene and protein symbols varies between species.Symbol and name
Vertebrate genes and proteins have names and symbols, which are short identifiers. For example, the gene cytotoxic T-lymphocyte-associated protein 4 has the HGNC symbol CTLA4. These symbols are usually, but not always, coined by contraction or acronymic abbreviation of the name. They are pseudo-acronyms, however, in the sense that they are complete identifiers by themselves—short names, essentially. They are synonymous with the gene/protein name, regardless of whether the initial letters "match". For example, the symbol for the gene v-akt murine thymoma viral oncogene homolog 1, which is AKT1, cannot be said to be an acronym for the name, and neither can any of its various synonyms, which include AKT, PKB, PRKBA, and RAC. Thus, the relationship of a gene symbol to the gene name is functionally the relationship of a nickname to a formal name —it is not the relationship of an acronym to its expansion. In this sense they are similar to the symbols for units of measurement in the SI system, in that they can be viewed as true logograms rather than just abbreviations. Sometimes the distinction is academic, but not always. Although it is not wrong to say that "VEGFA" is an acronym standing for "vascular endothelial growth factor A", just as it is not wrong that "km" is an abbreviation for "kilometre", there is more to the formality of symbols than those statements capture.The root portion of the symbols for a gene family is called a root symbol.
Human
The HUGO Gene Nomenclature Committee is responsible for providing human gene naming guidelines and approving new, unique human gene names and symbols. All human gene names and symbols can be searched online at the HGNC website, and the guidelines for their formation are available there. The guidelines for humans fit logically into the larger scope of vertebrates in general, and the HGNC's remit has recently expanded to assigning symbols to all vertebrate species without an existing nomenclature committee, to ensure that vertebrate genes are named in line with their human orthologs/paralogs. Human gene symbols generally are italicised, with all letters in uppercase. Italics are not necessary in gene catalogs. Protein designations are the same as the gene symbol except that they are not italicised. Like the gene symbol, they are in all caps because human. mRNAs and cDNAs use the same formatting conventions as the gene symbol. For naming families of genes, the HGNC recommends using a "root symbol" as the root for the various gene symbols. For example, for the peroxiredoxin family, PRDX is the root symbol, and the family members are PRDX1, PRDX2, PRDX3, PRDX4, PRDX5, and PRDX6.Mouse and rat
Gene symbols generally are italicised, with only the first letter in uppercase and the remaining letters in lowercase. Italics are not required on web pages. Protein designations are the same as the gene symbol, but are not italicised and all are upper case.Chicken (''Gallus'' sp.)
Nomenclature generally follows the conventions of human nomenclature. Gene symbols generally are italicised, with all letters in uppercase. Protein designations are the same as the gene symbol, but are not italicised; all letters are in uppercase. mRNAs and cDNAs use the same formatting conventions as the gene symbol.Anole lizard (''Anolis'' sp.)
Gene symbols are italicised and all letters are in lowercase. Protein designations are different from their gene symbol; they are not italicised, and all letters are in uppercase.Frog (''Xenopus'' sp.)
Gene symbols are italicised and all letters are in lowercase. Protein designations are the same as the gene symbol, but are not italicised; the first letter is in uppercase and the remaining letters are in lowercase.Zebrafish
Gene symbols are italicised, with all letters in lowercase. Protein designations are the same as the gene symbol, but are not italicised; the first letter is in uppercase and the remaining letters are in lowercase.Gene and protein symbol and description in copyediting
"Expansion" (glossing)
A nearly universal rule in copyediting of articles for medical journals and other health science publications is that abbreviations and acronyms must be expanded at first use, to provide a glossing type of explanation. Typically no exceptions are permitted except for small lists of especially well known terms. Although readers with high subject-matter expertise do not need most of these expansions, those with intermediate or low expertise are appropriately served by them.One complication that gene and protein symbols bring to this general rule is that they are not, accurately speaking, abbreviations or acronyms, despite the fact that many were originally coined via abbreviating or acronymic etymology. They are pseudoacronyms because they do not "stand for" any expansion. Rather, the relationship of a gene symbol to the gene name is functionally the relationship of a nickname to a formal name —it is not the relationship of an acronym to its expansion. In fact, many official gene symbol–gene name pairs do not even share their initial-letter sequences. Nevertheless, gene and protein symbols "look just like" abbreviations and acronyms, which presents the problem that "failing" to "expand" them creates the appearance of violating the spell-out-all-acronyms rule.
One common way of reconciling these two opposing forces is simply to exempt all gene and protein symbols from the glossing rule. This is certainly fast and easy to do, and in highly specialized journals, it is also justified because the entire target has high subject matter expertise. But for journals with broader and more general target readerships, this action leaves the readers without any explanatory annotation and can leave them wondering what the apparent-abbreviation stands for and why it was not explained. Therefore, a good alternative solution is simply to put either the official gene name or a suitable short description in parentheses after the first use of the official gene/protein symbol. This meets both the formal requirement and the functional requirement. The same guideline applies to shorthand names for sequence variations; AMA says, "In general medical publications, textual explanations should accompany the shorthand terms at first mention." Thus "188del11" is glossed as "an 11-bp deletion at nucleotide 188." This corollary rule often also follows the "abbreviation-leading" style of expansion that is becoming more prevalent in recent years. Traditionally, the abbreviation always followed the fully expanded form in parentheses at first use. This is still the general rule. But for certain classes of abbreviations or acronyms, this pattern may be reversed, because the short form is more widely used and the expansion is merely parenthetical to the discussion at hand. The same is true of gene/protein symbols.