Linguistic categories
Linguistic categories include
- Lexical category, a part of speech such as noun, preposition, etc.
- Syntactic category, a similar concept which can also include phrasal categories
- Grammatical category, a grammatical feature such as tense, gender, etc.
Linguistic category inventories
To facilitate the interoperability between lexical resources, linguistic annotations and annotation tools and for the systematic handling of linguistic categories across different theoretical frameworks, a number of inventories of linguistic categories have been developed and are being used, with examples as given below. The practical objective of such inventories is to perform quantitative evaluation, to train NLP tools, or to facilitate cross-linguistic evaluation, querying or annotation of language data. At a theoretical level, the existence of universal categories in human language has been postulated, e.g., in Universal grammar, but also heavily criticized.Part-of-Speech tagsets
Schools commonly teach that there are 9 parts of speech in English: noun, verb, article, adjective, preposition, pronoun, adverb, conjunction, and interjection. However, there are clearly many more categories and sub-categories. For nouns, the plural, possessive, and singular forms can be distinguished. In many languages words are also marked for their "case", grammatical gender, and so on; while verbs are marked for tense, aspect, and other things. In some tagging systems, different inflections of the same root word will get different parts of speech, resulting in a large number of tags. For example, NN for singular common nouns, NNS for plural common nouns, NP for singular proper nouns. Other tagging systems use a smaller number of tags and ignore fine differences or model them as features somewhat independent from part-of-speech.In part-of-speech tagging by computer, it is typical to distinguish from 50 to 150 separate parts of speech for English. POS tagging work has been done in a variety of languages, and the set of POS tags used varies greatly with language. Tags usually are designed to include overt morphological distinctions, although this leads to inconsistencies such as case-marking for pronouns but not nouns in English, and much larger cross-language differences. The tag sets for heavily inflected languages such as Greek and Latin can be very large; tagging words in agglutinative languages such as Inuit languages may be virtually impossible. Work on stochastic methods for tagging Koine Greek has used over 1,000 parts of speech and found that about as many words were ambiguous in that language as in English. A morphosyntactic descriptor in the case of morphologically rich languages is commonly expressed using very short mnemonics, such as Ncmsan for Category=Noun, Type = common, Gender = masculine, Number = singular, Case = accusative, Animate = no.
The most popular "tag set" for POS tagging for American English is probably the Penn tag set, developed in the Penn Treebank project.
Multilingual annotation schemes
For Western European languages, cross-linguistically applicable annotation schemes for parts-of-speech, morphosyntax and syntax have been developed with the Eagles Guidelines. The Eagles guidelines have inspired subsequent work on other regions, as well, e.g., Eastern Europe.Petrov et al. have proposed a "universal", but highly reductionist, tag set, with 12 categories. Subsequently, this was complemented with cross-lingual specifications for dependency syntax, and morphosyntax in the context of the Universal Dependencies, an international cooperative project to create treebanks of the world's languages with cross-linguistically applicable annotations for parts of speech, dependency syntax, and morphosyntactic features. Core applications are automated text processing in the field of natural language processing and research into natural language syntax and grammar, especially within linguistic typology. The annotation scheme has it roots in three related projects: The UD annotation scheme uses a representation in the form of dependency trees as opposed to a phrase structure trees. At as of February 2019, there are just over 100 treebanks of more than 70 languages available in the UD inventory. The project's primary aim is to achieve cross-linguistic consistency of annotation. However, language-specific extensions are permitted for morphological features. In a more restricted form, dependency relations can be extended with a secondary label that accompanies the UD label, e.g., aux:pass for an auxiliary used to mark passive voice.
The Universal Dependencies have inspired similar efforts for the areas of inflectional morphology, frame semantics and coreference. For phrase structure syntax, a comparable effort does not seem to exist, but the specifications of the Penn Treebank have been applied to a broad range of languages, e.g., Icelandic, Old English, Middle English, Middle Low German, Early Modern High German, Yiddish, Portuguese, Japanese, Arabic and Chinese.
Conventions for interlinear glosses
In linguistics, an interlinear gloss is a gloss placed between lines, such as between a line of original text and its translation into another language. When glossed, each line of the original text acquires one or more lines of transcription known as an interlinear text or interlinear glossed text —interlinear for short. Such glosses help the reader follow the relationship between the source text and its translation, and the structure of the original language. There is no standard inventory for glosses, but common labels are collected in the Leipzig Glossing Rules. Wikipedia also provides a List of glossing abbreviations that draws on this and other sources.General Ontology for Linguistic Description (GOLD)
GOLD is an ontology for descriptive linguistics. It gives a formalized account of the most basic categories and relations used in the scientific description of human language, e.g., as a formalization of interlinear glosses. GOLD was first introduced by Farrar and Langendoen. Originally, it was envisioned as a solution to the problem of resolving disparate markup schemes for linguistic data, in particular data from endangered languages. However, GOLD is much more general and can be applied to all languages. In this function, GOLD overlaps with the ISO 12620 Data Category Registry, it is, however, more stringently structured.GOLD was maintained by the LINGUIST List and others from 2007 to 2010. The project created a mirror of the 2010 edition of GOLD as a Data Category Selection within ISOcat. As of 2018, GOLD data remains an important terminology hub in the context of the Linguistic Linked Open Data cloud, but as it is not actively maintained anymore, its function is increasingly replaced by OLiA and .
ISO 12620 (ISO TC37 Data Category Registry, ISOcat)
ISO 12620 is a standard from ISO/TC 37 defines a registry for registering linguistic terms used in various fields of translation, computational linguistics and natural language processing and defining mappings both between different terms and the same terms used in different systems. An earlier edition of this system, ISOcat, provides persistent identifiers and URIs for linguistic categories, including the inventory of the GOLD ontology. Since 2014, is no longer actively developed. As of May 2020, successor systems, CLARIN Concept Registry and DatCatInfo are only emerging.For linguistic categories relevant to lexical resources, the lexinfo vocabulary represents an established community standard, in particular in connection with the OntoLex vocabulary and machine-readable dictionaries in the context of Linguistic Linked Open Data technologies. Like the OntoLex vocabulary builds on the Lexical Markup Framework, lexinfo builds on ISOcat. Unlike ISOcat, however, lexinfo is actively maintained and currently extended in a community effort.
Ontologies of Linguistic Annotation (OLiA)
Similar in spirit to GOLD, the Ontologies of Linguistic Annotation provide a reference inventory of linguistic categories for syntactic, morphological and semantic phenomena relevant for linguistic annotation and linguistic corpora in the form of an ontology. In addition, they also provide machine-readable annotation schemes for more than 100 languages, linked with the OLiA reference model. The OLiA ontologies represent a major hub of annotation terminology in the Linked Open Data cloud, with applications for search, retrieval and machine learning over heterogeneously annotated language resources.In addition to annotation schemes, the OLiA Reference Model is also linked with the Eagles Guidelines, GOLD, ISOcat, CLARIN Concept Registry, Universal Dependencies, lexinfo, etc., they thus enable interoperability between these vocabularies. OLiA is being developed as a community project on GitHub