Treebank

In linguistics, a treebank is a parsed text corpus that annotates syntactic or semantic sentence structure. The construction of parsed corpora in the early 1990s revolutionized computational linguistics, which benefitted from large-scale empirical data. The exploitation of treebank data has been important ever since the first large-scale treebank, , was published. However, although originating in computational linguistics, the value of treebanks is becoming more widely appreciated in linguistics research as a whole. For example, annotated treebank data has been crucial in syntactic research to test linguistic theories of sentence structure against large quantities of naturally occurring examples.

Etymology

The term treebank was coined by linguist Geoffrey Leech in the 1980s, by analogy to other repositories such as a seedbank or bloodbank. This is because both syntactic and semantic structure are commonly represented compositionally as a tree structure. The term parsed corpus is often used interchangeably with the term treebank, with the emphasis on the primacy of sentences rather than trees.

Construction

Treebanks are often created on top of a corpus that has already been annotated with part-of-speech tags. In turn, treebanks are sometimes enhanced with semantic or other linguistic information. Treebanks can be created completely manually, where linguists annotate each sentence with syntactic structure, or semi-automatically, where a parser assigns some syntactic structure which linguists then check and, if necessary, correct. In practice, fully checking and completing the parsing of natural language corpora is a labour-intensive project that can take teams of graduate linguists several years. The level of annotation detail and the breadth of the linguistic sample determine the difficulty of the task and the length of time required to build a treebank.
Some treebanks follow a specific linguistic theory in their syntactic annotation but most try to be less theory-specific. However, two main groups can be distinguished: treebanks that annotate phrase structure and those that annotate dependency structure.
It is important to clarify the distinction between the formal representation and the file format used to store the annotated data. Treebanks are necessarily constructed according to a particular grammar. The same grammar may be implemented by different file formats. For example, the syntactic analysis for John loves Mary, shown in the figure on the right, may be represented by simple labelled brackets in a text file, like this :

)
)
This type of representation is popular because it is light on resources, and the tree structure is relatively easy to read without software tools. However, as corpora become increasingly complex, other file formats may be preferred. Alternatives include treebank-specific XML schemes, numbered indentation and various types of standoff notation.

Applications

From a computational linguistics perspective, treebanks have been used to engineer state-of-the-art natural language processing systems such as part-of-speech taggers, parsers, semantic analyzers and machine translation systems. Most computational systems utilize gold-standard treebank data. However, an automatically parsed corpus that is not corrected by human linguists can still be useful. It can provide evidence of rule frequency for a parser. A parser may be improved by applying it to large amounts of text and gathering rule frequencies. However, it should be obvious that only by a process of correcting and completing a corpus by hand is it possible then to identify rules absent from the parser knowledge base. In addition, frequencies are likely to be more accurate.
In corpus linguistics, treebanks are used to study syntactic phenomena. Once parsed, a corpus will contain frequency evidence showing how common different grammatical structures are in use. Treebanks also provide evidence of coverage and support the discovery of new, unanticipated, grammatical phenomena.
Another use of treebanks in theoretical linguistics and psycholinguistics is interaction evidence. A completed treebank can help linguists carry out experiments as to how the decision to use one grammatical construction tends to influence the decision to form others, and to try to understand how speakers and writers make decisions as they form sentences. Interaction research is particularly fruitful as further layers of annotation, e.g. semantic, pragmatic, are added to a corpus. It is then possible to evaluate the impact of non-syntactic phenomena on grammatical choices.

Semantic treebanks

A semantic treebank is a collection of natural language sentences annotated with a meaning representation. These resources use a formal representation of each sentence's semantic structure. Semantic treebanks vary in the depth of their semantic representation. A notable example of deep semantic annotation is the , developed at the University of Groningen and annotated using Discourse Representation Theory. An example of a shallow semantic treebank is PropBank, which provides annotation of verbal propositions and their arguments, without attempting to represent every word in the corpus in logical form.

Language	Treebank	Semantic Formalism	Distribution / License
English	Abstract Meaning Representation Bank	Deep semantics
English		Deep semantics
English		Deep semantics
English		Deep semantics
English		Deep semantics
English		Deep semantics
English		Deep semantics
English	Universal Conceptual Cognitive Annotation	Deep semantics
English	FrameNet	Shallow semantics
English	PropBank	Shallow semantics

Deep Syntax treebanks

A deep syntax treebank is a treebank lying at the interface between syntax and semantics, where the representation structure can be interpreted as a graph, representing subject of infinitival phrases, extraction, it-clef construction, shared subject ellipsis and so on.

Syntactic treebanks

Many syntactic treebanks have been developed for a wide variety of languages:

Language	Treebank	Syntactic Formalism	Distribution / License
Arabic		Phrase structure
Arabic		Dependency
Arabic		Dependency
Arabic		Dependency
Classical Armenian		Dependency
Bulgarian		HPSG
Catalan		Phrase structure
Chinese		Phrase structure
Chinese		Case grammar
Chinese		Dependency
Old Church Slavonic		Dependency
Croatian		Dependency
Czech		Dependency
Danish		Dependency
Danish		Phrase structure
Dutch		Phrase structure
Dutch		Dependency
Dutch		Dependency
English		Phrase structure
English		Combinatory categorial grammar
English		Dependency
English		Dependency
English		Phrase structure
English		Phrase structure
English		Phrase structure
English		Phrase structure
English		Phrase structure
English		Phrase structure
English		Phrase structure
English		HPSG
English		HPSG
English		Phrase structure
English		Dependency
English		Dependency
English		Phrase structure
English		Phrase structure
English		Dependency
English	;	Phrase structure
English		Phrase structure
Estonian		Dependency
Estonian		Phrase structure
Finnish		Dependency
French		Phrase structure
French		Phrase structure & Dependency
French		Dependency and macrosyntactic annotation
French		Phrase structure
French		Phrase structure
French		Phrase structure
German		Dependency
German		Phrase structure
German		Phrase structure
German		Phrase structure
German		Phrase structure
German		Phrase structure
German		Phrase structure
Gothic		Dependency
Greek		Dependency
Greek		Dependency
Greek		Dependency
Hebrew		Dependency
Hindi		Dependency
Hungarian		Phrase structure
Icelandic		Phrase structure
Italian		Dependency
Italian		Phrase structure and dependency
Italian		Phrase structure and dependency
Italian
Italian		dependency
Italian		dependency
Italian		dependency
Japanese		Dependency
Japanese
Japanese		Phrase structure
Japanese		Phrase structure
Korean		Phrase structure
Latin		Dependency
Latin		Dependency
Latin		Dependency
Norwegian		LFG
Persian		HPSG
Persian		Dependency
Polish		HPSG
Polish		Phrase structure and Dependency
Portuguese		Dependency, Phrase structure
Portuguese		Phrase structure
Romanian		Dependency
Russian	SynTagRus Dependency Treebank	Dependency
Old Russian		Dependency
Slovene		Dependency
Spanish		Phrase structure and dependency
Spanish		Phrase structure
Swedish		Phrase structure and dependency
Swedish		Phrase structure
Swedish		Phrase structure
Thai		Dependency
Turkish		Dependency
Ukrainian		Dependency
Urdu		Phrase structure
Urdu		Phrase and Hyper Dependency Structure
Vietnamese		Phrase structure
Vietnamese		Dependency

To facilitate the further researches between multilingual tasks, some researchers discussed the universal annotation scheme for cross-languages. In this way, people try to utilize or merge the advantages of different treebanks corpora. For instance,
The universal annotation approach for dependency treebanks; and the universal annotation approach for phrase structure treebanks.

Search tools

One of the key ways to extract evidence from a treebank is through search tools. Search tools for parsed corpora typically depend on the annotation scheme that was applied to the corpus. User interfaces range in sophistication from expression-based query systems aimed at computer programmers to full exploration environments aimed at general linguists. Wallis discusses the principles of searching treebanks in detail and reviews the state of the art.

Phrase structure grammar
*
*
* ;
* Linguistic DataBase
*
*
*
* VIQTORYA
Dependency grammar
*
*
*
*
*
Dependency grammar and/or Phrase-structure grammar
*
*
*
*
Others
*
*
*
* Tatoeba

Popular movies

The Hunger Games (film) - 2012 American dystopian action thriller science fiction-adventure film directed by Gary Ross and based on Suzanne Collins’s 2008 novel of the same name. It is the first insta...
untitled Captain Marvel sequel - part of Marvel Cinematic Universe....
Killers of the Flower Moon (film project) - Killers of the Flower Moon - film project in United States of America. It was presented as drama, detective fiction, thriller. The film project starred Leonardo Dicaprio, Robert De Niro. Director of...
Five Nights at Freddy's (film) - Five Nights at Freddy's - film published in 2017 in United States of America. Scenarist of the film - Scott Cawthon....

Popular books

Book of Revelation - The Book of Revelation is the final book of the New Testament, and consequently is also the final book of the Christian Bible. Its title is derived from the first word of the Koine Greek text: apok...
Book of Genesis - account of the creation of the world, the early history of humanity, Israel's ancestors and the origins...
Gospel of Matthew - The Gospel According to Matthew is the first book of the New Testament and one of the three synoptic gospels. It tells how Israel's Messiah, rejected and executed in Israel, pronounces judgement on ...
Michelin Guide - Michelin Guides are a series of guide books published by the French tyre company Michelin for more than a century. The term normally refers to the annually published Michelin Red Guide , the oldest...
Psalms - The Book of Psalms , commonly referred to simply as Psalms , the Psalter or "the Psalms", is the first book of the Ketuvim , the third section of the Hebrew Bible, and thus a book of th...
Ecclesiastes - Ecclesiastes is one of 24 books of the Tanakh , where it is classified as one of the Ketuvim . Originally written c. 450–200 BCE, it is also among the canonical Wisdom literature of the Old Tes...
The 48 Laws of Power - non-fiction book by American author Robert Greene. The book...

Popular television series

The Crown (TV series) - historical drama web television series about the reign of Queen Elizabeth II, created and principally written by Peter Morgan, and produced by Left Bank Pictures and Sony Pictures Tel...
Friends - American sitcom television series, created by David Crane and Marta Kauffman, which aired on NBC from September 22, 1994, to May 6, 2004, lasting ten seasons. With an ensemble cast sta...
Young Sheldon - spin-off prequel to The Big Bang Theory and begins with the character Sheldon...
Modern Family - American television mockumentary family sitcom created by Christopher Lloyd and Steven Levitan for the American Broadcasting Company. It ran for eleven seasons, from September 23...
Loki (TV series) - upcoming American web television miniseries created for Disney+ by Michael Waldron, based on the Marvel Comics character of the same name. It is set in the Marvel Cinematic Universe, shar...
Game of Thrones - American fantasy drama television series created by David Benioff and D. B. Weiss for HBO. It...
Shameless (American TV series) - American comedy-drama television series developed by John Wells which debuted on Showtime on January 9, 2011. It...