Tatoeba

Tatoeba is a free collaborative online database of example sentences geared towards foreign language learners. Its name comes from the Japanese term "tatoeba", meaning "for example". Unlike other online dictionaries, which focus on words, Tatoeba focuses on translation of complete sentences. In addition, the structure of the database and interface emphasize one-to-many relationships. Not only can a sentence have multiple translations within a single language, but its translations into all languages are readily visible, as are indirect translations that involve a chain of stepwise links from one language to another.

The aim of the project

The aim of the Tatoeba Project is to create a database of sentences and translations that can be used by anyone developing a language learning application. The idea is that the project creates the data, so programmers can just focus on coding the application.
The data collected by the project is freely available under a Creative Commons Attribution license.

Content

As of June 2019, the Tatoeba Corpus has over 7,500,000 sentences in 337 languages. The top 10 languages make up 73% of the corpus. Ninety-eight of these languages have over 1,000 sentences. The top 14 languages have over 100,000 sentences each.
Tatoeba is also the current home of the Tanaka Corpus, a public-domain series of about 150,000 English–Japanese sentence pairs compiled by Hyogo University professor Yasuhito Tanaka first released in 2001, and where it is undergoing its latest revisions.
The statistics for all languages are found at .

History

Tatoeba was founded by Trang Ho in 2006. She originally hosted the project on Sourceforge under the project name "multilangdict".

Interface

Users, even those who are not registered, can search for words in any language to retrieve sentences that use them. Each sentence in the Tatoeba database is displayed next to its likely translations in other languages; direct and indirect translations are differentiated. Sentences are tagged for content such as subject matter, dialect, or vulgarity; they also each have individual comment threads to facilitate feedback and corrections from other users and cultural notes. As of early 2016, more than 200,000 sentences in 19 languages had audio readings of different quality. Sentences can also be browsed by language, tag, or audio.
Registered users can add new sentences or translate or proofread existing ones, even if their target language is not their native tongue. However, it is preferred that users translate into their native or "strongest" language and add sentences from their native language rather than translating into or adding from their target language.
This means that the text corpus is by far not free of errors, every user can translate sentences even if they have no idea about this specific language – due to the number of sentences it is not possible to check any sentence if it is correct or not. Furthermore, as of late 2019 even the terms of use of the website are not translated.
Translations are linked to the original sentence automatically. Users can freely edit their sentences, "adopt" and correct sentences without an owner, and comment on others' sentences. Advanced contributors, a rank above ordinary contributors, can tag, link, and unlink sentences. Corpus maintainers, a rank above advanced contributors, can untag and delete sentences. They can also modify owned sentences, though they typically do so only if the owner fails to respond to a request to make the change.

Database structure

Tatoeba's basic data structure is a series of nodes and links. Each sentence is a node; each link bridges two sentences with the same meaning.

License

The entire Tatoeba database is published under a Creative Commons Attribution 2.0 license, freeing it for academic and other use.

Grants

Tatoeba received a grant from Mozilla Drumbeat in December 2010.
Some work on the Tatoeba infrastructure was sponsored by Google Summer of Code, 2014 edition.
In May 2018 they received a $25,000 Mozilla Open Source Support program grant.
In Aug 2019 they received a $15,000 Mozilla Open Source Support program grant.

Usage

Parallel text corpora such as Tatoeba are used for a variety of natural language processing tasks such as machine translation. The Tatoeba data has been used as data for treebanking Japanese and statistical machine translation, as well as the WWWJDIC Japanese–English dictionary and the and on www.ManyThings.org.

Offline edition

Selected content from Tatoeba – 83,932 phrases in Esperanto along with all their translations into other languages – has appeared in the third edition of the multilingual DVD Esperanto Elektronike published in 6,000 copies by E@I in July 2011.
Tab-delimited data ready for import into Anki and similar software can be downloaded directly at the Tatoeba Website.

Popular movies

The Hunger Games (film) - 2012 American dystopian action thriller science fiction-adventure film directed by Gary Ross and based on Suzanne Collins’s 2008 novel of the same name. It is the first insta...
untitled Captain Marvel sequel - part of Marvel Cinematic Universe....
Killers of the Flower Moon (film project) - Killers of the Flower Moon - film project in United States of America. It was presented as drama, detective fiction, thriller. The film project starred Leonardo Dicaprio, Robert De Niro. Director of...
Five Nights at Freddy's (film) - Five Nights at Freddy's - film published in 2017 in United States of America. Scenarist of the film - Scott Cawthon....

Popular books

Book of Revelation - The Book of Revelation is the final book of the New Testament, and consequently is also the final book of the Christian Bible. Its title is derived from the first word of the Koine Greek text: apok...
Book of Genesis - account of the creation of the world, the early history of humanity, Israel's ancestors and the origins...
Gospel of Matthew - The Gospel According to Matthew is the first book of the New Testament and one of the three synoptic gospels. It tells how Israel's Messiah, rejected and executed in Israel, pronounces judgement on ...
Michelin Guide - Michelin Guides are a series of guide books published by the French tyre company Michelin for more than a century. The term normally refers to the annually published Michelin Red Guide , the oldest...
Psalms - The Book of Psalms , commonly referred to simply as Psalms , the Psalter or "the Psalms", is the first book of the Ketuvim , the third section of the Hebrew Bible, and thus a book of th...
Ecclesiastes - Ecclesiastes is one of 24 books of the Tanakh , where it is classified as one of the Ketuvim . Originally written c. 450–200 BCE, it is also among the canonical Wisdom literature of the Old Tes...
The 48 Laws of Power - non-fiction book by American author Robert Greene. The book...

Popular television series

The Crown (TV series) - historical drama web television series about the reign of Queen Elizabeth II, created and principally written by Peter Morgan, and produced by Left Bank Pictures and Sony Pictures Tel...
Friends - American sitcom television series, created by David Crane and Marta Kauffman, which aired on NBC from September 22, 1994, to May 6, 2004, lasting ten seasons. With an ensemble cast sta...
Young Sheldon - spin-off prequel to The Big Bang Theory and begins with the character Sheldon...
Modern Family - American television mockumentary family sitcom created by Christopher Lloyd and Steven Levitan for the American Broadcasting Company. It ran for eleven seasons, from September 23...
Loki (TV series) - upcoming American web television miniseries created for Disney+ by Michael Waldron, based on the Marvel Comics character of the same name. It is set in the Marvel Cinematic Universe, shar...
Game of Thrones - American fantasy drama television series created by David Benioff and D. B. Weiss for HBO. It...
Shameless (American TV series) - American comedy-drama television series developed by John Wells which debuted on Showtime on January 9, 2011. It...