Apertium is a shallow-transfer machine translation system, which uses finite state transducers for all of its lexical transformations, and hidden Markov models for part-of-speech tagging or word category disambiguation. Constraint Grammar taggers are also used for some language pairs. Existing machine translation systems available at present are mostly commercial or use proprietary technologies, which makes them very hard to adapt to new usages; furthermore, they use different technologies across language pairs, which makes it very difficult, for instance, to integrate them in a single multilingual content management system. Apertium uses a language-independent specification, to allow for the ease of contributing to Apertium, more efficient development, and enhancing the project's overall growth. At present, Apertium has released 40 stable language pairs, delivering fast translation with reasonably intelligible results. Being an open-source project, Apertium provides tools for potential developers to build their own language pair and contribute to the project.
This is an overall, step-by-step view how Apertium works. The diagram displays the steps that Apertium takes to translate a source-language text into a target-language text.
Source language text is passed into Apertium for translation.
The deformatter removes formatting markup that should be kept in place but not translated.
The morphological analyser segments the text, and look up segments in the language dictionaries, then returning baseform and tags for all matches. In pairs that involve agglutinative morphology, including a number of Turkic languages, a Helsinki Finite-State Transducer is used. Otherwise, an Apertium-specific technology, called the lttoolbox, is used.
The morphological disambiguator resolves ambiguous segments by choosing one match. Apertium is working on installing more Constraint Grammar frameworks for its language pairs, allowing the imposition of more fine-grained constraints than would be otherwise possible. Apertium uses the Visual Interactive Syntax Learning Constraint Grammar Parser.
Lexical transfer looks up disambiguated source-language basewords to find their target-language equivalents. For lexical transfer, Apertium uses an XML-based dictionary format called bidix.
Lexical selection chooses between alternative translations when the source text word has alternative meanings. Apertium uses a specific XML-based technology, apertium-lex-tools, to perform lexical selection.
Structural transfer can consist of a one-step transfer or a three-step transfer module. It flags grammatical differences between the source language and target language by creating a sequence of chunks containing markers for this. It then reorders or modifies chunks in order to produce a grammatical translation in the target-language. This is also done using lttoolbox.
The morphological generator uses the tags to deliver the correct target language surface form. The morphological generator is a morphological transducer, just like the morphological analyser. A morphological transducer both analyses and generates forms.
The post-generator makes any necessary orthographic changes due to the contact of words.
The reformatter replaces formatting markup that was removed by the deformatter in the first step.
Apertium delivers the target-language translation.
Language pairs
List of currently stable language pairs, hover over the language codes to see the languages that they represent.