Vietnamese language and computers


The Vietnamese language is written with a Latin-based alphabet that requires various accommodations when computing. Currently, software-based keyboard layouts are the most popular form of writing the Vietnamese language on computer e.g. Using VNI input method with UniKey, an input method editor. Historically, Vietnamese was written in a logographic script, chữ Nôm, which only extends to ceremonial and traditional purposes.

Fonts and character encodings

Vietnamese alphabet

There are as many as 46 character encodings for representing the Vietnamese alphabet. Unicode has become the most popular form for many of the world's writing systems, due to its great compatibility and software support. Diacritics may be encoded either as combining characters or as precomposed characters, which are scattered among the Latin Extended-A, Latin Extended-B, and Latin Extended Additional blocks. The Vietnamese đồng symbol is encoded in the Currency Symbols block. Historically, the Vietnamese language used other characters beyond the modern alphabet. The Middle Vietnamese letter B with flourish is included in the Latin Extended-D block. The apex is not included in Unicode, but may serve as a rough approximation.
Early versions of Unicode assigned the characters and for the purpose of placing these marks beside a circumflex, as is common in Vietnamese typography. These two characters have been deprecated; and are now used regardless of any present circumflex.
For systems that lack support for Unicode, dozens of 8-bit Vietnamese code pages have been designed. The most commonly used of them were VISCII, VSCII, VNI, VPS and Windows-1258. Where ASCII is required, such as when ensuring readability in plain text e-mail, Vietnamese letters are often encoded according to Vietnamese Quoted-Readable or VSCII Mnemonic, though usage of either variable-width scheme has declined dramatically following the adoption of Unicode on the World Wide Web. For instance, support for all above mentioned 8-bit encodings, with the exception of Windows-1258, was dropped from Mozilla software in 2014.
Many Vietnamese fonts intended for desktop publishing are encoded in VNI or TCVN3. Such fonts are known as "ABC fonts". Popular web browsers lack support for specialty Vietnamese encodings, so any webpage that uses these fonts appears as unintelligible mojibake on systems without them installed.
.
Vietnamese frequently stacks diacritics, so typeface designers must take care to prevent stacked diacritics from colliding with adjacent letters or lines. When a tone mark is used together with another diacritic, offsetting the tone mark to the right preserves consistency and avoids slowing down saccades. In advertising signage and in cursive handwriting, diacritics often take forms unfamiliar to other Latin alphabets. For example, the lowercase letter I retains its tittle in ì, , ĩ, and í. These nuances are rarely accounted for in computing environments.

Approaches

Vietnamese writing requires 134 additional letters besides the 52 already present in ASCII. This exceeds the 128 additional characters available in a conventional extended ASCII encoding. Although this can be solved by using a variable-width encoding, a number of approaches have been used by other encodings to support Vietnamese without doing so:

Text input

A purely physical Vietnamese keyboard would be impractical, due to the sheer number of letter-diacritic-diacritic combinations in the alphabet. Instead, Vietnamese input relies on software-based keyboard layouts, virtual keyboards, or input methods.

Keyboard layouts

Vietnamese keyboard layouts rely on dead keys to compose letters with diacritics. Most desktop operating systems include a Vietnamese keyboard layout similar to, a Vietnamese national standard. Previously, typewriters used an AZERTY-based Vietnamese layout.

Input methods

The three most common Vietnamese input methods are Telex, VNI, and VIQR. Telex indicates diacritics using letters that are unlikely to appear at the end of a word, while VNI repurposes the number keys or function keys and VIQR repurposes various punctuation marks. The Telex and VIQR conventions originated in an earlier era of telex machines and typewriters, respectively.
Support for these input methods is provided by input method editors, which are known in Vietnamese as bộ gõ, literally "pecker". IMEs may be provided by the operating system, installed as a third-party application, installed as a browser extension, or provided by an individual website in the form of a script. Common third-party applications include GoTiengViet, UniKey, VietKey, VPSKeys, WinVNKey, and xvnkb. On Unix-like operating systems, the IBus and SCIM frameworks both support Vietnamese. IMEs scripts such as AVIM, Mudim, and VietTyping can be found on most Vietnamese message boards, the Vietnamese Wikipedia, and other text-intensive websites. The Vietnamese Web browser Cốc Cốc comes with an input method built-in.
Input methods allow words to be composed in a more flexible order than keyboard layouts allow. For example, to enter the word " list, the IME may need to communicate with a Web service. Some IMEs also use candidate lists to allow the user to convert text from the Vietnamese alphabet to chữ Nôm, because there is no one-to-one correspondence between alphabetic words and nôm characters.

Other considerations

Typical Vietnamese text contains a high proportion of compound words. Compound words are never hyphenated in contemporary usage, so spell checkers are limited to checking individual syllables unless a statistical language model is consulted.
Vietnamese has rigid spelling rules and few exceptions, so text-to-speech engines may avoid dictionary lookups except when encountering a foreign loan word. TTS engines must account for tones, which are essential to the meaning of any Vietnamese word.