The extension of a code is the mapping of finite length source sequences to finite length bit strings, that is obtained by concatenating for each symbol of the source sequence the corresponding codeword produced by the original code. Using terms from formal language theory, the precise mathematical definition is as follows: Let and be two finite sets, called the source and target alphabets, respectively. A code is a total function mapping each symbol from to a sequence of symbols over, and the extension of to a homomorphism of into, which naturally maps each sequence of source symbols to a sequence of target symbols, is referred to as its extension.
Classes of variable-length codes
Variable-length codes can be strictly nested in order of decreasing generality as non-singular codes, uniquely decodable codes and prefix codes. Prefix codes are always uniquely decodable, and these in turn are always non-singular:
Non-singular codes
A code is non-singular if each source symbol is mapped to a different non-emptybit string, i.e. the mapping from source symbols to bit strings is injective.
For example, the mapping is not non-singular because both "a" and "b" map to the same bit string "0" ; any extension of this mapping will generate a lossy coding. Such singular coding may still be useful when some loss of information is acceptable.
However, the mapping is non-singular ; its extension will generate a lossless coding, which will be useful for general data transmission. Note that it is not necessary for the non-singular code to be more compact than the source.
Uniquely decodable codes
A code is uniquely decodable if its extension is non-singular. Whether a given code is uniquely decodable can be decided with the Sardinas–Patterson algorithm.
The mapping is uniquely decodable.
Consider again the code from the previous section. This code is not uniquely decodable, since the string 011101110011 can be interpreted as the sequence of codewords 01110 – 1110 – 011, but also as the sequence of codewords 011 – 1 – 011 – 10011. Two possible decodings of this encoded string are thus given by cdb and babe. However, such a code is useful when the set of all possible source symbols is completely known and finite, or when there are restrictions that determine if source elements of this extension are acceptable. Such restrictions permit the decoding of the original message by checking which of the possible source symbols mapped to the same symbol are valid under those restrictions.
Prefix codes
A code is a prefix code if no target bit string in the mapping is a prefix of the target bit string of a different source symbol in the same mapping. This means that symbols can be decoded instantaneously after their entire codeword is received. Other commonly used names for this concept are prefix-free code, instantaneous code, or context-free code.
The example mapping in the previous paragraph is not a prefix code because we don't know after reading the bit string "0" if it encodes an "a" source symbol, or if it is the prefix of the encodings of the "b" or "c" symbols.
The advantage of a variable-length code is that unlikely source symbols can be assigned longer codewords and likely source symbols can be assigned shorter codewords, thus giving a low expected codeword length. For the above example, if the probabilities of were, the expected number of bits used to represent a source symbol using the code above would be: As the entropy of this source is 1.7500 bits per symbol, this code compresses the source as much as possible so that the source can be recovered with zero error.