Chomsky normal form

In formal language theory, a context-free grammar, G, is said to be in Chomsky normal form if all of its production rules are of the form:
where A, B, and C are nonterminal symbols, the letter a is a terminal symbol, S is the start symbol, and ε denotes the empty string. Also, neither B nor C may be the start symbol, and the third production rule can only appear if ε is in L, the language produced by the context-free grammar G.
Every grammar in Chomsky normal form is context-free, and conversely, every context-free grammar can be transformed into an equivalent one which is in Chomsky normal form and has a size no larger than the square of the original grammar's size.

Converting a grammar to Chomsky normal form

To convert a grammar to Chomsky normal form, a sequence of simple transformations is applied in a certain order; this is described in most textbooks on automata theory.
The presentation here follows Hopcroft, Ullman, but is adapted to use the transformation names from Lange, Leiß. Each of the following transformations establishes one of the properties required for Chomsky normal form.

START: Eliminate the start symbol from right-hand sides

Introduce a new start symbol S₀, and a new rule
where S is the previous start symbol.
This does not change the grammar's produced language, and S₀ will not occur on any rule's right-hand side.

TERM: Eliminate rules with nonsolitary terminals

To eliminate each rule
with a terminal symbol a being not the only symbol on the right-hand side, introduce, for every such terminal, a new nonterminal symbol N_a, and a new rule
Change every rule
to
If several terminal symbols occur on the right-hand side, simultaneously replace each of them by its associated nonterminal symbol.
This does not change the grammar's produced language.

BIN: Eliminate right-hand sides with more than 2 nonterminals

Replace each rule
with more than 2 nonterminals X₁,...,X_n by rules
where A_i are new nonterminal symbols.
Again, this does not change the grammar's produced language.

DEL: Eliminate ε-rules

An ε-rule is a rule of the form
where A is not S₀, the grammar's start symbol.
To eliminate all rules of this form, first determine the set of all nonterminals that derive ε.
Hopcroft and Ullman call such nonterminals nullable, and compute them as follows:

If a rule A → ε exists, then A is nullable.
If a rule A → X₁... X_n exists, and every single X_i is nullable, then A is nullable, too.

Obtain an intermediate grammar by replacing each rule
by all versions with some nullable X_i omitted.
By deleting in this grammar each ε-rule, unless its left-hand side is the start symbol, the transformed grammar is obtained.
For example, in the following grammar, with start symbol S₀,
the nonterminal A, and hence also B, is nullable, while neither C nor S₀ is.
Hence the following intermediate grammar is obtained:
In this grammar, all ε-rules have been "inlined at the call site".
In the next step, they can hence be deleted, yielding the grammar:
This grammar produces the same language as the original example grammar, viz., but has no ε-rules.

UNIT: Eliminate unit rules

A unit rule is a rule of the form
where A, B are nonterminal symbols.
To remove it, for each rule
where X₁... X_n is a string of nonterminals and terminals, add rule
unless this is a unit rule which has already been removed.

Order of transformations

When choosing the order in which the [|above] transformations are to be applied, it has to be considered that some transformations may destroy the result achieved by other ones. For example, START will re-introduce a unit rule if it is applied after UNIT. The table shows which orderings are admitted.
Moreover, the worst-case bloat in grammar size depends on the transformation order. Using |G| to denote the size of the original grammar G, the size blow-up in the worst case may range from |G|² to 2^{2 |G|}, depending on the transformation algorithm used. The blow-up in grammar size depends on the order between DEL and BIN. It may be exponential when DEL is done first, but is linear otherwise. UNIT can incur a quadratic blow-up in the size of the grammar. The orderings START,TERM,BIN,DEL,UNIT and START,BIN,DEL,UNIT,TERM lead to the least blow-up.

Example

The following grammar, with start symbol Expr, describes a simplified version of the set of all syntactical valid arithmetic expressions in programming languages like C or Algol60. Both number and variable are considered terminal symbols here for simplicity, since in a compiler front-end their internal structure is usually not considered by the parser. The terminal symbol "^" denoted exponentiation in Algol60.
In step "START" of the above conversion algorithm, just a rule S₀→Expr is added to the grammar.
After step "TERM", the grammar looks like this:
After step "BIN", the following grammar is obtained:
Since there are no ε-rules, step "DEL" does not change the grammar.
After step "UNIT", the following grammar is obtained, which is in Chomsky normal form:
The N_a introduced in step "TERM" are PowOp, Open, and Close.
The A_i introduced in step "BIN" are AddOp_Term, MulOp_Factor, PowOp_Primary, and Expr_Close.

Alternative definition

Chomsky reduced form

Another way to define the Chomsky normal form is:
A formal grammar is in Chomsky reduced form if all of its production rules are of the form:
where, and are nonterminal symbols, and is a terminal symbol. When using this definition, or may be the start symbol. Only those context-free grammars which do not generate the empty string can be transformed into Chomsky reduced form.

Floyd normal form

In a letter where he proposed a term Backus–Naur form, Donald E. Knuth implied a BNF "syntax in which all definitions have such a form may be said to be in 'Floyd Normal Form'",
where, and are nonterminal symbols, and is a terminal symbol,
because Robert W. Floyd found any BNF syntax can be converted to the above one in 1961. But he withdrew this term, "since doubtless many people have independently used this simple fact in their own work, and the point is only incidental to the main considerations of Floyd's note." While Floyd's note cites Chomsky's original 1959 article, Knuth's letter does not.

Application

Besides its theoretical significance, CNF conversion is used in some algorithms as a preprocessing step, e.g., the CYK algorithm, a bottom-up parsing for context-free grammars, and its variant probabilistic CKY.

Popular movies

The Hunger Games (film) - 2012 American dystopian action thriller science fiction-adventure film directed by Gary Ross and based on Suzanne Collins’s 2008 novel of the same name. It is the first insta...
untitled Captain Marvel sequel - part of Marvel Cinematic Universe....
Killers of the Flower Moon (film project) - Killers of the Flower Moon - film project in United States of America. It was presented as drama, detective fiction, thriller. The film project starred Leonardo Dicaprio, Robert De Niro. Director of...
Five Nights at Freddy's (film) - Five Nights at Freddy's - film published in 2017 in United States of America. Scenarist of the film - Scott Cawthon....

Popular books

Book of Revelation - The Book of Revelation is the final book of the New Testament, and consequently is also the final book of the Christian Bible. Its title is derived from the first word of the Koine Greek text: apok...
Book of Genesis - account of the creation of the world, the early history of humanity, Israel's ancestors and the origins...
Gospel of Matthew - The Gospel According to Matthew is the first book of the New Testament and one of the three synoptic gospels. It tells how Israel's Messiah, rejected and executed in Israel, pronounces judgement on ...
Michelin Guide - Michelin Guides are a series of guide books published by the French tyre company Michelin for more than a century. The term normally refers to the annually published Michelin Red Guide , the oldest...
Psalms - The Book of Psalms , commonly referred to simply as Psalms , the Psalter or "the Psalms", is the first book of the Ketuvim , the third section of the Hebrew Bible, and thus a book of th...
Ecclesiastes - Ecclesiastes is one of 24 books of the Tanakh , where it is classified as one of the Ketuvim . Originally written c. 450–200 BCE, it is also among the canonical Wisdom literature of the Old Tes...
The 48 Laws of Power - non-fiction book by American author Robert Greene. The book...

Popular television series

The Crown (TV series) - historical drama web television series about the reign of Queen Elizabeth II, created and principally written by Peter Morgan, and produced by Left Bank Pictures and Sony Pictures Tel...
Friends - American sitcom television series, created by David Crane and Marta Kauffman, which aired on NBC from September 22, 1994, to May 6, 2004, lasting ten seasons. With an ensemble cast sta...
Young Sheldon - spin-off prequel to The Big Bang Theory and begins with the character Sheldon...
Modern Family - American television mockumentary family sitcom created by Christopher Lloyd and Steven Levitan for the American Broadcasting Company. It ran for eleven seasons, from September 23...
Loki (TV series) - upcoming American web television miniseries created for Disney+ by Michael Waldron, based on the Marvel Comics character of the same name. It is set in the Marvel Cinematic Universe, shar...
Game of Thrones - American fantasy drama television series created by David Benioff and D. B. Weiss for HBO. It...
Shameless (American TV series) - American comedy-drama television series developed by John Wells which debuted on Showtime on January 9, 2011. It...