Backus–Naur form


In computer science, Backus–Naur form or Backus normal form is a notation technique for context-free grammars, often used to describe the syntax of languages used in computing, such as computer programming languages, document formats, instruction sets and communication protocols. They are applied wherever exact descriptions of languages are needed: for instance, in official language specifications, in manuals, and in textbooks on programming language theory.
Many extensions and variants of the original Backus–Naur notation are used; some are exactly defined, including extended Backus–Naur form and augmented Backus–Naur form.

History

The idea of describing the structure of language using rewriting rules can be traced back to at least the work of Pāṇini, an ancient Indian Sanskrit grammarian and a revered scholar in Hinduism who lived sometime between the 6th and 4th century BCE. His notation to describe Sanskrit word structure notation is equivalent in power to that of Backus and has many similar properties.
In Western society, grammar was long regarded as a subject for teaching, rather than scientific study; descriptions were informal and targeted at practical usage. In the first half of the 20th century, linguists such as Leonard Bloomfield and Zellig Harris started attempts to formalize the description of language, including phrase structure.
Meanwhile, string rewriting rules as formal logical systems were introduced and studied by mathematicians such as Axel Thue, Emil Post and Alan Turing. Noam Chomsky, teaching linguistics to students of information theory at MIT, combined linguistics and mathematics by taking what is essentially Thue's formalism as the basis for the description of the syntax of natural language. He also introduced a clear distinction between generative rules and transformation rules.
John Backus, a programming language designer at IBM, proposed a metalanguage of "metalinguistic formulas"
to describe the syntax of the new programming language IAL, known today as ALGOL 58. His notation was first used in the ALGOL 60 report.
BNF is a notation for Chomsky's context-free grammars. Apparently, Backus was familiar with Chomsky's work.
As proposed by Backus, the formula defined "classes" whose names are enclosed in angle brackets. For example, <ab>. Each of these names denotes a class of basic symbols.
Further development of ALGOL led to ALGOL 60. In the committee's 1963 report, Peter Naur called Backus's notation Backus normal form. Donald Knuth argued that BNF should rather be read as Backus–Naur form, as it is "not a normal form in the conventional sense",
unlike, for instance, Chomsky normal form. The name Pāṇini Backus form was also once suggested in view of the fact that the expansion Backus normal form may not be accurate, and that Pāṇini had independently developed a similar notation earlier.
BNF is described by Peter Naur in the ALGOL 60 report as metalinguistic formula:
Another example from the ALGOL 60 report illustrates a major difference between the BNF metalanguage and a Chomsky context-free grammar. Metalinguistic variables do not require a rule defining their formation. Their formation may simply be described in natural language within the <> brackets. The following ALGOL 60 report section 2.3 comments specification, exemplifies how this works:

For the purpose of including text among the symbols of a program the following "comment" conventions hold:
The sequence of basic symbols:is equivalent to
; comment <any sequence not containing ';'>;;
begin comment <any sequence not containing ';'>;begin
end <any sequence not containing 'end' or ';' or 'else'>end

By equivalence is here meant that any of the three structures shown in the left column may be replaced, in any occurrence outside of strings, by the symbol shown in the same line in the right column without any effect on the action of the program.

Naur changed two of Backus's symbols to commonly available characters. The "::=" symbol was originally a ":≡". The "|" symbol was originally the word "".
Working for IBM, Backus would have had a non-disclosure agreement and could not have talked about his source if it came from an IBM proprietary project.
BNF is very similar to canonical-form boolean algebra equations that are, and were at the time, used in logic-circuit design. Backus was a mathematician and the designer of the FORTRAN programming language. Studies of boolean algebra is commonly part of a mathematics. What we do know is that neither Backus nor Naur described the names enclosed in < > as non-terminals. Chomsky's terminology was not originally used in describing BNF. Naur later described them as classes in ALGOL course materials. In the ALGOL 60 report they were called metalinguistic variables. Anything other than the metasymbols ::=, |, and class names enclosed in <,> are symbols of the language being defined. The metasymbols ::= is to be interpreted as "is defined as". The | is used to separate alternative definitions and is interpreted as "or". The metasymbols <,> are delimiters enclosing a class name. BNF is described as a metalanguage for talking about ALGOL by Peter Naur and Saul Rosen.
In 1947 Saul Rosen became involved in the activities of the fledgling Association for Computing Machinery, first on the languages committee that became the IAL group and eventually led to ALGOL. He was the first managing editor of the Communications of the ACM. What we do know is that BNF was first used as a metalanguage to talk about the ALGOL language in the ALGOL 60 report. That is how it is explained in ALGOL programming course material developed by Peter Naur in 1962. Early ALGOL manuals by IBM, Honeywell, Burroughs and Digital Equipment Corporation followed the ALGOL 60 report using it as a metalanguage. Saul Rosen in his book describes BNF as a metalanguage for talking about ALGOL. An example of its use as a metalanguage would be in defining an arithmetic expression:
The first symbol of an alternative may be the class being defined, the repetition, as explained by Naur, having the function of specifying that the alternative sequence can recursively begin with a previous alternative and can be repeated any number of times. For example, above <expr> is defined as a <term> followed by any number of <addop> <term>.
In some later metalanguages, such as Schorre's META II, the BNF recursive repeat construct is replaced by a sequence operator and target language symbols defined using quoted strings. The < and > brackets were removed. Parentheses for mathematical grouping were added. The <expr> rule would appear in META II as
These changes enabled META II and its derivative programming languages to define and extend their own metalanguage, at the cost of the ability to use a natural language description, metalinguistic variable, language construct description. Many spin-off metalanguages were inspired by BNF. See META II, TREE-META, and Metacompiler.
A BNF class describes a language construct formation, with formation defined as a pattern or the action of forming the pattern. The class name expr is described in a natural language as a <term> followed by a sequence <addop> <term>. A class is an abstraction; we can talk about it independent of its formation. We can talk about term, independent of its definition, as being added or subtracted in expr. We can talk about a term being a specific data type and how an expr is to be evaluated having specific combinations of data types. Or even reordering an expression to group data types and evaluation results of mixed types. The natural-language supplement provided specific details of the language class semantics to be used by a compiler implementation and a programmer writing an ALGOL program. Natural-language description further supplemented the syntax as well. The integer rule is a good example of natural and metalanguage used to describe syntax:
There are no specifics on white space in the above. As far as the rule states, we could have space between the digits. In the natural language we complement the BNF metalanguage by explaining that the digit sequence can have no white space between the digits. English is only one of the possible natural languages. Translations of the ALGOL reports were available in many natural languages.
The origin of BNF is not as important as its impact on programming language development. During the period immediately following the publication of the ALGOL 60 report BNF was the basis of many compiler-compiler systems.
Some, like "A Syntax Directed Compiler for ALGOL 60" developed by Edgar T. Irons and "A Compiler Building System" Developed by Brooker and Morris, directly used BNF. Others, like the Schorre Metacompilers, made it into a programming language with only a few changes. <class name> became symbol identifiers, dropping the enclosing <,> and using quoted strings for symbols of the target language. Arithmetic-like grouping provided a simplification that removed using classes where grouping was its only value. The META II arithmetic expression rule shows grouping use. Output expressions placed in a META II rule are used to output code and labels in an assembly language. Rules in META II are equivalent to a class definitions in BNF. The Unix utility yacc is based on BNF with code production similar to META II. yacc is most commonly used as a parser generator, and its roots are obviously BNF.
BNF today is one of the oldest computer-related languages still in use.

Introduction

A BNF specification is a set of derivation rules, written as

::= __expression__

where is a nonterminal, and the __expression__ consists of one or more sequences of symbols; more sequences are separated by the vertical bar "|", indicating a choice, the whole being a possible substitution for the symbol on the left. Symbols that never appear on a left side are terminals. On the other hand, symbols that appear on a left side are non-terminals and are always enclosed between the pair <>.
The "::=" means that the symbol on the left must be replaced with the expression on the right.

Example

As an example, consider this possible BNF for a U.S. postal address:

::=
::= |
::= "." |
::=
::= ","
::= "Sr." | "Jr." | | ""
::= | ""

This translates into English as:
  • A postal address consists of a name-part, followed by a street-address part, followed by a zip-code part.
  • A name-part consists of either: a personal-part followed by a last name followed by an optional suffix and end-of-line, or a personal part followed by a name part.
  • A personal-part consists of either a first name or an initial followed by a dot.
  • A street address consists of a house number, followed by a street name, followed by an optional apartment specifier, followed by an end-of-line.
  • A zip-part consists of a town-name, followed by a comma, followed by a state code, followed by a ZIP-code followed by an end-of-line.
  • An opt-suffix-part consists of a suffix, such as "Sr.", "Jr." or a roman-numeral, or an empty string.
  • An opt-apt-num consists of an apartment number or an empty string.
Note that many things are left unspecified here. If necessary, they may be described using additional BNF rules.

Further examples

BNF's syntax itself may be represented with a BNF like the following:

::= |
::= "<" ">" "::="
::= " " | ""
::= | "|"
::= |
::= |
::= | "<" ">"
::= '"' '"' | "'" "'"
::= "" |
::= '' |
::= | |
::= "A" | "B" | "C" | "D" | "E" | "F" | "G" | "H" | "I" | "J" | "K" | "L" | "M" | "N" | "O" | "P" | "Q" | "R" | "S" | "T" | "U" | "V" | "W" | "X" | "Y" | "Z" | "a" | "b" | "c" | "d" | "e" | "f" | "g" | "h" | "i" | "j" | "k" | "l" | "m" | "n" | "o" | "p" | "q" | "r" | "s" | "t" | "u" | "v" | "w" | "x" | "y" | "z"
::= "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9"
::= "|" | " " | "!" | "#" | "$" | "%" | "&" | "" | "*" | "+" | "," | "-" | "." | "/" | ":" | ";" | ">" | "=" | "<" | "?" | "@" | "" | "^" | "_" | "`" | "" | "~"
::= | "'"
::= | '"'
::= |
::= | | "-"

Note that "" is the empty string.
The original BNF did not use quotes as shown in <literal> rule. This assumes that no whitespace is necessary for proper interpretation of the rule.
<EOL> represents the appropriate line-end specifier. <rule-name> and <text> are to be substituted with a declared rule's name/label or literal text, respectively.
In the U.S. postal address example above, the entire block-quote is a syntax. Each line or unbroken grouping of lines is a rule; for example one rule begins with <name-part> ::=. The other part of that rule is an expression, which consists of two lists separated by a pipe |. These two lists consists of some terms. Each term in this particular rule is a rule-name.

Variants

There are many variants and extensions of BNF, generally either for the sake of simplicity and succinctness, or to adapt it to a specific application. One common feature of many variants is the use of regular expression repetition operators such as * and +. The extended Backus–Naur form is a common one.
Another common extension is the use of square brackets around optional items. Although not present in the original ALGOL 60 report, the notation is now universally recognised.
Augmented Backus–Naur form and Routing Backus–Naur form are extensions commonly used to describe Internet Engineering Task Force protocols.
Parsing expression grammars build on the BNF and regular expression notations to form an alternative class of formal grammar, which is essentially analytic rather than generative in character.
Many BNF specifications found online today are intended to be human-readable and are non-formal. These often include many of the following syntax rules and extensions:
  • Optional items enclosed in square brackets: .
  • Items existing 0 or more times are enclosed in curly brackets or suffixed with an asterisk such as <word> ::= <letter> or <word> ::= <letter> <letter>* respectively.
  • Items existing 1 or more times are suffixed with an addition symbol, +.
  • Terminals may appear in bold rather than italics, and non-terminals in plain text rather than angle brackets.
  • Where items are grouped, they are enclosed in simple parentheses.

    Software using BNF

  • ANTLR, another parser generator written in Java
  • Qlik Sense, a BI tool, uses a variant of BNF for scripting
  • BNF Converter, operating on a variant called "labeled Backus–Naur form". In this variant, each production for a given non-terminal is given a label, which can be used as a constructor of an algebraic data type representing that nonterminal. The converter is capable of producing types and parsers for abstract syntax in several languages, including Haskell and Java.
  • Coco/R, compiler generator accepting an attributed grammar in EBNF
  • DMS Software Reengineering Toolkit, program analysis and transformation system for arbitrary languages
  • GOLD BNF parser
  • GNU bison, GNU version of yacc
  • RPA BNF parser. Online demo parsing: JavaScript, XML
  • XACT X4MR System, a rule-based expert system for programming language translation
  • XPL Analyzer, a tool which accepts simplified BNF for a language and produces a parser for that language in XPL; it may be integrated into the supplied SKELETON program, with which the language may be debugged
  • Yacc, parser generator
  • bnfparser2, a universal syntax verification utility
  • bnf2xml, Markup input with XML tags using advanced BNF matching.
  • JavaCC, Java Compiler Compiler tm - The Java Parser Generator.
  • , lex and yacc-style Parsing
  • , A parser generator written in C++11. It uses ABNF.

    Language grammars

  • , the original BNF.
  • , freely available BNF grammars for SQL.
  • , freely available BNF grammars for SQL, Ada, Java.
  • , freely available BNF/EBNF grammars for C/C++, Pascal, COBOL, Ada 95, PL/I.
  • . Includes parts 11, 14, and 21 of the ISO 10303 standard.