Re2c


re2c is a free and open-source lexer generator for C, C++ and Go. It compiles declarative regular expression specifications to deterministic finite automata. Originally written by Peter Bumbulis and described in his paper, re2c was put in public domain and has been since maintained by volunteers. It is the lexer generator adopted by projects such as PHP, SpamAssassin, Ninja build system and others. Together with the Lemon parser generator, re2c is used in BRL-CAD. This combination is also used with STEPcode, an implementation of ISO 10303 standard.

Philosophy

The main goal of re2c is generating fast lexers:
at least as fast as reasonably optimized C lexers coded by hand.
Instead of using traditional table-driven approach, re2c
encodes the generated finite state machine directly in the form of conditional jumps and comparisons.
The resulting program is faster than its table-driven counterpart
and much easier to debug and understand.
Moreover, this approach often results in smaller lexers,
as re2c applies a number of optimizations such as DFA minimization and the construction of tunnel automaton.
Another distinctive feature of re2c is its flexible interface:
instead of assuming a fixed program template,
re2c lets the programmer write most of the interface code and adapt the generated lexer to any particular environment.
The main idea is that re2c should be a zero-cost abstraction for the programmer:
using it should never result in a slower program than the corresponding hand-coded implementation.

Features

re2c program can contain any number of /*!re2c... */ blocks.
Each block consists of a sequence of rules, definitions and configurations
.
Rules have the form REGEXP or REGEXP := CODE; where REGEXP is a regular expression and CODE is a block of C code. When REGEXP matches the input string, control flow is transferred to the associated CODE. There is one special rule: the default rule with * instead of REGEXP; it is triggered if no other rules matches. re2c has greedy matching semantics: if multiple rules match, the rule that matches longer prefix is preferred; if the conflicting rules match the same prefix, the earlier rule has priority.
Definitions have the form NAME = REGEXP;.
Configurations have the form re2c:CONFIG = VALUE; where CONFIG is the name of the particular configuration and VALUE is a number or a string.
For more advanced usage see the official re2c manual.

Regular expressions

re2c uses the following syntax for regular expressions:
Character classes and string literals may contain the following escape sequences: \a, \b, \f, \n, \r, \t, \v, \\, octal escapes \ooo and hexadecimal escapes \xhh, \uhhhh and \Uhhhhhhhh.

Example

Here is a very simple program in re2c.
It checks that all input arguments are hexadecimal numbers.
The code for re2c is enclosed in comments /*!re2c... */, all the rest is plain C code.
See the official re2c website for more complex examples.

  1. include
static int lex
int main

Given that, re2c -is -o example.c example.re generates the code below.
The contents of the comment /*!re2c... */ are substituted with a deterministic finite automaton
encoded in the form of conditional jumps and comparisons; the rest of the program is copied verbatim into the output file.
There are several code generation options; normally re2c uses switch statements,
but it can use nested if statements,
or generate bitmaps and jump tables.
Which option is better depends on the C compiler;
re2c users are encouraged to experiment.

/* Generated by re2c 1.2.1 on Fri Aug 23 21:59:00 2019 */
  1. include
static int lex
int main