Chomski


pattern parsing virtual machine and pep refer to both a command line computer language and utility which can be used to parse and transform text patterns and languages. The utility reads input files character by character, applying the operation which has been specified via the command line or a pep script, and then outputs the line. It was developed from 2006 in the C language. Pep has derived a number of ideas and syntax elements from Sed, a command line text stream editor.

Features

The pattern-parser language uses many ideas taken from sed, the Unix stream editor. For example, sed includes two virtual variables or data buffers, known as the "pattern space" and the "hold space". These two variables constitute an extremely simple virtual machine. In the pep language this virtual machine has been augmented with several new buffers or registers along with a number of commands to manipulate these buffers.
The parsing virtual machine includes a tape data structure as well as a stack, along with a "workspace" (which is the equivalent of the sed "pattern space" and a number of other buffers of lesser importance. This virtual machine is designed specifically to be apt for the parsing of formal languages. This parsing process traditionally involves two phases; the lexical analysis phase and the formal grammar phase. During the lexical analysis phase as series of tokens are generated. These tokens are then used as the input for a set of formal grammar rule. The chomski virtual machine uses the stack to hold these tokens and uses the tape structure to hold the attributes of these parse tokens. In a pep
script, these two phases, lexing and parsing, are combined in one
script file. A series of command words are used to manipulate the different data structures of the virtual machine.

Purpose and motivation

The purpose of the pep tool is to parse and transform text patterns. The text patterns conform to the rules provided in a formal language and include many context free languages. Whereas traditional Unix tools process text one line at a time, and use regular expressions to search or transform text, the pep tool processes text one character at a time and can use context free grammars to transform the text. However, in common with the Unix philosophy, the pep tool works upon plain text streams, encoded according to the locale of the local computer, and produces as output another plain text stream, allowing the pep tool to be used as part of a standard pipeline.
The motivation for the creation of the pp tool and the virtual machine was to allow the writing of parsing scripts, rather than having to resort to traditional parsing tools such as Lex and Yacc or their many variants
and improvements such as Antlr.

Usage

The following example shows a typical use of pep pattern parser, where the -e option indicates that the pattern parse expression follows:

$ pep -e 'read; "/" print; clear;' input.c > output.c

In the above script, C multiline comments are deleted from the input stream.
The pattern parser tool was designed to be used as a filter in a pipeline: for example,

$ generate.data | pep -e '"x"print;clear;'

That is, generate the data, and then make the small change of replacing x with y. However this functionality is not currently available because the pep tool also includes a comprehensive script viewer and debugger and so cannot read from piped standard input.
Several commands can be put together in a file called, for example, substitute.pss and then be applied using the option to read the commands from the file:

$ pep -f substitute.pss file > output

Besides substitution, other forms of simple processing are possible. For example, the following uses the accumulator-increment command and commands to count the number of lines in a file:

$ pep -e '"\n" clear; ' textile

Complex "pep" constructs are possible, allowing it to serve as a simple, but highly specialised, programming language. pep has two flow control statements, namely the and commands, which jump back to the label.

History

The idea for the pep machine and language arose from the limitations of regular expression engines and sed which uses a line by line paradigm, and the limitations on parsing nested text patterns with regular expressions. Pep evolved as a natural progression from the grep and sed command. Development began approximately in 2006 and continues.

Limitations

The pattern parsing script language is not a general purpose programming language. Like sed it is designed for a limited type of usage. The interpret and executable does not currently support unicode strings, since the implementation uses standard C character arrays. However scripts can also be translated into other languages which do support unicode text. Since the virtual machine behind the pattern parser language is considerably more complex than that of sed it is necessary to be able to debug scripts. This facility is currently provided within the 'pep' executable.