pijnu

text parsing & processing tool
— a parser generator, and more

pijnu is mainly:

  • written in and for python
  • PEG-based
  • intended to be clear, easy, practicle

pijnu is also:

  • fully & pleasantly usable while still a work in progress
  • rather original, I guess
  • what happens when a parsing library meets a parser generator

overall pointers

documents

presentation: this page
user tutorial guide
reference manual: TODO

release

gplv3logo.jpg

download & release history page

useful literature

What is Parsing Expression Grammar? See also below section on "grammar & parser" for a first overview.

wikipedia article on PEG
PEG & packrat parser
thesis on PEG & packrat parsing
slides by Bryan Ford
disadvantages of PEG
university course on PEG

examples

Sample use of pijnu for real or complex test parsing & processing applications.

  • Booleano: flexible, natural language, logic language

grammar & parser

pijnu syntax is a custom, extended version of Parsing Expression Grammar (PEG — see links above); which itself is a kind of mix of BNF and regular expressions.
The major difference is that PEG is a grammar to express string recognition patterns, while BNF or regex express string generation. As a consequence, PEG is better suited for parsing tasks. A PEG grammar clearly encodes the algorithm to parse a source string, that simply needs to be rewritten into a parser coded in a programming language.
This is precisely pijnu's approach: the user writes a grammar using a (hopefully clear, simple & practicle) PEG-like language, then pijnu provides for ready-to-use parser in python.

pijnu generated parsers use a library to parse source texts. This library mainly holds

  • pattern classes
  • a Node type for the resulting parse tree
  • custom exception types providing usable feedback
  • various & efficient testing methods, for in-progress grammars/parsers or parts of them
  • builtin transformations (see below)
  • builtin source text pre-process functions

The parser is produced from the grammar by a generator which, indeed, itself "meta-uses" the library to first parse the user grammar. [For the story, all but the first generator were themselves "meta-produced" by previous versions of the generator: pijnu is bootstrapped].

### simple binary arithmetics with '+' and '*' only
formula
<definition>
# tokens
    ADD        : '+'
    MULT       : '*'
    LPAREN     : '('
    RPAREN     : ')'
    digit      : [0..9.]
    number     : digit+
# operations
    mult       : number MULT number
    add        : (mult/number) ADD (add/mult/number)
    formula    : add / mult / number

post-process & transformations

A parsing phase produces a parse tree in which every node was yielded by a pattern. Simple leaf nodes hold the matched string snippet while branch nodes contain a sequence of child nodes. A major issue in text processing applications is that a raw parse tree is far from having a form well suited for further processing.

Using pijnu, one can do far more that getting a parse tree. The grammar allows assigning transformation funcs to patterns, that will then be applied to every generated node. Numerous in-built transformations are provided in order to easily restructure the resulting parse tree and/or modify specific nodes.
Moreover, a user can write custom functions right inside the grammar that will then be applied to directly perform further processing. This is a both very simple and highly powerful method. In most cases, one can get final results "like magic".

For instance, to compute the actual result from the above formula grammar, one needs only 2 single-line funcs: one for each operation indeed. Then, the result of the parsing/processing process is the result of the expressed formula.

Another example that will generate XHTML from wiki-text styled lines (possibly nested), using a single 3-lines function:

### parse wiki-text styled lines and rewrite them into XHTML
wikInline
<toolset>
def styledSpan(node):
    klass = node.typ
    text = node.value
    node.value = '<span class="%s">%s</span>' %(klass,text)

<definition>
# codes
    ESCAPE         : '~'
    DISTINCT       : "//"                                : drop
    IMPORTANT      : "**"                                : drop
    styleCode      : (DISTINCT / IMPORTANT)
# character expression
    escChar        : ESCAPE ('*' / '!' / '/' / ESCAPE)   : join
    validChar      : [\\x20..\\xff  !!/!*~]
    rawText        : (escChar / (!styleCode validChar))+ : join
# text kinds
    distinctText   : DISTINCT inlineText DISTINCT        : liftValue
    importantText  : IMPORTANT inlineText IMPORTANT      : liftValue
    styledText     : distinctText / importantText        : styledSpan
    inlineText     : (styledText / rawText)+             : @ join

The column on right side assigns transformations to patterns. drop, join, and liftValue are builtin. styledSpan is a custom transformation. '@' denotes a recursive pattern.

practical use

See user tutorial guide for details.

As a tool, pijnu is hopefully clear and efficient for the user.
It provides highly informative feedback — maybe too much ;-) even — about patterns, results and exceptions.
Custom extensions from PEG help defining legible grammars — there may be more in the future. There are also pre-processing functions and configuration parameters that may be worthful in practicle cases, but still need be fully integrated (see below section on "evolution").

Typically, a user will define the grammar, import the generator and let it write a corresponding parser. This parser comes in the form of a python module from which a parser object can be imported. The said parser object as well as any of its patterns can be used to match a source text partially or completely, find first or all ocurrences of matches, or replace found matches. In most cases, standard or custom transformations will restructure and further process the resulting parse tree.

from pijnu import generator
generator.writeParser(myGrammar)
from myGrammarParser import myGrammarParser
myGrammarParser.match(source)

New: It is now possible to directly produce a parser from the command line using the gen.py module (later may be renamed to pijnu.py):

python gen.py myGrammar.pijnu myParser.py

evolution

Here is a sketch of possible future evolution of pijnu as a list of todoes, questions & issues.

licence — DONE

for freedom of use & personal protection against stupid lawsuits
Add SUPER CLEAR & VISIBLE disclaimer to all source files and docs. Release either as public, or with free licence like GPL.

done: GPL v3 — copyright in all files except for docs and samples

machine run time — LEFT OUT

should pijnu be optimized for speed?
pijnu has been designed as a clear and practical tool. Every trade-off was made according to these principles, which should allow for fast use, in human time. Conversely, machine run time speed has never been considered as a relevant goal. As a consequence, pijnu seems to be very slow.
I personly have no skill for this; and also no motivaton as of now. Comments, critical use cases, & contributions welcome.

package & import — DONE

a clear packaging and importing system
There is no packaging using __init.py__ file(s) as of now. While I still haven't figured out how they are supposed to help the developper or the user (all works fine anyway), it may be a good idea at least to comply with standards.
The whole app should be split into library, generator, samples, and doc folders. Imports should be clarified to allow importing either the generator to create parsers, or else the library rather for exploration (usually the lib is not needed, for the parsers import it transparently), or even both.
Also, there should be a way to allow the user import all names from library modules directly through import * ; which means that a lib module, maybe __init__.py, should import all of them.

done: the result matches the description above

command line — DONE

add simple & fast command line mode

   python pijnu.py grammarFileName

should generate a parser in grammarTitleParser.py
   python pijnu.py grammarFileName parserFileName

should generate a parser in parserFileName
To be added in pijnu.py, or maybe {{__init__.py}}, top-level source file. (actually in gen.py because of import mess)
— 5 minute work ;-) (actually was a bit more because of import mess)

test — HELP ME?

a consistent testing framework
The whole package is full of ad hoc test suites. A main issue is that standard testing tools (at least the ones I know of) are totally unpracticle with such an app for which both input and output typically are big and complicated texts.
At the very minimum, the tests may be cleaned a bit; reformatted when needed to ensure a common interface; and integrated into one or more main test control funcs, probably one for the library and one for the generator.
Maybe I have not enough experience to do this the best way: contributions welcome!

samples — HELP ME!

a collection of sample applications
Contributions welcome! ;-)

line continuation — TODO

a continuation code to allow patterns be expressed on 2 or more lines
This is even more pertinent for pijnu, due to the 3rd column for transformations. Also, not everybody preferes (as I do) spliting complex expressions into sub-patterns; even if this has obvious advantages in terms of simplicity, clarity, and naming.
Probable syntax would use '...', possibly repeted at start of continuation line:

pat   : (a long pattern format) / (even longer here) / ...
        ... (continued on next line)                 

-- or --

pat   : (a long pattern format) / (even longer here) / ...
        (continued on next line)

better Word & Klass format — TODO

use double characters instead of \ escape
Currently, "unsafe" characters, such a quote inside a Word or a right-bracket inside a Klass definition, are escaped using a backslash:

quotedFoobar    : "\"foobar\""
delimiters      : [()[\]{}]

I wish to change this to a repetition of the character, doubling it so as to get:

quotedFoobar    : """foobar"""
delimiters      : [()[]]{}]

This has from my point of view the following advantages:

  • It's the same rule as for special characters: the visual separator " " and the exclusion code "!!".
  • It's natural, safe, and even obvious, for there is no point in repeting characters in classes!
  • It avoids confusion and tricky issues with decimal/hex character encoding, as well as with python's own escaping format (and regex's, by the way).

list transform — DONE

builtin transformation for separated lists
Should yield a simple sequence of items, without the need of separate drop, liftValue, and extract transformations. Should the separators be dropped before (they usually are dropped anyway for the same ones are often used in numerous patterns)? Or not at all (then often need be copied witout drop)? Or the transform should be flexible & adapt to both cases?

Done: intoList extracts matched items into a flat sequence, meaning it yields a simple branch node. The minimum number of items can be 1, 2, or more. Moreover, intoList is able to cope with cases where separators are dropped earlier, or not, producing in both cases a flat list wth items only.

The issue is that the parse tree is messy:
    itemList    : item (SEP item)*
    ==>
    itemList:[item  [[SEP  item]  [SEP  item]...]]
The only solution is a combination of drop (on SEP), liftValue
(on each additional item), extract (on the sequence of additional items).

We can now get directly what we want using 'intoList':
    itemList    : item (SEP item)*            : intoList
    ==>
    itemList:[item  item  item...]

'intoList' will also deal with patterns for
operations that allow more than 2 operands:
    operation    : oper OP oper (OP oper)*    : intoList
    ==>
    operation:[oper  oper  oper  oper...]

AnyChar — TODO?

integrate AnyChar pattern for '.' in grammar
Has been left out earlier and forgot to put it back — mainly for I never use it. See ValidChar below to know why.
pijnu probably still needs it if only for people used to other variants of PEG.

ValidChar — TODO

a variant of AnyChar to allow only valid characters
The issue is that in most cases not any char is a valid char.

The pattern class is ready. Just find a nice way to integrate it in the grammar.
The klass of valid characters is specified through a config parameter: actually the attribute ValidChar.CONFIG. So ValidChar shoud be "officially" introduced together with configuration below.

A nice solution may be to combine it with AnyChar: when no character class is specified, the pattern acts like AnyChar, else it acts like ValidChar. Then, we don't need any special sign in pijnu language for ValidChar.

"STOP repetition" — LEFT OUT

add an "until" STOP condition to repetitions
(see this topic in the guide, section practicle use -> more on patterns -> LookAhead)
The purpose is to avoid ugly things like:

pat : (!"HALT" [a..z A..Z])+ "HALT"
pat : (!"X"{3} ("X" / "Y" / "Z"))* "X"{3}

And have the following instead, using the sign '>' :
pat : [a..z A..Z]+>"HALT"
pat : ("X" / "Y" / "Z")*>"X"{3}

But in practice the need for it seems to be rather rare.

config — TODO

config parameters inside the grammar itself
There are some configuration parameters to allow a certain (sensible?) level of customization. Typically, they are attributes of the concerned type.
First write a reference of all existing ones:

  • pattern, node, and exception output
  • pre-process
  • special patterns such as AnyChar/ValidChar ('.').

Need more?
Then have a syntax to specify them inside the grammar; possibly anywhere and repetitively, for some may change (e.g. ValidChar's PATTERN param). This would mean (& look like) setting an attribute on the proper type:

   Node.SHOW_TREE = True

Some may define a pattern, so they must be parsed, indeed:
   ValidChar.PATTERN = [\x20..\x7e  \n\t]

pre-process — LEFT OUT

allow specifying pre-process functions in grammar
To be applied on source before parsing, in order to help having a clear grammar and/or avoid context-dependant issues. Typical cases:

  • normalize separators to a single form (eg all whitespace to a single space)
  • transform indented structure to wrapped structure (inside block -open & -close tokens)
  • join or split lines according to continuation or separator tokens

Some in-built pre-process funcs exist already in the package, actually in the preprocess.py module: they can be manually called to be applied on the sorce bedore it is passed to the parser.

Pre-processes would build another section of the grammar, coming between <toolset> and <definition>. Like for the case of transformations, some pre-process funcs would be in-built, while the definition of custom ones would come in the <toolset> section.

Problem: the calls to pre-processes need to be collected into the parser object in order to be properly applied before parsing. The issue is that this cannot work when an individual pattern is accessed & used for matching, instead of the parser itself. This may lead to confusion due to seemingly unconsistent results or even 'random' parse failures.
Thus, it may be better to forget this all, simply document the builtin pre-processes and let the user call these builtin functions or custom ones 'manually'.

recursion — LEFT OUT

a func to detect recursive patterns

To avoid the need for the user to add a special code '@' denoting such patterns. Not a major issue — in fact, this can well be seen as a help for the user to better understand the grammar… (I actually think it).
There may also be a func to detect 'wrong' recursion: when a recursive pattern written on left side of a format leads to infinite recursion.

.
.
.

Add a New Comment
Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License