text parsing & processing tool
— a parser generator, and more
pijnu is mainly:
- written in and for python
- PEG-based
- intended to be clear, easy, practicle
pijnu is also:
- fully & pleasantly usable while still a work in progress
- rather original, I guess
- what happens when a parsing library meets a parser generator
|
Table of Contents
|
overall pointers
documents
presentation: this page
user tutorial guide
reference manual: TODO
release
download & release history page
useful literature
What is Parsing Expression Grammar? See also below section on "grammar & parser" for a first overview.
wikipedia article on PEG
PEG & packrat parser
thesis on PEG & packrat parsing
slides by Bryan Ford
disadvantages of PEG
university course on PEG
examples
Sample use of pijnu for real or complex test parsing & processing applications.
- Booleano: flexible, natural language, logic language
grammar & parser
pijnu syntax is a custom, extended version of Parsing Expression Grammar (PEG — see links above); which itself is a kind of mix of BNF and regular expressions.
The major difference is that PEG is a grammar to express string recognition patterns, while BNF or regex express string generation. As a consequence, PEG is better suited for parsing tasks. A PEG grammar clearly encodes the algorithm to parse a source string, that simply needs to be rewritten into a parser coded in a programming language.
This is precisely pijnu's approach: the user writes a grammar using a (hopefully clear, simple & practicle) PEG-like language, then pijnu provides for ready-to-use parser in python.
pijnu generated parsers use a library to parse source texts. This library mainly holds
- pattern classes
- a Node type for the resulting parse tree
- custom exception types providing usable feedback
- various & efficient testing methods, for in-progress grammars/parsers or parts of them
- builtin transformations (see below)
- builtin source text pre-process functions
The parser is produced from the grammar by a generator which, indeed, itself "meta-uses" the library to first parse the user grammar. [For the story, all but the first generator were themselves "meta-produced" by previous versions of the generator: pijnu is bootstrapped].
### simple binary arithmetics with '+' and '*' only
formula
<definition>
# tokens
ADD : '+'
MULT : '*'
LPAREN : '('
RPAREN : ')'
digit : [0..9.]
number : digit+
# operations
mult : number MULT number
add : (mult/number) ADD (add/mult/number)
formula : add / mult / number
post-process & transformations
A parsing phase produces a parse tree in which every node was yielded by a pattern. Simple leaf nodes hold the matched string snippet while branch nodes contain a sequence of child nodes. A major issue in text processing applications is that a raw parse tree is far from having a form well suited for further processing.
Using pijnu, one can do far more that getting a parse tree. The grammar allows assigning transformation funcs to patterns, that will then be applied to every generated node. Numerous in-built transformations are provided in order to easily restructure the resulting parse tree and/or modify specific nodes.
Moreover, a user can write custom functions right inside the grammar that will then be applied to directly perform further processing. This is a both very simple and highly powerful method. In most cases, one can get final results "like magic".
For instance, to compute the actual result from the above formula grammar, one needs only 2 single-line funcs: one for each operation indeed. Then, the result of the parsing/processing process is the result of the expressed formula.
Another example that will generate XHTML from wiki-text styled lines (possibly nested), using a single 3-lines function:
### parse wiki-text styled lines and rewrite them into XHTML
wikInline
<toolset>
def styledSpan(node):
klass = node.typ
text = node.value
node.value = '<span class="%s">%s</span>' %(klass,text)
<definition>
# codes
ESCAPE : '~'
DISTINCT : "//" : drop
IMPORTANT : "**" : drop
styleCode : (DISTINCT / IMPORTANT)
# character expression
escChar : ESCAPE ('*' / '!' / '/' / ESCAPE) : join
validChar : [\\x20..\\xff !!/!*~]
rawText : (escChar / (!styleCode validChar))+ : join
# text kinds
distinctText : DISTINCT inlineText DISTINCT : liftValue
importantText : IMPORTANT inlineText IMPORTANT : liftValue
styledText : distinctText / importantText : styledSpan
inlineText : (styledText / rawText)+ : @ join
The column on right side assigns transformations to patterns. drop, join, and liftValue are builtin. styledSpan is a custom transformation. '@' denotes a recursive pattern.
practical use
See user tutorial guide for details.
As a tool, pijnu is hopefully clear and efficient for the user.
It provides highly informative feedback — maybe too much ;-) even — about patterns, results and exceptions.
Custom extensions from PEG help defining legible grammars — there may be more in the future. There are also pre-processing functions and configuration parameters that may be worthful in practicle cases, but still need be fully integrated (see below section on "evolution").
Typically, a user will define the grammar, import the generator and let it write a corresponding parser. This parser comes in the form of a python module from which a parser object can be imported. The said parser object as well as any of its patterns can be used to match a source text partially or completely, find first or all ocurrences of matches, or replace found matches. In most cases, standard or custom transformations will restructure and further process the resulting parse tree.
from pijnu import generator
generator.writeParser(myGrammar)
from myGrammarParser import myGrammarParser
myGrammarParser.match(source)
New: It is now possible to directly produce a parser from the command line using the gen.py module (later may be renamed to pijnu.py):
python gen.py myGrammar.pijnu myParser.py
evolution
Here is a sketch of possible future evolution of pijnu as a list of todoes, questions & issues.
licence — DONE
for freedom of use & personal protection against stupid lawsuits
Add SUPER CLEAR & VISIBLE disclaimer to all source files and docs. Release either as public, or with free licence like GPL.
done: GPL v3 — copyright in all files except for docs and samples
machine run time — LEFT OUT
should pijnu be optimized for speed?
pijnu has been designed as a clear and practical tool. Every trade-off was made according to these principles, which should allow for fast use, in human time. Conversely, machine run time speed has never been considered as a relevant goal. As a consequence, pijnu seems to be very slow.
I personly have no skill for this; and also no motivaton as of now. Comments, critical use cases, & contributions welcome.
package & import — DONE
a clear packaging and importing system
There is no packaging using __init.py__ file(s) as of now. While I still haven't figured out how they are supposed to help the developper or the user (all works fine anyway), it may be a good idea at least to comply with standards.
The whole app should be split into library, generator, samples, and doc folders. Imports should be clarified to allow importing either the generator to create parsers, or else the library rather for exploration (usually the lib is not needed, for the parsers import it transparently), or even both.
Also, there should be a way to allow the user import all names from library modules directly through import * ; which means that a lib module, maybe __init__.py, should import all of them.
done: the result matches the description above
command line — DONE
add simple & fast command line mode
python pijnu.py grammarFileName
should generate a parser in grammarTitleParser.py
python pijnu.py grammarFileName parserFileName
should generate a parser in parserFileName
To be added in pijnu.py, or maybe {{__init__.py}}, top-level source file. (actually in gen.py because of import mess)
— 5 minute work ;-) (actually was a bit more because of import mess)
test — HELP ME?
a consistent testing framework
The whole package is full of ad hoc test suites. A main issue is that standard testing tools (at least the ones I know of) are totally unpracticle with such an app for which both input and output typically are big and complicated texts.
At the very minimum, the tests may be cleaned a bit; reformatted when needed to ensure a common interface; and integrated into one or more main test control funcs, probably one for the library and one for the generator.
Maybe I have not enough experience to do this the best way: contributions welcome!
samples — HELP ME!
a collection of sample applications
Contributions welcome! ;-)
line continuation — TODO
a continuation code to allow patterns be expressed on 2 or more lines
This is even more pertinent for pijnu, due to the 3rd column for transformations. Also, not everybody preferes (as I do) spliting complex expressions into sub-patterns; even if this has obvious advantages in terms of simplicity, clarity, and naming.
Probable syntax would use '...', possibly repeted at start of continuation line:
pat : (a long pattern format) / (even longer here) / ...
... (continued on next line)
-- or --
pat : (a long pattern format) / (even longer here) / ...
(continued on next line)
better Word & Klass format — TODO
use double characters instead of \ escape
Currently, "unsafe" characters, such a quote inside a Word or a right-bracket inside a Klass definition, are escaped using a backslash:
quotedFoobar : "\"foobar\""
delimiters : [()[\]{}]
I wish to change this to a repetition of the character, doubling it so as to get:
quotedFoobar : """foobar"""
delimiters : [()[]]{}]
This has from my point of view the following advantages:
- It's the same rule as for special characters: the visual separator " " and the exclusion code "!!".
- It's natural, safe, and even obvious, for there is no point in repeting characters in classes!
- It avoids confusion and tricky issues with decimal/hex character encoding, as well as with python's own escaping format (and regex's, by the way).
list transform — DONE
builtin transformation for separated lists
Should yield a simple sequence of items, without the need of separate drop, liftValue, and extract transformations. Should the separators be dropped before (they usually are dropped anyway for the same ones are often used in numerous patterns)? Or not at all (then often need be copied witout drop)? Or the transform should be flexible & adapt to both cases?
Done: intoList extracts matched items into a flat sequence, meaning it yields a simple branch node. The minimum number of items can be 1, 2, or more. Moreover, intoList is able to cope with cases where separators are dropped earlier, or not, producing in both cases a flat list wth items only.
The issue is that the parse tree is messy:
itemList : item (SEP item)*
==>
itemList:[item [[SEP item] [SEP item]...]]
The only solution is a combination of drop (on SEP), liftValue
(on each additional item), extract (on the sequence of additional items).
We can now get directly what we want using 'intoList':
itemList : item (SEP item)* : intoList
==>
itemList:[item item item...]
'intoList' will also deal with patterns for
operations that allow more than 2 operands:
operation : oper OP oper (OP oper)* : intoList
==>
operation:[oper oper oper oper...]
AnyChar — TODO?
integrate AnyChar pattern for '.' in grammar
Has been left out earlier and forgot to put it back — mainly for I never use it. See ValidChar below to know why.
pijnu probably still needs it if only for people used to other variants of PEG.
ValidChar — TODO
a variant of AnyChar to allow only valid characters
The issue is that in most cases not any char is a valid char.
The pattern class is ready. Just find a nice way to integrate it in the grammar.
The klass of valid characters is specified through a config parameter: actually the attribute ValidChar.CONFIG. So ValidChar shoud be "officially" introduced together with configuration below.
A nice solution may be to combine it with AnyChar: when no character class is specified, the pattern acts like AnyChar, else it acts like ValidChar. Then, we don't need any special sign in pijnu language for ValidChar.
"STOP repetition" — LEFT OUT
add an "until" STOP condition to repetitions
(see this topic in the guide, section practicle use -> more on patterns -> LookAhead)
The purpose is to avoid ugly things like:
pat : (!"HALT" [a..z A..Z])+ "HALT"
pat : (!"X"{3} ("X" / "Y" / "Z"))* "X"{3}
And have the following instead, using the sign '>' :
pat : [a..z A..Z]+>"HALT"
pat : ("X" / "Y" / "Z")*>"X"{3}
But in practice the need for it seems to be rather rare.
config — TODO
config parameters inside the grammar itself
There are some configuration parameters to allow a certain (sensible?) level of customization. Typically, they are attributes of the concerned type.
First write a reference of all existing ones:
- pattern, node, and exception output
- pre-process
- special patterns such as AnyChar/ValidChar ('.').
Need more?
Then have a syntax to specify them inside the grammar; possibly anywhere and repetitively, for some may change (e.g. ValidChar's PATTERN param). This would mean (& look like) setting an attribute on the proper type:
Node.SHOW_TREE = True
Some may define a pattern, so they must be parsed, indeed:
ValidChar.PATTERN = [\x20..\x7e \n\t]
pre-process — LEFT OUT
allow specifying pre-process functions in grammar
To be applied on source before parsing, in order to help having a clear grammar and/or avoid context-dependant issues. Typical cases:
- normalize separators to a single form (eg all whitespace to a single space)
- transform indented structure to wrapped structure (inside block -open & -close tokens)
- join or split lines according to continuation or separator tokens
Some in-built pre-process funcs exist already in the package, actually in the preprocess.py module: they can be manually called to be applied on the sorce bedore it is passed to the parser.
Pre-processes would build another section of the grammar, coming between <toolset> and <definition>. Like for the case of transformations, some pre-process funcs would be in-built, while the definition of custom ones would come in the <toolset> section.
Problem: the calls to pre-processes need to be collected into the parser object in order to be properly applied before parsing. The issue is that this cannot work when an individual pattern is accessed & used for matching, instead of the parser itself. This may lead to confusion due to seemingly unconsistent results or even 'random' parse failures.
Thus, it may be better to forget this all, simply document the builtin pre-processes and let the user call these builtin functions or custom ones 'manually'.
recursion — LEFT OUT
a func to detect recursive patterns
To avoid the need for the user to add a special code '@' denoting such patterns. Not a major issue — in fact, this can well be seen as a help for the user to better understand the grammar… (I actually think it).
There may also be a func to detect 'wrong' recursion: when a recursive pattern written on left side of a format leads to infinite recursion.
.
.
.





