This bizarre little item writes parser programs in Perl.


I wrote it because I wanted a parser for a simple boolean search language. Writing parsers is a nuisance and a pain. I grew up in a UNIX environment, so I was used to having parsers written for me by a program called yacc. yacc reads a description of the language you want to parse, and emits a program, written in C, that parses the language you asked for. You can associate an `action' with every grammatical construct. If you're writing a compiler, the `action' will be to build some sort of internal structure that represents the locution you've parsed, and then later you'll emit machine instructions that implement the meaning of the locution. For my simple search application, a locution like name = Fred would locate all the records in the database whose name field contained Fred. X & Y would mean to compute lists of records for X and for Y, and to find all the records in both lists, and so on.


Unfortunately, yacc's output is in C, and I wanted the output to be in Perl. Now there's a version of yacc, called byacc, that can emit Perl code, but it either didn't exist yet, or I didn't know about it.

One option would have been to give up. I'm pretty stubborn, so I didn't do that.

Another option would have been to write a yacc replacement whose output would have been in Perl. That would have been a lot of work. I'm pretty lazy, so I didn't do that either.

Instead, I chose an option that I think not many people would have thought of.

The parsers produced by yacc are table-driven. yacc analyzes the grammar and constructs a state machine that reads input tokens one at a time. Depending on the state and on the next token, there are essentially two things it can do: It can put the token on a token stack for later, or it can take the top few tokens on the stack and reduce them to a single token, performing some appropriate action at the same time. For example, X & Y might reduce to expr; similarly expr | Z might also reduce to expr. The parsers that come out of yacc have two parts. One part is a totally canned driver, which reads tokens, consults the state tables, and manages the stack. This is the same in every parser. The other part is the state tables; they say when to push tokens on the stack, when to perform reductions, and what actions to invoke when reducing.

Replacing yacc would have required writing a canned driver in Perl, and writing a program that could examine a grammar and figure out the right state tables. But if I could figure out how to steal someone else's state tables and translate to Perl, I'd only have to write the canned driver and the state table stealer. That's what I did do.

I had bison, the FSF GNU project version of yacc. If you supply the -v option to bison, it produces a voluminous debugging output, which includes, among other things, details about exactly what the states of the state machine look like; from this you can infer all the tables you need to drive the driver. I wrote a program, py, which does exactly this. It's not big. It's about 120 lines long.

Then I wrote the canned driver. It's about 70 lines long, and about one-third of that is for producing output in debugging mode.

Using bison like this meant that my parser programs could take advantage of all the cool complicated features that Bison supports. It also means that people familiar with yacc can use it right away with hardly any study.

To use it, you write a grammar suitable for input to bison, or maybe you already have one sitting around because you're porting an application from C. You run bison -v on the grammar file; this produces a diagnostic output file. You run py on the diagnostic output. This produces a file with the state tables. Then you write your Perl program. Your Perl program should require the file with the state tables and the file with the canned driver. In your program, you call the function yyparse to parse the input; it returns true if the input is gramatically correct according to your grammar, false otherwise. yyaparse is the main function of the canned driver. Along the way, yyparse invokes your action subroutines at the right times. You have to write those yourself. When it needs a token, it calls a function yylex; you have to write that yourself too. In short, the programmer's interface is a lot like yacc's.

The other thing that this might be good for is that if you wanted to see how a table-driven parser works, it's a lot easier to look at the Perl version than the C version. I learned about parsing by running the C debugger on yacc parsers. It would have been easier if they had been written in Perl.

See It

Ok, enough yabbling.


Well, it's an awfully silly idea, but it certainly did work, and I've been happy with it. If I got to hacker heaven and they asked me what hacks I'd done that would justify letting me in, I think I'd mention py. It was clever, unusual, interesting, and satisfied the required constraints. It was a silly project, but a very successful one.

Return to: Universe of Discourse main page | What's new page | Perl Paraphernalia