Message-ID: <20040315171004.2252.qmail@plover.com>
To: Timothy Fletcher <tim@xxx.xx>
Subject: Re: py info 
Date: Mon, 15 Mar 2004 12:10:04 -0500
From: Mark Jason Dominus <mjd@plover.com>

> i started by 
> using bison -v to see what output i would be getting from it, and i 
> couldn't understand a thing :(.

It's easy to understand if you understand the parsing algorithm.
Otherwise, it's difficult.  I learned the parsing algorithm from
running the C debugger on yacc parsers, but I wouldn't recommend that.

I'll try to summarize what's happening.

The parser has two main parts.  It has a state machine, and it has a
stack.  Items on the stack have three parts:

        1. a grammar symbol
        2. the value of the symbol
        3. the state that the parser was in when it pushed the symbol
           onto the stack

The parser repeats the following loop over and over:

        1. look ahead at the next token on the input stream
        2. Consult a table to decide what to do next.

The table is indexed by the identity of the upcoming token, and by the
current state of the state machine.

There are essentially four different things that the parser can do at
each step.  The simplest thing to do is to halt successfully.  The
next simplest is to signal a parse error.

Of the interesting behaviors, the simpler is to do a "shift", which
just means that the next symbol is removed from the input stream, and
pushed onto the stack.  The current state is pushed onto the stack at
the same time.

The other thing the parser can do is 'reduce', which corresponds to
replacing the right-hand side of a grammar production with the
left-hand side.  In this case, the symbols on the stack that
correspond to the RHS of the production are popped, and a value for
the LHS is computed.  The parser examines the state of the new top
symbol, and uses it to index another table, which tells it what new
state to go into.  Then it pushes the LHS symbol and its value and the
new state onto the stack.

Here's the simplest example grammar I can think of:

        %token NUMBER

        %%
        expr : NUMBER '+' expr { $$ = $1 + $3 } 
             | NUMBER          { $$ = $1 }
             ;

I'll now explain the bison .output file:

        Grammar
        rule 1    expr -> NUMBER '+' expr
        rule 2    expr -> NUMBER

This is for debugging; it says what the gammar rules are.  Also, later
on it'll refer to the rules by number.

        Terminals, with rules where they appear

        $ (-1)
        '+' (43) 1
        error (256)
        NUMBER (257) 1 2

'terminals' are 'terminal symbols' which are produced by the lexer.
'error' is a special fake symbol that the parser sometimes inserts
when there is an error.  It's an advanced feature, so ignore it. '$'
represents the end-of-input condition.  Inside of bison, tokens are
represented by integers.  The lexer function will return one of these
integers to indicate what sort of token it has found; it will also set
a global variable (I think named 'yylval') to contain the value of the
token, if it has one.  In this example grammar, NUMBER has a value
which is a number. 

        Nonterminals, with rules where they appear

        expr (5)
            on left: 1 2, on right: 1

The grammar has only one nonterminal symbol, which is 'expr'.

The rest of the .output file describes the state machine and what it
will do on every possible input from every possible state.  Initially,
the state machine is in state 0.

        state 0

            NUMBER	shift, and go to state 1

            expr	go to state 4

Each state has two or three sections.  Here the first section is
omitted for state 0; we will see it in the explanation of the next
state.  The second section says what to do if the upcoming token has a
certain type.  In this case, if the machine is in state 0 and the
upcoming token is a NUMBER, the token is shifted (pushed onto the
stack) and the state machine goes into state 1.  The description
doesn't say so explicitly, but if the upcoming token is anything else,
the parser will signal an error.

This is what we expect, since valid inputs to this parser must always
begin with a NUMBER.

The third section says what to do next if the parser finds itself back
here after reducing an 'expr':  it should go into state 4 and continue.

        state 1

            expr  ->  NUMBER . '+' expr   (rule 1)
            expr  ->  NUMBER .   (rule 2)

            '+' 	shift, and go to state 2

            $default	reduce using rule 2 (expr)

Here we see all three sections.  The first section describes the place
in the input that the parser expects to be while in this state.  The
'.' indicates the current read position in the input stream.   The
notations here say that when the parser is in state 1, it is either
because it is in the middle of processing rule 1 of the grammar, and
has already read in the first NUMBER, and is expecting to see a '+'
sign followed by an 'expr', or else it has finished processing rule 2,
and has already read in the NUMBER.

If the next input token is a '+', the parser concludes that it's
actually in the 'rule 1' case rather than the 'rule 2' case.  It
pushes the '+' onto the stack and continues in state 2.   If it sees
anything else coming up (such as '$', the end-of-input token) it
"reduces using rule 2", which, as you will recall, is

        rule 2    expr -> NUMBER

This says that an 'expr' can be a plain NUMBER.  The parser has seen
the NUMBER and has it on the stack.  It pops the number off and pushes
an 'expr' symbol on the stack in its place.  At this point it runs the
associated semantic action to determine the value of the 'expr'.  The
semantic action is:

        { $$ = $1 }

$1 here refers to the value of the NUMBER; $$ is the resulting value.
We said that the value of this expression is the same as the value of
the number it comprises.  

        state 2

            expr  ->  NUMBER '+' . expr   (rule 1)

            NUMBER	shift, and go to state 1

            expr	go to state 3

The parser will be in state 2 after having seen the NUMBER and '+'
tokens of an expression of the form 'NUMBER + expr'.  If it sees
another number, it pushes the number onto the stack and returns to
state 1.  If it sees anything else coming up, , it signals an error.

Let's pause for a moment and look at an example input.  Say the input is

        NUMBER 3
        '+'
        NUMBER 4
        $

The parser starts in state 0.  The next token is 'NUMBER', so the
parser shifts the NUMBER onto the stack and goes into state 1.

In state 1, the next token is '+', so the parser shifts the '+' onto
the stack and goes into state 2.

In state 2, the next token is 'NUMBER', so the parser shifts the
NUMBER onto the stack, and goes back into state 1.

At this point, the stack has

        top-> NUMBER (4) (state 2)
              '+'        (state 1)
              NUMBER (3) (state 0)

Now we're in state 1, and the next token is '$', which isn't
explicitly listed among the shiftable tokens.  So the parser takes the
'$default' action, which is "reduce using rule 2".

Rule 2 has:

        rule 2    expr -> NUMBER

The parser pops the stack and get the value of 4 from the popped
symbol.  It runs the associated semantic action, { $$ = $1 }.  $1 is
4, so $$ is also 4.  It pushes the new 'expr' symbol onto the stack
with an associated semantic value of 4.  The stack has:

        top-> expr (4)   (state 2)
              '+'        (state 1)
              NUMBER (3) (state 0)

Finally, the parser looks at the state of the last-popped symbol,
which was 2, and checks the state 2 tables to see what transition is
appropriate when it has just reduced an 'expr'.  State 2 says:

            expr	go to state 3

so the parser does.  It is now in state 3.

        state 3

            expr  ->  NUMBER '+' expr .   (rule 1)

            $default	reduce using rule 1 (expr)

The upcoming token is '$', so the parser chooses the default action,
which is to reduce using rule 1.  Here's rule 1:

        rule 1    expr -> NUMBER '+' expr

There are three symbols on the RHS, so the parser pops three items off
the stack; they are indeed a number, a '+', and an 'expr'.  $1 becomes
the value of the NUMBER (3).  $2 becomes the value of the '+'
(garbage).  $3 becomes the value of the expr (4).  The semantic action
associated with rule 1 is executed.  It is 

        { $$ = $1 + $3 } 

so the value of the new symbol is 7.  A new symbol of type 'expr' (the
LHS of the rule) and value 7 is pushed on the stack:

        top-> expr (7)   (state 3)

Moreover, as in the last reduction, the parser looks at the state of
the last-popped symbol, which was 0, and checks the state 0 tables to
see what transition is appropriate when it has just reduced an
'expr'.  State 0 says:

            expr	go to state 4

so the parser does.

In state 4, the upcoming token is still '$', the end-of-input token,
and state 4 says:

        state 4

            $   	go to state 5

The parser is now in state 5.  From there, it goes to state 6.

        state 5

            $   	go to state 6


In state 6, the parser 'accepts', which means that it signals success:

        state 6

            $default	accept

The value of the entire expression is the value at the top of the
stack, in this case 7.

There are a couple of other points of interest.  Sometimes one writes
a grammar which is ambiguous.  For example, consider:

        expr = expr '+' expr 
             | NUMBER
             ;

If the input is (NUMBER 3) + (NUMBER 4) + (NUMBER 5), then there is an
ambiguity in parsing.  Here's what the parser will do.  First it will
shift the NUMBER (3) and immediately reduce it to an 'expr'.  Then it
will shift the '+', and then it will shift the NUMBER (4) and
immeditely reduce that to an 'expr'.  The stack now has:

        top -> expr (4)
               '+'
               expr (3)
       
The upcoming token is another '+'.  The parser has two possible
strategies.  It could reduce the tokens on the stack immediately, or
it could shift the upcoming '+' immediately and reduce later.

The 'reduce first' strategy goes through the following stacks:

        (reduce by rule   expr -> expr '+' expr)

        top -> expr (7)

        top -> '+'
               expr (7)

        top -> NUMBER (5)
               '+'
               expr (7)

        (reduce by rule   expr -> NUMBER)

        top -> expr (5)
               '+'
               expr (7)

        (reduce by rule   expr -> expr '+' expr)

        top -> expr (12)

The 'shift first' strategy foes through the following stacks:

        top -> '+'
               expr (4)
               '+'
               expr (3)

        top -> NUMBER (5)
               '+'
               expr (4)
               '+'
               expr (3)

        (reduce by rule   expr -> NUMBER)

        top -> expr (5)
               '+'
               expr (4)
               '+'
               expr (3)

        (reduce by rule   expr -> expr '+' expr)

        top -> expr (9)
               '+'
               expr (3)

        (reduce by rule   expr -> expr '+' expr)

        top -> expr (12)


In this case the results are the same.  But if the operator had been
subtraction instead of addition, the results would have been
different.  The "reduce first" strategy would have said that "3 - 4 -
5" was -6, and the "shift first" strategy would have said that "3 - 4
- 5" was 4.  In essence, the first strategy interprets the ambiguous
expressions "3 + 4 + 5" as "(3 + 4) + 5" and the second as "3 + (4 +
5)".  This ambiguity is called a "shift/reduce" conflict.  By
defeault, yacc resolves it in favor of the shift.  Yacc has directives

        %left '+'

which says that '+' should be left-associative, which juts means that
this shift/reduce conflict is resolved in favor of reduction instead.  

        %right '+'

says that '+' should be right-associative, so this shift/reduce conflict
is resolved in favor of 'shift'.  (This is the default, but the %right
directive also suppresses the warning that you would otherwise get.)
Finally, there's a

        %nonassoc '+'

which says that the shift/reduce conflict should be resolved by
generating a parse error.

I hope this clears up enough of the details that you can trace through
some more complicated bison .output files and understand what it going
on.  Equipped with this description, yo umay also be able to step
through the py parser's runs using the Perl debugger and follow what
is happening with the stack and the state machine.

The 'py' approach was very successful, and it would probably be easy
to do in PHP.  I recommend it. 

Let me know if you have any questions.