ocamlyacc: Bug ? Meaning of $end ?

Mailing list for all users of the OCaml language and system.
 help / color / mirror / Atom feed

From: Markus Leypold <leypold@informatik.uni-tuebingen.de>
To: caml-list@inria.fr
Subject: ocamlyacc: Bug ? Meaning of $end ?
Date: Mon, 25 Sep 2000 10:23:31 GMT	[thread overview]
Message-ID: <200009251023.KAA21436@neuromancer.informatik.uni-tuebingen.de> (raw)

Dear Friends,

I have found some points in the implementation of ocamlyacc,
which I'm not quite sure, wether I understand right, or wether
this behaviour might even be called a bug.

Let me show my example first: There are 3 modules: The Lexer Cword, 
the parser Cword2imf and the test driver tcw. I have stripped the
sample code down to the bare minimum which seems to be necessary to
demonstrate the effect which puzzles me.

The Lexer divides the stream of input characters into convenient
words, runs or white space and newline and end-of-file-marks.

cword.mll:
-----------------------------------
{
open Cwords2imf
}

let  newline    = '\n'
let  normalchar = [^ '\n' '/' '*' ' ' '\t']
let  space      = [' ' '\t']
let  word       = normalchar+

rule nextword = parse
       word      {WORD(Lexing.lexeme lexbuf)}
     | space     {SPACE(Lexing.lexeme lexbuf)}
     | newline   {NEWLINE}
     | eof       {END}
     | _         {WORD(Lexing.lexeme lexbuf)}

----------------------------------

The parser is supposed to collect the tokens into lines and newline marks.

cwords2imf.mly:
-----------------------------------

%token          NEWLINE END
%token <string> WORD 
%token <string> SPACE
%start          main
%type  <string> main 
%%

main
    : END       {"X\n"}
    | NEWLINE   {"N\n"}
    | linefrag  {"L "^$1^"\n"};

linefrag
    : SPACE linefrag {$1^$2}
    | WORD           {$1}
    |                {"**"};

--------------------------------------

The Testdriver invokes the parser on stdin just once (reading one semantic
value).

tcw.ml:
--------------------------------------
open Cwords2imf;;

let lexbuf = (Lexing.from_channel stdin);;
print_string (Cwords2imf.main 
               Cwords.nextword
               lexbuf);;
--------------------------------------

Now I call tcw and feed the some input like the following to it

bla bla foobar[newline]
blubb hello[newline]
[eof]

where the text in [] is to be interpreted (not to be taken as
verbatim).

So far this works and I can even call the parser+lexer repeatedly on
the input streams, thus reading a 'sentence' from the input every time
and obtaining it's semantic value.  But if I replace the rule about
linefrag with

linefrag
    : SPACE linefrag {$1^$2}
    | WORD  linefrag {$1^$2}
    |                {"**"};

everything goes down the drain, and I get a parse error, even if I call
parser+lexer only once. 

So I started to look into the generated statemachine in the .output file,
but things became even more confusing to me:

In both variants of the grammar, the only place, where the parser action
is 'accept', is the following:

state 2
        $accept : %entry% . $end  (0)

        $end  accept

It's a bit irritating, that ist doesn't say '. error' also, but never mind.
Ah, then, I thought, that is the problem: 'accept' is only done, if the
matched phrase is followed by the end of the input. $end, I thought,
is a pseudotoken, which signifies the end of input from the lexer
(the original yacc behaves like this, I think). 

-> But than i cannot explain, why in the first case, I get a match,
   which ist not obtained in the second case. Is the generation
   of parse tables erronous ? 

-> What is the meaning of $end ? It cannot be the end of input,
   since from the example in the manual I understand, that a
   parser/lexer-combination might be called on an input stream 
   repeatedly. 

   On the other side, to match just the longest prefix sentence 
   in the input, I'd more expect a state with '. accept'  after
   entries, which check all the tokens, wether it is possible to 
   continue parsing ie 'WORD shift 17'.

-> My original idea was to collect the tokens to larger tokens, and to
   pass them to another parser, ie letting the parser act as part of a
   (handwritten) lexer to another parser.  Now I read something about
   'the parsers not being reentrant' ? Is that true ? Does that mean
   the calling sequence

   parser1 -> function1 (as lexer to parser1)
           -> parser2
           -> lexer2

   will not work ?

Can anyone comment on this, possibly showing me THE LIGHT :-) ?

Regards -- Markus

                 reply	other threads:[~2000-09-25 17:21 UTC|newest]

Thread overview: [no followups] expand[flat|nested]  mbox.gz  Atom feed

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=200009251023.KAA21436@neuromancer.informatik.uni-tuebingen.de \
    --to=leypold@informatik.uni-tuebingen.de \
    --cc=caml-list@inria.fr \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox