From mboxrd@z Thu Jan 1 00:00:00 1970 Received: (from weis@localhost) by pauillac.inria.fr (8.7.6/8.7.3) id TAA03821 for caml-red; Mon, 25 Sep 2000 19:21:59 +0200 (MET DST) Received: from concorde.inria.fr (concorde.inria.fr [192.93.2.39]) by pauillac.inria.fr (8.7.6/8.7.3) with ESMTP id MAA04005 for ; Mon, 25 Sep 2000 12:23:52 +0200 (MET DST) Received: from mx1.informatik.uni-tuebingen.de (mx1.Informatik.Uni-Tuebingen.De [134.2.12.5]) by concorde.inria.fr (8.10.0/8.10.0) with ESMTP id e8PANp503716 for ; Mon, 25 Sep 2000 12:23:51 +0200 (MET DST) Received: from neuromancer.informatik.uni-tuebingen.de (neuromancer [134.2.12.58]) by mx1.informatik.uni-tuebingen.de (Postfix) with ESMTP id 219AE43A for ; Mon, 25 Sep 2000 12:23:50 +0200 (MST) Received: (from leypold@localhost) by neuromancer.informatik.uni-tuebingen.de (8.9.3/8.8.7) id KAA21436; Mon, 25 Sep 2000 10:23:31 GMT Date: Mon, 25 Sep 2000 10:23:31 GMT Message-Id: <200009251023.KAA21436@neuromancer.informatik.uni-tuebingen.de> X-Authentication-Warning: neuromancer.informatik.uni-tuebingen.de: leypold set sender to leypold@informatik.uni-tuebingen.de using -f From: Markus Leypold To: caml-list@inria.fr Subject: ocamlyacc: Bug ? Meaning of $end ? Sender: weis@pauillac.inria.fr Dear Friends, I have found some points in the implementation of ocamlyacc, which I'm not quite sure, wether I understand right, or wether this behaviour might even be called a bug. Let me show my example first: There are 3 modules: The Lexer Cword, the parser Cword2imf and the test driver tcw. I have stripped the sample code down to the bare minimum which seems to be necessary to demonstrate the effect which puzzles me. The Lexer divides the stream of input characters into convenient words, runs or white space and newline and end-of-file-marks. cword.mll: ----------------------------------- { open Cwords2imf } let newline = '\n' let normalchar = [^ '\n' '/' '*' ' ' '\t'] let space = [' ' '\t'] let word = normalchar+ rule nextword = parse word {WORD(Lexing.lexeme lexbuf)} | space {SPACE(Lexing.lexeme lexbuf)} | newline {NEWLINE} | eof {END} | _ {WORD(Lexing.lexeme lexbuf)} ---------------------------------- The parser is supposed to collect the tokens into lines and newline marks. cwords2imf.mly: ----------------------------------- %token NEWLINE END %token WORD %token SPACE %start main %type main %% main : END {"X\n"} | NEWLINE {"N\n"} | linefrag {"L "^$1^"\n"}; linefrag : SPACE linefrag {$1^$2} | WORD {$1} | {"**"}; -------------------------------------- The Testdriver invokes the parser on stdin just once (reading one semantic value). tcw.ml: -------------------------------------- open Cwords2imf;; let lexbuf = (Lexing.from_channel stdin);; print_string (Cwords2imf.main Cwords.nextword lexbuf);; -------------------------------------- Now I call tcw and feed the some input like the following to it bla bla foobar[newline] blubb hello[newline] [eof] where the text in [] is to be interpreted (not to be taken as verbatim). So far this works and I can even call the parser+lexer repeatedly on the input streams, thus reading a 'sentence' from the input every time and obtaining it's semantic value. But if I replace the rule about linefrag with linefrag : SPACE linefrag {$1^$2} | WORD linefrag {$1^$2} | {"**"}; everything goes down the drain, and I get a parse error, even if I call parser+lexer only once. So I started to look into the generated statemachine in the .output file, but things became even more confusing to me: In both variants of the grammar, the only place, where the parser action is 'accept', is the following: state 2 $accept : %entry% . $end (0) $end accept It's a bit irritating, that ist doesn't say '. error' also, but never mind. Ah, then, I thought, that is the problem: 'accept' is only done, if the matched phrase is followed by the end of the input. $end, I thought, is a pseudotoken, which signifies the end of input from the lexer (the original yacc behaves like this, I think). -> But than i cannot explain, why in the first case, I get a match, which ist not obtained in the second case. Is the generation of parse tables erronous ? -> What is the meaning of $end ? It cannot be the end of input, since from the example in the manual I understand, that a parser/lexer-combination might be called on an input stream repeatedly. On the other side, to match just the longest prefix sentence in the input, I'd more expect a state with '. accept' after entries, which check all the tokens, wether it is possible to continue parsing ie 'WORD shift 17'. -> My original idea was to collect the tokens to larger tokens, and to pass them to another parser, ie letting the parser act as part of a (handwritten) lexer to another parser. Now I read something about 'the parsers not being reentrant' ? Is that true ? Does that mean the calling sequence parser1 -> function1 (as lexer to parser1) -> parser2 -> lexer2 will not work ? Can anyone comment on this, possibly showing me THE LIGHT :-) ? Regards -- Markus