From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on yquem.inria.fr X-Spam-Level: X-Spam-Status: No, score=0.0 required=5.0 tests=none autolearn=disabled version=3.1.3 Received: from discorde.inria.fr (discorde.inria.fr [192.93.2.38]) by yquem.inria.fr (Postfix) with ESMTP id D2D12BC69; Wed, 2 May 2007 10:41:55 +0200 (CEST) Received: from ipmail01.adl2.internode.on.net (ipmail01.adl2.internode.on.net [203.16.214.140]) by discorde.inria.fr (8.13.6/8.13.6) with ESMTP id l428fq0X013682; Wed, 2 May 2007 10:41:53 +0200 X-IronPort-AV: E=Sophos;i="4.14,477,1170595800"; d="scan'208";a="122325721" Received: from ppp8-148.lns1.syd7.internode.on.net (HELO [192.168.1.201]) ([59.167.8.148]) by ipmail01.adl2.internode.on.net with ESMTP; 02 May 2007 18:11:49 +0930 Subject: Re: [Caml-list] menhir From: skaller To: Francois.Pottier@inria.fr Cc: caml-list@inria.fr In-Reply-To: <20070502053815.GA726@yquem.inria.fr> References: <1177756336.11923.18.camel@rosella.wigram> <20070428165058.GA31584@yquem.inria.fr> <1177821783.25394.37.camel@rosella.wigram> <20070501155705.GA29617@yquem.inria.fr> <1178039464.8967.10.camel@rosella.wigram> <20070501173409.GB7308@yquem.inria.fr> <1178062950.8967.39.camel@rosella.wigram> <20070502053815.GA726@yquem.inria.fr> Content-Type: text/plain Date: Wed, 02 May 2007 18:41:44 +1000 Message-Id: <1178095304.13660.71.camel@rosella.wigram> Mime-Version: 1.0 X-Mailer: Evolution 2.10.1 Content-Transfer-Encoding: 7bit X-Miltered: at discorde with ID 46384ED0.001 by Joe's j-chkmail (http://j-chkmail . ensmp . fr)! X-Spam: no; 0.00; 0200,:01 tokens:01 tokens:01 ocamlyacc:01 ocamlyacc:01 arises:01 lexer:01 compilation:01 endmarker:01 non-terminal:01 endmarker:01 compilation:01 parser:01 parser:01 syntax:01 On Wed, 2007-05-02 at 07:38 +0200, Francois Pottier wrote: > Hello, > > > That is, instead of treating the token set precisely as the > > set of user tokens + eof, menhir is treating eof specially > > and not consistently with other tokens. > > As far as Menhir is concerned, there is no special eof token. > The set of tokens is exactly the set of user-defined tokens. > This is true also of ocamlyacc. > > Internally, the construction of the automaton uses a pseudo-token, > written #, which stands for the end of the token stream. This token > can appear in conflict explanation messages. Exactly. In Ocamlyacc it is named 'eof', and you can use that token in your productions. > > I'm not sure i fully understand the 'do we need lookahead' > > issue: I would have thought: you need a fetch if your action is > > shift, and not if it is a reduce. > > What you mean is, you need to *consume* one token if your action > is shift, and none if your action is reduce. However, in order to > make a decision (in order to choose between shift and reduce, and > between multiple reduce actions), you sometimes need to *consult* > one lookahead token, without consuming it. > > This is necessary only *sometimes*, because, in some states, only > one reduce action is possible, and can be taken without consulting > a lookahead token. Right. > An end-of-stream conflict arises when a state has multiple > actions, some of which are on real tokens, and some of which are > on the pseudo-token #. In that case, one would like to consult > the next token in order to make a decision; however, there is a > possibility that we are at the end of the sentence that we are > trying to recognize, so asking the lexer for one more token might > be a mistake (in your terminology, it might be an overshoot). > > Does this clarify things? Yes. Your code or my grammar is bugged :) My grammar is compilation_unit: | statement_aster ENDMARKER { $1 } where no non-terminal on which statements_aster depends has a production containing ENDMARKER or # (eof). Therefore, there is no conflict. When compilation_unit is reduced the parser returns, the next token, whether it is # or any other, is irrelevant. The parser is therefore required to match the head of the input stream and NOT the whole stream. It seems Menhir wants to match the whole input, and is confused because it doesn't know what to do after it sees ENDMARKER. That means it doesn't detect that compilation_unit is a start symbol .. and the reduction is final. If you are augmenting the grammar to be: compilation_unit: | statement_aster ENDMARKER eof { $1 } that would explain Menhir's confusion .. however the generated parser actually does work (provided the user defined syntax feature isn't used). IMHO: the end-of-stream thing is a serious bug in the design of Ocamlyacc which unfortunately Menhir duplicates. In particular there should be no possibility of the parser consulting the lexbuf, the signature of parsers is very wrong: parser: (lexbuf -> token) -> lexbuf -> ast is a disaster. Because of this you're duplicating the disaster and probing the lexbuf for end of stream, which is just wrong. There should be no implicit end of stream: a stream is an INFINITE sequence, a parser should process the head of the stream, and the signature should be: parser: (state -> token) -> state -> ast where state is a USER defined type. If the user provides no state argument, then it defaults to unit. This denies the parser any access to a hidden 'end of stream' token. It's up to the user to decide what terminates the parse. The point again is that the token input to the parser is infinite: it can't ever be an error to read a next token. The problem is the usual one for streams. Probably the interface should be changed to something like: peek state n (* look ahead n symbols *) read state (* destructive read 1 symbol *) loc state (* return location for diagnostics *) This would support LR(infinity) parsing. A better interface would support recursive descent too, in that case you'd have to make the 'streaming' persistent to support backtracking, that is you'd have to use: state -> (token * state) as the lexer interface -- that gets rid of the push back problem. (This is trivial to implement on top of a destructive stream, by buffering with lists: Extlib streams do just that). Dypgen actually does a clever trick here, since GLR also requires actions which are purely functional .. from the outside (i.e. persistent): it allow mutations and they propagate locally in the current branch (L-attributes yes?). It also allows global mutations, but they only become permanent when all the branches agree on them. Anyhow the problem seems to reduce to this: I think a parser should read the head of a stream, and Menhir thinks it should read the whole stream, and it uses the bugged lexbuf interface to test for end of stream. So I could fix my grammar by adding 'end-of-stream' to my start symbols AND hack the otherwise unused lexbuf to show end-of-stream after ENDMARKER is read. Ocamlyacc doesn't complain though, suggesting it processes only the head of a stream .. and that there's a fundamental design distinction between the two parser generators. -- John Skaller Felix, successor to C++: http://felix.sf.net