* menhir
@ 2007-04-28 10:32 skaller
  2007-04-28 16:50 ` [Caml-list] menhir Francois Pottier
  0 siblings, 1 reply; 20+ messages in thread
From: skaller @ 2007-04-28 10:32 UTC (permalink / raw)
  To: caml-list
Just a note I just built the Felix parser with Menhir.
First, it detected some duplicate definitions Ocamlyacc didn't! Good!
Second, I got a "rather a lot" of states have end-of-stream conflicts.
What's that about?
Third the generated ml file was 4.5 Meg. 
Ocamlopt on amd64 hung for so long I almost posted a bug report
for Ocamlopt, but finally it finished. This was a 95% CPU, 25% memory
job, so no paging. I'd guess it took 100x times longer than
compilation of the ocamlyacc file (which is just a bunch of numbers :)
I didn't measure it .. no biggie for me now I know, but my
box is a LOT faster than some of the boxes my product gets built on.
After that, Felix built ok, and the parser worked for
'pure' code. However it failed when Felix preprocessor
syntax extensions were used (which is 90% of all programs).
Now, most of the system data transport for this is properly
built so it can't cause any problems. The one thing which 
is hacked is the pushback detection.
Basically: when Ocamlyacc reduces a production, it sometimes
ends on the last token, and sometimes it overshoots by 1.
My grammar uses a system like:
exprx:
  expr expr_terminator { $1,$2 }
statementsx:
  | statement_aster statements_terminator { $1, $2 }
This is saying: a 'special expression' exprx is an expression
PLUS one of the tokens which will solidly terminate an 
expression AND NOT ITSELF BE OVERSHOT by the parser.
In other words when exprx is parsed the reduction must
leave the next token unread.
The semantics used are: when the exprx is processed the
action arranges to push the terminator token back
into the token stream.
Perhaps because Menhir is LR(1) not LALR(1), this technique
is failing. Or Menhir may simply be looking ahead further
than required in the token stream.
Whichever way, I am depending on this particular implementation
detail of Ocamlyacc, and Mehir is using a different implementation.
Any suggestions how to 'fix' this?
-- 
John Skaller <skaller at users dot sf dot net>
Felix, successor to C++: http://felix.sf.net
^ permalink raw reply	[flat|nested] 20+ messages in thread* Re: [Caml-list] menhir 2007-04-28 10:32 menhir skaller @ 2007-04-28 16:50 ` Francois Pottier 2007-04-28 19:47 ` Markus Mottl 2007-04-29 4:43 ` skaller 0 siblings, 2 replies; 20+ messages in thread From: Francois Pottier @ 2007-04-28 16:50 UTC (permalink / raw) To: skaller; +Cc: caml-list Hello, On Sat, Apr 28, 2007 at 08:32:16PM +1000, skaller wrote: > Second, I got a "rather a lot" of states have end-of-stream conflicts. > What's that about? That's precisely about "overshooting" the end of the token stream (reading one too many tokens). The issue is described in the manual. > Third the generated ml file was 4.5 Meg. Ocamlopt on amd64 hung for so long > I almost posted a bug report for Ocamlopt, but finally it finished. Yup, I am somewhat disappointed that ocamlopt does not seem to have linear time complexity, but I shouldn't complain too loud, my boss may be listening :) An option to generate tables like ocamlyacc would clearly be useful. > Basically: when Ocamlyacc reduces a production, it sometimes > ends on the last token, and sometimes it overshoots by 1. I don't know the details of your grammar, but our (perhaps naive) view is that you should modify your grammar to avoid end-of-stream conflicts (and Menhir's conflict reports will help you do that). Then, Menhir will not overshoot. > The semantics used are: when the exprx is processed the > action arranges to push the terminator token back > into the token stream. How is this done? It is possible that this hack is not compatible with Menhir, because Menhir keeps a local cache of the next token, which is not properly updated when you push your old token back. -- François Pottier Francois.Pottier@inria.fr http://cristal.inria.fr/~fpottier/ ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [Caml-list] menhir 2007-04-28 16:50 ` [Caml-list] menhir Francois Pottier @ 2007-04-28 19:47 ` Markus Mottl 2007-04-28 21:15 ` Jon Harrop 2007-04-29 4:43 ` skaller 1 sibling, 1 reply; 20+ messages in thread From: Markus Mottl @ 2007-04-28 19:47 UTC (permalink / raw) To: Francois.Pottier; +Cc: skaller, caml-list On 4/28/07, Francois Pottier <Francois.Pottier@inria.fr> wrote: > Yup, I am somewhat disappointed that ocamlopt does not seem to have linear > time complexity, but I shouldn't complain too loud, my boss may be listening > :) This is unfortunate indeed. I have also run into problems with the compilers exhibiting quadratic behavior on very large sources. It seems that this mostly stems from the use of data structures in the compiler that scale badly on certain operations (e.g. searching through lists rather than in balanced trees). I think it wouldn't be difficult to update the compiler to use more efficient data structures, but it would certainly be a hell lot of work... Regards, Markus -- Markus Mottl http://www.ocaml.info markus.mottl@gmail.com ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [Caml-list] menhir 2007-04-28 19:47 ` Markus Mottl @ 2007-04-28 21:15 ` Jon Harrop 0 siblings, 0 replies; 20+ messages in thread From: Jon Harrop @ 2007-04-28 21:15 UTC (permalink / raw) To: caml-list On Saturday 28 April 2007 20:47, Markus Mottl wrote: > I think it wouldn't be difficult to update the compiler to use more > efficient data structures, but it would certainly be a hell lot of > work... I've had a quick look at this before. Some of our compiler compilers were causing ocamlopt to stack overflow and the problem turned out to be a pair of mutually recursive functions in different modules of the type checkers that were not tail recursive. I couldn't see an easy solution without rewriting the whole thing in CPS. Perhaps a program to convert OCaml code into CPS style would be useful? -- Dr Jon D Harrop, Flying Frog Consultancy Ltd. The F#.NET Journal http://www.ffconsultancy.com/products/fsharp_journal/?e ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [Caml-list] menhir 2007-04-28 16:50 ` [Caml-list] menhir Francois Pottier 2007-04-28 19:47 ` Markus Mottl @ 2007-04-29 4:43 ` skaller 2007-04-29 7:27 ` Christophe Raffalli 2007-05-01 15:57 ` Francois Pottier 1 sibling, 2 replies; 20+ messages in thread From: skaller @ 2007-04-29 4:43 UTC (permalink / raw) To: Francois.Pottier; +Cc: caml-list On Sat, 2007-04-28 at 18:50 +0200, Francois Pottier wrote: > > Third the generated ml file was 4.5 Meg. Ocamlopt on amd64 hung for so long > > I almost posted a bug report for Ocamlopt, but finally it finished. > > Yup, I am somewhat disappointed that ocamlopt does not seem to have linear > time complexity, but I shouldn't complain too loud, my boss may be listening > :) > > An option to generate tables like ocamlyacc would clearly be useful. Although that defeats one of the future advantages of Menhir: embedding. To understand what I mean you might look at the Felix parser generator. In Felix you can write parsers in any scope, and the user actions bind to the containing scope. This means you can think about mixing programmatic recursive descent with the parser automaton. An executable recursive descent parser is easily extended dynamically (but has woeful error handling ;( For Menhir, embedding would be nice. However even without it being able to use recursion and actually pass arguments -- which ocamlyacc does not support -- should allow decoupling of some of your grammar into smaller subgrammars, and this in turn could reduce compile time. It is possible (but I'm no expert!!) that the non-linearity in compiling Menhir generated code is due to a very large let rec/and/... being generated. Some extra analysis might fix that, i.e. using recursion only when necessary, and let/and/in otherwise (though I have no idea how to do that). For embedding the 'one token overshoot' is a bit nasty. OTOH with GLR parser, extra functionality is gained, and user actions must have no side effects, so the idea to 'push back' the overshoot token cannot be used (since it is a side-effect), at least without some thought.. hmmm .. An extensible GLR+ parser would be interesting. This is an extensible list of GLR parsers that run in 'parallel', using the GLR idea, but the list being dynamically extensible. Exactly how you'd manage it i don't know. The theoretical underpinnings should be the same as for open-recursion idea: make a grammar with a parameter representing extra productions, then you can fix it to close the grammar. This should at least be possible statically (at compile time). > > Basically: when Ocamlyacc reduces a production, it sometimes > > ends on the last token, and sometimes it overshoots by 1. > > I don't know the details of your grammar, but our (perhaps naive) view is that > you should modify your grammar to avoid end-of-stream conflicts (and Menhir's > conflict reports will help you do that). Then, Menhir will not overshoot. Actually in my grammar there is a definite user defined ENDMARKER for the main grammar, and 'end-of-expression' and 'end-of-statements' set of tokens for the extensible parts. The input is not a file, it's a list of tokens built by other means. So the conflicts are spurious: end of stream can't be parsed (but of course Menhir doesn't know that). This is good, because Mehir CANNOT detect end of stream, since my lexbuf is a dummy and is not used at all. > > The semantics used are: when the exprx is processed the > > action arranges to push the terminator token back > > into the token stream. > > How is this done? The lexer function is bound to a class object which contains a mutable list of tokens. The actual lexbuf is ignored, its a dummy. So pushing token onto the list is just ts <- last_token :: ts :) > It is possible that this hack is not compatible with Menhir, > because Menhir keeps a local cache of the next token, which is not properly > updated when you push your old token back. That's what I would have guessed. In order to use an LR parser recursively, which I do, it must be possible to control the lookahead token somehow. My code just uses an implementation detail of Ocamlyacc discovered by accident, and the detail differs for Menhir. -- John Skaller <skaller at users dot sf dot net> Felix, successor to C++: http://felix.sf.net ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [Caml-list] menhir 2007-04-29 4:43 ` skaller @ 2007-04-29 7:27 ` Christophe Raffalli 2007-05-01 15:57 ` Francois Pottier 1 sibling, 0 replies; 20+ messages in thread From: Christophe Raffalli @ 2007-04-29 7:27 UTC (permalink / raw) To: caml-list Dear list members, > > OTOH with GLR parser, extra functionality is gained, > and user actions must have no side effects, so > the idea to 'push back' the overshoot token cannot be used > (since it is a side-effect), at least without some thought.. > hmmm .. > dypgen uses GLR and allows side effect, because it keeps a state associated to each stack. Actually two states, one global on one "top down": typically the top down state can be used to carry information about bound variables down in the parse tree. > An extensible GLR+ parser would be interesting. This is > an extensible list of GLR parsers that run in 'parallel', > using the GLR idea, but the list being dynamically extensible. > Exactly how you'd manage it i don't know. > dypgen allows the action to extend the grammar (in a precise scope) About table/code size, dypgen can use (the user choose) LR0, LALR or LR1 automaton to drive the parser. Moreover, priorities can be handled (this is the default) outside the automaton, giving very small tables. Christophe Raffalli ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [Caml-list] menhir 2007-04-29 4:43 ` skaller 2007-04-29 7:27 ` Christophe Raffalli @ 2007-05-01 15:57 ` Francois Pottier 2007-05-01 17:11 ` skaller 2007-05-01 17:15 ` skaller 1 sibling, 2 replies; 20+ messages in thread From: Francois Pottier @ 2007-05-01 15:57 UTC (permalink / raw) To: skaller; +Cc: caml-list Hello, On Sun, Apr 29, 2007 at 02:43:03PM +1000, skaller wrote: > > An option to generate tables like ocamlyacc would clearly be useful. > > Although that defeats one of the future advantages of Menhir: > embedding. To understand what I mean you might look at the Felix > parser generator. In Felix you can write parsers in any scope, > and the user actions bind to the containing scope. I don't see why there is a relationship between the style of code generation (code versus tables) and embedding. Embedding in Menhir is supported indirectly via %parameter directives. > It is possible (but I'm no expert!!) that the non-linearity > in compiling Menhir generated code is due to a very > large let rec/and/... being generated. Some extra analysis > might fix that, i.e. using recursion only when necessary, > and let/and/in otherwise (though I have no idea how to > do that). I have tried that, but it does not help. Most (say, 95%) of the functions generated by Menhir are truly mutually recursive. > So the conflicts are spurious: end of stream can't be parsed > (but of course Menhir doesn't know that). This is good, > because Mehir CANNOT detect end of stream, since my lexbuf > is a dummy and is not used at all. Perhaps the terminology "end-of-stream conflict" is badly chosen. These conflicts are not just about detecting the end of the stream: they are about recognizing a non-terminal symbol without overshooting, that is, without unnecessarily requesting a lookahead token (a request which, if the end of stream has been reached, could be unsatisfiable, or could be blocking). These conflicts are not "spurious": just like shift/reduce or reduce/reduce conflicts, they will cause trouble. -- François Pottier Francois.Pottier@inria.fr http://cristal.inria.fr/~fpottier/ ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [Caml-list] menhir 2007-05-01 15:57 ` Francois Pottier @ 2007-05-01 17:11 ` skaller 2007-05-01 17:34 ` Francois Pottier 2007-05-01 17:15 ` skaller 1 sibling, 1 reply; 20+ messages in thread From: skaller @ 2007-05-01 17:11 UTC (permalink / raw) To: Francois.Pottier; +Cc: caml-list On Tue, 2007-05-01 at 17:57 +0200, Francois Pottier wrote: > > So the conflicts are spurious: end of stream can't be parsed > > (but of course Menhir doesn't know that). This is good, > > because Mehir CANNOT detect end of stream, since my lexbuf > > is a dummy and is not used at all. > > Perhaps the terminology "end-of-stream conflict" is badly chosen. These > conflicts are not just about detecting the end of the stream: they are about > recognizing a non-terminal symbol without overshooting, that is, without > unnecessarily requesting a lookahead token (a request which, if the end of > stream has been reached, could be unsatisfiable, or could be blocking). > > These conflicts are not "spurious": just like shift/reduce or reduce/reduce > conflicts, they will cause trouble. They're not technically spurious, but for my grammar they're spurious because the input language isn't arbitrary, it always has an ENDMARKER at the end of the token stream, and the parser never processes any symbols past that. It can't hit 'end of stream' because that's a state of the lexbuf .. and the lexbuf is is a dummy which is never updated or modifed -- so it can't ever be in the end-of-stream state. So basically, menhir is saying "what happens if you hit end of stream here? What should be done?" and the answer is "it can't happen". BTW: Ocamlyacc doesn't report any conflicts, so this is a minor incompatibility. -- John Skaller <skaller at users dot sf dot net> Felix, successor to C++: http://felix.sf.net ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [Caml-list] menhir 2007-05-01 17:11 ` skaller @ 2007-05-01 17:34 ` Francois Pottier 2007-05-01 23:42 ` skaller 0 siblings, 1 reply; 20+ messages in thread From: Francois Pottier @ 2007-05-01 17:34 UTC (permalink / raw) To: skaller; +Cc: caml-list > So basically, menhir is saying "what happens if you hit > end of stream here? What should be done?" and the answer > is "it can't happen". What I am trying to say is, Menhir doesn't ask this question. It asks: "what happens if I reach that state? should I request a lookahead token from the lexer, or shouldn't I?" -- François Pottier Francois.Pottier@inria.fr http://cristal.inria.fr/~fpottier/ ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [Caml-list] menhir 2007-05-01 17:34 ` Francois Pottier @ 2007-05-01 23:42 ` skaller 2007-05-02 5:38 ` Francois Pottier 0 siblings, 1 reply; 20+ messages in thread From: skaller @ 2007-05-01 23:42 UTC (permalink / raw) To: Francois.Pottier; +Cc: caml-list On Tue, 2007-05-01 at 19:34 +0200, Francois Pottier wrote: > > So basically, menhir is saying "what happens if you hit > > end of stream here? What should be done?" and the answer > > is "it can't happen". > > What I am trying to say is, Menhir doesn't ask this question. It asks: > "what happens if I reach that state? should I request a lookahead token > from the lexer, or shouldn't I?" Perhaps I do not fully understand but: I do not use 'eof' (end-of-stream pseudo token). I get this for an unused token: "Warning: the token WHENCE is unused." and I should get the same for eof IMHO (but possibly suppressed message since it's not a user defined token). What i actually get is 145 end-of-stream conflicts .. but i get no conflicts on WHENCE. That is, instead of treating the token set precisely as the set of user tokens + eof, menhir is treating eof specially and not consistently with other tokens. There's no conflict: in every one of these states if 'eof' turns up its a syntax error, exactly the same as if WHENCE turns up. I'm not sure i fully understand the 'do we need lookahead' issue: I would have thought: you need a fetch if your action is shift, and not if it is a reduce. -- John Skaller <skaller at users dot sf dot net> Felix, successor to C++: http://felix.sf.net ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [Caml-list] menhir 2007-05-01 23:42 ` skaller @ 2007-05-02 5:38 ` Francois Pottier 2007-05-02 5:50 ` Francois Pottier 2007-05-02 8:41 ` skaller 0 siblings, 2 replies; 20+ messages in thread From: Francois Pottier @ 2007-05-02 5:38 UTC (permalink / raw) To: skaller; +Cc: caml-list Hello, > That is, instead of treating the token set precisely as the > set of user tokens + eof, menhir is treating eof specially > and not consistently with other tokens. As far as Menhir is concerned, there is no special eof token. The set of tokens is exactly the set of user-defined tokens. This is true also of ocamlyacc. Internally, the construction of the automaton uses a pseudo-token, written #, which stands for the end of the token stream. This token can appear in conflict explanation messages. > I'm not sure i fully understand the 'do we need lookahead' > issue: I would have thought: you need a fetch if your action is > shift, and not if it is a reduce. What you mean is, you need to *consume* one token if your action is shift, and none if your action is reduce. However, in order to make a decision (in order to choose between shift and reduce, and between multiple reduce actions), you sometimes need to *consult* one lookahead token, without consuming it. This is necessary only *sometimes*, because, in some states, only one reduce action is possible, and can be taken without consulting a lookahead token. An end-of-stream conflict arises when a state has multiple actions, some of which are on real tokens, and some of which are on the pseudo-token #. In that case, one would like to consult the next token in order to make a decision; however, there is a possibility that we are at the end of the sentence that we are trying to recognize, so asking the lexer for one more token might be a mistake (in your terminology, it might be an overshoot). Does this clarify things? -- François Pottier Francois.Pottier@inria.fr http://cristal.inria.fr/~fpottier/ ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [Caml-list] menhir 2007-05-02 5:38 ` Francois Pottier @ 2007-05-02 5:50 ` Francois Pottier 2007-05-02 8:41 ` skaller 1 sibling, 0 replies; 20+ messages in thread From: Francois Pottier @ 2007-05-02 5:50 UTC (permalink / raw) To: skaller; +Cc: caml-list On Wed, May 02, 2007 at 07:38:15AM +0200, I wrote: > Internally, the construction of the automaton uses a pseudo-token, > written #, which stands for the end of the token stream. This token > can appear in conflict explanation messages. Actually, I should say: # stands for the end of a sentence that we are trying to recognize. That is, if S is a start symbol, then Menhir builds a new start symbol S' with the production S' -> S #. As a result, when we reach a state where a reduce action ("reduce production p") has lookahead symbol #, this means: "in this state, perhaps we have consumed a sentence derived from S; in that case, we should reduce production p and not read anything more". It does not mean that we have reached the end of the physical token stream, only that we have reached the end of what we were supposed to read. -- François Pottier Francois.Pottier@inria.fr http://cristal.inria.fr/~fpottier/ ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [Caml-list] menhir 2007-05-02 5:38 ` Francois Pottier 2007-05-02 5:50 ` Francois Pottier @ 2007-05-02 8:41 ` skaller 2007-05-02 12:30 ` Francois Pottier 1 sibling, 1 reply; 20+ messages in thread From: skaller @ 2007-05-02 8:41 UTC (permalink / raw) To: Francois.Pottier; +Cc: caml-list On Wed, 2007-05-02 at 07:38 +0200, Francois Pottier wrote: > Hello, > > > That is, instead of treating the token set precisely as the > > set of user tokens + eof, menhir is treating eof specially > > and not consistently with other tokens. > > As far as Menhir is concerned, there is no special eof token. > The set of tokens is exactly the set of user-defined tokens. > This is true also of ocamlyacc. > > Internally, the construction of the automaton uses a pseudo-token, > written #, which stands for the end of the token stream. This token > can appear in conflict explanation messages. Exactly. In Ocamlyacc it is named 'eof', and you can use that token in your productions. > > I'm not sure i fully understand the 'do we need lookahead' > > issue: I would have thought: you need a fetch if your action is > > shift, and not if it is a reduce. > > What you mean is, you need to *consume* one token if your action > is shift, and none if your action is reduce. However, in order to > make a decision (in order to choose between shift and reduce, and > between multiple reduce actions), you sometimes need to *consult* > one lookahead token, without consuming it. > > This is necessary only *sometimes*, because, in some states, only > one reduce action is possible, and can be taken without consulting > a lookahead token. Right. > An end-of-stream conflict arises when a state has multiple > actions, some of which are on real tokens, and some of which are > on the pseudo-token #. In that case, one would like to consult > the next token in order to make a decision; however, there is a > possibility that we are at the end of the sentence that we are > trying to recognize, so asking the lexer for one more token might > be a mistake (in your terminology, it might be an overshoot). > > Does this clarify things? Yes. Your code or my grammar is bugged :) My grammar is compilation_unit: | statement_aster ENDMARKER { $1 } where no non-terminal on which statements_aster depends has a production containing ENDMARKER or # (eof). Therefore, there is no conflict. When compilation_unit is reduced the parser returns, the next token, whether it is # or any other, is irrelevant. The parser is therefore required to match the head of the input stream and NOT the whole stream. It seems Menhir wants to match the whole input, and is confused because it doesn't know what to do after it sees ENDMARKER. That means it doesn't detect that compilation_unit is a start symbol .. and the reduction is final. If you are augmenting the grammar to be: compilation_unit: | statement_aster ENDMARKER eof { $1 } that would explain Menhir's confusion .. however the generated parser actually does work (provided the user defined syntax feature isn't used). IMHO: the end-of-stream thing is a serious bug in the design of Ocamlyacc which unfortunately Menhir duplicates. In particular there should be no possibility of the parser consulting the lexbuf, the signature of parsers is very wrong: parser: (lexbuf -> token) -> lexbuf -> ast is a disaster. Because of this you're duplicating the disaster and probing the lexbuf for end of stream, which is just wrong. There should be no implicit end of stream: a stream is an INFINITE sequence, a parser should process the head of the stream, and the signature should be: parser: (state -> token) -> state -> ast where state is a USER defined type. If the user provides no state argument, then it defaults to unit. This denies the parser any access to a hidden 'end of stream' token. It's up to the user to decide what terminates the parse. The point again is that the token input to the parser is infinite: it can't ever be an error to read a next token. The problem is the usual one for streams. Probably the interface should be changed to something like: peek state n (* look ahead n symbols *) read state (* destructive read 1 symbol *) loc state (* return location for diagnostics *) This would support LR(infinity) parsing. A better interface would support recursive descent too, in that case you'd have to make the 'streaming' persistent to support backtracking, that is you'd have to use: state -> (token * state) as the lexer interface -- that gets rid of the push back problem. (This is trivial to implement on top of a destructive stream, by buffering with lists: Extlib streams do just that). Dypgen actually does a clever trick here, since GLR also requires actions which are purely functional .. from the outside (i.e. persistent): it allow mutations and they propagate locally in the current branch (L-attributes yes?). It also allows global mutations, but they only become permanent when all the branches agree on them. Anyhow the problem seems to reduce to this: I think a parser should read the head of a stream, and Menhir thinks it should read the whole stream, and it uses the bugged lexbuf interface to test for end of stream. So I could fix my grammar by adding 'end-of-stream' to my start symbols AND hack the otherwise unused lexbuf to show end-of-stream after ENDMARKER is read. Ocamlyacc doesn't complain though, suggesting it processes only the head of a stream .. and that there's a fundamental design distinction between the two parser generators. -- John Skaller <skaller at users dot sf dot net> Felix, successor to C++: http://felix.sf.net ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [Caml-list] menhir 2007-05-02 8:41 ` skaller @ 2007-05-02 12:30 ` Francois Pottier 2007-05-02 16:29 ` skaller 0 siblings, 1 reply; 20+ messages in thread From: Francois Pottier @ 2007-05-02 12:30 UTC (permalink / raw) To: skaller; +Cc: caml-list [-- Attachment #1: Type: text/plain, Size: 4788 bytes --] Hello, On Wed, May 02, 2007 at 06:41:44PM +1000, skaller wrote: > Exactly. In Ocamlyacc it is named 'eof', and you can use that > token in your productions. As far as I know, this is incorrect. ocamlyacc does not have a predefined eof token. Perhaps you are thinking of ocamllex, which has an eof pattern. > compilation_unit: > | statement_aster ENDMARKER { $1 } > > where no non-terminal on which statements_aster depends > has a production containing ENDMARKER or # (eof). > > Therefore, there is no conflict. When compilation_unit > is reduced the parser returns, the next token, whether > it is # or any other, is irrelevant. Good. I seem to agree with you. Menhir should not report an end-of-stream conflict here. So, what does it report? > If you are augmenting the grammar to be: > > compilation_unit: > | statement_aster ENDMARKER eof { $1 } No, actually, I am augmenting the grammar with the production compilation_unit' -> compilation_unit, and using compilation_unit' -> compilation_unit { # } as my initial LR(1) item in the construction of the automaton. > IMHO: the end-of-stream thing is a serious bug in the > design of Ocamlyacc which unfortunately Menhir duplicates. I believe ocamlyacc's behavior is reasonable but subtle and undocumented... Menhir attempts to reproduce that behavior, while providing more guidance in the form of end-of-stream conflict reports... That said, I could be missing a point. What is the bug? > In particular there should be no possibility of the parser > consulting the lexbuf, the signature of parsers is very > wrong: > > parser: (lexbuf -> token) -> lexbuf -> ast > > is a disaster. Because of this you're duplicating the > disaster and probing the lexbuf for end of stream, > which is just wrong. > > There should be no implicit end of stream: a stream > is an INFINITE sequence, a parser should process the > head of the stream, and the signature should be: > > parser: (state -> token) -> state -> ast > > where state is a USER defined type. If the user provides > no state argument, then it defaults to unit. I believe this is a separate issue. You are right in saying that the historic signature, which involves lexbuf, is dubious. Following your suggestion, we could just as well use parser: (state -> token * state) -> state -> ast * state if we wish to promote a purely functional style (where values of type state are immutable), or just parser: (unit -> token) -> ast if we are willing to accept mutable state. (I am sweeping the issue of locations under the rug; we should use token * location instead of just token.) That said, the historic signature parser: (lexbuf -> token) -> lexbuf -> ast is really equivalent to the previous one, in the sense that I can write functions that convert between the two styles (see attached file). > The point again is that the token input to the parser is infinite: it can't > ever be an error to read a next token. I beg to disagree. First, the input stream does not have to be infinite: if I am reading from a file, clearly it is finite. Second, regardless of whether the stream is finite or infinite, it *is* an error to read more tokens than you were supposed to. If the grammar's start symbol is S, then the parser should read a sequence of tokens that derives from S, and nothing more; it should not overshoot and consume the first token that follows. > Anyhow the problem seems to reduce to this: I think > a parser should read the head of a stream, and Menhir > thinks it should read the whole stream, and it uses > the bugged lexbuf interface to test for end of stream. As I just said in the previous paragraph, we really are in agreement: a parser should read only a prefix of the stream. As I said earlier, a Menhir-generated parser does not test for the physical end of stream. It relies on the lexbuf interface only to (i) read the next token and (ii) access the location of the current token. > So I could fix my grammar by adding 'end-of-stream' > to my start symbols AND hack the otherwise unused > lexbuf to show end-of-stream after ENDMARKER is read. Hacking the dynamic behavior of your lexbuf won't make any difference, since end-of-stream conflicts are reported at compile time! Ponder that. The only way of avoiding these conflicts is to change your grammar somehow. But I still haven't understood what causes these conflicts in your grammar. Perhaps it would be time to show it? > Ocamlyacc doesn't complain though, suggesting it processes > only the head of a stream .. and that there's a fundamental > design distinction between the two parser generators. ocamlyacc never complains. It just trusts you to know what you are doing. -- François Pottier Francois.Pottier@inria.fr http://cristal.inria.fr/~fpottier/ [-- Attachment #2: convert.ml --] [-- Type: text/plain, Size: 580 bytes --] open Lexing type ('token, 'ast) historic_parser = (lexbuf -> 'token) -> lexbuf -> 'ast type ('token, 'ast) revised_parser = (unit -> 'token) -> 'ast let upgrade (p : ('token, 'ast) historic_parser) : ('token, 'ast) revised_parser = fun (next : unit -> 'token) -> let (* dummy *) lexbuf = Lexing.from_string "" and lexer lexbuf = next() in p lexer lexbuf let downgrade (p : ('token, 'ast) revised_parser) : ('token, 'ast) historic_parser = fun (lexer : lexbuf -> 'token) lexbuf -> let next () = lexer lexbuf in p next ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [Caml-list] menhir 2007-05-02 12:30 ` Francois Pottier @ 2007-05-02 16:29 ` skaller 2007-05-02 18:35 ` Francois Pottier 0 siblings, 1 reply; 20+ messages in thread From: skaller @ 2007-05-02 16:29 UTC (permalink / raw) To: Francois.Pottier; +Cc: caml-list On Wed, 2007-05-02 at 14:30 +0200, Francois Pottier wrote: > Hello, > > On Wed, May 02, 2007 at 06:41:44PM +1000, skaller wrote: > > Exactly. In Ocamlyacc it is named 'eof', and you can use that > > token in your productions. > > As far as I know, this is incorrect. ocamlyacc does not have a predefined > eof token. Perhaps you are thinking of ocamllex, which has an eof pattern. I believe you're right, I apologise for confusion. > > compilation_unit: > > | statement_aster ENDMARKER { $1 } > > > > where no non-terminal on which statements_aster depends > > has a production containing ENDMARKER or # (eof). > > > > Therefore, there is no conflict. When compilation_unit > > is reduced the parser returns, the next token, whether > > it is # or any other, is irrelevant. > > Good. I seem to agree with you. Menhir should not report an end-of-stream > conflict here. So, what does it report? Built an LR(0) automaton with 1416 states. Built an LR(1) automaton with 2009 states. Warning: 145 states have an end-of-stream conflict. Can I send you the file? [signature of parser] > I believe this is a separate issue. Yes, I agree. > You are right in saying that the historic > signature, which involves lexbuf, is dubious. Following your suggestion, we > could just as well use > > parser: (state -> token * state) -> state -> ast * state > > if we wish to promote a purely functional style (where values of type > state are immutable), or just > > parser: (unit -> token) -> ast > > if we are willing to accept mutable state. (I am sweeping the issue of > locations under the rug; we should use token * location instead of just > token.) Or forget it, which is the approach taken by Felix: every token contains its location: the user can organise this. This has the advantage of not specifying a particular location format. > That said, the historic signature > > parser: (lexbuf -> token) -> lexbuf -> ast > > is really equivalent to the previous one, in the sense that I can write > functions that convert between the two styles (see attached file). Yes, but you cannot write functions that take a state argument because lexbuf is a fixed data type and there's no where to add in any user state data. > > The point again is that the token input to the parser is infinite: it can't > > ever be an error to read a next token. > > I beg to disagree. First, the input stream does not have to be infinite: if > I am reading from a file, clearly it is finite. EOF is returned an infinite number of times in C. > Second, regardless of whether > the stream is finite or infinite, it *is* an error to read more tokens than > you were supposed to. If the grammar's start symbol is S, then the parser > should read a sequence of tokens that derives from S, and nothing more; it > should not overshoot and consume the first token that follows. This requires the definition: parse the *shortest* head of the input stream. > The only way of avoiding these conflicts is to change your grammar somehow. > But I still haven't understood what causes these conflicts in your grammar. > Perhaps it would be time to show it? > ocamlyacc never complains. It just trusts you to know what you are doing. I generate an .output file, grep for the word 'conflict', and terminate my build if there is one found. I do not permit any conflicts in my grammar: it's strictly unambiguous LALR(1). It's also pure in the sense that it doesn't use crud like %left, %prec etc to resolve conflicts. [The way dypgen does this is vastly superior!] -- John Skaller <skaller at users dot sf dot net> Felix, successor to C++: http://felix.sf.net ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [Caml-list] menhir 2007-05-02 16:29 ` skaller @ 2007-05-02 18:35 ` Francois Pottier 2007-05-03 1:30 ` skaller 0 siblings, 1 reply; 20+ messages in thread From: Francois Pottier @ 2007-05-02 18:35 UTC (permalink / raw) To: skaller; +Cc: caml-list On Thu, May 03, 2007 at 02:29:17AM +1000, skaller wrote: > > Built an LR(0) automaton with 1416 states. > Built an LR(1) automaton with 2009 states. > Warning: 145 states have an end-of-stream conflict. > > Can I send you the file? Sure, please do (perhaps off the list). > Yes, but you cannot write functions that take a state argument > because lexbuf is a fixed data type and there's no where to > add in any user state data. But my point is that you never need to pass a state argument to a parser function. Instead, you can just have the function capture the address of the (mutable) state in its closure. > EOF is returned an infinite number of times in C. Point taken. > This requires the definition: parse the *shortest* head of the > input stream. In fact, if there are no end-of-stream conflicts, then the head of the input stream is *unique*, so you don't need to specify whether you are interested in a shortest or longest prefix. > > ocamlyacc never complains. It just trusts you to know what you are doing. > > I generate an .output file, grep for the word 'conflict', > and terminate my build if there is one found. I do not permit > any conflicts in my grammar: it's strictly unambiguous LALR(1). Sure, but ocamlyacc does not report end-of-stream conflicts, which I believe are real issues... > It's also pure in the sense that it doesn't use crud > like %left, %prec etc to resolve conflicts. Good, good! > [The way dypgen does this is vastly superior!] How does it work? -- François Pottier Francois.Pottier@inria.fr http://cristal.inria.fr/~fpottier/ ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [Caml-list] menhir 2007-05-02 18:35 ` Francois Pottier @ 2007-05-03 1:30 ` skaller 2007-05-03 8:43 ` Joel Reymont 0 siblings, 1 reply; 20+ messages in thread From: skaller @ 2007-05-03 1:30 UTC (permalink / raw) To: Francois.Pottier; +Cc: caml-list On Wed, 2007-05-02 at 20:35 +0200, Francois Pottier wrote: > On Thu, May 03, 2007 at 02:29:17AM +1000, skaller wrote: > > Yes, but you cannot write functions that take a state argument > > because lexbuf is a fixed data type and there's no where to > > add in any user state data. > > But my point is that you never need to pass a state argument to a parser > function. Instead, you can just have the function capture the address of the > (mutable) state in its closure. And mine is that you can't do that because the function is generated with a fixed signature, whose only state data is the lexbuf. Joel pointed out this isn't so, if you use a functor and construct it locally, then a 'global' variable of the functor is actually localised: I admit I didn't think of local dynamic instantiation. -- John Skaller <skaller at users dot sf dot net> Felix, successor to C++: http://felix.sf.net ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [Caml-list] menhir 2007-05-03 1:30 ` skaller @ 2007-05-03 8:43 ` Joel Reymont 0 siblings, 0 replies; 20+ messages in thread From: Joel Reymont @ 2007-05-03 8:43 UTC (permalink / raw) To: skaller; +Cc: Francois.Pottier, caml-list On May 3, 2007, at 2:30 AM, skaller wrote: > Joel pointed out this isn't so, if you use a functor and > construct it locally, then a 'global' variable of the > functor is actually localised: I admit I didn't think of > local dynamic instantiation. Just to give credit where it's due, it wasn't my idea. I butted my head against this issue for a few days until Francois solved it for me by offering the local module example. Thanks, Joel -- http://wagerlabs.com/ ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [Caml-list] menhir 2007-05-01 15:57 ` Francois Pottier 2007-05-01 17:11 ` skaller @ 2007-05-01 17:15 ` skaller 2007-05-01 17:31 ` Francois Pottier 1 sibling, 1 reply; 20+ messages in thread From: skaller @ 2007-05-01 17:15 UTC (permalink / raw) To: Francois.Pottier; +Cc: caml-list On Tue, 2007-05-01 at 17:57 +0200, Francois Pottier wrote: > I have tried that, but it does not help. Most (say, 95%) of the functions > generated by Menhir are truly mutually recursive. Yes, but you can always use indirection to decouple them can't you? Eg just use a 'ref' to a function? Slower at run time, but should fix the compile time problem if the cause is huge let/recs. -- John Skaller <skaller at users dot sf dot net> Felix, successor to C++: http://felix.sf.net ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [Caml-list] menhir 2007-05-01 17:15 ` skaller @ 2007-05-01 17:31 ` Francois Pottier 0 siblings, 0 replies; 20+ messages in thread From: Francois Pottier @ 2007-05-01 17:31 UTC (permalink / raw) To: skaller; +Cc: caml-list > Yes, but you can always use indirection to decouple them > can't you? Eg just use a 'ref' to a function? I would need a ref to a huge record of functions, or a lot of individual refs. I have tried some of the possibilities, but they yield ugly code and none was actually faster. -- François Pottier Francois.Pottier@inria.fr http://cristal.inria.fr/~fpottier/ ^ permalink raw reply [flat|nested] 20+ messages in thread
end of thread, other threads:[~2007-05-03 8:44 UTC | newest] Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2007-04-28 10:32 menhir skaller 2007-04-28 16:50 ` [Caml-list] menhir Francois Pottier 2007-04-28 19:47 ` Markus Mottl 2007-04-28 21:15 ` Jon Harrop 2007-04-29 4:43 ` skaller 2007-04-29 7:27 ` Christophe Raffalli 2007-05-01 15:57 ` Francois Pottier 2007-05-01 17:11 ` skaller 2007-05-01 17:34 ` Francois Pottier 2007-05-01 23:42 ` skaller 2007-05-02 5:38 ` Francois Pottier 2007-05-02 5:50 ` Francois Pottier 2007-05-02 8:41 ` skaller 2007-05-02 12:30 ` Francois Pottier 2007-05-02 16:29 ` skaller 2007-05-02 18:35 ` Francois Pottier 2007-05-03 1:30 ` skaller 2007-05-03 8:43 ` Joel Reymont 2007-05-01 17:15 ` skaller 2007-05-01 17:31 ` Francois Pottier
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox