* [ANNOUNCE] Alpha release of Menhir, an LR(1) parser generator for ocaml
@ 2005-12-12 17:58 Francois Pottier
  2005-12-12 19:51 ` [Caml-list] " "Márk S. Zoltán"
                   ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: Francois Pottier @ 2005-12-12 17:58 UTC (permalink / raw)
  To: Caml Mailing List
Dear Caml users,
We are proud to announce the first release of Menhir. Menhir compiles LR(1)
grammar specifications to OCaml code.
Menhir is 90% compatible with ocamlyacc. That is, existing ocamlyacc grammar
specifications are accepted and compiled by Menhir; the resulting parsers run
and produce correct parse trees, except they produce incorrect position
information, because none of the functionality of module Parsing is
supported. Porting a grammar specification from ocamlyacc to Menhir requires
replacing all calls to the Parsing module with new keywords.
Why switch from ocamlyacc to Menhir? In short,
 * Menhir offers parameterized nonterminal symbols as well as a library of
   standard definitions, including options, sequences, and lists. It also
   offers limited support for EBNF syntax.
 * ocamlyacc accepts LALR(1) grammars; Menhir accepts LR(1) grammars, thus
   avoiding certain artificial conflicts.
 * Menhir explains conflicts in terms of the grammar, not (only) in terms of
   the automaton.
 * Menhir allows grammar specifications to be split over multiple files. It
   also allows several grammars to share a single set of tokens.
 * Menhir produces reentrant parsers.
 * Menhir is able to produce parsers that are parameterized by Ocaml modules.
 * ocamlyacc requires semantic values to be referred to via keywords: $1, $2,
   and so on. Menhir allows semantic values to be explicitly named.
 * Menhir's error and warning messages are usually more numerous and better
   than ocamlyacc's.
A more detailed comparison between ocamlyacc and Menhir appears in Menhir's
documentation.
This is an ALPHA-quality release, so there certainly remain a lot of bugs
to iron out. Nevertheless, we encourage intrepid testers to have a look
and send suggestions and bug reports our way. Thanks for your attention!
Menhir requires ocaml 3.09. The source distribution and the documentation can
be found at
  http://pauillac.inria.fr/~fpottier/menhir/menhir-20051212.tar.gz
  http://pauillac.inria.fr/~fpottier/menhir/manual.pdf
-- 
François Pottier and Yann Régis-Gianas
{Francois.Pottier,Yann.Regis-Gianas}@inria.fr
http://pauillac.inria.fr/~fpottier/
http://pauillac.inria.fr/~regisgia/
^ permalink raw reply	[flat|nested] 18+ messages in thread* Re: [Caml-list] [ANNOUNCE] Alpha release of Menhir, an LR(1) parser generator for ocaml 2005-12-12 17:58 [ANNOUNCE] Alpha release of Menhir, an LR(1) parser generator for ocaml Francois Pottier @ 2005-12-12 19:51 ` "Márk S. Zoltán" 2005-12-13 21:07 ` Nathaniel Gray 2005-12-15 6:35 ` Stefan Monnier 2 siblings, 0 replies; 18+ messages in thread From: "Márk S. Zoltán" @ 2005-12-12 19:51 UTC (permalink / raw) To: Francois.Pottier; +Cc: Caml Mailing List I just tried to build menhir and failed at it. Maybe the bug is on my side, because I do not exactly do everything by the book. My installation is mingw over Win2k, ocaml-3.09-0, however, it is a custom build over native msys, no cygwin whatsoever involved. This was not an issue for the last few years - I am maintaining an ocaml line for this platform just for fun ever since I learned about msys. I would understand if you did not consider this issue to be too urgent, appearing on an unsupported platform etc. but I thought it may be something more generic, so here you go: The output is the following: $ ./configure checking for ocamlc.opt... no checking for ocamlc... ocamlc checking the version of the ocaml bytecode compiler.... 3.09.0 checking for ocamlopt.opt... no checking for ocamlopt... ocamlopt checking the version of the ocaml native compiler.... 3.09.0 checking for ocamldep.opt... no checking for ocamldep... ocamldep checking for ocamlyacc... ocamlyacc checking for ocamllex... ocamllex checking for a BSD-compatible install... /bin/install -c configure: creating ./config.status config.status: creating Makefile $ make install make - --unix -s PGEN="ocamlyacc -v" PARSER=parser EXECUTABLE=stage-one executab les 77 states, 5444 transitions, table size 22238 bytes 18222 additional bytes used for bindings 243 states, 3079 transitions, table size 13774 bytes 10161 additional bytes used for bindings 5 states, 258 transitions, table size 1062 bytes 23 states, 1876 transitions, table size 7642 bytes 1209 additional bytes used for bindings make - --unix -s PGEN="./stage-one -v -lg 1 -la 1 -lc 1 --comment --infer --erro r-recovery --stdlib ." PARSER=fancy-parser EXECUTABLE=menhir executables I/O error: Invalid argument Fatal error: exception Sys_error("Invalid argument") make[1]: *** [parser.mli] Error 1 make: *** [bootstrap] Error 2 $ Regards M- S- Z- ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [Caml-list] [ANNOUNCE] Alpha release of Menhir, an LR(1) parser generator for ocaml 2005-12-12 17:58 [ANNOUNCE] Alpha release of Menhir, an LR(1) parser generator for ocaml Francois Pottier 2005-12-12 19:51 ` [Caml-list] " "Márk S. Zoltán" @ 2005-12-13 21:07 ` Nathaniel Gray 2005-12-14 6:08 ` skaller 2005-12-15 6:35 ` Stefan Monnier 2 siblings, 1 reply; 18+ messages in thread From: Nathaniel Gray @ 2005-12-13 21:07 UTC (permalink / raw) To: Francois.Pottier; +Cc: Caml Mailing List [-- Attachment #1: Type: text/plain, Size: 3195 bytes --] This is pretty nice! Every time I use ocamlyacc I think "somebody should write something better." Now it looks like somebody has! I can't tell you how many times I've wanted parameterized rules and simple "library" rules for parsing delimiter-separated lists and such... Cheers, -n8 On 12/12/05, Francois Pottier <Francois.Pottier@inria.fr> wrote: > > > Dear Caml users, > > We are proud to announce the first release of Menhir. Menhir compiles > LR(1) > grammar specifications to OCaml code. > > Menhir is 90% compatible with ocamlyacc. That is, existing ocamlyacc > grammar > specifications are accepted and compiled by Menhir; the resulting parsers > run > and produce correct parse trees, except they produce incorrect position > information, because none of the functionality of module Parsing is > supported. Porting a grammar specification from ocamlyacc to Menhir > requires > replacing all calls to the Parsing module with new keywords. > > Why switch from ocamlyacc to Menhir? In short, > > * Menhir offers parameterized nonterminal symbols as well as a library of > standard definitions, including options, sequences, and lists. It also > offers limited support for EBNF syntax. > > * ocamlyacc accepts LALR(1) grammars; Menhir accepts LR(1) grammars, thus > avoiding certain artificial conflicts. > > * Menhir explains conflicts in terms of the grammar, not (only) in terms > of > the automaton. > > * Menhir allows grammar specifications to be split over multiple files. It > also allows several grammars to share a single set of tokens. > > * Menhir produces reentrant parsers. > > * Menhir is able to produce parsers that are parameterized by Ocaml > modules. > > * ocamlyacc requires semantic values to be referred to via keywords: $1, > $2, > and so on. Menhir allows semantic values to be explicitly named. > > * Menhir's error and warning messages are usually more numerous and better > than ocamlyacc's. > > A more detailed comparison between ocamlyacc and Menhir appears in > Menhir's > documentation. > > This is an ALPHA-quality release, so there certainly remain a lot of bugs > to iron out. Nevertheless, we encourage intrepid testers to have a look > and send suggestions and bug reports our way. Thanks for your attention! > > Menhir requires ocaml 3.09. The source distribution and the documentation > can > be found at > > http://pauillac.inria.fr/~fpottier/menhir/menhir-20051212.tar.gz > http://pauillac.inria.fr/~fpottier/menhir/manual.pdf > > -- > François Pottier and Yann Régis-Gianas > {Francois.Pottier,Yann.Regis-Gianas}@inria.fr > http://pauillac.inria.fr/~fpottier/ > http://pauillac.inria.fr/~regisgia/ > > _______________________________________________ > Caml-list mailing list. Subscription management: > http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list > Archives: http://caml.inria.fr > Beginner's list: http://groups.yahoo.com/group/ocaml_beginners > Bug reports: http://caml.inria.fr/bin/caml-bugs > -- >>>-- Nathaniel Gray -- Caltech Computer Science ------> >>>-- Mojave Project -- http://mojave.cs.caltech.edu --> [-- Attachment #2: Type: text/html, Size: 4200 bytes --] ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [Caml-list] [ANNOUNCE] Alpha release of Menhir, an LR(1) parser generator for ocaml 2005-12-13 21:07 ` Nathaniel Gray @ 2005-12-14 6:08 ` skaller 2005-12-14 9:04 ` Francois Pottier 0 siblings, 1 reply; 18+ messages in thread From: skaller @ 2005-12-14 6:08 UTC (permalink / raw) To: Nathaniel Gray; +Cc: Francois.Pottier, Caml Mailing List On Tue, 2005-12-13 at 13:07 -0800, Nathaniel Gray wrote: > This is pretty nice! Every time I use ocamlyacc I think "somebody > should write something better." Now it looks like somebody has! I > can't tell you how many times I've wanted parameterized rules and > simple "library" rules for parsing delimiter-separated lists and > such... Yes, it is pretty nice! However it still appears to have some problems. Any comments appreciated. 0. The licence. Q public licence for the generator???? Please NO NO NO!! Not unless it is distributed as part of the official distro. Is there any chance of that? If not even GPL would be better ;( 1. Generating a functor is cute, but it doesn't seem to allow arguments to parser functions. Perhaps I missed something? Is there a way to use the functorisation with closures to add an argument? In particular, can the parser be generated *inside* an environment such a function or let binding? [Felix allows that, which means an extra argument is not required, a variable in the environment can be used instead] 2. The signature of parsers is still wrong? Ocamlyacc usesthe typing val parser: (lexbuf->token) -> lexbuf -> 'a which is just bad. A better signature is val parser: ( unit -> token ) -> 'a There is no need to provide location information: the correct solution is to throw an exception, which is caught in a context which can determine the location. It would be nice to be able to generate this signature with a command line switch, pragma, or some other mechanism, even if the default is chosen for ocamlyacc compatibility. 3. I have doubts about the claim that parsers can 'share' token types. I do not see how this is possible. It is contradicted by the compilation model description, which explains how it is necessary to join separate files making up a grammar specification. In this case, the joined system is going to generate a single token type, and any type generated by another joining is certain to generate a distinct type because (a) the type is defined in a distinct ocaml module (mli file) (b) the typing of normal variants is nominal This problem would go away if polymorphic variants were used instead, because the typenames are then simply abbreviations, since pm-variants are structurally, not nominally, typed. Perhaps a command line switch, pragma, or whatever, to use polymorphic variants instead of ordinary ones? Actually, I personally find the 'yacc' technique of generating tokens to be rather lame. Felix does this much better -- the parser simply expects a token type which is a variant, the type can be defined wherever you like. In particular, the lexer and parser can share that definition. As far as I can see Menhir COULD do this, except of course one would use %token as a special way of generating the variant. All that would be required I think is the syntax %import_tokens "filename" which refers to the token definition file -- as an alternative to inlining these token definitions. (if pm-variants are used you could probably support both, though I'm not sure). A token definition file then generates two files, an ordinary mli file with the token variant type, and, a special information file for the parser generator (with the same information, but in a more useful form). In Felix none of this is necessary because parsing is built in, so the compiler can find the information required for the parser generator directly from the token variant type. 4. Just curious, but how practical is LR(1) in terms of generated code sizes? Felix is using Elkhound as its parser which is a GLR parser with an LALR(1) core. In theory there is an option for choosing the core automaton, which also allows LR(1) however I recall Scott McPeak commenting it wasn't worth supporting because it generated tables which were far too big. I'm curious how one would be able to predict the size of the generated code since I don't really understand the additional constraints LALR(1) introduces .. -- John Skaller <skaller at users dot sf dot net> Felix, successor to C++: http://felix.sf.net ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [Caml-list] [ANNOUNCE] Alpha release of Menhir, an LR(1) parser generator for ocaml 2005-12-14 6:08 ` skaller @ 2005-12-14 9:04 ` Francois Pottier 2005-12-14 10:27 ` Alessandro Baretta 2005-12-14 20:51 ` skaller 0 siblings, 2 replies; 18+ messages in thread From: Francois Pottier @ 2005-12-14 9:04 UTC (permalink / raw) To: skaller; +Cc: Caml Mailing List Hello, On Wed, Dec 14, 2005 at 05:08:15PM +1100, skaller wrote: > 0. The licence. Q public licence for the generator???? > Please NO NO NO!! Not unless it is distributed > as part of the official distro. Is there any chance of that? Our plan (not yet approved by the OCaml higher authorities) is to make it part of the OCaml distribution. > In particular, can the parser be generated *inside* > an environment such a function or let binding? No. The only way of passing arguments to the parser is through functor arguments. (Or through global state, of course.) Why is that a problem? > Ocamlyacc usesthe typing > > val parser: (lexbuf->token) -> lexbuf -> 'a > > which is just bad. A better signature is > > val parser: ( unit -> token ) -> 'a > > There is no need to provide location information: the correct > solution is to throw an exception, which is caught in a > context which can determine the location. It can be nice to have the parser automatically extract position information for you; doing it manually and explicitly within semantic actions would be quite tedious. On the other hand, the signature that you suggest would allow using lexers that do not necessarily conform to the lexbuf interface; is that your point? If so, we will consider adding a command line switch, as you suggest. It would disallow the use of the position keywords ($startpos, $endpos, etc.) > 3. I have doubts about the claim that parsers can 'share' > token types. I do not see how this is possible. It certainly is possible: have a look at demos/calc-two in the distribution. The example is artificial but illustrates the possibility of sharing a token type. > Actually, I personally find the 'yacc' technique of > generating tokens to be rather lame. Felix does this > much better -- the parser simply expects a token type > which is a variant, the type can be defined wherever > you like. With Menhir, the token type can also be hand-written or generated via other means. It doesn't have to be generated by Menhir. It has to be an algebraic data type, though, not a polymorphic variant. > %import_tokens "filename" This is called --external-tokens Filename. It's a command line switch. > 4. Just curious, but how practical is LR(1) in terms of > generated code sizes? Felix is using Elkhound as its > parser which is a GLR parser with an LALR(1) core. In theory > there is an option for choosing the core automaton, which > also allows LR(1) however I recall Scott McPeak commenting > it wasn't worth supporting because it generated tables > which were far too big. There are two questions: how big is the automaton compared to an LALR automaton? and how big is the code that implements the automaton? The answer to the first question is, in most cases, the grammar is LALR(1) anyway, and the automaton generated by Menhir is identical to the one that would be produced by ocamlyacc or by some other LALR(1) parser generator. When the grammar is not LALR(1), the LR(1) automaton has more states than an LALR(1) automaton for the same grammar. The number of extra states is usually small (often just one, or a handful). So there's no concern here. Considering the answer to the first question, the second question would not arise at all if Menhir produced tables, like ocamlyacc. However, Menhir does not produce tables; it compiles the automaton down to a bunch of mutually recursive functions. We have not yet scientifically assessed the difference in size between tables and code, but a few quick experiments indicate that it is reasonable (the code is larger than the tables by a factor of two or three). -- François Pottier Francois.Pottier@inria.fr http://pauillac.inria.fr/~fpottier/ ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [Caml-list] [ANNOUNCE] Alpha release of Menhir, an LR(1) parser generator for ocaml 2005-12-14 9:04 ` Francois Pottier @ 2005-12-14 10:27 ` Alessandro Baretta 2005-12-14 21:04 ` skaller 2005-12-14 20:51 ` skaller 1 sibling, 1 reply; 18+ messages in thread From: Alessandro Baretta @ 2005-12-14 10:27 UTC (permalink / raw) To: Francois.Pottier; +Cc: Caml Mailing List Francois Pottier wrote: > Considering the answer to the first question, the second > question would not arise at all if Menhir produced tables, like > ocamlyacc. However, Menhir does not produce tables; it > compiles the automaton down to a bunch of mutually recursive > functions. We have not yet scientifically assessed the > difference in size between tables and code, but a few > quick experiments indicate that it is reasonable (the > code is larger than the tables by a factor of two or > three). > In general, I like the approach, as it can easily scale to an extensible parser generator by late-binding the recursion through a record/array/table of functions. Have you thought about this at all? Alex -- ********************************************************************* http://www.barettadeit.com/ Baretta DE&IT A division of Baretta SRL tel. +39 02 370 111 55 fax. +39 02 370 111 54 Our technology: The Application System/Xcaml (AS/Xcaml) <http://www.asxcaml.org/> The FreerP Project <http://www.freerp.org/> ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [Caml-list] [ANNOUNCE] Alpha release of Menhir, an LR(1) parser generator for ocaml 2005-12-14 10:27 ` Alessandro Baretta @ 2005-12-14 21:04 ` skaller 2005-12-15 8:46 ` Francois Pottier 0 siblings, 1 reply; 18+ messages in thread From: skaller @ 2005-12-14 21:04 UTC (permalink / raw) To: Alessandro Baretta; +Cc: Francois.Pottier, Caml Mailing List On Wed, 2005-12-14 at 11:27 +0100, Alessandro Baretta wrote: > Francois Pottier wrote: > > > However, Menhir does not produce tables; it > > compiles the automaton down to a bunch of mutually recursive > > functions. We have not yet scientifically assessed the > > difference in size between tables and code, but a few > > quick experiments indicate that it is reasonable (the > > code is larger than the tables by a factor of two or > > three). > > > > In general, I like the approach, as it can easily scale to an extensible parser > generator by late-binding the recursion through a record/array/table of > functions. Have you thought about this at all? Actually the Felix parser does this right now -- and that is why I'm asking for the parser functions to allow a parameter: to pass in the data structure. I actually have simplistic recursive descent interpreter which is called by ocamlyacc, and which calls back into ocamlyacc: at present this allows user defined statements to be added to the Felix language. It is very crude, since there is no way to properly type user non-terminals other than statements, and it only allows extension of statements. Interestingly the 'end of stream' issue discussed at length in the Menhir manual then applies also to the points of recursion. For a statement in Felix, there is no lookahead at the end (they end with a semicolon). For expressions, I have to read one token too many, keep a list of all possible tokens that can terminate an expression, introduce an 'augmented expression' which is an expression followed by any of these tokens, and then in the action code 'put back' the token into the token stream. It would be really nice if I could write the production as: exprx: expr ~expr where ~expr means 'any token which forces a reduction of the whole expression' if you know what I mean. I have to assemble this list by hand at the moment. I did it by starting with a list of all the tokens, and removing the ones which seemed to cause a conflict. I will certainly forget to upgrade this list when I add new tokens -- hence a desire for something like ~expr which tells the parser to do it :) -- John Skaller <skaller at users dot sf dot net> Felix, successor to C++: http://felix.sf.net ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [Caml-list] [ANNOUNCE] Alpha release of Menhir, an LR(1) parser generator for ocaml 2005-12-14 21:04 ` skaller @ 2005-12-15 8:46 ` Francois Pottier 2005-12-15 11:03 ` skaller 0 siblings, 1 reply; 18+ messages in thread From: Francois Pottier @ 2005-12-15 8:46 UTC (permalink / raw) To: skaller; +Cc: Alessandro Baretta, Caml Mailing List On Thu, Dec 15, 2005 at 08:04:54AM +1100, skaller wrote: > Interestingly the 'end of stream' issue discussed at length > in the Menhir manual then applies also to the points > of recursion. That's a good point. Actually, we are planning to implement in Menhir a mechanism that helps stop at some point and invoke an external lexer or parser (this can be any piece of code that consumes characters off the input stream). Doing this properly requires *not* reading the lookahead token. As you point out, the problems that arise are similar to the way the end of stream is handled. > It would be really nice if I could write the > production as: > > exprx: expr ~expr > > where ~expr means 'any token which forces a reduction > of the whole expression' if you know what I mean. No, I don't know exactly what you mean :) I need to think more to understand if this construction makes sense. It makes the grammar self-referential, in a way, so it at least has to be monotonic in order to make sense. Our plan regarding calls to external lexers or parsers was to be more rigid, that is, to require the current statement (or expression, or whatever) to be unambiguously terminated (for instance, by a semicolon) at the point where the external call is made. So what you do with statements would be accepted, but what you do with expressions might be impossible. -- François Pottier Francois.Pottier@inria.fr http://pauillac.inria.fr/~fpottier/ ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [Caml-list] [ANNOUNCE] Alpha release of Menhir, an LR(1) parser generator for ocaml 2005-12-15 8:46 ` Francois Pottier @ 2005-12-15 11:03 ` skaller 0 siblings, 0 replies; 18+ messages in thread From: skaller @ 2005-12-15 11:03 UTC (permalink / raw) To: Francois.Pottier; +Cc: Alessandro Baretta, Caml Mailing List On Thu, 2005-12-15 at 09:46 +0100, Francois Pottier wrote: > Doing this properly requires *not* reading the lookahead token. > As you point > out, the problems that arise are similar to the way the end of stream is > handled. > > > It would be really nice if I could write the > > production as: > > > > exprx: expr ~expr > > > > where ~expr means 'any token which forces a reduction > > of the whole expression' if you know what I mean. > > No, I don't know exactly what you mean :) I need to think more to understand > if this construction makes sense. > It makes the grammar self-referential, in a > way, so it at least has to be monotonic in order to make sense. > > Our plan regarding calls to external lexers or parsers was to be more rigid, > that is, to require the current statement (or expression, or whatever) to be > unambiguously terminated (for instance, by a semicolon) at the point where the > external call is made. So what you do with statements would be accepted, but > what you do with expressions might be impossible. So, the idea is this: for an 1 token look ahead parser, all productions terminate either with 0 lookahead or 1 lookahead. Now, you're saying only to support recursion for 0 lookahead productions. So what if you have a 1 lookahead and you still want to recurse?? One answer is: make a new augmented production with is 0 lookahead: prod': prod ~prod { $1, $2 } where ~prod is the nonterminal ~prod: | t1 { t1 $1 } | t2 { t2 $1 } | t3 { t3 $1 } .... where t1, t2, t3, .. are tokens that would not be shifted by the parser, and therefore cause prod to be reduced. The action then returns the result of prod, PLUS the lookahead token. Before recursing, we can simply push the lookahead token back into the head of the token stream. Felix does precisely this -- it convents a 1 lookahead production into a 0 lookahead production and therefore: > what you do with expressions might be impossible. seems possible by the above mechanism. It is equivalent to adding a set of terminating tokens. The *problem* is choosing terminating tokens that are viable first tokens of any sequence of symbols that can follow an expression elsewhere in the grammar. For example consider: while expr do .. for i in expr to expr do This grammar would be ambiguous, if, whilst parsing an expression, a 'do' or 'to' token did not cause expr to be reduced. So 'do' and 'to' are 'expression termination tokens', that is, 'do' and 'to' are members of the set '~expr'. I forget the correct terminology about viable prefixes and handles.. but I hope the idea is clear. I think you did raise an important point though: the meaning of ~x is dependent on the start symbol. So the notation should probably be something like: start~x meaning 'all the tokens which can terminate an x whilst parsing start'. This algorithm 'works' for finding 'start~x': "try every token. If it leads to a conflict it is not in start~x, otherwise it is" I am doing that by hand at the moment. Here are the relevant parts of Felix: %type <Flx_ast.expr_t * token> exprx %start exprx exprx: expr expr_terminator { $1,$2 } expr_terminator: | SEMI { SEMI $1 } | USER_KEYWORD { USER_KEYWORD $1 } | VBAR { VBAR $1 } | ENDMARKER { ENDMARKER } @for n,t in flx_expr_terminator_keywords: tangle(" | " + t + " { " + t + " $1 }",inhibit_sref=1) The @ stuff is a Python script that uses a Python table to generate code, elsewhere i have: flx_expr_terminator_keywords = [ ("all", "ALL"), ("assert", "ASSERT"), ("body", "BODY"), ("call", "CALL"), ("case", "CASE"), .... ("with", "WITH"), ("until", "UNTIL"), ("_", "UNDERSCORE"), ("_gc_pointer", "GC_POINTER"), ("_svc", "SVC"), ] flx_other_keywords = [ ("and", "AND"), ("as", "AS"), .... ("regmatch", "REGMATCH"), ("the", "THE"), ("typematch", "TYPEMATCH"), ] flx_keywords = flx_expr_terminator_keywords + flx_other_keywords I generated the partition by trial and error. When I modify the grammar I could miss some cases. In fact, I have no idea if the set of terminators is maximal. The parser KNOWS .. if I add an wrong keyword into the terminator set I get a conflict. The technique I am using would be impossible to maintain if there were 20-30 such productions at 'recursion points'. [At the moment in Felix there are just two: statements and expression] I hope this makes sense: I think what I'm asking for is quite general, since it enables any LR1 parser to be used with an LL parser, provided one can 'push back' the offending lookahead token. (that is, of course, always possible to implement by buffering for mutable streams, and persistence for functional ones). -- John Skaller <skaller at users dot sf dot net> Felix, successor to C++: http://felix.sf.net ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [Caml-list] [ANNOUNCE] Alpha release of Menhir, an LR(1) parser generator for ocaml 2005-12-14 9:04 ` Francois Pottier 2005-12-14 10:27 ` Alessandro Baretta @ 2005-12-14 20:51 ` skaller 2005-12-14 22:15 ` Joaquin Cuenca Abela 1 sibling, 1 reply; 18+ messages in thread From: skaller @ 2005-12-14 20:51 UTC (permalink / raw) To: Francois.Pottier; +Cc: Caml Mailing List On Wed, 2005-12-14 at 10:04 +0100, Francois Pottier wrote: > Hello, > > On Wed, Dec 14, 2005 at 05:08:15PM +1100, skaller wrote: > > > 0. The licence. Q public licence for the generator???? > > Please NO NO NO!! Not unless it is distributed > > as part of the official distro. Is there any chance of that? > > Our plan (not yet approved by the OCaml higher authorities) > is to make it part of the OCaml distribution. Good!!! > > In particular, can the parser be generated *inside* > > an environment such a function or let binding? > > No. The only way of passing arguments to the parser is through > functor arguments. (Or through global state, of course.) Which is the same thing... > Why is that a problem? Because it destroys re-entrancy. My parser is recursive, it actually calls a recursive descent parser to handle user defined extensions, that parser in turn can call other parsers, including the original ocamlyacc parser. Passing an argument to the parser is the natural way to do this. What I'm actually doing now is storing the data in a token, since this is the only way to get the required data to the parser. however it is a trick. It 'happens' to work at the moment, but may break at any time (it works because the information is constructed by the lexer which runs in an earlier phase -- that is good enough for the current extensions I allow, but this may change tomorrow -- I'd love to scope these extensions properly, but that really does require passing an argument) It's just the usual reason for functional programming and persistent data structures I guess. > > Ocamlyacc usesthe typing > > > > val parser: (lexbuf->token) -> lexbuf -> 'a > > > > which is just bad. A better signature is > > > > val parser: ( unit -> token ) -> 'a > > > > There is no need to provide location information: the correct > > solution is to throw an exception, which is caught in a > > context which can determine the location. > > It can be nice to have the parser automatically extract position information > for you; doing it manually and explicitly within semantic actions would be > quite tedious. Yes, it can be nice for a toy parser. For a complex production quality parser there is no way to avoid working hard to manage source location data. > On the other hand, the signature that you suggest would allow using lexers > that do not necessarily conform to the lexbuf interface; is that your point? Indeed. (My parsers run off lists of tokens .. they're lexed using ocamllex .. but in an earlier phase. And the lists are processed in between, for example, whitespace tokens are elided because the parser can't handle them) > If so, we will consider adding a command line switch, as you suggest. It > would disallow the use of the position keywords ($startpos, $endpos, etc.) Yes. That makes sense. There are other alternatives, such as requiring the tokeniser to return a pair, consisting of a token and the location. In Ocamlyacc this would be a bad idea, since it would require a fixed notion of location. In Menhir, however, this is not so, since it can generate a functor which can be parameterised by the location type. > > 3. I have doubts about the claim that parsers can 'share' > > token types. I do not see how this is possible. > > It certainly is possible: have a look at demos/calc-two in the > distribution. The example is artificial but illustrates the > possibility of sharing a token type. Hmm .. ok, I will look. > > Actually, I personally find the 'yacc' technique of > > generating tokens to be rather lame. Felix does this > > much better -- the parser simply expects a token type > > which is a variant, the type can be defined wherever > > you like. > > With Menhir, the token type can also be hand-written or generated via other > means. It doesn't have to be generated by Menhir. It has to be an algebraic > data type, though, not a polymorphic variant. Ah, I see, the same as Felix in that respect then. This solves a lot of problems, IMHO. > Considering the answer to the first question, the second > question would not arise at all if Menhir produced tables, like > ocamlyacc. However, Menhir does not produce tables; it > compiles the automaton down to a bunch of mutually recursive > functions. We have not yet scientifically assessed the > difference in size between tables and code, but a few > quick experiments indicate that it is reasonable (the > code is larger than the tables by a factor of two or > three). I'm also curious as to the performance of code vs table driven interpreter. One would think the code is faster .. but size matters a lot on current CPUs due to caching. Still, the O() performance will be much the same -- 2-3 times larger code is reasonable considering that it will probably make it easier to enhance the generator (as has already been done by be able to generate functors .. :) -- John Skaller <skaller at users dot sf dot net> Felix, successor to C++: http://felix.sf.net ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [Caml-list] [ANNOUNCE] Alpha release of Menhir, an LR(1) parser generator for ocaml 2005-12-14 20:51 ` skaller @ 2005-12-14 22:15 ` Joaquin Cuenca Abela 2005-12-15 8:40 ` Francois Pottier 0 siblings, 1 reply; 18+ messages in thread From: Joaquin Cuenca Abela @ 2005-12-14 22:15 UTC (permalink / raw) To: Francois.Pottier; +Cc: Caml Mailing List Hi, If you are still accepting suggestions for Menhir, mine is to allow guards on rules to prevent the reduction of a rule when the guard is false. Some languages are easy to parse, with some exceptions that are hard to express with a grammar, as "these two tokens should be on the same line" (hard to do if everywhere else new lines are ignored...) Cheers, Joaquin Cuenca Abela e98cuenc at yahoo dot com __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [Caml-list] [ANNOUNCE] Alpha release of Menhir, an LR(1) parser generator for ocaml 2005-12-14 22:15 ` Joaquin Cuenca Abela @ 2005-12-15 8:40 ` Francois Pottier 0 siblings, 0 replies; 18+ messages in thread From: Francois Pottier @ 2005-12-15 8:40 UTC (permalink / raw) To: Joaquin Cuenca Abela; +Cc: Caml Mailing List Hello, On Wed, Dec 14, 2005 at 02:15:06PM -0800, Joaquin Cuenca Abela wrote: > If you are still accepting suggestions for Menhir, > mine is to allow guards on rules to prevent the > reduction of a rule when the guard is false. I don't see how to implement this in an LR parser... can you explain? -- François Pottier Francois.Pottier@inria.fr http://pauillac.inria.fr/~fpottier/ ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [ANNOUNCE] Alpha release of Menhir, an LR(1) parser generator for ocaml 2005-12-12 17:58 [ANNOUNCE] Alpha release of Menhir, an LR(1) parser generator for ocaml Francois Pottier 2005-12-12 19:51 ` [Caml-list] " "Márk S. Zoltán" 2005-12-13 21:07 ` Nathaniel Gray @ 2005-12-15 6:35 ` Stefan Monnier 2005-12-15 8:47 ` [Caml-list] " Francois Pottier 2 siblings, 1 reply; 18+ messages in thread From: Stefan Monnier @ 2005-12-15 6:35 UTC (permalink / raw) To: caml-list > We are proud to announce the first release of Menhir. Menhir compiles LR(1) > grammar specifications to OCaml code. Nice. I wish it had something like ml-yacc's automatic handling of syntax-errors (which tries to insert/delete tokens in the input stream until the parse succeeds). Stefan ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [Caml-list] Re: [ANNOUNCE] Alpha release of Menhir, an LR(1) parser generator for ocaml 2005-12-15 6:35 ` Stefan Monnier @ 2005-12-15 8:47 ` Francois Pottier 2005-12-15 16:41 ` Stefan Monnier 0 siblings, 1 reply; 18+ messages in thread From: Francois Pottier @ 2005-12-15 8:47 UTC (permalink / raw) To: Stefan Monnier; +Cc: caml-list Hello, On Thu, Dec 15, 2005 at 01:35:40AM -0500, Stefan Monnier wrote: > I wish it had something like ml-yacc's automatic handling of syntax-errors > (which tries to insert/delete tokens in the input stream until the parse > succeeds). Do you mean instead of or in addition to the current mechanism? (I assume instead of. I don't know if the two can be made to coexist.) I am not particularly happy with Menhir's current error mechanism, which attempts to follow yacc's. If you could explain to us why ML-Yacc's approach is superior, that would be great. -- François Pottier Francois.Pottier@inria.fr http://pauillac.inria.fr/~fpottier/ ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [Caml-list] Re: [ANNOUNCE] Alpha release of Menhir, an LR(1) parser generator for ocaml 2005-12-15 8:47 ` [Caml-list] " Francois Pottier @ 2005-12-15 16:41 ` Stefan Monnier 2005-12-15 16:50 ` Francois Pottier 0 siblings, 1 reply; 18+ messages in thread From: Stefan Monnier @ 2005-12-15 16:41 UTC (permalink / raw) To: Francois.Pottier; +Cc: caml-list >> I wish it had something like ml-yacc's automatic handling of syntax-errors >> (which tries to insert/delete tokens in the input stream until the parse >> succeeds). > Do you mean instead of or in addition to the current mechanism? (I assume > instead of. I don't know if the two can be made to coexist.) I don't know enough of the details to tell you, but I'd guess it'd have to be "instead of", indeed. > I am not particularly happy with Menhir's current error mechanism, which > attempts to follow yacc's. If you could explain to us why ML-Yacc's > approach is superior, that would be great. ML-Yacc's error recovery if fully automatic: you write your grammar without worrying about error handling and the generated parser automatically signals syntax errors (note the "s" at the end: it doesn't just stop at the first syntax error) by mentioning that it had to insert a ";" at line L1 and remove an ID at line L2. For toy projects it's amazing: you get good error messages with 0 effort. The main downside of ML-Yacc IIRC has to do with the fact that it requires a fairly strong decoupling between the lexer and parser (you can't do tricks like change the lexer state from a parser semantic action IIRC), but it may be just a limitation of ML-Yacc rather than of the underlying technique. See http://www.smlnj.org/doc/ML-Yacc/mlyacc001.html#toc3 Stefan ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [Caml-list] Re: [ANNOUNCE] Alpha release of Menhir, an LR(1) parser generator for ocaml 2005-12-15 16:41 ` Stefan Monnier @ 2005-12-15 16:50 ` Francois Pottier 2005-12-15 18:56 ` Stefan Monnier 2005-12-30 21:57 ` Florian Weimer 0 siblings, 2 replies; 18+ messages in thread From: Francois Pottier @ 2005-12-15 16:50 UTC (permalink / raw) To: Stefan Monnier; +Cc: caml-list Hi Stefan, Thanks for your answer. > For toy projects it's amazing: you get good error messages with 0 effort. What about non-toy projects? Are there times where the automatic error recovery mechanism gets in the way? -- François Pottier Francois.Pottier@inria.fr http://pauillac.inria.fr/~fpottier/ ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [Caml-list] Re: [ANNOUNCE] Alpha release of Menhir, an LR(1) parser generator for ocaml 2005-12-15 16:50 ` Francois Pottier @ 2005-12-15 18:56 ` Stefan Monnier 2005-12-30 21:57 ` Florian Weimer 1 sibling, 0 replies; 18+ messages in thread From: Stefan Monnier @ 2005-12-15 18:56 UTC (permalink / raw) To: Francois.Pottier; +Cc: caml-list >> For toy projects it's amazing: you get good error messages with 0 effort. > What about non-toy projects? Are there times where the automatic error > recovery mechanism gets in the way? I wouldn't know, really. SML/NJ uses it and I find its syntax error reporting to be generally good. E.g. significantly better than OCaml's ;-) But if it's not good enough, there are hooks to influence its behavior (e.g. telling it what to try and add/remove first, thus telling it basically what are tokens commonly forgotten and spuriously added), but I don't know how it compares in practice to a well tuned yacc error recovery. Stefan ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [Caml-list] Re: [ANNOUNCE] Alpha release of Menhir, an LR(1) parser generator for ocaml 2005-12-15 16:50 ` Francois Pottier 2005-12-15 18:56 ` Stefan Monnier @ 2005-12-30 21:57 ` Florian Weimer 1 sibling, 0 replies; 18+ messages in thread From: Florian Weimer @ 2005-12-30 21:57 UTC (permalink / raw) To: Francois.Pottier; +Cc: Stefan Monnier, caml-list * Francois Pottier: > What about non-toy projects? Are there times where the automatic error > recovery mechanism gets in the way? AFAIK, MLton uses it, and the inserted symbols are often not those that are actually missing. This leads to error messages which are rather confusing. With manually written error handling, you can surely do much better, but the effort pays off in special cases. (Most of the time, the careful error handling code is never written, of course.) ^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2005-12-30 21:58 UTC | newest] Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2005-12-12 17:58 [ANNOUNCE] Alpha release of Menhir, an LR(1) parser generator for ocaml Francois Pottier 2005-12-12 19:51 ` [Caml-list] " "Márk S. Zoltán" 2005-12-13 21:07 ` Nathaniel Gray 2005-12-14 6:08 ` skaller 2005-12-14 9:04 ` Francois Pottier 2005-12-14 10:27 ` Alessandro Baretta 2005-12-14 21:04 ` skaller 2005-12-15 8:46 ` Francois Pottier 2005-12-15 11:03 ` skaller 2005-12-14 20:51 ` skaller 2005-12-14 22:15 ` Joaquin Cuenca Abela 2005-12-15 8:40 ` Francois Pottier 2005-12-15 6:35 ` Stefan Monnier 2005-12-15 8:47 ` [Caml-list] " Francois Pottier 2005-12-15 16:41 ` Stefan Monnier 2005-12-15 16:50 ` Francois Pottier 2005-12-15 18:56 ` Stefan Monnier 2005-12-30 21:57 ` Florian Weimer
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox