[ANNOUNCE] Alpha release of Menhir, an LR(1) parser generator for ocaml

Mailing list for all users of the OCaml language and system.
 help / color / mirror / Atom feed

* [ANNOUNCE] Alpha release of Menhir, an LR(1) parser generator for ocaml
@ 2005-12-12 17:58 Francois Pottier
  2005-12-12 19:51 ` [Caml-list] " "Márk S. Zoltán"
                   ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: Francois Pottier @ 2005-12-12 17:58 UTC (permalink / raw)
  To: Caml Mailing List

Dear Caml users,

We are proud to announce the first release of Menhir. Menhir compiles LR(1)
grammar specifications to OCaml code.

Menhir is 90% compatible with ocamlyacc. That is, existing ocamlyacc grammar
specifications are accepted and compiled by Menhir; the resulting parsers run
and produce correct parse trees, except they produce incorrect position
information, because none of the functionality of module Parsing is
supported. Porting a grammar specification from ocamlyacc to Menhir requires
replacing all calls to the Parsing module with new keywords.

Why switch from ocamlyacc to Menhir? In short,

 * Menhir offers parameterized nonterminal symbols as well as a library of
   standard definitions, including options, sequences, and lists. It also
   offers limited support for EBNF syntax.

 * ocamlyacc accepts LALR(1) grammars; Menhir accepts LR(1) grammars, thus
   avoiding certain artificial conflicts.

 * Menhir explains conflicts in terms of the grammar, not (only) in terms of
   the automaton.

 * Menhir allows grammar specifications to be split over multiple files. It
   also allows several grammars to share a single set of tokens.

 * Menhir produces reentrant parsers.

 * Menhir is able to produce parsers that are parameterized by Ocaml modules.

 * ocamlyacc requires semantic values to be referred to via keywords: $1, $2,
   and so on. Menhir allows semantic values to be explicitly named.

 * Menhir's error and warning messages are usually more numerous and better
   than ocamlyacc's.

A more detailed comparison between ocamlyacc and Menhir appears in Menhir's
documentation.

This is an ALPHA-quality release, so there certainly remain a lot of bugs
to iron out. Nevertheless, we encourage intrepid testers to have a look
and send suggestions and bug reports our way. Thanks for your attention!

Menhir requires ocaml 3.09. The source distribution and the documentation can
be found at

  http://pauillac.inria.fr/~fpottier/menhir/menhir-20051212.tar.gz
  http://pauillac.inria.fr/~fpottier/menhir/manual.pdf

-- 
François Pottier and Yann Régis-Gianas
{Francois.Pottier,Yann.Regis-Gianas}@inria.fr
http://pauillac.inria.fr/~fpottier/
http://pauillac.inria.fr/~regisgia/

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Caml-list] [ANNOUNCE] Alpha release of Menhir, an LR(1) parser generator for ocaml
  2005-12-12 17:58 [ANNOUNCE] Alpha release of Menhir, an LR(1) parser generator for ocaml Francois Pottier
@ 2005-12-12 19:51 ` "Márk S. Zoltán"
  2005-12-13 21:07 ` Nathaniel Gray
  2005-12-15  6:35 ` Stefan Monnier
  2 siblings, 0 replies; 18+ messages in thread
From: "Márk S. Zoltán" @ 2005-12-12 19:51 UTC (permalink / raw)
  To: Francois.Pottier; +Cc: Caml Mailing List

I just tried to build menhir and failed at it. Maybe the bug is on my 
side, because I do not exactly do everything by the book. My 
installation is mingw over Win2k, ocaml-3.09-0, however, it is a custom 
build over native msys, no cygwin whatsoever involved. This was not an 
issue for the last few years - I am maintaining an ocaml line for this 
platform just for fun ever since I learned about msys. I would 
understand if you did not consider this issue to be too urgent, 
appearing on an unsupported platform etc. but I thought it may be 
something more generic, so here you go:

The output is the following:

$ ./configure
checking for ocamlc.opt... no
checking for ocamlc... ocamlc
checking the version of the ocaml bytecode compiler.... 3.09.0
checking for ocamlopt.opt... no
checking for ocamlopt... ocamlopt
checking the version of the ocaml native compiler.... 3.09.0
checking for ocamldep.opt... no
checking for ocamldep... ocamldep
checking for ocamlyacc... ocamlyacc
checking for ocamllex... ocamllex
checking for a BSD-compatible install... /bin/install -c
configure: creating ./config.status
config.status: creating Makefile

$ make install
make - --unix -s PGEN="ocamlyacc -v" PARSER=parser EXECUTABLE=stage-one 
executab
les
77 states, 5444 transitions, table size 22238 bytes
18222 additional bytes used for bindings
243 states, 3079 transitions, table size 13774 bytes
10161 additional bytes used for bindings
5 states, 258 transitions, table size 1062 bytes
23 states, 1876 transitions, table size 7642 bytes
1209 additional bytes used for bindings
make - --unix -s PGEN="./stage-one -v -lg 1 -la 1 -lc 1 --comment 
--infer --erro
r-recovery --stdlib ." PARSER=fancy-parser EXECUTABLE=menhir executables
I/O error: Invalid argument
Fatal error: exception Sys_error("Invalid argument")
make[1]: *** [parser.mli] Error 1
make: *** [bootstrap] Error 2

$

Regards

    M- S- Z-

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Caml-list] [ANNOUNCE] Alpha release of Menhir, an LR(1) parser generator for ocaml
  2005-12-12 17:58 [ANNOUNCE] Alpha release of Menhir, an LR(1) parser generator for ocaml Francois Pottier
  2005-12-12 19:51 ` [Caml-list] " "Márk S. Zoltán"
@ 2005-12-13 21:07 ` Nathaniel Gray
  2005-12-14  6:08   ` skaller
  2005-12-15  6:35 ` Stefan Monnier
  2 siblings, 1 reply; 18+ messages in thread
From: Nathaniel Gray @ 2005-12-13 21:07 UTC (permalink / raw)
  To: Francois.Pottier; +Cc: Caml Mailing List

[-- Attachment #1: Type: text/plain, Size: 3195 bytes --]

This is pretty nice!  Every time I use ocamlyacc I think "somebody should
write something better."  Now it looks like somebody has!  I can't tell you
how many times I've wanted parameterized rules and simple "library" rules
for parsing delimiter-separated lists and such...

Cheers,
-n8

On 12/12/05, Francois Pottier <Francois.Pottier@inria.fr> wrote:
>
>
> Dear Caml users,
>
> We are proud to announce the first release of Menhir. Menhir compiles
> LR(1)
> grammar specifications to OCaml code.
>
> Menhir is 90% compatible with ocamlyacc. That is, existing ocamlyacc
> grammar
> specifications are accepted and compiled by Menhir; the resulting parsers
> run
> and produce correct parse trees, except they produce incorrect position
> information, because none of the functionality of module Parsing is
> supported. Porting a grammar specification from ocamlyacc to Menhir
> requires
> replacing all calls to the Parsing module with new keywords.
>
> Why switch from ocamlyacc to Menhir? In short,
>
> * Menhir offers parameterized nonterminal symbols as well as a library of
>    standard definitions, including options, sequences, and lists. It also
>    offers limited support for EBNF syntax.
>
> * ocamlyacc accepts LALR(1) grammars; Menhir accepts LR(1) grammars, thus
>    avoiding certain artificial conflicts.
>
> * Menhir explains conflicts in terms of the grammar, not (only) in terms
> of
>    the automaton.
>
> * Menhir allows grammar specifications to be split over multiple files. It
>    also allows several grammars to share a single set of tokens.
>
> * Menhir produces reentrant parsers.
>
> * Menhir is able to produce parsers that are parameterized by Ocaml
> modules.
>
> * ocamlyacc requires semantic values to be referred to via keywords: $1,
> $2,
>    and so on. Menhir allows semantic values to be explicitly named.
>
> * Menhir's error and warning messages are usually more numerous and better
>    than ocamlyacc's.
>
> A more detailed comparison between ocamlyacc and Menhir appears in
> Menhir's
> documentation.
>
> This is an ALPHA-quality release, so there certainly remain a lot of bugs
> to iron out. Nevertheless, we encourage intrepid testers to have a look
> and send suggestions and bug reports our way. Thanks for your attention!
>
> Menhir requires ocaml 3.09. The source distribution and the documentation
> can
> be found at
>
>   http://pauillac.inria.fr/~fpottier/menhir/menhir-20051212.tar.gz
>   http://pauillac.inria.fr/~fpottier/menhir/manual.pdf
>
> --
> François Pottier and Yann Régis-Gianas
> {Francois.Pottier,Yann.Regis-Gianas}@inria.fr
> http://pauillac.inria.fr/~fpottier/
> http://pauillac.inria.fr/~regisgia/
>
> _______________________________________________
> Caml-list mailing list. Subscription management:
> http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list
> Archives: http://caml.inria.fr
> Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
> Bug reports: http://caml.inria.fr/bin/caml-bugs
>



--
>>>-- Nathaniel Gray -- Caltech Computer Science ------>
>>>-- Mojave Project -- http://mojave.cs.caltech.edu -->

[-- Attachment #2: Type: text/html, Size: 4200 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Caml-list] [ANNOUNCE] Alpha release of Menhir, an LR(1) parser generator for ocaml
  2005-12-13 21:07 ` Nathaniel Gray
@ 2005-12-14  6:08   ` skaller
  2005-12-14  9:04     ` Francois Pottier
  0 siblings, 1 reply; 18+ messages in thread
From: skaller @ 2005-12-14  6:08 UTC (permalink / raw)
  To: Nathaniel Gray; +Cc: Francois.Pottier, Caml Mailing List

On Tue, 2005-12-13 at 13:07 -0800, Nathaniel Gray wrote:
> This is pretty nice!  Every time I use ocamlyacc I think "somebody
> should write something better."  Now it looks like somebody has!  I
> can't tell you how many times I've wanted parameterized rules and
> simple "library" rules for parsing delimiter-separated lists and
> such... 

Yes, it is pretty nice! However it still appears to have some
problems. Any comments appreciated.

0. The licence. Q public licence for the generator????
Please NO NO NO!! Not unless it is distributed
as part of the official distro. Is there any chance of that?
If not even GPL would be better ;(

1. Generating a functor is cute, but it doesn't seem to
allow arguments to parser functions. Perhaps I missed something?
Is there a way to use the functorisation with closures to
add an argument?

In particular, can the parser be generated *inside*
an environment such a function or let binding?
[Felix allows that, which means an extra argument is
not required, a variable in the environment can be used
instead]

2. The signature of parsers is still wrong? 
Ocamlyacc usesthe typing

	val parser: (lexbuf->token) -> lexbuf -> 'a

which is just bad. A better signature is

	val parser: ( unit -> token ) -> 'a

There is no need to provide location information: the correct
solution is to throw an exception, which is caught in a 
context which can determine the location.

It would be nice to be able to generate this signature 
with a command line switch, pragma, or some other mechanism,
even if the default is chosen for ocamlyacc compatibility.

3. I have doubts about the claim that parsers can 'share'
token types. I do not see how this is possible. It is
contradicted by the compilation model description, which
explains how it is necessary to join separate files making
up a grammar specification. In this case, the joined system
is going to generate a single token type, and any type
generated by another joining is certain to generate
a distinct type because

(a) the type is defined in a distinct ocaml module (mli file)
(b) the typing of normal variants is nominal

This problem would go away if polymorphic variants
were used instead, because the typenames are then simply
abbreviations, since pm-variants are structurally, not
nominally, typed.

Perhaps a command line switch, pragma, or whatever, to use
polymorphic variants instead of ordinary ones?

Actually, I personally find the 'yacc' technique of
generating tokens to be rather lame. Felix does this
much better -- the parser simply expects a token type
which is a variant, the type can be defined wherever
you like. In particular, the lexer and parser can
share that definition.

As far as I can see Menhir COULD do this, except of
course one would use %token as a special way
of generating the variant. All that would be required
I think is the syntax

%import_tokens "filename"

which refers to the token definition file -- as an
alternative to inlining these token definitions.
(if pm-variants are used you could probably support both,
though I'm not sure).

A token definition file then generates two files,
an ordinary mli file with the token variant type,
and, a special information file for the parser generator
(with the same information, but in a more useful form).

In Felix none of this is necessary because parsing is
built in, so the compiler can find the information required
for the parser generator directly from the token variant type.

4. Just curious, but how practical is LR(1) in terms of
generated code sizes? Felix is using Elkhound as its 
parser which is a GLR parser with an LALR(1) core. In theory
there is an option for choosing the core automaton, which
also allows LR(1) however I recall Scott McPeak commenting
it wasn't worth supporting because it generated tables
which were far too big. 

I'm curious how one would be able to predict the size of the 
generated code since I don't  really understand the 
additional constraints LALR(1) introduces .. 

-- 
John Skaller <skaller at users dot sf dot net>
Felix, successor to C++: http://felix.sf.net

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Caml-list] [ANNOUNCE] Alpha release of Menhir, an LR(1) parser generator for ocaml
  2005-12-14  6:08   ` skaller
@ 2005-12-14  9:04     ` Francois Pottier
  2005-12-14 10:27       ` Alessandro Baretta
  2005-12-14 20:51       ` skaller
  0 siblings, 2 replies; 18+ messages in thread
From: Francois Pottier @ 2005-12-14  9:04 UTC (permalink / raw)
  To: skaller; +Cc: Caml Mailing List

Hello,

On Wed, Dec 14, 2005 at 05:08:15PM +1100, skaller wrote:

> 0. The licence. Q public licence for the generator????
> Please NO NO NO!! Not unless it is distributed
> as part of the official distro. Is there any chance of that?

Our plan (not yet approved by the OCaml higher authorities)
is to make it part of the OCaml distribution.

> In particular, can the parser be generated *inside*
> an environment such a function or let binding?

No. The only way of passing arguments to the parser is through
functor arguments. (Or through global state, of course.) Why is
that a problem?

> Ocamlyacc usesthe typing
> 
> 	val parser: (lexbuf->token) -> lexbuf -> 'a
> 
> which is just bad. A better signature is
> 
> 	val parser: ( unit -> token ) -> 'a
> 
> There is no need to provide location information: the correct
> solution is to throw an exception, which is caught in a 
> context which can determine the location.

It can be nice to have the parser automatically extract position information
for you; doing it manually and explicitly within semantic actions would be
quite tedious.

On the other hand, the signature that you suggest would allow using lexers
that do not necessarily conform to the lexbuf interface; is that your point?
If so, we will consider adding a command line switch, as you suggest. It
would disallow the use of the position keywords ($startpos, $endpos, etc.)

> 3. I have doubts about the claim that parsers can 'share'
> token types. I do not see how this is possible. 

It certainly is possible: have a look at demos/calc-two in the
distribution. The example is artificial but illustrates the
possibility of sharing a token type.

> Actually, I personally find the 'yacc' technique of
> generating tokens to be rather lame. Felix does this
> much better -- the parser simply expects a token type
> which is a variant, the type can be defined wherever
> you like.

With Menhir, the token type can also be hand-written or generated via other
means. It doesn't have to be generated by Menhir. It has to be an algebraic
data type, though, not a polymorphic variant.

> %import_tokens "filename"

This is called --external-tokens Filename. It's a command line switch.

> 4. Just curious, but how practical is LR(1) in terms of
> generated code sizes? Felix is using Elkhound as its 
> parser which is a GLR parser with an LALR(1) core. In theory
> there is an option for choosing the core automaton, which
> also allows LR(1) however I recall Scott McPeak commenting
> it wasn't worth supporting because it generated tables
> which were far too big. 

There are two questions: how big is the automaton compared to
an LALR automaton? and how big is the code that implements the
automaton?

The answer to the first question is, in most cases, the grammar
is LALR(1) anyway, and the automaton generated by Menhir is
identical to the one that would be produced by ocamlyacc or
by some other LALR(1) parser generator. When the grammar is
not LALR(1), the LR(1) automaton has more states than an
LALR(1) automaton for the same grammar. The number of extra
states is usually small (often just one, or a handful). So
there's no concern here.

Considering the answer to the first question, the second
question would not arise at all if Menhir produced tables, like
ocamlyacc. However, Menhir does not produce tables; it
compiles the automaton down to a bunch of mutually recursive
functions. We have not yet scientifically assessed the
difference in size between tables and code, but a few
quick experiments indicate that it is reasonable (the
code is larger than the tables by a factor of two or
three).

-- 
François Pottier
Francois.Pottier@inria.fr
http://pauillac.inria.fr/~fpottier/

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Caml-list] [ANNOUNCE] Alpha release of Menhir, an LR(1) parser generator for ocaml
  2005-12-14  9:04     ` Francois Pottier
@ 2005-12-14 10:27       ` Alessandro Baretta
  2005-12-14 21:04         ` skaller
  2005-12-14 20:51       ` skaller
  1 sibling, 1 reply; 18+ messages in thread
From: Alessandro Baretta @ 2005-12-14 10:27 UTC (permalink / raw)
  To: Francois.Pottier; +Cc: Caml Mailing List

Francois Pottier wrote:

> Considering the answer to the first question, the second
> question would not arise at all if Menhir produced tables, like
> ocamlyacc. However, Menhir does not produce tables; it
> compiles the automaton down to a bunch of mutually recursive
> functions. We have not yet scientifically assessed the
> difference in size between tables and code, but a few
> quick experiments indicate that it is reasonable (the
> code is larger than the tables by a factor of two or
> three).
> 

In general, I like the approach, as it can easily scale to an extensible parser 
generator by late-binding the recursion through a record/array/table of 
functions. Have you thought about this at all?

Alex

-- 
*********************************************************************
http://www.barettadeit.com/
Baretta DE&IT
A division of Baretta SRL

tel. +39 02 370 111 55
fax. +39 02 370 111 54

Our technology:

The Application System/Xcaml (AS/Xcaml)
<http://www.asxcaml.org/>

The FreerP Project
<http://www.freerp.org/>


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Caml-list] [ANNOUNCE] Alpha release of Menhir, an LR(1) parser generator for ocaml
  2005-12-14 10:27       ` Alessandro Baretta
@ 2005-12-14 21:04         ` skaller
  2005-12-15  8:46           ` Francois Pottier
  0 siblings, 1 reply; 18+ messages in thread
From: skaller @ 2005-12-14 21:04 UTC (permalink / raw)
  To: Alessandro Baretta; +Cc: Francois.Pottier, Caml Mailing List

On Wed, 2005-12-14 at 11:27 +0100, Alessandro Baretta wrote:
> Francois Pottier wrote:
> 
> > However, Menhir does not produce tables; it
> > compiles the automaton down to a bunch of mutually recursive
> > functions. We have not yet scientifically assessed the
> > difference in size between tables and code, but a few
> > quick experiments indicate that it is reasonable (the
> > code is larger than the tables by a factor of two or
> > three).
> > 
> 
> In general, I like the approach, as it can easily scale to an extensible parser 
> generator by late-binding the recursion through a record/array/table of 
> functions. Have you thought about this at all?

Actually the Felix parser does this right now -- and that is
why I'm asking for the parser functions to allow a parameter:
to pass in the data structure. I actually have simplistic
recursive descent interpreter which is called by ocamlyacc,
and which calls back into ocamlyacc: at present this
allows user defined statements to be added to the Felix
language. It is very crude, since there is no way
to properly type user non-terminals other than statements,
and it only allows extension of statements.

Interestingly the 'end of stream' issue discussed at length
in the Menhir manual then applies also to the points
of recursion. For a statement in Felix, there is no lookahead
at the end (they end with a semicolon). For expressions,
I have to read one token too many, keep a list of all
possible tokens that can terminate an expression,
introduce an 'augmented expression' which is an expression
followed by any of these tokens, and then in the action
code 'put back' the token into the token stream.

It would be really nice if I could write the
production as:

exprx: expr ~expr

where ~expr means 'any token which forces a reduction
of the whole expression' if you know what I mean.
I have to assemble this list by hand at the moment.
I did it by starting with a list of all the tokens,
and removing the ones which seemed to cause a conflict.
I will certainly forget to upgrade this list when I add
new tokens -- hence a desire for something like ~expr
which tells the parser to do it :)

-- 
John Skaller <skaller at users dot sf dot net>
Felix, successor to C++: http://felix.sf.net

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Caml-list] [ANNOUNCE] Alpha release of Menhir, an LR(1) parser generator for ocaml
  2005-12-14 21:04         ` skaller
@ 2005-12-15  8:46           ` Francois Pottier
  2005-12-15 11:03             ` skaller
  0 siblings, 1 reply; 18+ messages in thread
From: Francois Pottier @ 2005-12-15  8:46 UTC (permalink / raw)
  To: skaller; +Cc: Alessandro Baretta, Caml Mailing List

On Thu, Dec 15, 2005 at 08:04:54AM +1100, skaller wrote:

> Interestingly the 'end of stream' issue discussed at length
> in the Menhir manual then applies also to the points
> of recursion.

That's a good point. Actually, we are planning to implement in Menhir a
mechanism that helps stop at some point and invoke an external lexer or parser
(this can be any piece of code that consumes characters off the input stream).
Doing this properly requires *not* reading the lookahead token. As you point
out, the problems that arise are similar to the way the end of stream is
handled.

> It would be really nice if I could write the
> production as:
> 
> exprx: expr ~expr
> 
> where ~expr means 'any token which forces a reduction
> of the whole expression' if you know what I mean.

No, I don't know exactly what you mean :) I need to think more to understand
if this construction makes sense. It makes the grammar self-referential, in a
way, so it at least has to be monotonic in order to make sense.

Our plan regarding calls to external lexers or parsers was to be more rigid,
that is, to require the current statement (or expression, or whatever) to be
unambiguously terminated (for instance, by a semicolon) at the point where the
external call is made. So what you do with statements would be accepted, but
what you do with expressions might be impossible.

-- 
François Pottier
Francois.Pottier@inria.fr
http://pauillac.inria.fr/~fpottier/

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Caml-list] [ANNOUNCE] Alpha release of Menhir, an LR(1) parser generator for ocaml
  2005-12-15  8:46           ` Francois Pottier
@ 2005-12-15 11:03             ` skaller
  0 siblings, 0 replies; 18+ messages in thread
From: skaller @ 2005-12-15 11:03 UTC (permalink / raw)
  To: Francois.Pottier; +Cc: Alessandro Baretta, Caml Mailing List

On Thu, 2005-12-15 at 09:46 +0100, Francois Pottier wrote:

> Doing this properly requires *not* reading the lookahead token. 

> As you point
> out, the problems that arise are similar to the way the end of stream is
> handled.
> 
> > It would be really nice if I could write the
> > production as:
> > 
> > exprx: expr ~expr
> > 
> > where ~expr means 'any token which forces a reduction
> > of the whole expression' if you know what I mean.
> 
> No, I don't know exactly what you mean :) I need to think more to understand
> if this construction makes sense. 

> It makes the grammar self-referential, in a
> way, so it at least has to be monotonic in order to make sense.
> 
> Our plan regarding calls to external lexers or parsers was to be more rigid,
> that is, to require the current statement (or expression, or whatever) to be
> unambiguously terminated (for instance, by a semicolon) at the point where the
> external call is made. So what you do with statements would be accepted, but
> what you do with expressions might be impossible.

So, the idea is this: for an 1 token look ahead parser,
all productions terminate either with 0 lookahead or 1 lookahead.

Now, you're saying only to support recursion for 0 lookahead
productions. 

So what if you have a 1 lookahead and you still want
to recurse??

One answer is: make a new augmented production with is 0 lookahead:

	prod': prod ~prod { $1, $2 }

where ~prod is the nonterminal

	~prod: 
		| t1 { t1 $1 } 
		| t2 { t2 $1 }
		| t3 { t3 $1 }
		....

where t1, t2, t3, .. are tokens that would not be shifted
by the parser, and therefore cause prod to be reduced.

The action then returns the result of prod, PLUS the 
lookahead token. Before recursing, we can simply
push the lookahead token back into the head of the
token stream.

Felix does precisely this -- it convents a 1 lookahead
production into a 0 lookahead production and therefore:

> what you do with expressions might be impossible.

seems possible by the above mechanism.
It is equivalent to adding a set of terminating tokens.

The *problem* is choosing terminating tokens that
are viable first tokens of any sequence of symbols
that can follow an expression elsewhere in the grammar.

For example consider:

	while expr do ..
	for i in expr to expr do

This grammar would be ambiguous, if, whilst parsing
an expression, a 'do' or 'to' token did not cause
expr to be reduced. So 'do' and 'to' are 'expression
termination tokens', that is, 'do' and 'to' are
members of the set '~expr'.

I forget the correct terminology about viable
prefixes and handles.. but I hope the idea is clear.

I think you did raise an important point though:
the meaning of ~x is dependent on the start symbol.
So the notation should probably be something like:

	start~x

meaning 'all the tokens which can terminate an x whilst
parsing start'.

This algorithm 'works' for finding 'start~x':

"try every token. If it leads to a conflict it is not
in start~x, otherwise it is"

I am doing that by hand at the moment. Here are the relevant 
parts of Felix:

%type <Flx_ast.expr_t * token> exprx
%start exprx

exprx:
  expr expr_terminator { $1,$2 }

expr_terminator:
  | SEMI { SEMI $1 }
  | USER_KEYWORD { USER_KEYWORD $1 }
  | VBAR { VBAR $1 }
  | ENDMARKER { ENDMARKER }

@for n,t in flx_expr_terminator_keywords:
  tangle("  | "  + t + " { " + t + " $1 }",inhibit_sref=1)

The @ stuff is a Python script that uses a Python table
to generate code, elsewhere i have:

flx_expr_terminator_keywords = [
    ("all", "ALL"),
    ("assert", "ASSERT"),
    ("body", "BODY"),
    ("call", "CALL"),
    ("case", "CASE"),

....

    ("with", "WITH"),
    ("until", "UNTIL"),
    ("_", "UNDERSCORE"),
    ("_gc_pointer", "GC_POINTER"),
    ("_svc", "SVC"),
  ]

flx_other_keywords = [
    ("and", "AND"),
    ("as", "AS"),
....
    ("regmatch", "REGMATCH"),
    ("the", "THE"),
    ("typematch", "TYPEMATCH"),
  ]

flx_keywords = flx_expr_terminator_keywords + flx_other_keywords

I generated the partition by trial and error. When I modify
the grammar I could miss some cases. In fact,
I have no idea if the set of terminators is maximal.
The parser KNOWS .. if I add an wrong keyword
into the terminator set I get a conflict.

The technique I am using would be impossible to maintain
if there were 20-30 such productions at 'recursion points'.
[At the moment in Felix there are just two: statements
and expression]

I hope this makes sense: I think what I'm asking for
is quite general, since it enables any LR1 parser
to be used with an LL parser, provided one can
'push back' the offending lookahead token.
(that is, of course, always possible to implement
by buffering for mutable streams, and persistence
for functional ones).

-- 
John Skaller <skaller at users dot sf dot net>
Felix, successor to C++: http://felix.sf.net

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Caml-list] [ANNOUNCE] Alpha release of Menhir, an LR(1) parser generator for ocaml
  2005-12-14  9:04     ` Francois Pottier
  2005-12-14 10:27       ` Alessandro Baretta
@ 2005-12-14 20:51       ` skaller
  2005-12-14 22:15         ` Joaquin Cuenca Abela
  1 sibling, 1 reply; 18+ messages in thread
From: skaller @ 2005-12-14 20:51 UTC (permalink / raw)
  To: Francois.Pottier; +Cc: Caml Mailing List

On Wed, 2005-12-14 at 10:04 +0100, Francois Pottier wrote:
> Hello,
> 
> On Wed, Dec 14, 2005 at 05:08:15PM +1100, skaller wrote:
> 
> > 0. The licence. Q public licence for the generator????
> > Please NO NO NO!! Not unless it is distributed
> > as part of the official distro. Is there any chance of that?
> 
> Our plan (not yet approved by the OCaml higher authorities)
> is to make it part of the OCaml distribution.

Good!!!

> > In particular, can the parser be generated *inside*
> > an environment such a function or let binding?
> 
> No. The only way of passing arguments to the parser is through
> functor arguments. (Or through global state, of course.) 

Which is the same thing...

> Why is that a problem?

Because it destroys re-entrancy. My parser is recursive,
it actually calls a recursive descent parser to handle
user defined extensions, that parser in turn can call
other parsers, including the original ocamlyacc parser.

Passing an argument to the parser is the natural way
to do this. What I'm actually doing now is storing
the data in a token, since this is the only way
to get the required data to the parser.

however it is a trick. It 'happens' to work at the
moment, but may break at any time (it works because
the information is constructed by the lexer which 
runs in an earlier phase -- that is good enough
for the current extensions I allow, but this
may change tomorrow -- I'd love to scope these
extensions properly, but that really does require
passing an argument)

It's just the usual reason for functional programming
and persistent data structures I guess.

> > Ocamlyacc usesthe typing
> > 
> > 	val parser: (lexbuf->token) -> lexbuf -> 'a
> > 
> > which is just bad. A better signature is
> > 
> > 	val parser: ( unit -> token ) -> 'a
> > 
> > There is no need to provide location information: the correct
> > solution is to throw an exception, which is caught in a 
> > context which can determine the location.
> 
> It can be nice to have the parser automatically extract position information
> for you; doing it manually and explicitly within semantic actions would be
> quite tedious.

Yes, it can be nice for a toy parser. For a complex production
quality parser there is no way to avoid working hard to manage
source location data.

> On the other hand, the signature that you suggest would allow using lexers
> that do not necessarily conform to the lexbuf interface; is that your point?

Indeed. (My parsers run off lists of tokens .. they're lexed using
ocamllex .. but in an earlier phase. And the lists are processed
in between, for example, whitespace tokens are elided because
the parser can't handle them)

> If so, we will consider adding a command line switch, as you suggest. It
> would disallow the use of the position keywords ($startpos, $endpos, etc.)

Yes. That makes sense. There are other alternatives, such as
requiring the tokeniser to return a pair, consisting of a
token and the location. In Ocamlyacc this would be a bad
idea, since it would require a fixed notion of location.

In Menhir, however, this is not so, since it can generate
a functor which can be parameterised by the location type.

> > 3. I have doubts about the claim that parsers can 'share'
> > token types. I do not see how this is possible. 
> 
> It certainly is possible: have a look at demos/calc-two in the
> distribution. The example is artificial but illustrates the
> possibility of sharing a token type.

Hmm .. ok, I will look.

> > Actually, I personally find the 'yacc' technique of
> > generating tokens to be rather lame. Felix does this
> > much better -- the parser simply expects a token type
> > which is a variant, the type can be defined wherever
> > you like.
> 
> With Menhir, the token type can also be hand-written or generated via other
> means. It doesn't have to be generated by Menhir. It has to be an algebraic
> data type, though, not a polymorphic variant.

Ah, I see, the same as Felix in that respect then.
This solves a lot of problems, IMHO.

> Considering the answer to the first question, the second
> question would not arise at all if Menhir produced tables, like
> ocamlyacc. However, Menhir does not produce tables; it
> compiles the automaton down to a bunch of mutually recursive
> functions. We have not yet scientifically assessed the
> difference in size between tables and code, but a few
> quick experiments indicate that it is reasonable (the
> code is larger than the tables by a factor of two or
> three).

I'm also curious as to the performance of code vs
table driven interpreter. One would think the code
is faster .. but size matters a lot on current CPUs
due to caching. Still, the O() performance will
be much the same -- 2-3 times larger code is reasonable
considering that it will probably make it easier
to enhance the generator (as has already been done
by be able to generate functors .. :)

-- 
John Skaller <skaller at users dot sf dot net>
Felix, successor to C++: http://felix.sf.net

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Caml-list] [ANNOUNCE] Alpha release of Menhir, an LR(1) parser generator for ocaml
  2005-12-14 20:51       ` skaller
@ 2005-12-14 22:15         ` Joaquin Cuenca Abela
  2005-12-15  8:40           ` Francois Pottier
  0 siblings, 1 reply; 18+ messages in thread
From: Joaquin Cuenca Abela @ 2005-12-14 22:15 UTC (permalink / raw)
  To: Francois.Pottier; +Cc: Caml Mailing List

Hi,

If you are still accepting suggestions for Menhir,
mine is to allow guards on rules to prevent the
reduction of a rule when the guard is false.

Some languages are easy to parse, with some exceptions
that are hard to express with a grammar, as "these two
tokens should be on the same line" (hard to do if
everywhere else new lines are ignored...)

Cheers,



Joaquin Cuenca Abela
e98cuenc at yahoo dot com

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Caml-list] [ANNOUNCE] Alpha release of Menhir, an LR(1) parser generator for ocaml
  2005-12-14 22:15         ` Joaquin Cuenca Abela
@ 2005-12-15  8:40           ` Francois Pottier
  0 siblings, 0 replies; 18+ messages in thread
From: Francois Pottier @ 2005-12-15  8:40 UTC (permalink / raw)
  To: Joaquin Cuenca Abela; +Cc: Caml Mailing List


Hello,

On Wed, Dec 14, 2005 at 02:15:06PM -0800, Joaquin Cuenca Abela wrote:

> If you are still accepting suggestions for Menhir,
> mine is to allow guards on rules to prevent the
> reduction of a rule when the guard is false.

I don't see how to implement this in an LR parser... can you explain?

-- 
François Pottier
Francois.Pottier@inria.fr
http://pauillac.inria.fr/~fpottier/


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [ANNOUNCE] Alpha release of Menhir, an LR(1) parser generator for ocaml
  2005-12-12 17:58 [ANNOUNCE] Alpha release of Menhir, an LR(1) parser generator for ocaml Francois Pottier
  2005-12-12 19:51 ` [Caml-list] " "Márk S. Zoltán"
  2005-12-13 21:07 ` Nathaniel Gray
@ 2005-12-15  6:35 ` Stefan Monnier
  2005-12-15  8:47   ` [Caml-list] " Francois Pottier
  2 siblings, 1 reply; 18+ messages in thread
From: Stefan Monnier @ 2005-12-15  6:35 UTC (permalink / raw)
  To: caml-list

> We are proud to announce the first release of Menhir. Menhir compiles LR(1)
> grammar specifications to OCaml code.

Nice.
I wish it had something like ml-yacc's automatic handling of syntax-errors
(which tries to insert/delete tokens in the input stream until the parse
succeeds).


        Stefan


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Caml-list] Re: [ANNOUNCE] Alpha release of Menhir, an LR(1) parser generator for ocaml
  2005-12-15  6:35 ` Stefan Monnier
@ 2005-12-15  8:47   ` Francois Pottier
  2005-12-15 16:41     ` Stefan Monnier
  0 siblings, 1 reply; 18+ messages in thread
From: Francois Pottier @ 2005-12-15  8:47 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: caml-list


Hello,

On Thu, Dec 15, 2005 at 01:35:40AM -0500, Stefan Monnier wrote:

> I wish it had something like ml-yacc's automatic handling of syntax-errors
> (which tries to insert/delete tokens in the input stream until the parse
> succeeds).

Do you mean instead of or in addition to the current mechanism? (I assume
instead of. I don't know if the two can be made to coexist.)

I am not particularly happy with Menhir's current error mechanism, which
attempts to follow yacc's. If you could explain to us why ML-Yacc's approach
is superior, that would be great.

-- 
François Pottier
Francois.Pottier@inria.fr
http://pauillac.inria.fr/~fpottier/


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Caml-list] Re: [ANNOUNCE] Alpha release of Menhir, an LR(1) parser generator for ocaml
  2005-12-15  8:47   ` [Caml-list] " Francois Pottier
@ 2005-12-15 16:41     ` Stefan Monnier
  2005-12-15 16:50       ` Francois Pottier
  0 siblings, 1 reply; 18+ messages in thread
From: Stefan Monnier @ 2005-12-15 16:41 UTC (permalink / raw)
  To: Francois.Pottier; +Cc: caml-list

>> I wish it had something like ml-yacc's automatic handling of syntax-errors
>> (which tries to insert/delete tokens in the input stream until the parse
>> succeeds).

> Do you mean instead of or in addition to the current mechanism? (I assume
> instead of.  I don't know if the two can be made to coexist.)

I don't know enough of the details to tell you, but I'd guess it'd have to
be "instead of", indeed.

> I am not particularly happy with Menhir's current error mechanism, which
> attempts to follow yacc's.  If you could explain to us why ML-Yacc's
> approach is superior, that would be great.

ML-Yacc's error recovery if fully automatic: you write your grammar without
worrying about error handling and the generated parser automatically signals
syntax errors (note the "s" at the end: it doesn't just stop at the first
syntax error) by mentioning that it had to insert a ";" at line L1 and
remove an ID at line L2.

For toy projects it's amazing: you get good error messages with 0 effort.

The main downside of ML-Yacc IIRC has to do with the fact that it requires
a fairly strong decoupling between the lexer and parser (you can't do tricks
like change the lexer state from a parser semantic action IIRC), but it may
be just a limitation of ML-Yacc rather than of the underlying technique.

See http://www.smlnj.org/doc/ML-Yacc/mlyacc001.html#toc3

        Stefan

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Caml-list] Re: [ANNOUNCE] Alpha release of Menhir, an LR(1) parser generator for ocaml
  2005-12-15 16:41     ` Stefan Monnier
@ 2005-12-15 16:50       ` Francois Pottier
  2005-12-15 18:56         ` Stefan Monnier
  2005-12-30 21:57         ` Florian Weimer
  0 siblings, 2 replies; 18+ messages in thread
From: Francois Pottier @ 2005-12-15 16:50 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: caml-list


Hi Stefan,

Thanks for your answer.

> For toy projects it's amazing: you get good error messages with 0 effort.

What about non-toy projects? Are there times where the automatic error
recovery mechanism gets in the way?

-- 
François Pottier
Francois.Pottier@inria.fr
http://pauillac.inria.fr/~fpottier/


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Caml-list] Re: [ANNOUNCE] Alpha release of Menhir, an LR(1) parser generator for ocaml
  2005-12-15 16:50       ` Francois Pottier
@ 2005-12-15 18:56         ` Stefan Monnier
  2005-12-30 21:57         ` Florian Weimer
  1 sibling, 0 replies; 18+ messages in thread
From: Stefan Monnier @ 2005-12-15 18:56 UTC (permalink / raw)
  To: Francois.Pottier; +Cc: caml-list

>> For toy projects it's amazing: you get good error messages with 0 effort.
> What about non-toy projects? Are there times where the automatic error
> recovery mechanism gets in the way?

I wouldn't know, really.
SML/NJ uses it and I find its syntax error reporting to be generally good.
E.g. significantly better than OCaml's ;-)

But if it's not good enough, there are hooks to influence its behavior
(e.g. telling it what to try and add/remove first, thus telling it basically
what are tokens commonly forgotten and spuriously added), but I don't know
how it compares in practice to a well tuned yacc error recovery.

        Stefan

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Caml-list] Re: [ANNOUNCE] Alpha release of Menhir, an LR(1) parser generator for ocaml
  2005-12-15 16:50       ` Francois Pottier
  2005-12-15 18:56         ` Stefan Monnier
@ 2005-12-30 21:57         ` Florian Weimer
  1 sibling, 0 replies; 18+ messages in thread
From: Florian Weimer @ 2005-12-30 21:57 UTC (permalink / raw)
  To: Francois.Pottier; +Cc: Stefan Monnier, caml-list

* Francois Pottier:

> What about non-toy projects? Are there times where the automatic error
> recovery mechanism gets in the way?

AFAIK, MLton uses it, and the inserted symbols are often not those
that are actually missing.  This leads to error messages which are
rather confusing.  With manually written error handling, you can
surely do much better, but the effort pays off in special cases.
(Most of the time, the careful error handling code is never written,
of course.)

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2005-12-30 21:58 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2005-12-12 17:58 [ANNOUNCE] Alpha release of Menhir, an LR(1) parser generator for ocaml Francois Pottier
2005-12-12 19:51 ` [Caml-list] " "Márk S. Zoltán"
2005-12-13 21:07 ` Nathaniel Gray
2005-12-14  6:08   ` skaller
2005-12-14  9:04     ` Francois Pottier
2005-12-14 10:27       ` Alessandro Baretta
2005-12-14 21:04         ` skaller
2005-12-15  8:46           ` Francois Pottier
2005-12-15 11:03             ` skaller
2005-12-14 20:51       ` skaller
2005-12-14 22:15         ` Joaquin Cuenca Abela
2005-12-15  8:40           ` Francois Pottier
2005-12-15  6:35 ` Stefan Monnier
2005-12-15  8:47   ` [Caml-list] " Francois Pottier
2005-12-15 16:41     ` Stefan Monnier
2005-12-15 16:50       ` Francois Pottier
2005-12-15 18:56         ` Stefan Monnier
2005-12-30 21:57         ` Florian Weimer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox