From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <skaller@users.sourceforge.net>
Received: from nez-perce.inria.fr (nez-perce.inria.fr [192.93.2.78])
	by yquem.inria.fr (Postfix) with ESMTP id BAEBEBB81;
	Wed, 14 Dec 2005 21:51:14 +0100 (CET)
Received: from smtp3.adl2.internode.on.net (smtp3.adl2.internode.on.net [203.16.214.203])
	by nez-perce.inria.fr (8.13.0/8.13.0) with ESMTP id jBEKpCDP005836;
	Wed, 14 Dec 2005 21:51:13 +0100
Received: from rosella (ppp33-4.lns1.syd2.internode.on.net [59.167.33.4] (may be forged))
	by smtp3.adl2.internode.on.net (8.12.9/8.12.6) with ESMTP id jBEKp3Zu016722;
	Thu, 15 Dec 2005 07:21:03 +1030 (CST)
	(envelope-from skaller@users.sourceforge.net)
Subject: Re: [Caml-list] [ANNOUNCE] Alpha release of Menhir, an LR(1)
	parser generator for ocaml
From: skaller <skaller@users.sourceforge.net>
To: Francois.Pottier@inria.fr
Cc: Caml Mailing List <caml-list@yquem.inria.fr>
In-Reply-To: <20051214090427.GB1330@yquem.inria.fr>
References: <20051212175838.GA8502@yquem.inria.fr>
	 <aee06c9e0512131307k3fc494a5k3591d549d552f1b@mail.gmail.com>
	 <1134540495.8980.63.camel@rosella>  <20051214090427.GB1330@yquem.inria.fr>
Content-Type: text/plain
Date: Thu, 15 Dec 2005 07:51:02 +1100
Message-Id: <1134593462.8942.67.camel@rosella>
Mime-Version: 1.0
X-Mailer: Evolution 2.4.1 
Content-Transfer-Encoding: 7bit
X-Miltered: at nez-perce with ID 43A085C0.001 by Joe's j-chkmail (http://j-chkmail.ensmp.fr)!
X-Spam: no; 0.00; caml-list:01 parser:01 ocaml:01 distro:01 ocaml:01 parser:01 functor:01 recursive:01 recursive:01 parsers:01 ocamlyacc:01 token:01 lexer:01 ocamlyacc:01 val:01 
X-Spam-Checker-Version: SpamAssassin 3.0.3 (2005-04-27) on yquem.inria.fr
X-Spam-Level: 
X-Spam-Status: No, score=0.0 required=5.0 tests=none autolearn=disabled 
	version=3.0.3

On Wed, 2005-12-14 at 10:04 +0100, Francois Pottier wrote:
> Hello,
> 
> On Wed, Dec 14, 2005 at 05:08:15PM +1100, skaller wrote:
> 
> > 0. The licence. Q public licence for the generator????
> > Please NO NO NO!! Not unless it is distributed
> > as part of the official distro. Is there any chance of that?
> 
> Our plan (not yet approved by the OCaml higher authorities)
> is to make it part of the OCaml distribution.

Good!!!

> > In particular, can the parser be generated *inside*
> > an environment such a function or let binding?
> 
> No. The only way of passing arguments to the parser is through
> functor arguments. (Or through global state, of course.) 

Which is the same thing...

> Why is that a problem?

Because it destroys re-entrancy. My parser is recursive,
it actually calls a recursive descent parser to handle
user defined extensions, that parser in turn can call
other parsers, including the original ocamlyacc parser.

Passing an argument to the parser is the natural way
to do this. What I'm actually doing now is storing
the data in a token, since this is the only way
to get the required data to the parser.

however it is a trick. It 'happens' to work at the
moment, but may break at any time (it works because
the information is constructed by the lexer which 
runs in an earlier phase -- that is good enough
for the current extensions I allow, but this
may change tomorrow -- I'd love to scope these
extensions properly, but that really does require
passing an argument)

It's just the usual reason for functional programming
and persistent data structures I guess.

> > Ocamlyacc usesthe typing
> > 
> > 	val parser: (lexbuf->token) -> lexbuf -> 'a
> > 
> > which is just bad. A better signature is
> > 
> > 	val parser: ( unit -> token ) -> 'a
> > 
> > There is no need to provide location information: the correct
> > solution is to throw an exception, which is caught in a 
> > context which can determine the location.
> 
> It can be nice to have the parser automatically extract position information
> for you; doing it manually and explicitly within semantic actions would be
> quite tedious.

Yes, it can be nice for a toy parser. For a complex production
quality parser there is no way to avoid working hard to manage
source location data.

> On the other hand, the signature that you suggest would allow using lexers
> that do not necessarily conform to the lexbuf interface; is that your point?

Indeed. (My parsers run off lists of tokens .. they're lexed using
ocamllex .. but in an earlier phase. And the lists are processed
in between, for example, whitespace tokens are elided because
the parser can't handle them)

> If so, we will consider adding a command line switch, as you suggest. It
> would disallow the use of the position keywords ($startpos, $endpos, etc.)

Yes. That makes sense. There are other alternatives, such as
requiring the tokeniser to return a pair, consisting of a
token and the location. In Ocamlyacc this would be a bad
idea, since it would require a fixed notion of location.

In Menhir, however, this is not so, since it can generate
a functor which can be parameterised by the location type.

> > 3. I have doubts about the claim that parsers can 'share'
> > token types. I do not see how this is possible. 
> 
> It certainly is possible: have a look at demos/calc-two in the
> distribution. The example is artificial but illustrates the
> possibility of sharing a token type.

Hmm .. ok, I will look.

> > Actually, I personally find the 'yacc' technique of
> > generating tokens to be rather lame. Felix does this
> > much better -- the parser simply expects a token type
> > which is a variant, the type can be defined wherever
> > you like.
> 
> With Menhir, the token type can also be hand-written or generated via other
> means. It doesn't have to be generated by Menhir. It has to be an algebraic
> data type, though, not a polymorphic variant.

Ah, I see, the same as Felix in that respect then.
This solves a lot of problems, IMHO.

> Considering the answer to the first question, the second
> question would not arise at all if Menhir produced tables, like
> ocamlyacc. However, Menhir does not produce tables; it
> compiles the automaton down to a bunch of mutually recursive
> functions. We have not yet scientifically assessed the
> difference in size between tables and code, but a few
> quick experiments indicate that it is reasonable (the
> code is larger than the tables by a factor of two or
> three).

I'm also curious as to the performance of code vs
table driven interpreter. One would think the code
is faster .. but size matters a lot on current CPUs
due to caching. Still, the O() performance will
be much the same -- 2-3 times larger code is reasonable
considering that it will probably make it easier
to enhance the generator (as has already been done
by be able to generate functors .. :)

-- 
John Skaller <skaller at users dot sf dot net>
Felix, successor to C++: http://felix.sf.net