menhir

Mailing list for all users of the OCaml language and system.
 help / color / mirror / Atom feed

* menhir
@ 2007-04-28 10:32 skaller
  2007-04-28 16:50 ` [Caml-list] menhir Francois Pottier
  0 siblings, 1 reply; 20+ messages in thread
From: skaller @ 2007-04-28 10:32 UTC (permalink / raw)
  To: caml-list

Just a note I just built the Felix parser with Menhir.
First, it detected some duplicate definitions Ocamlyacc didn't! Good!

Second, I got a "rather a lot" of states have end-of-stream conflicts.
What's that about?

Third the generated ml file was 4.5 Meg. 
Ocamlopt on amd64 hung for so long I almost posted a bug report
for Ocamlopt, but finally it finished. This was a 95% CPU, 25% memory
job, so no paging. I'd guess it took 100x times longer than
compilation of the ocamlyacc file (which is just a bunch of numbers :)
I didn't measure it .. no biggie for me now I know, but my
box is a LOT faster than some of the boxes my product gets built on.

After that, Felix built ok, and the parser worked for
'pure' code. However it failed when Felix preprocessor
syntax extensions were used (which is 90% of all programs).

Now, most of the system data transport for this is properly
built so it can't cause any problems. The one thing which 
is hacked is the pushback detection.

Basically: when Ocamlyacc reduces a production, it sometimes
ends on the last token, and sometimes it overshoots by 1.

My grammar uses a system like:

exprx:
  expr expr_terminator { $1,$2 }

statementsx:
  | statement_aster statements_terminator { $1, $2 }

This is saying: a 'special expression' exprx is an expression
PLUS one of the tokens which will solidly terminate an 
expression AND NOT ITSELF BE OVERSHOT by the parser.

In other words when exprx is parsed the reduction must
leave the next token unread.

The semantics used are: when the exprx is processed the
action arranges to push the terminator token back
into the token stream.

Perhaps because Menhir is LR(1) not LALR(1), this technique
is failing. Or Menhir may simply be looking ahead further
than required in the token stream.

Whichever way, I am depending on this particular implementation
detail of Ocamlyacc, and Mehir is using a different implementation.

Any suggestions how to 'fix' this?

-- 
John Skaller <skaller at users dot sf dot net>
Felix, successor to C++: http://felix.sf.net

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Caml-list] menhir
  2007-04-28 10:32 menhir skaller
@ 2007-04-28 16:50 ` Francois Pottier
  2007-04-28 19:47   ` Markus Mottl
  2007-04-29  4:43   ` skaller
  0 siblings, 2 replies; 20+ messages in thread
From: Francois Pottier @ 2007-04-28 16:50 UTC (permalink / raw)
  To: skaller; +Cc: caml-list

Hello,

On Sat, Apr 28, 2007 at 08:32:16PM +1000, skaller wrote:
> Second, I got a "rather a lot" of states have end-of-stream conflicts.
> What's that about?

That's precisely about "overshooting" the end of the token stream (reading
one too many tokens). The issue is described in the manual.

> Third the generated ml file was 4.5 Meg.  Ocamlopt on amd64 hung for so long
> I almost posted a bug report for Ocamlopt, but finally it finished.

Yup, I am somewhat disappointed that ocamlopt does not seem to have linear
time complexity, but I shouldn't complain too loud, my boss may be listening
:)

An option to generate tables like ocamlyacc would clearly be useful.

> Basically: when Ocamlyacc reduces a production, it sometimes
> ends on the last token, and sometimes it overshoots by 1.

I don't know the details of your grammar, but our (perhaps naive) view is that
you should modify your grammar to avoid end-of-stream conflicts (and Menhir's
conflict reports will help you do that). Then, Menhir will not overshoot.

> The semantics used are: when the exprx is processed the
> action arranges to push the terminator token back
> into the token stream.

How is this done? It is possible that this hack is not compatible with Menhir,
because Menhir keeps a local cache of the next token, which is not properly
updated when you push your old token back.

-- 
François Pottier
Francois.Pottier@inria.fr
http://cristal.inria.fr/~fpottier/

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Caml-list] menhir
  2007-04-28 16:50 ` [Caml-list] menhir Francois Pottier
@ 2007-04-28 19:47   ` Markus Mottl
  2007-04-28 21:15     ` Jon Harrop
  2007-04-29  4:43   ` skaller
  1 sibling, 1 reply; 20+ messages in thread
From: Markus Mottl @ 2007-04-28 19:47 UTC (permalink / raw)
  To: Francois.Pottier; +Cc: skaller, caml-list

On 4/28/07, Francois Pottier <Francois.Pottier@inria.fr> wrote:
> Yup, I am somewhat disappointed that ocamlopt does not seem to have linear
> time complexity, but I shouldn't complain too loud, my boss may be listening
> :)

This is unfortunate indeed.  I have also run into problems with the
compilers exhibiting quadratic behavior on very large sources.  It
seems that this mostly stems from the use of data structures in the
compiler that scale badly on certain operations (e.g. searching
through lists rather than in balanced trees).

I think it wouldn't be difficult to update the compiler to use more
efficient data structures, but it would certainly be a hell lot of
work...

Regards,
Markus

-- 
Markus Mottl        http://www.ocaml.info        markus.mottl@gmail.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Caml-list] menhir
  2007-04-28 19:47   ` Markus Mottl
@ 2007-04-28 21:15     ` Jon Harrop
  0 siblings, 0 replies; 20+ messages in thread
From: Jon Harrop @ 2007-04-28 21:15 UTC (permalink / raw)
  To: caml-list

On Saturday 28 April 2007 20:47, Markus Mottl wrote:
> I think it wouldn't be difficult to update the compiler to use more
> efficient data structures, but it would certainly be a hell lot of
> work...

I've had a quick look at this before. Some of our compiler compilers were 
causing ocamlopt to stack overflow and the problem turned out to be a pair of 
mutually recursive functions in different modules of the type checkers that 
were not tail recursive. I couldn't see an easy solution without rewriting 
the whole thing in CPS.

Perhaps a program to convert OCaml code into CPS style would be useful?

-- 
Dr Jon D Harrop, Flying Frog Consultancy Ltd.
The F#.NET Journal
http://www.ffconsultancy.com/products/fsharp_journal/?e

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Caml-list] menhir
  2007-04-28 16:50 ` [Caml-list] menhir Francois Pottier
  2007-04-28 19:47   ` Markus Mottl
@ 2007-04-29  4:43   ` skaller
  2007-04-29  7:27     ` Christophe Raffalli
  2007-05-01 15:57     ` Francois Pottier
  1 sibling, 2 replies; 20+ messages in thread
From: skaller @ 2007-04-29  4:43 UTC (permalink / raw)
  To: Francois.Pottier; +Cc: caml-list

On Sat, 2007-04-28 at 18:50 +0200, Francois Pottier wrote:

> > Third the generated ml file was 4.5 Meg.  Ocamlopt on amd64 hung for so long
> > I almost posted a bug report for Ocamlopt, but finally it finished.
> 
> Yup, I am somewhat disappointed that ocamlopt does not seem to have linear
> time complexity, but I shouldn't complain too loud, my boss may be listening
> :)
> 
> An option to generate tables like ocamlyacc would clearly be useful.

Although that defeats one of the future advantages of Menhir:
embedding. To understand what I mean you might look at the Felix
parser generator. In Felix you can write parsers in any scope,
and the user actions bind to the containing scope.

This means you can think about mixing programmatic recursive 
descent with the parser automaton. An executable recursive
descent parser is easily extended dynamically (but has woeful
error handling ;(

For Menhir, embedding would be nice. However even without
it being able to use recursion and actually pass arguments --
which ocamlyacc does not support -- should allow decoupling
of some of your grammar into smaller subgrammars, and this
in turn could reduce compile time.

It is possible (but I'm no expert!!) that the non-linearity
in compiling Menhir generated code is due to a very 
large let rec/and/... being generated. Some extra analysis 
might fix that, i.e. using recursion only when necessary,
and let/and/in otherwise (though I have no idea how to
do that).

For embedding the 'one token overshoot' is a bit nasty.

OTOH with GLR parser, extra functionality is gained,
and user actions must have no side effects, so
the idea to 'push back' the overshoot token cannot be used
(since it is a side-effect), at least without some thought.. 
hmmm ..

An extensible GLR+ parser would be interesting. This is 
an extensible list of GLR parsers that run in 'parallel',
using the GLR idea, but the list being dynamically extensible.
Exactly how you'd manage it i don't know.

The theoretical underpinnings should be the same as
for open-recursion idea: make a grammar with a parameter
representing extra productions, then you can fix it
to close the grammar. This should at least be possible
statically (at compile time).

> > Basically: when Ocamlyacc reduces a production, it sometimes
> > ends on the last token, and sometimes it overshoots by 1.
> 
> I don't know the details of your grammar, but our (perhaps naive) view is that
> you should modify your grammar to avoid end-of-stream conflicts (and Menhir's
> conflict reports will help you do that). Then, Menhir will not overshoot.

Actually in my grammar there is a definite user defined ENDMARKER
for the main grammar, and 'end-of-expression' and 'end-of-statements'
set of tokens for the extensible parts. The input is not a file,
it's a list of tokens built by other means.

So the conflicts are spurious: end of stream can't be parsed
(but of course Menhir doesn't know that). This is good,
because Mehir CANNOT detect end of stream, since my lexbuf
is a dummy and is not used at all.

> > The semantics used are: when the exprx is processed the
> > action arranges to push the terminator token back
> > into the token stream.
> 
> How is this done? 

The lexer function is bound to a class object which contains
a mutable list of tokens. The actual lexbuf is ignored,
its a dummy. So pushing token onto the list is just

	ts <- last_token :: ts

:)

> It is possible that this hack is not compatible with Menhir,
> because Menhir keeps a local cache of the next token, which is not properly
> updated when you push your old token back.

That's what I would have guessed.

In order to use an LR parser recursively, which I do, it must
be possible to control the lookahead token somehow. My code
just uses an implementation detail of Ocamlyacc discovered
by accident, and the detail differs for Menhir.

-- 
John Skaller <skaller at users dot sf dot net>
Felix, successor to C++: http://felix.sf.net

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Caml-list] menhir
  2007-04-29  4:43   ` skaller
@ 2007-04-29  7:27     ` Christophe Raffalli
  2007-05-01 15:57     ` Francois Pottier
  1 sibling, 0 replies; 20+ messages in thread
From: Christophe Raffalli @ 2007-04-29  7:27 UTC (permalink / raw)
  To: caml-list


Dear list members,
>
> OTOH with GLR parser, extra functionality is gained,
> and user actions must have no side effects, so
> the idea to 'push back' the overshoot token cannot be used
> (since it is a side-effect), at least without some thought.. 
> hmmm ..
>   
dypgen uses GLR and allows side effect, because it keeps a state
associated to each stack.
Actually two states, one global on one "top down": typically the top
down state can be used to carry
information about bound variables down in the parse tree.
> An extensible GLR+ parser would be interesting. This is 
> an extensible list of GLR parsers that run in 'parallel',
> using the GLR idea, but the list being dynamically extensible.
> Exactly how you'd manage it i don't know.
>   
dypgen allows the action to extend the grammar (in a precise scope)

About table/code size, dypgen can use (the user choose) LR0, LALR or LR1
automaton
to drive the parser. Moreover, priorities can be handled (this is the
default) outside the automaton, giving
very small tables.

Christophe Raffalli



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Caml-list] menhir
  2007-04-29  4:43   ` skaller
  2007-04-29  7:27     ` Christophe Raffalli
@ 2007-05-01 15:57     ` Francois Pottier
  2007-05-01 17:11       ` skaller
  2007-05-01 17:15       ` skaller
  1 sibling, 2 replies; 20+ messages in thread
From: Francois Pottier @ 2007-05-01 15:57 UTC (permalink / raw)
  To: skaller; +Cc: caml-list

Hello,

On Sun, Apr 29, 2007 at 02:43:03PM +1000, skaller wrote:
> > An option to generate tables like ocamlyacc would clearly be useful.
> 
> Although that defeats one of the future advantages of Menhir:
> embedding. To understand what I mean you might look at the Felix
> parser generator. In Felix you can write parsers in any scope,
> and the user actions bind to the containing scope.

I don't see why there is a relationship between the style of code
generation (code versus tables) and embedding.

Embedding in Menhir is supported indirectly via %parameter directives.

> It is possible (but I'm no expert!!) that the non-linearity
> in compiling Menhir generated code is due to a very 
> large let rec/and/... being generated. Some extra analysis 
> might fix that, i.e. using recursion only when necessary,
> and let/and/in otherwise (though I have no idea how to
> do that).

I have tried that, but it does not help. Most (say, 95%) of the functions
generated by Menhir are truly mutually recursive.

> So the conflicts are spurious: end of stream can't be parsed
> (but of course Menhir doesn't know that). This is good,
> because Mehir CANNOT detect end of stream, since my lexbuf
> is a dummy and is not used at all.

Perhaps the terminology "end-of-stream conflict" is badly chosen. These
conflicts are not just about detecting the end of the stream: they are about
recognizing a non-terminal symbol without overshooting, that is, without
unnecessarily requesting a lookahead token (a request which, if the end of
stream has been reached, could be unsatisfiable, or could be blocking).

These conflicts are not "spurious": just like shift/reduce or reduce/reduce
conflicts, they will cause trouble.

-- 
François Pottier
Francois.Pottier@inria.fr
http://cristal.inria.fr/~fpottier/

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Caml-list] menhir
  2007-05-01 15:57     ` Francois Pottier
@ 2007-05-01 17:11       ` skaller
  2007-05-01 17:34         ` Francois Pottier
  2007-05-01 17:15       ` skaller
  1 sibling, 1 reply; 20+ messages in thread
From: skaller @ 2007-05-01 17:11 UTC (permalink / raw)
  To: Francois.Pottier; +Cc: caml-list

On Tue, 2007-05-01 at 17:57 +0200, Francois Pottier wrote:

> > So the conflicts are spurious: end of stream can't be parsed
> > (but of course Menhir doesn't know that). This is good,
> > because Mehir CANNOT detect end of stream, since my lexbuf
> > is a dummy and is not used at all.
> 
> Perhaps the terminology "end-of-stream conflict" is badly chosen. These
> conflicts are not just about detecting the end of the stream: they are about
> recognizing a non-terminal symbol without overshooting, that is, without
> unnecessarily requesting a lookahead token (a request which, if the end of
> stream has been reached, could be unsatisfiable, or could be blocking).
> 
> These conflicts are not "spurious": just like shift/reduce or reduce/reduce
> conflicts, they will cause trouble.

They're not technically spurious, but for my grammar they're
spurious because the input language isn't arbitrary, it
always has an ENDMARKER at the end of the token stream,
and the parser never processes any symbols past that.

It can't hit 'end of stream' because that's a state
of the lexbuf .. and the lexbuf is is a dummy which is
never updated or modifed -- so it can't ever be in
the end-of-stream state.

So basically, menhir is saying "what happens if you hit
end of stream here? What should be done?" and the answer
is "it can't happen".

BTW: Ocamlyacc doesn't report any conflicts, so this
is a minor incompatibility.

-- 
John Skaller <skaller at users dot sf dot net>
Felix, successor to C++: http://felix.sf.net

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Caml-list] menhir
  2007-05-01 17:11       ` skaller
@ 2007-05-01 17:34         ` Francois Pottier
  2007-05-01 23:42           ` skaller
  0 siblings, 1 reply; 20+ messages in thread
From: Francois Pottier @ 2007-05-01 17:34 UTC (permalink / raw)
  To: skaller; +Cc: caml-list


> So basically, menhir is saying "what happens if you hit
> end of stream here? What should be done?" and the answer
> is "it can't happen".

What I am trying to say is, Menhir doesn't ask this question. It asks:
"what happens if I reach that state? should I request a lookahead token
from the lexer, or shouldn't I?"

-- 
François Pottier
Francois.Pottier@inria.fr
http://cristal.inria.fr/~fpottier/


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Caml-list] menhir
  2007-05-01 17:34         ` Francois Pottier
@ 2007-05-01 23:42           ` skaller
  2007-05-02  5:38             ` Francois Pottier
  0 siblings, 1 reply; 20+ messages in thread
From: skaller @ 2007-05-01 23:42 UTC (permalink / raw)
  To: Francois.Pottier; +Cc: caml-list

On Tue, 2007-05-01 at 19:34 +0200, Francois Pottier wrote:
> > So basically, menhir is saying "what happens if you hit
> > end of stream here? What should be done?" and the answer
> > is "it can't happen".
> 
> What I am trying to say is, Menhir doesn't ask this question. It asks:
> "what happens if I reach that state? should I request a lookahead token
> from the lexer, or shouldn't I?"

Perhaps I do not fully understand but:

I do not use 'eof' (end-of-stream pseudo token).
I get this for an unused token:

"Warning: the token WHENCE is unused."

and I should get the same for eof IMHO (but possibly 
suppressed message since it's not a user defined
token).

What i actually get is 145 end-of-stream conflicts .. 
but i get no conflicts on WHENCE.

That is, instead of treating the token set precisely as the
set of user tokens + eof, menhir is treating eof specially
and not consistently with other tokens.

There's no conflict: in every one of these states if 'eof'
turns up its a syntax error, exactly the same as if WHENCE
turns up. 

I'm not sure i fully understand the 'do we need lookahead'
issue: I would have thought: you need a fetch if your action is 
shift, and not if it is a reduce.

-- 
John Skaller <skaller at users dot sf dot net>
Felix, successor to C++: http://felix.sf.net

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Caml-list] menhir
  2007-05-01 23:42           ` skaller
@ 2007-05-02  5:38             ` Francois Pottier
  2007-05-02  5:50               ` Francois Pottier
  2007-05-02  8:41               ` skaller
  0 siblings, 2 replies; 20+ messages in thread
From: Francois Pottier @ 2007-05-02  5:38 UTC (permalink / raw)
  To: skaller; +Cc: caml-list

Hello,

> That is, instead of treating the token set precisely as the
> set of user tokens + eof, menhir is treating eof specially
> and not consistently with other tokens.

As far as Menhir is concerned, there is no special eof token.
The set of tokens is exactly the set of user-defined tokens.
This is true also of ocamlyacc.

Internally, the construction of the automaton uses a pseudo-token,
written #, which stands for the end of the token stream. This token
can appear in conflict explanation messages.

> I'm not sure i fully understand the 'do we need lookahead'
> issue: I would have thought: you need a fetch if your action is 
> shift, and not if it is a reduce.

What you mean is, you need to *consume* one token if your action
is shift, and none if your action is reduce. However, in order to
make a decision (in order to choose between shift and reduce, and
between multiple reduce actions), you sometimes need to *consult*
one lookahead token, without consuming it.

This is necessary only *sometimes*, because, in some states, only
one reduce action is possible, and can be taken without consulting
a lookahead token.

An end-of-stream conflict arises when a state has multiple
actions, some of which are on real tokens, and some of which are
on the pseudo-token #. In that case, one would like to consult
the next token in order to make a decision; however, there is a
possibility that we are at the end of the sentence that we are
trying to recognize, so asking the lexer for one more token might
be a mistake (in your terminology, it might be an overshoot).

Does this clarify things?

-- 
François Pottier
Francois.Pottier@inria.fr
http://cristal.inria.fr/~fpottier/

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Caml-list] menhir
  2007-05-02  5:38             ` Francois Pottier
@ 2007-05-02  5:50               ` Francois Pottier
  2007-05-02  8:41               ` skaller
  1 sibling, 0 replies; 20+ messages in thread
From: Francois Pottier @ 2007-05-02  5:50 UTC (permalink / raw)
  To: skaller; +Cc: caml-list

On Wed, May 02, 2007 at 07:38:15AM +0200, I wrote:
> Internally, the construction of the automaton uses a pseudo-token,
> written #, which stands for the end of the token stream. This token
> can appear in conflict explanation messages.

Actually, I should say: # stands for the end of a sentence that we are trying
to recognize. That is, if S is a start symbol, then Menhir builds a new start
symbol S' with the production S' -> S #. As a result, when we reach a state
where a reduce action ("reduce production p") has lookahead symbol #, this
means: "in this state, perhaps we have consumed a sentence derived from S; in
that case, we should reduce production p and not read anything more". It does
not mean that we have reached the end of the physical token stream, only that
we have reached the end of what we were supposed to read.

-- 
François Pottier
Francois.Pottier@inria.fr
http://cristal.inria.fr/~fpottier/

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Caml-list] menhir
  2007-05-02  5:38             ` Francois Pottier
  2007-05-02  5:50               ` Francois Pottier
@ 2007-05-02  8:41               ` skaller
  2007-05-02 12:30                 ` Francois Pottier
  1 sibling, 1 reply; 20+ messages in thread
From: skaller @ 2007-05-02  8:41 UTC (permalink / raw)
  To: Francois.Pottier; +Cc: caml-list

On Wed, 2007-05-02 at 07:38 +0200, Francois Pottier wrote:
> Hello,
> 
> > That is, instead of treating the token set precisely as the
> > set of user tokens + eof, menhir is treating eof specially
> > and not consistently with other tokens.
> 
> As far as Menhir is concerned, there is no special eof token.
> The set of tokens is exactly the set of user-defined tokens.
> This is true also of ocamlyacc.
> 
> Internally, the construction of the automaton uses a pseudo-token,
> written #, which stands for the end of the token stream. This token
> can appear in conflict explanation messages.

Exactly. In Ocamlyacc it is named 'eof', and you can use that
token in your productions.

> > I'm not sure i fully understand the 'do we need lookahead'
> > issue: I would have thought: you need a fetch if your action is 
> > shift, and not if it is a reduce.
> 
> What you mean is, you need to *consume* one token if your action
> is shift, and none if your action is reduce. However, in order to
> make a decision (in order to choose between shift and reduce, and
> between multiple reduce actions), you sometimes need to *consult*
> one lookahead token, without consuming it.
> 
> This is necessary only *sometimes*, because, in some states, only
> one reduce action is possible, and can be taken without consulting
> a lookahead token.

Right.

> An end-of-stream conflict arises when a state has multiple
> actions, some of which are on real tokens, and some of which are
> on the pseudo-token #. In that case, one would like to consult
> the next token in order to make a decision; however, there is a
> possibility that we are at the end of the sentence that we are
> trying to recognize, so asking the lexer for one more token might
> be a mistake (in your terminology, it might be an overshoot).
> 
> Does this clarify things?

Yes. Your code or my grammar is bugged :)

My grammar is 

compilation_unit:
  | statement_aster ENDMARKER { $1 }

where no non-terminal on which statements_aster depends 
has a production containing ENDMARKER or # (eof).

Therefore, there is no conflict. When compilation_unit
is reduced the parser returns, the next token, whether
it is # or any other, is irrelevant.

The parser is therefore required to match the head
of the input stream and NOT the whole stream.

It seems Menhir wants to match the whole input,
and is confused because it doesn't know what to
do after it sees ENDMARKER.

That means it doesn't detect that compilation_unit
is a start symbol .. and the reduction is final.
If you are augmenting the grammar to be:

compilation_unit:
	| statement_aster ENDMARKER eof { $1 }

that would explain Menhir's confusion .. however the
generated parser actually does work (provided the
user defined syntax feature isn't used).

IMHO: the end-of-stream thing is a serious bug in the
design of Ocamlyacc which unfortunately Menhir duplicates.

In particular there should be no possibility of the parser
consulting the lexbuf, the signature of parsers is very
wrong:

	parser: (lexbuf -> token) -> lexbuf -> ast

is a disaster. Because of this you're duplicating the
disaster and probing the lexbuf for end of stream,
which is just wrong. 

There should be no implicit end of stream: a stream
is an INFINITE sequence, a parser should process the
head of the stream, and the signature should be:

	parser: (state -> token) -> state -> ast

where state is a USER defined type. If the user provides
no state argument, then it defaults to unit.

This denies the parser any access to a hidden 'end of stream'
token. It's up to the user to decide what terminates the
parse. The point again is that the token input to the
parser is infinite: it can't ever be an error to read 
a next token.

The problem is the usual one for streams. Probably the
interface should be changed to something like:

	peek state n (* look ahead n symbols *)
	read state   (* destructive read 1 symbol *)
	loc state    (* return location for diagnostics *)

This would support LR(infinity) parsing.

A better interface would support recursive descent too,
in that case you'd have to make the 'streaming' 
persistent to support backtracking, that is you'd have to use:

	state -> (token * state)

as the lexer interface -- that gets rid of the push back problem.
(This is trivial to implement on top of a destructive stream,
by buffering with lists: Extlib streams do just that).

Dypgen actually does a clever trick here, since GLR
also requires actions which are purely functional ..
from the outside (i.e. persistent): it allow mutations
and they propagate locally in the current branch
(L-attributes yes?). It also allows global mutations,
but they only become permanent when all the branches
agree on them.

Anyhow the problem seems to reduce to this: I think
a parser should read the head of a stream, and Menhir
thinks it should read the whole stream, and it uses
the bugged lexbuf interface to test for end of stream.

So I could fix my grammar by adding 'end-of-stream'
to my start symbols AND hack the otherwise unused
lexbuf to show end-of-stream after ENDMARKER is read.

Ocamlyacc doesn't complain though, suggesting it processes
only the head of a stream .. and that there's a fundamental
design distinction between the two parser generators.

-- 
John Skaller <skaller at users dot sf dot net>
Felix, successor to C++: http://felix.sf.net

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Caml-list] menhir
  2007-05-02  8:41               ` skaller
@ 2007-05-02 12:30                 ` Francois Pottier
  2007-05-02 16:29                   ` skaller
  0 siblings, 1 reply; 20+ messages in thread
From: Francois Pottier @ 2007-05-02 12:30 UTC (permalink / raw)
  To: skaller; +Cc: caml-list

[-- Attachment #1: Type: text/plain, Size: 4788 bytes --]

Hello,

On Wed, May 02, 2007 at 06:41:44PM +1000, skaller wrote:
> Exactly. In Ocamlyacc it is named 'eof', and you can use that
> token in your productions.

As far as I know, this is incorrect. ocamlyacc does not have a predefined
eof token. Perhaps you are thinking of ocamllex, which has an eof pattern.

> compilation_unit:
>   | statement_aster ENDMARKER { $1 }
> 
> where no non-terminal on which statements_aster depends 
> has a production containing ENDMARKER or # (eof).
> 
> Therefore, there is no conflict. When compilation_unit
> is reduced the parser returns, the next token, whether
> it is # or any other, is irrelevant.

Good. I seem to agree with you. Menhir should not report an end-of-stream
conflict here. So, what does it report?

> If you are augmenting the grammar to be:
> 
> compilation_unit:
> 	| statement_aster ENDMARKER eof { $1 }

No, actually, I am augmenting the grammar with the production
compilation_unit' -> compilation_unit, and using
compilation_unit' -> compilation_unit { # } as my initial LR(1)
item in the construction of the automaton.

> IMHO: the end-of-stream thing is a serious bug in the
> design of Ocamlyacc which unfortunately Menhir duplicates.

I believe ocamlyacc's behavior is reasonable but subtle and undocumented...
Menhir attempts to reproduce that behavior, while providing more guidance in
the form of end-of-stream conflict reports... That said, I could be missing a
point. What is the bug?

> In particular there should be no possibility of the parser
> consulting the lexbuf, the signature of parsers is very
> wrong:
> 
> 	parser: (lexbuf -> token) -> lexbuf -> ast
> 
> is a disaster. Because of this you're duplicating the
> disaster and probing the lexbuf for end of stream,
> which is just wrong. 
> 
> There should be no implicit end of stream: a stream
> is an INFINITE sequence, a parser should process the
> head of the stream, and the signature should be:
> 
> 	parser: (state -> token) -> state -> ast
> 
> where state is a USER defined type. If the user provides
> no state argument, then it defaults to unit.

I believe this is a separate issue. You are right in saying that the historic
signature, which involves lexbuf, is dubious. Following your suggestion, we
could just as well use

  parser: (state -> token * state) -> state -> ast * state

if we wish to promote a purely functional style (where values of type
state are immutable), or just

  parser: (unit -> token) -> ast

if we are willing to accept mutable state. (I am sweeping the issue of
locations under the rug; we should use token * location instead of just
token.)

That said, the historic signature 

  parser: (lexbuf -> token) -> lexbuf -> ast

is really equivalent to the previous one, in the sense that I can write
functions that convert between the two styles (see attached file).

> The point again is that the token input to the parser is infinite: it can't
> ever be an error to read a next token.

I beg to disagree. First, the input stream does not have to be infinite: if
I am reading from a file, clearly it is finite. Second, regardless of whether
the stream is finite or infinite, it *is* an error to read more tokens than
you were supposed to. If the grammar's start symbol is S, then the parser
should read a sequence of tokens that derives from S, and nothing more; it
should not overshoot and consume the first token that follows.

> Anyhow the problem seems to reduce to this: I think
> a parser should read the head of a stream, and Menhir
> thinks it should read the whole stream, and it uses
> the bugged lexbuf interface to test for end of stream.

As I just said in the previous paragraph, we really are in agreement: a parser
should read only a prefix of the stream.

As I said earlier, a Menhir-generated parser does not test for the physical
end of stream. It relies on the lexbuf interface only to (i) read the next
token and (ii) access the location of the current token.

> So I could fix my grammar by adding 'end-of-stream'
> to my start symbols AND hack the otherwise unused
> lexbuf to show end-of-stream after ENDMARKER is read.

Hacking the dynamic behavior of your lexbuf won't make any difference, since
end-of-stream conflicts are reported at compile time! Ponder that.

The only way of avoiding these conflicts is to change your grammar somehow.
But I still haven't understood what causes these conflicts in your grammar.
Perhaps it would be time to show it?

> Ocamlyacc doesn't complain though, suggesting it processes
> only the head of a stream .. and that there's a fundamental
> design distinction between the two parser generators.

ocamlyacc never complains. It just trusts you to know what you are doing.

-- 
François Pottier
Francois.Pottier@inria.fr
http://cristal.inria.fr/~fpottier/

[-- Attachment #2: convert.ml --]
[-- Type: text/plain, Size: 580 bytes --]

open Lexing

type ('token, 'ast) historic_parser =
    (lexbuf -> 'token) -> lexbuf -> 'ast

type ('token, 'ast) revised_parser =
    (unit -> 'token) -> 'ast

let upgrade (p : ('token, 'ast) historic_parser) : ('token, 'ast) revised_parser =
  fun (next : unit -> 'token) ->
    let (* dummy *) lexbuf =
      Lexing.from_string ""
    and lexer lexbuf =
      next()
    in
    p lexer lexbuf

let downgrade (p : ('token, 'ast) revised_parser) : ('token, 'ast) historic_parser =
  fun (lexer : lexbuf -> 'token) lexbuf ->
    let next () =
      lexer lexbuf
    in
    p next

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Caml-list] menhir
  2007-05-02 12:30                 ` Francois Pottier
@ 2007-05-02 16:29                   ` skaller
  2007-05-02 18:35                     ` Francois Pottier
  0 siblings, 1 reply; 20+ messages in thread
From: skaller @ 2007-05-02 16:29 UTC (permalink / raw)
  To: Francois.Pottier; +Cc: caml-list

On Wed, 2007-05-02 at 14:30 +0200, Francois Pottier wrote:
> Hello,
> 
> On Wed, May 02, 2007 at 06:41:44PM +1000, skaller wrote:
> > Exactly. In Ocamlyacc it is named 'eof', and you can use that
> > token in your productions.
> 
> As far as I know, this is incorrect. ocamlyacc does not have a predefined
> eof token. Perhaps you are thinking of ocamllex, which has an eof pattern.

I believe you're right, I apologise for confusion.

> > compilation_unit:
> >   | statement_aster ENDMARKER { $1 }
> > 
> > where no non-terminal on which statements_aster depends 
> > has a production containing ENDMARKER or # (eof).
> > 
> > Therefore, there is no conflict. When compilation_unit
> > is reduced the parser returns, the next token, whether
> > it is # or any other, is irrelevant.
> 
> Good. I seem to agree with you. Menhir should not report an end-of-stream
> conflict here. So, what does it report?

Built an LR(0) automaton with 1416 states.
Built an LR(1) automaton with 2009 states.
Warning: 145 states have an end-of-stream conflict.

Can I send you the file?

[signature of parser]

> I believe this is a separate issue. 

Yes, I agree.

> You are right in saying that the historic
> signature, which involves lexbuf, is dubious. Following your suggestion, we
> could just as well use
> 
>   parser: (state -> token * state) -> state -> ast * state
> 
> if we wish to promote a purely functional style (where values of type
> state are immutable), or just
> 
>   parser: (unit -> token) -> ast
> 
> if we are willing to accept mutable state. (I am sweeping the issue of
> locations under the rug; we should use token * location instead of just
> token.)

Or forget it, which is the approach taken by Felix: every token
contains its location: the user can organise this. This has the
advantage of not specifying a particular location format.

> That said, the historic signature 
> 
>   parser: (lexbuf -> token) -> lexbuf -> ast
> 
> is really equivalent to the previous one, in the sense that I can write
> functions that convert between the two styles (see attached file).

Yes, but you cannot write functions that take a state argument
because lexbuf is a fixed data type and there's no where to
add in any user state data.

> > The point again is that the token input to the parser is infinite: it can't
> > ever be an error to read a next token.
> 
> I beg to disagree. First, the input stream does not have to be infinite: if
> I am reading from a file, clearly it is finite.

EOF is returned an infinite number of times in C.

>  Second, regardless of whether
> the stream is finite or infinite, it *is* an error to read more tokens than
> you were supposed to. If the grammar's start symbol is S, then the parser
> should read a sequence of tokens that derives from S, and nothing more; it
> should not overshoot and consume the first token that follows.

This requires the definition: parse the *shortest* head of the
input stream.

> The only way of avoiding these conflicts is to change your grammar somehow.
> But I still haven't understood what causes these conflicts in your grammar.
> Perhaps it would be time to show it?

> ocamlyacc never complains. It just trusts you to know what you are doing.

I generate an .output file, grep for the word 'conflict',
and terminate my build if there is one found. I do not permit
any conflicts in my grammar: it's strictly unambiguous LALR(1).

It's also pure in the sense that it doesn't use crud 
like %left, %prec etc to resolve conflicts.
[The way dypgen does this is vastly superior!]


-- 
John Skaller <skaller at users dot sf dot net>
Felix, successor to C++: http://felix.sf.net


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Caml-list] menhir
  2007-05-02 16:29                   ` skaller
@ 2007-05-02 18:35                     ` Francois Pottier
  2007-05-03  1:30                       ` skaller
  0 siblings, 1 reply; 20+ messages in thread
From: Francois Pottier @ 2007-05-02 18:35 UTC (permalink / raw)
  To: skaller; +Cc: caml-list


On Thu, May 03, 2007 at 02:29:17AM +1000, skaller wrote:
> 
> Built an LR(0) automaton with 1416 states.
> Built an LR(1) automaton with 2009 states.
> Warning: 145 states have an end-of-stream conflict.
> 
> Can I send you the file?

Sure, please do (perhaps off the list).

> Yes, but you cannot write functions that take a state argument
> because lexbuf is a fixed data type and there's no where to
> add in any user state data.

But my point is that you never need to pass a state argument to a parser
function. Instead, you can just have the function capture the address of the
(mutable) state in its closure.

> EOF is returned an infinite number of times in C.

Point taken.

> This requires the definition: parse the *shortest* head of the
> input stream.

In fact, if there are no end-of-stream conflicts, then the head of the input
stream is *unique*, so you don't need to specify whether you are interested in
a shortest or longest prefix.

> > ocamlyacc never complains. It just trusts you to know what you are doing.
> 
> I generate an .output file, grep for the word 'conflict',
> and terminate my build if there is one found. I do not permit
> any conflicts in my grammar: it's strictly unambiguous LALR(1).

Sure, but ocamlyacc does not report end-of-stream conflicts, which I believe
are real issues...

> It's also pure in the sense that it doesn't use crud 
> like %left, %prec etc to resolve conflicts.

Good, good!

> [The way dypgen does this is vastly superior!]

How does it work?

-- 
François Pottier
Francois.Pottier@inria.fr
http://cristal.inria.fr/~fpottier/


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Caml-list] menhir
  2007-05-02 18:35                     ` Francois Pottier
@ 2007-05-03  1:30                       ` skaller
  2007-05-03  8:43                         ` Joel Reymont
  0 siblings, 1 reply; 20+ messages in thread
From: skaller @ 2007-05-03  1:30 UTC (permalink / raw)
  To: Francois.Pottier; +Cc: caml-list

On Wed, 2007-05-02 at 20:35 +0200, Francois Pottier wrote:
> On Thu, May 03, 2007 at 02:29:17AM +1000, skaller wrote:

> > Yes, but you cannot write functions that take a state argument
> > because lexbuf is a fixed data type and there's no where to
> > add in any user state data.
> 
> But my point is that you never need to pass a state argument to a parser
> function. Instead, you can just have the function capture the address of the
> (mutable) state in its closure.

And mine is that you can't do that because the function is
generated with a fixed signature, whose only state data
is the lexbuf.

Joel pointed out this isn't so, if you use a functor and
construct it locally, then a 'global' variable of the
functor is actually localised: I admit I didn't think of
local dynamic instantiation.

-- 
John Skaller <skaller at users dot sf dot net>
Felix, successor to C++: http://felix.sf.net


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Caml-list] menhir
  2007-05-03  1:30                       ` skaller
@ 2007-05-03  8:43                         ` Joel Reymont
  0 siblings, 0 replies; 20+ messages in thread
From: Joel Reymont @ 2007-05-03  8:43 UTC (permalink / raw)
  To: skaller; +Cc: Francois.Pottier, caml-list


On May 3, 2007, at 2:30 AM, skaller wrote:

> Joel pointed out this isn't so, if you use a functor and
> construct it locally, then a 'global' variable of the
> functor is actually localised: I admit I didn't think of
> local dynamic instantiation.

Just to give credit where it's due, it wasn't my idea. I butted my  
head against this issue for a few days until Francois solved it for  
me by offering the local module example.

	Thanks, Joel

--
http://wagerlabs.com/






^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Caml-list] menhir
  2007-05-01 15:57     ` Francois Pottier
  2007-05-01 17:11       ` skaller
@ 2007-05-01 17:15       ` skaller
  2007-05-01 17:31         ` Francois Pottier
  1 sibling, 1 reply; 20+ messages in thread
From: skaller @ 2007-05-01 17:15 UTC (permalink / raw)
  To: Francois.Pottier; +Cc: caml-list

On Tue, 2007-05-01 at 17:57 +0200, Francois Pottier wrote:

> I have tried that, but it does not help. Most (say, 95%) of the functions
> generated by Menhir are truly mutually recursive.

Yes, but you can always use indirection to decouple them
can't you? Eg just use a 'ref' to a function?

Slower at run time, but should fix the compile time
problem if the cause is huge let/recs.

-- 
John Skaller <skaller at users dot sf dot net>
Felix, successor to C++: http://felix.sf.net


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Caml-list] menhir
  2007-05-01 17:15       ` skaller
@ 2007-05-01 17:31         ` Francois Pottier
  0 siblings, 0 replies; 20+ messages in thread
From: Francois Pottier @ 2007-05-01 17:31 UTC (permalink / raw)
  To: skaller; +Cc: caml-list


> Yes, but you can always use indirection to decouple them
> can't you? Eg just use a 'ref' to a function?

I would need a ref to a huge record of functions, or a lot of individual refs.
I have tried some of the possibilities, but they yield ugly code and none was
actually faster.

-- 
François Pottier
Francois.Pottier@inria.fr
http://cristal.inria.fr/~fpottier/


^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2007-05-03  8:44 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-04-28 10:32 menhir skaller
2007-04-28 16:50 ` [Caml-list] menhir Francois Pottier
2007-04-28 19:47   ` Markus Mottl
2007-04-28 21:15     ` Jon Harrop
2007-04-29  4:43   ` skaller
2007-04-29  7:27     ` Christophe Raffalli
2007-05-01 15:57     ` Francois Pottier
2007-05-01 17:11       ` skaller
2007-05-01 17:34         ` Francois Pottier
2007-05-01 23:42           ` skaller
2007-05-02  5:38             ` Francois Pottier
2007-05-02  5:50               ` Francois Pottier
2007-05-02  8:41               ` skaller
2007-05-02 12:30                 ` Francois Pottier
2007-05-02 16:29                   ` skaller
2007-05-02 18:35                     ` Francois Pottier
2007-05-03  1:30                       ` skaller
2007-05-03  8:43                         ` Joel Reymont
2007-05-01 17:15       ` skaller
2007-05-01 17:31         ` Francois Pottier

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox