[Caml-list] New release of Menhir (20141215)

Mailing list for all users of the OCaml language and system.
 help / color / mirror / Atom feed

* [Caml-list] New release of Menhir (20141215)
@ 2014-12-17 20:14 Francois Pottier
  2014-12-18 12:45 ` Gerd Stolpmann
  2014-12-22 18:40 ` Dario Teixeira
  0 siblings, 2 replies; 14+ messages in thread
From: Francois Pottier @ 2014-12-17 20:14 UTC (permalink / raw)
  To: caml-list

Dear OCaml users,

I have recently started making a series of changes in Menhir, inspired by
the work of Frédéric Bour on Merlin (a smart emacs-mode for OCaml, which
uses a modified version of Menhir). Thanks to Frédéric for his stimulating
ideas, and thanks to Gabriel Scherer for not letting me sleep until I
promised I would do something about them! :-)

I have made a first release a couple days ago. The relevant chunk of the
CHANGES file is appended below. In summary, a new incremental API is
available; Menhir now requires ocaml 4.02; and a couple of obscure features
(--error-recovery and $previouserror) have been removed in the interest of
speed and simplicity.

The incremental API means that you can take a snapshot of the parser's state,
essentially at no cost, at any time. It also means that the parser no longer
drives the lexer; you drive the lexer, and you provide tokens to the parser
when it requests them. This can be convenient if the lexer is in a monad (the
Lwt monad, for instance).

More changes are planned. The type "env" exposed by the new incremental API is
currently opaque. We are planning to offer a range of functions that allow
inspecting and building values of type "env". This should allow the user to
implement new error handling and error recovery strategies outside of Menhir.

The new release is available now as a .tar.gz archive:

  http://gallium.inria.fr/~fpottier/menhir/menhir-20141215.tar.gz

It is also available via opam ("opam install menhir").

Cheers,

-- 
François Pottier
Francois.Pottier@inria.fr
http://gallium.inria.fr/~fpottier/

2014/12/15:
New incremental API (in --table mode only), inspired by Frédéric Bour.

2014/12/11:
Menhir now reports an error if one of the start symbols produces
either the empty language or the singleton language {epsilon}.

Although some people out there actually define a start symbol that recognizes
{epsilon} (and use it as a way of initializing or re-initializing some global
state), this is considered bad style. Furthermore, by ruling out this case, we
are able to simplify the table back-end a little bit.

2014/12/12:
A speed improvement in the code back-end.

2014/12/08:
Menhir now requires OCaml 4.02 (instead of 3.09).

2014/12/02:
Removed support for the $previouserror keyword.
Removed support for --error-recovery mode.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Caml-list] New release of Menhir (20141215)
  2014-12-17 20:14 [Caml-list] New release of Menhir (20141215) Francois Pottier
@ 2014-12-18 12:45 ` Gerd Stolpmann
  2014-12-18 14:19   ` Nicolas Ojeda Bar
  2014-12-22 18:40 ` Dario Teixeira
  1 sibling, 1 reply; 14+ messages in thread
From: Gerd Stolpmann @ 2014-12-18 12:45 UTC (permalink / raw)
  To: Francois.Pottier; +Cc: caml-list

[-- Attachment #1: Type: text/plain, Size: 3327 bytes --]

Thanks for this, it is great news. A couple of months ago I tried to
develop a monadic parser for the IMAP protocol, which has a weird
grammar and needs some strange interactions between lexing and parsing.
Lacking a parser generator I started to write the parser manually, but
it never made good progress. It looks like Menhir can now be used for
this case, and I'll try it.

Gerd

Am Mittwoch, den 17.12.2014, 21:14 +0100 schrieb Francois Pottier:
> Dear OCaml users,
> 
> I have recently started making a series of changes in Menhir, inspired by
> the work of Frédéric Bour on Merlin (a smart emacs-mode for OCaml, which
> uses a modified version of Menhir). Thanks to Frédéric for his stimulating
> ideas, and thanks to Gabriel Scherer for not letting me sleep until I
> promised I would do something about them! :-)
> 
> I have made a first release a couple days ago. The relevant chunk of the
> CHANGES file is appended below. In summary, a new incremental API is
> available; Menhir now requires ocaml 4.02; and a couple of obscure features
> (--error-recovery and $previouserror) have been removed in the interest of
> speed and simplicity.
> 
> The incremental API means that you can take a snapshot of the parser's state,
> essentially at no cost, at any time. It also means that the parser no longer
> drives the lexer; you drive the lexer, and you provide tokens to the parser
> when it requests them. This can be convenient if the lexer is in a monad (the
> Lwt monad, for instance).
> 
> More changes are planned. The type "env" exposed by the new incremental API is
> currently opaque. We are planning to offer a range of functions that allow
> inspecting and building values of type "env". This should allow the user to
> implement new error handling and error recovery strategies outside of Menhir.
> 
> The new release is available now as a .tar.gz archive:
> 
>   http://gallium.inria.fr/~fpottier/menhir/menhir-20141215.tar.gz
> 
> It is also available via opam ("opam install menhir").
> 
> Cheers,
> 
> -- 
> François Pottier
> Francois.Pottier@inria.fr
> http://gallium.inria.fr/~fpottier/
> 
> 2014/12/15:
> New incremental API (in --table mode only), inspired by Frédéric Bour.
> 
> 2014/12/11:
> Menhir now reports an error if one of the start symbols produces
> either the empty language or the singleton language {epsilon}.
> 
> Although some people out there actually define a start symbol that recognizes
> {epsilon} (and use it as a way of initializing or re-initializing some global
> state), this is considered bad style. Furthermore, by ruling out this case, we
> are able to simplify the table back-end a little bit.
> 
> 2014/12/12:
> A speed improvement in the code back-end.
> 
> 2014/12/08:
> Menhir now requires OCaml 4.02 (instead of 3.09).
> 
> 2014/12/02:
> Removed support for the $previouserror keyword.
> Removed support for --error-recovery mode.
> 

-- 
------------------------------------------------------------
Gerd Stolpmann, Darmstadt, Germany    gerd@gerd-stolpmann.de
My OCaml site:          http://www.camlcity.org
Contact details:        http://www.camlcity.org/contact.html
Company homepage:       http://www.gerd-stolpmann.de
------------------------------------------------------------


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Caml-list] New release of Menhir (20141215)
  2014-12-18 12:45 ` Gerd Stolpmann
@ 2014-12-18 14:19   ` Nicolas Ojeda Bar
  2014-12-18 15:20     ` Daniel Bünzli
                       ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Nicolas Ojeda Bar @ 2014-12-18 14:19 UTC (permalink / raw)
  To: Gerd Stolpmann; +Cc: Francois.Pottier, caml-list

Hi Gerd,

I wrote an IMAP client library in pure OCaml
(https://github.com/nojb/ocaml-imap).  It can handle the full RFC 3501
grammar.  This library itself is not yet ready for prime-time, but I
spent quite a bit of time thinking about the question of parsing since
it is such a big part of the protocol.

While the incremental parsing in the new Menhir is great I do not
think it will help to parse IMAP because the token types of IMAP
grammar are highly context dependent.  By this I mean that there are
many overlapping token definitions that are valid in different parts
of the grammar.  While this could be hacked around (by changing the
lexer used in dummy non-terminals) the complexity of doing so quickly
gets out of hand.  In my view this almost requires using a scannerless
parser, which rules out ocamlyacc/menhir-type parsers.

A parser generator that could handle the IMAP grammar nicely is Dypgen
(which generates GLR parsers).  You can almost copy the IMAP grammar
verbatim from RFC 3501.  However it can not be made incremental.  This
means that you need to have all the input available before starting
the parse.  Since the amount of data is potentially very big, this
rules using Dypgen for anything else than toy applications.

In my library I use a monadic combinator parser
(https://github.com/nojb/ocaml-imap/blob/master/imap/imapParser.ml).
When programming monadically you are reifying the continuation at
every `bind` point so you get the incremental bit for free (I think
this goes by the fancy name of `iteratees` nowadays).  On the other
hand it suffers from poor time/space performance common to this type
of parser.  Also, while combinator parsers can handle arbitrary
backtracking (and you end up paying for this), IMAP itself requires
very little, so it would seem that the full flexibility of combinators
are not needed.

It seems that the "scannerless, incremental" space in parser
technology is not very developed (in OCaml; maybe other languages is
different ?).  One intriguing possibility that I am exploring (but
this is a much longer term project) is to use PEG parsers for this.
It would be easy to write an IMAP parser with PEG. The main problem is
that the usual implementation of PEG parsers uses memoization to keep
parsing time from blowing up exponentially on the length of input (
(in OCaml the Aurochs parser generator is of this type).  This
requires storage proportional to input length, which is a problem when
parsing large amounts of data.  But for grammars that don't need a lot
of backtracking (like IMAP) the memoization is not actually required,
so I think there is an interesting space to explore between all these
constraints.

In any case, I would love to hear about your experience if you do try
to parse IMAP using Menhir.

Best wishes,
Nicolas

On Thu, Dec 18, 2014 at 9:45 AM, Gerd Stolpmann <info@gerd-stolpmann.de> wrote:
> Thanks for this, it is great news. A couple of months ago I tried to
> develop a monadic parser for the IMAP protocol, which has a weird
> grammar and needs some strange interactions between lexing and parsing.
> Lacking a parser generator I started to write the parser manually, but
> it never made good progress. It looks like Menhir can now be used for
> this case, and I'll try it.
>
> Gerd
>
> Am Mittwoch, den 17.12.2014, 21:14 +0100 schrieb Francois Pottier:
>> Dear OCaml users,
>>
>> I have recently started making a series of changes in Menhir, inspired by
>> the work of Frédéric Bour on Merlin (a smart emacs-mode for OCaml, which
>> uses a modified version of Menhir). Thanks to Frédéric for his stimulating
>> ideas, and thanks to Gabriel Scherer for not letting me sleep until I
>> promised I would do something about them! :-)
>>
>> I have made a first release a couple days ago. The relevant chunk of the
>> CHANGES file is appended below. In summary, a new incremental API is
>> available; Menhir now requires ocaml 4.02; and a couple of obscure features
>> (--error-recovery and $previouserror) have been removed in the interest of
>> speed and simplicity.
>>
>> The incremental API means that you can take a snapshot of the parser's state,
>> essentially at no cost, at any time. It also means that the parser no longer
>> drives the lexer; you drive the lexer, and you provide tokens to the parser
>> when it requests them. This can be convenient if the lexer is in a monad (the
>> Lwt monad, for instance).
>>
>> More changes are planned. The type "env" exposed by the new incremental API is
>> currently opaque. We are planning to offer a range of functions that allow
>> inspecting and building values of type "env". This should allow the user to
>> implement new error handling and error recovery strategies outside of Menhir.
>>
>> The new release is available now as a .tar.gz archive:
>>
>>   http://gallium.inria.fr/~fpottier/menhir/menhir-20141215.tar.gz
>>
>> It is also available via opam ("opam install menhir").
>>
>> Cheers,
>>
>> --
>> François Pottier
>> Francois.Pottier@inria.fr
>> http://gallium.inria.fr/~fpottier/
>>
>> 2014/12/15:
>> New incremental API (in --table mode only), inspired by Frédéric Bour.
>>
>> 2014/12/11:
>> Menhir now reports an error if one of the start symbols produces
>> either the empty language or the singleton language {epsilon}.
>>
>> Although some people out there actually define a start symbol that recognizes
>> {epsilon} (and use it as a way of initializing or re-initializing some global
>> state), this is considered bad style. Furthermore, by ruling out this case, we
>> are able to simplify the table back-end a little bit.
>>
>> 2014/12/12:
>> A speed improvement in the code back-end.
>>
>> 2014/12/08:
>> Menhir now requires OCaml 4.02 (instead of 3.09).
>>
>> 2014/12/02:
>> Removed support for the $previouserror keyword.
>> Removed support for --error-recovery mode.
>>
>
> --
> ------------------------------------------------------------
> Gerd Stolpmann, Darmstadt, Germany    gerd@gerd-stolpmann.de
> My OCaml site:          http://www.camlcity.org
> Contact details:        http://www.camlcity.org/contact.html
> Company homepage:       http://www.gerd-stolpmann.de
> ------------------------------------------------------------
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Caml-list] New release of Menhir (20141215)
  2014-12-18 14:19   ` Nicolas Ojeda Bar
@ 2014-12-18 15:20     ` Daniel Bünzli
  2014-12-18 15:34       ` Simon Cruanes
  2014-12-18 15:25     ` Gerd Stolpmann
  2014-12-22 11:13     ` oleg
  2 siblings, 1 reply; 14+ messages in thread
From: Daniel Bünzli @ 2014-12-18 15:20 UTC (permalink / raw)
  To: Nicolas Ojeda Bar; +Cc: Gerd Stolpmann, Francois.Pottier, caml-list

Le jeudi, 18 décembre 2014 à 15:19, Nicolas Ojeda Bar a écrit :
> When programming monadically you are reifying the continuation at
> every `bind` point so you get the incremental bit for free (I think
> this goes by the fancy name of `iteratees` nowadays). On the other
> hand it suffers from poor time/space performance common to this type
> of parser. Also, while combinator parsers can handle arbitrary
> backtracking (and you end up paying for this), IMAP itself requires
> very little, so it would seem that the full flexibility of combinators
> are not needed.

For a long time I played with combinator parsers but I never got to the point of being satisfied with the result. I also played a little bit with iteratees but couldn't get the performance I wanted with the full model in all its compositional glory.   

Nowadays I simply write my streaming lexers/parsers manually in CPS and you can drive them in non-blocking mode with a single, fixed size, buffer. See here [1] for an interface and an implementation on a toy example.  

If you get the lowest continuation decoding bits correctly (e.g. don't bind on each byte) you can actually get good time and space performance. Once you have handled that (mainly the painful bits about read overlapping two input buffers) you get very nice parsing flexibility, e.g. for things like text encoding discovery where you need to patch the continuation with the appropriate character decoder. If you care for your users you can also get very good error reporting and error recovery capabilities by applying knowledge specific to the decoded protocol. That's the way Uutf [2], Jsonm [3] (on top of Uutf) and Dicomm [4] are programmed.  

Best,

Daniel

[1] https://github.com/dbuenzli/nbcodec/blob/master/RATIONALE.md
[2] http://erratique.ch/software/uutf
[3] http://erratique.ch/software/jsonm
[4] http://erratique.ch/software/dicomm

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Caml-list] New release of Menhir (20141215)
  2014-12-18 15:20     ` Daniel Bünzli
@ 2014-12-18 15:34       ` Simon Cruanes
  2014-12-18 16:02         ` Nicolas Ojeda Bar
  0 siblings, 1 reply; 14+ messages in thread
From: Simon Cruanes @ 2014-12-18 15:34 UTC (permalink / raw)
  To: Daniel Bünzli
  Cc: Nicolas Ojeda Bar, Gerd Stolpmann, Francois.Pottier, caml-list

[-- Attachment #1: Type: text/plain, Size: 2358 bytes --]

I saw this on mirage's wiki:
https://github.com/mirage/mirage-www/wiki/Pioneer-Projects#bigarray-parser-generator

Apart from using bigarrays for performance, the idea of using a ppx to
make the combinators more efficient looks interesting.

I also plan to try to mix recursive descent with a monadic "refill"
function for simple parsers (S-expressions, Bencode, etc.) It will be
interesting to compare that with Daniel's manual CPS.

Cheers!

Le Thu, 18 Dec 2014, Daniel Bünzli a écrit :
> Le jeudi, 18 décembre 2014 à 15:19, Nicolas Ojeda Bar a écrit :
> > When programming monadically you are reifying the continuation at
> > every `bind` point so you get the incremental bit for free (I think
> > this goes by the fancy name of `iteratees` nowadays). On the other
> > hand it suffers from poor time/space performance common to this type
> > of parser. Also, while combinator parsers can handle arbitrary
> > backtracking (and you end up paying for this), IMAP itself requires
> > very little, so it would seem that the full flexibility of combinators
> > are not needed.
> 
> For a long time I played with combinator parsers but I never got to the point of being satisfied with the result. I also played a little bit with iteratees but couldn't get the performance I wanted with the full model in all its compositional glory.   
> 
> Nowadays I simply write my streaming lexers/parsers manually in CPS and you can drive them in non-blocking mode with a single, fixed size, buffer. See here [1] for an interface and an implementation on a toy example.  
> 
> If you get the lowest continuation decoding bits correctly (e.g. don't bind on each byte) you can actually get good time and space performance. Once you have handled that (mainly the painful bits about read overlapping two input buffers) you get very nice parsing flexibility, e.g. for things like text encoding discovery where you need to patch the continuation with the appropriate character decoder. If you care for your users you can also get very good error reporting and error recovery capabilities by applying knowledge specific to the decoded protocol. That's the way Uutf [2], Jsonm [3] (on top of Uutf) and Dicomm [4] are programmed.  

-- 
Simon

http://weusepgp.info/
key 49AA62B6, fingerprint 949F EB87 8F06 59C6 D7D3  7D8D 4AC0 1D08 49AA 62B6

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Caml-list] New release of Menhir (20141215)
  2014-12-18 15:34       ` Simon Cruanes
@ 2014-12-18 16:02         ` Nicolas Ojeda Bar
  0 siblings, 0 replies; 14+ messages in thread
From: Nicolas Ojeda Bar @ 2014-12-18 16:02 UTC (permalink / raw)
  To: Simon Cruanes
  Cc: Daniel Bünzli, Gerd Stolpmann, Francois Pottier, caml-list

Hi Simon,

On Thu, Dec 18, 2014 at 12:34 PM, Simon Cruanes
<simon.cruanes.2007@m4x.org> wrote:
> I saw this on mirage's wiki:
> https://github.com/mirage/mirage-www/wiki/Pioneer-Projects#bigarray-parser-generator
>
> Apart from using bigarrays for performance, the idea of using a ppx to
> make the combinators more efficient looks interesting.

If I understand correctly the optimizations done by the FastParser
parsers are essentially a form of aggresive inlining.  This goes
hand-in-hand with encoding your parser as a set of mutually recursive
functions.  This type of encoding does not lend itself easily to
incremental parsing (where the parser can be stopped and resumed
as/when needed).

That apart, for other use-cases it looks pretty nice.

Best wishes,
Nicolas

> I also plan to try to mix recursive descent with a monadic "refill"
> function for simple parsers (S-expressions, Bencode, etc.) It will be
> interesting to compare that with Daniel's manual CPS.
>
> Cheers!
>
> Le Thu, 18 Dec 2014, Daniel Bünzli a écrit :
>> Le jeudi, 18 décembre 2014 à 15:19, Nicolas Ojeda Bar a écrit :
>> > When programming monadically you are reifying the continuation at
>> > every `bind` point so you get the incremental bit for free (I think
>> > this goes by the fancy name of `iteratees` nowadays). On the other
>> > hand it suffers from poor time/space performance common to this type
>> > of parser. Also, while combinator parsers can handle arbitrary
>> > backtracking (and you end up paying for this), IMAP itself requires
>> > very little, so it would seem that the full flexibility of combinators
>> > are not needed.
>>
>> For a long time I played with combinator parsers but I never got to the point of being satisfied with the result. I also played a little bit with iteratees but couldn't get the performance I wanted with the full model in all its compositional glory.
>>
>> Nowadays I simply write my streaming lexers/parsers manually in CPS and you can drive them in non-blocking mode with a single, fixed size, buffer. See here [1] for an interface and an implementation on a toy example.
>>
>> If you get the lowest continuation decoding bits correctly (e.g. don't bind on each byte) you can actually get good time and space performance. Once you have handled that (mainly the painful bits about read overlapping two input buffers) you get very nice parsing flexibility, e.g. for things like text encoding discovery where you need to patch the continuation with the appropriate character decoder. If you care for your users you can also get very good error reporting and error recovery capabilities by applying knowledge specific to the decoded protocol. That's the way Uutf [2], Jsonm [3] (on top of Uutf) and Dicomm [4] are programmed.
>
> --
> Simon
>
> http://weusepgp.info/
> key 49AA62B6, fingerprint 949F EB87 8F06 59C6 D7D3  7D8D 4AC0 1D08 49AA 62B6

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Caml-list] New release of Menhir (20141215)
  2014-12-18 14:19   ` Nicolas Ojeda Bar
  2014-12-18 15:20     ` Daniel Bünzli
@ 2014-12-18 15:25     ` Gerd Stolpmann
  2014-12-18 17:25       ` Francois Pottier
  2014-12-22 11:13     ` oleg
  2 siblings, 1 reply; 14+ messages in thread
From: Gerd Stolpmann @ 2014-12-18 15:25 UTC (permalink / raw)
  To: Nicolas Ojeda Bar; +Cc: Francois.Pottier, caml-list

[-- Attachment #1: Type: text/plain, Size: 9050 bytes --]

Hi,

thanks for pointing me to your project. Maybe I'm just going to use it
(well, I'd need it to make it compatible with Ocamlnet, and I also need
TLS and SASL). Regarding the question of whether Menhir is usable: 

Part of the difficulty are the BLOBs that can occur anywhere. Here is an
example from the RFC (C = sent by client, S = sent by server, i.e. what
we need to parse):

C:    a004 fetch 12 body[header]
S:    * 12 FETCH (BODY[HEADER] {342}
S:    Date: Wed, 17 Jul 1996 02:23:25 -0700 (PDT)
S:    From: Terry Gray <gray@cac.washington.edu>
S:    Subject: IMAP4rev1 WG mtg summary and minutes
S:    To: imap@cac.washington.edu
S:    cc: minutes@CNRI.Reston.VA.US, John Klensin <KLENSIN@MIT.EDU>
S:    Message-Id: <B27397-0100000@cac.washington.edu>
S:    MIME-Version: 1.0
S:    Content-Type: TEXT/PLAIN; CHARSET=US-ASCII
S:
S:    )
S:    a004 OK FETCH completed

The {342} means that a BLOB with 342 bytes follows. This could be
completely harmless if we could recognize such tokens in the lexer, and
switch the lexer to a BLOB lexer before the tokens reach the parser.
However, this is actually not as easy, as the grammar also allows such
strings at other places where they do not prefix BLOBs.

My idea was that the lexer marks such tokens as "maybe-prefixes", and we
initially only parse to that point. These maybe-prefixes are always
followed by a line ending, so either the response ends and the parser
finishes (meaning that the maybe-prefix isn't a BLOB prefix), or the
parser needs more tokens, in which case it is either a BLOB prefix
followed by a BLOB or completely invalid. I haven't checked yet: could
we inspect Menhir's state and see whether the only possible
interpretation is a BLOB prefix? I.e. whether it is in the middle of a
production

blob := blob-prefix blob-data

and there is no other way of interpreting the tokens? This would make
the decision somewhat safer how such context-dependent tokens are
handled.

Gerd

Am Donnerstag, den 18.12.2014, 11:19 -0300 schrieb Nicolas Ojeda Bar:
> Hi Gerd,
> 
> I wrote an IMAP client library in pure OCaml
> (https://github.com/nojb/ocaml-imap).  It can handle the full RFC 3501
> grammar.  This library itself is not yet ready for prime-time, but I
> spent quite a bit of time thinking about the question of parsing since
> it is such a big part of the protocol.
> 
> While the incremental parsing in the new Menhir is great I do not
> think it will help to parse IMAP because the token types of IMAP
> grammar are highly context dependent.  By this I mean that there are
> many overlapping token definitions that are valid in different parts
> of the grammar.  While this could be hacked around (by changing the
> lexer used in dummy non-terminals) the complexity of doing so quickly
> gets out of hand.  In my view this almost requires using a scannerless
> parser, which rules out ocamlyacc/menhir-type parsers.
> 
> A parser generator that could handle the IMAP grammar nicely is Dypgen
> (which generates GLR parsers).  You can almost copy the IMAP grammar
> verbatim from RFC 3501.  However it can not be made incremental.  This
> means that you need to have all the input available before starting
> the parse.  Since the amount of data is potentially very big, this
> rules using Dypgen for anything else than toy applications.
> 
> In my library I use a monadic combinator parser
> (https://github.com/nojb/ocaml-imap/blob/master/imap/imapParser.ml).
> When programming monadically you are reifying the continuation at
> every `bind` point so you get the incremental bit for free (I think
> this goes by the fancy name of `iteratees` nowadays).  On the other
> hand it suffers from poor time/space performance common to this type
> of parser.  Also, while combinator parsers can handle arbitrary
> backtracking (and you end up paying for this), IMAP itself requires
> very little, so it would seem that the full flexibility of combinators
> are not needed.
> 
> It seems that the "scannerless, incremental" space in parser
> technology is not very developed (in OCaml; maybe other languages is
> different ?).  One intriguing possibility that I am exploring (but
> this is a much longer term project) is to use PEG parsers for this.
> It would be easy to write an IMAP parser with PEG. The main problem is
> that the usual implementation of PEG parsers uses memoization to keep
> parsing time from blowing up exponentially on the length of input (
> (in OCaml the Aurochs parser generator is of this type).  This
> requires storage proportional to input length, which is a problem when
> parsing large amounts of data.  But for grammars that don't need a lot
> of backtracking (like IMAP) the memoization is not actually required,
> so I think there is an interesting space to explore between all these
> constraints.
> 
> In any case, I would love to hear about your experience if you do try
> to parse IMAP using Menhir.
> 
> Best wishes,
> Nicolas
> 
> On Thu, Dec 18, 2014 at 9:45 AM, Gerd Stolpmann <info@gerd-stolpmann.de> wrote:
> > Thanks for this, it is great news. A couple of months ago I tried to
> > develop a monadic parser for the IMAP protocol, which has a weird
> > grammar and needs some strange interactions between lexing and parsing.
> > Lacking a parser generator I started to write the parser manually, but
> > it never made good progress. It looks like Menhir can now be used for
> > this case, and I'll try it.
> >
> > Gerd
> >
> > Am Mittwoch, den 17.12.2014, 21:14 +0100 schrieb Francois Pottier:
> >> Dear OCaml users,
> >>
> >> I have recently started making a series of changes in Menhir, inspired by
> >> the work of Frédéric Bour on Merlin (a smart emacs-mode for OCaml, which
> >> uses a modified version of Menhir). Thanks to Frédéric for his stimulating
> >> ideas, and thanks to Gabriel Scherer for not letting me sleep until I
> >> promised I would do something about them! :-)
> >>
> >> I have made a first release a couple days ago. The relevant chunk of the
> >> CHANGES file is appended below. In summary, a new incremental API is
> >> available; Menhir now requires ocaml 4.02; and a couple of obscure features
> >> (--error-recovery and $previouserror) have been removed in the interest of
> >> speed and simplicity.
> >>
> >> The incremental API means that you can take a snapshot of the parser's state,
> >> essentially at no cost, at any time. It also means that the parser no longer
> >> drives the lexer; you drive the lexer, and you provide tokens to the parser
> >> when it requests them. This can be convenient if the lexer is in a monad (the
> >> Lwt monad, for instance).
> >>
> >> More changes are planned. The type "env" exposed by the new incremental API is
> >> currently opaque. We are planning to offer a range of functions that allow
> >> inspecting and building values of type "env". This should allow the user to
> >> implement new error handling and error recovery strategies outside of Menhir.
> >>
> >> The new release is available now as a .tar.gz archive:
> >>
> >>   http://gallium.inria.fr/~fpottier/menhir/menhir-20141215.tar.gz
> >>
> >> It is also available via opam ("opam install menhir").
> >>
> >> Cheers,
> >>
> >> --
> >> François Pottier
> >> Francois.Pottier@inria.fr
> >> http://gallium.inria.fr/~fpottier/
> >>
> >> 2014/12/15:
> >> New incremental API (in --table mode only), inspired by Frédéric Bour.
> >>
> >> 2014/12/11:
> >> Menhir now reports an error if one of the start symbols produces
> >> either the empty language or the singleton language {epsilon}.
> >>
> >> Although some people out there actually define a start symbol that recognizes
> >> {epsilon} (and use it as a way of initializing or re-initializing some global
> >> state), this is considered bad style. Furthermore, by ruling out this case, we
> >> are able to simplify the table back-end a little bit.
> >>
> >> 2014/12/12:
> >> A speed improvement in the code back-end.
> >>
> >> 2014/12/08:
> >> Menhir now requires OCaml 4.02 (instead of 3.09).
> >>
> >> 2014/12/02:
> >> Removed support for the $previouserror keyword.
> >> Removed support for --error-recovery mode.
> >>
> >
> > --
> > ------------------------------------------------------------
> > Gerd Stolpmann, Darmstadt, Germany    gerd@gerd-stolpmann.de
> > My OCaml site:          http://www.camlcity.org
> > Contact details:        http://www.camlcity.org/contact.html
> > Company homepage:       http://www.gerd-stolpmann.de
> > ------------------------------------------------------------
> >

-- 
------------------------------------------------------------
Gerd Stolpmann, Darmstadt, Germany    gerd@gerd-stolpmann.de
My OCaml site:          http://www.camlcity.org
Contact details:        http://www.camlcity.org/contact.html
Company homepage:       http://www.gerd-stolpmann.de
------------------------------------------------------------


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Caml-list] New release of Menhir (20141215)
  2014-12-18 15:25     ` Gerd Stolpmann
@ 2014-12-18 17:25       ` Francois Pottier
  0 siblings, 0 replies; 14+ messages in thread
From: Francois Pottier @ 2014-12-18 17:25 UTC (permalink / raw)
  To: Gerd Stolpmann; +Cc: Nicolas Ojeda Bar, caml-list

On Thu, Dec 18, 2014 at 04:25:29PM +0100, Gerd Stolpmann wrote:
> followed by a BLOB or completely invalid. I haven't checked yet: could
> we inspect Menhir's state and see whether the only possible
> interpretation is a BLOB prefix? I.e. whether it is in the middle of a
> production

At this moment, it is not possible to inspect Menhir's state. In the future,
we are planning to allow it. However, I don't know if it would help with this
issue.

An idea that comes to mind is: Menhir's new incremental interface actually
allows you to fork the parser. When you find {342} in the input stream, you
could try two alternatives (viewing it as a blob token, and not viewing it as
a non-blob token) in parallel. If you can be certain that one of the two
alternatives will quickly die, then this approach might be tenable. It is
actually a form of poor man's GLR.

-- 
François Pottier
Francois.Pottier@inria.fr
http://gallium.inria.fr/~fpottier/

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Caml-list] New release of Menhir (20141215)
  2014-12-18 14:19   ` Nicolas Ojeda Bar
  2014-12-18 15:20     ` Daniel Bünzli
  2014-12-18 15:25     ` Gerd Stolpmann
@ 2014-12-22 11:13     ` oleg
  2 siblings, 0 replies; 14+ messages in thread
From: oleg @ 2014-12-22 11:13 UTC (permalink / raw)
  To: no263; +Cc: Francois.Pottier, caml-list, daniel.buenzli

Regarding incremental parsing of protocols like IMAP: I have
successfully (as in successfully deployed in production and being used
around the clock, since about 2010 or so) used iteratees for
incremental parsing of full XML, including CDATA, parsed entities and
namespaces. The full XML is actually quite difficult to parse: for
example, parsed entity references like &amp; are not recognized within
CDATA blocks; the content of attributes has its own whitespace
handling rules. The parser is used for handling sometimes quite large
XML documents. The parser is incremental and so can work in constant
memory.

        http://okmij.org/ftp/Streams.html#xml

I have also used iteratees to parse HTTP Log files, also
incrementally. The log files have an (unintended, I hope) complication:
the user-agent string (quoted in the log) may, according to RFC,
itself contain quotes. Since the embedded quotes are not escaped
(again, according to RFC), we may end up with quoted strings
containing unescaped quote characters. Parsing will require unbounded
look-ahead then. Iteratees can handle that -- and report errors
precisely and recover.

        http://okmij.org/ftp/Streams.html#good-error
        http://okmij.org/ftp/Streams.html#fork

Incidentally, there are quite many iteratee libraries. Some, like
pipes, emphasize apparent simplicity and do no input buffering. The
performance is indeed pretty bad then. 

        I should also mention that a parser with a call-back interface
and the absence of visible side-effects can _automatically_ be made
incremental. The following web page describes incrementalization of
stdlib's Genlex lexer.
        http://okmij.org/ftp/continuations/differentiating-parsers.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Caml-list] New release of Menhir (20141215)
  2014-12-17 20:14 [Caml-list] New release of Menhir (20141215) Francois Pottier
  2014-12-18 12:45 ` Gerd Stolpmann
@ 2014-12-22 18:40 ` Dario Teixeira
  2014-12-24 23:30   ` Francois Pottier
  1 sibling, 1 reply; 14+ messages in thread
From: Dario Teixeira @ 2014-12-22 18:40 UTC (permalink / raw)
  To: Francois.Pottier, caml-list

Hi,
> More changes are planned. The type "env" exposed by the new incremental API is

> currently opaque. We are planning to offer a range of functions that allow
> inspecting and building values of type "env". This should allow the  user to
> implement new error handling and error recovery strategies outside of Menhir.

I have one very concrete parsing problem which can potentially be solved elegantly
by this new API.  In the Lambtex parser [1], I need to switch between lexers
mid-parsing (think handling verbatim-like blocks).  The current solution is very
hackish and complex, to the point that I've seriously considered switching to some
parser combinator approach.

Anyway, hopefully this particular use case will be solvable by the API once it
allows inspection of values of type "env"!

Best regards,
Dario Teixeira

[1] https://github.com/darioteixeira/lambdoc/tree/master/src/lib/lambdoc_read_lambtex_impl

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Caml-list] New release of Menhir (20141215)
  2014-12-22 18:40 ` Dario Teixeira
@ 2014-12-24 23:30   ` Francois Pottier
  2014-12-26 11:13     ` Dario Teixeira
  0 siblings, 1 reply; 14+ messages in thread
From: Francois Pottier @ 2014-12-24 23:30 UTC (permalink / raw)
  To: Dario Teixeira; +Cc: caml-list

On Mon, Dec 22, 2014 at 06:40:40PM +0000, Dario Teixeira wrote:
> Anyway, hopefully this particular use case will be solvable by the API once
> it allows inspection of values of type "env"!

Hmm, maybe. The new API will probably allow you to inspect the stack (which is
basically a list of pairs of a state and a semantic value) and to inspect a
state (which can be viewed as a set of LR(1) items). I don't know whether that
would offer you a simple way of deciding when to switch from one lexer to
another...

-- 
François Pottier
Francois.Pottier@inria.fr
http://gallium.inria.fr/~fpottier/

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Caml-list] New release of Menhir (20141215)
  2014-12-24 23:30   ` Francois Pottier
@ 2014-12-26 11:13     ` Dario Teixeira
  2014-12-26 11:31       ` Frédéric Bour
  0 siblings, 1 reply; 14+ messages in thread
From: Dario Teixeira @ 2014-12-26 11:13 UTC (permalink / raw)
  To: Francois.Pottier; +Cc: caml-list

Hi,

> Hmm, maybe. The new API will probably allow you to inspect the stack (which is
> basically a list of pairs of a state and a semantic value) and to inspect a
> state (which can be viewed as a set of LR(1) items). I don't know whether that
> would offer you a simple way of deciding when to switch from one lexer to
> another...

I suspect it will *at least* be an improvement over the current situation.
Consider the following rule:

inline:
| ...
| LINK OPEN RAW END OPEN inline* END
| ...

Suppose that by default I'm using a 'general' lexer.  However, upon encountering
that first OPEN token, I must switch to a 'raw' lexer and then switch back to the
'general' lexer upon encountering the first END token.  This lexer dance won't
happen with the second OPEN token, though.

Anyway, as long as I know which state Menhir is in, choosing the right lexer
should be an easy task.  It may require a large lookup table on my part to map
state to lexer, but at least that's a lot less hairy than the current approach.

Best regards,
Dario Teixeira

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Caml-list] New release of Menhir (20141215)
  2014-12-26 11:13     ` Dario Teixeira
@ 2014-12-26 11:31       ` Frédéric Bour
  2014-12-26 12:16         ` Dario Teixeira
  0 siblings, 1 reply; 14+ messages in thread
From: Frédéric Bour @ 2014-12-26 11:31 UTC (permalink / raw)
  To: caml-list

Hi,

I tried various methods with Merlin, with good results.
That's quite close to what you suggest: we add an empty non-terminal and 
change the behavior when it is on stack.
Something like:

lexer_switch:
| (* empty *) { () }

inline:
| …
| LINK lexer_switch OPEN RAW END OPEN inline* END
| …

Lexing loop:

if has_lexer_switch parser then
   feed parser (Lexer.raw_token buf)
else
   feed parser (Lexer.token buf)

Of course this require introspection (Merlin's internal version exposes 
a lot more informations, but that's for debugging and experimentation 
purposes, we hope to clean that).
You can still use side-effects to emulate the trick:

lexer_switch:
| (* empty *) { in_raw_lexer := true }

lexer_leave:
| (* empty *) { in_raw_lexer := false }

… but be careful :).

Cheers,
Fred

On 26/12/2014 12:13, Dario Teixeira wrote:
> Hi,
>
>> Hmm, maybe. The new API will probably allow you to inspect the stack (which is
>> basically a list of pairs of a state and a semantic value) and to inspect a
>> state (which can be viewed as a set of LR(1) items). I don't know whether that
>> would offer you a simple way of deciding when to switch from one lexer to
>> another...
> I suspect it will *at least* be an improvement over the current situation.
> Consider the following rule:
>
> inline:
> | ...
> | LINK OPEN RAW END OPEN inline* END
> | ...
>
> Suppose that by default I'm using a 'general' lexer.  However, upon encountering
> that first OPEN token, I must switch to a 'raw' lexer and then switch back to the
> 'general' lexer upon encountering the first END token.  This lexer dance won't
> happen with the second OPEN token, though.
>
> Anyway, as long as I know which state Menhir is in, choosing the right lexer
> should be an easy task.  It may require a large lookup table on my part to map
> state to lexer, but at least that's a lot less hairy than the current approach.
>
> Best regards,
> Dario Teixeira
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Caml-list] New release of Menhir (20141215)
  2014-12-26 11:31       ` Frédéric Bour
@ 2014-12-26 12:16         ` Dario Teixeira
  0 siblings, 0 replies; 14+ messages in thread
From: Dario Teixeira @ 2014-12-26 12:16 UTC (permalink / raw)
  To: Frédéric Bour, caml-list

Hi,

> I tried various methods with Merlin, with good results.

> That's quite close to what you suggest: we add an empty non-terminal and 
> change the behavior when it is on stack.

> Something like:

Yes, something like that should suffice for my needs.  My current solution
uses all manner of evil side-effects to achieve the same goal [1], which
is why I'm eager for an alternative...

Best regards,
Dario Teixeira


[1] https://github.com/darioteixeira/lambdoc/tree/master/src/lib/lambdoc_read_lambtex_impl

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2014-12-26 12:16 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-12-17 20:14 [Caml-list] New release of Menhir (20141215) Francois Pottier
2014-12-18 12:45 ` Gerd Stolpmann
2014-12-18 14:19   ` Nicolas Ojeda Bar
2014-12-18 15:20     ` Daniel Bünzli
2014-12-18 15:34       ` Simon Cruanes
2014-12-18 16:02         ` Nicolas Ojeda Bar
2014-12-18 15:25     ` Gerd Stolpmann
2014-12-18 17:25       ` Francois Pottier
2014-12-22 11:13     ` oleg
2014-12-22 18:40 ` Dario Teixeira
2014-12-24 23:30   ` Francois Pottier
2014-12-26 11:13     ` Dario Teixeira
2014-12-26 11:31       ` Frédéric Bour
2014-12-26 12:16         ` Dario Teixeira

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox