From: oleg@okmij.org
To: no263@dpmms.cam.ac.uk
Cc: Francois.Pottier@inria.fr, caml-list@inria.fr,
daniel.buenzli@erratique.ch
Subject: Re: [Caml-list] New release of Menhir (20141215)
Date: Mon, 22 Dec 2014 06:13:46 -0500 (EST) [thread overview]
Message-ID: <20141222111346.68AD4C3829@www1.g3.pair.com> (raw)
In-Reply-To: <CAPunWhD=DgnPRXJo60ppx_sGbGeVbzYSXDCqffk2dMKJ8K=Vdw@mail.gmail.com>
Regarding incremental parsing of protocols like IMAP: I have
successfully (as in successfully deployed in production and being used
around the clock, since about 2010 or so) used iteratees for
incremental parsing of full XML, including CDATA, parsed entities and
namespaces. The full XML is actually quite difficult to parse: for
example, parsed entity references like & are not recognized within
CDATA blocks; the content of attributes has its own whitespace
handling rules. The parser is used for handling sometimes quite large
XML documents. The parser is incremental and so can work in constant
memory.
http://okmij.org/ftp/Streams.html#xml
I have also used iteratees to parse HTTP Log files, also
incrementally. The log files have an (unintended, I hope) complication:
the user-agent string (quoted in the log) may, according to RFC,
itself contain quotes. Since the embedded quotes are not escaped
(again, according to RFC), we may end up with quoted strings
containing unescaped quote characters. Parsing will require unbounded
look-ahead then. Iteratees can handle that -- and report errors
precisely and recover.
http://okmij.org/ftp/Streams.html#good-error
http://okmij.org/ftp/Streams.html#fork
Incidentally, there are quite many iteratee libraries. Some, like
pipes, emphasize apparent simplicity and do no input buffering. The
performance is indeed pretty bad then.
I should also mention that a parser with a call-back interface
and the absence of visible side-effects can _automatically_ be made
incremental. The following web page describes incrementalization of
stdlib's Genlex lexer.
http://okmij.org/ftp/continuations/differentiating-parsers.html
next prev parent reply other threads:[~2014-12-22 11:13 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-12-17 20:14 Francois Pottier
2014-12-18 12:45 ` Gerd Stolpmann
2014-12-18 14:19 ` Nicolas Ojeda Bar
2014-12-18 15:20 ` Daniel Bünzli
2014-12-18 15:34 ` Simon Cruanes
2014-12-18 16:02 ` Nicolas Ojeda Bar
2014-12-18 15:25 ` Gerd Stolpmann
2014-12-18 17:25 ` Francois Pottier
2014-12-22 11:13 ` oleg [this message]
2014-12-22 18:40 ` Dario Teixeira
2014-12-24 23:30 ` Francois Pottier
2014-12-26 11:13 ` Dario Teixeira
2014-12-26 11:31 ` Frédéric Bour
2014-12-26 12:16 ` Dario Teixeira
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20141222111346.68AD4C3829@www1.g3.pair.com \
--to=oleg@okmij.org \
--cc=Francois.Pottier@inria.fr \
--cc=caml-list@inria.fr \
--cc=daniel.buenzli@erratique.ch \
--cc=no263@dpmms.cam.ac.uk \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox