From: Anton Bachin <antonbachin@yahoo.com>
To: caml users <caml-list@inria.fr>
Subject: [Caml-list] [ANN] Lambda Soup 0.6 + Markup.ml 0.7 – Improved HTML5 processing
Date: Thu, 11 Feb 2016 13:04:18 -0600 [thread overview]
Message-ID: <330D84C8-2A00-4127-A03B-5287E003F6B7@yahoo.com> (raw)
Hello,
I would like to announce releases 0.6 of Lambda Soup, the CSS-selector-based
HTML scraper and rewriter, and 0.7 of Markup.ml, the streaming HTML and XML
parser.
https://github.com/aantron/lambda-soup
https://github.com/aantron/markup.ml
The main change in Lambda Soup is that is is now based on Markup.ml instead of
Ocamlnet. As a result,
- parsing now conforms closely to the HTML5 specification, including error
recovery;
- HTML entity references are translated;
- encodings are detected automatically, Lambda Soup is no longer limited to
ASCII-compatible input, and all strings emitted by the API are in UTF-8; and
- empty attributes are handled correctly.
Lambda Soup can now accept and emit Markup.ml parsing signal streams, so it can
be used for filters, without having to parse directly from or serialize all the
way to strings. It can also be used safely with XML. Parsing is, however, much
slower – this depends on Markup.ml being optimized in the future.
The HTML parser in Markup.ml, in turn, now implements the adoption agency
algorithm, an error recovery algorithm from the HTML5 specification that is
ill-suited for streaming parsers. It is also more thouroughly tested, and has
received many bugfixes.
I must thank Jerome Vouillon and Leo Wzukw for bug reports. They are greatly
appreciated.
Regards,
Anton
reply other threads:[~2016-02-11 19:04 UTC|newest]
Thread overview: [no followups] expand[flat|nested] mbox.gz Atom feed
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=330D84C8-2A00-4127-A03B-5287E003F6B7@yahoo.com \
--to=antonbachin@yahoo.com \
--cc=caml-list@inria.fr \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox