From: Anton Bachin <antonbachin@yahoo.com>
To: caml-list@inria.fr
Subject: [Caml-list] [ANN] Markup.ml - HTML5 and XML parsers with error recovery
Date: Fri, 15 Jan 2016 10:51:47 -0600 [thread overview]
Message-ID: <57F3748F-BEF9-444D-96F0-71752CB4F4A2@yahoo.com> (raw)
Good time of day,
I would like to announce the release of Markup.ml, a pair of streaming,
error-recovering parsers for HTML and XML. Usage is simple, like this:
(* Pretty-print HTML, with error correction. *)
open Markup
channel stdin
|> parse_html
|> signals
|> pretty_print
|> write_html
|> to_channel stdout
and
(* Show up to 10 XML errors to the user and abort early. *)
let report =
let count = ref 0 in
fun location error ->
error |> Error.to_string ~location |> prerr_endline;
count := !count + 1;
if !count >= 10 then raise_notrace Exit
string "some xml" |> parse_xml ~report |> signals |> drain
While still providing an easy basic interface, the parsers are
non-blocking and can be readily used with threading libraries such as
Lwt. For example, if "s" is a char Lwt_stream.t:
(* Assemble HTML into a tree asynchronously. *)
type html = Text of string | Element of string * html list
Markup_lwt.lwt_stream s
|> parse_html
|> signals
|> Markup_lwt.tree
~text:(fun ss -> Text (String.concat "" ss))
~element:(fun (_, name) _ children -> Element (name, children))
>>= (fun tree -> ...)
The parsers detect input encodings automatically. Everything is
converted to UTF-8.
Markup.ml aims at standard conformance. See the conformance status [1].
Modulo any bugs, Markup.ml should already be highly conformant, the
only significant missing pieces being the two error recovery algorithms
listed for HTML (Markup.ml already performs the rest of HTML error
recovery).
The library can be found here:
https://github.com/aantron/markup.ml
To install:
opam install markup
Documentation is at:
http://aantron.github.io/markup.ml
Apart from ordinary improvements to the library, there are several
possible avenues of future work:
- An HTML5/XHTML polyglot serializer.
- Parsing of XML doctype declarations for a validation library built on
top of Markup.ml.
- An Async interface (mainly just applying a functor, but I am not
experienced with Async at the moment).
- Factoring out the stream and I/O portions of Markup.ml into their own
library or libraries.
Bug reports and contributions are greatly appreciated.
This work was prompted by Lambda Soup. That library could use a good,
modern HTML parser, and several people also commented on the need.
Markup.ml depends on the excellent Uutf by Daniel Buenzli. I'd also
like to thank Daniel for giving useful early feedback on the library
in the last couple of days.
Regards,
Anton
[1]: http://aantron.github.io/markup.ml/#2_Conformancestatus
reply other threads:[~2016-01-15 16:51 UTC|newest]
Thread overview: [no followups] expand[flat|nested] mbox.gz Atom feed
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=57F3748F-BEF9-444D-96F0-71752CB4F4A2@yahoo.com \
--to=antonbachin@yahoo.com \
--cc=caml-list@inria.fr \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox