From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail2-relais-roc.national.inria.fr (mail2-relais-roc.national.inria.fr [192.134.164.83]) by sympa.inria.fr (Postfix) with ESMTPS id 6B5007FCE8 for ; Fri, 15 Jan 2016 17:51:52 +0100 (CET) IronPort-PHdr: 9a23:QrBvthSqTTclByOwmpZCBIgUltpsv+yvbD5Q0YIujvd0So/mwa64ZxyN2/xhgRfzUJnB7Loc0qyN4/6mCDVLsMzJmUtBWaIPfidNsd8RkQ0kDZzNImzAB9muURYHGt9fXkRu5XCxPBsdMs//Y1rPvi/6tmZKSV3BPAZ4bt74BpTVx5zukbvipduCOk4Z3nKUWvBbElaflU3prM4YgI9veO4a6yDihT92QdlQ3n5iPlmJnhzxtY+a9Z9n9DlM6bp6r5YTGfayQ6NtRrVdCHEiMnspzMztrxjKCwWVtVUGVWBDuxxUBA6NxhjxXpb3+n/zsPZ63iOTNs33S5glUDSl6OFgTxq+23RPDCIw7GyC0p84t6lcuh/0/xE= Authentication-Results: mail2-smtp-roc.national.inria.fr; spf=None smtp.pra=antonbachin@yahoo.com; spf=Pass smtp.mailfrom=antonbachin@yahoo.com; spf=None smtp.helo=postmaster@nm17-vm6.bullet.mail.ne1.yahoo.com Received-SPF: None (mail2-smtp-roc.national.inria.fr: no sender authenticity information available from domain of antonbachin@yahoo.com) identity=pra; client-ip=98.138.91.110; receiver=mail2-smtp-roc.national.inria.fr; envelope-from="antonbachin@yahoo.com"; x-sender="antonbachin@yahoo.com"; x-conformance=sidf_compatible Received-SPF: Pass (mail2-smtp-roc.national.inria.fr: domain of antonbachin@yahoo.com designates 98.138.91.110 as permitted sender) identity=mailfrom; client-ip=98.138.91.110; receiver=mail2-smtp-roc.national.inria.fr; envelope-from="antonbachin@yahoo.com"; x-sender="antonbachin@yahoo.com"; x-conformance=sidf_compatible; x-record-type="v=spf1" Received-SPF: None (mail2-smtp-roc.national.inria.fr: no sender authenticity information available from domain of postmaster@nm17-vm6.bullet.mail.ne1.yahoo.com) identity=helo; client-ip=98.138.91.110; receiver=mail2-smtp-roc.national.inria.fr; envelope-from="antonbachin@yahoo.com"; x-sender="postmaster@nm17-vm6.bullet.mail.ne1.yahoo.com"; x-conformance=sidf_compatible X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: A0AfAgAQI5lWlW5bimJegi6BXm0QiEahTQQDgU2FO4xBIocnOxEBAQEBAQEBARABAQEBBw0JCR8wgi2CNRMGATkVgRCIMAEDEg6gW5FZgnmHVAEjJwODOAEBCAEBAQEBAQEBARQGDAIBhkaCD4gTgn+BGwWNQneIYIVHiBcXghGHAiCFNQKOXTiCUoIDUwGEZiWBJAEBAQ X-IPAS-Result: A0AfAgAQI5lWlW5bimJegi6BXm0QiEahTQQDgU2FO4xBIocnOxEBAQEBAQEBARABAQEBBw0JCR8wgi2CNRMGATkVgRCIMAEDEg6gW5FZgnmHVAEjJwODOAEBCAEBAQEBAQEBARQGDAIBhkaCD4gTgn+BGwWNQneIYIVHiBcXghGHAiCFNQKOXTiCUoIDUwGEZiWBJAEBAQ X-IronPort-AV: E=Sophos;i="5.22,300,1449529200"; d="scan'208";a="197590105" Received: from nm17-vm6.bullet.mail.ne1.yahoo.com ([98.138.91.110]) by mail2-smtp-roc.national.inria.fr with ESMTP/TLS/RC4-SHA; 15 Jan 2016 17:51:51 +0100 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yahoo.com; s=s2048; t=1452876709; bh=eMCiyjP2YQeZ2c6rPCOVD7obfM1MPcTnuHM/g4rKj/E=; h=From:Subject:Date:To:From:Subject; b=J8qR5J66cnR1EzqzyIWFJgUrrBhFugN/khpXwHhMEt08pZCk3kKJach9l729UwOFSwbYBaDpUUuqfhUuepb/LpOyNn+9TI7vhNBZh3TW4kuyuLJUKDRnxkXykvRFjkqXI/bZG/5RZALa8MWppOyuOUVlyKGiWqYrvspkHNrGaHSJ7Th+WM2jACK4pilY7deGQ/6a8TLo+rZ4/ngf+s9rW69Czi59/ED+pBTu6TpU0PQsoN/c/WcCTFoOmAtMJuWempvIZ/MOf8niSfcmmu2RzS7PjzrszGaY0aYE5BmvwEJSQbtYGG9+EgjhWcIdnFxpIUHnRSMcVAclV+CFD25jag== Received: from [98.138.226.177] by nm17.bullet.mail.ne1.yahoo.com with NNFMP; 15 Jan 2016 16:51:49 -0000 Received: from [98.138.226.61] by tm12.bullet.mail.ne1.yahoo.com with NNFMP; 15 Jan 2016 16:51:49 -0000 Received: from [127.0.0.1] by smtp212.mail.ne1.yahoo.com with NNFMP; 15 Jan 2016 16:51:49 -0000 X-Yahoo-Newman-Id: 185775.89015.bm@smtp212.mail.ne1.yahoo.com X-Yahoo-Newman-Property: ymail-3 X-YMail-OSG: 351v7ZEVM1k_p5my3NUWmvvj5zlumgaTQ9bPGy.S35nTz9h 8E.eh.M3TrRdM9_iG29jmqAqsAq5GO7_897latWoYs6ULsGXYXFAicY4JTY5 asbOJyEFngfnELe.1Q4MtIK4oa9xZoJNIr92eJxQjFe__Kq5Sm58TeAhV0Z_ RZfawiNENH8cBLGyOdzn9hzPy62MXphWfdmxn3WcbCBSnITFonpdpioPF_06 Jz9JMQ0G6MMuEeWxysRWLfIcZ5vDLpaw_TjwvQlksSuqbzHCHcLPRjpejXAM UC1X1EsfHRZz1VznQH0TEWsp5wi4rCcEtWiQBHwZ.qldWLSbp5QlXzxOuW2U ABBe.NsVLT8rEsQ07eiMeeg0THtJGcq1KCcPvU7bDiFA7jjvT4SfIGTMPhi_ 9sxYf8eoVPiJHZO0T4gkRehco6zoN9rdhZzGXZIR84XoBSGSLIM.d2vk6K0t oYZK.dUyXJ5Pa1k..Vhvkp8GZiVw4bqIa2BHD0juqK2xsstbhZ.PhyZB2d18 bhn5T4J_nQo6NQIm.HVlt2QmQi_cJ5s.KWeZbvG_922wbeE5IRDJmJ.nPFla bJTjuKe8- X-Yahoo-SMTP: ddtAESaswBBsjSthz_dVP91gr2NDfymF From: Anton Bachin Content-Type: text/plain; charset=us-ascii X-Mao-Original-Outgoing-Id: 474569507.761242-ab09c58e728f94b13b68e0e96996f657 Content-Transfer-Encoding: 7bit Mime-Version: 1.0 (Mac OS X Mail 8.2 \(2070.6\)) Message-Id: <57F3748F-BEF9-444D-96F0-71752CB4F4A2@yahoo.com> Date: Fri, 15 Jan 2016 10:51:47 -0600 To: caml-list@inria.fr X-Mailer: Apple Mail (2.2070.6) Subject: [Caml-list] [ANN] Markup.ml - HTML5 and XML parsers with error recovery Good time of day, I would like to announce the release of Markup.ml, a pair of streaming, error-recovering parsers for HTML and XML. Usage is simple, like this: (* Pretty-print HTML, with error correction. *) open Markup channel stdin |> parse_html |> signals |> pretty_print |> write_html |> to_channel stdout and (* Show up to 10 XML errors to the user and abort early. *) let report = let count = ref 0 in fun location error -> error |> Error.to_string ~location |> prerr_endline; count := !count + 1; if !count >= 10 then raise_notrace Exit string "some xml" |> parse_xml ~report |> signals |> drain While still providing an easy basic interface, the parsers are non-blocking and can be readily used with threading libraries such as Lwt. For example, if "s" is a char Lwt_stream.t: (* Assemble HTML into a tree asynchronously. *) type html = Text of string | Element of string * html list Markup_lwt.lwt_stream s |> parse_html |> signals |> Markup_lwt.tree ~text:(fun ss -> Text (String.concat "" ss)) ~element:(fun (_, name) _ children -> Element (name, children)) >>= (fun tree -> ...) The parsers detect input encodings automatically. Everything is converted to UTF-8. Markup.ml aims at standard conformance. See the conformance status [1]. Modulo any bugs, Markup.ml should already be highly conformant, the only significant missing pieces being the two error recovery algorithms listed for HTML (Markup.ml already performs the rest of HTML error recovery). The library can be found here: https://github.com/aantron/markup.ml To install: opam install markup Documentation is at: http://aantron.github.io/markup.ml Apart from ordinary improvements to the library, there are several possible avenues of future work: - An HTML5/XHTML polyglot serializer. - Parsing of XML doctype declarations for a validation library built on top of Markup.ml. - An Async interface (mainly just applying a functor, but I am not experienced with Async at the moment). - Factoring out the stream and I/O portions of Markup.ml into their own library or libraries. Bug reports and contributions are greatly appreciated. This work was prompted by Lambda Soup. That library could use a good, modern HTML parser, and several people also commented on the need. Markup.ml depends on the excellent Uutf by Daniel Buenzli. I'd also like to thank Daniel for giving useful early feedback on the library in the last couple of days. Regards, Anton [1]: http://aantron.github.io/markup.ml/#2_Conformancestatus