From mboxrd@z Thu Jan 1 00:00:00 1970 Received: (from majordomo@localhost) by pauillac.inria.fr (8.7.6/8.7.3) id SAA18702; Mon, 21 Jun 2004 18:03:34 +0200 (MET DST) X-Authentication-Warning: pauillac.inria.fr: majordomo set sender to owner-caml-list@pauillac.inria.fr using -f Received: from concorde.inria.fr (concorde.inria.fr [192.93.2.39]) by pauillac.inria.fr (8.7.6/8.7.3) with ESMTP id SAA18630 for ; Mon, 21 Jun 2004 18:03:33 +0200 (MET DST) Received: from aomori.annexia.org (annexia.force9.co.uk [212.56.101.183]) by concorde.inria.fr (8.12.10/8.12.10) with ESMTP id i5LG3XSH001604 for ; Mon, 21 Jun 2004 18:03:33 +0200 Received: from rich by aomori.annexia.org with local (Exim 3.36 #1 (Debian)) id 1BcRGb-0007az-00 for ; Mon, 21 Jun 2004 17:03:29 +0100 Date: Mon, 21 Jun 2004 17:03:28 +0100 To: caml-list@inria.fr Subject: [Caml-list] Parse crazy HTML, output XML Message-ID: <20040621160328.GA28952@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.5.5.1+cvs20040105i From: Richard Jones X-Miltered: at concorde with ID 40D706D5.000 by Joe's j-chkmail (http://j-chkmail.ensmp.fr)! X-Loop: caml-list@inria.fr X-Spam: no; 0.00; fragment:01 pxp:01 dbi:99 threads:01 ltd:98 ocaml:01 ocaml:01 external:03 library:03 library:03 bunch:03 parse:04 parse:04 efficient:05 investment:94 Sender: owner-caml-list@pauillac.inria.fr Precedence: bulk I have a bunch of HTML documents from an external source which I do not control. They aren't valid XML, by any means. I need to read them in, do a "best effort" to build a DOM, do various manipulations over the DOM (such as removing