From: Florent Monnier <monnier.florent@gmail.com>
To: Gerd Stolpmann <info@gerd-stolpmann.de>
Cc: "José Romildo Malaquias" <j.romildo@gmail.com>, caml-list@inria.fr
Subject: Re: [Caml-list] Extracting information from HTML documents
Date: Sat, 23 Feb 2013 13:40:28 +0100 [thread overview]
Message-ID: <CAE1DttBdrpW1-gq7GVTN6eVSY73bma95Lt2VKTyYDsSXRis1zg@mail.gmail.com> (raw)
In-Reply-To: <1361522580.4875.1@samsung>
2013/2/22, Gerd Stolpmann <info@gerd-stolpmann.de> :
> Am 23.01.2013 21:52:29 schrieb(en) José Romildo Malaquias:
>> Hello.
>>
>> tagsoup[1][2] is a Haskell library for parsing and extracting
>> information from (possibly malformed) HTML/XML documents.
>>
>> tagsoup provides a basic data type for a list of unstructured tags, a
>> parser to convert HTML into this tag type, and useful functions and
>> combinators for finding and extracting information.
>>
>> Is there a similar library for OCaml?
>>
>> I want to write an application which will need to extract some
>> information from HTML documents from the web. tagsoup helps a lot in
>> the Haskell version of my program. Which OCaml libraries can help me
>> with that when porting the application to OCaml?
>>
>> [1] http://community.haskell.org/~ndm/tagsoup/
>> [2] http://hackage.haskell.org/package/tagsoup
>>
>>
>> Romildo
>
> Well, not really identical, but there is at least a robust HTML parser
> in OCamlnet:
>
> http://projects.camlcity.org/projects/dl/ocamlnet-3.6.3/doc/html-main/Nethtml.html
>
> Homepage: http://projects.camlcity.org/projects/ocamlnet.html
>
> This parser was once used for Mylife's profile extractor (grabbing data
> from profile pages of social networks), and is proven to handle
> absolutely bad HTML well. XML should also be no problem.
>
> Gerd
There's also xmlerr:
http://www.linux-nantes.org/%7Efmonnier/ocaml/xmlerr/
but xmlerr is an alpha, experimental, hobbyist, not professional thing.
Not all but some parts of its code are very quick-n-dirty.
I've written it for my own use to read HTML web-pages,
and I'm using it quite often since several years now.
99.9% of the time it does what I expect from it.
It's not able to read XML files that are several Go,
because it first loads the content in a string and then parses from it
which was a very poor choice, but at the beginning I was only using it
to load HTML web-pages.
Don't expect something of the quality of Nethtml, xmlm and xml-light!
I've never used Nethtml so I cannot say anything about it, but
from what I can see from the interface is that the type is:
type document =
| Element of (string * (string * string) list * document list)
| Data of string
XmlErr's type is:
type attr = string * string
type t =
| Tag of string * attr list (** opening tag *)
| ETag of string (** closing tag *)
| Data of string (** PCData *)
| Comm of string (** Comments *)
type html = t list
As a result xmlerr will be able to return a plain representation of:
<bold><i>text</bold></i>
So it seems that Nethtml will return something corrected.
Xmlerr doesn't, it only returns what it seems.
Also Xmlerr parses comments because sometimes what I want to get is there.
Xmlerr only returns junk for the very XML specific things like <?xml
and <! things,
as a result it's not possible to use xmlerr to read, correct and print
back corrected HTML when there are these kind of elements.
The last release also provides a command line utility "htmlxtr".
This "thing" doesn't require any ocaml programming, it's a basic
command line tool.
What htmlxtr does is to "untemplate" templated parts of a web-page
(but in a very basic way) and print the extracted things on stdout
(read man ./htmlxtr.1 for more informations).
I'm interested by suggestions to improve it.
I'm using xmlerr to make quickly written scripts, for example
Xmlerr.print_code prints an HTML content as ocaml code with Xmlerr.t
type, so that I can just quickly copy-paste a piece of it in a
parttern match and get something from this piece in less than one
minute.
When the template of a website changes, I can usually fix my script in
less than 3 minutes.
I know that some other programming languages provide utilities and
libraries for these kind of tasks and that some uses some tricks and
concepts to extract things from web-pages the more easily possible,
but I don't know them. If you do and have some time, please tell me
about it.
Anyway even if xmlerr is very amateurish,
I would be interested to get any kind of suggestions about how to improve it.
--
Cheers
Florent
next prev parent reply other threads:[~2013-02-23 12:40 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-01-23 20:52 José Romildo Malaquias
2013-02-22 8:43 ` AW: " Gerd Stolpmann
2013-02-23 12:40 ` Florent Monnier [this message]
2013-02-23 13:23 ` Gerd Stolpmann
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CAE1DttBdrpW1-gq7GVTN6eVSY73bma95Lt2VKTyYDsSXRis1zg@mail.gmail.com \
--to=monnier.florent@gmail.com \
--cc=caml-list@inria.fr \
--cc=info@gerd-stolpmann.de \
--cc=j.romildo@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox