Re: [Caml-list] Extracting information from HTML documents

Mailing list for all users of the OCaml language and system.
 help / color / mirror / Atom feed

From: Florent Monnier <monnier.florent@gmail.com>
To: Gerd Stolpmann <info@gerd-stolpmann.de>
Cc: "José Romildo Malaquias" <j.romildo@gmail.com>, caml-list@inria.fr
Subject: Re: [Caml-list] Extracting information from HTML documents
Date: Sat, 23 Feb 2013 13:40:28 +0100	[thread overview]
Message-ID: <CAE1DttBdrpW1-gq7GVTN6eVSY73bma95Lt2VKTyYDsSXRis1zg@mail.gmail.com> (raw)
In-Reply-To: <1361522580.4875.1@samsung>

2013/2/22, Gerd Stolpmann <info@gerd-stolpmann.de> :
> Am 23.01.2013 21:52:29 schrieb(en) José Romildo Malaquias:
>> Hello.
>>
>> tagsoup[1][2] is a Haskell library for parsing and extracting
>> information from (possibly malformed) HTML/XML documents.
>>
>> tagsoup provides a basic data type for a list of unstructured tags, a
>> parser to convert HTML into this tag type, and useful functions and
>> combinators for finding and extracting information.
>>
>> Is there a similar library for OCaml?
>>
>> I want to write an application which will need to extract some
>> information from HTML documents from the web. tagsoup helps a lot in
>> the Haskell version of my program. Which OCaml libraries can help me
>> with that when porting the application to OCaml?
>>
>> [1] http://community.haskell.org/~ndm/tagsoup/
>> [2] http://hackage.haskell.org/package/tagsoup
>>
>>
>> Romildo
>
> Well, not really identical, but there is at least a robust HTML parser
> in OCamlnet:
>
> http://projects.camlcity.org/projects/dl/ocamlnet-3.6.3/doc/html-main/Nethtml.html
>
> Homepage: http://projects.camlcity.org/projects/ocamlnet.html
>
> This parser was once used for Mylife's profile extractor (grabbing data
> from profile pages of social networks), and is proven to handle
> absolutely bad HTML well. XML should also be no problem.
>
> Gerd

There's also xmlerr:
http://www.linux-nantes.org/%7Efmonnier/ocaml/xmlerr/

but xmlerr is an alpha, experimental, hobbyist, not professional thing.
Not all but some parts of its code are very quick-n-dirty.
I've written it for my own use to read HTML web-pages,
and I'm using it quite often since several years now.
99.9% of the time it does what I expect from it.
It's not able to read XML files that are several Go,
because it first loads the content in a string and then parses from it
which was a very poor choice, but at the beginning I was only using it
to load HTML web-pages.

Don't expect something of the quality of Nethtml, xmlm and xml-light!

I've never used Nethtml so I cannot say anything about it, but
from what I can see from the interface is that the type is:

type document =
  | Element of (string * (string * string) list * document list)
  | Data of string

XmlErr's type is:

type attr = string * string
type t =
  | Tag of string * attr list  (** opening tag *)
  | ETag of string  (** closing tag *)
  | Data of string  (** PCData *)
  | Comm of string  (** Comments *)

type html = t list

As a result xmlerr will be able to return a plain representation of:
<bold><i>text</bold></i>

So it seems that Nethtml will return something corrected.
Xmlerr doesn't, it only returns what it seems.

Also Xmlerr parses comments because sometimes what I want to get is there.
Xmlerr only returns junk for the very XML specific things like <?xml
and <! things,
as a result it's not possible to use xmlerr to read, correct and print
back corrected HTML when there are these kind of elements.

The last release also provides a command line utility "htmlxtr".
This "thing" doesn't require any ocaml programming, it's a basic
command line tool.
What htmlxtr does is to "untemplate" templated parts of a web-page
(but in a very basic way) and print the extracted things on stdout
(read man ./htmlxtr.1 for more informations).
I'm interested by suggestions to improve it.

I'm using xmlerr to make quickly written scripts, for example
Xmlerr.print_code prints an HTML content as ocaml code with Xmlerr.t
type, so that I can just quickly copy-paste a piece of it in a
parttern match and get something from this piece in less than one
minute.
When the template of a website changes, I can usually fix my script in
less than 3 minutes.

I know that some other programming languages provide utilities and
libraries for these kind of tasks and that some uses some tricks and
concepts to extract things from web-pages the more easily possible,
but I don't know them. If you do and have some time, please tell me
about it.

Anyway even if xmlerr is very amateurish,
I would be interested to get any kind of suggestions about how to improve it.

-- 
Cheers
Florent

next prev parent reply	other threads:[~2013-02-23 12:40 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-01-23 20:52 José Romildo Malaquias
2013-02-22  8:43 ` AW: " Gerd Stolpmann
2013-02-23 12:40   ` Florent Monnier [this message]
2013-02-23 13:23     ` Gerd Stolpmann

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAE1DttBdrpW1-gq7GVTN6eVSY73bma95Lt2VKTyYDsSXRis1zg@mail.gmail.com \
    --to=monnier.florent@gmail.com \
    --cc=caml-list@inria.fr \
    --cc=info@gerd-stolpmann.de \
    --cc=j.romildo@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox