[Caml-list] Extracting information from HTML documents

Mailing list for all users of the OCaml language and system.
 help / color / mirror / Atom feed

* [Caml-list] Extracting information from HTML documents
@ 2013-01-23 20:52 José Romildo Malaquias
  2013-02-22  8:43 ` AW: " Gerd Stolpmann
  0 siblings, 1 reply; 4+ messages in thread
From: José Romildo Malaquias @ 2013-01-23 20:52 UTC (permalink / raw)
  To: caml-list

Hello.

tagsoup[1][2] is a Haskell library for parsing and extracting
information from (possibly malformed) HTML/XML documents.

tagsoup provides a basic data type for a list of unstructured tags, a
parser to convert HTML into this tag type, and useful functions and
combinators for finding and extracting information.

Is there a similar library for OCaml?

I want to write an application which will need to extract some
information from HTML documents from the web. tagsoup helps a lot in the
Haskell version of my program. Which OCaml libraries can help me with
that when porting the application to OCaml?

[1] http://community.haskell.org/~ndm/tagsoup/
[2] http://hackage.haskell.org/package/tagsoup

Romildo

^ permalink raw reply	[flat|nested] 4+ messages in thread

* AW: [Caml-list] Extracting information from HTML documents
  2013-01-23 20:52 [Caml-list] Extracting information from HTML documents José Romildo Malaquias
@ 2013-02-22  8:43 ` Gerd Stolpmann
  2013-02-23 12:40   ` Florent Monnier
  0 siblings, 1 reply; 4+ messages in thread
From: Gerd Stolpmann @ 2013-02-22  8:43 UTC (permalink / raw)
  To: José Romildo Malaquias; +Cc: caml-list

Well, not really identical, but there is at least a robust HTML parser  
in OCamlnet:

http://projects.camlcity.org/projects/dl/ocamlnet-3.6.3/doc/html-main/Nethtml.html

Homepage: http://projects.camlcity.org/projects/ocamlnet.html

This parser was once used for Mylife's profile extractor (grabbing data  
from profile pages of social networks), and is proven to handle  
absolutely bad HTML well. XML should also be no problem.

Gerd


Am 23.01.2013 21:52:29 schrieb(en) José Romildo Malaquias:
> Hello.
> 
> tagsoup[1][2] is a Haskell library for parsing and extracting
> information from (possibly malformed) HTML/XML documents.
> 
> tagsoup provides a basic data type for a list of unstructured tags, a
> parser to convert HTML into this tag type, and useful functions and
> combinators for finding and extracting information.
> 
> Is there a similar library for OCaml?
> 
> I want to write an application which will need to extract some
> information from HTML documents from the web. tagsoup helps a lot in  
> the
> Haskell version of my program. Which OCaml libraries can help me with
> that when porting the application to OCaml?
> 
> [1] http://community.haskell.org/~ndm/tagsoup/
> [2] http://hackage.haskell.org/package/tagsoup
> 
> 
> Romildo
> 
> --
> Caml-list mailing list.  Subscription management and archives:
> https://sympa.inria.fr/sympa/arc/caml-list
> Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
> Bug reports: http://caml.inria.fr/bin/caml-bugs
> 
> 



-- 
------------------------------------------------------------
Gerd Stolpmann, Darmstadt, Germany    gerd@gerd-stolpmann.de
Creator of GODI and camlcity.org.
Contact details:        http://www.camlcity.org/contact.html
Company homepage:       http://www.gerd-stolpmann.de
------------------------------------------------------------

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [Caml-list] Extracting information from HTML documents
  2013-02-22  8:43 ` AW: " Gerd Stolpmann
@ 2013-02-23 12:40   ` Florent Monnier
  2013-02-23 13:23     ` AW: " Gerd Stolpmann
  0 siblings, 1 reply; 4+ messages in thread
From: Florent Monnier @ 2013-02-23 12:40 UTC (permalink / raw)
  To: Gerd Stolpmann; +Cc: José Romildo Malaquias, caml-list

2013/2/22, Gerd Stolpmann <info@gerd-stolpmann.de> :
> Am 23.01.2013 21:52:29 schrieb(en) José Romildo Malaquias:
>> Hello.
>>
>> tagsoup[1][2] is a Haskell library for parsing and extracting
>> information from (possibly malformed) HTML/XML documents.
>>
>> tagsoup provides a basic data type for a list of unstructured tags, a
>> parser to convert HTML into this tag type, and useful functions and
>> combinators for finding and extracting information.
>>
>> Is there a similar library for OCaml?
>>
>> I want to write an application which will need to extract some
>> information from HTML documents from the web. tagsoup helps a lot in
>> the Haskell version of my program. Which OCaml libraries can help me
>> with that when porting the application to OCaml?
>>
>> [1] http://community.haskell.org/~ndm/tagsoup/
>> [2] http://hackage.haskell.org/package/tagsoup
>>
>>
>> Romildo
>
> Well, not really identical, but there is at least a robust HTML parser
> in OCamlnet:
>
> http://projects.camlcity.org/projects/dl/ocamlnet-3.6.3/doc/html-main/Nethtml.html
>
> Homepage: http://projects.camlcity.org/projects/ocamlnet.html
>
> This parser was once used for Mylife's profile extractor (grabbing data
> from profile pages of social networks), and is proven to handle
> absolutely bad HTML well. XML should also be no problem.
>
> Gerd

There's also xmlerr:
http://www.linux-nantes.org/%7Efmonnier/ocaml/xmlerr/

but xmlerr is an alpha, experimental, hobbyist, not professional thing.
Not all but some parts of its code are very quick-n-dirty.
I've written it for my own use to read HTML web-pages,
and I'm using it quite often since several years now.
99.9% of the time it does what I expect from it.
It's not able to read XML files that are several Go,
because it first loads the content in a string and then parses from it
which was a very poor choice, but at the beginning I was only using it
to load HTML web-pages.

Don't expect something of the quality of Nethtml, xmlm and xml-light!

I've never used Nethtml so I cannot say anything about it, but
from what I can see from the interface is that the type is:

type document =
  | Element of (string * (string * string) list * document list)
  | Data of string

XmlErr's type is:

type attr = string * string
type t =
  | Tag of string * attr list  (** opening tag *)
  | ETag of string  (** closing tag *)
  | Data of string  (** PCData *)
  | Comm of string  (** Comments *)

type html = t list

As a result xmlerr will be able to return a plain representation of:
<bold><i>text</bold></i>

So it seems that Nethtml will return something corrected.
Xmlerr doesn't, it only returns what it seems.

Also Xmlerr parses comments because sometimes what I want to get is there.
Xmlerr only returns junk for the very XML specific things like <?xml
and <! things,
as a result it's not possible to use xmlerr to read, correct and print
back corrected HTML when there are these kind of elements.

The last release also provides a command line utility "htmlxtr".
This "thing" doesn't require any ocaml programming, it's a basic
command line tool.
What htmlxtr does is to "untemplate" templated parts of a web-page
(but in a very basic way) and print the extracted things on stdout
(read man ./htmlxtr.1 for more informations).
I'm interested by suggestions to improve it.

I'm using xmlerr to make quickly written scripts, for example
Xmlerr.print_code prints an HTML content as ocaml code with Xmlerr.t
type, so that I can just quickly copy-paste a piece of it in a
parttern match and get something from this piece in less than one
minute.
When the template of a website changes, I can usually fix my script in
less than 3 minutes.

I know that some other programming languages provide utilities and
libraries for these kind of tasks and that some uses some tricks and
concepts to extract things from web-pages the more easily possible,
but I don't know them. If you do and have some time, please tell me
about it.

Anyway even if xmlerr is very amateurish,
I would be interested to get any kind of suggestions about how to improve it.

-- 
Cheers
Florent

^ permalink raw reply	[flat|nested] 4+ messages in thread

* AW: [Caml-list] Extracting information from HTML documents
  2013-02-23 12:40   ` Florent Monnier
@ 2013-02-23 13:23     ` Gerd Stolpmann
  0 siblings, 0 replies; 4+ messages in thread
From: Gerd Stolpmann @ 2013-02-23 13:23 UTC (permalink / raw)
  To: Florent Monnier; +Cc: José Romildo Malaquias, caml-list

Am 23.02.2013 13:40:28 schrieb(en) Florent Monnier:
> > Well, not really identical, but there is at least a robust HTML  
> parser
> > in OCamlnet:
> >
> >  
> http://projects.camlcity.org/projects/dl/ocamlnet-3.6.3/doc/html-main/Nethtml.html
> >
> > Homepage: http://projects.camlcity.org/projects/ocamlnet.html
> >
> > This parser was once used for Mylife's profile extractor (grabbing  
> data
> > from profile pages of social networks), and is proven to handle
> > absolutely bad HTML well. XML should also be no problem.
> >
> > Gerd
> 
> There's also xmlerr:
> http://www.linux-nantes.org/%7Efmonnier/ocaml/xmlerr/
> 
> but xmlerr is an alpha, experimental, hobbyist, not professional  
> thing.
> ...
> I've never used Nethtml so I cannot say anything about it, but
> from what I can see from the interface is that the type is:
> 
> type document =
>   | Element of (string * (string * string) list * document list)
>   | Data of string
> 
> XmlErr's type is:
> 
> type attr = string * string
> type t =
>   | Tag of string * attr list  (** opening tag *)
>   | ETag of string  (** closing tag *)
>   | Data of string  (** PCData *)
>   | Comm of string  (** Comments *)
> 
> type html = t list
> 
> As a result xmlerr will be able to return a plain representation of:
> <bold><i>text</bold></i>

Right, in quirk mode browsers understand this, although this has always  
been against the specs. Note that even this is possible in quirk mode:
<b>bold <i>bold+italics </b>only italics </i>normal text

Nethtml cannot interpret this in the obviously intended way. In  
practice, this was never a problem, though (fortunately, 99% of the  
code in the web is cleaner than this).

> So it seems that Nethtml will return something corrected.
> Xmlerr doesn't, it only returns what it seems.

Nethtml returns the logical view, i.e. it doesn't return tags but  
elements. (NB Tags are the lexical delimiters of elements.) This is  
actually what you normally want to see because HTML is specified in  
terms of elements (except you write something like an HTML editor where  
also knowing tags as such is important). Nethtml also processes omitted  
tags, e.g. for <a><b>text</a> it will implicitly close the "b" element  
when closing "a". Or even this: <p>para1 <p>para2 - here, Nethtml  
closes the first "p" when it sees the second (because it knows that "p"  
elements cannot contain other "p" elements). Note that this was always  
the tricky part of HTML parsing, and we had most problems in this area.

> Also Xmlerr parses comments because sometimes what I want to get is  
> there.

This is also possible with Nethtml, but optional. Nethml can also parse  
processing instructions, but these are rarely used even in XML files.

> Xmlerr only returns junk for the very XML specific things like <?xml
> and <! things,
> as a result it's not possible to use xmlerr to read, correct and print
> back corrected HTML when there are these kind of elements.

But anyway, an XML token reader like Xmlerr is certainly something  
useful.

Gerd


> The last release also provides a command line utility "htmlxtr".
> This "thing" doesn't require any ocaml programming, it's a basic
> command line tool.
> What htmlxtr does is to "untemplate" templated parts of a web-page
> (but in a very basic way) and print the extracted things on stdout
> (read man ./htmlxtr.1 for more informations).
> I'm interested by suggestions to improve it.
> 
> I'm using xmlerr to make quickly written scripts, for example
> Xmlerr.print_code prints an HTML content as ocaml code with Xmlerr.t
> type, so that I can just quickly copy-paste a piece of it in a
> parttern match and get something from this piece in less than one
> minute.
> When the template of a website changes, I can usually fix my script in
> less than 3 minutes.
> 
> I know that some other programming languages provide utilities and
> libraries for these kind of tasks and that some uses some tricks and
> concepts to extract things from web-pages the more easily possible,
> but I don't know them. If you do and have some time, please tell me
> about it.
> 
> Anyway even if xmlerr is very amateurish,
> I would be interested to get any kind of suggestions about how to  
> improve it.
> 
> --
> Cheers
> Florent
> 
> --
> Caml-list mailing list.  Subscription management and archives:
> https://sympa.inria.fr/sympa/arc/caml-list
> Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
> Bug reports: http://caml.inria.fr/bin/caml-bugs
> 



-- 
------------------------------------------------------------
Gerd Stolpmann, Darmstadt, Germany    gerd@gerd-stolpmann.de
Creator of GODI and camlcity.org.
Contact details:        http://www.camlcity.org/contact.html
Company homepage:       http://www.gerd-stolpmann.de
------------------------------------------------------------

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2013-02-23 13:23 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-01-23 20:52 [Caml-list] Extracting information from HTML documents José Romildo Malaquias
2013-02-22  8:43 ` AW: " Gerd Stolpmann
2013-02-23 12:40   ` Florent Monnier
2013-02-23 13:23     ` AW: " Gerd Stolpmann

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox