* [Caml-list] Extracting information from HTML documents @ 2013-01-23 20:52 José Romildo Malaquias 2013-02-22 8:43 ` AW: " Gerd Stolpmann 0 siblings, 1 reply; 4+ messages in thread From: José Romildo Malaquias @ 2013-01-23 20:52 UTC (permalink / raw) To: caml-list Hello. tagsoup[1][2] is a Haskell library for parsing and extracting information from (possibly malformed) HTML/XML documents. tagsoup provides a basic data type for a list of unstructured tags, a parser to convert HTML into this tag type, and useful functions and combinators for finding and extracting information. Is there a similar library for OCaml? I want to write an application which will need to extract some information from HTML documents from the web. tagsoup helps a lot in the Haskell version of my program. Which OCaml libraries can help me with that when porting the application to OCaml? [1] http://community.haskell.org/~ndm/tagsoup/ [2] http://hackage.haskell.org/package/tagsoup Romildo ^ permalink raw reply [flat|nested] 4+ messages in thread
* AW: [Caml-list] Extracting information from HTML documents 2013-01-23 20:52 [Caml-list] Extracting information from HTML documents José Romildo Malaquias @ 2013-02-22 8:43 ` Gerd Stolpmann 2013-02-23 12:40 ` Florent Monnier 0 siblings, 1 reply; 4+ messages in thread From: Gerd Stolpmann @ 2013-02-22 8:43 UTC (permalink / raw) To: José Romildo Malaquias; +Cc: caml-list Well, not really identical, but there is at least a robust HTML parser in OCamlnet: http://projects.camlcity.org/projects/dl/ocamlnet-3.6.3/doc/html-main/Nethtml.html Homepage: http://projects.camlcity.org/projects/ocamlnet.html This parser was once used for Mylife's profile extractor (grabbing data from profile pages of social networks), and is proven to handle absolutely bad HTML well. XML should also be no problem. Gerd Am 23.01.2013 21:52:29 schrieb(en) José Romildo Malaquias: > Hello. > > tagsoup[1][2] is a Haskell library for parsing and extracting > information from (possibly malformed) HTML/XML documents. > > tagsoup provides a basic data type for a list of unstructured tags, a > parser to convert HTML into this tag type, and useful functions and > combinators for finding and extracting information. > > Is there a similar library for OCaml? > > I want to write an application which will need to extract some > information from HTML documents from the web. tagsoup helps a lot in > the > Haskell version of my program. Which OCaml libraries can help me with > that when porting the application to OCaml? > > [1] http://community.haskell.org/~ndm/tagsoup/ > [2] http://hackage.haskell.org/package/tagsoup > > > Romildo > > -- > Caml-list mailing list. Subscription management and archives: > https://sympa.inria.fr/sympa/arc/caml-list > Beginner's list: http://groups.yahoo.com/group/ocaml_beginners > Bug reports: http://caml.inria.fr/bin/caml-bugs > > -- ------------------------------------------------------------ Gerd Stolpmann, Darmstadt, Germany gerd@gerd-stolpmann.de Creator of GODI and camlcity.org. Contact details: http://www.camlcity.org/contact.html Company homepage: http://www.gerd-stolpmann.de ------------------------------------------------------------ ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [Caml-list] Extracting information from HTML documents 2013-02-22 8:43 ` AW: " Gerd Stolpmann @ 2013-02-23 12:40 ` Florent Monnier 2013-02-23 13:23 ` AW: " Gerd Stolpmann 0 siblings, 1 reply; 4+ messages in thread From: Florent Monnier @ 2013-02-23 12:40 UTC (permalink / raw) To: Gerd Stolpmann; +Cc: José Romildo Malaquias, caml-list 2013/2/22, Gerd Stolpmann <info@gerd-stolpmann.de> : > Am 23.01.2013 21:52:29 schrieb(en) José Romildo Malaquias: >> Hello. >> >> tagsoup[1][2] is a Haskell library for parsing and extracting >> information from (possibly malformed) HTML/XML documents. >> >> tagsoup provides a basic data type for a list of unstructured tags, a >> parser to convert HTML into this tag type, and useful functions and >> combinators for finding and extracting information. >> >> Is there a similar library for OCaml? >> >> I want to write an application which will need to extract some >> information from HTML documents from the web. tagsoup helps a lot in >> the Haskell version of my program. Which OCaml libraries can help me >> with that when porting the application to OCaml? >> >> [1] http://community.haskell.org/~ndm/tagsoup/ >> [2] http://hackage.haskell.org/package/tagsoup >> >> >> Romildo > > Well, not really identical, but there is at least a robust HTML parser > in OCamlnet: > > http://projects.camlcity.org/projects/dl/ocamlnet-3.6.3/doc/html-main/Nethtml.html > > Homepage: http://projects.camlcity.org/projects/ocamlnet.html > > This parser was once used for Mylife's profile extractor (grabbing data > from profile pages of social networks), and is proven to handle > absolutely bad HTML well. XML should also be no problem. > > Gerd There's also xmlerr: http://www.linux-nantes.org/%7Efmonnier/ocaml/xmlerr/ but xmlerr is an alpha, experimental, hobbyist, not professional thing. Not all but some parts of its code are very quick-n-dirty. I've written it for my own use to read HTML web-pages, and I'm using it quite often since several years now. 99.9% of the time it does what I expect from it. It's not able to read XML files that are several Go, because it first loads the content in a string and then parses from it which was a very poor choice, but at the beginning I was only using it to load HTML web-pages. Don't expect something of the quality of Nethtml, xmlm and xml-light! I've never used Nethtml so I cannot say anything about it, but from what I can see from the interface is that the type is: type document = | Element of (string * (string * string) list * document list) | Data of string XmlErr's type is: type attr = string * string type t = | Tag of string * attr list (** opening tag *) | ETag of string (** closing tag *) | Data of string (** PCData *) | Comm of string (** Comments *) type html = t list As a result xmlerr will be able to return a plain representation of: <bold><i>text</bold></i> So it seems that Nethtml will return something corrected. Xmlerr doesn't, it only returns what it seems. Also Xmlerr parses comments because sometimes what I want to get is there. Xmlerr only returns junk for the very XML specific things like <?xml and <! things, as a result it's not possible to use xmlerr to read, correct and print back corrected HTML when there are these kind of elements. The last release also provides a command line utility "htmlxtr". This "thing" doesn't require any ocaml programming, it's a basic command line tool. What htmlxtr does is to "untemplate" templated parts of a web-page (but in a very basic way) and print the extracted things on stdout (read man ./htmlxtr.1 for more informations). I'm interested by suggestions to improve it. I'm using xmlerr to make quickly written scripts, for example Xmlerr.print_code prints an HTML content as ocaml code with Xmlerr.t type, so that I can just quickly copy-paste a piece of it in a parttern match and get something from this piece in less than one minute. When the template of a website changes, I can usually fix my script in less than 3 minutes. I know that some other programming languages provide utilities and libraries for these kind of tasks and that some uses some tricks and concepts to extract things from web-pages the more easily possible, but I don't know them. If you do and have some time, please tell me about it. Anyway even if xmlerr is very amateurish, I would be interested to get any kind of suggestions about how to improve it. -- Cheers Florent ^ permalink raw reply [flat|nested] 4+ messages in thread
* AW: [Caml-list] Extracting information from HTML documents 2013-02-23 12:40 ` Florent Monnier @ 2013-02-23 13:23 ` Gerd Stolpmann 0 siblings, 0 replies; 4+ messages in thread From: Gerd Stolpmann @ 2013-02-23 13:23 UTC (permalink / raw) To: Florent Monnier; +Cc: José Romildo Malaquias, caml-list Am 23.02.2013 13:40:28 schrieb(en) Florent Monnier: > > Well, not really identical, but there is at least a robust HTML > parser > > in OCamlnet: > > > > > http://projects.camlcity.org/projects/dl/ocamlnet-3.6.3/doc/html-main/Nethtml.html > > > > Homepage: http://projects.camlcity.org/projects/ocamlnet.html > > > > This parser was once used for Mylife's profile extractor (grabbing > data > > from profile pages of social networks), and is proven to handle > > absolutely bad HTML well. XML should also be no problem. > > > > Gerd > > There's also xmlerr: > http://www.linux-nantes.org/%7Efmonnier/ocaml/xmlerr/ > > but xmlerr is an alpha, experimental, hobbyist, not professional > thing. > ... > I've never used Nethtml so I cannot say anything about it, but > from what I can see from the interface is that the type is: > > type document = > | Element of (string * (string * string) list * document list) > | Data of string > > XmlErr's type is: > > type attr = string * string > type t = > | Tag of string * attr list (** opening tag *) > | ETag of string (** closing tag *) > | Data of string (** PCData *) > | Comm of string (** Comments *) > > type html = t list > > As a result xmlerr will be able to return a plain representation of: > <bold><i>text</bold></i> Right, in quirk mode browsers understand this, although this has always been against the specs. Note that even this is possible in quirk mode: <b>bold <i>bold+italics </b>only italics </i>normal text Nethtml cannot interpret this in the obviously intended way. In practice, this was never a problem, though (fortunately, 99% of the code in the web is cleaner than this). > So it seems that Nethtml will return something corrected. > Xmlerr doesn't, it only returns what it seems. Nethtml returns the logical view, i.e. it doesn't return tags but elements. (NB Tags are the lexical delimiters of elements.) This is actually what you normally want to see because HTML is specified in terms of elements (except you write something like an HTML editor where also knowing tags as such is important). Nethtml also processes omitted tags, e.g. for <a><b>text</a> it will implicitly close the "b" element when closing "a". Or even this: <p>para1 <p>para2 - here, Nethtml closes the first "p" when it sees the second (because it knows that "p" elements cannot contain other "p" elements). Note that this was always the tricky part of HTML parsing, and we had most problems in this area. > Also Xmlerr parses comments because sometimes what I want to get is > there. This is also possible with Nethtml, but optional. Nethml can also parse processing instructions, but these are rarely used even in XML files. > Xmlerr only returns junk for the very XML specific things like <?xml > and <! things, > as a result it's not possible to use xmlerr to read, correct and print > back corrected HTML when there are these kind of elements. But anyway, an XML token reader like Xmlerr is certainly something useful. Gerd > The last release also provides a command line utility "htmlxtr". > This "thing" doesn't require any ocaml programming, it's a basic > command line tool. > What htmlxtr does is to "untemplate" templated parts of a web-page > (but in a very basic way) and print the extracted things on stdout > (read man ./htmlxtr.1 for more informations). > I'm interested by suggestions to improve it. > > I'm using xmlerr to make quickly written scripts, for example > Xmlerr.print_code prints an HTML content as ocaml code with Xmlerr.t > type, so that I can just quickly copy-paste a piece of it in a > parttern match and get something from this piece in less than one > minute. > When the template of a website changes, I can usually fix my script in > less than 3 minutes. > > I know that some other programming languages provide utilities and > libraries for these kind of tasks and that some uses some tricks and > concepts to extract things from web-pages the more easily possible, > but I don't know them. If you do and have some time, please tell me > about it. > > Anyway even if xmlerr is very amateurish, > I would be interested to get any kind of suggestions about how to > improve it. > > -- > Cheers > Florent > > -- > Caml-list mailing list. Subscription management and archives: > https://sympa.inria.fr/sympa/arc/caml-list > Beginner's list: http://groups.yahoo.com/group/ocaml_beginners > Bug reports: http://caml.inria.fr/bin/caml-bugs > -- ------------------------------------------------------------ Gerd Stolpmann, Darmstadt, Germany gerd@gerd-stolpmann.de Creator of GODI and camlcity.org. Contact details: http://www.camlcity.org/contact.html Company homepage: http://www.gerd-stolpmann.de ------------------------------------------------------------ ^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2013-02-23 13:23 UTC | newest] Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2013-01-23 20:52 [Caml-list] Extracting information from HTML documents José Romildo Malaquias 2013-02-22 8:43 ` AW: " Gerd Stolpmann 2013-02-23 12:40 ` Florent Monnier 2013-02-23 13:23 ` AW: " Gerd Stolpmann
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox