Mailing list for all users of the OCaml language and system.
 help / color / mirror / Atom feed
From: Anton Bachin <antonbachin@yahoo.com>
To: caml-list@inria.fr
Subject: [Caml-list] [ANN] Lambda Soup - HTML scraping and rewriting with CSS selectors
Date: Mon, 16 Nov 2015 15:01:15 -0600	[thread overview]
Message-ID: <4824377F-4045-4D47-9BAB-E06B0C939988@yahoo.com> (raw)

[-- Attachment #1: Type: text/plain, Size: 1586 bytes --]

Greetings,

I would like to announce the release of Lambda Soup, a library for
manipulating HTML documents with CSS selector support. In brief, it
allows expressions such as

    (* Print all links. *)

    read_file “index.html" |> parse
    $$ "a[href]"
    |> iter (fun a -> a |> R.attribute "href" |> print_endline)

and

    (* Add ids to all <h2> tags. *)

    read_channel stdin |> parse
    $$ "h2"
    |> iter (fun h2 -> h2 |> set_attribute "id" (R.leaf_text h2))
    |> write_channel stdout

The library is based on a set of lazy node traversals (to parents,
children, siblings, etc.). The CSS syntax maps onto these. Types are
used to distinguish HTML node classes (such as text, element, and
document) and reduce the need for error-checking.

The library can be found here:

    https://github.com/aantron/lambda-soup <https://github.com/aantron/lambda-soup>

and the associated documentation is at

    http://aantron.github.io/lambda-soup <http://aantron.github.io/lambda-soup>

OCaml, as an impure functional language with terse syntax, seems very
well-suited to this kind of work. I currently have Lambda Soup
postprocessing its own ocamldoc documentation, and I found this
postprocessor more pleasant to write and maintain than the equivalent
program using Python's Beautiful Soup would have been.

There is some discussion of implementing a new lax HTML(5) parser. This
may be the next thing I will do. Any comments on this, and on Lambda
Soup, are welcome.

Lambda Soup is in OPAM as package "lambdasoup".

Best,
Anton

[-- Attachment #2: Type: text/html, Size: 4730 bytes --]

             reply	other threads:[~2015-11-16 21:01 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-11-16 21:01 Anton Bachin [this message]
2015-11-17  9:31 ` François Bobot
2015-11-22  7:58   ` Anton Bachin
2015-11-23 10:44     ` François Bobot
2015-11-23 16:26       ` Anton Bachin
2015-11-23 17:16         ` Drup
2015-11-23 17:35           ` Anton Bachin
2015-11-23 17:41             ` Anton Bachin
2015-11-23 18:20             ` Drup
2015-11-23 19:02               ` Anton Bachin
2015-11-24  8:35         ` François Bobot

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4824377F-4045-4D47-9BAB-E06B0C939988@yahoo.com \
    --to=antonbachin@yahoo.com \
    --cc=caml-list@inria.fr \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox