Mailing list for all users of the OCaml language and system.
 help / color / mirror / Atom feed
From: Xavier Leroy <xavier.leroy@inria.fr>
To: caml-announce@inria.fr
Subject: [Caml-list] OCamlAgrep 1.0 (string searching with errors)
Date: Tue, 5 Feb 2002 10:50:03 +0100	[thread overview]
Message-ID: <20020205105003.A30938@pauillac.inria.fr> (raw)

It is my pleasure to release the OCamlAgrep library:

     ftp://ftp.inria.fr/lang/caml-light/bazar-ocaml/ocamlagrep-1.0.tar.gz

This library implements the Wu-Manber algorithm for string searching
with errors, popularized by the "agrep" Unix command and the "glimpse"
file indexing tool.  It was developed as part of a search engine for a
largish MP3 collection; the "with error" searching comes handy for those
who can't spell Liszt or Shostakovitch.  

Given a search pattern and a string, this algorithm determines whether
the string contains a substring that matches the pattern up to a
parameterizable number N of "errors".  An "error" is either a
substitution (replace a character of the string with another
character), a deletion (remove a character) or an insertion (add a
character to the string).  In more scientific terms, the number of
errors is the Levenshtein edit distance between the pattern and the
matched substring.

The search patterns are roughly those of the Unix shell, including
one-character wildcard (?), character classes ([0-9]) and multi-character
wildcard (*).  In addition, conjunction (&) and alternative (|) are
supported.  General regular expressions are not supported, however.

Performance is quite good: for short patterns (less than 31 characters)
and no errors, this library is about 8 times faster than OCaml's "Str"
regular expression library.  Speed decreases with the number of errors
allowed, but even with 3 errors we are still faster than "Str".

The algorithm is described in S. Wu and U. Manber, "Fast Text
Searching With Errors", tech. rep. TR 91-11, University of Arizona, 1991.
It's a nice exercise in dynamic programming and bit-parallel implementation.

Enjoy,

- Xavier Leroy
-------------------
Bug reports: http://caml.inria.fr/bin/caml-bugs  FAQ: http://caml.inria.fr/FAQ/
To unsubscribe, mail caml-list-request@inria.fr  Archives: http://caml.inria.fr


                 reply	other threads:[~2002-02-05  9:50 UTC|newest]

Thread overview: [no followups] expand[flat|nested]  mbox.gz  Atom feed

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20020205105003.A30938@pauillac.inria.fr \
    --to=xavier.leroy@inria.fr \
    --cc=caml-announce@inria.fr \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox