From mboxrd@z Thu Jan 1 00:00:00 1970 Received: (from majordomo@localhost) by pauillac.inria.fr (8.7.6/8.7.3) id KAA30607; Tue, 5 Feb 2002 10:50:05 +0100 (MET) X-Authentication-Warning: pauillac.inria.fr: majordomo set sender to owner-caml-list@pauillac.inria.fr using -f Received: from concorde.inria.fr (concorde.inria.fr [192.93.2.39]) by pauillac.inria.fr (8.7.6/8.7.3) with ESMTP id KAA31261 for ; Tue, 5 Feb 2002 10:50:04 +0100 (MET) Received: from pauillac.inria.fr (pauillac.inria.fr [128.93.11.35]) by concorde.inria.fr (8.11.1/8.11.1) with ESMTP id g159o3H06781 for ; Tue, 5 Feb 2002 10:50:03 +0100 (MET) Received: (from xleroy@localhost) by pauillac.inria.fr (8.7.6/8.7.3) id KAA30728 for caml-announce@inria.fr; Tue, 5 Feb 2002 10:50:03 +0100 (MET) Date: Tue, 5 Feb 2002 10:50:03 +0100 From: Xavier Leroy To: caml-announce@inria.fr Subject: [Caml-list] OCamlAgrep 1.0 (string searching with errors) Message-ID: <20020205105003.A30938@pauillac.inria.fr> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 1.0i Sender: owner-caml-list@pauillac.inria.fr Precedence: bulk It is my pleasure to release the OCamlAgrep library: ftp://ftp.inria.fr/lang/caml-light/bazar-ocaml/ocamlagrep-1.0.tar.gz This library implements the Wu-Manber algorithm for string searching with errors, popularized by the "agrep" Unix command and the "glimpse" file indexing tool. It was developed as part of a search engine for a largish MP3 collection; the "with error" searching comes handy for those who can't spell Liszt or Shostakovitch. Given a search pattern and a string, this algorithm determines whether the string contains a substring that matches the pattern up to a parameterizable number N of "errors". An "error" is either a substitution (replace a character of the string with another character), a deletion (remove a character) or an insertion (add a character to the string). In more scientific terms, the number of errors is the Levenshtein edit distance between the pattern and the matched substring. The search patterns are roughly those of the Unix shell, including one-character wildcard (?), character classes ([0-9]) and multi-character wildcard (*). In addition, conjunction (&) and alternative (|) are supported. General regular expressions are not supported, however. Performance is quite good: for short patterns (less than 31 characters) and no errors, this library is about 8 times faster than OCaml's "Str" regular expression library. Speed decreases with the number of errors allowed, but even with 3 errors we are still faster than "Str". The algorithm is described in S. Wu and U. Manber, "Fast Text Searching With Errors", tech. rep. TR 91-11, University of Arizona, 1991. It's a nice exercise in dynamic programming and bit-parallel implementation. Enjoy, - Xavier Leroy ------------------- Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/ To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr