From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on yquem.inria.fr X-Spam-Level: X-Spam-Status: No, score=0.0 required=5.0 tests=AWL autolearn=disabled version=3.1.3 Received: from concorde.inria.fr (concorde.inria.fr [192.93.2.39]) by yquem.inria.fr (Postfix) with ESMTP id 90B7ABC6B for ; Sat, 28 Apr 2007 16:18:17 +0200 (CEST) Received: from einhorn.in-berlin.de (einhorn.in-berlin.de [192.109.42.8]) by concorde.inria.fr (8.13.6/8.13.6) with ESMTP id l3SEIG7n016365 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=FAIL) for ; Sat, 28 Apr 2007 16:18:17 +0200 X-Envelope-From: oliver@first.in-berlin.de X-Envelope-To: Received: from first (dslb-088-073-111-021.pools.arcor-ip.net [88.73.111.21]) (authenticated bits=0) by einhorn.in-berlin.de (8.13.6/8.13.6/Debian-1) with ESMTP id l3SEIFqw017337 for ; Sat, 28 Apr 2007 16:18:16 +0200 Received: by first (Postfix, from userid 501) id 38AA03AE67C; Sat, 28 Apr 2007 16:18:12 +0200 (CEST) Date: Sat, 28 Apr 2007 16:18:12 +0200 From: Oliver Bandel To: caml-list@yquem.inria.fr Subject: Re: [Caml-list] mboxlib reloaded ;-) Message-ID: <20070428141812.GD2396@first.in-berlin.de> References: <20070427135425.GA1161@first.in-berlin.de> <20070427162911.GA10099@furbychan.cocan.org> <20070427231220.GA1507@first.in-berlin.de> <1177721646.16582.8.camel@rosella.wigram> <20070428104746.GA363@first.in-berlin.de> <20070428125453.1fef4ae4@kerneis.info> <20070428114450.GI363@first.in-berlin.de> <1177768191.11923.24.camel@rosella.wigram> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <1177768191.11923.24.camel@rosella.wigram> User-Agent: Mutt/1.5.6i X-Scanned-By: MIMEDefang_at_IN-Berlin_e.V. on 192.109.42.8 X-Miltered: at concorde with ID 463357A8.001 by Joe's j-chkmail (http://j-chkmail . ensmp . fr)! X-Spam: no; 0.00; bandel:01 in-berlin:01 reloaded:01 0200,:01 bandel:01 0200,:01 in-berlin:01 lexer:01 ocamllex:01 mll:01 transitions:01 afaik:01 ocaml:01 buffer:01 ocaml:01 On Sat, Apr 28, 2007 at 11:49:51PM +1000, skaller wrote: > On Sat, 2007-04-28 at 13:44 +0200, Oliver Bandel wrote: > > On Sat, Apr 28, 2007 at 12:54:53PM +0200, Gabriel Kerneis wrote: > > > Le Sat, 28 Apr 2007 12:47:47 +0200, Oliver Bandel > > > a écrit : > > > > > You should check the size (number of states) of the generated > > > > > lexer. > > > > > > > > How? > > > > > > It's printed out by ocamllex when you run it on you .mll file. > > > Regards, > > > > Ah, ok. :) > > > > > > 18 states, 261 transitions, table size 1152 bytes. > > > > Does not loooks very huge ;-) > > Lol, no it is tiny. You are probably right, too many calls, > and too much copying data around. AFAIK Ocaml channels also > add an extra buffer layer (is that right?) so there's even > more copying. > > Still, although Ocaml may generate more code than C, > if your code is reasonably tight it should be cached > and be fast: function calls are actually quite cheap. > > Here's an idea: you said: > > "For the about 100MB mbox there are 2.5 * 10^6 calls to > to Buffer.add_string for the header and 1.6 * 10^6 calls > to Buffer.add_string for the body, 2.6*10^6 calls to the > function lexing.engine, ..." > > How about NOT storing the body text. Instead, just store > the integer file offset of the first byte and the length? [...] This is what I have foreseen for different kinds of application. In the parts I have implemented already, I use completely reading of the data, because this would be the most general way of handling. I will be able to do fulltext-research and possibly very complex analysis of the mailtext, and this is the reason why I read all things into memory. To use file-oiffsets would make sense, when the main functionality would be that of a mailreader, where you might look into the one or the other mail. Then reading from disk is not a problem. But when you want to be able to do a lot of different things, then always again and again reading from disk might be much slower than once reading data into memory and then working on it. But maybe I provide such a functionality also, and only read to memory, when I need it. The main intend of the lib (ys, a library, not a certain application) was to have a tool at hand, that makes complex things achievable easy. Example: ========================================================= open Mbox let filename = "./MB1" let _ = print_endline "OK, let's start!"; let sm = read filename in let mymatch = match_header_regexp ".*richard" NoCase in let result = filter_rename mymatch sm "./MB1.ready" in write_force result; print_endline "OK, ready! :)" ========================================================= Opens the mbox-file "MB1" and writes all mails with the "richard"-string in the header to the mbox-file "MB1.ready". It could also be used for spam-detection, word-counts, textanalysis, automatic rearranging of mbox-files, throwing away multiple mails, or saving the same mail in multiple folders, if they belong to more than one theme. Or doing complex data analysis on the mails. E.g. for statistical reasons of text-analysis, maybe similarity-calculations of texts, ... ocamlc: 31.4 seconds ocamopt: 7.7 seconds Doing an exit after reading-stage: ocamlc: 7.0 seconds ocamlopt: 2.5 seconds So, the pattern-matching (using Str-module) takes much more time than the reading-/scanning. It's about 100 MB mbox with mostly short mails. (I could provide more statistical infos, using this lib :)) Doing a $ time cat MB1 > MB2 needs 0.2 seconds $ time grep -i richard MB1 > MB2 needs 0.148 seconds. OK, such flat-file applications are always faster, as they do not read the data to memory and they do no checks on the validity of the files (no exception on broken mbox-files). But maybe it could be done better. Parsing-stage: Str-module not seems to be the fastest, and ocamllex creats ml-files that first have to be compiled.... Is there no certain library in OCaml, which can be used? Or do it have to be developed? I think on such a kind of thing: http://swtch.com/~rsc/regexp/regexp1.html A thing like a runtime-ocamllex, that creates datastructures instead of ocaml-code, would make sense, IMHO. IS No such thing available already? Ciao, Oliver