From: Oliver Bandel <oliver@first.in-berlin.de>
To: caml-list@inria.fr
Subject: Re: [Caml-list] mboxlib reloaded ;-)
Date: Sat, 28 Apr 2007 12:47:47 +0200 [thread overview]
Message-ID: <20070428104746.GA363@first.in-berlin.de> (raw)
In-Reply-To: <1177721646.16582.8.camel@rosella.wigram>
On Sat, Apr 28, 2007 at 10:54:06AM +1000, skaller wrote:
> On Sat, 2007-04-28 at 01:12 +0200, Oliver Bandel wrote:
>
> > So, I then checked my mboxlib and saw that it is quite slow,
> > compared to what I expected ( expect! I did not tried it
> > on my development machine because I have nomutt installed there)
> > and even if native-code smuch faster, it's nevertheless slow...
> > ...so I thought I have to redesign my scanner-stage.
> > (I use Str-module and ocamnllex mixed together; maybe
> > using a plain selfwritten OCaml-scanner might be better here).
>
> Ocamllex generates very fast scanner: it is using
> a very high-tech tagged deterministic finite state automaton
> with a driver written in C (so no boxing etc processing
> text buffers). I doubt you can hand code anything as
> fast as Ocamllex in C, let alone in Ocaml.
I know that ocamllexis fast.
But I call ocamllex many many times from my
own functions, and this maybe could be done
more elegant / with less calls toocamllex,
or maybe I should not lex directly from the channel
and better read in a bigger chunk of data
into memory and then lex on that.
Or maybe I should first scan the whole header and
then the body for each mail, and only afterwards
scan again the header into seperated lines,
when it is already in the RAM.
>
> You should check the size (number of states) of the generated
> lexer.
How?
> It will run faster with small number of states where
> the matrix fits easily in the cache.
I think that tehere are not so much states, but so many calls.
And maybe creating a list of header-entreies is faster than
creating strings with buffer module, because I always call
Buffer.add_string and so on and so on, instead of puttng
the line onto alist.
For the about 100MB mbox there are 2.5 * 10^6 calls to
to Buffer.add_string for the header and 1.6 * 10^6 calls
to Buffer.add_string for the body, 2.6*10^6 calls to the
function lexing.engine, ...
I better should not read linewise, it seems.
And there are maybe other problems, why it might be slow.
I let the lexer read in linewise and count the line-number.
That is, because I throw an exception, when I detect a
broken mbox file (when a mbox-file ends in the middle
of a header).
So maybe I do too much and to often.
I think there are tooo many calls, not too much
states of the lexer.
(But you could argue that both things are closely related).
Ciao,
Oliver
next prev parent reply other threads:[~2007-04-28 10:47 UTC|newest]
Thread overview: 22+ messages / expand[flat|nested] mbox.gz Atom feed top
2007-04-27 13:54 Oliver Bandel
2007-04-27 16:29 ` [Caml-list] " Richard Jones
2007-04-27 23:12 ` Oliver Bandel
2007-04-28 0:54 ` skaller
2007-04-28 10:47 ` Oliver Bandel [this message]
2007-04-28 10:54 ` Gabriel Kerneis
2007-04-28 11:44 ` Oliver Bandel
2007-04-28 13:49 ` skaller
2007-04-28 14:18 ` Oliver Bandel
2007-04-29 10:45 ` Richard Jones
2007-04-29 15:41 ` Oliver Bandel
2007-04-29 18:51 ` Robert Roessler
2007-05-01 11:00 ` camomile-problem (Re: [Caml-list] mboxlib reloaded ;-)) Oliver Bandel
2007-05-01 10:56 ` [Caml-list] mboxlib reloaded ;-) Oliver Bandel
2007-04-28 7:56 ` Richard Jones
2007-04-28 10:58 ` Oliver Bandel
[not found] ` <20070429103911.GA30510@furbychan.cocan.org>
2007-04-29 15:43 ` Oliver Bandel
2007-09-24 18:22 ` ocamllex speed [was Re: [Caml-list] mboxlib reloaded ;-)] Bruno De Fraine
2007-09-24 19:54 ` Alain Frisch
2007-09-25 8:53 ` Bruno De Fraine
2007-09-24 22:06 ` skaller
2007-09-27 5:26 ` Chris King
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20070428104746.GA363@first.in-berlin.de \
--to=oliver@first.in-berlin.de \
--cc=caml-list@inria.fr \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox