* mboxlib reloaded ;-) @ 2007-04-27 13:54 Oliver Bandel 2007-04-27 16:29 ` [Caml-list] " Richard Jones 0 siblings, 1 reply; 22+ messages in thread From: Oliver Bandel @ 2007-04-27 13:54 UTC (permalink / raw) To: caml-list Hello, after two years of doing nothing on it, I today found my mboxlib, I started to write in 2005. I have put the mli-file on the web and maybe the library itself will follow during the next time. Any feedback, questions and suggestions are welcome. http://me.in-berlin.de/~first/software/libraries/mboxlib/ TIA, Oliver Bandel ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Caml-list] mboxlib reloaded ;-) 2007-04-27 13:54 mboxlib reloaded ;-) Oliver Bandel @ 2007-04-27 16:29 ` Richard Jones 2007-04-27 23:12 ` Oliver Bandel 0 siblings, 1 reply; 22+ messages in thread From: Richard Jones @ 2007-04-27 16:29 UTC (permalink / raw) To: Oliver Bandel; +Cc: caml-list On Fri, Apr 27, 2007 at 03:54:25PM +0200, Oliver Bandel wrote: > Hello, > > after two years of doing nothing on it, > I today found my mboxlib, I started to > write in 2005. > > I have put the mli-file on the web and > maybe the library itself will follow > during the next time. > > Any feedback, questions and suggestions are welcome. > > http://me.in-berlin.de/~first/software/libraries/mboxlib/ The source for COCANWIKI[1] contains extensive support for threading of mail messages, based on JWZ's algorithm: http://www.jwz.org/doc/threading.html You are of course welcome to copy this. If there are any license issues let me know & I can fix them. I'd also like to point you to another useful JWZ doc: http://www.jwz.org/doc/mailsum.html Rich. [1] http://sandbox.merjis.com/ -- Richard Jones Red Hat ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Caml-list] mboxlib reloaded ;-) 2007-04-27 16:29 ` [Caml-list] " Richard Jones @ 2007-04-27 23:12 ` Oliver Bandel 2007-04-28 0:54 ` skaller ` (2 more replies) 0 siblings, 3 replies; 22+ messages in thread From: Oliver Bandel @ 2007-04-27 23:12 UTC (permalink / raw) To: caml-list Hi, only a short note, because I tonight will not explore it in detail... On Fri, Apr 27, 2007 at 05:29:11PM +0100, Richard Jones wrote: > On Fri, Apr 27, 2007 at 03:54:25PM +0200, Oliver Bandel wrote: > > Hello, > > > > after two years of doing nothing on it, > > I today found my mboxlib, I started to > > write in 2005. > > > > I have put the mli-file on the web and > > maybe the library itself will follow > > during the next time. > > > > Any feedback, questions and suggestions are welcome. > > > > http://me.in-berlin.de/~first/software/libraries/mboxlib/ > > The source for COCANWIKI[1] contains extensive support for threading > of mail messages, based on JWZ's algorithm: > > http://www.jwz.org/doc/threading.html Nice... you speak of an optimized algorithm for threading. I didn't explored your solution nor did I explored your paper in detail (tomorrow I think I have the time to do it), but IMHO the best thing for handling message-threads is to use tries-datastructure with messgae-id's as identifers (instead of char's, as they are used normally). So: did you reimplemented the tries-datastructure as abstraction on message ID's, or did you made it different? > > You are of course welcome to copy this. If there are any license > issues let me know & I can fix them. > > I'd also like to point you to another useful JWZ doc: > > http://www.jwz.org/doc/mailsum.html Well, the same here: tomorrow I can look at itin more detail; but the problem of fast mbox-usage I today also found out as a problem, as I first time used a test-mbox of about 100 MB. Normally I would use some MB's of size, because I think ths is the normal size; but I had some dscussions on the berlin Linux user group, and some people were anbnoyed that mutt needs some seconds to read in mbox-files of about 80 MB's. So, I then checked my mboxlib and saw that it is quite slow, compared to what I expected ( expect! I did not tried it on my development machine because I have nomutt installed there) and even if native-code smuch faster, it's nevertheless slow... ...so I thought I have to redesign my scanner-stage. (I use Str-module and ocamnllex mixed together; maybe using a plain selfwritten OCaml-scanner might be better here). Ciao, Oliver P.S.: 12 seconds for 100 MB seems tobe quite slow... I very often call the lexer, and that might be done smarter. Maybe your pages will show some useful attempts. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Caml-list] mboxlib reloaded ;-) 2007-04-27 23:12 ` Oliver Bandel @ 2007-04-28 0:54 ` skaller 2007-04-28 10:47 ` Oliver Bandel 2007-04-28 7:56 ` Richard Jones 2007-09-24 18:22 ` ocamllex speed [was Re: [Caml-list] mboxlib reloaded ;-)] Bruno De Fraine 2 siblings, 1 reply; 22+ messages in thread From: skaller @ 2007-04-28 0:54 UTC (permalink / raw) To: Oliver Bandel; +Cc: caml-list On Sat, 2007-04-28 at 01:12 +0200, Oliver Bandel wrote: > So, I then checked my mboxlib and saw that it is quite slow, > compared to what I expected ( expect! I did not tried it > on my development machine because I have nomutt installed there) > and even if native-code smuch faster, it's nevertheless slow... > ...so I thought I have to redesign my scanner-stage. > (I use Str-module and ocamnllex mixed together; maybe > using a plain selfwritten OCaml-scanner might be better here). Ocamllex generates very fast scanner: it is using a very high-tech tagged deterministic finite state automaton with a driver written in C (so no boxing etc processing text buffers). I doubt you can hand code anything as fast as Ocamllex in C, let alone in Ocaml. You should check the size (number of states) of the generated lexer. It will run faster with small number of states where the matrix fits easily in the cache. -- John Skaller <skaller at users dot sf dot net> Felix, successor to C++: http://felix.sf.net ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Caml-list] mboxlib reloaded ;-) 2007-04-28 0:54 ` skaller @ 2007-04-28 10:47 ` Oliver Bandel 2007-04-28 10:54 ` Gabriel Kerneis 0 siblings, 1 reply; 22+ messages in thread From: Oliver Bandel @ 2007-04-28 10:47 UTC (permalink / raw) To: caml-list On Sat, Apr 28, 2007 at 10:54:06AM +1000, skaller wrote: > On Sat, 2007-04-28 at 01:12 +0200, Oliver Bandel wrote: > > > So, I then checked my mboxlib and saw that it is quite slow, > > compared to what I expected ( expect! I did not tried it > > on my development machine because I have nomutt installed there) > > and even if native-code smuch faster, it's nevertheless slow... > > ...so I thought I have to redesign my scanner-stage. > > (I use Str-module and ocamnllex mixed together; maybe > > using a plain selfwritten OCaml-scanner might be better here). > > Ocamllex generates very fast scanner: it is using > a very high-tech tagged deterministic finite state automaton > with a driver written in C (so no boxing etc processing > text buffers). I doubt you can hand code anything as > fast as Ocamllex in C, let alone in Ocaml. I know that ocamllexis fast. But I call ocamllex many many times from my own functions, and this maybe could be done more elegant / with less calls toocamllex, or maybe I should not lex directly from the channel and better read in a bigger chunk of data into memory and then lex on that. Or maybe I should first scan the whole header and then the body for each mail, and only afterwards scan again the header into seperated lines, when it is already in the RAM. > > You should check the size (number of states) of the generated > lexer. How? > It will run faster with small number of states where > the matrix fits easily in the cache. I think that tehere are not so much states, but so many calls. And maybe creating a list of header-entreies is faster than creating strings with buffer module, because I always call Buffer.add_string and so on and so on, instead of puttng the line onto alist. For the about 100MB mbox there are 2.5 * 10^6 calls to to Buffer.add_string for the header and 1.6 * 10^6 calls to Buffer.add_string for the body, 2.6*10^6 calls to the function lexing.engine, ... I better should not read linewise, it seems. And there are maybe other problems, why it might be slow. I let the lexer read in linewise and count the line-number. That is, because I throw an exception, when I detect a broken mbox file (when a mbox-file ends in the middle of a header). So maybe I do too much and to often. I think there are tooo many calls, not too much states of the lexer. (But you could argue that both things are closely related). Ciao, Oliver ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Caml-list] mboxlib reloaded ;-) 2007-04-28 10:47 ` Oliver Bandel @ 2007-04-28 10:54 ` Gabriel Kerneis 2007-04-28 11:44 ` Oliver Bandel 0 siblings, 1 reply; 22+ messages in thread From: Gabriel Kerneis @ 2007-04-28 10:54 UTC (permalink / raw) To: caml-list [-- Attachment #1: Type: text/plain, Size: 276 bytes --] Le Sat, 28 Apr 2007 12:47:47 +0200, Oliver Bandel <oliver@first.in-berlin.de> a écrit : > > You should check the size (number of states) of the generated > > lexer. > > How? It's printed out by ocamllex when you run it on you .mll file. Regards, -- Gabriel [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Caml-list] mboxlib reloaded ;-) 2007-04-28 10:54 ` Gabriel Kerneis @ 2007-04-28 11:44 ` Oliver Bandel 2007-04-28 13:49 ` skaller 0 siblings, 1 reply; 22+ messages in thread From: Oliver Bandel @ 2007-04-28 11:44 UTC (permalink / raw) To: caml-list On Sat, Apr 28, 2007 at 12:54:53PM +0200, Gabriel Kerneis wrote: > Le Sat, 28 Apr 2007 12:47:47 +0200, Oliver Bandel > <oliver@first.in-berlin.de> a écrit : > > > You should check the size (number of states) of the generated > > > lexer. > > > > How? > > It's printed out by ocamllex when you run it on you .mll file. > Regards, Ah, ok. :) 18 states, 261 transitions, table size 1152 bytes. Does not loooks very huge ;-) Ciao, Oliver ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Caml-list] mboxlib reloaded ;-) 2007-04-28 11:44 ` Oliver Bandel @ 2007-04-28 13:49 ` skaller 2007-04-28 14:18 ` Oliver Bandel 0 siblings, 1 reply; 22+ messages in thread From: skaller @ 2007-04-28 13:49 UTC (permalink / raw) To: Oliver Bandel; +Cc: caml-list On Sat, 2007-04-28 at 13:44 +0200, Oliver Bandel wrote: > On Sat, Apr 28, 2007 at 12:54:53PM +0200, Gabriel Kerneis wrote: > > Le Sat, 28 Apr 2007 12:47:47 +0200, Oliver Bandel > > <oliver@first.in-berlin.de> a écrit : > > > > You should check the size (number of states) of the generated > > > > lexer. > > > > > > How? > > > > It's printed out by ocamllex when you run it on you .mll file. > > Regards, > > Ah, ok. :) > > > 18 states, 261 transitions, table size 1152 bytes. > > Does not loooks very huge ;-) Lol, no it is tiny. You are probably right, too many calls, and too much copying data around. AFAIK Ocaml channels also add an extra buffer layer (is that right?) so there's even more copying. Still, although Ocaml may generate more code than C, if your code is reasonably tight it should be cached and be fast: function calls are actually quite cheap. Here's an idea: you said: "For the about 100MB mbox there are 2.5 * 10^6 calls to to Buffer.add_string for the header and 1.6 * 10^6 calls to Buffer.add_string for the body, 2.6*10^6 calls to the function lexing.engine, ..." How about NOT storing the body text. Instead, just store the integer file offset of the first byte and the length? Not sure what you application is doing; perhaps that would work for you? -- John Skaller <skaller at users dot sf dot net> Felix, successor to C++: http://felix.sf.net ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Caml-list] mboxlib reloaded ;-) 2007-04-28 13:49 ` skaller @ 2007-04-28 14:18 ` Oliver Bandel 2007-04-29 10:45 ` Richard Jones 0 siblings, 1 reply; 22+ messages in thread From: Oliver Bandel @ 2007-04-28 14:18 UTC (permalink / raw) To: caml-list On Sat, Apr 28, 2007 at 11:49:51PM +1000, skaller wrote: > On Sat, 2007-04-28 at 13:44 +0200, Oliver Bandel wrote: > > On Sat, Apr 28, 2007 at 12:54:53PM +0200, Gabriel Kerneis wrote: > > > Le Sat, 28 Apr 2007 12:47:47 +0200, Oliver Bandel > > > <oliver@first.in-berlin.de> a écrit : > > > > > You should check the size (number of states) of the generated > > > > > lexer. > > > > > > > > How? > > > > > > It's printed out by ocamllex when you run it on you .mll file. > > > Regards, > > > > Ah, ok. :) > > > > > > 18 states, 261 transitions, table size 1152 bytes. > > > > Does not loooks very huge ;-) > > Lol, no it is tiny. You are probably right, too many calls, > and too much copying data around. AFAIK Ocaml channels also > add an extra buffer layer (is that right?) so there's even > more copying. > > Still, although Ocaml may generate more code than C, > if your code is reasonably tight it should be cached > and be fast: function calls are actually quite cheap. > > Here's an idea: you said: > > "For the about 100MB mbox there are 2.5 * 10^6 calls to > to Buffer.add_string for the header and 1.6 * 10^6 calls > to Buffer.add_string for the body, 2.6*10^6 calls to the > function lexing.engine, ..." > > How about NOT storing the body text. Instead, just store > the integer file offset of the first byte and the length? [...] This is what I have foreseen for different kinds of application. In the parts I have implemented already, I use completely reading of the data, because this would be the most general way of handling. I will be able to do fulltext-research and possibly very complex analysis of the mailtext, and this is the reason why I read all things into memory. To use file-oiffsets would make sense, when the main functionality would be that of a mailreader, where you might look into the one or the other mail. Then reading from disk is not a problem. But when you want to be able to do a lot of different things, then always again and again reading from disk might be much slower than once reading data into memory and then working on it. But maybe I provide such a functionality also, and only read to memory, when I need it. The main intend of the lib (ys, a library, not a certain application) was to have a tool at hand, that makes complex things achievable easy. Example: ========================================================= open Mbox let filename = "./MB1" let _ = print_endline "OK, let's start!"; let sm = read filename in let mymatch = match_header_regexp ".*richard" NoCase in let result = filter_rename mymatch sm "./MB1.ready" in write_force result; print_endline "OK, ready! :)" ========================================================= Opens the mbox-file "MB1" and writes all mails with the "richard"-string in the header to the mbox-file "MB1.ready". It could also be used for spam-detection, word-counts, textanalysis, automatic rearranging of mbox-files, throwing away multiple mails, or saving the same mail in multiple folders, if they belong to more than one theme. Or doing complex data analysis on the mails. E.g. for statistical reasons of text-analysis, maybe similarity-calculations of texts, ... ocamlc: 31.4 seconds ocamopt: 7.7 seconds Doing an exit after reading-stage: ocamlc: 7.0 seconds ocamlopt: 2.5 seconds So, the pattern-matching (using Str-module) takes much more time than the reading-/scanning. It's about 100 MB mbox with mostly short mails. (I could provide more statistical infos, using this lib :)) Doing a $ time cat MB1 > MB2 needs 0.2 seconds $ time grep -i richard MB1 > MB2 needs 0.148 seconds. OK, such flat-file applications are always faster, as they do not read the data to memory and they do no checks on the validity of the files (no exception on broken mbox-files). But maybe it could be done better. Parsing-stage: Str-module not seems to be the fastest, and ocamllex creats ml-files that first have to be compiled.... Is there no certain library in OCaml, which can be used? Or do it have to be developed? I think on such a kind of thing: http://swtch.com/~rsc/regexp/regexp1.html A thing like a runtime-ocamllex, that creates datastructures instead of ocaml-code, would make sense, IMHO. IS No such thing available already? Ciao, Oliver ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Caml-list] mboxlib reloaded ;-) 2007-04-28 14:18 ` Oliver Bandel @ 2007-04-29 10:45 ` Richard Jones 2007-04-29 15:41 ` Oliver Bandel 2007-05-01 10:56 ` [Caml-list] mboxlib reloaded ;-) Oliver Bandel 0 siblings, 2 replies; 22+ messages in thread From: Richard Jones @ 2007-04-29 10:45 UTC (permalink / raw) To: Oliver Bandel; +Cc: caml-list On Sat, Apr 28, 2007 at 04:18:12PM +0200, Oliver Bandel wrote: > The main intend of the lib (ys, a library, not a certain > application) was to have a tool at hand, that makes > complex things achievable easy. [...] > let mymatch = match_header_regexp ".*richard" NoCase [...] > Opens the mbox-file "MB1" and writes > all mails with the "richard"-string in the header > to the mbox-file "MB1.ready". There's a whole lots of issues that this doesn't cover. For example, what about internationalised headers using the (rather crazy) syntax defined in RFC 2047? >From the RFC: From: =?US-ASCII?Q?Keith_Moore?= <moore@cs.utk.edu> To: =?ISO-8859-1?Q?Keld_J=F8rn_Simonsen?= <keld@dkuug.dk> CC: =?ISO-8859-1?Q?Andr=E9?= Pirard <PIRARD@vm1.ulg.ac.be> Subject: =?ISO-8859-1?B?SWYgeW91IGNhbiByZWFkIHRoaXMgeW8=?= =?ISO-8859-2?B?dSB1bmRlcnN0YW5kIHRoZSBleGFtcGxlLg==?= Rich. -- Richard Jones Red Hat ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Caml-list] mboxlib reloaded ;-) 2007-04-29 10:45 ` Richard Jones @ 2007-04-29 15:41 ` Oliver Bandel 2007-04-29 18:51 ` Robert Roessler 2007-05-01 10:56 ` [Caml-list] mboxlib reloaded ;-) Oliver Bandel 1 sibling, 1 reply; 22+ messages in thread From: Oliver Bandel @ 2007-04-29 15:41 UTC (permalink / raw) To: caml-list On Sun, Apr 29, 2007 at 11:45:11AM +0100, Richard Jones wrote: > On Sat, Apr 28, 2007 at 04:18:12PM +0200, Oliver Bandel wrote: > > The main intend of the lib (ys, a library, not a certain > > application) was to have a tool at hand, that makes > > complex things achievable easy. > [...] > > let mymatch = match_header_regexp ".*richard" NoCase > [...] > > Opens the mbox-file "MB1" and writes > > all mails with the "richard"-string in the header > > to the mbox-file "MB1.ready". > > There's a whole lots of issues that this doesn't cover. For example, > what about internationalised headers using the (rather crazy) syntax > defined in RFC 2047? Aha, ok, you say there must be UTF-8 compatibility.... ...well I have not thought about this... well... ...good hint! ...any bindings for UTF-8-libs for OCaml avialable somewhere? Or other necessary things? Maybe using already available C-libs and only add an OCaml-binding might make sense? Any idea on that? Ciao, Oliver ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Caml-list] mboxlib reloaded ;-) 2007-04-29 15:41 ` Oliver Bandel @ 2007-04-29 18:51 ` Robert Roessler 2007-05-01 11:00 ` camomile-problem (Re: [Caml-list] mboxlib reloaded ;-)) Oliver Bandel 0 siblings, 1 reply; 22+ messages in thread From: Robert Roessler @ 2007-04-29 18:51 UTC (permalink / raw) To: Caml-list Oliver Bandel wrote: > ... > Aha, ok, you say there must be UTF-8 compatibility.... > ...well I have not thought about this... well... > ...good hint! > > > ...any bindings for UTF-8-libs for OCaml avialable somewhere? > Or other necessary things? There is always the [LGPL] http://camomile.sourceforge.net/ "Camomile is a comprehensive Unicode library for objective caml (a. k. a. OCaml or O'Caml) language. Camomile provides Unicode character type, UTF-8, UTF-16, UTF-32 strings, conversion to/from about 200 encodings, collation and locale-sensitive case mappings, and more." Robert Roessler roessler@rftp.com http://www.rftp.com ^ permalink raw reply [flat|nested] 22+ messages in thread
* camomile-problem (Re: [Caml-list] mboxlib reloaded ;-)) 2007-04-29 18:51 ` Robert Roessler @ 2007-05-01 11:00 ` Oliver Bandel 0 siblings, 0 replies; 22+ messages in thread From: Oliver Bandel @ 2007-05-01 11:00 UTC (permalink / raw) To: Caml-list On Sun, Apr 29, 2007 at 11:51:00AM -0700, Robert Roessler wrote: > Oliver Bandel wrote: > >... > >Aha, ok, you say there must be UTF-8 compatibility.... > >...well I have not thought about this... well... > >...good hint! > > > > > >...any bindings for UTF-8-libs for OCaml avialable somewhere? > >Or other necessary things? > > There is always the [LGPL] > > http://camomile.sourceforge.net/ > > "Camomile is a comprehensive Unicode library for objective caml (a. k. > a. OCaml or O'Caml) language. Camomile provides Unicode character > type, UTF-8, UTF-16, UTF-32 strings, conversion to/from about 200 > encodings, collation and locale-sensitive case mappings, and more." Ah, fine. :) But not so fine is: ============================================================ [...] public/uText.mli:50:38: missing terminating ' character public/uText.mli:51:35: missing terminating ' character public/uText.mli:67:29: missing terminating ' character In file included from public/main.mlp:46: public/xString.mli:33:29: missing terminating ' character make: *** [public/main.ml] Error 1 rm internal/uReStrLexer.ml internal/uReStrParser.ml first:~/Desktop/Camino-DOWNLOADS/camomile-0.7.1 oliver$ ============================================================ I use ocaml 3.09.3. Ciao, Oliver ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Caml-list] mboxlib reloaded ;-) 2007-04-29 10:45 ` Richard Jones 2007-04-29 15:41 ` Oliver Bandel @ 2007-05-01 10:56 ` Oliver Bandel 1 sibling, 0 replies; 22+ messages in thread From: Oliver Bandel @ 2007-05-01 10:56 UTC (permalink / raw) To: caml-list On Sun, Apr 29, 2007 at 11:45:11AM +0100, Richard Jones wrote: > On Sat, Apr 28, 2007 at 04:18:12PM +0200, Oliver Bandel wrote: > > The main intend of the lib (ys, a library, not a certain > > application) was to have a tool at hand, that makes > > complex things achievable easy. > [...] > > let mymatch = match_header_regexp ".*richard" NoCase > [...] > > Opens the mbox-file "MB1" and writes > > all mails with the "richard"-string in the header > > to the mbox-file "MB1.ready". > > There's a whole lots of issues that this doesn't cover. For example, I want to provide an nterface, that the above things will be possible. > what about internationalised headers using the (rather crazy) syntax > defined in RFC 2047? This should be done transparently. So, the above should work, but I also had to add the things you mentioned (nternationalized headers). Ciao, Oliver ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Caml-list] mboxlib reloaded ;-) 2007-04-27 23:12 ` Oliver Bandel 2007-04-28 0:54 ` skaller @ 2007-04-28 7:56 ` Richard Jones 2007-04-28 10:58 ` Oliver Bandel 2007-09-24 18:22 ` ocamllex speed [was Re: [Caml-list] mboxlib reloaded ;-)] Bruno De Fraine 2 siblings, 1 reply; 22+ messages in thread From: Richard Jones @ 2007-04-28 7:56 UTC (permalink / raw) To: Oliver Bandel; +Cc: caml-list On Sat, Apr 28, 2007 at 01:12:20AM +0200, Oliver Bandel wrote: > On Fri, Apr 27, 2007 at 05:29:11PM +0100, Richard Jones wrote: > > The source for COCANWIKI[1] contains extensive support for threading > > of mail messages, based on JWZ's algorithm: > > > > http://www.jwz.org/doc/threading.html > > Nice... you speak of an optimized algorithm for threading. > I didn't explored your solution nor did I explored your > paper in detail (tomorrow I think I have the time to do it), I should point out that the algorithm is due to esteemed hacker[1] Jamie Zawinski (http://jwz.org) who used it in Netscape versions 1 through 3. They got C++/OO group-think disease from Netscape 4 and above (the rot continues to this day). Rich. [1] Perhaps not as esteemed as Xavier L though. -- Richard Jones Red Hat ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Caml-list] mboxlib reloaded ;-) 2007-04-28 7:56 ` Richard Jones @ 2007-04-28 10:58 ` Oliver Bandel [not found] ` <20070429103911.GA30510@furbychan.cocan.org> 0 siblings, 1 reply; 22+ messages in thread From: Oliver Bandel @ 2007-04-28 10:58 UTC (permalink / raw) To: caml-list On Sat, Apr 28, 2007 at 08:56:28AM +0100, Richard Jones wrote: > On Sat, Apr 28, 2007 at 01:12:20AM +0200, Oliver Bandel wrote: > > On Fri, Apr 27, 2007 at 05:29:11PM +0100, Richard Jones wrote: > > > The source for COCANWIKI[1] contains extensive support for threading > > > of mail messages, based on JWZ's algorithm: > > > > > > http://www.jwz.org/doc/threading.html > > > > Nice... you speak of an optimized algorithm for threading. > > I didn't explored your solution nor did I explored your > > paper in detail (tomorrow I think I have the time to do it), > > I should point out that the algorithm is due to esteemed hacker[1] > Jamie Zawinski (http://jwz.org) who used it in Netscape versions 1 > through 3. They got C++/OO group-think disease from Netscape 4 and > above (the rot continues to this day). When I think about netscape, I think about the terrible way, how it handles newsgroups.... ...when doing a resort of the newsgroup-entries it's extremely slow. Maybe it again checks data on the newsserver again and again. And when I think about Netscape, I se that it reloads data via network, when you want to save it, even if it' already on the screen. So, theese things are the reason, why I don't see why this algorithm is something special, until I have seen that it really is fast. ;-) I have thought a while about the threading, and thought I have some good ideas. A while later, I saw, there already is a datastructure that is useful for threading, and it is the Tries-datastructure: http://www.csse.monash.edu.au/~lloyd/tildeAlgDS/Tree/Trie/ If you use the message ID's instead of the char's of a word, then this datastructure is well suited for news-/mail-threading, I think. So, other people already had done, what can be used here. But maybe thze algorithm you show on your pages is really good. But I would trust your algorithms more than algorithms of netscape-programmers ;-) But maybe it's really good and only the overall-design of Netscape was bad. ;-) Ciao, Oliver ^ permalink raw reply [flat|nested] 22+ messages in thread
[parent not found: <20070429103911.GA30510@furbychan.cocan.org>]
* Re: [Caml-list] mboxlib reloaded ;-) [not found] ` <20070429103911.GA30510@furbychan.cocan.org> @ 2007-04-29 15:43 ` Oliver Bandel 0 siblings, 0 replies; 22+ messages in thread From: Oliver Bandel @ 2007-04-29 15:43 UTC (permalink / raw) To: caml-list On Sun, Apr 29, 2007 at 11:39:11AM +0100, Richard Jones wrote: > On Sat, Apr 28, 2007 at 12:58:16PM +0200, Oliver Bandel wrote: > > On Sat, Apr 28, 2007 at 08:56:28AM +0100, Richard Jones wrote: > > > On Sat, Apr 28, 2007 at 01:12:20AM +0200, Oliver Bandel wrote: > > > > On Fri, Apr 27, 2007 at 05:29:11PM +0100, Richard Jones wrote: > > > > > The source for COCANWIKI[1] contains extensive support for threading > > > > > of mail messages, based on JWZ's algorithm: > > > > > > > > > > http://www.jwz.org/doc/threading.html > > > > > > > > Nice... you speak of an optimized algorithm for threading. > > > > I didn't explored your solution nor did I explored your > > > > paper in detail (tomorrow I think I have the time to do it), > > > > > > I should point out that the algorithm is due to esteemed hacker[1] > > > Jamie Zawinski (http://jwz.org) who used it in Netscape versions 1 > > > through 3. They got C++/OO group-think disease from Netscape 4 and > > > above (the rot continues to this day). > > > > When I think about netscape, I think about the terrible way, > > how it handles newsgroups.... > > ...when doing a resort of the newsgroup-entries it's extremely slow. > > That's my point - Netscape <= 3 was very fast. > Ah, oh,OK, I see.... you wrote versions 1 through 3 were fast, and these versions used the algorithm you talked about... right? Ciao, Oliver ^ permalink raw reply [flat|nested] 22+ messages in thread
* ocamllex speed [was Re: [Caml-list] mboxlib reloaded ;-)] 2007-04-27 23:12 ` Oliver Bandel 2007-04-28 0:54 ` skaller 2007-04-28 7:56 ` Richard Jones @ 2007-09-24 18:22 ` Bruno De Fraine 2007-09-24 19:54 ` Alain Frisch ` (2 more replies) 2 siblings, 3 replies; 22+ messages in thread From: Bruno De Fraine @ 2007-09-24 18:22 UTC (permalink / raw) To: Caml-list ml; +Cc: Oliver Bandel Hello, On 28 Apr 2007, at 01:12, Oliver Bandel wrote: > So, I then checked my mboxlib and saw that it is quite slow, > compared to what I expected ( expect! I did not tried it > on my development machine because I have nomutt installed there) > and even if native-code smuch faster, it's nevertheless slow... > ...so I thought I have to redesign my scanner-stage. > (I use Str-module and ocamnllex mixed together; maybe > using a plain selfwritten OCaml-scanner might be better here). I don't know if Oliver ever got to the bottom of this speed problem, but, I also noticed ocamllex can be quite slow for simple scanning. For example, I used this ocamllex source: { } rule translate = parse | "current_directory" { print_endline (Sys.getcwd ()); translate lexbuf } | _ { translate lexbuf } | eof { () } { for i = 1 to (Array.length Sys.argv - 1); do translate (Lexing.from_channel (open_in Sys.argv.(i))) done ;; } And compared it against this version using the Str module: let re = Str.regexp_string "current_directory" ;; for i = 1 to (Array.length Sys.argv - 1); do let ch = open_in Sys.argv.(i) in try while true; do let line = input_line ch in try let _ = Str.search_forward re line 0 in print_endline (Sys.getcwd ()) with Not_found -> () done with End_of_file -> close_in ch done ;; Neither version does anything useful, except print the current directory when it encounters the string "current_directory". I tested this on a 57M text file (that has only a few "current_directory" occurrences). The ocamllex-version takes about 3.5s, while the Str- version takes only 0.35s. What causes this difference? Perhaps there is a high overhead in calling the translate function for every input character in such big input files, but I don't know how this can be avoided. Thanks, Bruno ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: ocamllex speed [was Re: [Caml-list] mboxlib reloaded ;-)] 2007-09-24 18:22 ` ocamllex speed [was Re: [Caml-list] mboxlib reloaded ;-)] Bruno De Fraine @ 2007-09-24 19:54 ` Alain Frisch 2007-09-25 8:53 ` Bruno De Fraine 2007-09-24 22:06 ` skaller 2007-09-27 5:26 ` Chris King 2 siblings, 1 reply; 22+ messages in thread From: Alain Frisch @ 2007-09-24 19:54 UTC (permalink / raw) To: Bruno De Fraine; +Cc: Caml-list ml Bruno De Fraine wrote: > Neither version does anything useful, except print the current directory > when it encounters the string "current_directory". I tested this on a > 57M text file (that has only a few "current_directory" occurrences). The > ocamllex-version takes about 3.5s, while the Str-version takes only > 0.35s. What causes this difference? Perhaps there is a high overhead in > calling the translate function for every input character in such big > input files, but I don't know how this can be avoided. Did you try to compile the same programs with the native compiler? -- Alain ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: ocamllex speed [was Re: [Caml-list] mboxlib reloaded ;-)] 2007-09-24 19:54 ` Alain Frisch @ 2007-09-25 8:53 ` Bruno De Fraine 0 siblings, 0 replies; 22+ messages in thread From: Bruno De Fraine @ 2007-09-25 8:53 UTC (permalink / raw) To: Alain Frisch; +Cc: Caml-list ml On 24 Sep 2007, at 21:54, Alain Frisch wrote: > Bruno De Fraine wrote: >> Neither version does anything useful, except print the current >> directory when it encounters the string "current_directory". I >> tested this on a 57M text file (that has only a few >> "current_directory" occurrences). The ocamllex-version takes about >> 3.5s, while the Str-version takes only 0.35s. What causes this >> difference? Perhaps there is a high overhead in calling the >> translate function for every input character in such big input >> files, but I don't know how this can be avoided. > > Did you try to compile the same programs with the native compiler? Yes, the timings were for the ocamlopt compiler. Following skaller's suggestion, I also tried: rule translate = parse | [^ ' ' '\n' '\t']+ as word { if word = "current_directory" then print_endline (Sys.getcwd ()); translate lexbuf } | _ { translate lexbuf } | eof { () } This brings it down to 1.87s. Of course, only occurrences of "current_directory" that are white-space separated are picked up. Regards, Bruno ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: ocamllex speed [was Re: [Caml-list] mboxlib reloaded ;-)] 2007-09-24 18:22 ` ocamllex speed [was Re: [Caml-list] mboxlib reloaded ;-)] Bruno De Fraine 2007-09-24 19:54 ` Alain Frisch @ 2007-09-24 22:06 ` skaller 2007-09-27 5:26 ` Chris King 2 siblings, 0 replies; 22+ messages in thread From: skaller @ 2007-09-24 22:06 UTC (permalink / raw) To: Bruno De Fraine; +Cc: Caml-list ml, Oliver Bandel On Mon, 2007-09-24 at 20:22 +0200, Bruno De Fraine wrote: > { } > rule translate = parse > | "current_directory" { print_endline (Sys.getcwd ()); translate > lexbuf } > | _ { translate lexbuf } > | eof { () } > { > for i = 1 to (Array.length Sys.argv - 1); do > translate (Lexing.from_channel (open_in Sys.argv.(i))) > done ;; > } > > What causes this difference? Perhaps there > is a high overhead in calling the translate function for every input > character in such big input files, but I don't know how this can be > avoided. Easily. Rewrite the lexer to | [a-zA-z_]+ { if lexeme = "current_directory" ... } | _ { translate lexbuf } Then it will eat whole words instead of characters, which should make it about 5 times faster. -- John Skaller <skaller at users dot sf dot net> Felix, successor to C++: http://felix.sf.net ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: ocamllex speed [was Re: [Caml-list] mboxlib reloaded ;-)] 2007-09-24 18:22 ` ocamllex speed [was Re: [Caml-list] mboxlib reloaded ;-)] Bruno De Fraine 2007-09-24 19:54 ` Alain Frisch 2007-09-24 22:06 ` skaller @ 2007-09-27 5:26 ` Chris King 2 siblings, 0 replies; 22+ messages in thread From: Chris King @ 2007-09-27 5:26 UTC (permalink / raw) To: Bruno De Fraine; +Cc: Caml-list ml, Oliver Bandel On 9/24/07, Bruno De Fraine <Bruno.De.Fraine@vub.ac.be> wrote: > I don't know if Oliver ever got to the bottom of this speed problem, > but, I also noticed ocamllex can be quite slow for simple scanning. FWIW, regexps in the Str module are interpreted in C. - Chris ^ permalink raw reply [flat|nested] 22+ messages in thread
end of thread, other threads:[~2007-09-27 5:26 UTC | newest] Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2007-04-27 13:54 mboxlib reloaded ;-) Oliver Bandel 2007-04-27 16:29 ` [Caml-list] " Richard Jones 2007-04-27 23:12 ` Oliver Bandel 2007-04-28 0:54 ` skaller 2007-04-28 10:47 ` Oliver Bandel 2007-04-28 10:54 ` Gabriel Kerneis 2007-04-28 11:44 ` Oliver Bandel 2007-04-28 13:49 ` skaller 2007-04-28 14:18 ` Oliver Bandel 2007-04-29 10:45 ` Richard Jones 2007-04-29 15:41 ` Oliver Bandel 2007-04-29 18:51 ` Robert Roessler 2007-05-01 11:00 ` camomile-problem (Re: [Caml-list] mboxlib reloaded ;-)) Oliver Bandel 2007-05-01 10:56 ` [Caml-list] mboxlib reloaded ;-) Oliver Bandel 2007-04-28 7:56 ` Richard Jones 2007-04-28 10:58 ` Oliver Bandel [not found] ` <20070429103911.GA30510@furbychan.cocan.org> 2007-04-29 15:43 ` Oliver Bandel 2007-09-24 18:22 ` ocamllex speed [was Re: [Caml-list] mboxlib reloaded ;-)] Bruno De Fraine 2007-09-24 19:54 ` Alain Frisch 2007-09-25 8:53 ` Bruno De Fraine 2007-09-24 22:06 ` skaller 2007-09-27 5:26 ` Chris King
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox