From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from mail1-relais-roc.national.inria.fr (mail1-relais-roc.national.inria.fr [192.134.164.82])
	by walapai.inria.fr (8.13.6/8.13.6) with ESMTP id q2GKBUcq013467
	for <caml-list@sympa-roc.inria.fr>; Fri, 16 Mar 2012 21:11:30 +0100
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: AlUDAA6dY0/AbSoIe2dsb2JhbABChT6wfCIBARYmBCOCCQEBBSNWEAsJDwICJgICFBgxiB0Ep02SLBOBHI44M2MEjgCHZZMP
X-IronPort-AV: E=Sophos;i="4.73,598,1325458800"; 
   d="scan'208";a="149850837"
Received: from einhorn.in-berlin.de ([192.109.42.8])
  by mail1-smtp-roc.national.inria.fr with ESMTP/TLS/DHE-RSA-AES256-SHA; 16 Mar 2012 21:11:25 +0100
X-Envelope-From: oliver@first.in-berlin.de
Received: from first (e178043055.adsl.alicedsl.de [85.178.43.55])
	(authenticated bits=0)
	by einhorn.in-berlin.de (8.13.6/8.13.6/Debian-1) with ESMTP id q2GKBNVW029582
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NOT);
	Fri, 16 Mar 2012 21:11:24 +0100
Received: by first (Postfix, from userid 1000)
	id 78D341540144; Fri, 16 Mar 2012 21:11:23 +0100 (CET)
Date: Fri, 16 Mar 2012 21:11:23 +0100
From: oliver <oliver@first.in-berlin.de>
To: Philippe Veber <philippe.veber@gmail.com>
Cc: caml users <caml-list@inria.fr>
Message-ID: <20120316201123.GC21643@siouxsie>
References: <CAOOOohQV+nYWyTXFU-sKdLkJSNAAkZH+=UCkwKEpnxjAiE=ORg@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
In-Reply-To: <CAOOOohQV+nYWyTXFU-sKdLkJSNAAkZH+=UCkwKEpnxjAiE=ORg@mail.gmail.com>
User-Agent: Mutt/1.5.20 (2009-06-14)
X-Scanned-By: MIMEDefang_at_IN-Berlin_e.V. on 192.109.42.8
Subject: Re: [Caml-list] Efficient scanning of large strings from files

On Fri, Mar 16, 2012 at 02:03:38PM +0100, Philippe Veber wrote:
> Dear camlers,
> 
> Say that you'd like to search a regexp on a file with lines so long that
> you'd rather not load them entirely at once. If you can bound the size of a
> match by k << length of a line, then you know that you can only keep a
> small portion of the line in memory to search the regexp. Typically you'd
> like to access substrings of size k from left to right. I guess such a
> thing should involve buffered inputs and avoid copying strings as much as
> possible. My question is as follows: has anybody written a library to
> access these substrings gracefully and with decent performance?
> Cheers,
>   Philippe.
[...]


I think, the regexp itself also has an impact on
how fast and/or how easy this can be achieved.

The more complex the Regexp, the more ressources
you will need.

If you can make your regexp becoming boult down to something
easy parseable, the length of lines might be of no importance.


Ciao,
   Oliver