From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail1-relais-roc.national.inria.fr (mail1-relais-roc.national.inria.fr [192.134.164.82]) by walapai.inria.fr (8.13.6/8.13.6) with ESMTP id q2GKBUcq013467 for ; Fri, 16 Mar 2012 21:11:30 +0100 X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AlUDAA6dY0/AbSoIe2dsb2JhbABChT6wfCIBARYmBCOCCQEBBSNWEAsJDwICJgICFBgxiB0Ep02SLBOBHI44M2MEjgCHZZMP X-IronPort-AV: E=Sophos;i="4.73,598,1325458800"; d="scan'208";a="149850837" Received: from einhorn.in-berlin.de ([192.109.42.8]) by mail1-smtp-roc.national.inria.fr with ESMTP/TLS/DHE-RSA-AES256-SHA; 16 Mar 2012 21:11:25 +0100 X-Envelope-From: oliver@first.in-berlin.de Received: from first (e178043055.adsl.alicedsl.de [85.178.43.55]) (authenticated bits=0) by einhorn.in-berlin.de (8.13.6/8.13.6/Debian-1) with ESMTP id q2GKBNVW029582 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NOT); Fri, 16 Mar 2012 21:11:24 +0100 Received: by first (Postfix, from userid 1000) id 78D341540144; Fri, 16 Mar 2012 21:11:23 +0100 (CET) Date: Fri, 16 Mar 2012 21:11:23 +0100 From: oliver To: Philippe Veber Cc: caml users Message-ID: <20120316201123.GC21643@siouxsie> References: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.20 (2009-06-14) X-Scanned-By: MIMEDefang_at_IN-Berlin_e.V. on 192.109.42.8 Subject: Re: [Caml-list] Efficient scanning of large strings from files On Fri, Mar 16, 2012 at 02:03:38PM +0100, Philippe Veber wrote: > Dear camlers, > > Say that you'd like to search a regexp on a file with lines so long that > you'd rather not load them entirely at once. If you can bound the size of a > match by k << length of a line, then you know that you can only keep a > small portion of the line in memory to search the regexp. Typically you'd > like to access substrings of size k from left to right. I guess such a > thing should involve buffered inputs and avoid copying strings as much as > possible. My question is as follows: has anybody written a library to > access these substrings gracefully and with decent performance? > Cheers, > Philippe. [...] I think, the regexp itself also has an impact on how fast and/or how easy this can be achieved. The more complex the Regexp, the more ressources you will need. If you can make your regexp becoming boult down to something easy parseable, the length of lines might be of no importance. Ciao, Oliver