From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail4-relais-sop.national.inria.fr (mail4-relais-sop.national.inria.fr [192.134.164.105]) by walapai.inria.fr (8.13.6/8.13.6) with ESMTP id q2GH2MQD009802 for ; Fri, 16 Mar 2012 18:02:22 +0100 X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AisCAMFxY09KfVM2kGdsb2JhbABChT6wcAgiAQEBAQkJDQcUBCOCCQEBAQMBEgIPHQEbHQEDAQsGBQQHNwICIQEBEQEFARwGExsHh2MFnBEKi0ROgnGFJj+IdAEFC4lIXoU2gRYEki+DN4sugxo9hCSBQA X-IronPort-AV: E=Sophos;i="4.73,598,1325458800"; d="scan'208";a="136455394" Received: from mail-ee0-f54.google.com ([74.125.83.54]) by mail4-smtp-sop.national.inria.fr with ESMTP/TLS/RC4-SHA; 16 Mar 2012 18:02:17 +0100 Received: by eekd17 with SMTP id d17so3185264eek.27 for ; Fri, 16 Mar 2012 10:02:16 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=g0kaxVu5EUwwj2KL2CyWQF9LQRuTlQzi66WzFRimcz0=; b=O35QGtfo1msLJfumuv382dvPWyWO53vpAAeyDzzTE1+pgBNVAy5pvc4vLurOonG+ui VVNNBHavinvbalSidrjDMZW2dmxe1cymr/Mo8kjG0ui4cHijkTbPxI+BW4H4rwxjfLOW kHgss8XAlPK42m6Vr30wc+Bmf+1mqdZLFLDAODNmnZcO38nvlP5V7gvEBbYJPOMtQ/cA atEFtkGjfV68Wx28bER7pxQ8JAfmxDJ6z5Xvn2nLODyXZqh9GIeAEWWIwEJUw9V1I2PS 4pWr5UGmVBQQoftFe4x7AScBUvCeh3EzhGzHwJ/QlFpImVDiBTRjCQL7EUYG38hSFYOc sKFg== MIME-Version: 1.0 Received: by 10.213.105.207 with SMTP id u15mr207936ebo.159.1331917336749; Fri, 16 Mar 2012 10:02:16 -0700 (PDT) Received: by 10.213.26.19 with HTTP; Fri, 16 Mar 2012 10:02:16 -0700 (PDT) In-Reply-To: References: <4F634AD1.20402@gmail.com> Date: Fri, 16 Mar 2012 13:02:16 -0400 Message-ID: From: Edgar Friendly To: Philippe Veber Cc: "caml-list@inria.fr" Content-Type: multipart/alternative; boundary=0015174c366eba27b704bb5f2b7b Subject: Re: [Caml-list] Efficient scanning of large strings from files --0015174c366eba27b704bb5f2b7b Content-Type: text/plain; charset=UTF-8 So given a large file and a line number, you want to: 1) extract that line from the file 2) produce an enum of all k-length slices of that line? 3) match each slice against your regexp set to produce a list/enum of substrings that match the regexps? Without reading the whole line into memory at once. I'm with Dimino on the right solution - just use a matcher that that works incrementally, feed it one byte at a time, and have it return a list of match offsets. Then work backwards from these endpoints to figure out which substrings you want. There shouldn't be a reason to use substrings (0,k-1) and (1,k) - it should suffice to use (0,k-1) and (k,2k-1) with an incremental matching routine. E. On Fri, Mar 16, 2012 at 10:48 AM, Philippe Veber wrote: > Thank you Edgar for your answer (and also Christophe). It seems my > question was a bit misleading: actually I target a subset of regexps whose > matching is really trivial, so this is no worry for me. I was more > interested in how accessing a large line in a file by chunks of fixed > length k. For instance how to build a [Substring.t Enum.t] from some line > in a file, without building the whole line in memory. This enum would yield > the substrings (0,k-1), (1,k), (2,k+1), etc ... without doing too many > string copy/concat operations. I think I can do it myself but I'm not too > confident regarding good practices on buffered reads of files. Maybe there > are some good examples in Batteries? > > Thanks again, > ph. > > > --0015174c366eba27b704bb5f2b7b Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable So given a large file and a line number, you want to:
1) extract that li= ne from the file
2) produce an enum of all k-length slices of that line?=
3) match each slice against your regexp set to produce a list/enum of s= ubstrings that match the regexps?
Without reading the whole line into memory at once.=C2=A0

I'm w= ith Dimino on the right solution - just use a matcher that that works incre= mentally, feed it one byte at a time, and have it return a list of match of= fsets.=C2=A0 Then work backwards from these endpoints to figure out which s= ubstrings you want.

There shouldn't be a reason to use substrings (0,k-1) and (1,k) - i= t should suffice to use (0,k-1) and (k,2k-1) with an incremental matching r= outine.

E.


On Fri, Mar 16, 201= 2 at 10:48 AM, Philippe Veber <philippe.veber@gmail.com> wrote:
Thank you Edgar for your answer (and also Ch= ristophe). It seems my question was a bit misleading: actually I target a s= ubset of regexps whose matching is really trivial, so this is no worry for = me. I was more interested in how accessing a large line in a file by chunks= of fixed length k. For instance how to build a [Substring.t Enum.t] from s= ome line in a file, without building the whole line in memory. This enum wo= uld yield the substrings (0,k-1), (1,k), (2,k+1), etc ... without doing too= many string copy/concat operations. I think I can do it myself but I'm= not too confident regarding good practices on buffered reads of files. May= be there are some good examples in Batteries?

Thanks again,
=C2=A0 ph.



--0015174c366eba27b704bb5f2b7b--