From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from mail4-relais-sop.national.inria.fr (mail4-relais-sop.national.inria.fr [192.134.164.105])
	by walapai.inria.fr (8.13.6/8.13.6) with ESMTP id q2GH2MQD009802
	for <caml-list@sympa-roc.inria.fr>; Fri, 16 Mar 2012 18:02:22 +0100
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: AisCAMFxY09KfVM2kGdsb2JhbABChT6wcAgiAQEBAQkJDQcUBCOCCQEBAQMBEgIPHQEbHQEDAQsGBQQHNwICIQEBEQEFARwGExsHh2MFnBEKi0ROgnGFJj+IdAEFC4lIXoU2gRYEki+DN4sugxo9hCSBQA
X-IronPort-AV: E=Sophos;i="4.73,598,1325458800"; 
   d="scan'208";a="136455394"
Received: from mail-ee0-f54.google.com ([74.125.83.54])
  by mail4-smtp-sop.national.inria.fr with ESMTP/TLS/RC4-SHA; 16 Mar 2012 18:02:17 +0100
Received: by eekd17 with SMTP id d17so3185264eek.27
        for <caml-list@inria.fr>; Fri, 16 Mar 2012 10:02:16 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20120113;
        h=mime-version:in-reply-to:references:date:message-id:subject:from:to
         :cc:content-type;
        bh=g0kaxVu5EUwwj2KL2CyWQF9LQRuTlQzi66WzFRimcz0=;
        b=O35QGtfo1msLJfumuv382dvPWyWO53vpAAeyDzzTE1+pgBNVAy5pvc4vLurOonG+ui
         VVNNBHavinvbalSidrjDMZW2dmxe1cymr/Mo8kjG0ui4cHijkTbPxI+BW4H4rwxjfLOW
         kHgss8XAlPK42m6Vr30wc+Bmf+1mqdZLFLDAODNmnZcO38nvlP5V7gvEBbYJPOMtQ/cA
         atEFtkGjfV68Wx28bER7pxQ8JAfmxDJ6z5Xvn2nLODyXZqh9GIeAEWWIwEJUw9V1I2PS
         4pWr5UGmVBQQoftFe4x7AScBUvCeh3EzhGzHwJ/QlFpImVDiBTRjCQL7EUYG38hSFYOc
         sKFg==
MIME-Version: 1.0
Received: by 10.213.105.207 with SMTP id u15mr207936ebo.159.1331917336749;
 Fri, 16 Mar 2012 10:02:16 -0700 (PDT)
Received: by 10.213.26.19 with HTTP; Fri, 16 Mar 2012 10:02:16 -0700 (PDT)
In-Reply-To: <CAOOOohRVtAkiKGC=9GQNUEt=7d=z6kSOU=uDEc2ALZgFepWABg@mail.gmail.com>
References: <CAOOOohQV+nYWyTXFU-sKdLkJSNAAkZH+=UCkwKEpnxjAiE=ORg@mail.gmail.com>
	<4F634AD1.20402@gmail.com>
	<CAOOOohRVtAkiKGC=9GQNUEt=7d=z6kSOU=uDEc2ALZgFepWABg@mail.gmail.com>
Date: Fri, 16 Mar 2012 13:02:16 -0400
Message-ID: <CAL-jcAm5dfS0CEB=rCry=auP9FeyVM5Cy3PkNhoT4rX2AXX5rg@mail.gmail.com>
From: Edgar Friendly <thelema314@gmail.com>
To: Philippe Veber <philippe.veber@gmail.com>
Cc: "caml-list@inria.fr" <caml-list@inria.fr>
Content-Type: multipart/alternative; boundary=0015174c366eba27b704bb5f2b7b
Subject: Re: [Caml-list] Efficient scanning of large strings from files


--0015174c366eba27b704bb5f2b7b
Content-Type: text/plain; charset=UTF-8

So given a large file and a line number, you want to:
1) extract that line from the file
2) produce an enum of all k-length slices of that line?
3) match each slice against your regexp set to produce a list/enum of
substrings that match the regexps?
Without reading the whole line into memory at once.

I'm with Dimino on the right solution - just use a matcher that that works
incrementally, feed it one byte at a time, and have it return a list of
match offsets.  Then work backwards from these endpoints to figure out
which substrings you want.

There shouldn't be a reason to use substrings (0,k-1) and (1,k) - it should
suffice to use (0,k-1) and (k,2k-1) with an incremental matching routine.

E.


On Fri, Mar 16, 2012 at 10:48 AM, Philippe Veber
<philippe.veber@gmail.com>wrote:

> Thank you Edgar for your answer (and also Christophe). It seems my
> question was a bit misleading: actually I target a subset of regexps whose
> matching is really trivial, so this is no worry for me. I was more
> interested in how accessing a large line in a file by chunks of fixed
> length k. For instance how to build a [Substring.t Enum.t] from some line
> in a file, without building the whole line in memory. This enum would yield
> the substrings (0,k-1), (1,k), (2,k+1), etc ... without doing too many
> string copy/concat operations. I think I can do it myself but I'm not too
> confident regarding good practices on buffered reads of files. Maybe there
> are some good examples in Batteries?
>
> Thanks again,
>   ph.
>
>
>

--0015174c366eba27b704bb5f2b7b
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

So given a large file and a line number, you want to:<br>1) extract that li=
ne from the file<br>2) produce an enum of all k-length slices of that line?=
<br>3) match each slice against your regexp set to produce a list/enum of s=
ubstrings that match the regexps?<br>
Without reading the whole line into memory at once.=C2=A0 <br><br>I&#39;m w=
ith Dimino on the right solution - just use a matcher that that works incre=
mentally, feed it one byte at a time, and have it return a list of match of=
fsets.=C2=A0 Then work backwards from these endpoints to figure out which s=
ubstrings you want.<br>
<br>There shouldn&#39;t be a reason to use substrings (0,k-1) and (1,k) - i=
t should suffice to use (0,k-1) and (k,2k-1) with an incremental matching r=
outine.<br><br>E.<br><br><br><div class=3D"gmail_quote">On Fri, Mar 16, 201=
2 at 10:48 AM, Philippe Veber <span dir=3D"ltr">&lt;<a href=3D"mailto:phili=
ppe.veber@gmail.com">philippe.veber@gmail.com</a>&gt;</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">Thank you Edgar for your answer (and also Ch=
ristophe). It seems my question was a bit misleading: actually I target a s=
ubset of regexps whose matching is really trivial, so this is no worry for =
me. I was more interested in how accessing a large line in a file by chunks=
 of fixed length k. For instance how to build a [Substring.t Enum.t] from s=
ome line in a file, without building the whole line in memory. This enum wo=
uld yield the substrings (0,k-1), (1,k), (2,k+1), etc ... without doing too=
 many string copy/concat operations. I think I can do it myself but I&#39;m=
 not too confident regarding good practices on buffered reads of files. May=
be there are some good examples in Batteries?<br>


<br>Thanks again,<br>=C2=A0 ph.<br><br><br>
</blockquote></div><br>

--0015174c366eba27b704bb5f2b7b--