From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from mail4-relais-sop.national.inria.fr (mail4-relais-sop.national.inria.fr [192.134.164.105])
	by walapai.inria.fr (8.13.6/8.13.6) with ESMTP id q2JDiOQ1030160
	for <caml-list@sympa-roc.inria.fr>; Mon, 19 Mar 2012 14:44:24 +0100
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: Ar8GALg3Z0/RVdU2kGdsb2JhbABCgw2yKgKBCQgiAQEBAQkJDQcUBCOCCQEBAQMBDAYCHgEFCAEbHAEBAwELBgULDQkWCAcJAwIBAgEPAhEBBQELEQYNAQUCAQEXB4djBZxmCowSgnGEYj+IdAEFC4lKX4FmhGIEkjGDN4VthUKDGj2EJIFA
X-IronPort-AV: E=Sophos;i="4.73,611,1325458800"; 
   d="scan'208";a="136754248"
Received: from mail-yw0-f54.google.com ([209.85.213.54])
  by mail4-smtp-sop.national.inria.fr with ESMTP/TLS/RC4-SHA; 19 Mar 2012 14:44:19 +0100
Received: by mail-yw0-f54.google.com with SMTP id m50so8100890yhg.27
        for <caml-list@inria.fr>; Mon, 19 Mar 2012 06:44:19 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20120113;
        h=message-id:date:from:user-agent:mime-version:to:cc:subject
         :references:in-reply-to:content-type:content-transfer-encoding;
        bh=Q2j7MQyNwfNFRebEIiK/Cn9AKTsyg5CmBxAfWi4zg3E=;
        b=CXrHF/ZX55bB/l4ZDKW2HrZaDtKiNBXbgMMYSI3e9tLgEucEt8nZ34mvVVjEM+Z8lu
         lcZ3u5JIP8K7mmHuMtnftedS23mbFjqN1RcrD5cxd5m7n/T8BuRWlt16zi8BQlSdDhJW
         4HjlPfiYpBguWpIdz6blHqul6GHRZ0m11k0sCzCc4Ev57FyRxV2HNaIqZSFoGjxdsMHM
         DSMoZazQT7t6u6ZmlfMW4DCT7zlji4+d20Mi3lVl3433KThRD2q2s33RG4AwW7XHj76J
         3y4v+61D0ihP6HrzpCKU41dZkxfRb4KxyO9qNB9EjdN21DPxyCTRFHZ/5buvlVtOwMnM
         OAaA==
Received: by 10.60.22.138 with SMTP id d10mr5618401oef.69.1332164659418;
        Mon, 19 Mar 2012 06:44:19 -0700 (PDT)
Received: from [192.168.1.65] (99-121-78-10.lightspeed.lnngmi.sbcglobal.net. [99.121.78.10])
        by mx.google.com with ESMTPS id h7sm10040141oeh.9.2012.03.19.06.44.17
        (version=SSLv3 cipher=OTHER);
        Mon, 19 Mar 2012 06:44:18 -0700 (PDT)
Message-ID: <4F673830.4010907@gmail.com>
Date: Mon, 19 Mar 2012 09:44:16 -0400
From: Edgar Friendly <thelema314@gmail.com>
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:10.0.2) Gecko/20120216 Thunderbird/10.0.2
MIME-Version: 1.0
To: Philippe Veber <philippe.veber@gmail.com>
CC: "caml-list@inria.fr" <caml-list@inria.fr>
References: <CAOOOohQV+nYWyTXFU-sKdLkJSNAAkZH+=UCkwKEpnxjAiE=ORg@mail.gmail.com> <4F634AD1.20402@gmail.com> <CAOOOohRVtAkiKGC=9GQNUEt=7d=z6kSOU=uDEc2ALZgFepWABg@mail.gmail.com> <CAL-jcAm5dfS0CEB=rCry=auP9FeyVM5Cy3PkNhoT4rX2AXX5rg@mail.gmail.com> <CAOOOohQ3e3LXk_qMmm__0XeWADjZAGqWmpS7UFGNyt-6bFeY3A@mail.gmail.com>
In-Reply-To: <CAOOOohQ3e3LXk_qMmm__0XeWADjZAGqWmpS7UFGNyt-6bFeY3A@mail.gmail.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 8bit
Subject: Re: [Caml-list] Efficient scanning of large strings from files

On 03/19/2012 05:08 AM, Philippe Veber wrote:
> Thanks Edgar and Jérémie, this indeed seems to be the right track. I
> just hope that a repeated use of input_char is not 10-100X slower than
> input_line :o).
> ph.
>
Quite true - instead of giving the matcher just a single byte at a time, 
it is more efficient to give it blocks of data, as long as it can keep 
its state from one block to the next.  But its matching internally will 
be on one byte at a time, normally.  I guess with DNA, because of the 
reduced character set, it'd be possible to get each symbol down to 2 
bits (if you're really just using ACGT), in which case, the matcher 
could run 4 basepairs at a time, but there's a lot of corner issues 
doing things that way.  A lot depends on how much time and effort you're 
willing to spend engineering something.

E.

> 2012/3/16 Edgar Friendly <thelema314@gmail.com
> <mailto:thelema314@gmail.com>>
>
>     So given a large file and a line number, you want to:
>     1) extract that line from the file
>     2) produce an enum of all k-length slices of that line?
>     3) match each slice against your regexp set to produce a list/enum
>     of substrings that match the regexps?
>     Without reading the whole line into memory at once.
>
>     I'm with Dimino on the right solution - just use a matcher that that
>     works incrementally, feed it one byte at a time, and have it return
>     a list of match offsets.  Then work backwards from these endpoints
>     to figure out which substrings you want.
>
>     There shouldn't be a reason to use substrings (0,k-1) and (1,k) - it
>     should suffice to use (0,k-1) and (k,2k-1) with an incremental
>     matching routine.
>
>     E.
>
>
>
>     On Fri, Mar 16, 2012 at 10:48 AM, Philippe Veber
>     <philippe.veber@gmail.com <mailto:philippe.veber@gmail.com>> wrote:
>
>         Thank you Edgar for your answer (and also Christophe). It seems
>         my question was a bit misleading: actually I target a subset of
>         regexps whose matching is really trivial, so this is no worry
>         for me. I was more interested in how accessing a large line in a
>         file by chunks of fixed length k. For instance how to build a
>         [Substring.t Enum.t] from some line in a file, without building
>         the whole line in memory. This enum would yield the substrings
>         (0,k-1), (1,k), (2,k+1), etc ... without doing too many string
>         copy/concat operations. I think I can do it myself but I'm not
>         too confident regarding good practices on buffered reads of
>         files. Maybe there are some good examples in Batteries?
>
>         Thanks again,
>            ph.
>
>
>
>