From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail1-relais-roc.national.inria.fr (mail1-relais-roc.national.inria.fr [192.134.164.82]) by walapai.inria.fr (8.13.6/8.13.6) with ESMTP id q2GEF6eT001510 for ; Fri, 16 Mar 2012 15:15:07 +0100 X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: ArUCAIVKY09KfVM2kGdsb2JhbABCgw2zIQgiAQEBAQkJDQcUBCOCCQEBAQQSAiQIARseAwwGBQsNCSUPAhIRAQUBHAYBDAgBAR6HaJwrCowSgnGFIT+IdAEFC5ByBJVmhWyIXD2EJA X-IronPort-AV: E=Sophos;i="4.73,598,1325458800"; d="scan'208";a="149790384" Received: from mail-ee0-f54.google.com ([74.125.83.54]) by mail1-smtp-roc.national.inria.fr with ESMTP/TLS/RC4-SHA; 16 Mar 2012 15:14:46 +0100 Received: by eekd17 with SMTP id d17so2989328eek.27 for ; Fri, 16 Mar 2012 07:14:45 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=message-id:date:from:user-agent:mime-version:to:subject:references :in-reply-to:content-type:content-transfer-encoding; bh=6u4o152dwqurEG7GA5VtPkG+QM4grsNfmLvKo/iF9zo=; b=oOUw76cVX6m2iqKTglcr1W22OaBofKeoMd/IvLP46FASiFGj+wM4QZ8OcJTynfc1iC 6yE3elHfjOPyHBHP6z5oY3ZXeztanwQrAW8QXUKAXnHRia3Pd5qc/Tm1NjQzYUXtEa2t j/iI5p9jLEKy29uLg767AzMHS1ym7fuqpCPKAR9nre/Uo0nlzmYW9sfj81ProfFjoFgz 87zeIqCqOwBJSh6nJj7BJa5Mmg3Qtc7Litf1yNCHrrLEQWIRsX177tFo1UHp7bcUifYn 0OzuC/lbyetdCQsXAHfAdmg7trIZdgi3MgeaXeg9CtY2iBk4x5CSwLetD/U/MAzTS/jX qboA== Received: by 10.50.135.74 with SMTP id pq10mr19589609igb.62.1331907285250; Fri, 16 Mar 2012 07:14:45 -0700 (PDT) Received: from [192.168.1.65] (99-121-78-10.lightspeed.lnngmi.sbcglobal.net. [99.121.78.10]) by mx.google.com with ESMTPS id al5sm5145940igc.5.2012.03.16.07.14.43 (version=SSLv3 cipher=OTHER); Fri, 16 Mar 2012 07:14:44 -0700 (PDT) Message-ID: <4F634AD1.20402@gmail.com> Date: Fri, 16 Mar 2012 10:14:41 -0400 From: Edgar Friendly User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:10.0.2) Gecko/20120216 Thunderbird/10.0.2 MIME-Version: 1.0 To: Philippe Veber , "caml-list@inria.fr" References: In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Subject: Re: [Caml-list] Efficient scanning of large strings from files On 03/16/2012 09:03 AM, Philippe Veber wrote: > Dear camlers, > > Say that you'd like to search a regexp on a file with lines so long that > you'd rather not load them entirely at once. If you can bound the size > of a match by k << length of a line, then you know that you can only > keep a small portion of the line in memory to search the regexp. > Typically you'd like to access substrings of size k from left to right. > I guess such a thing should involve buffered inputs and avoid copying > strings as much as possible. My question is as follows: has anybody > written a library to access these substrings gracefully and with decent > performance? > Cheers, > Philippe. > This is tricky to do, as incremental matching implies DFA-based matching, but returning matching substrings implies backtrack-based matching. The re2 library returns matching substrings by matching forward to find the end of patterns, and then matching backwards on the reversed regex from that point to find their beginning. I don't know if even it supports this on incremental input. Subject to the assumption that starting to match implies either finishing or aborting a match (i.e. once you've started to match your regex, no other match will start before either this match attempt finishes successful or not), this is not unreasonable to do. Without this assumption, it requires tracking many match start locations and somehow knowing which of them match or fail to match. I'm not sure this has been done before. E.