From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail4-relais-sop.national.inria.fr (mail4-relais-sop.national.inria.fr [192.134.164.105]) by walapai.inria.fr (8.13.6/8.13.6) with ESMTP id q2J99BDZ019298 for ; Mon, 19 Mar 2012 10:09:11 +0100 X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: Ap4DANX2Zk/RVdy2mGdsb2JhbABDhT6fQIhYAYhqCCIBAQEBAQgJDQcUJ4IJAQEBAwESAg8dARsSCwEDAQsGAwILDQ0dAgIhAQERAQUBChIGEwgKEIdjBQucRgqLRE6CcYQ2P4h0AQULiUqGEYEWBJVoiy+DGj2ECQ X-IronPort-AV: E=Sophos;i="4.73,611,1325458800"; d="scan'208";a="136706540" Received: from mail-vx0-f182.google.com ([209.85.220.182]) by mail4-smtp-sop.national.inria.fr with ESMTP/TLS/RC4-SHA; 19 Mar 2012 10:09:05 +0100 Received: by vcmm1 with SMTP id m1so12000287vcm.27 for ; Mon, 19 Mar 2012 02:09:04 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc:content-type; bh=3p67Igld0YVobTwQSU0KpqAMCc+g9c3odBRkswlLrgE=; b=ERJeR2TI31wc53eaUuE55fLu/fbv24J5V9SBH9A2Pt9Af566WcTOxzhRux8h/HXoCz NwO5OQdhlcRkwewSdx07Dzjv5EXB2uggN8EekCBJv/TVt2Jb8OBoUbLTVgpTbEsarOiA NtPPUFVsqWRUXvjZ3/mxHKgQ/kjRuxCBSDyyEC3i/jo13VZfSsjbVgKJGJpIPmhAV5Gq jWt0zb4dAUOB00HzKeN8nFKdre9amhE6cG4A+QO+hisbNZfl72fX1FD1Ad4TgwVnZkHH ICtt2KbE96wM8RN+Xy+UlCvYGMG2QhRjHq56v+w6ZAKsP2NJx5d7WJ6fd4Qs3hjaoe2N 3W4A== Received: by 10.52.70.2 with SMTP id i2mr5001497vdu.54.1332148144368; Mon, 19 Mar 2012 02:09:04 -0700 (PDT) MIME-Version: 1.0 Received: by 10.52.31.136 with HTTP; Mon, 19 Mar 2012 02:08:44 -0700 (PDT) In-Reply-To: <1331918628011613.Berenger@riken.jp> References: <4F634AD1.20402@gmail.com> <1331918628011613.Berenger@riken.jp> From: Philippe Veber Date: Mon, 19 Mar 2012 10:08:44 +0100 Message-ID: To: "Francois????Charles Matthieu????Berenger" Cc: Edgar Friendly , "caml-list@inria.fr" Content-Type: multipart/alternative; boundary=20cf307f3990eeeaa404bb94e8f7 X-Validation-by: philippe.veber@gmail.com Subject: Re: Re: [Caml-list] Efficient scanning of large strings from files --20cf307f3990eeeaa404bb94e8f7 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Yes indeed! 2012/3/16 Francois????Charles Matthieu????Berenger > =EF=BD=88=EF=BD=89 philippe, > > i am curious, is your string a dna sequenc=EF=BD=85=EF=BD=93=EF=BD=8F =EF= =BD=94=EF=BD=88=EF=BD=81=EF=BD=94 =EF=BD=93 =EF=BD=97=EF=BD=88=EF=BD=99 =EF= =BD=89=EF=BD=94 =EF=BD=89=EF=BD=93 =EF=BD=93=EF=BD=8F =EF=BD=8C=EF=BD=8F=EF= =BD=8E=EF=BD=87=EF=BC=9F > > =EF=BD=92=EF=BD=85=EF=BD=87=EF=BD=81=EF=BD=92=EF=BD=84=EF=BD=93=EF=BC=8C= =EF=BD=86 > > > On Fri, 16 Mar 2012 10:14:41 -0400 > Edgar Friendly wrote: > > > On 03/16/2012 09:03 AM, Philippe Veber wrote: > > > Dear camlers, > > > > > > Say that you'd like to search a regexp on a file with lines so long > that > > > you'd rather not load them entirely at once. If you can bound the size > > > of a match by k << length of a line, then you know that you can only > > > keep a small portion of the line in memory to search the regexp. > > > Typically you'd like to access substrings of size k from left to righ= t. > > > I guess such a thing should involve buffered inputs and avoid copying > > > strings as much as possible. My question is as follows: has anybody > > > written a library to access these substrings gracefully and with dece= nt > > > performance? > > > Cheers, > > > Philippe. > > > > > This is tricky to do, as incremental matching implies DFA-based > > matching, but returning matching substrings implies backtrack-based > > matching. The re2 library returns matching substrings by matching > > forward to find the end of patterns, and then matching backwards on the > > reversed regex from that point to find their beginning. I don't know if > > even it supports this on incremental input. > > > > Subject to the assumption that starting to match implies either > > finishing or aborting a match (i.e. once you've started to match your > > regex, no other match will start before either this match attempt > > finishes successful or not), this is not unreasonable to do. Without > > this assumption, it requires tracking many match start locations and > > somehow knowing which of them match or fail to match. I'm not sure this > > has been done before. > > > > E. > > > > -- > > Caml-list mailing list. Subscription management and archives: > > https://sympa-roc.inria.fr/wws/info/caml-list > > Beginner's list: http://groups.yahoo.com/group/ocaml_beginners > > Bug reports: http://caml.inria.fr/bin/caml-bugs > > > > > > --20cf307f3990eeeaa404bb94e8f7 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Yes indeed!

2012/3/16 Francois????Charles= Matthieu????Berenger <Berenger@riken.jp>
=EF=BD=88=EF=BD=89=E3=80=80philippe,

i am curious, is your string a dna sequenc=EF=BD=85=EF=BD=93=EF=BD=8F=E3=80= =80=EF=BD=94=EF=BD=88=EF=BD=81=EF=BD=94=E3=80=80=EF=BD=93=E3=80=80=EF=BD=97= =EF=BD=88=EF=BD=99=E3=80=80=EF=BD=89=EF=BD=94=E3=80=80=EF=BD=89=EF=BD=93=E3= =80=80=EF=BD=93=EF=BD=8F=E3=80=80=EF=BD=8C=EF=BD=8F=EF=BD=8E=EF=BD=87=EF=BC= =9F

=EF=BD=92=EF=BD=85=EF=BD=87=EF=BD=81=EF=BD=92=EF=BD=84=EF=BD=93=EF=BC=8C=EF= =BD=86


On Fri, 16 Mar 2012 10:14:41 -0400
Edgar Friendly <thelema314@gmail= .com> wrote:

> On 03/16/2012 09:03 AM, Philippe Veber wrote:
> > Dear camlers,
> >
> > Say that you'd like to search a regexp on a file with lines s= o long that
> > you'd rather not load them entirely at once. If you can bound= the size
> > of a match by k << length of a line, then you know that you= can only
> > keep a small portion of the line in memory to search the regexp.<= br> > > Typically you'd like to access substrings of size k from left= to right.
> > I guess such a thing should involve buffered inputs and avoid cop= ying
> > strings as much as possible. My question is as follows: has anybo= dy
> > written a library to access these substrings gracefully and with = decent
> > performance?
> > Cheers,
> > =C2=A0 =C2=A0Philippe.
> >
> This is tricky to do, as incremental matching implies DFA-based
> matching, but returning matching substrings implies backtrack-based
> matching. =C2=A0The re2 library returns matching substrings by matchin= g
> forward to find the end of patterns, and then matching backwards on th= e
> reversed regex from that point to find their beginning. =C2=A0I don= 9;t know if
> even it supports this on incremental input.
>
> Subject to the assumption that starting to match implies either
> finishing or aborting a match (i.e. once you've started to match y= our
> regex, no other match will start before either this match attempt
> finishes successful or not), this is not unreasonable to do. =C2=A0Wit= hout
> this assumption, it requires tracking many match start locations and > somehow knowing which of them match or fail to match. =C2=A0I'm no= t sure this
> has been done before.
>
> E.
>
> --
> Caml-list mailing list. =C2=A0Subscription management and archives:
> https://sympa-roc.inria.fr/wws/info/caml-list
> Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
> Bug reports:
http://caml.inria.fr/bin/caml-bugs
>




--20cf307f3990eeeaa404bb94e8f7--