* [Caml-list] Efficient scanning of large strings from files @ 2012-03-16 13:03 Philippe Veber 2012-03-16 14:14 ` Edgar Friendly ` (3 more replies) 0 siblings, 4 replies; 15+ messages in thread From: Philippe Veber @ 2012-03-16 13:03 UTC (permalink / raw) To: caml users [-- Attachment #1: Type: text/plain, Size: 608 bytes --] Dear camlers, Say that you'd like to search a regexp on a file with lines so long that you'd rather not load them entirely at once. If you can bound the size of a match by k << length of a line, then you know that you can only keep a small portion of the line in memory to search the regexp. Typically you'd like to access substrings of size k from left to right. I guess such a thing should involve buffered inputs and avoid copying strings as much as possible. My question is as follows: has anybody written a library to access these substrings gracefully and with decent performance? Cheers, Philippe. [-- Attachment #2: Type: text/html, Size: 651 bytes --] ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [Caml-list] Efficient scanning of large strings from files 2012-03-16 13:03 [Caml-list] Efficient scanning of large strings from files Philippe Veber @ 2012-03-16 14:14 ` Edgar Friendly 2012-03-16 14:48 ` Philippe Veber 2012-03-16 17:23 ` Francois????Charles Matthieu????Berenger 2012-03-16 14:49 ` Jérémie Dimino ` (2 subsequent siblings) 3 siblings, 2 replies; 15+ messages in thread From: Edgar Friendly @ 2012-03-16 14:14 UTC (permalink / raw) To: Philippe Veber, caml-list On 03/16/2012 09:03 AM, Philippe Veber wrote: > Dear camlers, > > Say that you'd like to search a regexp on a file with lines so long that > you'd rather not load them entirely at once. If you can bound the size > of a match by k << length of a line, then you know that you can only > keep a small portion of the line in memory to search the regexp. > Typically you'd like to access substrings of size k from left to right. > I guess such a thing should involve buffered inputs and avoid copying > strings as much as possible. My question is as follows: has anybody > written a library to access these substrings gracefully and with decent > performance? > Cheers, > Philippe. > This is tricky to do, as incremental matching implies DFA-based matching, but returning matching substrings implies backtrack-based matching. The re2 library returns matching substrings by matching forward to find the end of patterns, and then matching backwards on the reversed regex from that point to find their beginning. I don't know if even it supports this on incremental input. Subject to the assumption that starting to match implies either finishing or aborting a match (i.e. once you've started to match your regex, no other match will start before either this match attempt finishes successful or not), this is not unreasonable to do. Without this assumption, it requires tracking many match start locations and somehow knowing which of them match or fail to match. I'm not sure this has been done before. E. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [Caml-list] Efficient scanning of large strings from files 2012-03-16 14:14 ` Edgar Friendly @ 2012-03-16 14:48 ` Philippe Veber 2012-03-16 17:02 ` Edgar Friendly 2012-03-16 17:23 ` Francois????Charles Matthieu????Berenger 1 sibling, 1 reply; 15+ messages in thread From: Philippe Veber @ 2012-03-16 14:48 UTC (permalink / raw) To: Edgar Friendly; +Cc: caml-list [-- Attachment #1: Type: text/plain, Size: 2314 bytes --] 2012/3/16 Edgar Friendly <thelema314@gmail.com> > On 03/16/2012 09:03 AM, Philippe Veber wrote: > >> Dear camlers, >> >> Say that you'd like to search a regexp on a file with lines so long that >> you'd rather not load them entirely at once. If you can bound the size >> of a match by k << length of a line, then you know that you can only >> keep a small portion of the line in memory to search the regexp. >> Typically you'd like to access substrings of size k from left to right. >> I guess such a thing should involve buffered inputs and avoid copying >> strings as much as possible. My question is as follows: has anybody >> written a library to access these substrings gracefully and with decent >> performance? >> Cheers, >> Philippe. >> >> This is tricky to do, as incremental matching implies DFA-based > matching, but returning matching substrings implies backtrack-based > matching. The re2 library returns matching substrings by matching forward > to find the end of patterns, and then matching backwards on the reversed > regex from that point to find their beginning. I don't know if even it > supports this on incremental input. > > Subject to the assumption that starting to match implies either finishing > or aborting a match (i.e. once you've started to match your regex, no other > match will start before either this match attempt finishes successful or > not), this is not unreasonable to do. Without this assumption, it requires > tracking many match start locations and somehow knowing which of them match > or fail to match. I'm not sure this has been done before. > > E. > Thank you Edgar for your answer (and also Christophe). It seems my question was a bit misleading: actually I target a subset of regexps whose matching is really trivial, so this is no worry for me. I was more interested in how accessing a large line in a file by chunks of fixed length k. For instance how to build a [Substring.t Enum.t] from some line in a file, without building the whole line in memory. This enum would yield the substrings (0,k-1), (1,k), (2,k+1), etc ... without doing too many string copy/concat operations. I think I can do it myself but I'm not too confident regarding good practices on buffered reads of files. Maybe there are some good examples in Batteries? Thanks again, ph. [-- Attachment #2: Type: text/html, Size: 2862 bytes --] ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [Caml-list] Efficient scanning of large strings from files 2012-03-16 14:48 ` Philippe Veber @ 2012-03-16 17:02 ` Edgar Friendly 2012-03-19 9:08 ` Philippe Veber 0 siblings, 1 reply; 15+ messages in thread From: Edgar Friendly @ 2012-03-16 17:02 UTC (permalink / raw) To: Philippe Veber; +Cc: caml-list [-- Attachment #1: Type: text/plain, Size: 1526 bytes --] So given a large file and a line number, you want to: 1) extract that line from the file 2) produce an enum of all k-length slices of that line? 3) match each slice against your regexp set to produce a list/enum of substrings that match the regexps? Without reading the whole line into memory at once. I'm with Dimino on the right solution - just use a matcher that that works incrementally, feed it one byte at a time, and have it return a list of match offsets. Then work backwards from these endpoints to figure out which substrings you want. There shouldn't be a reason to use substrings (0,k-1) and (1,k) - it should suffice to use (0,k-1) and (k,2k-1) with an incremental matching routine. E. On Fri, Mar 16, 2012 at 10:48 AM, Philippe Veber <philippe.veber@gmail.com>wrote: > Thank you Edgar for your answer (and also Christophe). It seems my > question was a bit misleading: actually I target a subset of regexps whose > matching is really trivial, so this is no worry for me. I was more > interested in how accessing a large line in a file by chunks of fixed > length k. For instance how to build a [Substring.t Enum.t] from some line > in a file, without building the whole line in memory. This enum would yield > the substrings (0,k-1), (1,k), (2,k+1), etc ... without doing too many > string copy/concat operations. I think I can do it myself but I'm not too > confident regarding good practices on buffered reads of files. Maybe there > are some good examples in Batteries? > > Thanks again, > ph. > > > [-- Attachment #2: Type: text/html, Size: 1815 bytes --] ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [Caml-list] Efficient scanning of large strings from files 2012-03-16 17:02 ` Edgar Friendly @ 2012-03-19 9:08 ` Philippe Veber 2012-03-19 13:44 ` Edgar Friendly 0 siblings, 1 reply; 15+ messages in thread From: Philippe Veber @ 2012-03-19 9:08 UTC (permalink / raw) To: Edgar Friendly; +Cc: caml-list [-- Attachment #1: Type: text/plain, Size: 1843 bytes --] Thanks Edgar and Jérémie, this indeed seems to be the right track. I just hope that a repeated use of input_char is not 10-100X slower than input_line :o). ph. 2012/3/16 Edgar Friendly <thelema314@gmail.com> > So given a large file and a line number, you want to: > 1) extract that line from the file > 2) produce an enum of all k-length slices of that line? > 3) match each slice against your regexp set to produce a list/enum of > substrings that match the regexps? > Without reading the whole line into memory at once. > > I'm with Dimino on the right solution - just use a matcher that that works > incrementally, feed it one byte at a time, and have it return a list of > match offsets. Then work backwards from these endpoints to figure out > which substrings you want. > > There shouldn't be a reason to use substrings (0,k-1) and (1,k) - it > should suffice to use (0,k-1) and (k,2k-1) with an incremental matching > routine. > > E. > > > > On Fri, Mar 16, 2012 at 10:48 AM, Philippe Veber <philippe.veber@gmail.com > > wrote: > >> Thank you Edgar for your answer (and also Christophe). It seems my >> question was a bit misleading: actually I target a subset of regexps whose >> matching is really trivial, so this is no worry for me. I was more >> interested in how accessing a large line in a file by chunks of fixed >> length k. For instance how to build a [Substring.t Enum.t] from some line >> in a file, without building the whole line in memory. This enum would yield >> the substrings (0,k-1), (1,k), (2,k+1), etc ... without doing too many >> string copy/concat operations. I think I can do it myself but I'm not too >> confident regarding good practices on buffered reads of files. Maybe there >> are some good examples in Batteries? >> >> Thanks again, >> ph. >> >> >> > [-- Attachment #2: Type: text/html, Size: 2401 bytes --] ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [Caml-list] Efficient scanning of large strings from files 2012-03-19 9:08 ` Philippe Veber @ 2012-03-19 13:44 ` Edgar Friendly 2012-03-21 7:21 ` Philippe Veber 0 siblings, 1 reply; 15+ messages in thread From: Edgar Friendly @ 2012-03-19 13:44 UTC (permalink / raw) To: Philippe Veber; +Cc: caml-list On 03/19/2012 05:08 AM, Philippe Veber wrote: > Thanks Edgar and Jérémie, this indeed seems to be the right track. I > just hope that a repeated use of input_char is not 10-100X slower than > input_line :o). > ph. > Quite true - instead of giving the matcher just a single byte at a time, it is more efficient to give it blocks of data, as long as it can keep its state from one block to the next. But its matching internally will be on one byte at a time, normally. I guess with DNA, because of the reduced character set, it'd be possible to get each symbol down to 2 bits (if you're really just using ACGT), in which case, the matcher could run 4 basepairs at a time, but there's a lot of corner issues doing things that way. A lot depends on how much time and effort you're willing to spend engineering something. E. > 2012/3/16 Edgar Friendly <thelema314@gmail.com > <mailto:thelema314@gmail.com>> > > So given a large file and a line number, you want to: > 1) extract that line from the file > 2) produce an enum of all k-length slices of that line? > 3) match each slice against your regexp set to produce a list/enum > of substrings that match the regexps? > Without reading the whole line into memory at once. > > I'm with Dimino on the right solution - just use a matcher that that > works incrementally, feed it one byte at a time, and have it return > a list of match offsets. Then work backwards from these endpoints > to figure out which substrings you want. > > There shouldn't be a reason to use substrings (0,k-1) and (1,k) - it > should suffice to use (0,k-1) and (k,2k-1) with an incremental > matching routine. > > E. > > > > On Fri, Mar 16, 2012 at 10:48 AM, Philippe Veber > <philippe.veber@gmail.com <mailto:philippe.veber@gmail.com>> wrote: > > Thank you Edgar for your answer (and also Christophe). It seems > my question was a bit misleading: actually I target a subset of > regexps whose matching is really trivial, so this is no worry > for me. I was more interested in how accessing a large line in a > file by chunks of fixed length k. For instance how to build a > [Substring.t Enum.t] from some line in a file, without building > the whole line in memory. This enum would yield the substrings > (0,k-1), (1,k), (2,k+1), etc ... without doing too many string > copy/concat operations. I think I can do it myself but I'm not > too confident regarding good practices on buffered reads of > files. Maybe there are some good examples in Batteries? > > Thanks again, > ph. > > > > ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [Caml-list] Efficient scanning of large strings from files 2012-03-19 13:44 ` Edgar Friendly @ 2012-03-21 7:21 ` Philippe Veber 0 siblings, 0 replies; 15+ messages in thread From: Philippe Veber @ 2012-03-21 7:21 UTC (permalink / raw) To: Edgar Friendly; +Cc: caml-list [-- Attachment #1: Type: text/plain, Size: 3163 bytes --] 2012/3/19 Edgar Friendly <thelema314@gmail.com> > On 03/19/2012 05:08 AM, Philippe Veber wrote: > >> Thanks Edgar and Jérémie, this indeed seems to be the right track. I >> just hope that a repeated use of input_char is not 10-100X slower than >> input_line :o). >> ph. >> >> Quite true - instead of giving the matcher just a single byte at a time, > it is more efficient to give it blocks of data, as long as it can keep its > state from one block to the next. But its matching internally will be on > one byte at a time, normally. Thanks for the confirmation, I now see more clearly what to do. > I guess with DNA, because of the reduced character set, it'd be possible > to get each symbol down to 2 bits (if you're really just using ACGT), in > which case, the matcher could run 4 basepairs at a time, but there's a lot > of corner issues doing things that way. A lot depends on how much time and > effort you're willing to spend engineering something. > Maybe not that far yet, but this is something we've mentionned for biocaml. I guess I could take some inspiration from the bitset module in Batteries. Anyway thanks everybody for your help! ph. > > E. > > 2012/3/16 Edgar Friendly <thelema314@gmail.com >> <mailto:thelema314@gmail.com>> >> >> >> So given a large file and a line number, you want to: >> 1) extract that line from the file >> 2) produce an enum of all k-length slices of that line? >> 3) match each slice against your regexp set to produce a list/enum >> of substrings that match the regexps? >> Without reading the whole line into memory at once. >> >> I'm with Dimino on the right solution - just use a matcher that that >> works incrementally, feed it one byte at a time, and have it return >> a list of match offsets. Then work backwards from these endpoints >> to figure out which substrings you want. >> >> There shouldn't be a reason to use substrings (0,k-1) and (1,k) - it >> should suffice to use (0,k-1) and (k,2k-1) with an incremental >> matching routine. >> >> E. >> >> >> >> On Fri, Mar 16, 2012 at 10:48 AM, Philippe Veber >> <philippe.veber@gmail.com <mailto:philippe.veber@gmail.**com<philippe.veber@gmail.com>>> >> wrote: >> >> Thank you Edgar for your answer (and also Christophe). It seems >> my question was a bit misleading: actually I target a subset of >> regexps whose matching is really trivial, so this is no worry >> for me. I was more interested in how accessing a large line in a >> file by chunks of fixed length k. For instance how to build a >> [Substring.t Enum.t] from some line in a file, without building >> the whole line in memory. This enum would yield the substrings >> (0,k-1), (1,k), (2,k+1), etc ... without doing too many string >> copy/concat operations. I think I can do it myself but I'm not >> too confident regarding good practices on buffered reads of >> files. Maybe there are some good examples in Batteries? >> >> Thanks again, >> ph. >> >> >> >> >> > [-- Attachment #2: Type: text/html, Size: 4346 bytes --] ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Re: [Caml-list] Efficient scanning of large strings from files 2012-03-16 14:14 ` Edgar Friendly 2012-03-16 14:48 ` Philippe Veber @ 2012-03-16 17:23 ` Francois????Charles Matthieu????Berenger 2012-03-17 16:53 ` oliver 2012-03-19 9:08 ` Philippe Veber 1 sibling, 2 replies; 15+ messages in thread From: Francois????Charles Matthieu????Berenger @ 2012-03-16 17:23 UTC (permalink / raw) To: Edgar Friendly; +Cc: Philippe Veber, caml-list hi philippe, i am curious, is your string a dna sequenceso that s why it is so long? regards,f On Fri, 16 Mar 2012 10:14:41 -0400 Edgar Friendly <thelema314@gmail.com> wrote: > On 03/16/2012 09:03 AM, Philippe Veber wrote: > > Dear camlers, > > > > Say that you'd like to search a regexp on a file with lines so long that > > you'd rather not load them entirely at once. If you can bound the size > > of a match by k << length of a line, then you know that you can only > > keep a small portion of the line in memory to search the regexp. > > Typically you'd like to access substrings of size k from left to right. > > I guess such a thing should involve buffered inputs and avoid copying > > strings as much as possible. My question is as follows: has anybody > > written a library to access these substrings gracefully and with decent > > performance? > > Cheers, > > Philippe. > > > This is tricky to do, as incremental matching implies DFA-based > matching, but returning matching substrings implies backtrack-based > matching. The re2 library returns matching substrings by matching > forward to find the end of patterns, and then matching backwards on the > reversed regex from that point to find their beginning. I don't know if > even it supports this on incremental input. > > Subject to the assumption that starting to match implies either > finishing or aborting a match (i.e. once you've started to match your > regex, no other match will start before either this match attempt > finishes successful or not), this is not unreasonable to do. Without > this assumption, it requires tracking many match start locations and > somehow knowing which of them match or fail to match. I'm not sure this > has been done before. > > E. > > -- > Caml-list mailing list. Subscription management and archives: > https://sympa-roc.inria.fr/wws/info/caml-list > Beginner's list: http://groups.yahoo.com/group/ocaml_beginners > Bug reports: http://caml.inria.fr/bin/caml-bugs > ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Re: [Caml-list] Efficient scanning of large strings from files 2012-03-16 17:23 ` Francois????Charles Matthieu????Berenger @ 2012-03-17 16:53 ` oliver 2012-03-19 9:08 ` Philippe Veber 1 sibling, 0 replies; 15+ messages in thread From: oliver @ 2012-03-17 16:53 UTC (permalink / raw) To: Francois????Charles Matthieu????Berenger Cc: Edgar Friendly, Philippe Veber, caml-list DNA: C, G, A, T, \n On Sat, Mar 17, 2012 at 02:23:48AM +0900, Francois????Charles Matthieu????Berenger wrote: > hi philippe, > > i am curious, is your string a dna sequenceso that s why it is so long? > > regards,f > > > On Fri, 16 Mar 2012 10:14:41 -0400 > Edgar Friendly <thelema314@gmail.com> wrote: > > > On 03/16/2012 09:03 AM, Philippe Veber wrote: > > > Dear camlers, > > > > > > Say that you'd like to search a regexp on a file with lines so long that > > > you'd rather not load them entirely at once. If you can bound the size > > > of a match by k << length of a line, then you know that you can only > > > keep a small portion of the line in memory to search the regexp. > > > Typically you'd like to access substrings of size k from left to right. > > > I guess such a thing should involve buffered inputs and avoid copying > > > strings as much as possible. My question is as follows: has anybody > > > written a library to access these substrings gracefully and with decent > > > performance? > > > Cheers, > > > Philippe. > > > > > This is tricky to do, as incremental matching implies DFA-based > > matching, but returning matching substrings implies backtrack-based > > matching. The re2 library returns matching substrings by matching > > forward to find the end of patterns, and then matching backwards on the > > reversed regex from that point to find their beginning. I don't know if > > even it supports this on incremental input. > > > > Subject to the assumption that starting to match implies either > > finishing or aborting a match (i.e. once you've started to match your > > regex, no other match will start before either this match attempt > > finishes successful or not), this is not unreasonable to do. Without > > this assumption, it requires tracking many match start locations and > > somehow knowing which of them match or fail to match. I'm not sure this > > has been done before. > > > > E. > > > > -- > > Caml-list mailing list. Subscription management and archives: > > https://sympa-roc.inria.fr/wws/info/caml-list > > Beginner's list: http://groups.yahoo.com/group/ocaml_beginners > > Bug reports: http://caml.inria.fr/bin/caml-bugs > > > > > > > -- > Caml-list mailing list. Subscription management and archives: > https://sympa-roc.inria.fr/wws/info/caml-list > Beginner's list: http://groups.yahoo.com/group/ocaml_beginners > Bug reports: http://caml.inria.fr/bin/caml-bugs > ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Re: [Caml-list] Efficient scanning of large strings from files 2012-03-16 17:23 ` Francois????Charles Matthieu????Berenger 2012-03-17 16:53 ` oliver @ 2012-03-19 9:08 ` Philippe Veber 1 sibling, 0 replies; 15+ messages in thread From: Philippe Veber @ 2012-03-19 9:08 UTC (permalink / raw) To: Francois????Charles Matthieu????Berenger; +Cc: Edgar Friendly, caml-list [-- Attachment #1: Type: text/plain, Size: 2295 bytes --] Yes indeed! 2012/3/16 Francois????Charles Matthieu????Berenger <Berenger@riken.jp> > hi philippe, > > i am curious, is your string a dna sequenceso that s why it is so long? > > regards,f > > > On Fri, 16 Mar 2012 10:14:41 -0400 > Edgar Friendly <thelema314@gmail.com> wrote: > > > On 03/16/2012 09:03 AM, Philippe Veber wrote: > > > Dear camlers, > > > > > > Say that you'd like to search a regexp on a file with lines so long > that > > > you'd rather not load them entirely at once. If you can bound the size > > > of a match by k << length of a line, then you know that you can only > > > keep a small portion of the line in memory to search the regexp. > > > Typically you'd like to access substrings of size k from left to right. > > > I guess such a thing should involve buffered inputs and avoid copying > > > strings as much as possible. My question is as follows: has anybody > > > written a library to access these substrings gracefully and with decent > > > performance? > > > Cheers, > > > Philippe. > > > > > This is tricky to do, as incremental matching implies DFA-based > > matching, but returning matching substrings implies backtrack-based > > matching. The re2 library returns matching substrings by matching > > forward to find the end of patterns, and then matching backwards on the > > reversed regex from that point to find their beginning. I don't know if > > even it supports this on incremental input. > > > > Subject to the assumption that starting to match implies either > > finishing or aborting a match (i.e. once you've started to match your > > regex, no other match will start before either this match attempt > > finishes successful or not), this is not unreasonable to do. Without > > this assumption, it requires tracking many match start locations and > > somehow knowing which of them match or fail to match. I'm not sure this > > has been done before. > > > > E. > > > > -- > > Caml-list mailing list. Subscription management and archives: > > https://sympa-roc.inria.fr/wws/info/caml-list > > Beginner's list: http://groups.yahoo.com/group/ocaml_beginners > > Bug reports: http://caml.inria.fr/bin/caml-bugs > > > > > > [-- Attachment #2: Type: text/html, Size: 3210 bytes --] ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [Caml-list] Efficient scanning of large strings from files 2012-03-16 13:03 [Caml-list] Efficient scanning of large strings from files Philippe Veber 2012-03-16 14:14 ` Edgar Friendly @ 2012-03-16 14:49 ` Jérémie Dimino 2012-03-18 21:11 ` Török Edwin 2012-03-16 20:11 ` oliver 2012-03-18 23:56 ` oliver 3 siblings, 1 reply; 15+ messages in thread From: Jérémie Dimino @ 2012-03-16 14:49 UTC (permalink / raw) To: Philippe Veber; +Cc: caml users Le Fri, 16 Mar 2012 14:03:38 +0100, Philippe Veber <philippe.veber@gmail.com> a écrit : > Say that you'd like to search a regexp on a file with lines so long > that you'd rather not load them entirely at once. If you can bound > the size of a match by k << length of a line, then you know that you > can only keep a small portion of the line in memory to search the > regexp. Typically you'd like to access substrings of size k from left > to right. I guess such a thing should involve buffered inputs and > avoid copying strings as much as possible. My question is as follows: > has anybody written a library to access these substrings gracefully > and with decent performance? Cheers, You can use a non-backtracking regexp library to find offsets of the substrings, then seek in the file to extract them. You can use for example the libre library from Jérôme Vouillon [1]. It only accept strings as input but it would be really easy to make it work on input channels (just replace "s.[pos]" by "input_char ic"). [1] http://sourceforge.net/projects/libre/ https://github.com/avsm/ocaml-re.git Cheers, -- Jérémie ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [Caml-list] Efficient scanning of large strings from files 2012-03-16 14:49 ` Jérémie Dimino @ 2012-03-18 21:11 ` Török Edwin 2012-03-19 9:11 ` Philippe Veber 0 siblings, 1 reply; 15+ messages in thread From: Török Edwin @ 2012-03-18 21:11 UTC (permalink / raw) To: caml-list On 03/16/2012 04:49 PM, Jérémie Dimino wrote: > Le Fri, 16 Mar 2012 14:03:38 +0100, > Philippe Veber <philippe.veber@gmail.com> a écrit : > >> Say that you'd like to search a regexp on a file with lines so long >> that you'd rather not load them entirely at once. If you can bound >> the size of a match by k << length of a line, then you know that you >> can only keep a small portion of the line in memory to search the >> regexp. Typically you'd like to access substrings of size k from left >> to right. I guess such a thing should involve buffered inputs and >> avoid copying strings as much as possible. My question is as follows: >> has anybody written a library to access these substrings gracefully >> and with decent performance? Cheers, > > You can use a non-backtracking regexp library to find offsets of the > substrings, then seek in the file to extract them. You can use for > example the libre library from Jérôme Vouillon [1]. It only accept > strings as input but it would be really easy to make it work on input > channels (just replace "s.[pos]" by "input_char ic"). > > [1] http://sourceforge.net/projects/libre/ > https://github.com/avsm/ocaml-re.git > A nice library for regular expression matching is LibTRE (BSD licensed), and it has a way to parse arbitrary data with callbacks: http://laurikari.net/tre/documentation/reguexec/ According to the paper it is also good at finding substring matches with its tagged NFA: http://laurikari.net/ville/regex-submatch.pdf If you don't use back-references (!tre_have_backrefs) then it guarantees linear-time matching. I couldn't find an OCaml wrapper for it, but should be simple enough to write one. Best regards, --Edwin ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [Caml-list] Efficient scanning of large strings from files 2012-03-18 21:11 ` Török Edwin @ 2012-03-19 9:11 ` Philippe Veber 0 siblings, 0 replies; 15+ messages in thread From: Philippe Veber @ 2012-03-19 9:11 UTC (permalink / raw) To: Török Edwin; +Cc: caml-list [-- Attachment #1: Type: text/plain, Size: 2262 bytes --] Thanks Edwin and Oliver, I wasn't aware of these libraries. I'll definitely have a look. Thanks again everybody for your help on this! ph. 2012/3/18 Török Edwin <edwin+ml-ocaml@etorok.net> > On 03/16/2012 04:49 PM, Jérémie Dimino wrote: > > Le Fri, 16 Mar 2012 14:03:38 +0100, > > Philippe Veber <philippe.veber@gmail.com> a écrit : > > > >> Say that you'd like to search a regexp on a file with lines so long > >> that you'd rather not load them entirely at once. If you can bound > >> the size of a match by k << length of a line, then you know that you > >> can only keep a small portion of the line in memory to search the > >> regexp. Typically you'd like to access substrings of size k from left > >> to right. I guess such a thing should involve buffered inputs and > >> avoid copying strings as much as possible. My question is as follows: > >> has anybody written a library to access these substrings gracefully > >> and with decent performance? Cheers, > > > > You can use a non-backtracking regexp library to find offsets of the > > substrings, then seek in the file to extract them. You can use for > > example the libre library from Jérôme Vouillon [1]. It only accept > > strings as input but it would be really easy to make it work on input > > channels (just replace "s.[pos]" by "input_char ic"). > > > > [1] http://sourceforge.net/projects/libre/ > > https://github.com/avsm/ocaml-re.git > > > > A nice library for regular expression matching is LibTRE (BSD licensed), > and it has a way to parse arbitrary data with callbacks: > http://laurikari.net/tre/documentation/reguexec/ > > According to the paper it is also good at finding substring matches > with its tagged NFA: > http://laurikari.net/ville/regex-submatch.pdf > > If you don't use back-references (!tre_have_backrefs) then it guarantees > linear-time matching. > > I couldn't find an OCaml wrapper for it, but should be simple enough to > write one. > > Best regards, > --Edwin > > -- > Caml-list mailing list. Subscription management and archives: > https://sympa-roc.inria.fr/wws/info/caml-list > Beginner's list: http://groups.yahoo.com/group/ocaml_beginners > Bug reports: http://caml.inria.fr/bin/caml-bugs > > [-- Attachment #2: Type: text/html, Size: 3398 bytes --] ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [Caml-list] Efficient scanning of large strings from files 2012-03-16 13:03 [Caml-list] Efficient scanning of large strings from files Philippe Veber 2012-03-16 14:14 ` Edgar Friendly 2012-03-16 14:49 ` Jérémie Dimino @ 2012-03-16 20:11 ` oliver 2012-03-18 23:56 ` oliver 3 siblings, 0 replies; 15+ messages in thread From: oliver @ 2012-03-16 20:11 UTC (permalink / raw) To: Philippe Veber; +Cc: caml users On Fri, Mar 16, 2012 at 02:03:38PM +0100, Philippe Veber wrote: > Dear camlers, > > Say that you'd like to search a regexp on a file with lines so long that > you'd rather not load them entirely at once. If you can bound the size of a > match by k << length of a line, then you know that you can only keep a > small portion of the line in memory to search the regexp. Typically you'd > like to access substrings of size k from left to right. I guess such a > thing should involve buffered inputs and avoid copying strings as much as > possible. My question is as follows: has anybody written a library to > access these substrings gracefully and with decent performance? > Cheers, > Philippe. [...] I think, the regexp itself also has an impact on how fast and/or how easy this can be achieved. The more complex the Regexp, the more ressources you will need. If you can make your regexp becoming boult down to something easy parseable, the length of lines might be of no importance. Ciao, Oliver ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [Caml-list] Efficient scanning of large strings from files 2012-03-16 13:03 [Caml-list] Efficient scanning of large strings from files Philippe Veber ` (2 preceding siblings ...) 2012-03-16 20:11 ` oliver @ 2012-03-18 23:56 ` oliver 3 siblings, 0 replies; 15+ messages in thread From: oliver @ 2012-03-18 23:56 UTC (permalink / raw) To: Philippe Veber; +Cc: caml users On Fri, Mar 16, 2012 at 02:03:38PM +0100, Philippe Veber wrote: > Dear camlers, > > Say that you'd like to search a regexp on a file with lines so long that > you'd rather not load them entirely at once. If you can bound the size of a > match by k << length of a line, then you know that you can only keep a > small portion of the line in memory to search the regexp. Typically you'd > like to access substrings of size k from left to right. I guess such a > thing should involve buffered inputs and avoid copying strings as much as > possible. My question is as follows: has anybody written a library to > access these substrings gracefully and with decent performance? > Cheers, > Philippe. To your question of such a library: I don't know such a lib. I wonder if your lines would fill some GB or RAM...?! Not sure if it matches your question, but if there is no such lib, you maybe want to implement the Regexp-serach by yourself...?! ==> http://swtch.com/~rsc/regexp/regexp1.html For fast input the Buffe Module is really a performance boost, compared to normal string-appending operations. ==> http://caml.inria.fr/pub/docs/manual-ocaml/libref/Buffer.html Ciao, Oliver ^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2012-03-21 7:22 UTC | newest] Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2012-03-16 13:03 [Caml-list] Efficient scanning of large strings from files Philippe Veber 2012-03-16 14:14 ` Edgar Friendly 2012-03-16 14:48 ` Philippe Veber 2012-03-16 17:02 ` Edgar Friendly 2012-03-19 9:08 ` Philippe Veber 2012-03-19 13:44 ` Edgar Friendly 2012-03-21 7:21 ` Philippe Veber 2012-03-16 17:23 ` Francois????Charles Matthieu????Berenger 2012-03-17 16:53 ` oliver 2012-03-19 9:08 ` Philippe Veber 2012-03-16 14:49 ` Jérémie Dimino 2012-03-18 21:11 ` Török Edwin 2012-03-19 9:11 ` Philippe Veber 2012-03-16 20:11 ` oliver 2012-03-18 23:56 ` oliver
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox