2012/3/16 Edgar Friendly > On 03/16/2012 09:03 AM, Philippe Veber wrote: > >> Dear camlers, >> >> Say that you'd like to search a regexp on a file with lines so long that >> you'd rather not load them entirely at once. If you can bound the size >> of a match by k << length of a line, then you know that you can only >> keep a small portion of the line in memory to search the regexp. >> Typically you'd like to access substrings of size k from left to right. >> I guess such a thing should involve buffered inputs and avoid copying >> strings as much as possible. My question is as follows: has anybody >> written a library to access these substrings gracefully and with decent >> performance? >> Cheers, >> Philippe. >> >> This is tricky to do, as incremental matching implies DFA-based > matching, but returning matching substrings implies backtrack-based > matching. The re2 library returns matching substrings by matching forward > to find the end of patterns, and then matching backwards on the reversed > regex from that point to find their beginning. I don't know if even it > supports this on incremental input. > > Subject to the assumption that starting to match implies either finishing > or aborting a match (i.e. once you've started to match your regex, no other > match will start before either this match attempt finishes successful or > not), this is not unreasonable to do. Without this assumption, it requires > tracking many match start locations and somehow knowing which of them match > or fail to match. I'm not sure this has been done before. > > E. > Thank you Edgar for your answer (and also Christophe). It seems my question was a bit misleading: actually I target a subset of regexps whose matching is really trivial, so this is no worry for me. I was more interested in how accessing a large line in a file by chunks of fixed length k. For instance how to build a [Substring.t Enum.t] from some line in a file, without building the whole line in memory. This enum would yield the substrings (0,k-1), (1,k), (2,k+1), etc ... without doing too many string copy/concat operations. I think I can do it myself but I'm not too confident regarding good practices on buffered reads of files. Maybe there are some good examples in Batteries? Thanks again, ph.