2012/3/19 Edgar Friendly <thelema314@gmail.com>

On 03/19/2012 05:08 AM, Philippe Veber wrote:

Thanks Edgar and Jérémie, this indeed seems to be the right track. I
just hope that a repeated use of input_char is not 10-100X slower than
input_line :o).
ph.

Quite true - instead of giving the matcher just a single byte at a time, it is more efficient to give it blocks of data, as long as it can keep its state from one block to the next. But its matching internally will be on one byte at a time, normally.

Thanks for the confirmation, I now see more clearly what to do.

I guess with DNA, because of the reduced character set, it'd be possible to get each symbol down to 2 bits (if you're really just using ACGT), in which case, the matcher could run 4 basepairs at a time, but there's a lot of corner issues doing things that way. A lot depends on how much time and effort you're willing to spend engineering something.

Maybe not that far yet, but this is something we've mentionned for biocaml. I guess I could take some inspiration from the bitset module in Batteries.
Anyway thanks everybody for your help!
ph.

E.

2012/3/16 Edgar Friendly <thelema314@gmail.com
<mailto:thelema314@gmail.com>>

So given a large file and a line number, you want to:
1) extract that line from the file
2) produce an enum of all k-length slices of that line?
3) match each slice against your regexp set to produce a list/enum
of substrings that match the regexps?
Without reading the whole line into memory at once.

I'm with Dimino on the right solution - just use a matcher that that
works incrementally, feed it one byte at a time, and have it return
a list of match offsets. Then work backwards from these endpoints
to figure out which substrings you want.

There shouldn't be a reason to use substrings (0,k-1) and (1,k) - it
should suffice to use (0,k-1) and (k,2k-1) with an incremental
matching routine.

E.

On Fri, Mar 16, 2012 at 10:48 AM, Philippe Veber
<philippe.veber@gmail.com <mailto:philippe.veber@gmail.com>> wrote:

Thank you Edgar for your answer (and also Christophe). It seems
my question was a bit misleading: actually I target a subset of
regexps whose matching is really trivial, so this is no worry
for me. I was more interested in how accessing a large line in a
file by chunks of fixed length k. For instance how to build a
[Substring.t Enum.t] from some line in a file, without building
the whole line in memory. This enum would yield the substrings
(0,k-1), (1,k), (2,k+1), etc ... without doing too many string
copy/concat operations. I think I can do it myself but I'm not
too confident regarding good practices on buffered reads of
files. Maybe there are some good examples in Batteries?

Thanks again,
ph.