From: Jon Harrop <jon@ffconsultancy.com>
To: caml-list <caml-list@inria.fr>
Subject: Ropes: progressively crazy ideas
Date: Mon, 13 Aug 2007 22:23:42 +0100 [thread overview]
Message-ID: <200708132223.43128.jon@ffconsultancy.com> (raw)
I decided to have another go at curing cancer over the weekend. Downloaded the
human genome sequence from UCSC and started reading papers on suffix trees.
The data is primarily a sequence of A, C, G and T, of course. However, there
are also sparse entries of other characters, primarily N. You really want to
store the bulk of the data as a dense array of bit pairs for the four most
common characters and then store the exceptional cases in a sparse data
structure.
Perhaps it would be useful if your Vect implementation could be abstracted
over different concrete leaf implementations, such as bit vectors? Presumably
leaf size should be measured in bytes rather than elements?
Also, I've noticed a correlation between the purely functional implementations
of the linear-time suffix tree generation algorithms, Huet's zippers and my
work that I mentioned before on lazy metadata in nodes.
I'm toying with the idea of a sequence type based on Vect that lazily computes
its own suffix tree. You would need some way to compose suffix trees when
sequences are combined (e.g. concatenation of sequences) but the result could
be very useful.
For example, Mathematica's pattern matcher allows you to search for
subsequences:
f[{front___, x, y, z, back___}] := {front, back}
This is currently implemented completely naively by recursive linear searches
of an array. If the sequence type could generate and reuse suffix trees then
subsequence matches should be vastly more efficient. The ability to search
for subsequences is the main reason why Mathematica's pattern matches are not
optimized as ML's are.
--
Dr Jon D Harrop, Flying Frog Consultancy Ltd.
OCaml for Scientists
http://www.ffconsultancy.com/products/ocaml_for_scientists/?e
reply other threads:[~2007-08-13 21:32 UTC|newest]
Thread overview: [no followups] expand[flat|nested] mbox.gz Atom feed
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=200708132223.43128.jon@ffconsultancy.com \
--to=jon@ffconsultancy.com \
--cc=caml-list@inria.fr \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox