From mboxrd@z Thu Jan 1 00:00:00 1970 Received: (from majordomo@localhost) by pauillac.inria.fr (8.7.6/8.7.3) id QAA17782; Wed, 9 Apr 2003 16:04:18 +0200 (MET DST) X-Authentication-Warning: pauillac.inria.fr: majordomo set sender to owner-caml-list@pauillac.inria.fr using -f Received: from concorde.inria.fr (concorde.inria.fr [192.93.2.39]) by pauillac.inria.fr (8.7.6/8.7.3) with ESMTP id QAA17746 for ; Wed, 9 Apr 2003 16:04:16 +0200 (MET DST) Received: from omta01.mta.everyone.net (sitemail3.everyone.net [216.200.145.37]) by concorde.inria.fr (8.11.1/8.11.1) with ESMTP id h39E4EX28451 for ; Wed, 9 Apr 2003 16:04:15 +0200 (MET DST) Received: from sitemail.everyone.net (dsnat [216.200.145.62]) by omta01.mta.everyone.net (Postfix) with ESMTP id CEA831C444D for ; Wed, 9 Apr 2003 07:04:13 -0700 (PDT) Received: by sitemail.everyone.net (Postfix, from userid 99) id 61092400E; Wed, 9 Apr 2003 07:04:13 -0700 (PDT) Content-Type: text/plain Content-Disposition: inline Content-Transfer-Encoding: 7bit Mime-Version: 1.0 X-Mailer: MIME-tools 5.41 (Entity 5.404) Date: Wed, 9 Apr 2003 07:04:12 -0700 (PDT) From: "Dr.Dr.Ruediger M.Flaig" To: caml-list@inria.fr Subject: Re: Re: Re: [Caml-list] newbie questions Reply-To: flaig@hallucinogene.sanctacaris.net X-Originating-Ip: [129.206.124.135] Message-Id: <20030409140413.61092400E@sitemail.everyone.net> X-Spam: no; 0.00; caml-list:01 newbie:01 annotated:01 annotations:01 shorter:01 seq:01 cta:99 acg:99 temperature:99 recursively:01 monniaux:01 biblio:99 lncs:01 marcus:01 heidelberg:99 Sender: owner-caml-list@pauillac.inria.fr Precedence: bulk > Please define "fast processing of large amounts of data". This can mean > widely different things. I am working with DNA, so my idea of large amounts of data is a huge annotated sequence file. Well, the annotations are not a problem, they are small by comparison and may be kept separate... so what remains is a simple trail of up to several millions of base pairs -- indexed 2-bit elements, strictly speaking, but usually dealt with as bytes. Imagine I want to do the following: In order to plan an experiment, I have to find to which positions of a long DNA sequence a shorter one may bind under certain circumstances... there are approximations for that, but lab experience shows that they just don't work properly for real life. So the most reliable thing to do is: for all possible subsets of both sequences, calculate their affinity: Seq 1 = ggatcggctaag -> Subsets: ggatcggctaa, ggatcggcta, ggatcggct, ggatcggc, ..., gg, gatcggctaag, gatcggctaa, gatcggcta, gatcggc, ..., ga, ..., ctaa, cta, ct, taa, ta . Seq 2 = aacgtaa -> Subsets: aacgta, aacgt, aacg, ..., aa, acgta, acgt, acg, ac, ..., taa, ta, aa . Match ggatcggctaa with aacgta, aacgt, aacg, ..., aa, acgta, acgt, acg, ac, ..., taa, ta, aa ; match ggatcggcta with aacgta, aacgt, aacg, ..., aa, acgta, acgt, acg, ac, ..., taa, ta, aa ; ......... ; match ta with aa. where "match" means: calculate the maximal temperature at which these sequences may bind to each other. Okay, getting this done either recursively or iteratively is freshman level programming, and you can add lots of "cutoffs" to reduce the work load, but getting it done FAST is quite a different matter when one of the sequences is in the megabyte range... > And figure out how you can minimize the rate of new object > creation. No new object creation is needed at all, if all this is done by indexing... (I have followed the thread about GC efficiency) > If you are dealing with matrices (numerical analysis...), yes, probably > you want Array's or Bigarray's. > > Otherwise, even for structures mapping an integer range to values, arrays > may not be the best choice. I have in mind a particular example where we > used a balanced binary map from integers to values, because this allowed > implementing certain optimizations (see section 6.2 of > http://www.di.ens.fr/~monniaux/biblio/Static_analyzer_LNCS2566.pdf ). Yup, that sounds very interesting. I'll have a look. Yours, Ruediger Dr. Dr. Ruediger Marcus Flaig Institute for Immunology University of Heidelberg Im Neuenheimer Feld 305 D-69120 Heidelberg Tel. +49-172-7652946 Fax +49-4075110-17171 _____________________________________________________________ Free eMail .... the way it should be.... http://www.hablas.com _____________________________________________________________ Select your own custom email address for FREE! Get you@yourchoice.com w/No Ads, 6MB, POP & more! http://www.everyone.net/selectmail?campaign=tag ------------------- To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/ Beginner's list: http://groups.yahoo.com/group/ocaml_beginners