* Package with multidimensional AND sparse matrices? @ 2006-07-03 22:55 Andreas Biegert 2000-01-02 4:56 ` [Caml-list] " Jon Harrop 2006-07-03 23:59 ` Markus Mottl 0 siblings, 2 replies; 7+ messages in thread From: Andreas Biegert @ 2006-07-03 22:55 UTC (permalink / raw) To: caml-list Hi, I am a bioinformatics student from Germany and OCaml is my language of choice for my diploma thesis program. So far, everything worked out great, thanks to OCaml allowing for short development cycles. However, no I am at a point that involves computations of very sparse but relatively big multidimensional matrices. The worst case scenario is a 4-dim 2000*2000*2000*2000 float matrix. Here is what I found so far: 1. Bigarray module: The Bigarray module fits perfectly, BUT it does not implement sparse matrix storage. Therefore the worst case as stated above is not feasable memorywise. 2. ocamlfloat package: The ocamlfloat package provides a sparse matrix implmentation for 2-dimensional matrices, but I would need at least 4 dimensions. Does anybody know a package that implements multidimensional AND sparse matrices? In case there is no such package, would it be very hard to write one by myself? Thanks in advance for helping, Andreas ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [Caml-list] Package with multidimensional AND sparse matrices? 2006-07-03 22:55 Package with multidimensional AND sparse matrices? Andreas Biegert @ 2000-01-02 4:56 ` Jon Harrop [not found] ` <fd0847170607040038u117d2594r38adcbdf1f00ec04@mail.gmail.com> 2006-07-03 23:59 ` Markus Mottl 1 sibling, 1 reply; 7+ messages in thread From: Jon Harrop @ 2000-01-02 4:56 UTC (permalink / raw) To: caml-list On Monday 03 July 2006 23:55, Andreas Biegert wrote: > The ocamlfloat package provides a sparse matrix implmentation for > 2-dimensional matrices, but I would need at least 4 dimensions. You probably mean "tensor" rather than "matrix". > In case there is no such package, would it be very hard to write one by > myself? Provided performance and memory usage are not too important, it is quite simple to write your own implementation. First, you must decide what requirements you have, i.e. what do you want to do with these tensors? Then, you must choose a data structure to represent your tensors, e.g. a hash table that maps int * int * int * int to float. Finally, you must implement the functions that you need. You could start with something like this: module SparseArray : sig type t val make : int list -> t val get : t -> int list -> float val set : t -> int list -> float -> unit end = struct type t = int list * (int list, float) Hashtbl.t let make n = n, Hashtbl.create 0 let check s n i = List.iter2 (fun n i -> if i<0 || i>=n then invalid_arg s) n i let get (n, m) i = check "SparseArray.get" n i; try Hashtbl.find m i with Not_found -> 0. let set (n, m) i x = check "SparseArray.set" n i; if x<>0. then Hashtbl.replace m i x else try Hashtbl.remove m i with Not_found -> () end;; That should be fine for getting and setting but you probably want to loop over dimensions and perform arithmetic operations, in which case you need a way to jump to the non-zero elements. How many non-zero elements do you have and how long do you expect your calculations to take? -- Dr Jon D Harrop, Flying Frog Consultancy Ltd. Objective CAML for Scientists http://www.ffconsultancy.com/products/ocaml_for_scientists ^ permalink raw reply [flat|nested] 7+ messages in thread
[parent not found: <fd0847170607040038u117d2594r38adcbdf1f00ec04@mail.gmail.com>]
* Re: [Caml-list] Package with multidimensional AND sparse matrices? [not found] ` <fd0847170607040038u117d2594r38adcbdf1f00ec04@mail.gmail.com> @ 2006-07-04 7:46 ` Andreas Biegert 2006-07-04 9:06 ` Martin Jambon 2006-07-04 10:03 ` Jon Harrop 0 siblings, 2 replies; 7+ messages in thread From: Andreas Biegert @ 2006-07-04 7:46 UTC (permalink / raw) To: caml-list Well, in short: the algorithm computes the optimal multiple alignment of up to n protein sequences with maximum length L. The straightforward and theoretically optimal approach to solve this problem is by dynamic programming, therefore the time and space complexity is O(L^n). I managed to come up with a pretty good heuristic that lets me boil down the number of cells that need to be computed giving me a complexitiy of O(L^2) instead of O(L^n), in fact I can even manually adjust the number of nonzero entries in the matrix. In the 2000*2000*2000*2000 matrix only at most 2000 cells will be non-zero. Furthermore for all non-zero cells M(i,j,k,l) holds i<j<k<l. The bad news is that performance is important since the program has to run under 5min. An estimation of the time requirements given the number of sequences n and masimum lengths L is: 24 * L^2 * T_x * 2^(n+1) - 1 with T_x being about 100ns For n=4 and L=2000 I get just about 5min. Would it be a bad idea to use Hashtables all the way instead of matrices? Accessing elements in a hashtable has O(1), right? Andreas 2006/7/4, Andreas Biegert <andreas.biegert@googlemail.com>: > Well, in short: the algorithm computes the optimal multiple alignment > of up to n protein sequences with maximum length L. The > straightforward and theoretically optimal approach to solve this > problem is by dynamic programming, therefore the time and space > complexity is O(L^n). I managed to come up with a pretty good > heuristic that lets me boil down the number of cells that need to be > computed giving me a complexitiy of O(L^2) instead of O(L^n), in fact > I can even manually adjust the number of nonzero entries in the > matrix. In the 2000*2000*2000*2000 matrix only at most 2000 cells will > be non-zero. Furthermore for all non-zero cells M(i,j,k,l) holds > i<j<k<l. The bad news is that performance is important since the > program has to run under 5min. An estimation of the time requirements > given the number of sequences n and masimum lengths L is: > > 24 * L^2 * T_x * 2^(n+1) - 1 > with T_x being about 100ns > > For n=4 and L=2000 I get just about 5min. Would it be a bad idea to > use Hashtables all the way instead of matrices? Accessing elements in > a hashtable has O(1), right? > > Andreas > > > > 2000/1/2, Jon Harrop <jon@ffconsultancy.com>: > > On Monday 03 July 2006 23:55, Andreas Biegert wrote: > > > The ocamlfloat package provides a sparse matrix implmentation for > > > 2-dimensional matrices, but I would need at least 4 dimensions. > > > > You probably mean "tensor" rather than "matrix". > > > > > In case there is no such package, would it be very hard to write one by > > > myself? > > > > Provided performance and memory usage are not too important, it is quite > > simple to write your own implementation. First, you must decide what > > requirements you have, i.e. what do you want to do with these tensors? Then, > > you must choose a data structure to represent your tensors, e.g. a hash table > > that maps int * int * int * int to float. Finally, you must implement the > > functions that you need. > > > > You could start with something like this: > > > > module SparseArray : sig > > type t > > > > val make : int list -> t > > val get : t -> int list -> float > > val set : t -> int list -> float -> unit > > end = struct > > type t = int list * (int list, float) Hashtbl.t > > > > let make n = n, Hashtbl.create 0 > > > > let check s n i = > > List.iter2 (fun n i -> if i<0 || i>=n then invalid_arg s) n i > > > > let get (n, m) i = > > check "SparseArray.get" n i; > > try Hashtbl.find m i with Not_found -> 0. > > > > let set (n, m) i x = > > check "SparseArray.set" n i; > > if x<>0. then Hashtbl.replace m i x else > > try Hashtbl.remove m i with Not_found -> () > > end;; > > > > That should be fine for getting and setting but you probably want to loop over > > dimensions and perform arithmetic operations, in which case you need a way to > > jump to the non-zero elements. > > > > How many non-zero elements do you have and how long do you expect your > > calculations to take? > > > > -- > > Dr Jon D Harrop, Flying Frog Consultancy Ltd. > > Objective CAML for Scientists > > http://www.ffconsultancy.com/products/ocaml_for_scientists > > > > _______________________________________________ > > Caml-list mailing list. Subscription management: > > http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list > > Archives: http://caml.inria.fr > > Beginner's list: http://groups.yahoo.com/group/ocaml_beginners > > Bug reports: http://caml.inria.fr/bin/caml-bugs > > > ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [Caml-list] Package with multidimensional AND sparse matrices? 2006-07-04 7:46 ` Andreas Biegert @ 2006-07-04 9:06 ` Martin Jambon 2006-07-04 10:03 ` Jon Harrop 1 sibling, 0 replies; 7+ messages in thread From: Martin Jambon @ 2006-07-04 9:06 UTC (permalink / raw) To: Andreas Biegert; +Cc: caml-list On Tue, 4 Jul 2006, Andreas Biegert wrote: > Well, in short: the algorithm computes the optimal multiple alignment > of up to n protein sequences with maximum length L. The > straightforward and theoretically optimal approach to solve this > problem is by dynamic programming, therefore the time and space > complexity is O(L^n). I managed to come up with a pretty good > heuristic that lets me boil down the number of cells that need to be > computed giving me a complexitiy of O(L^2) instead of O(L^n), in fact > I can even manually adjust the number of nonzero entries in the > matrix. In the 2000*2000*2000*2000 matrix only at most 2000 cells will > be non-zero. Furthermore for all non-zero cells M(i,j,k,l) holds > i<j<k<l. The bad news is that performance is important since the > program has to run under 5min. An estimation of the time requirements > given the number of sequences n and masimum lengths L is: > > 24 * L^2 * T_x * 2^(n+1) - 1 > with T_x being about 100ns > > For n=4 and L=2000 I get just about 5min. Would it be a bad idea to > use Hashtables all the way instead of matrices? Accessing elements in > a hashtable has O(1), right? Yes, if you create it big enough so that it doesn't get resized. The best is that you look at hashtbl.ml in the standard library to see how it is implemented. But I wouldn't worry too much for now: try Hashtbl and you will quickly see if it's fast enough or not. Martin -- Martin Jambon, PhD - http://martin.jambon.free.fr Visit http://wikiomics.org, the Bioinformatics Howto Wiki ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [Caml-list] Package with multidimensional AND sparse matrices? 2006-07-04 7:46 ` Andreas Biegert 2006-07-04 9:06 ` Martin Jambon @ 2006-07-04 10:03 ` Jon Harrop 2006-07-04 16:55 ` Andreas Biegert 1 sibling, 1 reply; 7+ messages in thread From: Jon Harrop @ 2006-07-04 10:03 UTC (permalink / raw) To: caml-list On Tuesday 04 July 2006 08:46, Andreas Biegert wrote: > Well, in short: the algorithm computes the optimal multiple alignment > of up to n protein sequences with maximum length L. The > straightforward and theoretically optimal approach to solve this > problem is by dynamic programming, therefore the time and space > complexity is O(L^n). I managed to come up with a pretty good > heuristic that lets me boil down the number of cells that need to be > computed giving me a complexitiy of O(L^2) instead of O(L^n), in fact > I can even manually adjust the number of nonzero entries in the > matrix. In the 2000*2000*2000*2000 matrix only at most 2000 cells will > be non-zero. Furthermore for all non-zero cells M(i,j,k,l) holds > i<j<k<l. The bad news is that performance is important since the > program has to run under 5min. An estimation of the time requirements > given the number of sequences n and masimum lengths L is: > > 24 * L^2 * T_x * 2^(n+1) - 1 > with T_x being about 100ns Ok. What patterns of access will there be? > For n=4 and L=2000 I get just about 5min. Would it be a bad idea to > use Hashtables all the way instead of matrices? No, I don't think so. If you need more performance then you can always change things later. I'd concentrate on getting a working algorithm first. > Accessing elements in > a hashtable has O(1), right? Roughly, yes. :-) -- Dr Jon D Harrop, Flying Frog Consultancy Ltd. Objective CAML for Scientists http://www.ffconsultancy.com/products/ocaml_for_scientists ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [Caml-list] Package with multidimensional AND sparse matrices? 2006-07-04 10:03 ` Jon Harrop @ 2006-07-04 16:55 ` Andreas Biegert 0 siblings, 0 replies; 7+ messages in thread From: Andreas Biegert @ 2006-07-04 16:55 UTC (permalink / raw) To: caml-list Since the algorithm computes alignemnts in most casses the pattern of access will be adjacent tuples, for example (i,j,k,l) (i,j,k-1,l-1) etc. But it does not have to be that way. The only contstraint that holds for sure is i<j<k<l. 2006/7/4, Jon Harrop <jon@ffconsultancy.com>: > On Tuesday 04 July 2006 08:46, Andreas Biegert wrote: > > Well, in short: the algorithm computes the optimal multiple alignment > > of up to n protein sequences with maximum length L. The > > straightforward and theoretically optimal approach to solve this > > problem is by dynamic programming, therefore the time and space > > complexity is O(L^n). I managed to come up with a pretty good > > heuristic that lets me boil down the number of cells that need to be > > computed giving me a complexitiy of O(L^2) instead of O(L^n), in fact > > I can even manually adjust the number of nonzero entries in the > > matrix. In the 2000*2000*2000*2000 matrix only at most 2000 cells will > > be non-zero. Furthermore for all non-zero cells M(i,j,k,l) holds > > i<j<k<l. The bad news is that performance is important since the > > program has to run under 5min. An estimation of the time requirements > > given the number of sequences n and masimum lengths L is: > > > > 24 * L^2 * T_x * 2^(n+1) - 1 > > with T_x being about 100ns > > Ok. What patterns of access will there be? > > > For n=4 and L=2000 I get just about 5min. Would it be a bad idea to > > use Hashtables all the way instead of matrices? > > No, I don't think so. If you need more performance then you can always change > things later. I'd concentrate on getting a working algorithm first. > > > Accessing elements in > > a hashtable has O(1), right? > > Roughly, yes. :-) > > -- > Dr Jon D Harrop, Flying Frog Consultancy Ltd. > Objective CAML for Scientists > http://www.ffconsultancy.com/products/ocaml_for_scientists > > _______________________________________________ > Caml-list mailing list. Subscription management: > http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list > Archives: http://caml.inria.fr > Beginner's list: http://groups.yahoo.com/group/ocaml_beginners > Bug reports: http://caml.inria.fr/bin/caml-bugs > ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [Caml-list] Package with multidimensional AND sparse matrices? 2006-07-03 22:55 Package with multidimensional AND sparse matrices? Andreas Biegert 2000-01-02 4:56 ` [Caml-list] " Jon Harrop @ 2006-07-03 23:59 ` Markus Mottl 1 sibling, 0 replies; 7+ messages in thread From: Markus Mottl @ 2006-07-03 23:59 UTC (permalink / raw) To: Andreas Biegert; +Cc: caml-list On 7/3/06, Andreas Biegert <andreas.biegert@googlemail.com> wrote: > Does anybody know a package that implements multidimensional AND > sparse matrices? In case there is no such package, would it be very > hard to write one by myself? I don't know of any OCaml-libraries implementing datastructures and algorithms for sparse tensors. You haven't mentioned anything about the kind of algorithms you want to implement. Maybe the problem can be represented differently so that you can use matrices without too many conceptual hassles. Otherwise I'd first look around for some Fortran libraries that could be interfaced, because implementing stable and efficient numerical algorithms, especially given this kind of data, is pretty hard. Regards, Markus -- Markus Mottl http://www.ocaml.info markus.mottl@gmail.com ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2006-07-04 17:58 UTC | newest] Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2006-07-03 22:55 Package with multidimensional AND sparse matrices? Andreas Biegert 2000-01-02 4:56 ` [Caml-list] " Jon Harrop [not found] ` <fd0847170607040038u117d2594r38adcbdf1f00ec04@mail.gmail.com> 2006-07-04 7:46 ` Andreas Biegert 2006-07-04 9:06 ` Martin Jambon 2006-07-04 10:03 ` Jon Harrop 2006-07-04 16:55 ` Andreas Biegert 2006-07-03 23:59 ` Markus Mottl
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox