* [Caml-list] Explaining bit sets [not found] <3D70203F.1000106@ozemail.com.au> @ 2002-09-02 9:40 ` Diego Olivier Fernandez Pons 2002-09-02 13:10 ` [Caml-list] Holding a set of random integers of very wide range? john.chu 2002-09-02 23:46 ` [Caml-list] Explaining bit sets Yamagata Yoriyuki 0 siblings, 2 replies; 6+ messages in thread From: Diego Olivier Fernandez Pons @ 2002-09-02 9:40 UTC (permalink / raw) To: John Max Skaller; +Cc: Caml-list Bonjour, I will try to explain how does the bitSet module work. A bit set is just à n-ary tree with n = 30 and an integer data in each node : instead of just having type tree = Empty | Node of int * tree * tree you have type tree = Empty | Node of int * tree array (I really use an other representation which is mosty equivalent, but saves some space and makes algorithms easier : tree = N of int * int array) When you call (insert [1,15,2,3,17] my_set) the only thing you are doing is set to 1 the 17h bit of the integer located in the first tree first level, 15th tree second level, 2nd tree third level, 3rd tree 4th level. Now, all the game is to find a convenient decomposition for each value you want to save, the actual module provide two classical function - big-endian (base 30) example : path 15123 -> [16,24,3] path 15124 -> [16,24,4] (same mask) path 15153 -> [16,25,3] (next node) that means that integer masks will represent numbers as follows [ 0, 1, ..., 30 ] [ 31, 32, ... , 60] etc. - endian-endian (base 30) example : path 15123 -> [3,24,16] path 16023 -> [3,24,17] (same mask) path 15124 -> [4,24,16] (next tree from the root) that means that integer masks will represent numbers as follows [ 0, 30, 60, ..., 900] [ 1, 31, 61, ..., 901] etc. I am working on a third function allowing other modulo representations e.g. [ 0, 5, 10, 15, ... , 150] etc. > Hmmm. I need a lexer that supports unicode, > i.e. 2^24 characters. To make that work, > I need to classify characters (array of 2^24 size is a bit expensive), > but also the lexer needs an efficient representation > of subsets of unicode characters, typically > containining a few ranges (eg 'letter' is a code > in one of about 30 different ranges). You could use big-endian representation of course, but you could even use your own 30-ary representation. Why ? let say you insert 10 ints from 221 456 to 221 466 that is [8,6,1,26] to [8,6,2,6], You will have a degenerated tree with 2 useless nodes (level 1 and 2) since the rest of the tree is empty | | / \ int int Of course, two nodes is not much but modulo 30 operations have a cost. The same data structure with modulo 2 operations for optimal speed acces would look have 5 times more useless nodes (because 30 ~= 32 = 2^5) To have an optimal space occupation you just need to write down your own decomposition function which should have the following properties - injectivity in the domain you are working with - neighbourg conservation - most accessed integers in the top (id est shortest path) There are other solutions based on using common prefixes : - patricia trees - a trie of packed arrays You can find patricia trees in Jean Christophe Fillâtre's home page I have written a small packed arrays module which will soon be in my data structures library (an old buggy beta version is in my home page but will be removed) If you give me some more informations (on typical ranges that you can find in unicode subsets) I will be able to give more precise answers Diego Olivier ------------------- To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/ Beginner's list: http://groups.yahoo.com/group/ocaml_beginners ^ permalink raw reply [flat|nested] 6+ messages in thread
* [Caml-list] Holding a set of random integers of very wide range? 2002-09-02 9:40 ` [Caml-list] Explaining bit sets Diego Olivier Fernandez Pons @ 2002-09-02 13:10 ` john.chu 2002-09-02 13:25 ` Noel Welsh ` (2 more replies) 2002-09-02 23:46 ` [Caml-list] Explaining bit sets Yamagata Yoriyuki 1 sibling, 3 replies; 6+ messages in thread From: john.chu @ 2002-09-02 13:10 UTC (permalink / raw) To: Caml-list Diego Olivier Fernandez Pons writes: > If you give me some more informations (on typical ranges that you can > find in unicode subsets) I will be able to give more precise answers This brings up a possibly related question that I have. For reasons I tried to explain in a previous draft of this e-mail and has been cut due to the amount of space it took, I need to generate multiple permuted lists of integers ranging from 0 to approximately 2^100 (or more, it's a bit open-ended unfortunately). Since I only need one value at a time, I can use a lazy list for this. i.e., the head holds a value and the tail is a suspension of the state I need to generate the next value. No big deal there. The big deal, for me anyways, is that the state I need is tracking the integers that I've already used (so that I don't generate the same one twice) given that the range of possible values is so large. I'm currently using the very low-tech solution of maintaining a list of ranges of used values. (i.e., when I use a value that would link 2 different ranges, I go coalesce the list. The worst case would be if I picked all the odd integers or all the even integers.) Given the recent talk of Patricia trees and bit sets, I was wondering if some variant of either one of those, or some other representation would be better for what I'm doing? (I figure at the very least I ought to be using a tree of ranges of used values. I'm currently using the Big_int module to represent my integers.) BTW, it's extremely unlikely I will ever need every single value in the lazy list. I will probably have queried the garbage collector and declared defeat long before then. I'm hoping that, at worst, I will need only several thousand values from each permuted list. But I don't want to precompute them since I may not need several thousand values from some lists and others might need more. So I want to generate only the numbers that are necessary. (Sorry, this bit only makes sense if you knew what I'm using this for. The short description would be backtracking to find a set of variable values which satisfy a bunch of contraints where there can be multiple possible solutions of which I want a random one where variable values can be mapped onto a potentially very large range of integers. I'm applying as many constraints before backtracking as possible to reduce the state space to search.) Thanks. john ------------------- To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/ Beginner's list: http://groups.yahoo.com/group/ocaml_beginners ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [Caml-list] Holding a set of random integers of very wide range? 2002-09-02 13:10 ` [Caml-list] Holding a set of random integers of very wide range? john.chu @ 2002-09-02 13:25 ` Noel Welsh 2002-09-02 17:18 ` Tim Freeman 2002-09-02 21:30 ` Berke Durak 2 siblings, 0 replies; 6+ messages in thread From: Noel Welsh @ 2002-09-02 13:25 UTC (permalink / raw) To: john.chu, Caml-list --- john.chu@east.sun.com wrote: > For reasons I tried to explain in a previous draft > of this e-mail and > has been cut due to the amount of space it took, I > need to generate > multiple permuted lists of integers ranging from 0 > to approximately > 2^100 (or more, it's a bit open-ended > unfortunately). How about using a linear-congruential random number generators. These simple algorithms will generate numbers that don't repeat for a long period of time (typically they use 32bit integers but I'm sure you'll be able to find algorithms that generate numbers in a larger range). You can generate permutations of by setting the seeds to different values. HTH, Noel __________________________________________________ Do You Yahoo!? Yahoo! Finance - Get real-time stock quotes http://finance.yahoo.com ------------------- To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/ Beginner's list: http://groups.yahoo.com/group/ocaml_beginners ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [Caml-list] Holding a set of random integers of very wide range? 2002-09-02 13:10 ` [Caml-list] Holding a set of random integers of very wide range? john.chu 2002-09-02 13:25 ` Noel Welsh @ 2002-09-02 17:18 ` Tim Freeman 2002-09-02 21:30 ` Berke Durak 2 siblings, 0 replies; 6+ messages in thread From: Tim Freeman @ 2002-09-02 17:18 UTC (permalink / raw) To: john.chu; +Cc: Caml-list >For reasons I tried to explain in a previous draft of this e-mail and >has been cut due to the amount of space it took, I need to generate >multiple permuted lists of integers ranging from 0 to approximately >2^100 (or more, it's a bit open-ended unfortunately). > >Since I only need one value at a time, I can use a lazy list for >this. i.e., the head holds a value and the tail is a suspension of the >state I need to generate the next value. No big deal there. > >The big deal, for me anyways, is that the state I need is tracking the >integers that I've already used (so that I don't generate the same one >twice) given that the range of possible values is so large. If your integers have enough bits, then a list of random numbers is indistinguishable from a random permutation. This assumes you only have time to look at a reasonable number of values from the permutation. Do the math to figure out what "big enough" is. For example, in the past year I designed a scheme that assumed that two files were identical if and only if their 64-bit checksums were identical, and there were only a million or so files, so I did the math and concluded that the chances of a collision were small enough. It worked fine. If you're paranoid then use the MD5 checksums in the Digest module to generate random numbers; otherwise you can use a linear congruential pseudo-random number generator as mentioned by another poster and hope that there's no interesting interaction between the number generation scheme and the other details of your algorithm. >I'm currently using the very low-tech solution of maintaining a list >of ranges of used values. (i.e., when I use a value that would link 2 >different ranges, I go coalesce the list. The worst case would be if >I picked all the odd integers or all the even integers.) > >Given the recent talk of Patricia trees and bit sets, I was wondering >if some variant of either one of those, or some other representation >would be better for what I'm doing? (I figure at the very least I >ought to be using a tree of ranges of used values. I'm currently >using the Big_int module to represent my integers.) If your integers are large enough, then you can ignore the possiblity of a collision and no data structure is needed. Otherwise, is there a reason not to use a hash table? -- Tim Freeman tim@fungible.com GPG public key fingerprint ECDF 46F8 3B80 BB9E 575D 7180 76DF FE00 34B1 5C78 ------------------- To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/ Beginner's list: http://groups.yahoo.com/group/ocaml_beginners ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [Caml-list] Holding a set of random integers of very wide range? 2002-09-02 13:10 ` [Caml-list] Holding a set of random integers of very wide range? john.chu 2002-09-02 13:25 ` Noel Welsh 2002-09-02 17:18 ` Tim Freeman @ 2002-09-02 21:30 ` Berke Durak 2 siblings, 0 replies; 6+ messages in thread From: Berke Durak @ 2002-09-02 21:30 UTC (permalink / raw) To: john.chu; +Cc: caml-list On Mon, Sep 02, 2002 at 09:10:18AM -0400, john.chu@east.sun.com wrote: > For reasons I tried to explain in a previous draft of this e-mail and > has been cut due to the amount of space it took, I need to generate > multiple permuted lists of integers ranging from 0 to approximately > 2^100 (or more, it's a bit open-ended unfortunately). You can solve that problem in constant space by encrypting a counter using a convenient block cipher (such as RC6 or idea). Since encryption is bijective, you are assured of not hitting the same integer twice. You have to somehow encode and decode integers, but in runs in constant space and time. -- Berke ------------------- To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/ Beginner's list: http://groups.yahoo.com/group/ocaml_beginners ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [Caml-list] Explaining bit sets 2002-09-02 9:40 ` [Caml-list] Explaining bit sets Diego Olivier Fernandez Pons 2002-09-02 13:10 ` [Caml-list] Holding a set of random integers of very wide range? john.chu @ 2002-09-02 23:46 ` Yamagata Yoriyuki 1 sibling, 0 replies; 6+ messages in thread From: Yamagata Yoriyuki @ 2002-09-02 23:46 UTC (permalink / raw) To: skaller; +Cc: Caml-list From: Diego Olivier Fernandez Pons <Diego-Olivier.FERNANDEZ-PONS@cicrp.jussieu.fr> Subject: [Caml-list] Explaining bit sets Date: Mon, 2 Sep 2002 11:40:42 +0200 (DST) > > Hmmm. I need a lexer that supports unicode, > > i.e. 2^24 characters. To make that work, > > I need to classify characters (array of 2^24 size is a bit expensive), > > but also the lexer needs an efficient representation > > of subsets of unicode characters, typically > > containining a few ranges (eg 'letter' is a code > > in one of about 30 different ranges). I wrote something for the same purpose (Unicode -> foo tables), which might be useful. The relevant modules are tbl.ml[i], bitsvect.ml[i], bytesvect.ml[i] in http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/camomile/camomile/internal There are 6 table types, 'a Tbl32.rw, 'a Tbl32.t, which are tables from int to 'a, Tbl32.bits_rw, Tbl32.bits which are from int to int (< 256), Tbl32.bytes_rw, Tbl32.bytes which are from int to int. Tbl32.*rw are internally tries whose nodes have 256 branches. Terminal nodes are arrays (Tbl32.rw) or bits vectors (Tbl32.bits_rw) or bytes sequences (Tbl32.bytes_rw). Inplace-update is possible for these types. 'a Tbl32.t (and Tbl32.bits, Tbl32.bytes) are read-only. They are internally 4-dimensional arrays, so access should be fast. Tbl32.rw_to_ro creates Tbl32.t from Tbl32.rw. (There are similar functions for Tbl32.bits and Tbl32.bytes.) It scans the whole trie and make identical nodes shared, so that table size is small. For example, the size of Unicode character category table (Classification of Unicode characters into about 30 categories) becomes 16K. (btw, Unicode is 21-bits, not 24-bits.) -- Yamagata Yoriyuki http://www.mars.sphere.ne.jp/yoriyuki/ PGP fingerprint = 0374 5290 7445 4C06 D79E AA86 1A91 48CB 2B4E 34CF ------------------- To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/ Beginner's list: http://groups.yahoo.com/group/ocaml_beginners ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2002-09-02 23:42 UTC | newest] Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- [not found] <3D70203F.1000106@ozemail.com.au> 2002-09-02 9:40 ` [Caml-list] Explaining bit sets Diego Olivier Fernandez Pons 2002-09-02 13:10 ` [Caml-list] Holding a set of random integers of very wide range? john.chu 2002-09-02 13:25 ` Noel Welsh 2002-09-02 17:18 ` Tim Freeman 2002-09-02 21:30 ` Berke Durak 2002-09-02 23:46 ` [Caml-list] Explaining bit sets Yamagata Yoriyuki
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox