* [Caml-list] Specify the default hash function for a type @ 2016-05-13 7:52 Soegtrop, Michael 2016-05-13 8:13 ` Ben Millwood 2016-05-13 9:06 ` Alain Frisch 0 siblings, 2 replies; 16+ messages in thread From: Soegtrop, Michael @ 2016-05-13 7:52 UTC (permalink / raw) To: caml-list [-- Attachment #1: Type: text/plain, Size: 918 bytes --] Dear Ocaml Users, I wonder if there is a way to specify a default hash function for a type, which is then used by the automated hash functions e.g. created for records containing this type. I have e.g. a record with about 10 fields and I am happy with the default hash functions for most fields and for the record, but for just one field I would like to use a custom hash function, but I don't want to write a hash function for the record. I am currently using the standard library, but if there is a solution with Batteries or Core, I would also be interested. Best regards, Michael Intel Deutschland GmbH Registered Address: Am Campeon 10-12, 85579 Neubiberg, Germany Tel: +49 89 99 8853-0, www.intel.de Managing Directors: Christin Eisenschmid, Christian Lamprechter Chairperson of the Supervisory Board: Nicole Lau Registered Office: Munich Commercial Register: Amtsgericht Muenchen HRB 186928 [-- Attachment #2: Type: text/html, Size: 2891 bytes --] ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [Caml-list] Specify the default hash function for a type 2016-05-13 7:52 [Caml-list] Specify the default hash function for a type Soegtrop, Michael @ 2016-05-13 8:13 ` Ben Millwood 2016-05-13 8:40 ` Soegtrop, Michael 2016-05-13 9:06 ` Alain Frisch 1 sibling, 1 reply; 16+ messages in thread From: Ben Millwood @ 2016-05-13 8:13 UTC (permalink / raw) To: Soegtrop, Michael; +Cc: caml-list [-- Attachment #1: Type: text/plain, Size: 1499 bytes --] Polymorphic hash operates without type information, so it's not able to tell that it's hashing your record type vs. some other type with the same structure. So I would guess that what you want is not possible. The Core team is working on a ppx extension analogous to ppx_sexp_conv and ppx_compare which generates hash functions automatically, and would make this kind of thing possible, but it's not done yet. On 13 May 2016 at 15:52, Soegtrop, Michael <michael.soegtrop@intel.com> wrote: > Dear Ocaml Users, > > > > I wonder if there is a way to specify a default hash function for a type, > which is then used by the automated hash functions e.g. created for records > containing this type. I have e.g. a record with about 10 fields and I am > happy with the default hash functions for most fields and for the record, > but for just one field I would like to use a custom hash function, but I > don’t want to write a hash function for the record. > > > > I am currently using the standard library, but if there is a solution with > Batteries or Core, I would also be interested. > > > > Best regards, > > > > Michael > > Intel Deutschland GmbH > Registered Address: Am Campeon 10-12, 85579 Neubiberg, Germany > Tel: +49 89 99 8853-0, www.intel.de > Managing Directors: Christin Eisenschmid, Christian Lamprechter > Chairperson of the Supervisory Board: Nicole Lau > Registered Office: Munich > Commercial Register: Amtsgericht Muenchen HRB 186928 > [-- Attachment #2: Type: text/html, Size: 2316 bytes --] ^ permalink raw reply [flat|nested] 16+ messages in thread
* RE: [Caml-list] Specify the default hash function for a type 2016-05-13 8:13 ` Ben Millwood @ 2016-05-13 8:40 ` Soegtrop, Michael 0 siblings, 0 replies; 16+ messages in thread From: Soegtrop, Michael @ 2016-05-13 8:40 UTC (permalink / raw) To: Ben Millwood; +Cc: caml-list [-- Attachment #1: Type: text/plain, Size: 1093 bytes --] Dear Ben, thanks a lot for the quick answer! I am looking forward to the PPX. Best regards, Michael From: caml-list-request@inria.fr [mailto:caml-list-request@inria.fr] On Behalf Of Ben Millwood Sent: Friday, May 13, 2016 10:14 AM To: Soegtrop, Michael Cc: caml-list@inria.fr Subject: Re: [Caml-list] Specify the default hash function for a type Polymorphic hash operates without type information, so it's not able to tell that it's hashing your record type vs. some other type with the same structure. So I would guess that what you want is not possible. The Core team is working on a ppx extension analogous to ppx_sexp_conv and ppx_compare which generates hash functions automatically, and would make this kind of thing possible, but it's not done yet. Intel Deutschland GmbH Registered Address: Am Campeon 10-12, 85579 Neubiberg, Germany Tel: +49 89 99 8853-0, www.intel.de Managing Directors: Christin Eisenschmid, Christian Lamprechter Chairperson of the Supervisory Board: Nicole Lau Registered Office: Munich Commercial Register: Amtsgericht Muenchen HRB 186928 [-- Attachment #2: Type: text/html, Size: 4807 bytes --] ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [Caml-list] Specify the default hash function for a type 2016-05-13 7:52 [Caml-list] Specify the default hash function for a type Soegtrop, Michael 2016-05-13 8:13 ` Ben Millwood @ 2016-05-13 9:06 ` Alain Frisch 2016-05-13 9:26 ` Soegtrop, Michael 2016-05-13 13:50 ` Pierre Chambart 1 sibling, 2 replies; 16+ messages in thread From: Alain Frisch @ 2016-05-13 9:06 UTC (permalink / raw) To: Soegtrop, Michael, caml-list Dear Michael, Don't tell anybody (ahem), but the following might work depending on your exact use case... If you define a type such as: type t = { foo: ....; id: int; } and if you set the tag of values for this type to Obj.record_tag (= 248) with Obj.set_tag, then polymorphic hash *and* comparison functions will use the "id" field and ignore the "foo" field (and any other extra fields). It is important to keep "id" as the second field to mimic the layout of real object values (and exception slots). In particular, this lets you easily "forget" sub-parts of a value (by setting "id" to a fixed constant), or implement "references compared with physical equality" (use a counter to assign unique "id" to each such reference). Best regards, Alain On 13/05/2016 09:52, Soegtrop, Michael wrote: > Dear Ocaml Users, > > > > I wonder if there is a way to specify a default hash function for a > type, which is then used by the automated hash functions e.g. created > for records containing this type. I have e.g. a record with about 10 > fields and I am happy with the default hash functions for most fields > and for the record, but for just one field I would like to use a custom > hash function, but I don’t want to write a hash function for the record. > > > > I am currently using the standard library, but if there is a solution > with Batteries or Core, I would also be interested. > > > > Best regards, > > > > Michael > > Intel Deutschland GmbH > Registered Address: Am Campeon 10-12, 85579 Neubiberg, Germany > Tel: +49 89 99 8853-0, www.intel.de > Managing Directors: Christin Eisenschmid, Christian Lamprechter > Chairperson of the Supervisory Board: Nicole Lau > Registered Office: Munich > Commercial Register: Amtsgericht Muenchen HRB 186928 > ^ permalink raw reply [flat|nested] 16+ messages in thread
* RE: [Caml-list] Specify the default hash function for a type 2016-05-13 9:06 ` Alain Frisch @ 2016-05-13 9:26 ` Soegtrop, Michael 2016-05-13 12:01 ` Gabriel Scherer 2016-05-13 13:50 ` Pierre Chambart 1 sibling, 1 reply; 16+ messages in thread From: Soegtrop, Michael @ 2016-05-13 9:26 UTC (permalink / raw) To: Alain Frisch, caml-list Dear Alain, this are temptations on Friday morning ... I guess as a beginner I should resist, not be lazy, and write the hash function by hand. But I put it into appendix D of my OCaml notes ;-) Best regards, Michael > -----Original Message----- > From: Alain Frisch [mailto:alain.frisch@lexifi.com] > Sent: Friday, May 13, 2016 11:06 AM > To: Soegtrop, Michael; caml-list@inria.fr > Subject: Re: [Caml-list] Specify the default hash function for a type > > Dear Michael, > > Don't tell anybody (ahem), but the following might work depending on your > exact use case... > > If you define a type such as: > > type t = { > foo: ....; > id: int; > } > > and if you set the tag of values for this type to Obj.record_tag (= 248) with > Obj.set_tag, then polymorphic hash *and* comparison functions will use the > "id" field and ignore the "foo" field (and any other extra fields). It is > important to keep "id" as the second field to mimic the layout of real object > values (and exception slots). > > In particular, this lets you easily "forget" sub-parts of a value (by setting "id" > to a fixed constant), or implement "references compared with physical > equality" (use a counter to assign unique "id" to each such reference). > > > Best regards, > > Alain Intel Deutschland GmbH Registered Address: Am Campeon 10-12, 85579 Neubiberg, Germany Tel: +49 89 99 8853-0, www.intel.de Managing Directors: Christin Eisenschmid, Christian Lamprechter Chairperson of the Supervisory Board: Nicole Lau Registered Office: Munich Commercial Register: Amtsgericht Muenchen HRB 186928 ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [Caml-list] Specify the default hash function for a type 2016-05-13 9:26 ` Soegtrop, Michael @ 2016-05-13 12:01 ` Gabriel Scherer 2016-05-13 12:23 ` Alain Frisch 2016-05-13 12:32 ` Soegtrop, Michael 0 siblings, 2 replies; 16+ messages in thread From: Gabriel Scherer @ 2016-05-13 12:01 UTC (permalink / raw) To: Soegtrop, Michael; +Cc: Alain Frisch, caml-list From the C side it is possible to create "custom" values that come with user-specified comparison and hashing functions. However, for now the runtime does not allow creating them from OCaml -- this is indeed a desirable feature, but it's not trivial to implement. I don't really whether it's actually "really hard", or just that it requires some effort that nobody has invested yet -- the latter would be my guess. If you don't mind going through a C stubs to do this, you could thus create the custom value in C land, and export that in OCaml land. See the manual: Interfacing C with OCaml > custom blocks http://caml.inria.fr/pub/docs/manual-ocaml-4.03/intfc.html#sec442 On Fri, May 13, 2016 at 5:26 AM, Soegtrop, Michael <michael.soegtrop@intel.com> wrote: > Dear Alain, > > this are temptations on Friday morning ... I guess as a beginner I should resist, not be lazy, and write the hash function by hand. But I put it into appendix D of my OCaml notes ;-) > > Best regards, > > Michael > >> -----Original Message----- >> From: Alain Frisch [mailto:alain.frisch@lexifi.com] >> Sent: Friday, May 13, 2016 11:06 AM >> To: Soegtrop, Michael; caml-list@inria.fr >> Subject: Re: [Caml-list] Specify the default hash function for a type >> >> Dear Michael, >> >> Don't tell anybody (ahem), but the following might work depending on your >> exact use case... >> >> If you define a type such as: >> >> type t = { >> foo: ....; >> id: int; >> } >> >> and if you set the tag of values for this type to Obj.record_tag (= 248) with >> Obj.set_tag, then polymorphic hash *and* comparison functions will use the >> "id" field and ignore the "foo" field (and any other extra fields). It is >> important to keep "id" as the second field to mimic the layout of real object >> values (and exception slots). >> >> In particular, this lets you easily "forget" sub-parts of a value (by setting "id" >> to a fixed constant), or implement "references compared with physical >> equality" (use a counter to assign unique "id" to each such reference). >> >> >> Best regards, >> >> Alain > Intel Deutschland GmbH > Registered Address: Am Campeon 10-12, 85579 Neubiberg, Germany > Tel: +49 89 99 8853-0, www.intel.de > Managing Directors: Christin Eisenschmid, Christian Lamprechter > Chairperson of the Supervisory Board: Nicole Lau > Registered Office: Munich > Commercial Register: Amtsgericht Muenchen HRB 186928 > > > -- > Caml-list mailing list. Subscription management and archives: > https://sympa.inria.fr/sympa/arc/caml-list > Beginner's list: http://groups.yahoo.com/group/ocaml_beginners > Bug reports: http://caml.inria.fr/bin/caml-bugs ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [Caml-list] Specify the default hash function for a type 2016-05-13 12:01 ` Gabriel Scherer @ 2016-05-13 12:23 ` Alain Frisch 2016-05-13 12:32 ` Soegtrop, Michael 1 sibling, 0 replies; 16+ messages in thread From: Alain Frisch @ 2016-05-13 12:23 UTC (permalink / raw) To: Gabriel Scherer, Soegtrop, Michael; +Cc: caml-list On 13/05/2016 14:01, Gabriel Scherer wrote: >>From the C side it is possible to create "custom" values that come > with user-specified comparison and hashing functions. However, for now > the runtime does not allow creating them from OCaml -- this is indeed > a desirable feature, but it's not trivial to implement. I don't really > whether it's actually "really hard", or just that it requires some > effort that nobody has invested yet -- the latter would be my guess. The problem is that the custom comparison function could move objects in the heap if it triggers a GC, and the polymorphic functions (compare/hash) would need to take this into account, i.e. register all "pending" nodes as GC roots. This could slow them down quite a bit. Alain ^ permalink raw reply [flat|nested] 16+ messages in thread
* RE: [Caml-list] Specify the default hash function for a type 2016-05-13 12:01 ` Gabriel Scherer 2016-05-13 12:23 ` Alain Frisch @ 2016-05-13 12:32 ` Soegtrop, Michael 1 sibling, 0 replies; 16+ messages in thread From: Soegtrop, Michael @ 2016-05-13 12:32 UTC (permalink / raw) To: Gabriel Scherer; +Cc: Alain Frisch, caml-list Dear Gabriel, Alain, thanks for the hints! After I applied my program to a larger data set, I found that, contrary to my initial assessment, the polymorphic hash function for the complete record is not good in my case. The hash stats show that the largest bucket has more items in it than I have single item buckets - not a good sign. So I anyway have to write a complete custom hash function. Is there some documentation on what the polymorphic hash function does? Best regards, Michael Intel Deutschland GmbH Registered Address: Am Campeon 10-12, 85579 Neubiberg, Germany Tel: +49 89 99 8853-0, www.intel.de Managing Directors: Christin Eisenschmid, Christian Lamprechter Chairperson of the Supervisory Board: Nicole Lau Registered Office: Munich Commercial Register: Amtsgericht Muenchen HRB 186928 ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [Caml-list] Specify the default hash function for a type 2016-05-13 9:06 ` Alain Frisch 2016-05-13 9:26 ` Soegtrop, Michael @ 2016-05-13 13:50 ` Pierre Chambart 2016-05-13 13:56 ` Alain Frisch 1 sibling, 1 reply; 16+ messages in thread From: Pierre Chambart @ 2016-05-13 13:50 UTC (permalink / raw) To: Alain Frisch, Soegtrop, Michael, caml-list For those who don't see that as a joke, please forget everything about that. Obj.set_tag is an extremely dangerous function as the compiler assumes that value's tag never change, even for completely unknown values. Hence, it might be legal for the compiler to do something that you might not expect on: let f x = let x' = Obj.repr x in let tag = Obj.tag x' in Obj.set_tag x' 0 let ... = match x with ... in Obj.set_tag x' tag; match x with ... The access to the tag can be shared for instance. And lots of 'surprising' stuff might happen. This is of course a quite brutal example but this could apply to a lot of less obvious situation ! So Obj.magic and Obj.set_field : bad Obj.set_tag: really really bad ! -- Pierre On 13/05/2016 11:06, Alain Frisch wrote: > Dear Michael, > > Don't tell anybody (ahem), but the following might work depending on > your exact use case... > > If you define a type such as: > > type t = { > foo: ....; > id: int; > } > > and if you set the tag of values for this type to Obj.record_tag (= > 248) with Obj.set_tag, then polymorphic hash *and* comparison > functions will use the "id" field and ignore the "foo" field (and any > other extra fields). It is important to keep "id" as the second field > to mimic the layout of real object values (and exception slots). > > In particular, this lets you easily "forget" sub-parts of a value (by > setting "id" to a fixed constant), or implement "references compared > with physical equality" (use a counter to assign unique "id" to each > such reference). > > > Best regards, > > Alain > > > > On 13/05/2016 09:52, Soegtrop, Michael wrote: >> Dear Ocaml Users, >> >> >> >> I wonder if there is a way to specify a default hash function for a >> type, which is then used by the automated hash functions e.g. created >> for records containing this type. I have e.g. a record with about 10 >> fields and I am happy with the default hash functions for most fields >> and for the record, but for just one field I would like to use a custom >> hash function, but I don’t want to write a hash function for the record. >> >> >> >> I am currently using the standard library, but if there is a solution >> with Batteries or Core, I would also be interested. >> >> >> >> Best regards, >> >> >> >> Michael >> >> Intel Deutschland GmbH >> Registered Address: Am Campeon 10-12, 85579 Neubiberg, Germany >> Tel: +49 89 99 8853-0, www.intel.de >> Managing Directors: Christin Eisenschmid, Christian Lamprechter >> Chairperson of the Supervisory Board: Nicole Lau >> Registered Office: Munich >> Commercial Register: Amtsgericht Muenchen HRB 186928 >> > ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [Caml-list] Specify the default hash function for a type 2016-05-13 13:50 ` Pierre Chambart @ 2016-05-13 13:56 ` Alain Frisch 2016-05-13 16:17 ` Soegtrop, Michael 0 siblings, 1 reply; 16+ messages in thread From: Alain Frisch @ 2016-05-13 13:56 UTC (permalink / raw) To: Pierre Chambart, Soegtrop, Michael, caml-list FWIW, I was not seriously suggesting this. That said, ..., the case should be relatively safe, since the code generated by OCaml never reads the tag of record values (it is assumed to be 0). The tag should only be read by polymorphic primitives, which don't receive any information from the OCaml optimizer. So I don't see how the optimizer could break the (not-really-)suggested hack, except on purpose. Alain On 13/05/2016 15:50, Pierre Chambart wrote: > For those who don't see that as a joke, please forget everything about that. > Obj.set_tag is an extremely dangerous function as the compiler assumes that > value's tag never change, even for completely unknown values. Hence, it > might > be legal for the compiler to do something that you might not expect on: > > let f x = > let x' = Obj.repr x in > let tag = Obj.tag x' in > Obj.set_tag x' 0 > let ... = match x with ... in > Obj.set_tag x' tag; > match x with ... > > The access to the tag can be shared for instance. And lots of > 'surprising' stuff might happen. > This is of course a quite brutal example but this could apply to a lot > of less obvious situation ! > > So Obj.magic and Obj.set_field : bad > Obj.set_tag: really really bad ! > ^ permalink raw reply [flat|nested] 16+ messages in thread
* RE: [Caml-list] Specify the default hash function for a type 2016-05-13 13:56 ` Alain Frisch @ 2016-05-13 16:17 ` Soegtrop, Michael 2016-05-13 18:57 ` Thomas Braibant 0 siblings, 1 reply; 16+ messages in thread From: Soegtrop, Michael @ 2016-05-13 16:17 UTC (permalink / raw) To: Alain Frisch, Pierre Chambart, caml-list Dear Ocaml Users, a possibly interesting data point in the discussion on polymorphic hashes: I replaced the polymorphic hash function for a record, which consists of smallish integers, bools, lists of integers and short strings, with my own hash function which looks like: ( ( ( (v1 * p1 + v2) * p2) + v3 ) *p3 + v4) * p4 where vi are values and pi are distinct prime numbers. For lists I use 4 primes alternatingly and 4 different primes for empty cases (implemented as 4 mutual recursive functions). The primes are random primes between 2^28 and 2^29, avoiding those primes where larger powers of 2 are close to 0 modulo the prime. All arithmetic is done with plain ocaml ints and corresponding overflows. For strings I used the polymorphic hash function. The hash histogram for this hash function looks like this: num_bindings=1803454 num_buckets=16777216 max_bucket_length=24 (the histogram is "number of items in bucket=count of buckets with this number of items") 0=15356268 (empty buckets) 1=1185329 (non-colliding buckets) 2=135231 3=80194 4=11344 5=2367 6=2836 7=2630 8=244 9=49 10=20 11=25 12=64 13=39 14=100 15=75 16=33 17=5 18=149 19=15 20=3 21=180 22=14 24=2 This is roughly in line with random numbers: fill rate = 1803454/16777216 = 10.7% 2 collisions = 0.8% ~ fill rate ^2 3 collisions = 0.48% ~ 4 * fill rate ^3 4 collisions = 0.068% ~ 5 * fill rate ^4 The hash histogram for the same data and the polymorphic hash function looks like this: num_bindings=1803454 num_buckets=16777216 max_bucket_length=3343 0=16730926 1=5520 2=3201 3=2779 4=1633 5=1079 6=3701 7=672 8=828 9=1584 10=600 11=384 12=2774 13=404 14=525 15=500 16=435 17=358 18=1406 19=244 20=504 21=427 22=316 23=369 24=837 25=153 26=250 27=417 28=191 29=222 30=501 31=76 32=196 33=530 34=142 35=153 36=859 37=88 38=178 39=310 40=147 41=173 42=313 43=102 44=235 45=123 46=149 47=152 48=244 49=107 50=173 : : 1001=1 1002=1 1005=1 1006=3 1009=1 1010=1 1013=1 1014=2 1019=1 1020=1 1029=1 1031=1 1034=1 1035=1 1036=1 1064=1 1082=1 1184=1 1191=1 1196=1 1197=1 1200=1 1206=1 1207=1 1219=1 1225=1 1228=1 1232=1 1566=1 1581=1 1636=1 1698=1 1720=2 1737=2 1740=1 1744=1 1752=1 1762=1 1789=1 2255=1 2271=1 2287=1 2319=1 2332=1 2345=1 3296=1 3325=1 3343=1 So there are many buckets with 1000s of collisions and only a few 1000 collision free buckets. It might make sense to come up with a better polymorphic hashing function. Best regards, Michael Intel Deutschland GmbH Registered Address: Am Campeon 10-12, 85579 Neubiberg, Germany Tel: +49 89 99 8853-0, www.intel.de Managing Directors: Christin Eisenschmid, Christian Lamprechter Chairperson of the Supervisory Board: Nicole Lau Registered Office: Munich Commercial Register: Amtsgericht Muenchen HRB 186928 ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [Caml-list] Specify the default hash function for a type 2016-05-13 16:17 ` Soegtrop, Michael @ 2016-05-13 18:57 ` Thomas Braibant 2016-05-13 22:45 ` Soegtrop, Michael 0 siblings, 1 reply; 16+ messages in thread From: Thomas Braibant @ 2016-05-13 18:57 UTC (permalink / raw) To: Soegtrop, Michael; +Cc: Alain Frisch, Pierre Chambart, caml-list The OCaml hash function is not bad in itself (it's a slightly modified version of Murmur 3, with a 32 bit state). There are a few difficult design decisions that are made in the main hash loop which can impact users though. I wonder if the one that you are hitting is the fact that by default the polymorphic hash function only consider a (small) amount of meaningful value, in a breadth first manner. The following program is interesting ``` let rec make n acc = if n = 0 then acc else make (n - 1) (n :: acc);; let r = ref 0 in while Hashtbl.hash (make !r []) <> Hashtbl.hash (make (!r + 1) []) do incr r; done; !r;; - : int = 10 ``` The current default choice for the number of meaningful values is described here: http://caml.inria.fr/pub/docs/manual-ocaml/libref/Hashtbl.html and it is indeed 10. You could try to increase the number of meaningful nodes that you want to consider in your code, to the max of 256. There is really a balance to strike here between speed of hashing and collisions. Tom On Fri, May 13, 2016 at 6:17 PM, Soegtrop, Michael <michael.soegtrop@intel.com> wrote: > Dear Ocaml Users, > > a possibly interesting data point in the discussion on polymorphic hashes: > > I replaced the polymorphic hash function for a record, which consists of smallish integers, bools, lists of integers and short strings, with my own hash function which looks like: > > ( ( ( (v1 * p1 + v2) * p2) + v3 ) *p3 + v4) * p4 > > where vi are values and pi are distinct prime numbers. For lists I use 4 primes alternatingly and 4 different primes for empty cases (implemented as 4 mutual recursive functions). The primes are random primes between 2^28 and 2^29, avoiding those primes where larger powers of 2 are close to 0 modulo the prime. All arithmetic is done with plain ocaml ints and corresponding overflows. For strings I used the polymorphic hash function. > > The hash histogram for this hash function looks like this: > > num_bindings=1803454 > num_buckets=16777216 > max_bucket_length=24 > (the histogram is "number of items in bucket=count of buckets with this number of items") > 0=15356268 (empty buckets) > 1=1185329 (non-colliding buckets) > 2=135231 > 3=80194 > 4=11344 > 5=2367 > 6=2836 > 7=2630 > 8=244 > 9=49 > 10=20 > 11=25 > 12=64 > 13=39 > 14=100 > 15=75 > 16=33 > 17=5 > 18=149 > 19=15 > 20=3 > 21=180 > 22=14 > 24=2 > > This is roughly in line with random numbers: > > fill rate = 1803454/16777216 = 10.7% > 2 collisions = 0.8% ~ fill rate ^2 > 3 collisions = 0.48% ~ 4 * fill rate ^3 > 4 collisions = 0.068% ~ 5 * fill rate ^4 > > The hash histogram for the same data and the polymorphic hash function looks like this: > > num_bindings=1803454 > num_buckets=16777216 > max_bucket_length=3343 > 0=16730926 > 1=5520 > 2=3201 > 3=2779 > 4=1633 > 5=1079 > 6=3701 > 7=672 > 8=828 > 9=1584 > 10=600 > 11=384 > 12=2774 > 13=404 > 14=525 > 15=500 > 16=435 > 17=358 > 18=1406 > 19=244 > 20=504 > 21=427 > 22=316 > 23=369 > 24=837 > 25=153 > 26=250 > 27=417 > 28=191 > 29=222 > 30=501 > 31=76 > 32=196 > 33=530 > 34=142 > 35=153 > 36=859 > 37=88 > 38=178 > 39=310 > 40=147 > 41=173 > 42=313 > 43=102 > 44=235 > 45=123 > 46=149 > 47=152 > 48=244 > 49=107 > 50=173 > : > : > 1001=1 > 1002=1 > 1005=1 > 1006=3 > 1009=1 > 1010=1 > 1013=1 > 1014=2 > 1019=1 > 1020=1 > 1029=1 > 1031=1 > 1034=1 > 1035=1 > 1036=1 > 1064=1 > 1082=1 > 1184=1 > 1191=1 > 1196=1 > 1197=1 > 1200=1 > 1206=1 > 1207=1 > 1219=1 > 1225=1 > 1228=1 > 1232=1 > 1566=1 > 1581=1 > 1636=1 > 1698=1 > 1720=2 > 1737=2 > 1740=1 > 1744=1 > 1752=1 > 1762=1 > 1789=1 > 2255=1 > 2271=1 > 2287=1 > 2319=1 > 2332=1 > 2345=1 > 3296=1 > 3325=1 > 3343=1 > > So there are many buckets with 1000s of collisions and only a few 1000 collision free buckets. > > It might make sense to come up with a better polymorphic hashing function. > > Best regards, > > Michael > Intel Deutschland GmbH > Registered Address: Am Campeon 10-12, 85579 Neubiberg, Germany > Tel: +49 89 99 8853-0, www.intel.de > Managing Directors: Christin Eisenschmid, Christian Lamprechter > Chairperson of the Supervisory Board: Nicole Lau > Registered Office: Munich > Commercial Register: Amtsgericht Muenchen HRB 186928 > > > -- > Caml-list mailing list. Subscription management and archives: > https://sympa.inria.fr/sympa/arc/caml-list > Beginner's list: http://groups.yahoo.com/group/ocaml_beginners > Bug reports: http://caml.inria.fr/bin/caml-bugs ^ permalink raw reply [flat|nested] 16+ messages in thread
* RE: [Caml-list] Specify the default hash function for a type 2016-05-13 18:57 ` Thomas Braibant @ 2016-05-13 22:45 ` Soegtrop, Michael 2016-05-14 8:41 ` Thomas Braibant 0 siblings, 1 reply; 16+ messages in thread From: Soegtrop, Michael @ 2016-05-13 22:45 UTC (permalink / raw) To: Thomas Braibant; +Cc: Alain Frisch, Pierre Chambart, caml-list Dear Tom, thanks for the pointer to the manual. I googled, but couldn't find this info - should have read the manual first. Yes, 10 is for sure not enough in my case. I will try the hash function with a value adjusted to my case and compare with mine. Best regards, Michael > -----Original Message----- > From: Thomas Braibant [mailto:thomas.braibant@gmail.com] > Sent: Friday, May 13, 2016 8:57 PM > To: Soegtrop, Michael > Cc: Alain Frisch; Pierre Chambart; caml-list@inria.fr > Subject: Re: [Caml-list] Specify the default hash function for a type > > The OCaml hash function is not bad in itself (it's a slightly modified version > of Murmur 3, with a 32 bit state). There are a few difficult design decisions > that are made in the main hash loop which can impact users though. > > I wonder if the one that you are hitting is the fact that by default the > polymorphic hash function only consider a (small) amount of meaningful > value, in a breadth first manner. The following program is interesting > > ``` > let rec make n acc = if n = 0 then acc else make (n - 1) (n :: acc);; let r = ref 0 in > while Hashtbl.hash (make !r []) <> Hashtbl.hash (make (!r + 1) []) do incr r; > done; !r;; > - : int = 10 > ``` > > The current default choice for the number of meaningful values is described > here: > http://caml.inria.fr/pub/docs/manual-ocaml/libref/Hashtbl.html > and it is indeed 10. You could try to increase the number of meaningful > nodes that you want to consider in your code, to the max of 256. > > There is really a balance to strike here between speed of hashing and > collisions. > > Tom > > > > > > On Fri, May 13, 2016 at 6:17 PM, Soegtrop, Michael > <michael.soegtrop@intel.com> wrote: > > Dear Ocaml Users, > > > > a possibly interesting data point in the discussion on polymorphic hashes: > > > > I replaced the polymorphic hash function for a record, which consists of > smallish integers, bools, lists of integers and short strings, with my own > hash function which looks like: > > > > ( ( ( (v1 * p1 + v2) * p2) + v3 ) *p3 + v4) * p4 > > > > where vi are values and pi are distinct prime numbers. For lists I use 4 > primes alternatingly and 4 different primes for empty cases (implemented > as 4 mutual recursive functions). The primes are random primes between > 2^28 and 2^29, avoiding those primes where larger powers of 2 are close to 0 > modulo the prime. All arithmetic is done with plain ocaml ints and > corresponding overflows. For strings I used the polymorphic hash function. > > > > The hash histogram for this hash function looks like this: > > > > num_bindings=1803454 > > num_buckets=16777216 > > max_bucket_length=24 > > (the histogram is "number of items in bucket=count of buckets with > > this number of items") > > 0=15356268 (empty buckets) > > 1=1185329 (non-colliding buckets) > > 2=135231 > > 3=80194 > > 4=11344 > > 5=2367 > > 6=2836 > > 7=2630 > > 8=244 > > 9=49 > > 10=20 > > 11=25 > > 12=64 > > 13=39 > > 14=100 > > 15=75 > > 16=33 > > 17=5 > > 18=149 > > 19=15 > > 20=3 > > 21=180 > > 22=14 > > 24=2 > > > > This is roughly in line with random numbers: > > > > fill rate = 1803454/16777216 = 10.7% > > 2 collisions = 0.8% ~ fill rate ^2 > > 3 collisions = 0.48% ~ 4 * fill rate ^3 > > 4 collisions = 0.068% ~ 5 * fill rate ^4 > > > > The hash histogram for the same data and the polymorphic hash function > looks like this: > > > > num_bindings=1803454 > > num_buckets=16777216 > > max_bucket_length=3343 > > 0=16730926 > > 1=5520 > > 2=3201 > > 3=2779 > > 4=1633 > > 5=1079 > > 6=3701 > > 7=672 > > 8=828 > > 9=1584 > > 10=600 > > 11=384 > > 12=2774 > > 13=404 > > 14=525 > > 15=500 > > 16=435 > > 17=358 > > 18=1406 > > 19=244 > > 20=504 > > 21=427 > > 22=316 > > 23=369 > > 24=837 > > 25=153 > > 26=250 > > 27=417 > > 28=191 > > 29=222 > > 30=501 > > 31=76 > > 32=196 > > 33=530 > > 34=142 > > 35=153 > > 36=859 > > 37=88 > > 38=178 > > 39=310 > > 40=147 > > 41=173 > > 42=313 > > 43=102 > > 44=235 > > 45=123 > > 46=149 > > 47=152 > > 48=244 > > 49=107 > > 50=173 > > : > > : > > 1001=1 > > 1002=1 > > 1005=1 > > 1006=3 > > 1009=1 > > 1010=1 > > 1013=1 > > 1014=2 > > 1019=1 > > 1020=1 > > 1029=1 > > 1031=1 > > 1034=1 > > 1035=1 > > 1036=1 > > 1064=1 > > 1082=1 > > 1184=1 > > 1191=1 > > 1196=1 > > 1197=1 > > 1200=1 > > 1206=1 > > 1207=1 > > 1219=1 > > 1225=1 > > 1228=1 > > 1232=1 > > 1566=1 > > 1581=1 > > 1636=1 > > 1698=1 > > 1720=2 > > 1737=2 > > 1740=1 > > 1744=1 > > 1752=1 > > 1762=1 > > 1789=1 > > 2255=1 > > 2271=1 > > 2287=1 > > 2319=1 > > 2332=1 > > 2345=1 > > 3296=1 > > 3325=1 > > 3343=1 > > > > So there are many buckets with 1000s of collisions and only a few 1000 > collision free buckets. > > > > It might make sense to come up with a better polymorphic hashing > function. > > > > Best regards, > > > > Michael > > Intel Deutschland GmbH > > Registered Address: Am Campeon 10-12, 85579 Neubiberg, Germany > > Tel: +49 89 99 8853-0, www.intel.de > > Managing Directors: Christin Eisenschmid, Christian Lamprechter > > Chairperson of the Supervisory Board: Nicole Lau Registered Office: > > Munich Commercial Register: Amtsgericht Muenchen HRB 186928 > > > > > > -- > > Caml-list mailing list. Subscription management and archives: > > https://sympa.inria.fr/sympa/arc/caml-list > > Beginner's list: http://groups.yahoo.com/group/ocaml_beginners > > Bug reports: http://caml.inria.fr/bin/caml-bugs Intel Deutschland GmbH Registered Address: Am Campeon 10-12, 85579 Neubiberg, Germany Tel: +49 89 99 8853-0, www.intel.de Managing Directors: Christin Eisenschmid, Christian Lamprechter Chairperson of the Supervisory Board: Nicole Lau Registered Office: Munich Commercial Register: Amtsgericht Muenchen HRB 186928 ^ permalink raw reply [flat|nested] 16+ messages in thread
* RE: [Caml-list] Specify the default hash function for a type 2016-05-13 22:45 ` Soegtrop, Michael @ 2016-05-14 8:41 ` Thomas Braibant 2016-05-14 9:06 ` Soegtrop, Michael 0 siblings, 1 reply; 16+ messages in thread From: Thomas Braibant @ 2016-05-14 8:41 UTC (permalink / raw) To: Soegtrop, Michael; +Cc: Alain Frisch, OCaML Mailing List, Pierre Chambart [-- Attachment #1: Type: text/plain, Size: 6714 bytes --] Dear Michael, 256 might be too small for your use case if you have big lists in your records. A way to workaround this limitation is to use the seeded_hash function to fold an hash accumulator through your record. That is, if your record has fields f1, ..., fn you might want to write let hash t = let x = seeded_hash (hash t.f1) (hash t.f2) in let x = seeded_hash x (hash t.f3) in ... seeded_hash x (hash t.fn) On 14 May 2016 00:46, "Soegtrop, Michael" <michael.soegtrop@intel.com> wrote: > Dear Tom, > > thanks for the pointer to the manual. I googled, but couldn't find this > info - should have read the manual first. Yes, 10 is for sure not enough in > my case. I will try the hash function with a value adjusted to my case and > compare with mine. > > Best regards, > > Michael > > > -----Original Message----- > > From: Thomas Braibant [mailto:thomas.braibant@gmail.com] > > Sent: Friday, May 13, 2016 8:57 PM > > To: Soegtrop, Michael > > Cc: Alain Frisch; Pierre Chambart; caml-list@inria.fr > > Subject: Re: [Caml-list] Specify the default hash function for a type > > > > The OCaml hash function is not bad in itself (it's a slightly modified > version > > of Murmur 3, with a 32 bit state). There are a few difficult design > decisions > > that are made in the main hash loop which can impact users though. > > > > I wonder if the one that you are hitting is the fact that by default the > > polymorphic hash function only consider a (small) amount of meaningful > > value, in a breadth first manner. The following program is interesting > > > > ``` > > let rec make n acc = if n = 0 then acc else make (n - 1) (n :: acc);; > let r = ref 0 in > > while Hashtbl.hash (make !r []) <> Hashtbl.hash (make (!r + 1) []) do > incr r; > > done; !r;; > > - : int = 10 > > ``` > > > > The current default choice for the number of meaningful values is > described > > here: > > http://caml.inria.fr/pub/docs/manual-ocaml/libref/Hashtbl.html > > and it is indeed 10. You could try to increase the number of meaningful > > nodes that you want to consider in your code, to the max of 256. > > > > There is really a balance to strike here between speed of hashing and > > collisions. > > > > Tom > > > > > > > > > > > > On Fri, May 13, 2016 at 6:17 PM, Soegtrop, Michael > > <michael.soegtrop@intel.com> wrote: > > > Dear Ocaml Users, > > > > > > a possibly interesting data point in the discussion on polymorphic > hashes: > > > > > > I replaced the polymorphic hash function for a record, which consists > of > > smallish integers, bools, lists of integers and short strings, with my > own > > hash function which looks like: > > > > > > ( ( ( (v1 * p1 + v2) * p2) + v3 ) *p3 + v4) * p4 > > > > > > where vi are values and pi are distinct prime numbers. For lists I use > 4 > > primes alternatingly and 4 different primes for empty cases (implemented > > as 4 mutual recursive functions). The primes are random primes between > > 2^28 and 2^29, avoiding those primes where larger powers of 2 are close > to 0 > > modulo the prime. All arithmetic is done with plain ocaml ints and > > corresponding overflows. For strings I used the polymorphic hash > function. > > > > > > The hash histogram for this hash function looks like this: > > > > > > num_bindings=1803454 > > > num_buckets=16777216 > > > max_bucket_length=24 > > > (the histogram is "number of items in bucket=count of buckets with > > > this number of items") > > > 0=15356268 (empty buckets) > > > 1=1185329 (non-colliding buckets) > > > 2=135231 > > > 3=80194 > > > 4=11344 > > > 5=2367 > > > 6=2836 > > > 7=2630 > > > 8=244 > > > 9=49 > > > 10=20 > > > 11=25 > > > 12=64 > > > 13=39 > > > 14=100 > > > 15=75 > > > 16=33 > > > 17=5 > > > 18=149 > > > 19=15 > > > 20=3 > > > 21=180 > > > 22=14 > > > 24=2 > > > > > > This is roughly in line with random numbers: > > > > > > fill rate = 1803454/16777216 = 10.7% > > > 2 collisions = 0.8% ~ fill rate ^2 > > > 3 collisions = 0.48% ~ 4 * fill rate ^3 > > > 4 collisions = 0.068% ~ 5 * fill rate ^4 > > > > > > The hash histogram for the same data and the polymorphic hash function > > looks like this: > > > > > > num_bindings=1803454 > > > num_buckets=16777216 > > > max_bucket_length=3343 > > > 0=16730926 > > > 1=5520 > > > 2=3201 > > > 3=2779 > > > 4=1633 > > > 5=1079 > > > 6=3701 > > > 7=672 > > > 8=828 > > > 9=1584 > > > 10=600 > > > 11=384 > > > 12=2774 > > > 13=404 > > > 14=525 > > > 15=500 > > > 16=435 > > > 17=358 > > > 18=1406 > > > 19=244 > > > 20=504 > > > 21=427 > > > 22=316 > > > 23=369 > > > 24=837 > > > 25=153 > > > 26=250 > > > 27=417 > > > 28=191 > > > 29=222 > > > 30=501 > > > 31=76 > > > 32=196 > > > 33=530 > > > 34=142 > > > 35=153 > > > 36=859 > > > 37=88 > > > 38=178 > > > 39=310 > > > 40=147 > > > 41=173 > > > 42=313 > > > 43=102 > > > 44=235 > > > 45=123 > > > 46=149 > > > 47=152 > > > 48=244 > > > 49=107 > > > 50=173 > > > : > > > : > > > 1001=1 > > > 1002=1 > > > 1005=1 > > > 1006=3 > > > 1009=1 > > > 1010=1 > > > 1013=1 > > > 1014=2 > > > 1019=1 > > > 1020=1 > > > 1029=1 > > > 1031=1 > > > 1034=1 > > > 1035=1 > > > 1036=1 > > > 1064=1 > > > 1082=1 > > > 1184=1 > > > 1191=1 > > > 1196=1 > > > 1197=1 > > > 1200=1 > > > 1206=1 > > > 1207=1 > > > 1219=1 > > > 1225=1 > > > 1228=1 > > > 1232=1 > > > 1566=1 > > > 1581=1 > > > 1636=1 > > > 1698=1 > > > 1720=2 > > > 1737=2 > > > 1740=1 > > > 1744=1 > > > 1752=1 > > > 1762=1 > > > 1789=1 > > > 2255=1 > > > 2271=1 > > > 2287=1 > > > 2319=1 > > > 2332=1 > > > 2345=1 > > > 3296=1 > > > 3325=1 > > > 3343=1 > > > > > > So there are many buckets with 1000s of collisions and only a few 1000 > > collision free buckets. > > > > > > It might make sense to come up with a better polymorphic hashing > > function. > > > > > > Best regards, > > > > > > Michael > > > Intel Deutschland GmbH > > > Registered Address: Am Campeon 10-12, 85579 Neubiberg, Germany > > > Tel: +49 89 99 8853-0, www.intel.de > > > Managing Directors: Christin Eisenschmid, Christian Lamprechter > > > Chairperson of the Supervisory Board: Nicole Lau Registered Office: > > > Munich Commercial Register: Amtsgericht Muenchen HRB 186928 > > > > > > > > > -- > > > Caml-list mailing list. Subscription management and archives: > > > https://sympa.inria.fr/sympa/arc/caml-list > > > Beginner's list: http://groups.yahoo.com/group/ocaml_beginners > > > Bug reports: http://caml.inria.fr/bin/caml-bugs > Intel Deutschland GmbH > Registered Address: Am Campeon 10-12, 85579 Neubiberg, Germany > Tel: +49 89 99 8853-0, www.intel.de > Managing Directors: Christin Eisenschmid, Christian Lamprechter > Chairperson of the Supervisory Board: Nicole Lau > Registered Office: Munich > Commercial Register: Amtsgericht Muenchen HRB 186928 > [-- Attachment #2: Type: text/html, Size: 9735 bytes --] ^ permalink raw reply [flat|nested] 16+ messages in thread
* RE: [Caml-list] Specify the default hash function for a type 2016-05-14 8:41 ` Thomas Braibant @ 2016-05-14 9:06 ` Soegtrop, Michael 2016-05-17 13:03 ` Soegtrop, Michael 0 siblings, 1 reply; 16+ messages in thread From: Soegtrop, Michael @ 2016-05-14 9:06 UTC (permalink / raw) To: Thomas Braibant; +Cc: Alain Frisch, OCaML Mailing List, Pierre Chambart [-- Attachment #1: Type: text/plain, Size: 7944 bytes --] Dear Tom, my lists are typically 3..6 elements. I think in total 60 or so might be fine. I will give it a try. Although, after I have written the hash and equality functions for the Hashtbl module interfce, I make use of the fact that I can easily deliberately ignore certain things – the starting point of this thread. Maybe for larger structures it is just the right thing to do to write your own hash function. I wonder what would be an elegant way to do this? Collect the default-hash function values of the pieces in a list and compute the hash of this list? Best regards, Michael From: Thomas Braibant [mailto:thomas.braibant@gmail.com] Sent: Saturday, May 14, 2016 10:41 AM To: Soegtrop, Michael <michael.soegtrop@intel.com> Cc: Alain Frisch <alain.frisch@lexifi.com>; OCaML Mailing List <caml-list@inria.fr>; Pierre Chambart <pierre.chambart@ocamlpro.com> Subject: RE: [Caml-list] Specify the default hash function for a type Dear Michael, 256 might be too small for your use case if you have big lists in your records. A way to workaround this limitation is to use the seeded_hash function to fold an hash accumulator through your record. That is, if your record has fields f1, ..., fn you might want to write let hash t = let x = seeded_hash (hash t.f1) (hash t.f2) in let x = seeded_hash x (hash t.f3) in ... seeded_hash x (hash t.fn) On 14 May 2016 00:46, "Soegtrop, Michael" <michael.soegtrop@intel.com<mailto:michael.soegtrop@intel.com>> wrote: Dear Tom, thanks for the pointer to the manual. I googled, but couldn't find this info - should have read the manual first. Yes, 10 is for sure not enough in my case. I will try the hash function with a value adjusted to my case and compare with mine. Best regards, Michael > -----Original Message----- > From: Thomas Braibant [mailto:thomas.braibant@gmail.com<mailto:thomas.braibant@gmail.com>] > Sent: Friday, May 13, 2016 8:57 PM > To: Soegtrop, Michael > Cc: Alain Frisch; Pierre Chambart; caml-list@inria.fr<mailto:caml-list@inria.fr> > Subject: Re: [Caml-list] Specify the default hash function for a type > > The OCaml hash function is not bad in itself (it's a slightly modified version > of Murmur 3, with a 32 bit state). There are a few difficult design decisions > that are made in the main hash loop which can impact users though. > > I wonder if the one that you are hitting is the fact that by default the > polymorphic hash function only consider a (small) amount of meaningful > value, in a breadth first manner. The following program is interesting > > ``` > let rec make n acc = if n = 0 then acc else make (n - 1) (n :: acc);; let r = ref 0 in > while Hashtbl.hash (make !r []) <> Hashtbl.hash (make (!r + 1) []) do incr r; > done; !r;; > - : int = 10 > ``` > > The current default choice for the number of meaningful values is described > here: > http://caml.inria.fr/pub/docs/manual-ocaml/libref/Hashtbl.html > and it is indeed 10. You could try to increase the number of meaningful > nodes that you want to consider in your code, to the max of 256. > > There is really a balance to strike here between speed of hashing and > collisions. > > Tom > > > > > > On Fri, May 13, 2016 at 6:17 PM, Soegtrop, Michael > <michael.soegtrop@intel.com<mailto:michael.soegtrop@intel.com>> wrote: > > Dear Ocaml Users, > > > > a possibly interesting data point in the discussion on polymorphic hashes: > > > > I replaced the polymorphic hash function for a record, which consists of > smallish integers, bools, lists of integers and short strings, with my own > hash function which looks like: > > > > ( ( ( (v1 * p1 + v2) * p2) + v3 ) *p3 + v4) * p4 > > > > where vi are values and pi are distinct prime numbers. For lists I use 4 > primes alternatingly and 4 different primes for empty cases (implemented > as 4 mutual recursive functions). The primes are random primes between > 2^28 and 2^29, avoiding those primes where larger powers of 2 are close to 0 > modulo the prime. All arithmetic is done with plain ocaml ints and > corresponding overflows. For strings I used the polymorphic hash function. > > > > The hash histogram for this hash function looks like this: > > > > num_bindings=1803454 > > num_buckets=16777216 > > max_bucket_length=24 > > (the histogram is "number of items in bucket=count of buckets with > > this number of items") > > 0=15356268 (empty buckets) > > 1=1185329 (non-colliding buckets) > > 2=135231 > > 3=80194 > > 4=11344 > > 5=2367 > > 6=2836 > > 7=2630 > > 8=244 > > 9=49 > > 10=20 > > 11=25 > > 12=64 > > 13=39 > > 14=100 > > 15=75 > > 16=33 > > 17=5 > > 18=149 > > 19=15 > > 20=3 > > 21=180 > > 22=14 > > 24=2 > > > > This is roughly in line with random numbers: > > > > fill rate = 1803454/16777216 = 10.7% > > 2 collisions = 0.8% ~ fill rate ^2 > > 3 collisions = 0.48% ~ 4 * fill rate ^3 > > 4 collisions = 0.068% ~ 5 * fill rate ^4 > > > > The hash histogram for the same data and the polymorphic hash function > looks like this: > > > > num_bindings=1803454 > > num_buckets=16777216 > > max_bucket_length=3343 > > 0=16730926 > > 1=5520 > > 2=3201 > > 3=2779 > > 4=1633 > > 5=1079 > > 6=3701 > > 7=672 > > 8=828 > > 9=1584 > > 10=600 > > 11=384 > > 12=2774 > > 13=404 > > 14=525 > > 15=500 > > 16=435 > > 17=358 > > 18=1406 > > 19=244 > > 20=504 > > 21=427 > > 22=316 > > 23=369 > > 24=837 > > 25=153 > > 26=250 > > 27=417 > > 28=191 > > 29=222 > > 30=501 > > 31=76 > > 32=196 > > 33=530 > > 34=142 > > 35=153 > > 36=859 > > 37=88 > > 38=178 > > 39=310 > > 40=147 > > 41=173 > > 42=313 > > 43=102 > > 44=235 > > 45=123 > > 46=149 > > 47=152 > > 48=244 > > 49=107 > > 50=173 > > : > > : > > 1001=1 > > 1002=1 > > 1005=1 > > 1006=3 > > 1009=1 > > 1010=1 > > 1013=1 > > 1014=2 > > 1019=1 > > 1020=1 > > 1029=1 > > 1031=1 > > 1034=1 > > 1035=1 > > 1036=1 > > 1064=1 > > 1082=1 > > 1184=1 > > 1191=1 > > 1196=1 > > 1197=1 > > 1200=1 > > 1206=1 > > 1207=1 > > 1219=1 > > 1225=1 > > 1228=1 > > 1232=1 > > 1566=1 > > 1581=1 > > 1636=1 > > 1698=1 > > 1720=2 > > 1737=2 > > 1740=1 > > 1744=1 > > 1752=1 > > 1762=1 > > 1789=1 > > 2255=1 > > 2271=1 > > 2287=1 > > 2319=1 > > 2332=1 > > 2345=1 > > 3296=1 > > 3325=1 > > 3343=1 > > > > So there are many buckets with 1000s of collisions and only a few 1000 > collision free buckets. > > > > It might make sense to come up with a better polymorphic hashing > function. > > > > Best regards, > > > > Michael > > Intel Deutschland GmbH > > Registered Address: Am Campeon 10-12, 85579 Neubiberg, Germany > > Tel: +49 89 99 8853-0, www.intel.de<http://www.intel.de> > > Managing Directors: Christin Eisenschmid, Christian Lamprechter > > Chairperson of the Supervisory Board: Nicole Lau Registered Office: > > Munich Commercial Register: Amtsgericht Muenchen HRB 186928 > > > > > > -- > > Caml-list mailing list. Subscription management and archives: > > https://sympa.inria.fr/sympa/arc/caml-list > > Beginner's list: http://groups.yahoo.com/group/ocaml_beginners > > Bug reports: http://caml.inria.fr/bin/caml-bugs Intel Deutschland GmbH Registered Address: Am Campeon 10-12, 85579 Neubiberg, Germany Tel: +49 89 99 8853-0<tel:%2B49%2089%2099%208853-0>, www.intel.de<http://www.intel.de> Managing Directors: Christin Eisenschmid, Christian Lamprechter Chairperson of the Supervisory Board: Nicole Lau Registered Office: Munich Commercial Register: Amtsgericht Muenchen HRB 186928 Intel Deutschland GmbH Registered Address: Am Campeon 10-12, 85579 Neubiberg, Germany Tel: +49 89 99 8853-0, www.intel.de Managing Directors: Christin Eisenschmid, Christian Lamprechter Chairperson of the Supervisory Board: Nicole Lau Registered Office: Munich Commercial Register: Amtsgericht Muenchen HRB 186928 [-- Attachment #2: Type: text/html, Size: 14866 bytes --] ^ permalink raw reply [flat|nested] 16+ messages in thread
* RE: [Caml-list] Specify the default hash function for a type 2016-05-14 9:06 ` Soegtrop, Michael @ 2016-05-17 13:03 ` Soegtrop, Michael 0 siblings, 0 replies; 16+ messages in thread From: Soegtrop, Michael @ 2016-05-17 13:03 UTC (permalink / raw) To: OCaML Mailing List [-- Attachment #1: Type: text/plain, Size: 2951 bytes --] Dear OCaml users, to close this topic, here are some statistics for my use case with different hash functions. All hash methods I tried lead to identical statistics as random numbers. Simply using Hashtbl.hash on lists of hashes of substructures (Method 2 below) worked very well, is easy to write down and offers sufficient flexibility. NItems = 3277598 NBuckets = 16777216 | Theoretical | Method1 | Method2 | Method3 --------+-------------+----------+----------+--------- Empty | 13799905.0 | 13800062 | 13799864 | 13801452 Single | 2695950.5 | 2695830 | 2696131 | 2693130 2 items | 263340.5 | 263101 | 263042 | 264318 3 items | 17148.7 | 17350 | 17368 | 17463 4 items | 837.5 | 849 | 779 | 822 5 items | 32.7 | 24 | 29 | 31 6 items | 1.1 | 0 | 3 | 3 --------+-------------+----------+----------+--------- Runtime | | 119s | 122s | 110s Note 1: the expected random deviation of the counts is sqrt(count). Note 2: The speed differences between Method 1 and 2 are random – Method 2 should be faster. Note 3: using Hashtbl.hash on the complete structure doesn't work for me for two reasons: - parts of the structure should be ignored - I have maps as substructures and require that logically equal maps result in the same hash Theoretical: collision numbers for a random process with same parameters (used exact Bernoulli statistics, not Poisson approximation) Method1: Hashtbl.hash used on structures recursively (create list of hash values for substructures and use Hashtbl.hash on list). Maps (<10 elements) are converted to lists and the list is hashed with Hashtbl.hash. Multiply each hash list entry with a distinct prime. Use prime values for variants and prime multipliers for variant arguments. Method2: Hashtbl.hash used on structures recursively (create list of hash values for substructures and use Hashtbl.hash on list). Maps (<10 elements) are converted to lists and the list is hashed with Hashtbl.hash. Use prime values for variants and prime multipliers for variant arguments. Method3: ( ( ( (v1 * p1 + v2) * p2) + v3 ) *p3 + v4) * p4 where vi are values and pi are distinct prime numbers. For lists I use 4 primes alternatingly and 4 different primes for empty cases Btw.: I added a function “worst_bucketlist” to Hashtbl, which returns the keys of the largest bucket. This is useful for optimizing (or bug fixing) hash and equality functions – try to make a feature request for it. Best regards, Michael Intel Deutschland GmbH Registered Address: Am Campeon 10-12, 85579 Neubiberg, Germany Tel: +49 89 99 8853-0, www.intel.de Managing Directors: Christin Eisenschmid, Christian Lamprechter Chairperson of the Supervisory Board: Nicole Lau Registered Office: Munich Commercial Register: Amtsgericht Muenchen HRB 186928 [-- Attachment #2: Type: text/html, Size: 13039 bytes --] ^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2016-05-17 13:04 UTC | newest] Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2016-05-13 7:52 [Caml-list] Specify the default hash function for a type Soegtrop, Michael 2016-05-13 8:13 ` Ben Millwood 2016-05-13 8:40 ` Soegtrop, Michael 2016-05-13 9:06 ` Alain Frisch 2016-05-13 9:26 ` Soegtrop, Michael 2016-05-13 12:01 ` Gabriel Scherer 2016-05-13 12:23 ` Alain Frisch 2016-05-13 12:32 ` Soegtrop, Michael 2016-05-13 13:50 ` Pierre Chambart 2016-05-13 13:56 ` Alain Frisch 2016-05-13 16:17 ` Soegtrop, Michael 2016-05-13 18:57 ` Thomas Braibant 2016-05-13 22:45 ` Soegtrop, Michael 2016-05-14 8:41 ` Thomas Braibant 2016-05-14 9:06 ` Soegtrop, Michael 2016-05-17 13:03 ` Soegtrop, Michael
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox