From: "Soegtrop, Michael" <michael.soegtrop@intel.com>
To: Alain Frisch <alain.frisch@lexifi.com>,
Pierre Chambart <pierre.chambart@ocamlpro.com>,
"caml-list@inria.fr" <caml-list@inria.fr>
Subject: RE: [Caml-list] Specify the default hash function for a type
Date: Fri, 13 May 2016 16:17:47 +0000 [thread overview]
Message-ID: <0F7D3B1B3C4B894D824F5B822E3E5A172CEF2D2F@IRSMSX102.ger.corp.intel.com> (raw)
In-Reply-To: <44df5d5e-8ad9-1a86-7fda-bfc203bfb479@lexifi.com>
Dear Ocaml Users,
a possibly interesting data point in the discussion on polymorphic hashes:
I replaced the polymorphic hash function for a record, which consists of smallish integers, bools, lists of integers and short strings, with my own hash function which looks like:
( ( ( (v1 * p1 + v2) * p2) + v3 ) *p3 + v4) * p4
where vi are values and pi are distinct prime numbers. For lists I use 4 primes alternatingly and 4 different primes for empty cases (implemented as 4 mutual recursive functions). The primes are random primes between 2^28 and 2^29, avoiding those primes where larger powers of 2 are close to 0 modulo the prime. All arithmetic is done with plain ocaml ints and corresponding overflows. For strings I used the polymorphic hash function.
The hash histogram for this hash function looks like this:
num_bindings=1803454
num_buckets=16777216
max_bucket_length=24
(the histogram is "number of items in bucket=count of buckets with this number of items")
0=15356268 (empty buckets)
1=1185329 (non-colliding buckets)
2=135231
3=80194
4=11344
5=2367
6=2836
7=2630
8=244
9=49
10=20
11=25
12=64
13=39
14=100
15=75
16=33
17=5
18=149
19=15
20=3
21=180
22=14
24=2
This is roughly in line with random numbers:
fill rate = 1803454/16777216 = 10.7%
2 collisions = 0.8% ~ fill rate ^2
3 collisions = 0.48% ~ 4 * fill rate ^3
4 collisions = 0.068% ~ 5 * fill rate ^4
The hash histogram for the same data and the polymorphic hash function looks like this:
num_bindings=1803454
num_buckets=16777216
max_bucket_length=3343
0=16730926
1=5520
2=3201
3=2779
4=1633
5=1079
6=3701
7=672
8=828
9=1584
10=600
11=384
12=2774
13=404
14=525
15=500
16=435
17=358
18=1406
19=244
20=504
21=427
22=316
23=369
24=837
25=153
26=250
27=417
28=191
29=222
30=501
31=76
32=196
33=530
34=142
35=153
36=859
37=88
38=178
39=310
40=147
41=173
42=313
43=102
44=235
45=123
46=149
47=152
48=244
49=107
50=173
:
:
1001=1
1002=1
1005=1
1006=3
1009=1
1010=1
1013=1
1014=2
1019=1
1020=1
1029=1
1031=1
1034=1
1035=1
1036=1
1064=1
1082=1
1184=1
1191=1
1196=1
1197=1
1200=1
1206=1
1207=1
1219=1
1225=1
1228=1
1232=1
1566=1
1581=1
1636=1
1698=1
1720=2
1737=2
1740=1
1744=1
1752=1
1762=1
1789=1
2255=1
2271=1
2287=1
2319=1
2332=1
2345=1
3296=1
3325=1
3343=1
So there are many buckets with 1000s of collisions and only a few 1000 collision free buckets.
It might make sense to come up with a better polymorphic hashing function.
Best regards,
Michael
Intel Deutschland GmbH
Registered Address: Am Campeon 10-12, 85579 Neubiberg, Germany
Tel: +49 89 99 8853-0, www.intel.de
Managing Directors: Christin Eisenschmid, Christian Lamprechter
Chairperson of the Supervisory Board: Nicole Lau
Registered Office: Munich
Commercial Register: Amtsgericht Muenchen HRB 186928
next prev parent reply other threads:[~2016-05-13 16:17 UTC|newest]
Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-05-13 7:52 Soegtrop, Michael
2016-05-13 8:13 ` Ben Millwood
2016-05-13 8:40 ` Soegtrop, Michael
2016-05-13 9:06 ` Alain Frisch
2016-05-13 9:26 ` Soegtrop, Michael
2016-05-13 12:01 ` Gabriel Scherer
2016-05-13 12:23 ` Alain Frisch
2016-05-13 12:32 ` Soegtrop, Michael
2016-05-13 13:50 ` Pierre Chambart
2016-05-13 13:56 ` Alain Frisch
2016-05-13 16:17 ` Soegtrop, Michael [this message]
2016-05-13 18:57 ` Thomas Braibant
2016-05-13 22:45 ` Soegtrop, Michael
2016-05-14 8:41 ` Thomas Braibant
2016-05-14 9:06 ` Soegtrop, Michael
2016-05-17 13:03 ` Soegtrop, Michael
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=0F7D3B1B3C4B894D824F5B822E3E5A172CEF2D2F@IRSMSX102.ger.corp.intel.com \
--to=michael.soegtrop@intel.com \
--cc=alain.frisch@lexifi.com \
--cc=caml-list@inria.fr \
--cc=pierre.chambart@ocamlpro.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox