From: Brian Hurt <bhurt@spnz.org>
To: Wheeler Ruml <ruml@parc.com>
Cc: caml-list@yquem.inria.fr
Subject: Re: [Caml-list] Hashtbls with physical equality?
Date: Mon, 15 Nov 2004 14:37:36 -0600 (CST) [thread overview]
Message-ID: <Pine.LNX.4.44.0411151400080.19815-100000@localhost.localdomain> (raw)
In-Reply-To: <16791.56417.334890.765954@katsura.parc.xerox.com>
On Sun, 14 Nov 2004, Wheeler Ruml wrote:
> Is it possible in OCaml to have a hash table that can insert and retrieve
> values without walking over their structure? I tried to hack something
> together using Obj.magic (working from Hashtbl.ml and the size.ml example
> by Filliatre) but it doesn't work for me and I'm concerned that the garbage
> collector might be making the magic values obsolete.
There is no magic value associated with objects, unlike Java. Personally,
I think that was one of the biggest screwups with Java. Starting with the
fact that it returns a 32-bit *int* (and what happens when you have more
than 4 billion objects? Possible in a 64-bit system!).
> I have a hash table
> of strings and I'd like to avoid traversing their length on every lookup.
> Do I have to explicitly use integers instead?
Note that Ocaml doesn't intern strings (using more Java speak)
automatically. The result of "foo" == "foo" is false. Or consider "foo"
== (let s = "goo" in s.(0) <- 'f'; s). Note that I had a friend who was
burned by this recently- it seems that more recent versions of Java have
replaced the old mediocre-but-workable string hashcode implementation with
the default "magic number" implementation. Well, my friend managed to
defeat the interning the compiler tried to do (not delibertly), and was
surprised when the hashtable lookup suddently failed.
If the strings you are looking up in the hashtabl are known ahead of time
(for example, they're keywords in a programming language), you might look
at perfect hashes (see gperf for more info). If you're not, I think what
you're asking for is impossible.
Consider the following impelementation: you have a resizeable array
(there's one in ExtLib, or roll your own). When you create a new object,
you insert it into the array. You can now use the array index as the
magic number for the object. You can use the array index as a key into a
hash table, but why bother? Why not just look the structure up in the
array? To look the structure up in either the array or the hash table,
you need the array index (the key for the hash table)- how do you have the
key? You have two choices: you can seach the array (O(log N) if you're
lucky), or the array index/hash table key has to come from the data
itself.
If you simply store the array index in with the rest of the data, doing
something like:
type my_type = {
array_index : int;
...
};;
then by the time you get the array index, you already have the data. The
other possibility is to compute the key from some or all of the data.
Welcome to the wonderfull world of hash functions.
Any magic number mapping you can come up with is equivelent to the index
into the arrays problem I outlined above- except sometimes you don't have
the array. The "address of" idea that Java's implementation was obviously
originally designed for is exactly this sort of implementation- memory
itself becomes the array. And the question of how you get the magic
number of an object you don't already have is also a good question (if
you're passing around the magic number, I comment that passing around the
reference to the object itself is no more data).
As a side note, be wary of hashtables. They works real slick- right up to
the point where they don't. If the hash functions fail to give you
sufficient dispersion over the field of interesting values, the nice O(1)
best-case performance of hashtables suddenly becomes O(N). The nice thing
about balanced trees is that they're O(log N) worst case. Hashtables are
*NOT* magic data structures.
The other comment I'd make is that it sound like you're doing premature
optimization. Stop it. Get the code working first, and then measure to
see if it's a performance problem. What is your average length string
going to be? Walking down a 4- or even 16-character string isn't that
expensive. How much dispersion is your hash function giving you? Or is
some other part of the code the problem child? Or is the code simply fast
enough, and doesn't need any performance tuning what so ever?
Getting it working first, then worry about performance.
Brian
next prev parent reply other threads:[~2004-11-15 20:36 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2004-11-14 22:29 Wheeler Ruml
2004-11-15 0:33 ` [Caml-list] " Christian Szegedy
[not found] ` <20041115012212.GA6561@artisan.com>
2004-11-15 2:20 ` Wheeler Ruml
2004-11-15 10:24 ` Alex Baretta
2004-11-15 16:45 ` Wheeler Ruml
2004-11-15 20:34 ` Marcin 'Qrczak' Kowalczyk
2004-11-15 20:37 ` Brian Hurt [this message]
2004-11-15 21:03 ` Wheeler Ruml
2004-11-15 21:30 ` Marcin 'Qrczak' Kowalczyk
2004-11-15 23:42 ` Stefan Monnier
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Pine.LNX.4.44.0411151400080.19815-100000@localhost.localdomain \
--to=bhurt@spnz.org \
--cc=caml-list@yquem.inria.fr \
--cc=ruml@parc.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox