Mailing list for all users of the OCaml language and system.
 help / color / mirror / Atom feed
* Hashtbls with physical equality?
@ 2004-11-14 22:29 Wheeler Ruml
  2004-11-15  0:33 ` [Caml-list] " Christian Szegedy
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Wheeler Ruml @ 2004-11-14 22:29 UTC (permalink / raw)
  To: caml-list

Is it possible in OCaml to have a hash table that can insert and retrieve
values without walking over their structure?  I tried to hack something
together using Obj.magic (working from Hashtbl.ml and the size.ml example
by Filliatre) but it doesn't work for me and I'm concerned that the garbage
collector might be making the magic values obsolete.  I have a hash table
of strings and I'd like to avoid traversing their length on every lookup.
Do I have to explicitly use integers instead?

And hints or advice would be warmly welcome!  Please respond directly to me
as well as to the list, as I can't keep up with the volume and am not a
regular subscriber.  Thanks very much,

Wheeler
-- 
Wheeler Ruml, Palo Alto Research Center, Rm 1522, 650-812-4329 voice
ruml@parc.com, http://www.parc.com/ruml           650-812-4334 fax
------------ excerpt of my attempt --------------------

let hash x =
  (* want both integers and pointers (presumably word-aligned) to have
     well-spread lower bits *)
  let y = ((Obj.magic x) : int) in
    y lxor (y lsr 16)

    
let add h key info =
  let i = (hash key) mod (Array.length h.data) in
  let bucket = Cons(key, info, h.data.(i)) in
  h.data.(i) <- bucket;
  h.size <- succ h.size;
  if h.size > Array.length h.data lsl 1 then resize h

    
let rec find_rec key = function
    Empty ->
      raise Not_found
  | Cons(k, d, rest) ->
      if key == k then d else find_rec key rest

	
let find h key =
  match h.data.((hash key) mod (Array.length h.data)) with
    Empty -> raise Not_found
  | Cons(k1, d1, rest1) ->
      if key == k1 then d1 else find_rec key rest1

------------------------------------------------------------


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Caml-list] Hashtbls with physical equality?
  2004-11-14 22:29 Hashtbls with physical equality? Wheeler Ruml
@ 2004-11-15  0:33 ` Christian Szegedy
       [not found] ` <20041115012212.GA6561@artisan.com>
  2004-11-15 20:37 ` Brian Hurt
  2 siblings, 0 replies; 10+ messages in thread
From: Christian Szegedy @ 2004-11-15  0:33 UTC (permalink / raw)
  To: caml-list

> Is it possible in OCaml to have a hash table that can insert and retrieve
> values without walking over their structure?  I tried to hack something
> together using Obj.magic (working from Hashtbl.ml and the size.ml example
> by Filliatre) but it doesn't work for me and I'm concerned that the garbage
> collector might be making the magic values obsolete.  

You are right: copying of values by the garbage collection makes this 
approach infeasible.

> I have a hash table
> a of strings and I'd like to avoid traversing their length on every lookup.
> Do I have to explicitly use integers instead?

This is certainly a good idea.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Caml-list] Hashtbls with physical equality?
       [not found] ` <20041115012212.GA6561@artisan.com>
@ 2004-11-15  2:20   ` Wheeler Ruml
  2004-11-15 10:24     ` Alex Baretta
  0 siblings, 1 reply; 10+ messages in thread
From: Wheeler Ruml @ 2004-11-15  2:20 UTC (permalink / raw)
  To: John Malecki; +Cc: caml-list

> > Is it possible in OCaml to have a hash table that can insert and retrieve
> > values without walking over their structure?
> 
> How about something like this?
> 
> module PhysrefHashtbl =
>   Hashtbl.Make (struct type t = string
>     let equal = (==)
>     let hash = Hashtbl.hash
>   end)

My only concern about this solution is that Hashtbl.hash will walk over the
key's structure in order to compute its hash value.  I know that its
computation is bounded, but I'd still love to find something faster and
more more direct if possible.

Thanks,

Wheeler
-- 
Wheeler Ruml, Palo Alto Research Center, Rm 1522, 650-812-4329 voice
ruml@parc.com, http://www.parc.com/ruml           650-812-4334 fax


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Caml-list] Hashtbls with physical equality?
  2004-11-15  2:20   ` Wheeler Ruml
@ 2004-11-15 10:24     ` Alex Baretta
  2004-11-15 16:45       ` Wheeler Ruml
  0 siblings, 1 reply; 10+ messages in thread
From: Alex Baretta @ 2004-11-15 10:24 UTC (permalink / raw)
  To: Wheeler Ruml; +Cc: caml-list

Wheeler Ruml wrote:
>>>Is it possible in OCaml to have a hash table that can insert and retrieve
>>>values without walking over their structure?
>>
>>How about something like this?
>>
>>module PhysrefHashtbl =
>>  Hashtbl.Make (struct type t = string
>>    let equal = (==)
>>    let hash = Hashtbl.hash
>>  end)
> 
> 
> My only concern about this solution is that Hashtbl.hash will walk over the
> key's structure in order to compute its hash value.  I know that its
> computation is bounded, but I'd still love to find something faster and
> more more direct if possible.
> 
> Thanks,
> 
> Wheeler

Keep in mind that the hash function in the Hashtbl.Make functor can be 
any function whose signature is t -> int. This allows you to tweak and 
twine your hashing algorithm as you choose. If you don't want to right a 
hashing algorithm from scratch, than parametrize Xavier's to your 
heart's content:

val hash_param : int -> int -> 'a -> int

Alex

-- 
*********************************************************************
http://www.barettadeit.com/
Baretta DE&IT
A division of Baretta SRL

tel. 02 370 111 55
fax. 02 370 111 54

Our technology:

The Application System/Xcaml (AS/Xcaml)
<http://www.asxcaml.org/>

The FreerP Project
<http://www.freerp.org/>


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Caml-list] Hashtbls with physical equality?
  2004-11-15 10:24     ` Alex Baretta
@ 2004-11-15 16:45       ` Wheeler Ruml
  2004-11-15 20:34         ` Marcin 'Qrczak' Kowalczyk
  0 siblings, 1 reply; 10+ messages in thread
From: Wheeler Ruml @ 2004-11-15 16:45 UTC (permalink / raw)
  To: Alex Baretta; +Cc: caml-list

> >>module PhysrefHashtbl =
> >>  Hashtbl.Make (struct type t = string
> >>    let equal = (==)
> >>    let hash = Hashtbl.hash
> >>  end)
> > 
> > My only concern about this solution is that Hashtbl.hash will walk over the
> > key's structure in order to compute its hash value.
> 
> Keep in mind that the hash function in the Hashtbl.Make functor can be 
> any function whose signature is t -> int. This allows you to tweak and 
> twine your hashing algorithm as you choose. If you don't want to right a 
> hashing algorithm from scratch, than parametrize Xavier's to your 
> heart's content:

OK, fair enough.  I guess there isn't any static "address-like" identity to
objects in OCaml, so I really do have to examine at least a few bits of the
object to compute a hash value.

Thanks very much for your help!

Wheeler
-- 
Wheeler Ruml, Palo Alto Research Center, Rm 1522, 650-812-4329 voice
ruml@parc.com, http://www.parc.com/ruml           650-812-4334 fax


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Caml-list] Hashtbls with physical equality?
  2004-11-15 16:45       ` Wheeler Ruml
@ 2004-11-15 20:34         ` Marcin 'Qrczak' Kowalczyk
  0 siblings, 0 replies; 10+ messages in thread
From: Marcin 'Qrczak' Kowalczyk @ 2004-11-15 20:34 UTC (permalink / raw)
  To: caml-list

Wheeler Ruml <ruml@parc.com> writes:

> OK, fair enough.  I guess there isn't any static "address-like" identity to
> objects in OCaml, so I really do have to examine at least a few bits of the
> object to compute a hash value.

Indeed.

Serialization uses an internal hash table hashed by addresses. It can
do this because it's implemented in C and performs no GC during
serialization, so address don't change. It would be impossible in pure
OCaml.

For my language Kogut I borrowed the idea of stable names from Glasgow
Haskell, strengthening its guarantees a bit (GHC does not guarantee
uniqueness of stable names, because Haskell is lazy and does not have
a well-defined concept of object identity which does not change during
evaluation).

Here they are called object ids. An object id is a hashable value
representing the identity of an arbitrary object, such that a given
object each time yields the same object id.

An object id doesn't keep the original object alive, can't be used
to access it, and is not kept alive by the object either. If an
object id is garbage collected, the new object id of the same object
will not necessarily have the same hash value.

This means that an object id must be kept alive during the time when
the corresponding object is used in a hash table, and thus it cannot
be used to obtain a universal "object -> hash" function (which does
not use extra memory for the duration of the life of the object,
otherwise it's possible with the help of weak references).

Fortunately the design of my hash tables makes such universal hashing
function unnecessary. If you want to use a different equality
predicate, you don't specify the predicate with a corresponding hash
function. Instead you specify a function which transforms keys into
something which is comparable and hashable using standard generic
functions for equality and hashing. Transformed keys are stored inside
the hash table and used for lookup.

So you can make a case-insensitive dictionary of strings by supplying
a transformation function which folds case, or a dictionary based on
object identity by supplying the ObjectId function.

The implementation of object id uses an internal hash table based on
physical addresses, which is rehashed on each GC (actually two tables,
one per generation).

-- 
   __("<         Marcin Kowalczyk
   \__/       qrczak@knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Caml-list] Hashtbls with physical equality?
  2004-11-14 22:29 Hashtbls with physical equality? Wheeler Ruml
  2004-11-15  0:33 ` [Caml-list] " Christian Szegedy
       [not found] ` <20041115012212.GA6561@artisan.com>
@ 2004-11-15 20:37 ` Brian Hurt
  2004-11-15 21:03   ` Wheeler Ruml
                     ` (2 more replies)
  2 siblings, 3 replies; 10+ messages in thread
From: Brian Hurt @ 2004-11-15 20:37 UTC (permalink / raw)
  To: Wheeler Ruml; +Cc: caml-list

On Sun, 14 Nov 2004, Wheeler Ruml wrote:

> Is it possible in OCaml to have a hash table that can insert and retrieve
> values without walking over their structure?  I tried to hack something
> together using Obj.magic (working from Hashtbl.ml and the size.ml example
> by Filliatre) but it doesn't work for me and I'm concerned that the garbage
> collector might be making the magic values obsolete.  

There is no magic value associated with objects, unlike Java.  Personally, 
I think that was one of the biggest screwups with Java.  Starting with the 
fact that it returns a 32-bit *int* (and what happens when you have more 
than 4 billion objects?  Possible in a 64-bit system!).

> I have a hash table
> of strings and I'd like to avoid traversing their length on every lookup.
> Do I have to explicitly use integers instead?

Note that Ocaml doesn't intern strings (using more Java speak) 
automatically.  The result of "foo" == "foo" is false.  Or consider "foo" 
== (let s = "goo" in s.(0) <- 'f'; s).  Note that I had a friend who was 
burned by this recently- it seems that more recent versions of Java have 
replaced the old mediocre-but-workable string hashcode implementation with 
the default "magic number" implementation.  Well, my friend managed to 
defeat the interning the compiler tried to do (not delibertly), and was 
surprised when the hashtable lookup suddently failed.

If the strings you are looking up in the hashtabl are known ahead of time
(for example, they're keywords in a programming language), you might look
at perfect hashes (see gperf for more info).  If you're not, I think what
you're asking for is impossible.

Consider the following impelementation: you have a resizeable array
(there's one in ExtLib, or roll your own).  When you create a new object,
you insert it into the array.  You can now use the array index as the
magic number for the object.  You can use the array index as a key into a
hash table, but why bother?  Why not just look the structure up in the
array?  To look the structure up in either the array or the hash table,
you need the array index (the key for the hash table)- how do you have the
key?  You have two choices: you can seach the array (O(log N) if you're
lucky), or the array index/hash table key has to come from the data
itself.

If you simply store the array index in with the rest of the data, doing
something like:

type my_type = {
    array_index : int;
    ...
};;

then by the time you get the array index, you already have the data.  The 
other possibility is to compute the key from some or all of the data.  
Welcome to the wonderfull world of hash functions.

Any magic number mapping you can come up with is equivelent to the index 
into the arrays problem I outlined above- except sometimes you don't have 
the array.  The "address of" idea that Java's implementation was obviously 
originally designed for is exactly this sort of implementation- memory 
itself becomes the array.  And the question of how you get the magic 
number of an object you don't already have is also a good question (if 
you're passing around the magic number, I comment that passing around the 
reference to the object itself is no more data).

As a side note, be wary of hashtables.  They works real slick- right up to 
the point where they don't.  If the hash functions fail to give you 
sufficient dispersion over the field of interesting values, the nice O(1) 
best-case performance of hashtables suddenly becomes O(N).  The nice thing 
about balanced trees is that they're O(log N) worst case.  Hashtables are 
*NOT* magic data structures.

The other comment I'd make is that it sound like you're doing premature 
optimization.  Stop it.  Get the code working first, and then measure to 
see if it's a performance problem.  What is your average length string 
going to be?  Walking down a 4- or even 16-character string isn't that 
expensive.  How much dispersion is your hash function giving you?  Or is 
some other part of the code the problem child?  Or is the code simply fast 
enough, and doesn't need any performance tuning what so ever?

Getting it working first, then worry about performance.

Brian



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Caml-list] Hashtbls with physical equality?
  2004-11-15 20:37 ` Brian Hurt
@ 2004-11-15 21:03   ` Wheeler Ruml
  2004-11-15 21:30   ` Marcin 'Qrczak' Kowalczyk
  2004-11-15 23:42   ` Stefan Monnier
  2 siblings, 0 replies; 10+ messages in thread
From: Wheeler Ruml @ 2004-11-15 21:03 UTC (permalink / raw)
  To: Brian Hurt; +Cc: caml-list

Hi Brian,

Thanks for your reply.

> > Is it possible in OCaml to have a hash table that can insert and retrieve
> > values without walking over their structure?  I tried to hack something
> > together using Obj.magic
> 
> There is no magic value associated with objects, unlike Java.

OK - thanks.

> > I have a hash table
> > of strings and I'd like to avoid traversing their length on every lookup.
> > Do I have to explicitly use integers instead?
>
> [...] I think what you're asking for is impossible.

Yeah, seems that way.  I think I'll explicitly associate an integer with
each string, along the lines of what you suggest.

> The nice thing about balanced trees is that they're O(log N) worst case.
> [...]  Getting it working first, then worry about performance.

Thanks for the tips - good advice.  Happily, it already works, and a factor
of two in performance actually does matter in my application.

Best,

Wheeler
-- 
Wheeler Ruml, Palo Alto Research Center, Rm 1522, 650-812-4329 voice
ruml@parc.com, http://www.parc.com/ruml           650-812-4334 fax


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Caml-list] Hashtbls with physical equality?
  2004-11-15 20:37 ` Brian Hurt
  2004-11-15 21:03   ` Wheeler Ruml
@ 2004-11-15 21:30   ` Marcin 'Qrczak' Kowalczyk
  2004-11-15 23:42   ` Stefan Monnier
  2 siblings, 0 replies; 10+ messages in thread
From: Marcin 'Qrczak' Kowalczyk @ 2004-11-15 21:30 UTC (permalink / raw)
  To: caml-list

Brian Hurt <bhurt@spnz.org> writes:

> There is no magic value associated with objects, unlike Java.  Personally, 
> I think that was one of the biggest screwups with Java.

Where is it in Java? As far as I know Java does not provide any way
to obtain a unique int corresponding to object identity.

It does provide a default hashCode() which is compatible with the
default equals(), and the default equals() compares pointers, but
hashCode() does not have to be unique. It's still hard to implement
the default hashCode() with a copying GC (with a usable distribution
and without overheads).

-- 
   __("<         Marcin Kowalczyk
   \__/       qrczak@knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Hashtbls with physical equality?
  2004-11-15 20:37 ` Brian Hurt
  2004-11-15 21:03   ` Wheeler Ruml
  2004-11-15 21:30   ` Marcin 'Qrczak' Kowalczyk
@ 2004-11-15 23:42   ` Stefan Monnier
  2 siblings, 0 replies; 10+ messages in thread
From: Stefan Monnier @ 2004-11-15 23:42 UTC (permalink / raw)
  To: caml-list

> Starting with the  fact that it returns a 32-bit *int* (and what happens
> when you have more  than 4 billion objects?  Possible in a 64-bit
> system!).

It's not an object-id, it's a hash code.  I.e. it does not guarantee that

     a != b  ==>  a.hashCode() != b.hashCode()

so the 32bit restriction is only a serious problem once the size of your
hashtable gets close to a billion of buckets.
Sure that can happen, but not nearly as soon as a 4GB Java process or even
a Java process with 4 billion objects.


        Stefan


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2004-11-15 23:42 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2004-11-14 22:29 Hashtbls with physical equality? Wheeler Ruml
2004-11-15  0:33 ` [Caml-list] " Christian Szegedy
     [not found] ` <20041115012212.GA6561@artisan.com>
2004-11-15  2:20   ` Wheeler Ruml
2004-11-15 10:24     ` Alex Baretta
2004-11-15 16:45       ` Wheeler Ruml
2004-11-15 20:34         ` Marcin 'Qrczak' Kowalczyk
2004-11-15 20:37 ` Brian Hurt
2004-11-15 21:03   ` Wheeler Ruml
2004-11-15 21:30   ` Marcin 'Qrczak' Kowalczyk
2004-11-15 23:42   ` Stefan Monnier

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox