Mailing list for all users of the OCaml language and system.
 help / color / mirror / Atom feed
* RE: Alternative generic hash function
@ 2000-05-25 17:08 Manuel Fahndrich
  0 siblings, 0 replies; 12+ messages in thread
From: Manuel Fahndrich @ 2000-05-25 17:08 UTC (permalink / raw)
  To: 'dsl@tepkom.ru', caml-list

Thanks for all the answers. Yes, one clearly would need information
about mutable data fields at runtime in order to implement hash_pure.

Dmitri's suggestion on using block length and tag for computing the hash is
essentially the
0th approximation of hash_pure, since we know that the length and tag of a
block are immutable.

Noone has made any comments about the managing of explicit id's in order to
get around the problem I sketched. 
I've written a lot of code with such id generation and it is not an ideal
solution. First, there needs to be an id generator, which introduces a
global variable into your program. Second, what happens if you wrap around
in the id space (e.g. ints). I can speak from experience that wrap around
happens, namely when generating structures with id's repeatedly, where most
of them are garbage collected away. At any point, there are only few
structures live, but whenever we create a new one, we still need a fresh id.
In general, one would have to be able to reuse id's, and thus connect with
the garbage collector through weak pointers etc... It gets pretty
complicated.

My point is that the memory address of an object is an ideal id for equality
comparisons. The space of addresses is already managed by the language, such
that reusable id's can be generated cheaply. With copying garbage
collection, id's change over time, which is not a problem for equality
testing, but breaks hashing on them. The only reliable part of a
datastructure to use in computing a hash is the immutable part.

-Manuel

BTW, the same problem arises with the polymorphic comparison functions:

        Objective Caml version 3.00

# let x = ref 1;;
val x : int ref = {contents=1}
# let y = ref 5;;
val y : int ref = {contents=5}
# x < y;;
- : bool = true
# x := 10;;
- : unit = ()
# x < y;;
- : bool = false

Of course, using an analogous pure comparison operator would result in a
preorder, not a total order.




-----Original Message-----
From: Dmitri Lomov [mailto:dsl@tepkom.ru]
Sent: Thursday, May 25, 2000 12:38 AM
To: caml-list@pauillac.inria.fr
Cc: Manuel Fahndrich
Subject: Re: Alternative generic hash function


AFA I understand, Hastable.hash_pure will be hard to implement in run-time,
as now at run-time it is not known whether given data structure, or 
given data structure field is mutable or not.
Such information is not propagated to run-time by ocaml compiler.

Possibly your option here is to divide your data structure into mutable
and immutable parts, and than use Hashtable.hash on immutable part.
Then all will work well.

This, however, does not cover the example you present. 
But actually the nature of hashing assumes the hashed object has
some immutable part. For example, when hashing strings you do not assume 
to get the same hash value for mutable string if you change some character
of it.
In your example everything in an object changes - it does not have any 
immutable properties/elements/fields to "hook" hash to.

By the way, a rough hashing function for all O'Caml objects can be based
on block length and/or tag (they are accesible via Obj module).
Such hash function will tend to be very biased, due to biased
distribution of lengths and tags, but in very bad situations it 
may help.

I use such function in my (forthcoming :-)) dynamic code generation library 
for O'Caml to keep external references from code block. 

Regards,
Dmitri

Manuel Fahndrich wrote:
> 
> The generic hash function Hashtbl.hash traverses into mutable data
> structures. So I get different hash values depending on updates of the
data
> structure.
> 
> # let x = ref None;;
> val x : '_a option ref = {contents=None}
> # Hashtbl.hash x;;
> - : int = 0
> # x := Some 5;;
> - : unit = ()
> # Hashtbl.hash x;;
> - : int = 5
> 
> As was pointed out times before on this list, Hashtbl.hash is thus
> unsuitable for hashing on data structures that contain mutable data.
> 
> I'm wondering what the design rational is for this choice. I'd rather have
a
> generic hash function (Hashtbl.hash_pure) that never looks into mutable
> structures (skips refs and mutable fields of records and classes).
> The reason being that if I want to hash on data that contains references,
> I'm more likely to want to find the structure independent of the contents
of
> any mutable subparts.
> 
> So yes, a hashtable containing int refs would pointless, because it maps
> them all to the same bucket. But I frequently use hashing with equality
> being ==. Whenever I do this, I need to actually add some unique
identifier
> to the data structure just for the hashing operation. Having an alternate
> hash function (Hashtbl.hash_pure) would make such id's (and their
managment)
> unnecessary.
> 
> I see how the current working of Hashtbl.hash is useful for data
structures
> that contain mutable parts in the case where such structures don't change
> between the time they are stored and later looked up in the hash table.
But
> I'm wondering how often this actually arises.
> 
> Furthermore, using Hashtbl.hash_pure in place of Hashtbl.hash would not
> break any old code (except for performance), since all it does is make the
> equivalence classes bigger. On the other hand, using Hashtbl.hash with the
> misconception that it acts as Hashtbl.hash_pure breaks lots of code.
> 
> I rest my case.
> 
>         Manuel Fahndrich
 
_________________________________________________________________
Dmitri S. Lomov
mailto:dsl@tepkom.ru    ICQ#: 20524819 (Rusty)
http://users.tepkom.ru/dsl, http://young.tepkom.ru
ftp://young.tepkom.ru
+7 (812) 428-46-57 (b)   +7 (812) 295-94-15 (h)




^ permalink raw reply	[flat|nested] 12+ messages in thread
* Re: Alternative generic hash function
@ 2000-05-29 15:00 Damien Doligez
  2000-05-30 18:04 ` Daniel Ortmann
  0 siblings, 1 reply; 12+ messages in thread
From: Damien Doligez @ 2000-05-29 15:00 UTC (permalink / raw)
  To: caml-list

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 457 bytes --]

>From: Dan Grossman <danieljg@cs.cornell.edu>

>[Veuillez me pardonner; je ne donne pas une traduction en français parce

No need to apologize, really.


>* Recording this information would only require 1 header bit.  (I
>personally could live with 21 size bits.)  Getting this bit into the
>header word should not be hard.

Some people are already complaining about the maximum size of arrays.
And it's true that 16 megabytes is quite small.

-- Damien




^ permalink raw reply	[flat|nested] 12+ messages in thread
* Alternative generic hash function
@ 2000-05-24 22:04 Manuel Fahndrich
  2000-05-25  7:00 ` Pierre Weis
                   ` (3 more replies)
  0 siblings, 4 replies; 12+ messages in thread
From: Manuel Fahndrich @ 2000-05-24 22:04 UTC (permalink / raw)
  To: caml-list


The generic hash function Hashtbl.hash traverses into mutable data
structures. So I get different hash values depending on updates of the data
structure. 

# let x = ref None;;
val x : '_a option ref = {contents=None}
# Hashtbl.hash x;;
- : int = 0
# x := Some 5;;
- : unit = ()
# Hashtbl.hash x;;
- : int = 5

As was pointed out times before on this list, Hashtbl.hash is thus
unsuitable for hashing on data structures that contain mutable data.

I'm wondering what the design rational is for this choice. I'd rather have a
generic hash function (Hashtbl.hash_pure) that never looks into mutable
structures (skips refs and mutable fields of records and classes).
The reason being that if I want to hash on data that contains references,
I'm more likely to want to find the structure independent of the contents of
any mutable subparts.

So yes, a hashtable containing int refs would pointless, because it maps
them all to the same bucket. But I frequently use hashing with equality
being ==. Whenever I do this, I need to actually add some unique identifier
to the data structure just for the hashing operation. Having an alternate
hash function (Hashtbl.hash_pure) would make such id's (and their managment)
unnecessary.

I see how the current working of Hashtbl.hash is useful for data structures
that contain mutable parts in the case where such structures don't change
between the time they are stored and later looked up in the hash table. But
I'm wondering how often this actually arises.

Furthermore, using Hashtbl.hash_pure in place of Hashtbl.hash would not
break any old code (except for performance), since all it does is make the
equivalence classes bigger. On the other hand, using Hashtbl.hash with the
misconception that it acts as Hashtbl.hash_pure breaks lots of code.

I rest my case.

	Manuel Fahndrich





^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2000-05-30 20:35 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2000-05-25 17:08 Alternative generic hash function Manuel Fahndrich
  -- strict thread matches above, loose matches on Subject: below --
2000-05-29 15:00 Damien Doligez
2000-05-30 18:04 ` Daniel Ortmann
2000-05-24 22:04 Manuel Fahndrich
2000-05-25  7:00 ` Pierre Weis
2000-05-25  7:14 ` STARYNKEVITCH Basile
2000-05-25  7:38 ` Dmitri Lomov
2000-05-25  8:24 ` Francois Pottier
2000-05-26 19:35   ` Dan Grossman
2000-05-29 13:13     ` Francois Pottier
2000-05-29 23:45       ` Ken Wakita
2000-05-30 17:25       ` Dan Grossman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox