From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <garrigue@math.nagoya-u.ac.jp>
X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on yquem.inria.fr
X-Spam-Level: 
X-Spam-Status: No, score=0.4 required=5.0 tests=AWL,DNS_FROM_RFC_ABUSE 
	autolearn=disabled version=3.1.3
Received: from mail3-relais-sop.national.inria.fr (mail3-relais-sop.national.inria.fr [192.134.164.104])
	by yquem.inria.fr (Postfix) with ESMTP id 2AA61BC69
	for <caml-list@yquem.inria.fr>; Tue, 15 Jan 2008 04:36:40 +0100 (CET)
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: Ao8CAHC5i0eCNhAB/2dsb2JhbACRWJga
X-IronPort-AV: E=Sophos;i="4.24,284,1196636400"; 
   d="scan'208";a="7801360"
Received: from kurims.kurims.kyoto-u.ac.jp ([130.54.16.1])
  by mail3-smtp-sop.national.inria.fr with ESMTP; 15 Jan 2008 04:36:38 +0100
Received: from localhost (orion [130.54.16.5])
	by kurims.kurims.kyoto-u.ac.jp (8.13.8/8.13.8) with ESMTP id m0F3aTQw008231;
	Tue, 15 Jan 2008 12:36:29 +0900 (JST)
Date: Tue, 15 Jan 2008 12:36:21 +0900 (JST)
Message-Id: <20080115.123621.184910428.garrigue@math.nagoya-u.ac.jp>
To: ober.14@osu.edu
Cc: caml-list@yquem.inria.fr
Subject: Re: [Caml-list] Re: Hash clash in polymorphic variants
From: Jacques Garrigue <garrigue@math.nagoya-u.ac.jp>
In-Reply-To: <200801140956.25449.ober.14@osu.edu>
References: <200801140720.02094.ober.14@osu.edu>
	<jwvk5mceabc.fsf-monnier+gmane.comp.lang.caml.inria@gnu.org>
	<200801140956.25449.ober.14@osu.edu>
X-Mailer: Mew version 4.2 on Emacs 22.1 / Mule 5.0 (SAKAKI)
Mime-Version: 1.0
Content-Type: Text/Plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-Spam: no; 0.00; hash:01 variants:01 hash:01 hashing:01 inputs:01 inputs:01 marshalling:01 variants:01 logarithmic:01 ocaml:01 hashes:01 hashes:01 implements:01 lablgl:01 polymorphic:01 

From: Kuba Ober <ober.14@osu.edu>
> On Monday 14 January 2008, Stefan Monnier wrote:
> > > What I meant was simply that instead of using some fixed hash function,
> > > one could use a perfect hashing function which is optimal for its known
> > > set of inputs, and won't ever generate a collision.
> >
> > The problem is that the set of inputs is not know at compile time, only
> > at link time.
> 
> As I've said in the cited post, the perfect hash generator would have to be 
> invoked at link time, which shouldn't be a big deal.

Unfortunately, this would make marshalling between different programs
much more complicated...

Another advantage of knowing the hash function at compile time is
that you can generate efficient code for pattern matching. Since you
already know the ordering of tags, it is easy to generate a decision
tree. I didn't check very recently about efficiency for polymorphic
variants, but the depth of the decision tree is logarithmic in the
number of tags involved in the pattern matching, and if you can keep
it below 3 or 4 (about 10 tags) you can be actually faster than a
jump table.
Another comparison is with the old implementation for method calls.
Originally ocaml used your idea for methods: method hashes were
generated at initialization time. The scheme for dispatch was a two
level array, compressed by reusing buckets so that you don't use too
much memory. This meant actually 3 array accesses for a method call.
The current scheme reuses variant hashes, and implements a simple
dichotomic search, together with an index cache for each call site.
This doesn't look very efficient, but on small method tables, the
search is almost as fast as the old approach, and if the cache hits
this is much faster...

Now concerning the risks of name conflicts. The main point of
polymorphic variants is that there is only a conflict if the two tags
appear in the same type. And logically the type should stay small.
If you want to put all GLenum's inside the same type, then you may
well end up with conflicts. But what LablGL shows is that in practice
only a small number of tags are used together. So if you can partition
your set of tags so that each type has at most 64 tags, then you get
a probability conflict less than 1 per million for each type. This
seems safe enough. But if you have one type with 2000 tags, then the
probability is 1 per thousand. Not that much, but it can happen.
(p(n) is n*n / 2**32) 

Jacques Garrigue