From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on yquem.inria.fr X-Spam-Level: X-Spam-Status: No, score=0.4 required=5.0 tests=AWL,DNS_FROM_RFC_ABUSE autolearn=disabled version=3.1.3 Received: from mail3-relais-sop.national.inria.fr (mail3-relais-sop.national.inria.fr [192.134.164.104]) by yquem.inria.fr (Postfix) with ESMTP id 2AA61BC69 for ; Tue, 15 Jan 2008 04:36:40 +0100 (CET) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: Ao8CAHC5i0eCNhAB/2dsb2JhbACRWJga X-IronPort-AV: E=Sophos;i="4.24,284,1196636400"; d="scan'208";a="7801360" Received: from kurims.kurims.kyoto-u.ac.jp ([130.54.16.1]) by mail3-smtp-sop.national.inria.fr with ESMTP; 15 Jan 2008 04:36:38 +0100 Received: from localhost (orion [130.54.16.5]) by kurims.kurims.kyoto-u.ac.jp (8.13.8/8.13.8) with ESMTP id m0F3aTQw008231; Tue, 15 Jan 2008 12:36:29 +0900 (JST) Date: Tue, 15 Jan 2008 12:36:21 +0900 (JST) Message-Id: <20080115.123621.184910428.garrigue@math.nagoya-u.ac.jp> To: ober.14@osu.edu Cc: caml-list@yquem.inria.fr Subject: Re: [Caml-list] Re: Hash clash in polymorphic variants From: Jacques Garrigue In-Reply-To: <200801140956.25449.ober.14@osu.edu> References: <200801140720.02094.ober.14@osu.edu> <200801140956.25449.ober.14@osu.edu> X-Mailer: Mew version 4.2 on Emacs 22.1 / Mule 5.0 (SAKAKI) Mime-Version: 1.0 Content-Type: Text/Plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Spam: no; 0.00; hash:01 variants:01 hash:01 hashing:01 inputs:01 inputs:01 marshalling:01 variants:01 logarithmic:01 ocaml:01 hashes:01 hashes:01 implements:01 lablgl:01 polymorphic:01 From: Kuba Ober > On Monday 14 January 2008, Stefan Monnier wrote: > > > What I meant was simply that instead of using some fixed hash function, > > > one could use a perfect hashing function which is optimal for its known > > > set of inputs, and won't ever generate a collision. > > > > The problem is that the set of inputs is not know at compile time, only > > at link time. > > As I've said in the cited post, the perfect hash generator would have to be > invoked at link time, which shouldn't be a big deal. Unfortunately, this would make marshalling between different programs much more complicated... Another advantage of knowing the hash function at compile time is that you can generate efficient code for pattern matching. Since you already know the ordering of tags, it is easy to generate a decision tree. I didn't check very recently about efficiency for polymorphic variants, but the depth of the decision tree is logarithmic in the number of tags involved in the pattern matching, and if you can keep it below 3 or 4 (about 10 tags) you can be actually faster than a jump table. Another comparison is with the old implementation for method calls. Originally ocaml used your idea for methods: method hashes were generated at initialization time. The scheme for dispatch was a two level array, compressed by reusing buckets so that you don't use too much memory. This meant actually 3 array accesses for a method call. The current scheme reuses variant hashes, and implements a simple dichotomic search, together with an index cache for each call site. This doesn't look very efficient, but on small method tables, the search is almost as fast as the old approach, and if the cache hits this is much faster... Now concerning the risks of name conflicts. The main point of polymorphic variants is that there is only a conflict if the two tags appear in the same type. And logically the type should stay small. If you want to put all GLenum's inside the same type, then you may well end up with conflicts. But what LablGL shows is that in practice only a small number of tags are used together. So if you can partition your set of tags so that each type has at most 64 tags, then you get a probability conflict less than 1 per million for each type. This seems safe enough. But if you have one type with 2000 tags, then the probability is 1 per thousand. Not that much, but it can happen. (p(n) is n*n / 2**32) Jacques Garrigue