From mboxrd@z Thu Jan  1 00:00:00 1970
Received: (from majordomo@localhost) by pauillac.inria.fr (8.7.6/8.7.3) id QAA02080; Mon, 19 Nov 2001 16:34:59 +0100 (MET)
X-Authentication-Warning: pauillac.inria.fr: majordomo set sender to owner-caml-list@pauillac.inria.fr using -f
Received: from concorde.inria.fr (concorde.inria.fr [192.93.2.39]) by pauillac.inria.fr (8.7.6/8.7.3) with ESMTP id QAA08631 for <caml-list@pauillac.inria.fr>; Mon, 19 Nov 2001 16:34:58 +0100 (MET)
Received: from pauillac.inria.fr (pauillac.inria.fr [128.93.11.35])
	by concorde.inria.fr (8.11.1/8.10.0) with ESMTP id fAJFYdb27333;
	Mon, 19 Nov 2001 16:34:39 +0100 (MET)
Received: (from xleroy@localhost) by pauillac.inria.fr (8.7.6/8.7.3) id QAA08106; Mon, 19 Nov 2001 16:34:39 +0100 (MET)
Date: Mon, 19 Nov 2001 16:34:39 +0100
From: Xavier Leroy <xavier.leroy@inria.fr>
To: "Marcin 'Qrczak' Kowalczyk" <qrczak@knm.org.pl>
Cc: caml-list@inria.fr
Subject: Re: [Caml-list] Hashtbl and max_len
Message-ID: <20011119163439.C7102@pauillac.inria.fr>
References: <slrn9vfiul.rkm.qrczak@qrnik.zagroda>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
X-Mailer: Mutt 1.0i
In-Reply-To: <slrn9vfiul.rkm.qrczak@qrnik.zagroda>; from qrczak@knm.org.pl on Sun, Nov 18, 2001 at 02:55:17PM +0000
Sender: owner-caml-list@pauillac.inria.fr
Precedence: bulk

> The initial_size argument of Hashtbl.create has much bigger impact on
> efficiency than I thought.
> [...]
> IMHO max_len shouldn't grow that much. Why does it grow at all when
> the current size is not near Sys.max_array_length?

The original strategy was indeed to bound the length of buckets by a
constant, and resize when a bucket exceeds this length.  However, this
results in excessive resizing when the hashing causes collisions, or
(worse!) the same key is repeatedly inserted in the hash table.
Hence, as a stopgap, we decided to increase the maximal bucket length
at each resizing.

There are two problems with this strategy:
- as you noticed, this causes the initial size to impact performance
  significantly;
- checking for long buckets becomes linear in the size of the table,
  resulting in quadratic behavior for long sequences of insertions.

The latter problem was the last nail in the coffin of this
implementation.  In the working sources, we now keep track of the size
(total number of elements inserted in the table) and base resizing
decisions on this.  In case the hashing is good (all buckets have
approximately the same length), this results in constant length for
buckets.  If the hashing is not good, the performances degrade more
gracefully than with the old scheme.

- Xavier Leroy

PS.  Thanks for pointing out the typos in the working implementation.
The dangers of cut-and-paste programming...
-------------------
Bug reports: http://caml.inria.fr/bin/caml-bugs  FAQ: http://caml.inria.fr/FAQ/
To unsubscribe, mail caml-list-request@inria.fr  Archives: http://caml.inria.fr