From: Richard Jones <rich@annexia.org>
To: John Caml <camljohn42@gmail.com>
Cc: caml-list@yquem.inria.fr
Subject: Re: [Caml-list] large hash tables
Date: Wed, 20 Feb 2008 12:48:41 +0000 [thread overview]
Message-ID: <20080220124841.GA25817@annexia.org> (raw)
In-Reply-To: <33d2b3f70802191501v39346a56x4b1852b84d4067b4@mail.gmail.com>
On Tue, Feb 19, 2008 at 03:01:46PM -0800, John Caml wrote:
> I need to load roughly 100 million items into a memory based hash table,
> where each key is an integer and each value is an integer and a float. Under
> C++ stdlib's map library, this requires 800 MB of memory.
You can do as well as C++, but you need to understand how values are
stored in the OCaml runtime.
In C++ you're using an int and a (single-precision) float, and
presumably you're storing them packed into an array. Something like
this?
struct item { int i; float f; };
struct item array[100000000];
The storage requirements for this are 100 * 10^6 * 8 bytes = 800 MB,
as you say.
In OCaml things are a bit different. Firstly you are using
double-precision floats ("float" in OCaml is the same as double in C),
so those are 8 bytes. Secondly you're using OCaml ints, and since you
must be on a 64 bit machine, those are 64 bits wide (although in OCaml
you can only use 63 of those bits). So the minimum for storing an int
and a float in OCaml is 16 bytes.
It gets worse though because you're using a (int * float) tuple.
These are stored particularly inefficiently in OCaml. There is an 8
byte header, 8 bytes for the int, then the float is a boxed pointer
[stored in a separate allocation] which means an 8 byte pointer,
another 8 byte header plus 8 bytes for the double-precision float. So
if I've got the math right, the cost of storing each tuple would be:
8 + 8 + 8 + 8 + 8
header of tuple int pointer to float header of float float
= 40 bytes.
Storing those in a simple array means another pointer per tuple
(tuples aren't packed into the array, but pointed to), so that's a
total of around:
100 * 10^6 * (8 + 40) = 4.8 GB.
Luckily there are much better ways to store this data in OCaml. If
you want to keep it simple, use two arrays, a 'float array' to store
double-precision floats and an 'int array' to store 63 bit ints.
These are both unboxed in OCaml, so this works out as:
float array: 100 * 10^6 * 8
int array: + 100 * 10^6 * 8 = 1.6 GB
You can do better (the same as C++) by using Bigarray. The basic
saving of Bigarray is that you can store single-precision float and 32
bit integers directly in them. So either go for two Bigarrays, or use
one Bigarray (of int32_elt type and length 200 million elements) and a
module wrapping the details of inserting and getting the pairs (also
the functions Int32.bits_of_float and Int32.float_of_bits will help
you here).
Rich.
--
Richard Jones
Red Hat
next prev parent reply other threads:[~2008-02-20 12:48 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2008-02-19 23:01 John Caml
2008-02-19 23:34 ` [Caml-list] " Gabriel Kerneis
2008-02-19 23:36 ` Gerd Stolpmann
2008-02-19 23:51 ` Francois Rouaix
2008-02-20 9:37 ` Berke Durak
2008-02-20 9:56 ` Berke Durak
2008-02-20 12:48 ` Richard Jones [this message]
2008-02-20 15:54 ` Oliver Bandel
2008-02-21 22:45 ` John Caml
2008-02-22 0:33 ` Richard Jones
2008-02-24 5:39 ` John Caml
2008-02-22 14:19 ` Brian Hurt
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20080220124841.GA25817@annexia.org \
--to=rich@annexia.org \
--cc=caml-list@yquem.inria.fr \
--cc=camljohn42@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox