From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <francois.rouaix@gmail.com>
X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on yquem.inria.fr
X-Spam-Level: *
X-Spam-Status: No, score=1.2 required=5.0 tests=AWL,HTML_MESSAGE,
	MAILTO_TO_SPAM_ADDR,SPF_NEUTRAL autolearn=disabled version=3.1.3
Received: from mail4-relais-sop.national.inria.fr (mail4-relais-sop.national.inria.fr [192.134.164.105])
	by yquem.inria.fr (Postfix) with ESMTP id BEB45BBC1
	for <caml-list@yquem.inria.fr>; Wed, 20 Feb 2008 07:11:16 +0100 (CET)
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: AgAAAAZTu0dIDszje2dsb2JhbACCPTSNYQEBCQQFCAoRl3GHPg
X-IronPort-AV: E=Sophos;i="4.25,380,1199660400"; 
   d="scan'208";a="22814649"
Received: from qb-out-0506.google.com ([72.14.204.227])
  by mail4-smtp-sop.national.inria.fr with ESMTP; 20 Feb 2008 07:11:15 +0100
Received: by qb-out-0506.google.com with SMTP id e11so2476867qbe.15
        for <caml-list@yquem.inria.fr>; Tue, 19 Feb 2008 22:11:14 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=gamma;
        h=domainkey-signature:received:received:message-id:date:from:to:subject:cc:in-reply-to:mime-version:content-type:references;
        bh=f6WBOdJ6SPdaJR3LT4Ftbx6s/HIPSSgx5+M/4UYqIQg=;
        b=aVYsFGgJF33ah6rXG3VT2S8qnEmxVChWryXyVy6xjxz/lOlvsFcRDVStw6lN850sxXD4T4feAvKcddsD5SufywAp0RgGWRHG/Il6NQPkx0oqIkoCsxhZ/+SGp72C6gEOEWHWC9/psCUMVFC/TURaA5KB7RBeWu6ETM4QKBkOuOU=
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=message-id:date:from:to:subject:cc:in-reply-to:mime-version:content-type:references;
        b=eV9B0kf1rNuZicLHIEa1F0IDeulI72uGcZtI/R9ghmwa385dU4n6ImUIVZdEmu5DEUOuPs13zHvrJX8ZJhs09z+Oq6ZxX1fu3pswpgfn8oNWIKTrtjW061SnLhOcQsB3grNavC8PSK7oWFUHBNOfh1Q0im9zLbyN0rMJgwks5DY=
Received: by 10.115.60.1 with SMTP id n1mr4973911wak.37.1203487873520;
        Tue, 19 Feb 2008 22:11:13 -0800 (PST)
Received: by 10.114.166.15 with HTTP; Tue, 19 Feb 2008 22:11:13 -0800 (PST)
Message-ID: <f2d6cc680802192211o1b445573uec00abf46654887b@mail.gmail.com>
Date: Tue, 19 Feb 2008 22:11:13 -0800
From: "Francois Rouaix" <francois.rouaix@gmail.com>
To: "John Caml" <camljohn42@gmail.com>
Subject: Re: [Caml-list] Re: large hash tables
Cc: caml-list@yquem.inria.fr
In-Reply-To: <33d2b3f70802192118p4d887212mcf76e34447c54e52@mail.gmail.com>
MIME-Version: 1.0
Content-Type: multipart/alternative; 
	boundary="----=_Part_23260_19221201.1203487873524"
References: <fa.XXbywsQknpl7bhlesWN8vFLM58c@ifi.uio.no>
	 <55e81f00-5ef7-4946-9272-05595299e114@41g2000hsc.googlegroups.com>
	 <33d2b3f70802192118p4d887212mcf76e34447c54e52@mail.gmail.com>
X-Spam: no; 0.00; hash:01 arrays:01 pointer:01 arrays:01 structs:01 ocaml:01 --f:01 hashtbl:01 integers:01 ocaml:01 long-time:01 infile:01 infile:01 pcre:01 printf:01 

------=_Part_23260_19221201.1203487873524
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

Well, you could use resizeable arrays instead of lists for each bucket.If
you have 100M items, each "cons" becomes fairly expensive. A pointer per
item is 400MB...
I'm a bit surprised that the C++ program only required 800MB - that would be
8 bytes exactly per item; if each item is an int (4 bytes) and a double (8
bytes), it doesn't add up. Or are you using single precision floats and
arrays everywhere (no objects, structs of any kind) ?

The most memory efficient representation in OCaml would probably be a couple
of arrays, ints and floats. For an item indexed by j, the int value is
ints.(j) and the float value is in floats.(j).
ints.(j) == [| i0; ... |]
floats.(j) == [ f0; ... |]

--f


On Feb 19, 2008 9:18 PM, John Caml <camljohn42@gmail.com> wrote:

> Thank you all for the assistance.
>
> I've resolved the Stack_overflow problem by using an Array instead of
> a Hashtbl; my keys were just consecutive integers, so this later
> approach is clearly preferable.
>
> However, the memory usage is still pretty bad...it takes nearly an
> order of magnitude more memory than the equivalent C++ program. While
> the C++ program required 800 MB, my ocaml program requires roughly 6
> GB. Am I doing something very inefficiently? My revised code appears
> below.
>
> Also, if you have any other coding suggestions I'd appreciate hearing
> them. I'm a long-time coder but new to Ocaml and eager to learn.
>
>
> --------------
>
> exception SplitError
>
>
> let loadWholeFile filename =
>    let infile = open_in filename
>    and movieMajor = Array.make 17770 [] in
>
>    let rec loadLines count =
>        let line = input_line infile in
>        let murList = Pcre.split line in
>
>        match murList with
>            | m::u::r::[] ->
>                let rFloat = float_of_string r
>                and mInt = int_of_string m
>                and uInt = int_of_string u in
>
>                let newElement = (uInt, rFloat)
>                and oldList = movieMajor.(mInt) in
>                let newList = List.rev_append [newElement] oldList in
>                Array.set movieMajor mInt newList;
>
>                if (count mod 1000000) == 0 then begin
>                    Printf.printf "count: %d\n" count;
>                    flush stdout;
>                    end;
>
>                    loadLines (count + 1)
>
>            | _ -> raise SplitError
>  in
>
>    try
>        loadLines 0
>    with
>        End_of_file -> close_in infile;
>        movieMajor
> ;;
>
>
> let filename = Sys.argv.(1);;
> let str = loadWholeFile filename;;
>
> _______________________________________________
> Caml-list mailing list. Subscription management:
> http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list
> Archives: http://caml.inria.fr
> Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
> Bug reports: http://caml.inria.fr/bin/caml-bugs
>

------=_Part_23260_19221201.1203487873524
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

Well, you could use resizeable arrays instead of lists for each bucket.<div>If you have 100M items, each &quot;cons&quot; becomes fairly expensive. A pointer per item is 400MB...</div><div>I&#39;m a bit surprised that the C++ program only required 800MB - that would be 8 bytes exactly per item; if each item is an int (4 bytes) and a double (8 bytes), it doesn&#39;t add up. Or are you using single precision floats and arrays everywhere (no objects, structs of any kind) ?</div>
<div><br class="webkit-block-placeholder"></div><div>The most memory efficient representation in OCaml would probably be a couple of arrays, ints and floats. For an item indexed by j, the int value is ints.(j) and the float value is in floats.(j).&nbsp;</div>
<div>ints.(j) == [| i0; ... |]</div><div>floats.(j) == [ f0; ... |]</div><div><br class="webkit-block-placeholder"></div><div>--f<br></div><div><br class="webkit-block-placeholder"></div><div><div><br><div class="gmail_quote">
On Feb 19, 2008 9:18 PM, John Caml &lt;<a href="mailto:camljohn42@gmail.com">camljohn42@gmail.com</a>&gt; wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">Thank you all for the assistance.<br>
<br>I&#39;ve resolved the Stack_overflow problem by using an Array instead of<br>a Hashtbl; my keys were just consecutive integers, so this later<br>approach is clearly preferable.<br><br>However, the memory usage is still pretty bad...it takes nearly an<br>
order of magnitude more memory than the equivalent C++ program. While<br>the C++ program required 800 MB, my ocaml program requires roughly 6<br>GB. Am I doing something very inefficiently? My revised code appears<br>below.<br>
<br>Also, if you have any other coding suggestions I&#39;d appreciate hearing<br>them. I&#39;m a long-time coder but new to Ocaml and eager to learn.<br><br><br>--------------<br><br>exception SplitError<br><br><br>let loadWholeFile filename =<br>
 &nbsp; &nbsp;let infile = open_in filename<br> &nbsp; &nbsp;and movieMajor = Array.make 17770 [] in<br><br> &nbsp; &nbsp;let rec loadLines count =<br> &nbsp; &nbsp; &nbsp; &nbsp;let line = input_line infile in<br><div class="Ih2E3d"> &nbsp; &nbsp; &nbsp; &nbsp;let murList = Pcre.split line in<br>
<br> &nbsp; &nbsp; &nbsp; &nbsp;match murList with<br> &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;| m::u::r::[] -&gt;<br> &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;let rFloat = float_of_string r<br></div> &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;and mInt = int_of_string m<br> &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;and uInt = int_of_string u in<br><br>
 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;let newElement = (uInt, rFloat)<br> &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;and oldList = movieMajor.(mInt) in<br> &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;let newList = List.rev_append [newElement] oldList in<br> &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;Array.set movieMajor mInt newList;<br>
<br> &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;if (count mod 1000000) == 0 then begin<br> &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;Printf.printf &quot;count: %d\n&quot; count;<br> &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;flush stdout;<br> &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;end;<br><div class="Ih2E3d"><br> &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;loadLines (count + 1)<br>
<br> &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;| _ -&gt; raise SplitError<br> &nbsp;in<br><br> &nbsp; &nbsp;try<br> &nbsp; &nbsp; &nbsp; &nbsp;loadLines 0<br> &nbsp; &nbsp;with<br></div> &nbsp; &nbsp; &nbsp; &nbsp;End_of_file -&gt; close_in infile;<br> &nbsp; &nbsp; &nbsp; &nbsp;movieMajor<br><div class="Ih2E3d">;;<br><br><br>let filename = Sys.argv.(1);;<br>
</div>let str = loadWholeFile filename;;<br><div><div></div><div class="Wj3C7c"><br>_______________________________________________<br>Caml-list mailing list. Subscription management:<br><a href="http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list" target="_blank">http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list</a><br>
Archives: <a href="http://caml.inria.fr" target="_blank">http://caml.inria.fr</a><br>Beginner&#39;s list: <a href="http://groups.yahoo.com/group/ocaml_beginners" target="_blank">http://groups.yahoo.com/group/ocaml_beginners</a><br>
Bug reports: <a href="http://caml.inria.fr/bin/caml-bugs" target="_blank">http://caml.inria.fr/bin/caml-bugs</a><br></div></div></blockquote></div><br></div></div>

------=_Part_23260_19221201.1203487873524--