From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on yquem.inria.fr X-Spam-Level: * X-Spam-Status: No, score=1.2 required=5.0 tests=AWL,HTML_MESSAGE, MAILTO_TO_SPAM_ADDR,SPF_NEUTRAL autolearn=disabled version=3.1.3 Received: from mail4-relais-sop.national.inria.fr (mail4-relais-sop.national.inria.fr [192.134.164.105]) by yquem.inria.fr (Postfix) with ESMTP id BEB45BBC1 for ; Wed, 20 Feb 2008 07:11:16 +0100 (CET) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AgAAAAZTu0dIDszje2dsb2JhbACCPTSNYQEBCQQFCAoRl3GHPg X-IronPort-AV: E=Sophos;i="4.25,380,1199660400"; d="scan'208";a="22814649" Received: from qb-out-0506.google.com ([72.14.204.227]) by mail4-smtp-sop.national.inria.fr with ESMTP; 20 Feb 2008 07:11:15 +0100 Received: by qb-out-0506.google.com with SMTP id e11so2476867qbe.15 for ; Tue, 19 Feb 2008 22:11:14 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from:to:subject:cc:in-reply-to:mime-version:content-type:references; bh=f6WBOdJ6SPdaJR3LT4Ftbx6s/HIPSSgx5+M/4UYqIQg=; b=aVYsFGgJF33ah6rXG3VT2S8qnEmxVChWryXyVy6xjxz/lOlvsFcRDVStw6lN850sxXD4T4feAvKcddsD5SufywAp0RgGWRHG/Il6NQPkx0oqIkoCsxhZ/+SGp72C6gEOEWHWC9/psCUMVFC/TURaA5KB7RBeWu6ETM4QKBkOuOU= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:to:subject:cc:in-reply-to:mime-version:content-type:references; b=eV9B0kf1rNuZicLHIEa1F0IDeulI72uGcZtI/R9ghmwa385dU4n6ImUIVZdEmu5DEUOuPs13zHvrJX8ZJhs09z+Oq6ZxX1fu3pswpgfn8oNWIKTrtjW061SnLhOcQsB3grNavC8PSK7oWFUHBNOfh1Q0im9zLbyN0rMJgwks5DY= Received: by 10.115.60.1 with SMTP id n1mr4973911wak.37.1203487873520; Tue, 19 Feb 2008 22:11:13 -0800 (PST) Received: by 10.114.166.15 with HTTP; Tue, 19 Feb 2008 22:11:13 -0800 (PST) Message-ID: Date: Tue, 19 Feb 2008 22:11:13 -0800 From: "Francois Rouaix" To: "John Caml" Subject: Re: [Caml-list] Re: large hash tables Cc: caml-list@yquem.inria.fr In-Reply-To: <33d2b3f70802192118p4d887212mcf76e34447c54e52@mail.gmail.com> MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_Part_23260_19221201.1203487873524" References: <55e81f00-5ef7-4946-9272-05595299e114@41g2000hsc.googlegroups.com> <33d2b3f70802192118p4d887212mcf76e34447c54e52@mail.gmail.com> X-Spam: no; 0.00; hash:01 arrays:01 pointer:01 arrays:01 structs:01 ocaml:01 --f:01 hashtbl:01 integers:01 ocaml:01 long-time:01 infile:01 infile:01 pcre:01 printf:01 ------=_Part_23260_19221201.1203487873524 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Content-Disposition: inline Well, you could use resizeable arrays instead of lists for each bucket.If you have 100M items, each "cons" becomes fairly expensive. A pointer per item is 400MB... I'm a bit surprised that the C++ program only required 800MB - that would be 8 bytes exactly per item; if each item is an int (4 bytes) and a double (8 bytes), it doesn't add up. Or are you using single precision floats and arrays everywhere (no objects, structs of any kind) ? The most memory efficient representation in OCaml would probably be a couple of arrays, ints and floats. For an item indexed by j, the int value is ints.(j) and the float value is in floats.(j). ints.(j) == [| i0; ... |] floats.(j) == [ f0; ... |] --f On Feb 19, 2008 9:18 PM, John Caml wrote: > Thank you all for the assistance. > > I've resolved the Stack_overflow problem by using an Array instead of > a Hashtbl; my keys were just consecutive integers, so this later > approach is clearly preferable. > > However, the memory usage is still pretty bad...it takes nearly an > order of magnitude more memory than the equivalent C++ program. While > the C++ program required 800 MB, my ocaml program requires roughly 6 > GB. Am I doing something very inefficiently? My revised code appears > below. > > Also, if you have any other coding suggestions I'd appreciate hearing > them. I'm a long-time coder but new to Ocaml and eager to learn. > > > -------------- > > exception SplitError > > > let loadWholeFile filename = > let infile = open_in filename > and movieMajor = Array.make 17770 [] in > > let rec loadLines count = > let line = input_line infile in > let murList = Pcre.split line in > > match murList with > | m::u::r::[] -> > let rFloat = float_of_string r > and mInt = int_of_string m > and uInt = int_of_string u in > > let newElement = (uInt, rFloat) > and oldList = movieMajor.(mInt) in > let newList = List.rev_append [newElement] oldList in > Array.set movieMajor mInt newList; > > if (count mod 1000000) == 0 then begin > Printf.printf "count: %d\n" count; > flush stdout; > end; > > loadLines (count + 1) > > | _ -> raise SplitError > in > > try > loadLines 0 > with > End_of_file -> close_in infile; > movieMajor > ;; > > > let filename = Sys.argv.(1);; > let str = loadWholeFile filename;; > > _______________________________________________ > Caml-list mailing list. Subscription management: > http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list > Archives: http://caml.inria.fr > Beginner's list: http://groups.yahoo.com/group/ocaml_beginners > Bug reports: http://caml.inria.fr/bin/caml-bugs > ------=_Part_23260_19221201.1203487873524 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Content-Disposition: inline Well, you could use resizeable arrays instead of lists for each bucket.
If you have 100M items, each "cons" becomes fairly expensive. A pointer per item is 400MB...
I'm a bit surprised that the C++ program only required 800MB - that would be 8 bytes exactly per item; if each item is an int (4 bytes) and a double (8 bytes), it doesn't add up. Or are you using single precision floats and arrays everywhere (no objects, structs of any kind) ?

The most memory efficient representation in OCaml would probably be a couple of arrays, ints and floats. For an item indexed by j, the int value is ints.(j) and the float value is in floats.(j). 
ints.(j) == [| i0; ... |]
floats.(j) == [ f0; ... |]

--f


On Feb 19, 2008 9:18 PM, John Caml <camljohn42@gmail.com> wrote:
Thank you all for the assistance.

I've resolved the Stack_overflow problem by using an Array instead of
a Hashtbl; my keys were just consecutive integers, so this later
approach is clearly preferable.

However, the memory usage is still pretty bad...it takes nearly an
order of magnitude more memory than the equivalent C++ program. While
the C++ program required 800 MB, my ocaml program requires roughly 6
GB. Am I doing something very inefficiently? My revised code appears
below.

Also, if you have any other coding suggestions I'd appreciate hearing
them. I'm a long-time coder but new to Ocaml and eager to learn.


--------------

exception SplitError


let loadWholeFile filename =
   let infile = open_in filename
   and movieMajor = Array.make 17770 [] in

   let rec loadLines count =
       let line = input_line infile in
       let murList = Pcre.split line in

       match murList with
           | m::u::r::[] ->
               let rFloat = float_of_string r
               and mInt = int_of_string m
               and uInt = int_of_string u in

               let newElement = (uInt, rFloat)
               and oldList = movieMajor.(mInt) in
               let newList = List.rev_append [newElement] oldList in
               Array.set movieMajor mInt newList;

               if (count mod 1000000) == 0 then begin
                   Printf.printf "count: %d\n" count;
                   flush stdout;
                   end;

                   loadLines (count + 1)

           | _ -> raise SplitError
 in

   try
       loadLines 0
   with
       End_of_file -> close_in infile;
       movieMajor
;;


let filename = Sys.argv.(1);;
let str = loadWholeFile filename;;

_______________________________________________
Caml-list mailing list. Subscription management:
http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list
Archives: http://caml.inria.fr
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
Bug reports: http://caml.inria.fr/bin/caml-bugs

------=_Part_23260_19221201.1203487873524--