From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on yquem.inria.fr X-Spam-Level: X-Spam-Status: No, score=0.0 required=5.0 tests=none autolearn=disabled version=3.1.3 Received: from mail4-relais-sop.national.inria.fr (mail4-relais-sop.national.inria.fr [192.134.164.105]) by yquem.inria.fr (Postfix) with ESMTP id CE577BBCB for ; Wed, 20 Feb 2008 00:36:15 +0100 (CET) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AgAAAHj2ukfU436xnmdsb2JhbACBWY54AQEBAQEGBAYHChiBFJx2AQ X-IronPort-AV: E=Sophos;i="4.25,378,1199660400"; d="scan'208";a="22808976" Received: from moutng.kundenserver.de ([212.227.126.177]) by mail4-smtp-sop.national.inria.fr with ESMTP; 20 Feb 2008 00:36:15 +0100 Received: from gate.lan.gerd-stolpmann.de (dslb-088-068-209-062.pools.arcor-ip.net [88.68.209.62]) by mrelayeu.kundenserver.de (node=mrelayeu0) with ESMTP (Nemesis) id 0MKwh2-1JRc0R35wB-0002RZ; Wed, 20 Feb 2008 00:36:11 +0100 Received: from [192.168.1.226] (dsl092-191-028.sfo1.dsl.speakeasy.net [66.92.191.28]) by gate.lan.gerd-stolpmann.de (Postfix) with ESMTP id E9CEAC113; Wed, 20 Feb 2008 00:36:10 +0100 (CET) Subject: Re: [Caml-list] large hash tables From: Gerd Stolpmann To: John Caml Cc: caml-list@yquem.inria.fr In-Reply-To: <33d2b3f70802191501v39346a56x4b1852b84d4067b4@mail.gmail.com> References: <33d2b3f70802191501v39346a56x4b1852b84d4067b4@mail.gmail.com> Content-Type: text/plain Date: Wed, 20 Feb 2008 00:36:29 +0100 Message-Id: <1203464189.15877.7.camel@flake.lan.gerd-stolpmann.de> Mime-Version: 1.0 X-Mailer: Evolution 2.12.1 Content-Transfer-Encoding: 7bit X-Provags-ID: V01U2FsdGVkX1++Vg6qWwxMyxvFXVxcGvo0EoVUM4uQ7atgSnX 5noMhILIRAe7hrptgH1jPz0YK5CawHEHA2QIg4b/MzpjKtiDR5 Ynr5pwQUcedgPnxwAQMWadQLvfTNzk1 X-Spam: no; 0.00; hash:01 gerd:01 stolpmann:01 hash:01 ocaml:01 ocamlopt:01 ocamlc:01 ocaml:01 gerd:01 --john:01 hashtbl:01 pcre:01 hashtbl:01 printf:01 printf:01 Am Dienstag, den 19.02.2008, 15:01 -0800 schrieb John Caml: > I need to load roughly 100 million items into a memory based hash > table, where each key is an integer and each value is an integer and a > float. Under C++ stdlib's map library, this requires 800 MB of memory. > Under my naive porting of my C++ program to Ocaml, only 12 million > rows get loaded before I get the program crashes with the message > "Fatal error: exception Stack_overflow". At the time of the crash, > nearly 2 GB of memory are in use. (My machine -- AMD64 architecture -- > has 8 GB of memory, so insufficient memory is not the issue.) > > Two questions: > > 1. How can I overcome the Stack_overflow error? (I'm already compiling > with ocamlopt rather than ocamlc, which seems to have increased the > stack size from 500 MB to 2 GB.) I'd like to be able to use the full 8 > GB on my machine. > 2. How can I implment a more efficient hash table? My Ocaml program > seems to require 10x more memory than under C++. > > My code appears below. Thank you very much. The stack overflow is very strange. I would expect it only for one data case: that there is a single key that occurs very often (several 10000 times). In this case, there can be a stack overflow when the hash table is resized. It is possible to work around by handling this case explicitly. Gerd > > --John > > > -------------- > exception SplitError > > let read_whole_chan chan = > let movieMajor = Hashtbl.create 777777 in > > let rec loadLines count = > let line = input_line chan in > let murList = Pcre.split line in > match murList with > | m::u::r::[] -> > let rFloat = float_of_string r in > Hashtbl.add movieMajor m (u, rFloat); > if (count mod 10000) == 0 then Printf.printf "count: % > d, m: %s, u: %s, r: %f \n" count m u rFloat; > loadLines (count + 1) > | _ -> raise SplitError > in > > try > loadLines 0 > with > End_of_file -> () > ;; > > let read_whole_file filename = > let chan = open_in filename in > read_whole_chan chan > ;; > > let filename = Sys.argv.(1);; > > let str = read_whole_file filename;; > > _______________________________________________ > Caml-list mailing list. Subscription management: > http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list > Archives: http://caml.inria.fr > Beginner's list: http://groups.yahoo.com/group/ocaml_beginners > Bug reports: http://caml.inria.fr/bin/caml-bugs -- ------------------------------------------------------------ Gerd Stolpmann * Viktoriastr. 45 * 64293 Darmstadt * Germany gerd@gerd-stolpmann.de http://www.gerd-stolpmann.de Phone: +49-6151-153855 Fax: +49-6151-997714 ------------------------------------------------------------