From: "Yaron M. Minsky" <yminsky@CS.Cornell.EDU>
To: Tibor Simko <tibor.simko@cern.ch>
Cc: caml-list@inria.fr
Subject: Re: [Caml-list] intersecting huge integer sets
Date: 27 Aug 2002 08:06:54 -0400 [thread overview]
Message-ID: <1030450016.24909.45.camel@dragonfly.localdomain> (raw)
In-Reply-To: <87n0r835w9.fsf@pcdh91.cern.ch>
If the sets are really as dense as you say they are (on the order of 50%
of all possible elements), then the fastest thing is probably just to
use a bool array. Such an approach would require 3 passes: one to set
all the elements of the array to false. One to mark the elements
corresponding to set 1 to true, and then one pass over set 2 to see
which of its elements are shared by 1.
y
On Tue, 2002-08-27 at 07:20, Tibor Simko wrote:
> Hello
>
> I wonder what is the OCaml canonical way to get fast intersection of
> huge sets of integers: say, to intersect two sets of 500000 items
> containing distinct integer values from the (0,1000000) interval.
>
> I quickly tested the Hashtbl way, along the lines:
>
> +----
> | let dsize d =
> | (* return length of dictionary *)
> | let n = ref 0 in Hashtbl.iter (fun _ _ -> incr n) d; !n;;
> |
> | let dcreate nitems maxval =
> | (* create dictionary of nitems with integer keys from (0,maxval) interval *)
> | let d = Hashtbl.create 0 and
> | size = ref 0 and
> | rnd = ref 0 in
> | while !size < nitems && !size < maxval do
> | rnd := Random.int maxval;
> | if Hashtbl.mem d !rnd
> | then ()
> | else (incr size;
> | Hashtbl.add d !rnd 1)
> | done;
> | d;;
> |
> | let dintersect d1 d2 =
> | (* intersect two dictionaries: retain elements in d1 that are also in d2 *)
> | (* note: assuming that d1 is smaller in size than d2. Otherwise not efficient. *)
> | Hashtbl.iter (fun key _ ->
> | if Hashtbl.mem d2 key then ()
> | else Hashtbl.remove d1 key) d1;;
> |
> | (* main *)
> | let nitems = if Array.length Sys.argv > 1 then int_of_string Sys.argv.(1) else 300000 and
> | maxval = if Array.length Sys.argv > 2 then int_of_string Sys.argv.(2) else 500000;;
> | Printf.printf "intersecting sets of %d items from (0,%d) ... " nitems maxval;
> | flush stdout;
> | let d1 = dcreate nitems maxval and
> | d2 = dcreate nitems maxval in
> | let t1 = Sys.time () in
> | dintersect d1 d2;
> | Printf.printf "%d common items retained in %.2f seconds.\n" (dsize d1) (Sys.time () -. t1);;
> +----
>
> I found that the dintersect function appears to be about 3-4 times
> faster than its Python equivalent, and about 10% faster than Java's
> native HashSet retainAll() method.
>
> I wonder how to improve OCaml performance in this special case. Is
> there a better suited OCaml data structure to simulate huge hashed
> sets of integers rather than by using OCaml's native Hashtbl?
>
> Tibor
> -------------------
> To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
> Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
> Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
--
|--------/ Yaron M. Minsky \--------|
|--------\ http://www.cs.cornell.edu/home/yminsky/ /--------|
Open PGP --- KeyID B1FFD916 (new key as of Dec 4th)
Fingerprint: 5BF6 83E1 0CE3 1043 95D8 F8D5 9F12 B3A9 B1FF D916
-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
next prev parent reply other threads:[~2002-08-27 12:07 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2002-08-27 11:20 Tibor Simko
2002-08-27 11:52 ` Markus Mottl
2002-08-27 12:09 ` Diego Olivier Fernandez Pons
2002-08-27 12:06 ` Yaron M. Minsky [this message]
2002-08-27 12:12 ` Markus Mottl
2002-08-28 12:29 ` Tibor Simko
2002-08-29 10:13 ` Diego Olivier Fernandez Pons
2002-08-29 11:33 ` [Caml-list] barre dans le filtrage de motifs Diego Olivier Fernandez Pons
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1030450016.24909.45.camel@dragonfly.localdomain \
--to=yminsky@cs.cornell.edu \
--cc=caml-list@inria.fr \
--cc=tibor.simko@cern.ch \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox