More efficient implementation of intersection of sets?

Mailing list for all users of the OCaml language and system.
 help / color / mirror / Atom feed

* More efficient implementation of intersection of sets?
@ 2008-04-01 15:55 sasha mal
  2008-04-01 23:42 ` [Caml-list] " Jon Harrop
  2008-04-02 15:34 ` Mike Furr
  0 siblings, 2 replies; 7+ messages in thread
From: sasha mal @ 2008-04-01 15:55 UTC (permalink / raw)
  To: caml-list

Dear OCaml users!

Currently,

Set.inter x y

splits y into two trees, one containing elements that are bigger and the other containing elements that are smaller than the top of x, then applies the procedure recursively. What is the exact runtime of the algorithm? Is there a better one for the intersection for OCaml sets?

Regards

Sasha

 --- On Fri 03/14, Alain Frisch < alain@frisch.fr > wrote:

From: Alain Frisch [mailto: alain@frisch.fr]

To: sasha.mal@excite.com

     Cc: caml-list@yquem.inria.fr

Date: Fri, 14 Mar 2008 08:38:41 +0100

Subject: Re: [Caml-list] BDDs in ocaml

sasha mal wrote:> I wonder whether anyone has a BDD (binary decision diagram)> implementation in ocaml. Ocaml interfaces to external BDD> implementations in other languages (like Cudd) are of no use to me.I've seen many implementation of BDDs in OCaml, but none of them implements automatic reordering of variables (which is by far the most complex part of serious BDD packages). For some applications, this is really a must. Why is it impossible for you to use to an external BDD implementation?-- Alain

_______________________________________________
Join Excite! - http://www.excite.com
The most personalized portal on the Web!

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Caml-list] More efficient implementation of intersection of sets?
  2008-04-01 15:55 More efficient implementation of intersection of sets? sasha mal
@ 2008-04-01 23:42 ` Jon Harrop
  2008-04-02 14:05   ` Frédéric Gava
  2008-04-02 15:34 ` Mike Furr
  1 sibling, 1 reply; 7+ messages in thread
From: Jon Harrop @ 2008-04-01 23:42 UTC (permalink / raw)
  To: caml-list

On Tuesday 01 April 2008 16:55:24 sasha mal wrote:
> Dear OCaml users!
>
> Currently,
>
> Set.inter x y
>
> splits y into two trees, one containing elements that are bigger and the
> other containing elements that are smaller than the top of x, then applies
> the procedure recursively. What is the exact runtime of the algorithm?

We discussed this before on this list and the result was inconclusive. Suffice 
to say, it is very fast!

> Is there a better one for the intersection for OCaml sets?

Not likely. OCaml's implementation is already vastly more efficient than any 
other language I have ever seen (e.g. C++). Your next best bet is probably to 
parallelize the algorithm to improve the performance but that is extremely 
difficult to do without a concurrent GC. Frederic Gava did some work on this 
in OCaml. I am working on the same problem in F#.

Failing that, you might want to apply some of the stock optimizations to the 
Set module, such as a Node1 type constructor for nodes with a value but no 
child nodes. That can improve performance by 30%.

Alternatively, you may prefer to ditch immutable structures and opt for a 
hashset, which can be many times faster but is much more difficult to use 
because it is mutable.

-- 
Dr Jon D Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com/products/?e

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Caml-list] More efficient implementation of intersection of sets?
  2008-04-01 23:42 ` [Caml-list] " Jon Harrop
@ 2008-04-02 14:05   ` Frédéric Gava
  0 siblings, 0 replies; 7+ messages in thread
From: Frédéric Gava @ 2008-04-02 14:05 UTC (permalink / raw)
  Cc: caml-list

Dear John, Sasha and Caml-list

> Not likely. OCaml's implementation is already vastly more efficient than any 
> other language I have ever seen (e.g. C++). Your next best bet is probably to 
> parallelize the algorithm to improve the performance but that is extremely 
> difficult to do without a concurrent GC. Frederic Gava did some work on this 
> in OCaml. I am working on the same problem in F#.

You can have parallel sets without a concurrent GC : each processor has 
a subset of your initial set and you can distribute the elements using a 
hash function from element to the number of processor "p" (there is so 
"p" ocaml programs that runs and thus "p" GC). A random function can be 
used in general and generate a quick good load balancing.

You can have more information here :
http://lacl.univ-paris12.fr//gava/papers/gava_ppl_2008.pdf

Note, that I used our "under development library" but this work can be 
done using OCaml-MPI


Frédéric G.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Caml-list] More efficient implementation of intersection of sets?
  2008-04-01 15:55 More efficient implementation of intersection of sets? sasha mal
  2008-04-01 23:42 ` [Caml-list] " Jon Harrop
@ 2008-04-02 15:34 ` Mike Furr
  1 sibling, 0 replies; 7+ messages in thread
From: Mike Furr @ 2008-04-02 15:34 UTC (permalink / raw)
  To: caml-list

sasha mal wrote:
> Currently,
> 
> Set.inter x y
> 
> splits y into two trees, one containing elements that are bigger and
> the other containing elements that are smaller than the top of x,
> then applies the procedure recursively. What is the exact runtime of
> the algorithm? Is there a better one for the intersection for OCaml
> sets?

The general algorithm is linear in the size of both sets in the worst 
case, however it can be quite a bit faster in many circumstances.  See 
the paper "Implementing Sets Efficiently in a Functional Language" by 
Stephen Adams[1] for a more detailed discussion.  I haven't looked at 
INRIA's exact implementation very closely, so they may do some extra 
tricks not discussed there, but it certainly appears to use this general 
technique.

-Mike

[1] - http://citeseer.ist.psu.edu/162336.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Caml-list] More efficient implementation of intersection of sets?
@ 2008-04-02 13:30 sasha mal
  0 siblings, 0 replies; 7+ messages in thread
From: sasha mal @ 2008-04-02 13:30 UTC (permalink / raw)
  To: caml-list; +Cc: jon

> On Tuesday 01 April 2008 16:55:24 sasha mal wrote:

> > Dear OCaml users!

> >

> > Currently,

> >

> > Set.inter x y

> >

> > splits y into two trees, one containing elements that are bigger and the

> > other containing elements that are smaller than the top of x, then applies

> > the procedure recursively. What is the exact runtime of the algorithm?

> 

> We discussed this before on this list and the result was inconclusive. Suffice 

> to say, it is very fast!

If could give a pointer to the discussion, that would be great!

> > Is there a better one for the intersection for OCaml sets?

> Not likely.

I wouldn't be so sure.

I'm thinking of an alternative approach which inserts keys one at a time. Basically, we go through both trees simultaneously, starting with smallest nodes of both trees. If two nodes with the same key are seen, the are inserted into a new tree that will hold the intersection; then the pair of nodes with next larger keys is taken. Otherwise, if, say, key1<key2, then we search for the smallest key k>=key2 inside the first tree. If k=key2, we add k into the new tree and proceed to the next pair. If k>key2, we proceed with the pair (k, key2) instead of (key1, key2). For key1>key2, prcoeed analogously, i.e. search for the smallest key k>=key2 in the second tree. If at some point of time no "larger" key is found in one of the two original trees, the new tree where we insert elements to contains exactly the intersection.

The search in the trees is done by using "pointers" which are lists of trees contained in one another. E.g. [a,b,c] means that c is the root, b is one of the children of c, a is one of the children of b.

The advantage compared to the current implementation is that portions of one tree which are irrelevant for the intersection might be skipped. In the current implementation, they are rebalanced.

The used time is at most O((t1+t2)log(t1 intersection t2)) if constructors need O(1) time. A better estimation is welcome.

Could anyone compare this approach to the current implementation?

_______________________________________________
Join Excite! - http://www.excite.com
The most personalized portal on the Web!

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Caml-list] More efficient implementation of intersection of sets?
@ 2008-04-02 14:01 sasha mal
  2008-04-04 17:27 ` Brian Hurt
  0 siblings, 1 reply; 7+ messages in thread
From: sasha mal @ 2008-04-02 14:01 UTC (permalink / raw)
  To: caml-list; +Cc: jon


> I'm thinking of an alternative approach which inserts keys one at a time.

> Basically, we go through both trees simultaneously, starting with smallest

> nodes of both trees. If two nodes with the same key are seen, the are inserted

> into a new tree that will contain the intersection; then the pair of nodes with

> next larger keys is taken. Otherwise, if, say, key1<key2, then we search for

> the smallest key k>=key2 inside the first tree. If k=key2, we add k into the

> new tree and proceed to the next pair. If k>key2, we proceed with the pair 

> (k, key2) instead of (key1, key2). For key1>key2, proceed analogously, i.e.

> search for the smallest key k>=key1 in the second tree. If at some point of

> time no "larger" key is found in one of the two original trees, the new tree

> where we insert elements to contains exactly the intersection.



[Small corrections.]



By the way, the current implemented procedure always splits the second tree, regardless of the sizes of the trees. Imagine that we have two trees of very different sizes (e.g. with 5 and 10^5 elements). What tree should be split then?

_______________________________________________
Join Excite! - http://www.excite.com
The most personalized portal on the Web!



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Caml-list] More efficient implementation of intersection of sets?
  2008-04-02 14:01 sasha mal
@ 2008-04-04 17:27 ` Brian Hurt
  0 siblings, 0 replies; 7+ messages in thread
From: Brian Hurt @ 2008-04-04 17:27 UTC (permalink / raw)
  To: sasha.mal; +Cc: caml-list

[-- Attachment #1: Type: text/plain, Size: 2065 bytes --]

sasha mal wrote:

>>I'm thinking of an alternative approach which inserts keys one at a time.
>>    
>>
>
>  
>
>>Basically, we go through both trees simultaneously, starting with smallest
>>    
>>
>
>  
>
>>nodes of both trees. If two nodes with the same key are seen, the are inserted
>>    
>>
>
>  
>
>>into a new tree that will contain the intersection; then the pair of nodes with
>>    
>>
>
>  
>
>>next larger keys is taken. Otherwise, if, say, key1<key2, then we search for
>>    
>>
>
>  
>
>>the smallest key k>=key2 inside the first tree. If k=key2, we add k into the
>>    
>>
>
>  
>
>>new tree and proceed to the next pair. If k>key2, we proceed with the pair 
>>    
>>
>
>  
>
>>(k, key2) instead of (key1, key2). For key1>key2, proceed analogously, i.e.
>>    
>>
>
>  
>
>>search for the smallest key k>=key1 in the second tree. If at some point of
>>    
>>
>
>  
>
>>time no "larger" key is found in one of the two original trees, the new tree
>>    
>>
>
>  
>
>>where we insert elements to contains exactly the intersection.
>>    
>>
>
>
>
>[Small corrections.]
>
>
>
>By the way, the current implemented procedure always splits the second tree, regardless of the sizes of the trees. Imagine that we have two trees of very different sizes (e.g. with 5 and 10^5 elements). What tree should be split then?
>
>  
>
Sorry for the late response:

The length-5 tree should be the one which is split, for two reasons.  
Reason number one, it makes it the most likely that one half or the 
other will be empty.  This means that it's much more likely that I'll be 
able to skip large subtrees of the million-element set without 
inspection.  The second reason is that split (the function) is O((log 
N)^2) in cost, this is obviously cheaper if N=5 than if N=1 million.

Looking at this algorithm over the last couple of days, I've come to 
conclusion that it's O(N) normal-case, but O(N (log N)^2) worst case, 
and O(log N) best-cast, where N = the size of the larger set.  But I'm 
not sure of it.  It's certainly an interesting approach.

Brian


[-- Attachment #2: Type: text/html, Size: 3366 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2008-04-04 17:27 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-04-01 15:55 More efficient implementation of intersection of sets? sasha mal
2008-04-01 23:42 ` [Caml-list] " Jon Harrop
2008-04-02 14:05   ` Frédéric Gava
2008-04-02 15:34 ` Mike Furr
2008-04-02 13:30 sasha mal
2008-04-02 14:01 sasha mal
2008-04-04 17:27 ` Brian Hurt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox