From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on yquem.inria.fr X-Spam-Level: X-Spam-Status: No, score=0.0 required=5.0 tests=AWL autolearn=disabled version=3.1.3 Received: from mail4-relais-sop.national.inria.fr (mail4-relais-sop.national.inria.fr [192.134.164.105]) by yquem.inria.fr (Postfix) with ESMTP id B5D46BBCA for ; Sat, 10 May 2008 22:04:57 +0200 (CEST) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AlwDAFOcJUjUnw7XcGdsb2JhbACCMY9ZAQwFAgQHE5d6 X-IronPort-AV: E=Sophos;i="4.27,466,1204498800"; d="scan'208";a="26034806" Received: from fhw-relay07.plus.net ([212.159.14.215]) by mail4-smtp-sop.national.inria.fr with ESMTP; 10 May 2008 22:04:57 +0200 Received: from [80.229.56.224] (helo=beast.local) by fhw-relay07.plus.net with esmtp (Exim) id 1JuvJN-0005B1-SS; Sat, 10 May 2008 21:04:54 +0100 From: Jon Harrop Organization: Flying Frog Consultancy Ltd. To: David Teller Subject: Re: [Caml-list] Re: Why OCaml rocks Date: Sat, 10 May 2008 20:59:54 +0100 User-Agent: KMail/1.9.9 References: <200805090139.54870.jon@ffconsultancy.com> <200805091910.41381.jon@ffconsultancy.com> <1210371949.6399.62.camel@Blefuscu> In-Reply-To: <1210371949.6399.62.camel@Blefuscu> Cc: caml-list@yquem.inria.fr MIME-Version: 1.0 Content-Disposition: inline Content-Type: text/plain; charset="iso-8859-15" Content-Transfer-Encoding: 7bit Message-Id: <200805102059.54856.jon@ffconsultancy.com> X-Plusnet-Relay: 77fc228846ac1d05f50fd793c29bc6cd X-Spam: no; 0.00; ocaml:01 0100,:01 parallelism:01 ocaml:01 amortize:01 uniformly:01 mutable:01 fine-grained:01 parallelism:01 fine-grained:01 ocaml's:01 mutable:01 matrices:01 mutation:01 matrices:01 On Friday 09 May 2008 23:25:49 David Teller wrote: > On Fri, 2008-05-09 at 19:10 +0100, Jon Harrop wrote: > > Parallelism is easy in F#. > > Now, that's a cliffhanger. Could you elaborate? Sure. Review the concerns cited regarding parallel programming in OCaml: 1. When do we fork? Earlier to amortize the overhead or later to enable sharing of data structures? 2. How do we avoid excessive copying? What if each parallel thread only requires part of a data structure, should we dissect the data structure to alleviate message passing? 3. How do we pass values like heap allocated custom blocks? 4. Do we use the Ancient module and manage our memory manually so that we can mutate shared data? These all make parallel programming much harder in OCaml than it needs to be. All of these concerns simply disappear when you have a concurrent GC. F# already does. Hopefully OCaml will soon. Addressing each point in turn: 1. Use the high-performance thread pool provided by the Task Parallel Library. 2. There is no copying. 3. You can pass any value to another thread because everything is uniformly represented. 4. Memory management is fully automated and shared mutable data is easy. Consider the embarassingly-parallel task of matrix-matrix multiplication. Note that this is absolutely best-case for OCaml because the overheads are as small as possible thanks to the complete absence of fine-grained parallelism. Most real applications require much more fine-grained parallelism and OCaml's performance is much worse as a consequence. In F#, this is really easy to write: let c = Array2.zero_create am bn Parallel.For(0, n, fun i -> for j = 0 to n - 1 do let mutable r = 0.0 for k = 0 to n - 1 do r <- r + a.[i,k] * b.[k,j] c.[i,j] <- r) There are several things to notice about this: . The input matrices are immediately accessible from parallel threads without us having to do anything. There is no copying, so memory consumption is lower and there is no overhead to worry about. . The output matrix is filled directly using mutation. Again, there is no copying, no subsequent single-threaded collation of results and no overhead. . No manual memory management is required. There is no easy solution in OCaml but you might: . Fork a process for each CPU at the start of the multiply in order to avoid copying the input matrices, use message passing to return the result and collate it serially. . Prefork a process for each CPU and use message passing to queue work items, pass the input matrices using message passing, compute results, pass results back using message passing and collate them serially. So we might have an O(1) hit from forking, an O(n^2) hit from message passing and collation and an O(n^3) actual algorithm. Like Ulf said, message passing scales well. However, its raw performance is awful compared to leveraging shared memory. Here is an OCaml implementation of the above F#: let invoke (f : 'a -> 'b) x : unit -> 'b = let input, output = Unix.pipe() in match Unix.fork() with | -1 -> (let v = f x in fun () -> v) | 0 -> Unix.close input; let output = Unix.out_channel_of_descr output in Marshal.to_channel output (try `Res(f x) with e -> `Exn e) []; close_out output; exit 0 | pid -> Unix.close output; let input = Unix.in_channel_of_descr input in fun () -> let v = Marshal.from_channel input in ignore (Unix.waitpid [] pid); close_in input; match v with | `Res x -> x | `Exn e -> raise e let parmul_aux i0 i1 n a b = let c = Array.make_matrix (i1 - i0) n 0. in for i = i0 to i1 - 1 do let ai = a.(i) in for j = 0 to n - 1 do let r = ref 0.0 in for k = 0 to n - 1 do r := !r +. Array.unsafe_get ai k *. Array.unsafe_get (Array.unsafe_get b k) j done; c.(i - i0).(j) <- !r done; done; c let parmul n a b = let c1 = invoke (fun () -> parmul_aux 0 (n/2) n a b) () in let c2 = parmul_aux (n/2) n n a b in Array.concat [c1(); c2] Note that we must manually insert unsafe array accesses and hoist loop invariants in order to obtain comparable performance. This code is clearly much more complicated than the F# and that complexity is completely unnecessary. > > > I think that the cost of copying data is totally overrated. We are > > > doing this often, and even over the network, and hey, we are breaking > > > every speed limit. > > > > You cannot afford to pay that price for parallel implementations of most > > numerical algorithms. > > Er... Not being a specialist, I may be wrong, but I seem to remember > that you can afford that, as long as you're also doing something else > during that copy. Here are some actual performance results for this task on my machine: n OCaml F# Serial Parallel Serial Parallel 16 0.00005 0.00022 0.00002 0.000045 32 0.00040 0.00067 0.00015 0.00015 64 0.0033 0.0021 0.00090 0.00068 512 4.0 2.4 2.5 1.4 As you can see, F# is much faster in every case. Larger "n" approaches the limit where F# is only 1.6x faster than OCaml. For smaller "n" the gap quickly widens with F# being 5x faster for n=16. > > On the contrary, that is not a theoretical statement at all: it > > already happened. F# already makes it much easier to write high > > performance parallel algorithms and its concurrent GC is the crux of that > > capability. > > Examples ? Pretty please ? The above examples are a good start. The latest F#.NET Journal article covers parallelism using Microsoft's TPL. A future OCaml Journal article will cover the use of fork-based parallelism. I think you can see what I mean from the results I've presented in this post though. To put this another way, multicore computing provides hardware accelerated shared memory and a concurrent GC makes it easy to exploit that. Message passing is fine for concurrent applications that are not CPU bound or for distributed computing but it is not competitive on today's multicore machines and will not become competitive in the next decade. -- Dr Jon D Harrop, Flying Frog Consultancy Ltd. http://www.ffconsultancy.com/products/?e