From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on yquem.inria.fr X-Spam-Level: X-Spam-Status: No, score=0.0 required=5.0 tests=AWL autolearn=disabled version=3.1.3 Received: from mail2-relais-roc.national.inria.fr (mail2-relais-roc.national.inria.fr [192.134.164.83]) by yquem.inria.fr (Postfix) with ESMTP id F2727BC6B for ; Mon, 7 Jan 2008 15:49:57 +0100 (CET) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: Ah4FAG7KgUfUnw7UdWdsb2JhbACCNo1eAQoCCA8TB5cN X-IronPort-AV: E=Sophos;i="4.24,253,1196636400"; d="scan'208";a="5795985" Received: from ptb-relay01.plus.net ([212.159.14.212]) by mail2-smtp-roc.national.inria.fr with ESMTP; 07 Jan 2008 15:49:57 +0100 Received: from [80.229.56.224] (helo=beast.local) by ptb-relay01.plus.net with esmtp (Exim) id 1JBtIa-0006dF-GJ for caml-list@yquem.inria.fr; Mon, 07 Jan 2008 14:49:56 +0000 From: Jon Harrop Organization: Flying Frog Consultancy Ltd. To: caml-list@yquem.inria.fr Subject: Re: [Caml-list] Performance questions, -inline, ... Date: Mon, 7 Jan 2008 14:41:47 +0000 User-Agent: KMail/1.9.7 References: <200801031128.30183.ober.14@osu.edu> <200801051936.23521.jon@ffconsultancy.com> <200801070848.40809.ober.14@osu.edu> In-Reply-To: <200801070848.40809.ober.14@osu.edu> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200801071441.48212.jon@ffconsultancy.com> X-Spam: no; 0.00; -inline:01 ocaml:01 hofs:01 polymorphism:01 avoided:01 ocaml:01 high-level:01 rewrites:01 gcc:01 run-time:01 polymorphism:01 higher-order:01 ocamlopt:01 -inline:01 -unsafe:01 On Monday 07 January 2008 13:48:40 Kuba Ober wrote: > On Saturday 05 January 2008, Jon Harrop wrote: > > Optimizing numerical code is discussed in detail in my book OCaml for > > Scientists. You may also be interested in a very similar thread where I > > optimized someone's almost identical code on the beginners list recently. > > There is also this relevant blog post of mine: > > > > http://ocamlnews.blogspot.com/2007/09/spectral-norm.html > > > > Essentially, your benchmark has rediscovered the fact that abstractions > > (HOFs, polymorphism etc.) are prohibitively slow for high-performance > > numerics and must be avoided. > > In the case of the particular OCaml implementation - yes. In general - no. > I hope we agree here. You mean it might be possible to recover the performance of C from numerical code with high-level abstractions? Yes. Indeed, I would like to see this done. However, I've never heard of an implementation of any language that can do this. > > On Thursday 03 January 2008 16:28:30 Kuba Ober wrote: > > > I haven't looked at assembly output yet, but I've run into some > > > unexpected behavior in my benchmarks. > > > > Your benchmarks aren't sufficiently well defined to convey information > > about anything specific, so you're highly likely to misinterpret what you > > see. > > They are straight rewrites from C code and are used to compare how gcc and > OCaml stack up. I assume you didn't mimic run-time polymorphism, currying, closures and higher-order functions in the C though? > > > This was compiled by ocamlopt -inline 100 -unsafe, > > > > You should use Array.unsafe_get and _set rather than the command-line > > option -unsafe because the latter breaks correct code. > > This option is not affecting the execution speed in my case and thus can be > dropped. I find that surprising. > > > What I wonder is why vector-to-vector add is so much faster than > > > (constant) scalar to vector add. Vectors are preinitialized each time > > > with a 1.0000, 1.0001, ... sequence. > > > > This is also highly likely to be platform and architecture dependent. > > The benchmark was run on the same machine and on the same day as the C code > it was rewritten from :) OCaml is only likely to compete with C in terms of numerical performance on modern CPUs, e.g. AMD64. If you're using an old CPU or running in 32-bit then OCaml can be up to 4x slower. > > > (* generic scalar operation *) > > > let op1 op const nloop = > > > let accum = ref start in > > > for i = 1 to nloop do > > > accum := op !accum const > > > done > > > > You probably meant to return "!accum". > > The return value is ignored anyway. It's a benchmark, noone cares what > the result is. If you do not care what the result of calls to this function are then you must also not care how quickly it runs. I cannot overemphasize how important this distinction is. > Or is it for performance reasons?? In this case, I believe that will not affect performance because "accum" will always be boxed because it is polymorphic. > > > (* generic vector operation *) > > > let op2 op const a b (nloop : int) = > > > let len = Array.length a in > > > for j = 0 to len-1 do > > > for i = 0 to len-1 do > > > b.(i) <- op a.(i) b.(i) > > > done; > > > done > > > > > > (** addition **) > > > let add1 nloop = > > > let accum = ref start in > > > for i = 1 to nloop do > > > accum := !accum +. addconst > > > done > > > > Again, should probably return "!accum". However, you can encourage OCaml > > to unbox the float by returning "!accum + 0.0" instead. > > OK, I don't quite get it. Are you talking about what the function should > return? Yes. Your "add1" function is entirely dead code and, therefore, is trivially reducible. This is not true of real programs and, therefore, your results will not be representative of real programs. Moreover, there is no sense in trying to optimize a program that does nothing anyway. > If so, are you implying that the function body will be compiled > differently (better?) if a different type is returned? The change of type isn't the reason for the performance improvement. If you return "!accum" here, so "add1" returns "float" then the performance will probably not change. However, if you then change the return expression to "!accum + 0.0" then you're likely to see a significant improvement in performance because the OCaml compiler will unbox the floating point value "!accum" throughout this function, avoiding costly allocations. You can't appreciate any of that without fixing your function first though. > > > let add2 = op1 ( +. ) addconst > > > > This should be slower than "add1". > > But shouldn't. The way I write the code in this case should not affect the > assembly the least bit. The loop with explicit operand, functional > recursion and generic function-as-an-argument approach should all generate > same assembly. Obviously enough they don't :) The simplest route to recovering C performance here is: . Inline "( +. )". . Inline "op1". . Type-specialize "op1". . Hoist bounds checks. > > > let add3 a b nloop = > > > let len = Array.length a in > > > for j = 0 to len-1 do > > > for i = 0 to len-1 do > > > b.(i) <- a.(i) +. addconst > > > done; > > > done > > > > The loop over "j" can be removed. > > Well, the goal was to iterate Array.length^2 times :) Again, that makes your benchmark trivially reducible and, consequently, your results cannot be related to real code. > > > let add4 = op2 ( +. ) addconst > > > > This will be slow because "op2" is polymorphic and "+." will not be > > inlined. > > Yeah, I see that, but that shouldn't be the case if OCaml were to be > serious in that department :) AFAIK, there are no languages with implementations that are "serious in that department" by this definition, although I agree that it would be very useful in practice. There are some trade-offs here though. Stalin-compiled Scheme, MLton-compiled SML and GHC-compiled Haskell do a lot more than OCaml in theory but, in practice, they are much less useful for high-performance numerics because their performance is so unpredictable. > > > let add5 a b nloop = > > > let len = Array.length a in > > > for j = 0 to len-1 do > > > for i = 0 to len-1 do > > > b.(i) <- a.(i) +. b.(i) > > > done; > > > done > > > > This increments "b" by "a" many times. Replace the repeated adds with a > > single multiply for each element. > > It's benchmark code. It's supposed to check performance of > increment-vector-by-a-vector operation. I was hoping it would be obvious > enough :( Good benchmark code should not be trivially reducible in ways that real programs aren't, i.e. unless you're specifically testing the compiler's dead-code elimination. As you've written these functions, there is a lot of dead code. Some compilers (particularly C/C++) might eliminate the dead code and look better on this benchmark as a consequence but that is of little practical relevance. I've already spent a long time meticulously contructing benchmarks in various different languages along these lines and the most important optimizations missing from OCaml are: . Hoisting bounds checks. . Common subexpression elimination. . Type specialization. . More aggressive inlining. . Static optimization of div and mod. The nice thing about OCaml is that it has a superb 64-bit code generator so, once you've done those optimizations on the hot paths manually, OCaml lets you run code as quickly as C but with all of the high-level benefits everywhere else. A perfect language implementation would certainly do these optimizations for you and I wish OCaml did but I still haven't found anything better. -- Dr Jon D Harrop, Flying Frog Consultancy Ltd. http://www.ffconsultancy.com/products/?e