From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <jon@ffconsultancy.com>
X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on yquem.inria.fr
X-Spam-Level: 
X-Spam-Status: No, score=0.0 required=5.0 tests=AWL autolearn=disabled 
	version=3.1.3
Received: from mail2-relais-roc.national.inria.fr (mail2-relais-roc.national.inria.fr [192.134.164.83])
	by yquem.inria.fr (Postfix) with ESMTP id F2727BC6B
	for <caml-list@yquem.inria.fr>; Mon,  7 Jan 2008 15:49:57 +0100 (CET)
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: Ah4FAG7KgUfUnw7UdWdsb2JhbACCNo1eAQoCCA8TB5cN
X-IronPort-AV: E=Sophos;i="4.24,253,1196636400"; 
   d="scan'208";a="5795985"
Received: from ptb-relay01.plus.net ([212.159.14.212])
  by mail2-smtp-roc.national.inria.fr with ESMTP; 07 Jan 2008 15:49:57 +0100
Received: from [80.229.56.224] (helo=beast.local)
	 by ptb-relay01.plus.net with esmtp (Exim) id 1JBtIa-0006dF-GJ
	for caml-list@yquem.inria.fr; Mon, 07 Jan 2008 14:49:56 +0000
From: Jon Harrop <jon@ffconsultancy.com>
Organization: Flying Frog Consultancy Ltd.
To: caml-list@yquem.inria.fr
Subject: Re: [Caml-list] Performance questions, -inline, ...
Date: Mon, 7 Jan 2008 14:41:47 +0000
User-Agent: KMail/1.9.7
References: <200801031128.30183.ober.14@osu.edu> <200801051936.23521.jon@ffconsultancy.com> <200801070848.40809.ober.14@osu.edu>
In-Reply-To: <200801070848.40809.ober.14@osu.edu>
MIME-Version: 1.0
Content-Type: text/plain;
  charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
Message-Id: <200801071441.48212.jon@ffconsultancy.com>
X-Spam: no; 0.00; -inline:01 ocaml:01 hofs:01 polymorphism:01 avoided:01 ocaml:01 high-level:01 rewrites:01 gcc:01 run-time:01 polymorphism:01 higher-order:01 ocamlopt:01 -inline:01 -unsafe:01 

On Monday 07 January 2008 13:48:40 Kuba Ober wrote:
> On Saturday 05 January 2008, Jon Harrop wrote:
> > Optimizing numerical code is discussed in detail in my book OCaml for
> > Scientists. You may also be interested in a very similar thread where I
> > optimized someone's almost identical code on the beginners list recently.
> > There is also this relevant blog post of mine:
> >
> >   http://ocamlnews.blogspot.com/2007/09/spectral-norm.html
> >
> > Essentially, your benchmark has rediscovered the fact that abstractions
> > (HOFs, polymorphism etc.) are prohibitively slow for high-performance
> > numerics and must be avoided.
>
> In the case of the particular OCaml implementation - yes. In general - no.
> I hope we agree here.

You mean it might be possible to recover the performance of C from numerical 
code with high-level abstractions? Yes. Indeed, I would like to see this 
done. However, I've never heard of an implementation of any language that can 
do this.

> > On Thursday 03 January 2008 16:28:30 Kuba Ober wrote:
> > > I haven't looked at assembly output yet, but I've run into some
> > > unexpected behavior in my benchmarks.
> >
> > Your benchmarks aren't sufficiently well defined to convey information
> > about anything specific, so you're highly likely to misinterpret what you
> > see.
>
> They are straight rewrites from C code and are used to compare how gcc and
> OCaml stack up.

I assume you didn't mimic run-time polymorphism, currying, closures and 
higher-order functions in the C though?

> > > This was compiled by ocamlopt -inline 100 -unsafe,
> >
> > You should use Array.unsafe_get and _set rather than the command-line
> > option -unsafe because the latter breaks correct code.
>
> This option is not affecting the execution speed in my case and thus can be
> dropped.

I find that surprising.

> > > What I wonder is why vector-to-vector add is so much faster than
> > > (constant) scalar to vector add. Vectors are preinitialized each time
> > > with a 1.0000, 1.0001, ... sequence.
> >
> > This is also highly likely to be platform and architecture dependent.
>
> The benchmark was run on the same machine and on the same day as the C code
> it was rewritten from :)

OCaml is only likely to compete with C in terms of numerical performance on 
modern CPUs, e.g. AMD64. If you're using an old CPU or running in 32-bit then 
OCaml can be up to 4x slower.

> > > (* generic scalar operation *)
> > > let op1 op const nloop =
> > > 	let accum = ref start in
> > > 	for i = 1 to nloop do
> > > 		accum := op !accum const
> > > 	done
> >
> > You probably meant to return "!accum".
>
> The return value is ignored anyway. It's a benchmark, noone cares what
> the result is.

If you do not care what the result of calls to this function are then you must 
also not care how quickly it runs. I cannot overemphasize how important this 
distinction is.

> Or is it for performance reasons?? 

In this case, I believe that will not affect performance because "accum" will 
always be boxed because it is polymorphic.

> > > (* generic vector operation *)
> > > let op2 op const a b (nloop : int) =
> > > 	let len = Array.length a in
> > > 	for j = 0 to len-1 do
> > > 		for i = 0 to len-1 do
> > > 			b.(i) <- op a.(i) b.(i)
> > > 		done;
> > > 	done
> > >
> > > (** addition **)
> > > let add1 nloop =
> > > 	let accum = ref start in
> > > 	for i = 1 to nloop do
> > > 		accum := !accum +. addconst
> > > 	done
> >
> > Again, should probably return "!accum". However, you can encourage OCaml
> > to unbox the float by returning "!accum + 0.0" instead.
>
> OK, I don't quite get it. Are you talking about what the function should
> return?

Yes. Your "add1" function is entirely dead code and, therefore, is trivially 
reducible. This is not true of real programs and, therefore, your results 
will not be representative of real programs. Moreover, there is no sense in 
trying to optimize a program that does nothing anyway.

> If so, are you implying that the function body will be compiled 
> differently (better?) if a different type is returned?

The change of type isn't the reason for the performance improvement. If you 
return "!accum" here, so "add1" returns "float" then the performance will 
probably not change. However, if you then change the return expression 
to "!accum + 0.0" then you're likely to see a significant improvement in 
performance because the OCaml compiler will unbox the floating point 
value "!accum" throughout this function, avoiding costly allocations.

You can't appreciate any of that without fixing your function first though.

> > > let add2 = op1 ( +. ) addconst
> >
> > This should be slower than "add1".
>
> But shouldn't. The way I write the code in this case should not affect the
> assembly the least bit. The loop with explicit operand, functional
> recursion and generic function-as-an-argument approach should all generate
> same assembly. Obviously enough they don't :)

The simplest route to recovering C performance here is:

. Inline "( +. )".
. Inline "op1".
. Type-specialize "op1".
. Hoist bounds checks.

> > > let add3 a b nloop =
> > > 	let len = Array.length a in
> > > 	for j = 0 to len-1 do
> > > 		for i = 0 to len-1 do
> > > 			b.(i) <- a.(i) +. addconst
> > > 		done;
> > > 	done
> >
> > The loop over "j" can be removed.
>
> Well, the goal was to iterate Array.length^2 times :)

Again, that makes your benchmark trivially reducible and, consequently, your 
results cannot be related to real code.

> > > let add4 = op2 ( +. ) addconst
> >
> > This will be slow because "op2" is polymorphic and "+." will not be
> > inlined.
>
> Yeah, I see that, but that shouldn't be the case if OCaml were to be
> serious in that department :)

AFAIK, there are no languages with implementations that are "serious in that 
department" by this definition, although I agree that it would be very useful 
in practice.

There are some trade-offs here though. Stalin-compiled Scheme, MLton-compiled 
SML and GHC-compiled Haskell do a lot more than OCaml in theory but, in 
practice, they are much less useful for high-performance numerics because 
their performance is so unpredictable.

> > > let add5 a b nloop =
> > > 	let len = Array.length a in
> > > 	for j = 0 to len-1 do
> > > 		for i = 0 to len-1 do
> > > 			b.(i) <- a.(i) +. b.(i)
> > > 		done;
> > > 	done
> >
> > This increments "b" by "a" many times. Replace the repeated adds with a
> > single multiply for each element.
>
> It's benchmark code. It's supposed to check performance of
> increment-vector-by-a-vector operation. I was hoping it would be obvious
> enough :(

Good benchmark code should not be trivially reducible in ways that real 
programs aren't, i.e. unless you're specifically testing the compiler's 
dead-code elimination. As you've written these functions, there is a lot of 
dead code. Some compilers (particularly C/C++) might eliminate the dead code 
and look better on this benchmark as a consequence but that is of little 
practical relevance.

I've already spent a long time meticulously contructing benchmarks in various 
different languages along these lines and the most important optimizations 
missing from OCaml are:

. Hoisting bounds checks.
. Common subexpression elimination.
. Type specialization.
. More aggressive inlining.
. Static optimization of div and mod.

The nice thing about OCaml is that it has a superb 64-bit code generator so, 
once you've done those optimizations on the hot paths manually, OCaml lets 
you run code as quickly as C but with all of the high-level benefits 
everywhere else.

A perfect language implementation would certainly do these optimizations for 
you and I wish OCaml did but I still haven't found anything better.

-- 
Dr Jon D Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com/products/?e