From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <ober.14@osu.edu>
X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on yquem.inria.fr
X-Spam-Level: 
X-Spam-Status: No, score=0.4 required=5.0 tests=AWL,DNS_FROM_RFC_ABUSE 
	autolearn=disabled version=3.1.3
Received: from mail3-relais-sop.national.inria.fr (mail3-relais-sop.national.inria.fr [192.134.164.104])
	by yquem.inria.fr (Postfix) with ESMTP id EEB2EBC6B
	for <caml-list@yquem.inria.fr>; Mon,  7 Jan 2008 16:22:43 +0100 (CET)
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: Aq4HAKTSgUdDWxLC/2dsb2JhbACBWKZA
X-IronPort-AV: E=Sophos;i="4.24,254,1196636400"; 
   d="scan'208";a="7530407"
Received: from ip67-91-18-194.z18-91-67.customer.algx.net (HELO server1.bertec.net) ([67.91.18.194])
  by mail3-smtp-sop.national.inria.fr with ESMTP; 07 Jan 2008 16:22:42 +0100
Received: from kuba.bertec.net (kuba.bertec.net [192.168.2.16])
	by server1.bertec.net (Postfix) with ESMTP id B9F22105830
	for <caml-list@yquem.inria.fr>; Mon,  7 Jan 2008 10:22:41 -0500 (EST)
From: Kuba Ober <ober.14@osu.edu>
To: caml-list@yquem.inria.fr
Subject: Re: [Caml-list] Performance questions, -inline, ...
Date: Mon, 7 Jan 2008 10:22:40 -0500
User-Agent: KMail/1.9.6 (enterprise 0.20071123.740460)
References: <200801031128.30183.ober.14@osu.edu> <200801070848.40809.ober.14@osu.edu> <200801071441.48212.jon@ffconsultancy.com>
In-Reply-To: <200801071441.48212.jon@ffconsultancy.com>
MIME-Version: 1.0
Content-Type: text/plain;
  charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
Message-Id: <200801071022.41044.ober.14@osu.edu>
X-Spam: no; 0.00; -inline:01 ocaml:01 hofs:01 polymorphism:01 avoided:01 ocaml:01 high-level:01 hacked:01 runtime:01 inference:01 syntax:01 rewrites:01 gcc:01 run-time:01 polymorphism:01 

On Monday 07 January 2008, Jon Harrop wrote:
> On Monday 07 January 2008 13:48:40 Kuba Ober wrote:
> > On Saturday 05 January 2008, Jon Harrop wrote:
> > > Optimizing numerical code is discussed in detail in my book OCaml for
> > > Scientists. You may also be interested in a very similar thread where I
> > > optimized someone's almost identical code on the beginners list
> > > recently. There is also this relevant blog post of mine:
> > >
> > >   http://ocamlnews.blogspot.com/2007/09/spectral-norm.html
> > >
> > > Essentially, your benchmark has rediscovered the fact that abstractions
> > > (HOFs, polymorphism etc.) are prohibitively slow for high-performance
> > > numerics and must be avoided.
> >
> > In the case of the particular OCaml implementation - yes. In general -
> > no. I hope we agree here.
>
> You mean it might be possible to recover the performance of C from
> numerical code with high-level abstractions? Yes. Indeed, I would like to
> see this done. However, I've never heard of an implementation of any
> language that can do this.

g++ does that reasonably well. Heck, I have a severely hacked in-house LISP 
system which had grown out of an assembler-with-LISP-macros, and it does it 
all the time. It's a LISP with static types at runtime, but general LISPiness 
at compile time (macros just run in gcl). In fact I'm looking to port it to 
OCaml just because type inference and commonplace LISP syntax for type 
declarations don't mix too well - the code starts looking really ugly. Maybe 
I stuck too much with LISP's usual declaim-this-and-that approach, but that's 
what you get when you reuse most of underlying LISP's implementation.

> > They are straight rewrites from C code and are used to compare how gcc
> > and OCaml stack up.
>
> I assume you didn't mimic run-time polymorphism, currying, closures and
> higher-order functions in the C though?

No. But while those are very useful features to have, the code is messy if 
you've got two languages to work with (hi-perf stuff in C, everything else in 
OCaml).

> > > > (* generic scalar operation *)
> > > > let op1 op const nloop =
> > > > 	let accum = ref start in
> > > > 	for i = 1 to nloop do
> > > > 		accum := op !accum const
> > > > 	done
> > >
> > > You probably meant to return "!accum".
> >
> > The return value is ignored anyway. It's a benchmark, noone cares what
> > the result is.
>
> If you do not care what the result of calls to this function are then you
> must also not care how quickly it runs. I cannot overemphasize how
> important this distinction is.

I assumed (correctly) that OCaml's optimizer isn't all that clever. This is 
something you definitely have to watch for when using gcc/g++.

> > > > (** addition **)
> > > > let add1 nloop =
> > > > 	let accum = ref start in
> > > > 	for i = 1 to nloop do
> > > > 		accum := !accum +. addconst
> > > > 	done
> > >
> > > Again, should probably return "!accum". However, you can encourage
> > > OCaml to unbox the float by returning "!accum + 0.0" instead.
> >
> > OK, I don't quite get it. Are you talking about what the function should
> > return?
>
> Yes. Your "add1" function is entirely dead code and, therefore, is
> trivially reducible. This is not true of real programs and, therefore, your
> results will not be representative of real programs. Moreover, there is no
> sense in trying to optimize a program that does nothing anyway.

If you code such a thing in gcc, it correctly becomes a no-op. OCaml isn't all 
that clever ;)

You're 100% academically-correct of course.

> > If so, are you implying that the function body will be compiled
> > differently (better?) if a different type is returned?
>
> The change of type isn't the reason for the performance improvement. If you
> return "!accum" here, so "add1" returns "float" then the performance will
> probably not change. However, if you then change the return expression
> to "!accum + 0.0" then you're likely to see a significant improvement in
> performance because the OCaml compiler will unbox the floating point
> value "!accum" throughout this function, avoiding costly allocations.

Ah-ha, I have to try that!

> > > > let add2 = op1 ( +. ) addconst
> > >
> > > This should be slower than "add1".
> >
> > But shouldn't. The way I write the code in this case should not affect
> > the assembly the least bit. The loop with explicit operand, functional
> > recursion and generic function-as-an-argument approach should all
> > generate same assembly. Obviously enough they don't :)
>
> The simplest route to recovering C performance here is:
>
> . Inline "( +. )".
> . Inline "op1".
> . Type-specialize "op1".
> . Hoist bounds checks.

You mean "the programmer has to do this"? OCaml compiler is *really* bad, 
then.

> > > > let add4 = op2 ( +. ) addconst
> > >
> > > This will be slow because "op2" is polymorphic and "+." will not be
> > > inlined.
> >
> > Yeah, I see that, but that shouldn't be the case if OCaml were to be
> > serious in that department :)
>
> AFAIK, there are no languages with implementations that are "serious in
> that department" by this definition, although I agree that it would be very
> useful in practice.

I have a very broken inlining statically-typed LISP implementation which does 
that all the time. For code which runs on 8-bit hardware and is better than 
stock C compiler that came with it (Zilog's Encore! platform). It's really 
not all that hard to do. It's hardly a full-fledged implementation, and it's 
buggy, but it's only goal is to compile some very specific codebase. A hack, 
but it works, and I write everything functional-style, with functions passed 
left and right, etc. In fact, the compiler guarantees that a stateless 
function, if passed as an argument, is equivalent to coding it in-line. It's 
all DSP code.

> There are some trade-offs here though. Stalin-compiled Scheme,
> MLton-compiled SML and GHC-compiled Haskell do a lot more than OCaml in
> theory but, in practice, they are much less useful for high-performance
> numerics because their performance is so unpredictable.

Good to know.

> I've already spent a long time meticulously contructing benchmarks in
> various different languages along these lines and the most important
> optimizations missing from OCaml are:
>
> . Hoisting bounds checks.
> . Common subexpression elimination.
> . Type specialization.
> . More aggressive inlining.
> . Static optimization of div and mod.
>
> The nice thing about OCaml is that it has a superb 64-bit code generator
> so, once you've done those optimizations on the hot paths manually, OCaml
> lets you run code as quickly as C but with all of the high-level benefits
> everywhere else.
>
> A perfect language implementation would certainly do these optimizations
> for you and I wish OCaml did but I still haven't found anything better.

After I gain some more experience I will hopefully add an OCaml-esque front 
end to my hackish compiler. It's completely unportable, only works on 
12-bit-PIC-like architecture (SX from Parallax) and on Z8 Encore!, but hey, 
maybe it can be a good starting point to something bigger and better. The 
backend still has to be ported from Lisp, and I'm too time- and 
knowledge-constrained to do it just yet.

The code is 70% Lisp macros, reuses every bit of underlying LISP platform 
it can, and as of yet I don't know how clean the ported OCaml would look. 
Needless to say, I'd never get this done without having the macros available, 
as in "this would take me forever in C/C++". The way it currenly runs is as 
follows: the source code is run via a LISP program which creates another LISP 
program which then generates the linked binary to be downloaded to the 
target.

I was originally hoping to reuse most of OCaml's compiler/optimizer and just 
write a platform-specific backend. It now seems that it's easier for me to 
add a hacked OCaml front-end to my existing system. Probably it will just 
translate OCaml to LISP and then process it as it originally was. I really 
like OCaml syntax better, *although* some of my sources also use LISP macros. 
They could probably stay as-is for now if the OCaml parts will just get 
translated to LISP.

Cheers, Kuba