From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <xleroy@yquem.inria.fr>
Received: by yquem.inria.fr (Postfix, from userid 18180)
	id 9F907BB91; Sat, 15 Jan 2005 12:55:19 +0100 (CET)
Date: Sat, 15 Jan 2005 12:55:19 +0100
From: Xavier Leroy <Xavier.Leroy@inria.fr>
To: "Will M. Farr" <farr@MIT.EDU>
Cc: shootout-list@lists.alioth.debian.org, caml-list@yquem.inria.fr
Subject: Re: [Caml-list] Ocaml sums the harmonic series -- four ways, four benchmarks: floating point performance
Message-ID: <20050115115519.GA11037@yquem.inria.fr>
References: <3D3A6BF5-657B-11D9-A551-000393A34E82@mit.edu>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <3D3A6BF5-657B-11D9-A551-000393A34E82@mit.edu>
User-Agent: Mutt/1.3.28i
X-Spam: no; 0.00; caml-list:01 ocaml:01 compiler:01 ocamlopt:01 unboxed:01 int-:01 int-:01 powerpc:01 model:01 slowest:01 surprising:01 unbalanced:01 hastily:01 prevost:01 powerpc:01 
X-Spam-Checker-Version: SpamAssassin 3.0.0 (2004-09-13) on yquem.inria.fr
X-Spam-Status: No, score=-2.8 required=5.0 tests=ALL_TRUSTED 
	autolearn=disabled version=3.0.0
X-Spam-Level: 

> Here's an implementation (really 4) along with timing  
> information of the "harmonic" benchmark (essentially summing the  
> harmonic series) [...]
> Here's the code for the fastest implementation:

The following slight modification of your code generates asm code that
is closest to what a C compiler would produce:

let sum_harmonic5 n =
  let sum = ref 1.0 in
  let ifloat = ref 2.0 in
    for i = 2 to n do
      sum := !sum +. 1.0/.(!ifloat);
      ifloat := !ifloat +. 1.0
    done;
    !sum +. 0.0;;

The + 0.0 at the end is ugly but convinces ocamlopt that !sum is best
kept unboxed during the loop.

> (note that I have replaced an int->float conversion for  
> each term with a single floating point operation: ifloat := ifloat +.  
> 1.0).  Since int->float conversions are slow on my machine (PowerBook  
> G4)

Right, the PowerPC does not have an int -> float instruction and that
conversion must be performed with a rather expensive sequence of
instructions (for the gory details, see e.g.
 http://the.wall.riscom.net/books/proc/ppc/cwg/code3.html#303610).

64-bit PPCs have a dedicated instruction to do this conversion,
showing that the IBM/Motorola people learn from their past mistakes...

For Intel processors, it's the reverse conversion (float -> int) that
is slow.  Clearly, the SPEC benchmark doesn't contain much conversions
between floats and ints, otherwise hardware designers would pay more
attention :-)

> this is a big win (about a factor of 2 in time for the C program).  

As others have mentioned, this strongly depends on the processor
instruction set and even on the processor model.  My own benchmarks
(with your Caml code) give the following results:

PPC G4 (Cube)   1 < 2 < 3 < 4 < 5   speed ratio = 1.5
Xeon 2.8        3 < 4 < 1 = 2 < 5   speed ratio = 1.02
Pentium 4 2.0   3 < 1 < 2 < 4 < 5   speed ratio = 1.2
Athlon XP 1.4   4 < 5 < 3 < 1 < 2   speed ratio = 2.2

where 1, 2, 3, 4, 5 refer to the 5 different functions,
1 < 2 means "1 is slower than 2",
and "speed ratio" is the speed difference between fastest and slowest.

The Xeon case is what I was expecting: the running time is dominated by
the time it takes to do the float divisions, everything else is done in
parallel or takes negligible time, so it doesn't matter much how you
write the code.

The Athlon figures are *very* surprising.  It could be the case that
this benchmark falls into a quirk of that (otherwise excellent :-)
processor.  

Actually, this often happens with micro-benchmarks: they are so small
and their mix of operations is so unbalanced that they can easily run
into weird processor behaviors.  So, don't draw conclusions hastily.

John Prevost asks:

> Is the PowerPC ocamlopt back-end less optimized than the x86?

No, not really.  The x86 back-end works harder to work around oddities
in the x86 instruction set (e.g. the lack of floating-point
registers), but that is hardly an optimization, just compensating for
brain damage in the instruction set.  Conversely, the PPC back-end
performs basic-block instruction scheduling while the x86 back-end doesn't.
Instruction scheduling helped with early PPC chips (601, 603) but is
largely irrelevant with modern out-of-order PPC implementations.

> Are you sure that there isn't just a built-in 
> instruction on the x86 that adds an int to a float?

I think there exists one such instruction, but ocamlopt doesn't use
it, and the Intel optimization manuals recommend to do int->float
conversion followed by float addition instead.

- Xavier Leroy