From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by yquem.inria.fr (Postfix, from userid 18180) id 9F907BB91; Sat, 15 Jan 2005 12:55:19 +0100 (CET) Date: Sat, 15 Jan 2005 12:55:19 +0100 From: Xavier Leroy To: "Will M. Farr" Cc: shootout-list@lists.alioth.debian.org, caml-list@yquem.inria.fr Subject: Re: [Caml-list] Ocaml sums the harmonic series -- four ways, four benchmarks: floating point performance Message-ID: <20050115115519.GA11037@yquem.inria.fr> References: <3D3A6BF5-657B-11D9-A551-000393A34E82@mit.edu> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <3D3A6BF5-657B-11D9-A551-000393A34E82@mit.edu> User-Agent: Mutt/1.3.28i X-Spam: no; 0.00; caml-list:01 ocaml:01 compiler:01 ocamlopt:01 unboxed:01 int-:01 int-:01 powerpc:01 model:01 slowest:01 surprising:01 unbalanced:01 hastily:01 prevost:01 powerpc:01 X-Spam-Checker-Version: SpamAssassin 3.0.0 (2004-09-13) on yquem.inria.fr X-Spam-Status: No, score=-2.8 required=5.0 tests=ALL_TRUSTED autolearn=disabled version=3.0.0 X-Spam-Level: > Here's an implementation (really 4) along with timing > information of the "harmonic" benchmark (essentially summing the > harmonic series) [...] > Here's the code for the fastest implementation: The following slight modification of your code generates asm code that is closest to what a C compiler would produce: let sum_harmonic5 n = let sum = ref 1.0 in let ifloat = ref 2.0 in for i = 2 to n do sum := !sum +. 1.0/.(!ifloat); ifloat := !ifloat +. 1.0 done; !sum +. 0.0;; The + 0.0 at the end is ugly but convinces ocamlopt that !sum is best kept unboxed during the loop. > (note that I have replaced an int->float conversion for > each term with a single floating point operation: ifloat := ifloat +. > 1.0). Since int->float conversions are slow on my machine (PowerBook > G4) Right, the PowerPC does not have an int -> float instruction and that conversion must be performed with a rather expensive sequence of instructions (for the gory details, see e.g. http://the.wall.riscom.net/books/proc/ppc/cwg/code3.html#303610). 64-bit PPCs have a dedicated instruction to do this conversion, showing that the IBM/Motorola people learn from their past mistakes... For Intel processors, it's the reverse conversion (float -> int) that is slow. Clearly, the SPEC benchmark doesn't contain much conversions between floats and ints, otherwise hardware designers would pay more attention :-) > this is a big win (about a factor of 2 in time for the C program). As others have mentioned, this strongly depends on the processor instruction set and even on the processor model. My own benchmarks (with your Caml code) give the following results: PPC G4 (Cube) 1 < 2 < 3 < 4 < 5 speed ratio = 1.5 Xeon 2.8 3 < 4 < 1 = 2 < 5 speed ratio = 1.02 Pentium 4 2.0 3 < 1 < 2 < 4 < 5 speed ratio = 1.2 Athlon XP 1.4 4 < 5 < 3 < 1 < 2 speed ratio = 2.2 where 1, 2, 3, 4, 5 refer to the 5 different functions, 1 < 2 means "1 is slower than 2", and "speed ratio" is the speed difference between fastest and slowest. The Xeon case is what I was expecting: the running time is dominated by the time it takes to do the float divisions, everything else is done in parallel or takes negligible time, so it doesn't matter much how you write the code. The Athlon figures are *very* surprising. It could be the case that this benchmark falls into a quirk of that (otherwise excellent :-) processor. Actually, this often happens with micro-benchmarks: they are so small and their mix of operations is so unbalanced that they can easily run into weird processor behaviors. So, don't draw conclusions hastily. John Prevost asks: > Is the PowerPC ocamlopt back-end less optimized than the x86? No, not really. The x86 back-end works harder to work around oddities in the x86 instruction set (e.g. the lack of floating-point registers), but that is hardly an optimization, just compensating for brain damage in the instruction set. Conversely, the PPC back-end performs basic-block instruction scheduling while the x86 back-end doesn't. Instruction scheduling helped with early PPC chips (601, 603) but is largely irrelevant with modern out-of-order PPC implementations. > Are you sure that there isn't just a built-in > instruction on the x86 that adds an int to a float? I think there exists one such instruction, but ocamlopt doesn't use it, and the Intel optimization manuals recommend to do int->float conversion followed by float addition instead. - Xavier Leroy