* Slow allocations with 64bit code? @ 2007-04-20 20:31 Markus Mottl 2007-04-20 20:42 ` [Caml-list] " Jon Harrop ` (2 more replies) 0 siblings, 3 replies; 6+ messages in thread From: Markus Mottl @ 2007-04-20 20:31 UTC (permalink / raw) To: ocaml; +Cc: yaron jane Hi, I wonder whether others have already noticed that allocations may surprisingly be slower on 64bit platforms than on 32bit ones. I compiled the following code using an OCaml-compiler that generates 32bit code: ------------------------- let () = for i = 1 to 100000000 do ignore (Int32.add 42l 24l) done ------------------------- I ran it on a 64bit platform (Intel(R) Pentium(R) D CPU 2.80GHz), and it took 0.65 seconds to finish. Then I recompiled it on this same platform using an OCaml-compiler that generates 64bit code. Surprisingly, the resulting executable took 0.72 seconds to run! This is only a difference of about 10%, but I have seen more complex cases where there are timing differences in excess of 50%, which is already pretty substantial. Looking at the assembly, there is really no difference in the loop other than the use of the quad word instructions, which should not take longer on the exact same platform (i.e. same CPU-frequency). But there is a suspicious call to "caml_alloc2", which might cause these differences. Can it be that there are alignment problems or similar in the run time? In the considerably more complex code I'm currently working on it also seemed to me that it's allocations (the run time) that cause the performance difference. Regards, Markus -- Markus Mottl http://www.ocaml.info markus.mottl@gmail.com ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [Caml-list] Slow allocations with 64bit code? 2007-04-20 20:31 Slow allocations with 64bit code? Markus Mottl @ 2007-04-20 20:42 ` Jon Harrop 2007-04-21 2:57 ` skaller 2007-04-22 10:23 ` Xavier Leroy 2 siblings, 0 replies; 6+ messages in thread From: Jon Harrop @ 2007-04-20 20:42 UTC (permalink / raw) To: caml-list On Friday 20 April 2007 21:31, Markus Mottl wrote: > In the considerably more complex code I'm currently working on it also > seemed to me that it's allocations (the run time) that cause the > performance difference. Are you sure it isn't just eating the minor heap 2x faster? I did quite a few benchmarks when I first got my AMD64 and found 64-bit to be faster on all but tree-based algorithms. I put that down to 64-bit pointers consuming 2x more memory (although the performance difference was much less than 2x). Doing the benchmark again (nth.opt 50 1 cfg-10k-aSi) I get: 7.438s 32-bit metaocamlopt 3.09.1 5.289s 64-bit ocamlopt 3.10.0+beta What version of OCaml are you using? -- Dr Jon D Harrop, Flying Frog Consultancy Ltd. The F#.NET Journal http://www.ffconsultancy.com/products/fsharp_journal/?e ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [Caml-list] Slow allocations with 64bit code? 2007-04-20 20:31 Slow allocations with 64bit code? Markus Mottl 2007-04-20 20:42 ` [Caml-list] " Jon Harrop @ 2007-04-21 2:57 ` skaller 2007-04-22 10:23 ` Xavier Leroy 2 siblings, 0 replies; 6+ messages in thread From: skaller @ 2007-04-21 2:57 UTC (permalink / raw) To: Markus Mottl; +Cc: ocaml, yaron jane On Fri, 2007-04-20 at 16:31 -0400, Markus Mottl wrote: > This is only a difference of about 10%, but I have seen more complex > cases where there are timing differences in excess of 50%, which is > already pretty substantial. I am surprised! The 64 bit code is so fast! You are using 64 bit pointers. They're twice as big as 32 bit pointers. So every 'box' and all heap slot are double the size. On a memory intensive operation, you'd expect the 64 bit model to run at half the speed. In your example: let () = for i = 1 to 100000000 do ignore (Int32.add 42l 24l) done it would appear 'ignore' enforces an allocation which is subsequently garbage collected. So you have both allocation and collection which hits double the memory than on a 32 bit model. It seems likely the reason the code is only 10% slower here is that the minor heap compactor is successfully preventing this code hitting much memory, possibly keeping the whole thing in cache. This will be a gc tuning detail. Try folding over a large list. The 64 bit version should take twice as long because the memory accesses are the only part of the operation that takes significant time. [Everything else should fit in the cache except reading the list: boxing unboxing the accumulator and invoking the argument closure should all be effectively zero cost] -- John Skaller <skaller at users dot sf dot net> Felix, successor to C++: http://felix.sf.net ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [Caml-list] Slow allocations with 64bit code? 2007-04-20 20:31 Slow allocations with 64bit code? Markus Mottl 2007-04-20 20:42 ` [Caml-list] " Jon Harrop 2007-04-21 2:57 ` skaller @ 2007-04-22 10:23 ` Xavier Leroy 2007-04-22 16:12 ` Markus Mottl 2 siblings, 1 reply; 6+ messages in thread From: Xavier Leroy @ 2007-04-22 10:23 UTC (permalink / raw) To: Markus Mottl; +Cc: ocaml, yaron jane > I wonder whether others have already noticed that allocations may > surprisingly be slower on 64bit platforms than on 32bit ones. As already mentioned, on 64-bit platforms almost all Caml data representations are twice as large as on 32-bit platforms (exceptions: strings, float arrays), so the processor has twice as much data to move through its memory subsystem. However, you certainly don't get a slowdown by a factor of 2, for two reasons: 1- the processor doesn't spend all its time doing memory accesses, there are some computations here and there; 2- cache lines are much bigger than 32 bits, meaning that accessing 64 bits at a given address is much cheaper than accessing two 32-bit quantities at two random addresses (spatial locality). Moreover, x86 in 64-bit mode is much more compiler-friendly than in 32-bit mode: twice as many registers, a sensible floating-point model at last. So, OCaml in 64-bit mode generates better code than in 32-bit mode. All in all, your 10% slowdown seems reasonable and in line with what others reported using C benchmarks. > This is only a difference of about 10%, but I have seen more complex > cases where there are timing differences in excess of 50%, which is > already pretty substantial. Be careful with timings: I've seen simple changes in code placement (e.g. introducing or removing dead code) cause performance differences in excess of 20%. It's an unfortunate fact of today's processors that their performance is very hard to predict. > Looking at the assembly, there is really no difference in the loop > other than the use of the quad word instructions, which should not > take longer on the exact same platform (i.e. same CPU-frequency). But > there is a suspicious call to "caml_alloc2", which might cause these > differences. Can it be that there are alignment problems or similar > in the run time? ocamlopt compiles module initialization code in the so-called "compact" model, where code size is reduced by not open-coding some operations such as heap allocation, but instead going through auxiliary functions like "caml_alloc2". This makes sense since initialization code is usually large but not performance-critical. I recommend you put performance-critical code in functions, not in the initialization code. - Xavier Leroy ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [Caml-list] Slow allocations with 64bit code? 2007-04-22 10:23 ` Xavier Leroy @ 2007-04-22 16:12 ` Markus Mottl 2007-04-23 20:13 ` Markus Mottl 0 siblings, 1 reply; 6+ messages in thread From: Markus Mottl @ 2007-04-22 16:12 UTC (permalink / raw) To: Xavier Leroy; +Cc: ocaml, yaron jane On 4/22/07, Xavier Leroy <Xavier.Leroy@inria.fr> wrote: > > I wonder whether others have already noticed that allocations may > > surprisingly be slower on 64bit platforms than on 32bit ones. > > As already mentioned, on 64-bit platforms almost all Caml data > representations are twice as large as on 32-bit platforms (exceptions: > strings, float arrays), so the processor has twice as much data to > move through its memory subsystem. Interesting, I was obviously under the wrong assumption that a 64bit machine would scale appropriately when accessing 64bit words in memory. Of course, I'm aware that cache effects also play a role, but the minor heap should easily fit into the cache of any modern machine in any case, and it's not like this experiment is eating memory. > However, you certainly don't get a slowdown by a factor of 2, for two > reasons: 1- the processor doesn't spend all its time doing memory > accesses, there are some computations here and there; 2- cache lines > are much bigger than 32 bits, meaning that accessing 64 bits at a > given address is much cheaper than accessing two 32-bit > quantities at two random addresses (spatial locality). > > Moreover, x86 in 64-bit mode is much more compiler-friendly than in > 32-bit mode: twice as many registers, a sensible floating-point model > at last. So, OCaml in 64-bit mode generates better code than in > 32-bit mode. > > All in all, your 10% slowdown seems reasonable and in line with what > others reported using C benchmarks. This seems reasonable. It just seemed surprising to me that in some of my tests a 64bit machine could be slower handling even "large" Int64-values than in 32bit-mode, in which it always has to perform two memory accesses and possibly some additional computation steps. > Be careful with timings: I've seen simple changes in code placement > (e.g. introducing or removing dead code) cause performance differences > in excess of 20%. It's an unfortunate fact of today's processors that > their performance is very hard to predict. This surely also requires some caution when interpreting mini-benchmarks. > ocamlopt compiles module initialization code in the so-called > "compact" model, where code size is reduced by not open-coding some > operations such as heap allocation, but instead going through > auxiliary functions like "caml_alloc2". This makes sense since > initialization code is usually large but not performance-critical. > I recommend you put performance-critical code in functions, not in the > initialization code. Thanks, this is a very important bit of information that I wasn't aware of! I used to run mini-benchmarks from initialization code in most cases, which is obviously a bad idea... Regards, Markus -- Markus Mottl http://www.ocaml.info markus.mottl@gmail.com ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [Caml-list] Slow allocations with 64bit code? 2007-04-22 16:12 ` Markus Mottl @ 2007-04-23 20:13 ` Markus Mottl 0 siblings, 0 replies; 6+ messages in thread From: Markus Mottl @ 2007-04-23 20:13 UTC (permalink / raw) To: ocaml; +Cc: yaron jane On 4/22/07, Markus Mottl <markus.mottl@gmail.com> wrote: > On 4/22/07, Xavier Leroy <Xavier.Leroy@inria.fr> wrote: > > Be careful with timings: I've seen simple changes in code placement > > (e.g. introducing or removing dead code) cause performance differences > > in excess of 20%. It's an unfortunate fact of today's processors that > > their performance is very hard to predict. After performing many extensive tests between 32bit/64bit platforms, it seems that indeed code placement is a major cause of many if not most timing differences I have seen, especially if the difference is unusually big. Other developers who want to make their code run fast independent of platform should therefore be cautioned that a program compiled for different architectures may be slower/faster for very random reasons that have nothing to do with not having optimized well enough for the special case. This is especially true for low-level code, where such effects do not cancel each other out easily. Regards, Markus -- Markus Mottl http://www.ocaml.info markus.mottl@gmail.com ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2007-04-23 20:13 UTC | newest] Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2007-04-20 20:31 Slow allocations with 64bit code? Markus Mottl 2007-04-20 20:42 ` [Caml-list] " Jon Harrop 2007-04-21 2:57 ` skaller 2007-04-22 10:23 ` Xavier Leroy 2007-04-22 16:12 ` Markus Mottl 2007-04-23 20:13 ` Markus Mottl
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox