* [Caml-list] DFT in OCaml vs. C
@ 2003-03-27 7:33 Issac Trotts
2003-03-27 10:58 ` Fabrice Le Fessant
` (2 more replies)
0 siblings, 3 replies; 10+ messages in thread
From: Issac Trotts @ 2003-03-27 7:33 UTC (permalink / raw)
To: OCaml List
Here's a numerical mini-benchmark comparing C to OCaml
on a simple implementation of the Discrete Fourier Transform:
http://redwood.ucdavis.edu/~issac/dft_compare.tar.gz
The results on my 1 GHZ Pentium III Linux box:
C:
real 0m21.273s
user 0m21.200s
sys 0m0.040s
OCaml:
real 1m51.602s
user 1m47.020s
sys 0m0.260s
So the C version was about five times as fast. This is after looking
for ideas
in the "Writing Efficient Numerical code in Objective Caml" page [1]
and the Great Language Shootout statistical moment page for OCaml [2].
The OCaml code was easier to read and debug, and would be
easier to modify.
I'd be interested if anyone on this list knows of a way
to make it perform as well as the C version (without using the FFT.)
[1]
http://216.239.53.100/search?q=cache:5YnsSStlWiAC:caml.inria.fr/ocaml/numerical.html+efficient+numerical+ocaml&hl=en&ie=UTF-8
[2]
http://www.bagley.org/~doug/shootout/bench/moments/moments.ocaml
Issac Trotts
-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [Caml-list] DFT in OCaml vs. C
2003-03-27 7:33 [Caml-list] DFT in OCaml vs. C Issac Trotts
@ 2003-03-27 10:58 ` Fabrice Le Fessant
2003-03-27 19:40 ` Issac Trotts
2003-03-27 14:21 ` Markus Mottl
2003-03-27 14:32 ` Xavier Leroy
2 siblings, 1 reply; 10+ messages in thread
From: Fabrice Le Fessant @ 2003-03-27 10:58 UTC (permalink / raw)
To: Issac Trotts; +Cc: OCaml List
> Here's a numerical mini-benchmark comparing C to OCaml
> on a simple implementation of the Discrete Fourier Transform:
>
> http://redwood.ucdavis.edu/~issac/dft_compare.tar.gz
>
> The results on my 1 GHZ Pentium III Linux box:
> I'd be interested if anyone on this list knows of a way
> to make it perform as well as the C version (without using the FFT.)
If you really want to benchmark the numerical code, then, write a
program where there is only numerical code. Given the size of the
matrices you use (8), one can wonder if the program spends more time
to compute the FFT or to test and print the results.
- Fabrice
-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [Caml-list] DFT in OCaml vs. C
2003-03-27 7:33 [Caml-list] DFT in OCaml vs. C Issac Trotts
2003-03-27 10:58 ` Fabrice Le Fessant
@ 2003-03-27 14:21 ` Markus Mottl
2003-03-27 20:47 ` Issac Trotts
2003-03-27 14:32 ` Xavier Leroy
2 siblings, 1 reply; 10+ messages in thread
From: Markus Mottl @ 2003-03-27 14:21 UTC (permalink / raw)
To: Issac Trotts; +Cc: OCaml List
On Wed, 26 Mar 2003, Issac Trotts wrote:
> The results on my 1 GHZ Pentium III Linux box:
>
> C:
> real 0m21.273s
> user 0m21.200s
> sys 0m0.040s
>
> OCaml:
> real 1m51.602s
> user 1m47.020s
> sys 0m0.260s
Well, another insignificant benchmark... ;-)
My timings on a 2.4 GHZ Pentium IV using GCC 2.95.4 and the latest
CVS-checkout of OCaml:
C:
real 0m24.920s
user 0m24.900s
sys 0m0.020s
OCaml:
real 0m30.397s
user 0m30.390s
sys 0m0.000s
The difference you have observed on your machine is most likely due to
cache effects and possibly also due to compiler differences (GCC version).
Regards,
Markus Mottl
--
Markus Mottl markus@oefai.at
Austrian Research Institute
for Artificial Intelligence http://www.oefai.at/~markus
-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [Caml-list] DFT in OCaml vs. C
2003-03-27 7:33 [Caml-list] DFT in OCaml vs. C Issac Trotts
2003-03-27 10:58 ` Fabrice Le Fessant
2003-03-27 14:21 ` Markus Mottl
@ 2003-03-27 14:32 ` Xavier Leroy
2003-03-27 14:55 ` Falk Hueffner
` (2 more replies)
2 siblings, 3 replies; 10+ messages in thread
From: Xavier Leroy @ 2003-03-27 14:32 UTC (permalink / raw)
To: Issac Trotts; +Cc: OCaml List
> Here's a numerical mini-benchmark comparing C to OCaml
> on a simple implementation of the Discrete Fourier Transform:
> [...]
> So the C version was about five times as fast.
> I'd be interested if anyone on this list knows of a way
> to make it perform as well as the C version (without using the FFT.)
It can be done, but not on a Pentium 3. Here are my timings:
Pentium 4 Pentium 4 SSE2 Alpha 21264
(2 GHz) (2 GHz) (500 MHz)
C 20 20 36
OCaml (your code) 113 40 52
OCaml (variant 1) 90 26 40
OCaml (variant 2) 72 38 100
Variants 1 and 2 differ on the complex multiply step:
Your code:
let a2=c*. !a -. s*. !b
and b2=c*. !b +. s*. !a in
a := a2;
b := b2;
Variant 1:
let x = s *. !a in
a := c*. !a -. s*. !b;
b := c*. !b +. x
Variant 2:
let olda = !a and oldb = !b in
a := c *. olda -. s *. oldb;
b := c *. oldb +. s *. olda
The "Pentium 4 SSE2" column is an experimental code generator for the
Pentium 4 that uses SSE2 instructions and registers for floating-point
computations. (Before you ask: no, it's not publically available,
but will be the basis for the x86_64 code generator as soon as the
hardware becomes available.)
As you can see above, variant 1 achieves almost the performance of C
on platforms that have a regular register-based FP arithmetic unit.
However, the x86 floating-point stack (what OCaml uses for
compatibility with Pentium 3 and earlier processors) is notoriously
cranky and hard to generate efficient code for. gcc manages to
exploit instruction-level parallelism between the "re" and "im"
computations via amazing feats (fxch instructions, etc), but the
ocamlopt x86 code generator just generates very sequential code...
So, unless you have an Alpha at hand, you'd better consider FFT.
There's an FFT implementation that I use as a benchmark here:
http://camlcvs.inria.fr/cgi-bin/cvsweb.cgi/ocaml/test/fft.ml
and it delivers about 2/3 of the performances of C, even on the Pentium.
- Xavier Leroy
-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [Caml-list] DFT in OCaml vs. C
2003-03-27 14:32 ` Xavier Leroy
@ 2003-03-27 14:55 ` Falk Hueffner
2003-03-27 16:06 ` OCaml performance (was: Re: [Caml-list] DFT in OCaml vs. C) David Monniaux
2003-03-27 20:54 ` [Caml-list] DFT in OCaml vs. C Issac Trotts
2 siblings, 0 replies; 10+ messages in thread
From: Falk Hueffner @ 2003-03-27 14:55 UTC (permalink / raw)
To: OCaml List
Xavier Leroy <xavier.leroy@inria.fr> writes:
> It can be done, but not on a Pentium 3. Here are my timings:
>
> Pentium 4 Pentium 4 SSE2 Alpha 21264
> (2 GHz) (2 GHz) (500 MHz)
>
> C 20 20 36
> OCaml (your code) 113 40 52
> OCaml (variant 1) 90 26 40
> OCaml (variant 2) 72 38 100
Hmm, on an Alpha with 800 MHz, I measured 25 seconds C versus 69
seconds Ocaml, i. e. a factor of 2.8 instead of 1.4... does your
version contain optimizations for non-i386, too? I'm using 3.06.
--
Falk
-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
^ permalink raw reply [flat|nested] 10+ messages in thread
* OCaml performance (was: Re: [Caml-list] DFT in OCaml vs. C)
2003-03-27 14:32 ` Xavier Leroy
2003-03-27 14:55 ` Falk Hueffner
@ 2003-03-27 16:06 ` David Monniaux
2003-03-27 21:27 ` Issac Trotts
2003-03-27 20:54 ` [Caml-list] DFT in OCaml vs. C Issac Trotts
2 siblings, 1 reply; 10+ messages in thread
From: David Monniaux @ 2003-03-27 16:06 UTC (permalink / raw)
To: Liste CAML; +Cc: Antoine Mine
> The "Pentium 4 SSE2" column is an experimental code generator for the
> Pentium 4 that uses SSE2 instructions and registers for floating-point
> computations. (Before you ask: no, it's not publically available,
In this case, to get meaningful comparison results, you should use
gcc -march=pentium4 -msse2 or icc -march=pentium4
> and it delivers about 2/3 of the performances of C, even on the Pentium.
Let me tell you about our experience here. We are developing a large
program consisting of
- a large part of Caml code handling complex data structures
- a smaller C library handling certain numerical matrix computations that
are triggered by the Caml code
- some C (+ assembler) libraries dealing with system-dependent issues.
I profiled the code using OProfile (http://oprofile.sourceforge.net), for
expenses in clock cycles and cache faults. Earlier attempts were made with
gprof.
It turned out that we spent a significant amount of time in:
- The Caml polymorphic compare function (15% time + some cache faults)
Part of the problem seems to lie with the fact that the same function is
called when comparing strings, int64's and other types, thus the
processor has to do lots of tests and jumps just to get at the correct
comparison function.
Wouldn't it be reasonable to define String.compare and Int64.compare to
call monomorphic functions?
- The garbage collector (15% time + lots of cache faults)
There's little we can do about it. Changing the size of the minor heap,
adjusting it to optimize the use of L2 cache seems to gain 2.30% of the
total running time.
Curiously, using the compactor seems to slow things slightly.
Would it be possible to optimize the GC cache-wise? For instance, have
it ask the processor to "prefetch" data.
- 17% in a particular matrix function written in C. There's little we can
do except trying to optimize it carefully and compiling it with the best
C compiler around.
- The rest of the time is spent within the Caml code.
Now this was a bit surprising to us, because we thought we spent far more
time in the numerical computations.
Now back to the original question about DFTs. In your real-life
application, will DFT computations make a major part of the clock cycles
spent by the program?
David Monniaux http://www.di.ens.fr/~monniaux
Laboratoire d'informatique de l'École Normale Supérieure,
Paris, France
-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [Caml-list] DFT in OCaml vs. C
2003-03-27 10:58 ` Fabrice Le Fessant
@ 2003-03-27 19:40 ` Issac Trotts
0 siblings, 0 replies; 10+ messages in thread
From: Issac Trotts @ 2003-03-27 19:40 UTC (permalink / raw)
To: OCaml List
Fabrice Le Fessant wrote:
>> Here's a numerical mini-benchmark comparing C to OCaml
>> on a simple implementation of the Discrete Fourier Transform:
>>
>> http://redwood.ucdavis.edu/~issac/dft_compare.tar.gz
>>
>> The results on my 1 GHZ Pentium III Linux box:
>>
>>
>
>
>
>> I'd be interested if anyone on this list knows of a way
>> to make it perform as well as the C version (without using the FFT.)
>>
>>
>
>If you really want to benchmark the numerical code, then, write a
>program where there is only numerical code. Given the size of the
>matrices you use (8), one can wonder if the program spends more time
>to compute the FFT or to test and print the results.
>
Okay, I should have made the code clearer. The bulk of the time is spent
in the test2 function, which works on signals much longer than eight
samples.
After removing the printf calls and test1, the time doesn't improve much:
real 1m47.500s
user 1m37.820s
sys 0m0.250s
Thanks for the suggestions.
Issac
-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [Caml-list] DFT in OCaml vs. C
2003-03-27 14:21 ` Markus Mottl
@ 2003-03-27 20:47 ` Issac Trotts
0 siblings, 0 replies; 10+ messages in thread
From: Issac Trotts @ 2003-03-27 20:47 UTC (permalink / raw)
To: OCaml List
Markus Mottl wrote:
>On Wed, 26 Mar 2003, Issac Trotts wrote:
>
>
>>The results on my 1 GHZ Pentium III Linux box:
>>
>>C:
>>real 0m21.273s
>>user 0m21.200s
>>sys 0m0.040s
>>
>>OCaml:
>>real 1m51.602s
>>user 1m47.020s
>>sys 0m0.260s
>>
>>
>
>Well, another insignificant benchmark... ;-)
>
>My timings on a 2.4 GHZ Pentium IV using GCC 2.95.4 and the latest
>CVS-checkout of OCaml:
>
>C:
>real 0m24.920s
>user 0m24.900s
>sys 0m0.020s
>
>OCaml:
>real 0m30.397s
>user 0m30.390s
>sys 0m0.000s
>
>The difference you have observed on your machine is most likely due to
>cache effects and possibly also due to compiler differences (GCC version).
>
Probably, but the C version doesn't seem as sensitive to these things.
Issac
-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [Caml-list] DFT in OCaml vs. C
2003-03-27 14:32 ` Xavier Leroy
2003-03-27 14:55 ` Falk Hueffner
2003-03-27 16:06 ` OCaml performance (was: Re: [Caml-list] DFT in OCaml vs. C) David Monniaux
@ 2003-03-27 20:54 ` Issac Trotts
2 siblings, 0 replies; 10+ messages in thread
From: Issac Trotts @ 2003-03-27 20:54 UTC (permalink / raw)
To: OCaml List
Xavier Leroy wrote:
>>Here's a numerical mini-benchmark comparing C to OCaml
>>on a simple implementation of the Discrete Fourier Transform:
>>[...]
>>So the C version was about five times as fast.
>>I'd be interested if anyone on this list knows of a way
>>to make it perform as well as the C version (without using the FFT.)
>>
>>
>
>It can be done, but not on a Pentium 3. Here are my timings:
>
> Pentium 4 Pentium 4 SSE2 Alpha 21264
> (2 GHz) (2 GHz) (500 MHz)
>
>C 20 20 36
>OCaml (your code) 113 40 52
>OCaml (variant 1) 90 26 40
>OCaml (variant 2) 72 38 100
>
>Variants 1 and 2 differ on the complex multiply step:
>
>Your code:
> let a2=c*. !a -. s*. !b
> and b2=c*. !b +. s*. !a in
> a := a2;
> b := b2;
>Variant 1:
> let x = s *. !a in
> a := c*. !a -. s*. !b;
> b := c*. !b +. x
>Variant 2:
> let olda = !a and oldb = !b in
> a := c *. olda -. s *. oldb;
> b := c *. oldb +. s *. olda
>
>
>The "Pentium 4 SSE2" column is an experimental code generator for the
>Pentium 4 that uses SSE2 instructions and registers for floating-point
>computations. (Before you ask: no, it's not publically available,
>but will be the basis for the x86_64 code generator as soon as the
>hardware becomes available.)
>
>As you can see above, variant 1 achieves almost the performance of C
>on platforms that have a regular register-based FP arithmetic unit.
>
>However, the x86 floating-point stack (what OCaml uses for
>compatibility with Pentium 3 and earlier processors) is notoriously
>cranky and hard to generate efficient code for. gcc manages to
>exploit instruction-level parallelism between the "re" and "im"
>computations via amazing feats (fxch instructions, etc), but the
>ocamlopt x86 code generator just generates very sequential code...
>
>So, unless you have an Alpha at hand, you'd better consider FFT.
>There's an FFT implementation that I use as a benchmark here:
>
> http://camlcvs.inria.fr/cgi-bin/cvsweb.cgi/ocaml/test/fft.ml
>
>and it delivers about 2/3 of the performances of C, even on the Pentium.
>
>- Xavier Leroy
>
Thanks for a very informative and helpful message.
Issac
-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: OCaml performance (was: Re: [Caml-list] DFT in OCaml vs. C)
2003-03-27 16:06 ` OCaml performance (was: Re: [Caml-list] DFT in OCaml vs. C) David Monniaux
@ 2003-03-27 21:27 ` Issac Trotts
0 siblings, 0 replies; 10+ messages in thread
From: Issac Trotts @ 2003-03-27 21:27 UTC (permalink / raw)
To: OCaml List
David Monniaux wrote:
>>The "Pentium 4 SSE2" column is an experimental code generator for the
>>Pentium 4 that uses SSE2 instructions and registers for floating-point
>>computations. (Before you ask: no, it's not publically available,
>>
>>
>
>In this case, to get meaningful comparison results, you should use
>gcc -march=pentium4 -msse2 or icc -march=pentium4
>
>
>
>>and it delivers about 2/3 of the performances of C, even on the Pentium.
>>
>>
>
>Let me tell you about our experience here. We are developing a large
>program consisting of
>- a large part of Caml code handling complex data structures
>- a smaller C library handling certain numerical matrix computations that
> are triggered by the Caml code
>- some C (+ assembler) libraries dealing with system-dependent issues.
>
>I profiled the code using OProfile (http://oprofile.sourceforge.net), for
>expenses in clock cycles and cache faults. Earlier attempts were made with
>gprof.
>
>It turned out that we spent a significant amount of time in:
>
>- The Caml polymorphic compare function (15% time + some cache faults)
>
> Part of the problem seems to lie with the fact that the same function is
> called when comparing strings, int64's and other types, thus the
> processor has to do lots of tests and jumps just to get at the correct
> comparison function.
>
> Wouldn't it be reasonable to define String.compare and Int64.compare to
> call monomorphic functions?
>
>- The garbage collector (15% time + lots of cache faults)
>
> There's little we can do about it. Changing the size of the minor heap,
> adjusting it to optimize the use of L2 cache seems to gain 2.30% of the
> total running time.
>
> Curiously, using the compactor seems to slow things slightly.
>
> Would it be possible to optimize the GC cache-wise? For instance, have
> it ask the processor to "prefetch" data.
>
>- 17% in a particular matrix function written in C. There's little we can
> do except trying to optimize it carefully and compiling it with the best
> C compiler around.
>
>- The rest of the time is spent within the Caml code.
>
>Now this was a bit surprising to us, because we thought we spent far more
>time in the numerical computations.
>
>
>Now back to the original question about DFTs. In your real-life
>application, will DFT computations make a major part of the clock cycles
>spent by the program?
>
There's a small image processing experiment I want to do that will compute
lots of DFTs on small sub-images and will probably spend most of its
clock cycles
doing the transforms.
- Issac
-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2003-03-27 21:21 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-03-27 7:33 [Caml-list] DFT in OCaml vs. C Issac Trotts
2003-03-27 10:58 ` Fabrice Le Fessant
2003-03-27 19:40 ` Issac Trotts
2003-03-27 14:21 ` Markus Mottl
2003-03-27 20:47 ` Issac Trotts
2003-03-27 14:32 ` Xavier Leroy
2003-03-27 14:55 ` Falk Hueffner
2003-03-27 16:06 ` OCaml performance (was: Re: [Caml-list] DFT in OCaml vs. C) David Monniaux
2003-03-27 21:27 ` Issac Trotts
2003-03-27 20:54 ` [Caml-list] DFT in OCaml vs. C Issac Trotts
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox