* [Caml-list] DFT in OCaml vs. C @ 2003-03-27 7:33 Issac Trotts 2003-03-27 10:58 ` Fabrice Le Fessant ` (2 more replies) 0 siblings, 3 replies; 10+ messages in thread From: Issac Trotts @ 2003-03-27 7:33 UTC (permalink / raw) To: OCaml List Here's a numerical mini-benchmark comparing C to OCaml on a simple implementation of the Discrete Fourier Transform: http://redwood.ucdavis.edu/~issac/dft_compare.tar.gz The results on my 1 GHZ Pentium III Linux box: C: real 0m21.273s user 0m21.200s sys 0m0.040s OCaml: real 1m51.602s user 1m47.020s sys 0m0.260s So the C version was about five times as fast. This is after looking for ideas in the "Writing Efficient Numerical code in Objective Caml" page [1] and the Great Language Shootout statistical moment page for OCaml [2]. The OCaml code was easier to read and debug, and would be easier to modify. I'd be interested if anyone on this list knows of a way to make it perform as well as the C version (without using the FFT.) [1] http://216.239.53.100/search?q=cache:5YnsSStlWiAC:caml.inria.fr/ocaml/numerical.html+efficient+numerical+ocaml&hl=en&ie=UTF-8 [2] http://www.bagley.org/~doug/shootout/bench/moments/moments.ocaml Issac Trotts ------------------- To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/ Beginner's list: http://groups.yahoo.com/group/ocaml_beginners ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [Caml-list] DFT in OCaml vs. C 2003-03-27 7:33 [Caml-list] DFT in OCaml vs. C Issac Trotts @ 2003-03-27 10:58 ` Fabrice Le Fessant 2003-03-27 19:40 ` Issac Trotts 2003-03-27 14:21 ` Markus Mottl 2003-03-27 14:32 ` Xavier Leroy 2 siblings, 1 reply; 10+ messages in thread From: Fabrice Le Fessant @ 2003-03-27 10:58 UTC (permalink / raw) To: Issac Trotts; +Cc: OCaml List > Here's a numerical mini-benchmark comparing C to OCaml > on a simple implementation of the Discrete Fourier Transform: > > http://redwood.ucdavis.edu/~issac/dft_compare.tar.gz > > The results on my 1 GHZ Pentium III Linux box: > I'd be interested if anyone on this list knows of a way > to make it perform as well as the C version (without using the FFT.) If you really want to benchmark the numerical code, then, write a program where there is only numerical code. Given the size of the matrices you use (8), one can wonder if the program spends more time to compute the FFT or to test and print the results. - Fabrice ------------------- To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/ Beginner's list: http://groups.yahoo.com/group/ocaml_beginners ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [Caml-list] DFT in OCaml vs. C 2003-03-27 10:58 ` Fabrice Le Fessant @ 2003-03-27 19:40 ` Issac Trotts 0 siblings, 0 replies; 10+ messages in thread From: Issac Trotts @ 2003-03-27 19:40 UTC (permalink / raw) To: OCaml List Fabrice Le Fessant wrote: >> Here's a numerical mini-benchmark comparing C to OCaml >> on a simple implementation of the Discrete Fourier Transform: >> >> http://redwood.ucdavis.edu/~issac/dft_compare.tar.gz >> >> The results on my 1 GHZ Pentium III Linux box: >> >> > > > >> I'd be interested if anyone on this list knows of a way >> to make it perform as well as the C version (without using the FFT.) >> >> > >If you really want to benchmark the numerical code, then, write a >program where there is only numerical code. Given the size of the >matrices you use (8), one can wonder if the program spends more time >to compute the FFT or to test and print the results. > Okay, I should have made the code clearer. The bulk of the time is spent in the test2 function, which works on signals much longer than eight samples. After removing the printf calls and test1, the time doesn't improve much: real 1m47.500s user 1m37.820s sys 0m0.250s Thanks for the suggestions. Issac ------------------- To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/ Beginner's list: http://groups.yahoo.com/group/ocaml_beginners ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [Caml-list] DFT in OCaml vs. C 2003-03-27 7:33 [Caml-list] DFT in OCaml vs. C Issac Trotts 2003-03-27 10:58 ` Fabrice Le Fessant @ 2003-03-27 14:21 ` Markus Mottl 2003-03-27 20:47 ` Issac Trotts 2003-03-27 14:32 ` Xavier Leroy 2 siblings, 1 reply; 10+ messages in thread From: Markus Mottl @ 2003-03-27 14:21 UTC (permalink / raw) To: Issac Trotts; +Cc: OCaml List On Wed, 26 Mar 2003, Issac Trotts wrote: > The results on my 1 GHZ Pentium III Linux box: > > C: > real 0m21.273s > user 0m21.200s > sys 0m0.040s > > OCaml: > real 1m51.602s > user 1m47.020s > sys 0m0.260s Well, another insignificant benchmark... ;-) My timings on a 2.4 GHZ Pentium IV using GCC 2.95.4 and the latest CVS-checkout of OCaml: C: real 0m24.920s user 0m24.900s sys 0m0.020s OCaml: real 0m30.397s user 0m30.390s sys 0m0.000s The difference you have observed on your machine is most likely due to cache effects and possibly also due to compiler differences (GCC version). Regards, Markus Mottl -- Markus Mottl markus@oefai.at Austrian Research Institute for Artificial Intelligence http://www.oefai.at/~markus ------------------- To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/ Beginner's list: http://groups.yahoo.com/group/ocaml_beginners ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [Caml-list] DFT in OCaml vs. C 2003-03-27 14:21 ` Markus Mottl @ 2003-03-27 20:47 ` Issac Trotts 0 siblings, 0 replies; 10+ messages in thread From: Issac Trotts @ 2003-03-27 20:47 UTC (permalink / raw) To: OCaml List Markus Mottl wrote: >On Wed, 26 Mar 2003, Issac Trotts wrote: > > >>The results on my 1 GHZ Pentium III Linux box: >> >>C: >>real 0m21.273s >>user 0m21.200s >>sys 0m0.040s >> >>OCaml: >>real 1m51.602s >>user 1m47.020s >>sys 0m0.260s >> >> > >Well, another insignificant benchmark... ;-) > >My timings on a 2.4 GHZ Pentium IV using GCC 2.95.4 and the latest >CVS-checkout of OCaml: > >C: >real 0m24.920s >user 0m24.900s >sys 0m0.020s > >OCaml: >real 0m30.397s >user 0m30.390s >sys 0m0.000s > >The difference you have observed on your machine is most likely due to >cache effects and possibly also due to compiler differences (GCC version). > Probably, but the C version doesn't seem as sensitive to these things. Issac ------------------- To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/ Beginner's list: http://groups.yahoo.com/group/ocaml_beginners ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [Caml-list] DFT in OCaml vs. C 2003-03-27 7:33 [Caml-list] DFT in OCaml vs. C Issac Trotts 2003-03-27 10:58 ` Fabrice Le Fessant 2003-03-27 14:21 ` Markus Mottl @ 2003-03-27 14:32 ` Xavier Leroy 2003-03-27 14:55 ` Falk Hueffner ` (2 more replies) 2 siblings, 3 replies; 10+ messages in thread From: Xavier Leroy @ 2003-03-27 14:32 UTC (permalink / raw) To: Issac Trotts; +Cc: OCaml List > Here's a numerical mini-benchmark comparing C to OCaml > on a simple implementation of the Discrete Fourier Transform: > [...] > So the C version was about five times as fast. > I'd be interested if anyone on this list knows of a way > to make it perform as well as the C version (without using the FFT.) It can be done, but not on a Pentium 3. Here are my timings: Pentium 4 Pentium 4 SSE2 Alpha 21264 (2 GHz) (2 GHz) (500 MHz) C 20 20 36 OCaml (your code) 113 40 52 OCaml (variant 1) 90 26 40 OCaml (variant 2) 72 38 100 Variants 1 and 2 differ on the complex multiply step: Your code: let a2=c*. !a -. s*. !b and b2=c*. !b +. s*. !a in a := a2; b := b2; Variant 1: let x = s *. !a in a := c*. !a -. s*. !b; b := c*. !b +. x Variant 2: let olda = !a and oldb = !b in a := c *. olda -. s *. oldb; b := c *. oldb +. s *. olda The "Pentium 4 SSE2" column is an experimental code generator for the Pentium 4 that uses SSE2 instructions and registers for floating-point computations. (Before you ask: no, it's not publically available, but will be the basis for the x86_64 code generator as soon as the hardware becomes available.) As you can see above, variant 1 achieves almost the performance of C on platforms that have a regular register-based FP arithmetic unit. However, the x86 floating-point stack (what OCaml uses for compatibility with Pentium 3 and earlier processors) is notoriously cranky and hard to generate efficient code for. gcc manages to exploit instruction-level parallelism between the "re" and "im" computations via amazing feats (fxch instructions, etc), but the ocamlopt x86 code generator just generates very sequential code... So, unless you have an Alpha at hand, you'd better consider FFT. There's an FFT implementation that I use as a benchmark here: http://camlcvs.inria.fr/cgi-bin/cvsweb.cgi/ocaml/test/fft.ml and it delivers about 2/3 of the performances of C, even on the Pentium. - Xavier Leroy ------------------- To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/ Beginner's list: http://groups.yahoo.com/group/ocaml_beginners ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [Caml-list] DFT in OCaml vs. C 2003-03-27 14:32 ` Xavier Leroy @ 2003-03-27 14:55 ` Falk Hueffner 2003-03-27 16:06 ` OCaml performance (was: Re: [Caml-list] DFT in OCaml vs. C) David Monniaux 2003-03-27 20:54 ` [Caml-list] DFT in OCaml vs. C Issac Trotts 2 siblings, 0 replies; 10+ messages in thread From: Falk Hueffner @ 2003-03-27 14:55 UTC (permalink / raw) To: OCaml List Xavier Leroy <xavier.leroy@inria.fr> writes: > It can be done, but not on a Pentium 3. Here are my timings: > > Pentium 4 Pentium 4 SSE2 Alpha 21264 > (2 GHz) (2 GHz) (500 MHz) > > C 20 20 36 > OCaml (your code) 113 40 52 > OCaml (variant 1) 90 26 40 > OCaml (variant 2) 72 38 100 Hmm, on an Alpha with 800 MHz, I measured 25 seconds C versus 69 seconds Ocaml, i. e. a factor of 2.8 instead of 1.4... does your version contain optimizations for non-i386, too? I'm using 3.06. -- Falk ------------------- To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/ Beginner's list: http://groups.yahoo.com/group/ocaml_beginners ^ permalink raw reply [flat|nested] 10+ messages in thread
* OCaml performance (was: Re: [Caml-list] DFT in OCaml vs. C) 2003-03-27 14:32 ` Xavier Leroy 2003-03-27 14:55 ` Falk Hueffner @ 2003-03-27 16:06 ` David Monniaux 2003-03-27 21:27 ` Issac Trotts 2003-03-27 20:54 ` [Caml-list] DFT in OCaml vs. C Issac Trotts 2 siblings, 1 reply; 10+ messages in thread From: David Monniaux @ 2003-03-27 16:06 UTC (permalink / raw) To: Liste CAML; +Cc: Antoine Mine > The "Pentium 4 SSE2" column is an experimental code generator for the > Pentium 4 that uses SSE2 instructions and registers for floating-point > computations. (Before you ask: no, it's not publically available, In this case, to get meaningful comparison results, you should use gcc -march=pentium4 -msse2 or icc -march=pentium4 > and it delivers about 2/3 of the performances of C, even on the Pentium. Let me tell you about our experience here. We are developing a large program consisting of - a large part of Caml code handling complex data structures - a smaller C library handling certain numerical matrix computations that are triggered by the Caml code - some C (+ assembler) libraries dealing with system-dependent issues. I profiled the code using OProfile (http://oprofile.sourceforge.net), for expenses in clock cycles and cache faults. Earlier attempts were made with gprof. It turned out that we spent a significant amount of time in: - The Caml polymorphic compare function (15% time + some cache faults) Part of the problem seems to lie with the fact that the same function is called when comparing strings, int64's and other types, thus the processor has to do lots of tests and jumps just to get at the correct comparison function. Wouldn't it be reasonable to define String.compare and Int64.compare to call monomorphic functions? - The garbage collector (15% time + lots of cache faults) There's little we can do about it. Changing the size of the minor heap, adjusting it to optimize the use of L2 cache seems to gain 2.30% of the total running time. Curiously, using the compactor seems to slow things slightly. Would it be possible to optimize the GC cache-wise? For instance, have it ask the processor to "prefetch" data. - 17% in a particular matrix function written in C. There's little we can do except trying to optimize it carefully and compiling it with the best C compiler around. - The rest of the time is spent within the Caml code. Now this was a bit surprising to us, because we thought we spent far more time in the numerical computations. Now back to the original question about DFTs. In your real-life application, will DFT computations make a major part of the clock cycles spent by the program? David Monniaux http://www.di.ens.fr/~monniaux Laboratoire d'informatique de l'École Normale Supérieure, Paris, France ------------------- To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/ Beginner's list: http://groups.yahoo.com/group/ocaml_beginners ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: OCaml performance (was: Re: [Caml-list] DFT in OCaml vs. C) 2003-03-27 16:06 ` OCaml performance (was: Re: [Caml-list] DFT in OCaml vs. C) David Monniaux @ 2003-03-27 21:27 ` Issac Trotts 0 siblings, 0 replies; 10+ messages in thread From: Issac Trotts @ 2003-03-27 21:27 UTC (permalink / raw) To: OCaml List David Monniaux wrote: >>The "Pentium 4 SSE2" column is an experimental code generator for the >>Pentium 4 that uses SSE2 instructions and registers for floating-point >>computations. (Before you ask: no, it's not publically available, >> >> > >In this case, to get meaningful comparison results, you should use >gcc -march=pentium4 -msse2 or icc -march=pentium4 > > > >>and it delivers about 2/3 of the performances of C, even on the Pentium. >> >> > >Let me tell you about our experience here. We are developing a large >program consisting of >- a large part of Caml code handling complex data structures >- a smaller C library handling certain numerical matrix computations that > are triggered by the Caml code >- some C (+ assembler) libraries dealing with system-dependent issues. > >I profiled the code using OProfile (http://oprofile.sourceforge.net), for >expenses in clock cycles and cache faults. Earlier attempts were made with >gprof. > >It turned out that we spent a significant amount of time in: > >- The Caml polymorphic compare function (15% time + some cache faults) > > Part of the problem seems to lie with the fact that the same function is > called when comparing strings, int64's and other types, thus the > processor has to do lots of tests and jumps just to get at the correct > comparison function. > > Wouldn't it be reasonable to define String.compare and Int64.compare to > call monomorphic functions? > >- The garbage collector (15% time + lots of cache faults) > > There's little we can do about it. Changing the size of the minor heap, > adjusting it to optimize the use of L2 cache seems to gain 2.30% of the > total running time. > > Curiously, using the compactor seems to slow things slightly. > > Would it be possible to optimize the GC cache-wise? For instance, have > it ask the processor to "prefetch" data. > >- 17% in a particular matrix function written in C. There's little we can > do except trying to optimize it carefully and compiling it with the best > C compiler around. > >- The rest of the time is spent within the Caml code. > >Now this was a bit surprising to us, because we thought we spent far more >time in the numerical computations. > > >Now back to the original question about DFTs. In your real-life >application, will DFT computations make a major part of the clock cycles >spent by the program? > There's a small image processing experiment I want to do that will compute lots of DFTs on small sub-images and will probably spend most of its clock cycles doing the transforms. - Issac ------------------- To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/ Beginner's list: http://groups.yahoo.com/group/ocaml_beginners ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [Caml-list] DFT in OCaml vs. C 2003-03-27 14:32 ` Xavier Leroy 2003-03-27 14:55 ` Falk Hueffner 2003-03-27 16:06 ` OCaml performance (was: Re: [Caml-list] DFT in OCaml vs. C) David Monniaux @ 2003-03-27 20:54 ` Issac Trotts 2 siblings, 0 replies; 10+ messages in thread From: Issac Trotts @ 2003-03-27 20:54 UTC (permalink / raw) To: OCaml List Xavier Leroy wrote: >>Here's a numerical mini-benchmark comparing C to OCaml >>on a simple implementation of the Discrete Fourier Transform: >>[...] >>So the C version was about five times as fast. >>I'd be interested if anyone on this list knows of a way >>to make it perform as well as the C version (without using the FFT.) >> >> > >It can be done, but not on a Pentium 3. Here are my timings: > > Pentium 4 Pentium 4 SSE2 Alpha 21264 > (2 GHz) (2 GHz) (500 MHz) > >C 20 20 36 >OCaml (your code) 113 40 52 >OCaml (variant 1) 90 26 40 >OCaml (variant 2) 72 38 100 > >Variants 1 and 2 differ on the complex multiply step: > >Your code: > let a2=c*. !a -. s*. !b > and b2=c*. !b +. s*. !a in > a := a2; > b := b2; >Variant 1: > let x = s *. !a in > a := c*. !a -. s*. !b; > b := c*. !b +. x >Variant 2: > let olda = !a and oldb = !b in > a := c *. olda -. s *. oldb; > b := c *. oldb +. s *. olda > > >The "Pentium 4 SSE2" column is an experimental code generator for the >Pentium 4 that uses SSE2 instructions and registers for floating-point >computations. (Before you ask: no, it's not publically available, >but will be the basis for the x86_64 code generator as soon as the >hardware becomes available.) > >As you can see above, variant 1 achieves almost the performance of C >on platforms that have a regular register-based FP arithmetic unit. > >However, the x86 floating-point stack (what OCaml uses for >compatibility with Pentium 3 and earlier processors) is notoriously >cranky and hard to generate efficient code for. gcc manages to >exploit instruction-level parallelism between the "re" and "im" >computations via amazing feats (fxch instructions, etc), but the >ocamlopt x86 code generator just generates very sequential code... > >So, unless you have an Alpha at hand, you'd better consider FFT. >There's an FFT implementation that I use as a benchmark here: > > http://camlcvs.inria.fr/cgi-bin/cvsweb.cgi/ocaml/test/fft.ml > >and it delivers about 2/3 of the performances of C, even on the Pentium. > >- Xavier Leroy > Thanks for a very informative and helpful message. Issac ------------------- To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/ Beginner's list: http://groups.yahoo.com/group/ocaml_beginners ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2003-03-27 21:21 UTC | newest] Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2003-03-27 7:33 [Caml-list] DFT in OCaml vs. C Issac Trotts 2003-03-27 10:58 ` Fabrice Le Fessant 2003-03-27 19:40 ` Issac Trotts 2003-03-27 14:21 ` Markus Mottl 2003-03-27 20:47 ` Issac Trotts 2003-03-27 14:32 ` Xavier Leroy 2003-03-27 14:55 ` Falk Hueffner 2003-03-27 16:06 ` OCaml performance (was: Re: [Caml-list] DFT in OCaml vs. C) David Monniaux 2003-03-27 21:27 ` Issac Trotts 2003-03-27 20:54 ` [Caml-list] DFT in OCaml vs. C Issac Trotts
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox