* [Caml-list] novice puzzled by speed tests @ 2004-01-03 12:57 dolfi 2004-01-04 16:49 ` Xavier Leroy 0 siblings, 1 reply; 4+ messages in thread From: dolfi @ 2004-01-03 12:57 UTC (permalink / raw) To: caml-list Hi everybody, I'm new to the list, and deeply impressed by the stability and speed of OCaml, as well as its functional nature. My best congratulations to the authors! Toying around with 3.07, I found that ocamlopt.opt -unsafe (on Mandrake 9.1, Pentium 4, 2.4 GHz) actually produces slower code than ocamlopt.opt. The program in question is a prime generator: let count = 400000 let ptab = let ptab = Array.create count 0 in ( ptab.(0)<- 2; ptab.(1) <- 3; ptab.(2) <- 5; ptab );; let rec loop nr toggle psize = ( let max = truncate (sqrt (float nr)) in let rec floop i = if ptab.(i)>max then ( ptab.(psize) <- nr; loop (nr+toggle) (6-toggle) (succ psize) ) else if nr mod ptab.(i) = 0 then loop (nr+toggle) (6-toggle) psize else floop (succ i) in if psize<count then floop 2 else () ) in loop 5 2 3; Printf.printf "prime %d: %d\n" count ptab.(pred count);; On my box, the corresponding C program (gcc -O3) is slightly slower than the ocamlopt.opt compiled O'Caml program, but about 25-30% faster than the -unsafe one: #include <stdio.h> #include <math.h> #define how_many 400000 int main() { unsigned int nr = 5, toggle = 2, max, primes_size = 3, i; unsigned int primes[how_many]; primes[0] = 2; primes[1] = 3; primes[2] = 5; loop: nr += toggle; toggle = 6 - toggle; max = sqrt(nr); for (i = 2; primes[i] <= max; ++i) if (!(nr % primes[i])) goto loop; primes[primes_size++] = nr; if (primes_size < how_many) goto loop; printf("%i\n", primes[how_many - 1]); return 0; } Of course it's good that range checking increases the speed of programs, but, being a long-time C user, I'm a little bit puzzled by miracles like this. I suspected that the sense of the -unsafe flag was inverted, but it isn't: the -unsafe program dies with SEGV when I deliberately introduce a range overflow, the safe one gets an exception. Till soon, Dolfi ------------------- To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/ Beginner's list: http://groups.yahoo.com/group/ocaml_beginners ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [Caml-list] novice puzzled by speed tests 2004-01-03 12:57 [Caml-list] novice puzzled by speed tests dolfi @ 2004-01-04 16:49 ` Xavier Leroy 2004-01-04 20:49 ` Brian Hurt 2004-01-05 19:50 ` [Caml-list] camlp4 Ker Lutyn 0 siblings, 2 replies; 4+ messages in thread From: Xavier Leroy @ 2004-01-04 16:49 UTC (permalink / raw) To: dolfi; +Cc: caml-list > Toying around with 3.07, I found that ocamlopt.opt -unsafe (on Mandrake > 9.1, Pentium 4, 2.4 GHz) actually produces slower code than ocamlopt.opt. > On my box, the corresponding C program (gcc -O3) is slightly slower than > the ocamlopt.opt compiled O'Caml program, but about 25-30% faster than the > -unsafe one: > Of course it's good that range checking increases the speed of programs, > but, being a long-time C user, I'm a little bit puzzled by miracles like > this. I suspected that the sense of the -unsafe flag was inverted, but it > isn't: the -unsafe program dies with SEGV when I deliberately introduce a > range overflow, the safe one gets an exception. Welcome to the wonderful world of modern processors. It's not uncommon to observe "absurd" speed differences of up to 20%. By "absurd" I mean for instance adding or removing dead code (never executed) and observing a slowdown, or executing more code and observing a speedup. As far as I can guess, this is due to two processor features: - Lots of instruction-level parallelism is available. Thus, if your main code doesn't use all of the computational units, adding extra code (such as array bound checks) that can execute in parallel doesn't reduce execution speed. - Performance is very sensitive to code placement. Things like code cache conflicts, (mis-) alignment of branch targets, and oddities in the instruction decoding logic can cause insertion *or deletion* of a few instructions to have significant impact on execution speed. These are just wild guesses. The bottom line is that processors have become so complex that explaining observed performances (let alone predicting performances) has become nearly impossible, at least for small performance variations (say, less than a factor of 1.5). (This makes compiler writers very unhappy, because they used to make a living by cranking out 5% speed improvements, which are now lost in the overall noise :-) If you have access to a good performance analysis tool, such as Intel's VTune, you could run it on your three executables and see if some rational explanation comes out of VTune's figures. But I wouldn't bet on it. - Xavier Leroy ------------------- To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/ Beginner's list: http://groups.yahoo.com/group/ocaml_beginners ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [Caml-list] novice puzzled by speed tests 2004-01-04 16:49 ` Xavier Leroy @ 2004-01-04 20:49 ` Brian Hurt 2004-01-05 19:50 ` [Caml-list] camlp4 Ker Lutyn 1 sibling, 0 replies; 4+ messages in thread From: Brian Hurt @ 2004-01-04 20:49 UTC (permalink / raw) To: Xavier Leroy; +Cc: dolfi, caml-list On Sun, 4 Jan 2004, Xavier Leroy wrote: > > Toying around with 3.07, I found that ocamlopt.opt -unsafe (on Mandrake > > 9.1, Pentium 4, 2.4 GHz) actually produces slower code than ocamlopt.opt. > > On my box, the corresponding C program (gcc -O3) is slightly slower than > > the ocamlopt.opt compiled O'Caml program, but about 25-30% faster than the > > -unsafe one: > > Of course it's good that range checking increases the speed of programs, > > but, being a long-time C user, I'm a little bit puzzled by miracles like > > this. I suspected that the sense of the -unsafe flag was inverted, but it > > isn't: the -unsafe program dies with SEGV when I deliberately introduce a > > range overflow, the safe one gets an exception. > > Welcome to the wonderful world of modern processors. It's not > uncommon to observe "absurd" speed differences of up to 20%. By > "absurd" I mean for instance adding or removing dead code (never > executed) and observing a slowdown, or executing more code and > observing a speedup. > > As far as I can guess, this is due to two processor features: > > - Lots of instruction-level parallelism is available. Thus, if your > main code doesn't use all of the computational units, adding extra code > (such as array bound checks) that can execute in parallel doesn't > reduce execution speed. Modern processors also do fairly aggressive branch prediction. Bounds checking doesn't *normally* fail, so the processor very quickly starts predicting the branch with a very high degree of accuracy. A correctly predicted branch is not exactly no cost, but very near to it. A mispredicted branch is expensive- dozens of clock cycles. Hmm. Actually, as I understand it, all array accesses call the same code in Ocaml- that code then does the bounds checking. This means that every ocaml program has only one branch instruction which is the bounds checking branch. Thinking about it, in modern processors this may be better than hoisting the bounds checking- even in situations where the most bounds checking can be eliminated in hoisting. This is because processors only keep/"cache" information for branch prediction for so many branches- generally about 4K branches IIRC. So, program A doesn't hoist it's branch checking, and branch checking uses the exact same branch. This means that only one entry in the branch history cache is used by bounds checking. But it executes that branch a lot. Program B does hoist it's branch checking- now there are a thousand different branch instructions in the program to do branch checking. Now 25% of the branch history cache is for bounds checking branches- all of which should be trivially predictable, but are pushing other branches out of the branch history. Boom, the code runs slower. This isn't what this programmer is hitting. But it demonstrates that performance optimization is trickier than it might seem. > > - Performance is very sensitive to code placement. Things like code > cache conflicts, (mis-) alignment of branch targets, and oddities in the > instruction decoding logic can cause insertion *or deletion* of a few > instructions to have significant impact on execution speed. Cache conflicts especially. Modern processors have 100-350 or more clock cycle latencies to go to main memory. Very expensive- to the point where I'd argue that the run time of a program is mainly determined by the number of cache misses the program has, and not the number of instructions executed. If I had to guess, he's hitting some weird cache behavior. Generally, instruction decode weirdness, branch target alignment, etc. only cost dozens of clock cycles and don't make that big of a change on the performance of code. It's cache effects that can make surprisingly big swings. You can have poor cache behavior even with working sets much smaller than cache, due to the way cache works. Say you have a 2-way set associative cache of size C. This means if you're heavily accessing locations X and X+C, you're okay- but if you're heavily accessing locations X, X+C, and X+2C, you're screwed. Your cache can only hold two of those in memory at once- accessing the third one pushes one of the original two out of cache. You make a "minor" change, and suddenly you're accessing locations X, X+C, and X+2C+L (where L is the cache line size- generally 16-64 bytes), and boom- all three address lines fit into cache now. And suddenly your code is 100 times faster. This can happen with both data and code. I've seen this happen in real code. Here's an interesting paper for two reasons- first, it shows just how expensive minor changes can get due to cache effects, and second of all it has a solution (skew associative caches) that I'd like to see implemented, but it has a NIH problem: http://citeseer.nj.nec.com/bodin94cache.html Thinking about it, I'd love to see Ocaml do, as it's final linking stage, just a quick depth-first traversal sorting of the known-call graph of the functions. I'd be this would do good things to Ocaml's performance. > > These are just wild guesses. The bottom line is that processors have > become so complex that explaining observed performances (let alone > predicting performances) has become nearly impossible, at least for > small performance variations (say, less than a factor of 1.5). Easily. I've seen performance swings of 100x due to simple changes. -- "Usenet is like a herd of performing elephants with diarrhea -- massive, difficult to redirect, awe-inspiring, entertaining, and a source of mind-boggling amounts of excrement when you least expect it." - Gene Spafford Brian ------------------- To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/ Beginner's list: http://groups.yahoo.com/group/ocaml_beginners ^ permalink raw reply [flat|nested] 4+ messages in thread
* [Caml-list] camlp4 2004-01-04 16:49 ` Xavier Leroy 2004-01-04 20:49 ` Brian Hurt @ 2004-01-05 19:50 ` Ker Lutyn 1 sibling, 0 replies; 4+ messages in thread From: Ker Lutyn @ 2004-01-05 19:50 UTC (permalink / raw) To: caml-list Happy New Year! I just tried to get camlp4 working from cvs: - configure looks for exact equality between the ocaml version and the version in camlp4/pcaml.ml - camlp4/argl.ml does not handle Arg.Tuple and Arg.Bool Since I wanted camlp4 for a quick one-off and I am under deadline, I can't look into it more deeply. Let me just add my voice to those others who think it is too bad that camlp4 has fallen off the map. Just FYI, here's a cool camlp4 link: http://www.venge.net/graydon/talks/mkc/html/index.html __________________________________ Do you Yahoo!? Find out what made the Top Yahoo! Searches of 2003 http://search.yahoo.com/top2003 ------------------- To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/ Beginner's list: http://groups.yahoo.com/group/ocaml_beginners ^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2004-01-05 19:50 UTC | newest] Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2004-01-03 12:57 [Caml-list] novice puzzled by speed tests dolfi 2004-01-04 16:49 ` Xavier Leroy 2004-01-04 20:49 ` Brian Hurt 2004-01-05 19:50 ` [Caml-list] camlp4 Ker Lutyn
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox