Re: [Caml-list] novice puzzled by speed tests

Mailing list for all users of the OCaml language and system.
 help / color / mirror / Atom feed

From: Brian Hurt <bhurt@spnz.org>
To: Xavier Leroy <xavier.leroy@inria.fr>
Cc: dolfi@zkm.de, <caml-list@inria.fr>
Subject: Re: [Caml-list] novice puzzled by speed tests
Date: Sun, 4 Jan 2004 14:49:02 -0600 (CST)	[thread overview]
Message-ID: <Pine.LNX.4.44.0401041403150.4373-100000@localhost.localdomain> (raw)
In-Reply-To: <20040104174920.A20953@pauillac.inria.fr>

On Sun, 4 Jan 2004, Xavier Leroy wrote:

> > Toying around with 3.07, I found that ocamlopt.opt -unsafe (on Mandrake
> > 9.1, Pentium 4, 2.4 GHz) actually produces slower code than ocamlopt.opt.
> > On my box, the corresponding C program (gcc -O3) is slightly slower than
> > the ocamlopt.opt compiled O'Caml program, but about 25-30% faster than the
> > -unsafe one:
> > Of course it's good that range checking increases the speed of programs,
> > but, being a long-time C user, I'm a little bit puzzled by miracles like
> > this. I suspected that the sense of the -unsafe flag was inverted, but it
> > isn't: the -unsafe program dies with SEGV when I deliberately introduce a
> > range overflow, the safe one gets an exception.
> 
> Welcome to the wonderful world of modern processors.  It's not
> uncommon to observe "absurd" speed differences of up to 20%.  By
> "absurd" I mean for instance adding or removing dead code (never
> executed) and observing a slowdown, or executing more code and
> observing a speedup.
> 
> As far as I can guess, this is due to two processor features:
> 
> - Lots of instruction-level parallelism is available.  Thus, if your
> main code doesn't use all of the computational units, adding extra code
> (such as array bound checks) that can execute in parallel doesn't
> reduce execution speed.

Modern processors also do fairly aggressive branch prediction.  Bounds 
checking doesn't *normally* fail, so the processor very quickly starts 
predicting the branch with a very high degree of accuracy.  A correctly 
predicted branch is not exactly no cost, but very near to it.  A 
mispredicted branch is expensive- dozens of clock cycles.  

Hmm.  Actually, as I understand it, all array accesses call the same code
in Ocaml- that code then does the bounds checking.  This means that every
ocaml program has only one branch instruction which is the bounds checking
branch.  Thinking about it, in modern processors this may be better than
hoisting the bounds checking- even in situations where the most bounds
checking can be eliminated in hoisting.  This is because processors only
keep/"cache" information for branch prediction for so many branches-
generally about 4K branches IIRC.  So, program A doesn't hoist it's branch
checking, and branch checking uses the exact same branch.  This means that
only one entry in the branch history cache is used by bounds checking.  
But it executes that branch a lot.  Program B does hoist it's branch
checking- now there are a thousand different branch instructions in the
program to do branch checking.  Now 25% of the branch history cache is for 
bounds checking branches- all of which should be trivially predictable, 
but are pushing other branches out of the branch history.  Boom, the code 
runs slower.

This isn't what this programmer is hitting.  But it demonstrates that 
performance optimization is trickier than it might seem.

> 
> - Performance is very sensitive to code placement.  Things like code
> cache conflicts, (mis-) alignment of branch targets, and oddities in the
> instruction decoding logic can cause insertion *or deletion* of a few
> instructions to have significant impact on execution speed.

Cache conflicts especially.  Modern processors have 100-350 or more clock 
cycle latencies to go to main memory.  Very expensive- to the point where 
I'd argue that the run time of a program is mainly determined by the 
number of cache misses the program has, and not the number of instructions 
executed.  If I had to guess, he's hitting some weird cache behavior.  
Generally, instruction decode weirdness, branch target alignment, etc. 
only cost dozens of clock cycles and don't make that big of a change on 
the performance of code.  It's cache effects that can make surprisingly 
big swings.

You can have poor cache behavior even with working sets much smaller than
cache, due to the way cache works.  Say you have a 2-way set associative
cache of size C.  This means if you're heavily accessing locations X and
X+C, you're okay- but if you're heavily accessing locations X, X+C, and
X+2C, you're screwed.  Your cache can only hold two of those in memory at
once- accessing the third one pushes one of the original two out of cache.  
You make a "minor" change, and suddenly you're accessing locations X, X+C,
and X+2C+L (where L is the cache line size- generally 16-64 bytes), and
boom- all three address lines fit into cache now.  And suddenly your code
is 100 times faster.  This can happen with both data and code.  I've seen 
this happen in real code.

Here's an interesting paper for two reasons- first, it shows just how 
expensive minor changes can get due to cache effects, and second of all it 
has a solution (skew associative caches) that I'd like to see implemented, 
but it has a NIH problem:
http://citeseer.nj.nec.com/bodin94cache.html

Thinking about it, I'd love to see Ocaml do, as it's final linking stage, 
just a quick depth-first traversal sorting of the known-call graph of the 
functions.  I'd be this would do good things to Ocaml's performance.

> 
> These are just wild guesses.  The bottom line is that processors have
> become so complex that explaining observed performances (let alone
> predicting performances) has become nearly impossible, at least for
> small performance variations (say, less than a factor of 1.5).  

Easily.  I've seen performance swings of 100x due to simple changes.

-- 
"Usenet is like a herd of performing elephants with diarrhea -- massive,
difficult to redirect, awe-inspiring, entertaining, and a source of
mind-boggling amounts of excrement when you least expect it."
                                - Gene Spafford 
Brian

-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners

next prev parent reply	other threads:[~2004-01-04 19:47 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2004-01-03 12:57 dolfi
2004-01-04 16:49 ` Xavier Leroy
2004-01-04 20:49   ` Brian Hurt [this message]
2004-01-05 19:50   ` [Caml-list] camlp4 Ker Lutyn

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Pine.LNX.4.44.0401041403150.4373-100000@localhost.localdomain \
    --to=bhurt@spnz.org \
    --cc=caml-list@inria.fr \
    --cc=dolfi@zkm.de \
    --cc=xavier.leroy@inria.fr \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox