From: Paul Stodghill <stodghil@CS.Cornell.EDU>
To: "Xavier Leroy" <xavier.leroy@inria.fr>
Cc: <caml-list@inria.fr>
Subject: Re: [Caml-list] matrix-matrix multiply - O'Caml is 6 times slower than C
Date: Sun, 20 Oct 2002 14:06:34 -0400 [thread overview]
Message-ID: <r99of9pgf2t.fsf@quimby-xp.cs.cornell.edu> (raw)
In-Reply-To: <20021020114251.A7642@pauillac.inria.fr> ("Xavier Leroy"'s message of "Sun, 20 Oct 2002 05:42:51 -0400")
>>>>> "Xavier" == Xavier Leroy <xavier.leroy@inria.fr> writes:
Xavier> No, it doesn't. It doesn't hoist loop-invariant computations
Xavier> either. (See http://caml.inria.fr/ocaml/numerical.html)
I'll take a look. Thanks.
Xavier> Multiplications are pretty expensive on modern processors,
Xavier> so your code would run much faster if you hoist the
Xavier> computation i*120 out of the j and k loops, and
Xavier> strength-reduce k*120 in the innermost loop.
I don't see how to do strength-reduction without introducing references,
which appears to reduce performance more than is gained.
Xavier> Another option worth considering is to swap the j and k
Xavier> loops, thus making k*120 and a.(i*120+k) two invariants of
Xavier> the inner loop. In this case, you might not even have to
Xavier> strength-reduce k*120.
I tried this, and indeed the performance of the ML code get up to 95
Mflops. However, the performance of the C code goes down, as the ikj
loop ordering requires an additional store of C[i,j] in every iteration.
Xavier> Other loop optimizations such as loop unrolling would (I
Xavier> guess) make much less of a difference -- modern processors
Xavier> are very good at branch prediction, making loop overhead
Xavier> low.
Loop unrolling can also increase ILP. It can make a big difference in
performance.
Loop unrolling can be thought of as a way of getting some of the effect
of software pipelining without the complicated scheduling.
Xavier> Some more thoughts:
Xavier> Bagley's code works on a "proper" matrix (an array of
Xavier> arrays), while yours flatten the matrix into one single
Xavier> array. There are pros and cons to either approach, but
Xavier> notice that your approach (without strength reduction)
Xavier> entails fewer loads but more multiplications, which can be
Xavier> more expensive...
Right. So it is a win when it is done in combination with strength reduction.
Xavier> Tiling loops is good for really large matrices, but yours
Xavier> (in this example) occupies "only" 115 Kb, thus fits
Xavier> comfortably in L2 cache. Perhaps tiling isn't beneficial in
Xavier> this case.
True. This was a miscalculation on my part. However, the blocksize that
I chose makes the blocks fit in the L1 cache, so I was still getting a
measurable benefit.
Xavier> - Xavier Leroy
Here is what I was trying to accomplish - I am involved with a project
that is trying to automate/generalize some of the tricks that ATLAS
(http://math-atlas.sourceforge.net/) uses for getting maximal
performance from matrix-matrix multiply (MMM). I knew about Bagley's
conclusion that O'Caml was competitive with C for MMM, so I was curious
how close O'Caml came to GCC on Blocked MMM. I would like to use O'Caml
as a target language over C or Fortran.
My conclusion, at this point, is that the current native O'Caml compiler
is not going to competitive with a native C or Fortran compiler because
the it does not optimize the innermost loop as aggressively.
Thanks for your help.
-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
next prev parent reply other threads:[~2002-10-20 18:07 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2002-10-18 20:09 Paul Stodghill
2002-10-20 9:42 ` Xavier Leroy
2002-10-20 18:06 ` Paul Stodghill [this message]
2002-10-20 21:58 ` Issac Trotts
2002-10-23 10:16 ` [Caml-list] Re: float boxing (was: matrix-matrix multiply) Christophe TROESTLER
2002-10-23 12:08 ` malc
2002-10-24 0:00 ` malc
2002-10-21 12:53 ` [Caml-list] matrix-matrix multiply - O'Caml is 6 times slower than C Xavier Leroy
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=r99of9pgf2t.fsf@quimby-xp.cs.cornell.edu \
--to=stodghil@cs.cornell.edu \
--cc=caml-list@inria.fr \
--cc=xavier.leroy@inria.fr \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox