From mboxrd@z Thu Jan 1 00:00:00 1970 Received: (from majordomo@localhost) by pauillac.inria.fr (8.7.6/8.7.3) id UAA18973; Sun, 20 Oct 2002 20:07:02 +0200 (MET DST) X-Authentication-Warning: pauillac.inria.fr: majordomo set sender to owner-caml-list@pauillac.inria.fr using -f Received: from nez-perce.inria.fr (nez-perce.inria.fr [192.93.2.78]) by pauillac.inria.fr (8.7.6/8.7.3) with ESMTP id UAA17495 for ; Sun, 20 Oct 2002 20:07:00 +0200 (MET DST) Received: from sundown.cs.cornell.edu (sundown.cs.cornell.edu [128.84.96.20]) by nez-perce.inria.fr (8.11.1/8.11.1) with ESMTP id g9KI6wD15447; Sun, 20 Oct 2002 20:06:58 +0200 (MET DST) Received: from quimby-xp.cs.cornell.edu.cs.cornell.edu (dhcp99-102.cs.cornell.edu [128.84.99.102]) by sundown.cs.cornell.edu (8.11.3/8.11.3/R-3.7) with ESMTP id g9KI6o106738; Sun, 20 Oct 2002 14:06:55 -0400 (EDT) To: "Xavier Leroy" Cc: Subject: Re: [Caml-list] matrix-matrix multiply - O'Caml is 6 times slower than C References: <20021020114251.A7642@pauillac.inria.fr> From: Paul Stodghill In-Reply-To: <20021020114251.A7642@pauillac.inria.fr> ("Xavier Leroy"'s message of "Sun, 20 Oct 2002 05:42:51 -0400") Date: Sun, 20 Oct 2002 14:06:34 -0400 Message-ID: User-Agent: Gnus/5.090007 (Oort Gnus v0.07) XEmacs/21.4 (Military Intelligence (RC2 Windows), i686-pc-cygwin) MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Sender: owner-caml-list@pauillac.inria.fr Precedence: bulk >>>>> "Xavier" == Xavier Leroy writes: Xavier> No, it doesn't. It doesn't hoist loop-invariant computations Xavier> either. (See http://caml.inria.fr/ocaml/numerical.html) I'll take a look. Thanks. Xavier> Multiplications are pretty expensive on modern processors, Xavier> so your code would run much faster if you hoist the Xavier> computation i*120 out of the j and k loops, and Xavier> strength-reduce k*120 in the innermost loop. I don't see how to do strength-reduction without introducing references, which appears to reduce performance more than is gained. Xavier> Another option worth considering is to swap the j and k Xavier> loops, thus making k*120 and a.(i*120+k) two invariants of Xavier> the inner loop. In this case, you might not even have to Xavier> strength-reduce k*120. I tried this, and indeed the performance of the ML code get up to 95 Mflops. However, the performance of the C code goes down, as the ikj loop ordering requires an additional store of C[i,j] in every iteration. Xavier> Other loop optimizations such as loop unrolling would (I Xavier> guess) make much less of a difference -- modern processors Xavier> are very good at branch prediction, making loop overhead Xavier> low. Loop unrolling can also increase ILP. It can make a big difference in performance. Loop unrolling can be thought of as a way of getting some of the effect of software pipelining without the complicated scheduling. Xavier> Some more thoughts: Xavier> Bagley's code works on a "proper" matrix (an array of Xavier> arrays), while yours flatten the matrix into one single Xavier> array. There are pros and cons to either approach, but Xavier> notice that your approach (without strength reduction) Xavier> entails fewer loads but more multiplications, which can be Xavier> more expensive... Right. So it is a win when it is done in combination with strength reduction. Xavier> Tiling loops is good for really large matrices, but yours Xavier> (in this example) occupies "only" 115 Kb, thus fits Xavier> comfortably in L2 cache. Perhaps tiling isn't beneficial in Xavier> this case. True. This was a miscalculation on my part. However, the blocksize that I chose makes the blocks fit in the L1 cache, so I was still getting a measurable benefit. Xavier> - Xavier Leroy Here is what I was trying to accomplish - I am involved with a project that is trying to automate/generalize some of the tricks that ATLAS (http://math-atlas.sourceforge.net/) uses for getting maximal performance from matrix-matrix multiply (MMM). I knew about Bagley's conclusion that O'Caml was competitive with C for MMM, so I was curious how close O'Caml came to GCC on Blocked MMM. I would like to use O'Caml as a target language over C or Fortran. My conclusion, at this point, is that the current native O'Caml compiler is not going to competitive with a native C or Fortran compiler because the it does not optimize the innermost loop as aggressively. Thanks for your help. ------------------- To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/ Beginner's list: http://groups.yahoo.com/group/ocaml_beginners