From mboxrd@z Thu Jan 1 00:00:00 1970 Received: (from majordomo@localhost) by pauillac.inria.fr (8.7.6/8.7.3) id VAA02687; Fri, 13 Jun 2003 21:07:56 +0200 (MET DST) X-Authentication-Warning: pauillac.inria.fr: majordomo set sender to owner-caml-list@pauillac.inria.fr using -f Received: from concorde.inria.fr (concorde.inria.fr [192.93.2.39]) by pauillac.inria.fr (8.7.6/8.7.3) with ESMTP id VAA03101 for ; Fri, 13 Jun 2003 21:07:54 +0200 (MET DST) Received: from pauillac.inria.fr (pauillac.inria.fr [128.93.11.35]) by concorde.inria.fr (8.11.1/8.11.1) with ESMTP id h5DJ7pH22380; Fri, 13 Jun 2003 21:07:51 +0200 (MET DST) Received: (from xleroy@localhost) by pauillac.inria.fr (8.7.6/8.7.3) id VAA02791; Fri, 13 Jun 2003 21:07:51 +0200 (MET DST) Date: Fri, 13 Jun 2003 21:07:51 +0200 From: Xavier Leroy To: David McClain Cc: caml-list@inria.fr Subject: Re: [Caml-list] FP's and HyperThreading Processors Message-ID: <20030613210751.A2264@pauillac.inria.fr> References: <003601c33177$324ecc40$0201a8c0@dylan> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 1.0i In-Reply-To: <003601c33177$324ecc40$0201a8c0@dylan>; from dmcclain1@mindspring.com on Thu, Jun 12, 2003 at 11:44:09PM -0700 X-Spam: no; 0.00; caml-list:01 banks:99 misses:01 shifts:01 slower:01 smalltalk:01 alu:99 latencies:01 ocaml:01 caml:01 suffice:01 mid:98 lisp:01 distinguish:01 mainstream:01 Sender: owner-caml-list@pauillac.inria.fr Precedence: bulk > On an old P2 single processor machine at 350 MHz, I am seeing almost 95%+ > CPU utilization. But on a new 3 GHz P4 with HyperThreading enabled (dual > register banks for fast context switching and minimum cache coherence > overhead), this same program provides much less than 5% CPU utilization. How do you measure "CPU utilization"? Different tools have different notions of CPU utilization. For instance, OS-level tools such as "top", "ps" or "uptime" treat the time spent by the CPU while stalled by cache misses and the like as user CPU time, not as idle time. Only processor-level performance monitor counters can distinguish between active cycles and stalled cycles. > The net result is that this program runs only twice as fast on the > new 3 GHz P4 as it runs on the old 350 MHz P2. Some operations take more cycles on the P4 than on the PPro/P2/P3, e.g. shifts, integer multiplication and division. I saw some programs run slightly slower on a 2 Ghz P4 than on a 1 Ghz P3. But that doesn't suffice to explain the factor of 5 that you observed. > I suspect, but have yet to prove, that the low utilization is due to a low > CPU to memory bandwidth and to the failure of the L1 and L2 caches to supply > needed operations and data to the CPU. This, I would hypothesize, is going > to be demonstrated by any language that prefers fresh memory allocation for > results, e.g., OCaml, ML, Lisp, Smalltalk, etc. > > If I am correct, then it implies that our hardware friends are moving > rapidly in the opposite direction to our advanced software systems. I > mention this in order to tickle the imaginations of both camps. This was a concern in the early to mid 90s, when processor speed increased much more than the performances of the memory subsystems. (I remember Caml benchmarks on an early Alpha system where the processor was stalled 9 cycles out of 10.) But since then the hardware manufacturers have done a very good job at maintaining a decent balance between the memory subsystem and the ALU. In particular, out-of-order execution is quite effective at hiding cache miss latencies. My general feeling (although I have no proof of that) is that the memory usage patterns of Caml programs aren't radically different from those of more mainstream programs. - Xavier Leroy ------------------- To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/ Beginner's list: http://groups.yahoo.com/group/ocaml_beginners