From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on yquem.inria.fr X-Spam-Level: X-Spam-Status: No, score=0.4 required=5.0 tests=AWL,DNS_FROM_RFC_ABUSE autolearn=disabled version=3.1.3 Received: from mail1-relais-roc.national.inria.fr (mail1-relais-roc.national.inria.fr [192.134.164.82]) by yquem.inria.fr (Postfix) with ESMTP id DE70CBC6B for ; Thu, 10 Jan 2008 17:36:53 +0100 (CET) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: Ah4FAIjYhUdDWxLC/2dsb2JhbACBVqgj X-IronPort-AV: E=Sophos;i="4.24,267,1196636400"; d="scan'208";a="6504327" Received: from ip67-91-18-194.z18-91-67.customer.algx.net (HELO server1.bertec.net) ([67.91.18.194]) by mail1-smtp-roc.national.inria.fr with ESMTP; 10 Jan 2008 17:36:52 +0100 Received: from kuba.bertec.net (kuba.bertec.net [192.168.2.16]) by server1.bertec.net (Postfix) with ESMTP id E5C12105830 for ; Thu, 10 Jan 2008 11:36:48 -0500 (EST) From: Kuba Ober To: caml-list@yquem.inria.fr Subject: Re: [Caml-list] ocaml doesn't need to optimize on amd64?? Date: Thu, 10 Jan 2008 11:36:47 -0500 User-Agent: KMail/1.9.6 (enterprise 0.20071123.740460) References: <200801090922.00231.ober.14@osu.edu> <200801100856.16411.ober.14@osu.edu> <47862F53.1050605@janestcapital.com> In-Reply-To: <47862F53.1050605@janestcapital.com> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200801101136.48170.ober.14@osu.edu> X-Spam: no; 0.00; ocaml:01 ocaml:01 gcc:01 intel's:01 caching:01 recompiled:01 opcode:01 cheers:01 wrote:01 wrote:01 stack:01 caml-list:01 short:01 emulate:01 data:02 On Thursday 10 January 2008, you wrote: > Kuba Ober wrote: > >On Wednesday 09 January 2008, you wrote: > >>On Wed, Jan 09, 2008 at 09:22:00AM -0500, Kuba Ober wrote: > >>>Jon & al, > >>> > >>>why do you think that OCaml doesn't need to do certain > >>>optimizations on amd64? Or does it apply only to 64 bit mode? > >>>I run my benchmarks on amd64 (in 32 bit mode) and OCaml is worse > >>>off than gcc. > >> > >>Register pressure. The extra eight registers in AMD64 make a huge > >>difference to a lot of code generators. Of course, to access them > >>you need to run in 64 bit mode. > > > >Don't current x32 processors "emulate" extra registers anyway? I don't > > know what's the preferred way of telling the on-chip code dissector that > > you intend the data to say in virtual registers, but it must be something > > simple, like common, fixed memory locations or stack locations accessed > > in a certain way. I'm sure if you google on Intel's site you'll find it. > > In any event, the x32 chips have a notion of "many" virtual registers, > > it's just the old x32 opcodes that don't have it. Isn't it so? > > Short answer: no. > > The problem is that the code generated code needs to spill and fill > registers on 32-bit x86 a lot more. The extra registers are usefull in > reordering instructions, and executing things in parallel (at the > instruction level), but they can't avoid the memory accesses. Now, > there are a lot of tricks the CPUs play in order to reduce the costs of > these memory accesses, but they can't eliminate the memory accesses. > And that's the problem: you've now increases the memory pressure of the > CPUs. If a thread keeps those "virtual registers" in an isolated memory page that's accessed by nothing else, the CPU's caching will barely need to access the memory, as there won't be pressure from other cores or CPUs in the system to actually use that memory for anything. It will only get evicted to RAM if the cache pressure is high. The "spill and fill" problem is really only appropriate in the context of the x86 registers actually existing somewhere. As far as I know, they are a virtual concept, there's no piece of silicon in modern x86 CPUs that can be called "the AX register". The actual opcodes referencing ax, bx, ... registers is recompiled on the fly by the CPU into a RISC-like code running on many more registers. Sure, the code rewriter has to work harder when there's more opcodes to move stuff around, but I think that it's implemented to actually recognize certain opcode patterns as "virtual register accesses". I don't have any hard references on hand, but that's my impression so far. I can try and dig up relevant Intel references. Cheers, Kuba