From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <ober.14@osu.edu>
X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on yquem.inria.fr
X-Spam-Level: 
X-Spam-Status: No, score=0.4 required=5.0 tests=AWL,DNS_FROM_RFC_ABUSE 
	autolearn=disabled version=3.1.3
Received: from mail1-relais-roc.national.inria.fr (mail1-relais-roc.national.inria.fr [192.134.164.82])
	by yquem.inria.fr (Postfix) with ESMTP id DE70CBC6B
	for <caml-list@yquem.inria.fr>; Thu, 10 Jan 2008 17:36:53 +0100 (CET)
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: Ah4FAIjYhUdDWxLC/2dsb2JhbACBVqgj
X-IronPort-AV: E=Sophos;i="4.24,267,1196636400"; 
   d="scan'208";a="6504327"
Received: from ip67-91-18-194.z18-91-67.customer.algx.net (HELO server1.bertec.net) ([67.91.18.194])
  by mail1-smtp-roc.national.inria.fr with ESMTP; 10 Jan 2008 17:36:52 +0100
Received: from kuba.bertec.net (kuba.bertec.net [192.168.2.16])
	by server1.bertec.net (Postfix) with ESMTP id E5C12105830
	for <caml-list@yquem.inria.fr>; Thu, 10 Jan 2008 11:36:48 -0500 (EST)
From: Kuba Ober <ober.14@osu.edu>
To: caml-list@yquem.inria.fr
Subject: Re: [Caml-list] ocaml doesn't need to optimize on amd64??
Date: Thu, 10 Jan 2008 11:36:47 -0500
User-Agent: KMail/1.9.6 (enterprise 0.20071123.740460)
References: <200801090922.00231.ober.14@osu.edu> <200801100856.16411.ober.14@osu.edu> <47862F53.1050605@janestcapital.com>
In-Reply-To: <47862F53.1050605@janestcapital.com>
MIME-Version: 1.0
Content-Type: text/plain;
  charset="utf-8"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
Message-Id: <200801101136.48170.ober.14@osu.edu>
X-Spam: no; 0.00; ocaml:01 ocaml:01 gcc:01 intel's:01 caching:01 recompiled:01 opcode:01 cheers:01 wrote:01 wrote:01 stack:01 caml-list:01 short:01 emulate:01 data:02 

On Thursday 10 January 2008, you wrote:
> Kuba Ober wrote:
> >On Wednesday 09 January 2008, you wrote:
> >>On Wed, Jan 09, 2008 at 09:22:00AM -0500, Kuba Ober wrote:
> >>>Jon & al,
> >>>
> >>>why do you think that OCaml doesn't need to do certain
> >>>optimizations on amd64? Or does it apply only to 64 bit mode?
> >>>I run my benchmarks on amd64 (in 32 bit mode) and OCaml is worse
> >>>off than gcc.
> >>
> >>Register pressure.  The extra eight registers in AMD64 make a huge
> >>difference to a lot of code generators.  Of course, to access them
> >>you need to run in 64 bit mode.
> >
> >Don't current x32 processors "emulate" extra registers anyway? I don't
> > know what's the preferred way of telling the on-chip code dissector that
> > you intend the data to say in virtual registers, but it must be something
> > simple, like common, fixed memory locations or stack locations accessed
> > in a certain way. I'm sure if you google on Intel's site you'll find it.
> > In any event, the x32 chips have a notion of "many" virtual registers,
> > it's just the old x32 opcodes that don't have it. Isn't it so?
>
> Short answer: no.
>
> The problem is that the code generated code needs to spill and fill
> registers on 32-bit x86 a lot more.  The extra registers are usefull in
> reordering instructions, and executing things in parallel (at the
> instruction level), but they can't avoid the memory accesses.  Now,
> there are a lot of tricks the CPUs play in order to reduce the costs of
> these memory accesses, but they can't eliminate the memory accesses.
> And that's the problem: you've now increases the memory pressure of the
> CPUs.

If a thread keeps those "virtual registers" in an isolated memory page that's 
accessed by nothing else, the CPU's caching will barely need to access the 
memory, as there won't be pressure from other cores or CPUs in the system to 
actually use that memory for anything. It will only get evicted to RAM if the 
cache pressure is high.

The "spill and fill" problem is really only appropriate in the context of the 
x86 registers actually existing somewhere. As far as I know, they are a 
virtual concept, there's no piece of silicon in modern x86 CPUs that can be 
called "the AX register". The actual opcodes referencing ax, bx, ... 
registers is recompiled on the fly by the CPU into a RISC-like code running 
on many more registers. Sure, the code rewriter has to work harder when 
there's more opcodes to move stuff around, but I think that it's implemented 
to actually recognize certain opcode patterns as "virtual register accesses".

I don't have any hard references on hand, but that's my impression so far. I 
can try and dig up relevant Intel references.

Cheers, Kuba