From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on yquem.inria.fr X-Spam-Level: X-Spam-Status: No, score=0.0 required=5.0 tests=AWL,HTML_MESSAGE autolearn=disabled version=3.1.3 Received: from discorde.inria.fr (discorde.inria.fr [192.93.2.38]) by yquem.inria.fr (Postfix) with ESMTP id 8FAE3BC6B for ; Thu, 6 Sep 2007 17:18:06 +0200 (CEST) Received: from smtp.janestcapital.com (www.janestcapital.com [66.155.124.107]) by discorde.inria.fr (8.13.6/8.13.6) with ESMTP id l86FI1m9007473 for ; Thu, 6 Sep 2007 17:18:05 +0200 Received: from [172.25.129.161] [38.96.172.125] by janestcapital.com with ESMTP (SMTPD-9.10) id AA440248; Thu, 06 Sep 2007 11:18:28 -0400 Message-ID: <46E01A27.1070207@janestcapital.com> Date: Thu, 06 Sep 2007 11:17:59 -0400 From: Brian Hurt User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.2) Gecko/20040804 Netscape/7.2 (ax) X-Accept-Language: en-us, en MIME-Version: 1.0 To: Chris King Cc: Tom , Caml-list List Subject: Re: [Caml-list] More registers in modern day CPUs References: <875c7e070709060755r1d0d099ds30a25ea78d0fd85a@mail.gmail.com> In-Reply-To: <875c7e070709060755r1d0d099ds30a25ea78d0fd85a@mail.gmail.com> Content-Type: multipart/alternative; boundary="------------040103010700080008060209" X-Miltered: at discorde with ID 46E01A29.000 by Joe's j-chkmail (http://j-chkmail . ensmp . fr)! X-Spam: no; 0.00; itanium:01 compiler:01 itanium:01 compiler:01 lowers:98 lowers:98 wrote:01 wrote:01 compilers:01 compilers:01 caml-list:01 emulate:01 emulate:01 slower:02 slower:02 This is a multi-part message in MIME format. --------------040103010700080008060209 Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit Chris King wrote: >On 9/6/07, Tom wrote: > > >>However, would it be possible to "emulate" cpu registers using software? By >>keeping registers in the main memory, but accessing them often enough to >>keep them in primary cache? That would be quite fast I believe... >> >> > >This makes me wonder... why have registers to begin with? I wonder >how feasible a chip with a, say, 256-byte "register-level" cache would >be. > > Such chips exist. The Itanium is one example. The problem is gate delays. The purpose of registers is to be faster than L1 cache (which typically has a 2-3 clock delay associated with it). But the more registers you have, the more gate delays you need to read or write registers- the naive implementation takes O(log N) gate delays to access O(N) registers- reality is more complicated than this. But the rule more registers = more gate delays holds true. And these gate delays translate into a slower chip (one way or another- either you have to lower your clock rate or add more pipeline stages or both to deal with the larger register cache). Of course, more registers make compilers happy, and lowers pressure on the cache bandwidth (as the compiler doesn't need to spill/refill registers quite so often). This is why the 64-bit x86 is generally faster than the 32-bit x86- going from 8 (6 in practice) to 16 (14 in practice) registers was a big step up. The Itanium has a large enough register set that it's performance is probably getting hurt by it, but it's hard to tell with the everything else going on. The sweet spot for register sets seems to be in the 16-64 range- less than that, and you're being hurt by the increased memory pressure, more than that and you're probably being hurt by the slower register addressing. Brian --------------040103010700080008060209 Content-Type: text/html; charset=us-ascii Content-Transfer-Encoding: 7bit Chris King wrote:
On 9/6/07, Tom <tom.primozic@gmail.com> wrote:
  
However, would it be possible to "emulate" cpu registers using software? By
keeping registers in the main memory, but accessing them often enough to
keep them in primary cache? That would be quite fast I believe...
    

This makes me wonder... why have registers to begin with?  I wonder
how feasible a chip with a, say, 256-byte "register-level" cache would
be.
  
Such chips exist.  The Itanium is one example.

The problem is gate delays.  The purpose of registers is to be faster than L1 cache (which typically has a 2-3 clock delay associated with it).  But the more registers you have, the more gate delays you need to read or write registers- the naive implementation takes O(log N) gate delays to access O(N) registers- reality is more complicated than this.  But the rule more registers = more gate delays holds true.  And these gate delays translate into a slower chip (one way or another- either you have to lower your clock rate or add more pipeline stages or both to deal with the larger register cache).  Of course, more registers make compilers happy, and lowers pressure on the cache bandwidth (as the compiler doesn't need to spill/refill registers quite so often).  This is why the 64-bit x86 is generally faster than the 32-bit x86- going from 8 (6 in practice) to 16 (14 in practice) registers was a big step up.  The Itanium has a large enough register set that it's performance is probably getting hurt by it, but it's hard to tell with the everything else going on.

The sweet spot for register sets seems to be in the 16-64 range- less than that, and you're being hurt by the increased memory pressure, more than that and you're probably being hurt by the slower register addressing.

Brian
--------------040103010700080008060209--