From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <skaller@users.sourceforge.net>
X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on yquem.inria.fr
X-Spam-Level: 
X-Spam-Status: No, score=0.0 required=5.0 tests=AWL autolearn=disabled 
	version=3.1.3
Received: from concorde.inria.fr (concorde.inria.fr [192.93.2.39])
	by yquem.inria.fr (Postfix) with ESMTP id 1BB35BC0A
	for <caml-list@yquem.inria.fr>; Mon, 26 Feb 2007 03:22:53 +0100 (CET)
Received: from ipmail01.adl2.internode.on.net (ipmail01.adl2.internode.on.net [203.16.214.140])
	by concorde.inria.fr (8.13.6/8.13.6) with ESMTP id l1Q2MmmV012035
	for <caml-list@yquem.inria.fr>; Mon, 26 Feb 2007 03:22:51 +0100
Received: from ppp41-119.lns2.syd6.internode.on.net (HELO rosella) ([59.167.41.119])
  by ipmail01.adl2.internode.on.net with ESMTP; 26 Feb 2007 12:52:40 +1030
X-IronPort-AV: i="4.14,217,1170595800"; 
   d="scan'208"; a="93557126:sNHT5556568430"
Subject: amd64 code gen
From: skaller <skaller@users.sourceforge.net>
To: caml-list@yquem.inria.fr
Cc: felix-impl <felix-impl@lists.sourceforge.net>
Content-Type: text/plain
Date: Mon, 26 Feb 2007 13:22:36 +1100
Message-Id: <1172456556.18471.36.camel@rosella.wigram>
Mime-Version: 1.0
X-Mailer: Evolution 2.6.1 
Content-Transfer-Encoding: 7bit
X-Miltered: at concorde with ID 45E24478.001 by Joe's j-chkmail (http://j-chkmail . ensmp . fr)!
X-Spam: no; 0.00; ocaml:01 gcc:01 gcc:01 ocamlopt:01 flags:01 ocamlopt:01 exes:01 exes:01 ocaml:01 elided:01 decl:01 986:01 986:01 ifnot:01 ifnot:01 

Has anything changed in ocaml amd64 code generator?
I'm using this:  

Objective Caml version 3.09.3+rc1
gcc (GCC) 4.0.3 (Ubuntu 4.0.3-1ubuntu5)
GNAT 4.0.3 (Ubuntu 4.0.3-1ubuntu5)
Felix 1.1.3_rc1

gcc uses: gcc -O3 -fomit-frame-pointer -funroll-loops
Felix uses: -O3 -fomit-frame-pointer --inline 
Ocamlopt isn't given any special flags.


My latest performance test shows:

Rankings for ack on rosella
    felix            15 162.46 [N=  1 SD= 0%]
    felix            14  31.25 [N=  4 SD= 0%]
    gccopt           14  45.42 [N=  7 SD= 0%]
    ocamlopt         14  74.03 [N=  6 SD= 0%]
    felix            13   5.31 [N=  9 SD= 0%]
    gnat             13   9.68 [N= 11 SD= 0%]
    gccopt           13   9.78 [N= 10 SD= 0%]
    ocamlopt         13  13.73 [N= 11 SD= 0%]
    felix            12   1.22 [N= 21 SD= 0%]
    gccopt           12   2.37 [N= 22 SD= 0%]


The second column is the y in ack(x,y), the third
the time in seconds on my box, N is the number of samples,
and SD (is supposed to be) the standard deviation.

Felix always outperformed other languages on this test,
possibly due to some lucky combination. A quick check
that something in my test harness isn't going astray:

skaller@rosella:/work/felix/svn/felix/felix/trunk$ time
speed/exes/felix/ack 13
Ack(3,13): 65533

real    0m5.256s
user    0m5.244s
sys     0m0.012s
skaller@rosella:/work/felix/svn/felix/felix/trunk$ time
speed/exes/ocamlopt/ack 13
Ack(3,13): 65533

real    0m13.664s
user    0m13.625s
sys     0m0.040s

Felix is more that twice as fast as Ocamlopt, and trashes gcc
as well. The thing is, I haven't done any significant work on the Felix 
optimiser.

But the last time I checked, whilst Felix did beat both gcc
and Ocaml, it was only by a reasonable margin: 2-10% or something.
I may have pointed out the graphs (but those old graphs are gone
with the latest upload). 

The actual C++ code generated by Felix is this
(with comments elided for clarity).

int FLX_REGPARM ack(FLX_APAR_DECL  int x, int y){
  _us2 ack_mv_986;
  _us2 ack_mv_996;
    start_5803:;
      ack_mv_986 = x == 0 ;
      ifnot(ack_mv_986==1) goto _5798;
      return y + 1 ;
    _5798:;
      ack_mv_996 = y == 0 ;
      ifnot(ack_mv_996==1) goto _5799;
      y = 1;
      x = x - 1 ;
      goto start_5803;
    _5799:;
      y = ack(FLX_FPAR_PASS x, y - 1 );
      x = x - 1 ;
      goto start_5803;
}

C is using:

int Ack(int M, int N) {
  if (M==0) return N +1;
  else if(N==0) return Ack(M-1,1);
  else return Ack(M-1, Ack(M,N-1));
}

and Ocaml is using:

let rec ack m n = match m,n with
  | 0,n -> n + 1
  | m,0 -> ack (m - 1) 1
  | m,n -> ack (m - 1) (ack m (n - 1))
;;


What REALLY worries me is that Felix has this:

int FLX_REGPARM ack(FLX_APAR_DECL  int x, int y){

Well that FLX_APAR_DECL is passing a pointer to the
global object .. which is not in fact used .. that's
a BUG in the optimiser (it did not previously do this!)
which WOULD significantly slow the Felix code down
if it were actually passed -- it actually looks like gcc
is optimising this argument away.

Performance on this test, given reasonable optimisations
such as self-tail rec calls (and in Felix case proper
inlining of typeclass virtual functions .. which I mention
because for the Takfp test it is NOT being done ..)
is dependent entirely on the number of words of machine
stack used to implement the non-tail function call.

The minimal number of words is 3 given a dumb optimiser.
(It can actually be reduced to 1 in this case through a couple
of clever optimisations).

Loop unrolling by 3 reduces the number of return addresses to 1/3
per call giving 7/9 performance improvement.

-- 
John Skaller <skaller at users dot sf dot net>
Felix, successor to C++: http://felix.sf.net