From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on yquem.inria.fr X-Spam-Level: X-Spam-Status: No, score=0.0 required=5.0 tests=AWL autolearn=disabled version=3.1.3 Received: from concorde.inria.fr (concorde.inria.fr [192.93.2.39]) by yquem.inria.fr (Postfix) with ESMTP id 1BB35BC0A for ; Mon, 26 Feb 2007 03:22:53 +0100 (CET) Received: from ipmail01.adl2.internode.on.net (ipmail01.adl2.internode.on.net [203.16.214.140]) by concorde.inria.fr (8.13.6/8.13.6) with ESMTP id l1Q2MmmV012035 for ; Mon, 26 Feb 2007 03:22:51 +0100 Received: from ppp41-119.lns2.syd6.internode.on.net (HELO rosella) ([59.167.41.119]) by ipmail01.adl2.internode.on.net with ESMTP; 26 Feb 2007 12:52:40 +1030 X-IronPort-AV: i="4.14,217,1170595800"; d="scan'208"; a="93557126:sNHT5556568430" Subject: amd64 code gen From: skaller To: caml-list@yquem.inria.fr Cc: felix-impl Content-Type: text/plain Date: Mon, 26 Feb 2007 13:22:36 +1100 Message-Id: <1172456556.18471.36.camel@rosella.wigram> Mime-Version: 1.0 X-Mailer: Evolution 2.6.1 Content-Transfer-Encoding: 7bit X-Miltered: at concorde with ID 45E24478.001 by Joe's j-chkmail (http://j-chkmail . ensmp . fr)! X-Spam: no; 0.00; ocaml:01 gcc:01 gcc:01 ocamlopt:01 flags:01 ocamlopt:01 exes:01 exes:01 ocaml:01 elided:01 decl:01 986:01 986:01 ifnot:01 ifnot:01 Has anything changed in ocaml amd64 code generator? I'm using this: Objective Caml version 3.09.3+rc1 gcc (GCC) 4.0.3 (Ubuntu 4.0.3-1ubuntu5) GNAT 4.0.3 (Ubuntu 4.0.3-1ubuntu5) Felix 1.1.3_rc1 gcc uses: gcc -O3 -fomit-frame-pointer -funroll-loops Felix uses: -O3 -fomit-frame-pointer --inline Ocamlopt isn't given any special flags. My latest performance test shows: Rankings for ack on rosella felix 15 162.46 [N= 1 SD= 0%] felix 14 31.25 [N= 4 SD= 0%] gccopt 14 45.42 [N= 7 SD= 0%] ocamlopt 14 74.03 [N= 6 SD= 0%] felix 13 5.31 [N= 9 SD= 0%] gnat 13 9.68 [N= 11 SD= 0%] gccopt 13 9.78 [N= 10 SD= 0%] ocamlopt 13 13.73 [N= 11 SD= 0%] felix 12 1.22 [N= 21 SD= 0%] gccopt 12 2.37 [N= 22 SD= 0%] The second column is the y in ack(x,y), the third the time in seconds on my box, N is the number of samples, and SD (is supposed to be) the standard deviation. Felix always outperformed other languages on this test, possibly due to some lucky combination. A quick check that something in my test harness isn't going astray: skaller@rosella:/work/felix/svn/felix/felix/trunk$ time speed/exes/felix/ack 13 Ack(3,13): 65533 real 0m5.256s user 0m5.244s sys 0m0.012s skaller@rosella:/work/felix/svn/felix/felix/trunk$ time speed/exes/ocamlopt/ack 13 Ack(3,13): 65533 real 0m13.664s user 0m13.625s sys 0m0.040s Felix is more that twice as fast as Ocamlopt, and trashes gcc as well. The thing is, I haven't done any significant work on the Felix optimiser. But the last time I checked, whilst Felix did beat both gcc and Ocaml, it was only by a reasonable margin: 2-10% or something. I may have pointed out the graphs (but those old graphs are gone with the latest upload). The actual C++ code generated by Felix is this (with comments elided for clarity). int FLX_REGPARM ack(FLX_APAR_DECL int x, int y){ _us2 ack_mv_986; _us2 ack_mv_996; start_5803:; ack_mv_986 = x == 0 ; ifnot(ack_mv_986==1) goto _5798; return y + 1 ; _5798:; ack_mv_996 = y == 0 ; ifnot(ack_mv_996==1) goto _5799; y = 1; x = x - 1 ; goto start_5803; _5799:; y = ack(FLX_FPAR_PASS x, y - 1 ); x = x - 1 ; goto start_5803; } C is using: int Ack(int M, int N) { if (M==0) return N +1; else if(N==0) return Ack(M-1,1); else return Ack(M-1, Ack(M,N-1)); } and Ocaml is using: let rec ack m n = match m,n with | 0,n -> n + 1 | m,0 -> ack (m - 1) 1 | m,n -> ack (m - 1) (ack m (n - 1)) ;; What REALLY worries me is that Felix has this: int FLX_REGPARM ack(FLX_APAR_DECL int x, int y){ Well that FLX_APAR_DECL is passing a pointer to the global object .. which is not in fact used .. that's a BUG in the optimiser (it did not previously do this!) which WOULD significantly slow the Felix code down if it were actually passed -- it actually looks like gcc is optimising this argument away. Performance on this test, given reasonable optimisations such as self-tail rec calls (and in Felix case proper inlining of typeclass virtual functions .. which I mention because for the Takfp test it is NOT being done ..) is dependent entirely on the number of words of machine stack used to implement the non-tail function call. The minimal number of words is 3 given a dumb optimiser. (It can actually be reduced to 1 in this case through a couple of clever optimisations). Loop unrolling by 3 reduces the number of return addresses to 1/3 per call giving 7/9 performance improvement. -- John Skaller Felix, successor to C++: http://felix.sf.net