From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <ober.14@osu.edu>
X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on yquem.inria.fr
X-Spam-Level: 
X-Spam-Status: No, score=0.4 required=5.0 tests=AWL,DNS_FROM_RFC_ABUSE 
	autolearn=disabled version=3.1.3
Received: from mail1-relais-roc.national.inria.fr (mail1-relais-roc.national.inria.fr [192.134.164.82])
	by yquem.inria.fr (Postfix) with ESMTP id D1C06BC6B
	for <caml-list@yquem.inria.fr>; Fri, 11 Jan 2008 14:14:37 +0100 (CET)
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: Aq4HACr6hkdDWxLC/2dsb2JhbACBVqhn
X-IronPort-AV: E=Sophos;i="4.24,271,1196636400"; 
   d="scan'208";a="6536987"
Received: from ip67-91-18-194.z18-91-67.customer.algx.net (HELO server1.bertec.net) ([67.91.18.194])
  by mail1-smtp-roc.national.inria.fr with ESMTP; 11 Jan 2008 14:14:37 +0100
Received: from kuba.bertec.net (kuba.bertec.net [192.168.2.16])
	by server1.bertec.net (Postfix) with ESMTP id EE96D105830
	for <caml-list@yquem.inria.fr>; Fri, 11 Jan 2008 08:14:35 -0500 (EST)
From: Kuba Ober <ober.14@osu.edu>
To: caml-list@yquem.inria.fr
Subject: Re: [Caml-list] ocaml doesn't need to optimize on amd64??
Date: Fri, 11 Jan 2008 08:14:34 -0500
User-Agent: KMail/1.9.6 (enterprise 0.20071123.740460)
References: <200801090922.00231.ober.14@osu.edu> <200801091714.52294.jon@ffconsultancy.com> <200801102256.42193.jon@ffconsultancy.com>
In-Reply-To: <200801102256.42193.jon@ffconsultancy.com>
MIME-Version: 1.0
Content-Type: text/plain;
  charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
Message-Id: <200801110814.35168.ober.14@osu.edu>
X-Spam: no; 0.00; ocaml:01 unboxing:01 arrays:01 ocaml:01 struct:01 arrays:01 buffer:01 lambda:01 abstraction:01 allocations:01 run-time:01 variants:01 c-like:01 compiler:01 high-level:01 

> Even in standalone code, unboxing arrays of complex numbers (which OCaml
> does not do) makes FFTs 5x faster. In the context of FFIs, the performance
> improvement can be much bigger because you can completely avoid the cost of
> copying huge quantities of data (e.g. color/texcoord/normal/position struct
> arrays in vertex buffer objects for OpenGL). In contrast, the overhead of
> lambda abstraction in numerical code is "only" a factor of 2.
>
> PS: Kuba, your C code will most likely run a significantly faster in 64-bit
> as well.

Yeah, but my area of interest is really embedded realtime stuff, running 
typically on architectures which are quite resource constrained. On some of 
those your typical GC wouldn't even fit in the code memory. And I'm not even 
(most of the time) using dynamic memory allocations. None of my code really 
calls for any sort of boxing -- there's no need for it. All I need is C that 
is more expressive and easier to optimize. No run-time variants, really, all 
types are known and fixed, and data is at fixed locations in memory, or on 
the stack, or occasionally on the heap which is manually managed (C-like).

Of course that pertains to the code that gets generated, because I should be 
able to use abstract concepts while writing the code. If I pass a function to 
a function, it doesn't necessarily mean that the compiler must emit the code 
for the former, and that the latter should actually call (as call machine 
instruction) the former.

And I know that my code does immensely benefit from certain high-level 
optimizations - I have implemented them only because vendor compilers lacked 
them, and I had to resort to assembly, and unfortunately most assemblers suck 
big time :( Once you get used to an assembler with LISP macros, it's hard to 
go back... Self inflicted bait-and-switch.

There's a whole slew of stuff that's impossible to do with current compiler 
architectures as long as you have your typical compile-assemble-link process. 
There have been attempts at optimizing stuff in the linkers, but to me it's 
just so much effort and code duplication - I couldn't imagine implementing 
things that way.

Most linkers can't even cope with fairly basic *assembly* - as soon as you 
have complex operations on relocatable (or out-of-module) symbols, the 
assembler has to essentially ship an AST of the expression so that the linker 
will be able to compute the needed value. I don't know how many linkers will 
actually dig ASTs shipped in object files -- none of the embedded ones I 
tried do. Many vendor assemblers won't even report warnings if you try to do 
that, and when you link you will get an incorrect binary. It's a mess, and 
makes the linker need more and more of compiler's functionality. Which begs 
the question: why have a separate linker?

Early on I have decided to do whole-program compilation and linking is just 
the last stage of code generation. Of course you can pregenerate syntax trees 
from source files, which can save a bunch of time if you were compiling a big 
project, but I didn't even have to use that. My code compiles usually in less 
than a minute, and that's mostly because I haven't really taken time to 
optimize my optimizer: what for? It'd take me weeks sometimes to hand-write 
the assembly and debug it, waiting a minute for a compilation of a small 
10kloc code base isn't a big deal. And it lets me have a very nice and 
understandable 10kloc code base, and a hopefully not much worse compiler. Not 
having function inlining, tail call removal and polymorphism (among others) 
at the source code level is a big deal - that's why I abhor C in general, and 
vendor C implementations even more so.

Cheers, Kuba