From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on yquem.inria.fr X-Spam-Level: X-Spam-Status: No, score=0.4 required=5.0 tests=AWL,DNS_FROM_RFC_ABUSE autolearn=disabled version=3.1.3 Received: from mail1-relais-roc.national.inria.fr (mail1-relais-roc.national.inria.fr [192.134.164.82]) by yquem.inria.fr (Postfix) with ESMTP id D1C06BC6B for ; Fri, 11 Jan 2008 14:14:37 +0100 (CET) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: Aq4HACr6hkdDWxLC/2dsb2JhbACBVqhn X-IronPort-AV: E=Sophos;i="4.24,271,1196636400"; d="scan'208";a="6536987" Received: from ip67-91-18-194.z18-91-67.customer.algx.net (HELO server1.bertec.net) ([67.91.18.194]) by mail1-smtp-roc.national.inria.fr with ESMTP; 11 Jan 2008 14:14:37 +0100 Received: from kuba.bertec.net (kuba.bertec.net [192.168.2.16]) by server1.bertec.net (Postfix) with ESMTP id EE96D105830 for ; Fri, 11 Jan 2008 08:14:35 -0500 (EST) From: Kuba Ober To: caml-list@yquem.inria.fr Subject: Re: [Caml-list] ocaml doesn't need to optimize on amd64?? Date: Fri, 11 Jan 2008 08:14:34 -0500 User-Agent: KMail/1.9.6 (enterprise 0.20071123.740460) References: <200801090922.00231.ober.14@osu.edu> <200801091714.52294.jon@ffconsultancy.com> <200801102256.42193.jon@ffconsultancy.com> In-Reply-To: <200801102256.42193.jon@ffconsultancy.com> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200801110814.35168.ober.14@osu.edu> X-Spam: no; 0.00; ocaml:01 unboxing:01 arrays:01 ocaml:01 struct:01 arrays:01 buffer:01 lambda:01 abstraction:01 allocations:01 run-time:01 variants:01 c-like:01 compiler:01 high-level:01 > Even in standalone code, unboxing arrays of complex numbers (which OCaml > does not do) makes FFTs 5x faster. In the context of FFIs, the performance > improvement can be much bigger because you can completely avoid the cost of > copying huge quantities of data (e.g. color/texcoord/normal/position struct > arrays in vertex buffer objects for OpenGL). In contrast, the overhead of > lambda abstraction in numerical code is "only" a factor of 2. > > PS: Kuba, your C code will most likely run a significantly faster in 64-bit > as well. Yeah, but my area of interest is really embedded realtime stuff, running typically on architectures which are quite resource constrained. On some of those your typical GC wouldn't even fit in the code memory. And I'm not even (most of the time) using dynamic memory allocations. None of my code really calls for any sort of boxing -- there's no need for it. All I need is C that is more expressive and easier to optimize. No run-time variants, really, all types are known and fixed, and data is at fixed locations in memory, or on the stack, or occasionally on the heap which is manually managed (C-like). Of course that pertains to the code that gets generated, because I should be able to use abstract concepts while writing the code. If I pass a function to a function, it doesn't necessarily mean that the compiler must emit the code for the former, and that the latter should actually call (as call machine instruction) the former. And I know that my code does immensely benefit from certain high-level optimizations - I have implemented them only because vendor compilers lacked them, and I had to resort to assembly, and unfortunately most assemblers suck big time :( Once you get used to an assembler with LISP macros, it's hard to go back... Self inflicted bait-and-switch. There's a whole slew of stuff that's impossible to do with current compiler architectures as long as you have your typical compile-assemble-link process. There have been attempts at optimizing stuff in the linkers, but to me it's just so much effort and code duplication - I couldn't imagine implementing things that way. Most linkers can't even cope with fairly basic *assembly* - as soon as you have complex operations on relocatable (or out-of-module) symbols, the assembler has to essentially ship an AST of the expression so that the linker will be able to compute the needed value. I don't know how many linkers will actually dig ASTs shipped in object files -- none of the embedded ones I tried do. Many vendor assemblers won't even report warnings if you try to do that, and when you link you will get an incorrect binary. It's a mess, and makes the linker need more and more of compiler's functionality. Which begs the question: why have a separate linker? Early on I have decided to do whole-program compilation and linking is just the last stage of code generation. Of course you can pregenerate syntax trees from source files, which can save a bunch of time if you were compiling a big project, but I didn't even have to use that. My code compiles usually in less than a minute, and that's mostly because I haven't really taken time to optimize my optimizer: what for? It'd take me weeks sometimes to hand-write the assembly and debug it, waiting a minute for a compilation of a small 10kloc code base isn't a big deal. And it lets me have a very nice and understandable 10kloc code base, and a hopefully not much worse compiler. Not having function inlining, tail call removal and polymorphism (among others) at the source code level is a big deal - that's why I abhor C in general, and vendor C implementations even more so. Cheers, Kuba