From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on yquem.inria.fr X-Spam-Level: X-Spam-Status: No, score=0.1 required=5.0 tests=AWL autolearn=disabled version=3.1.3 Received: from mail4-relais-sop.national.inria.fr (mail4-relais-sop.national.inria.fr [192.134.164.105]) by yquem.inria.fr (Postfix) with ESMTP id 2B5DFBBCA for ; Tue, 13 May 2008 15:33:16 +0200 (CEST) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AnADAGw1KUhDWxLCbmdsb2JhbACBU5A9NppA X-IronPort-AV: E=Sophos;i="4.27,479,1204498800"; d="scan'208";a="26119306" Received: from ip67-91-18-194.z18-91-67.customer.algx.net (HELO server1.bertec.net) ([67.91.18.194]) by mail4-smtp-sop.national.inria.fr with ESMTP; 13 May 2008 15:33:15 +0200 Received: from kuba.bertec.net (kuba.bertec.net [192.168.2.16]) by server1.bertec.net (Postfix) with ESMTP id A669DCDFA6 for ; Tue, 13 May 2008 09:33:14 -0400 (EDT) From: Kuba Ober To: caml-list@yquem.inria.fr Subject: Re: [Caml-list] Re: Why OCaml sucks Date: Tue, 13 May 2008 09:33:13 -0400 User-Agent: KMail/1.9.9 References: <200805090139.54870.jon@ffconsultancy.com> <200805120854.45499.ober.14@osu.edu> <200805121516.13983.jon@ffconsultancy.com> In-Reply-To: <200805121516.13983.jon@ffconsultancy.com> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200805130933.13952.ober.14@osu.edu> X-Spam: no; 0.00; ocaml:01 encodings:01 runtime:01 lexer:01 indirection:01 byte:01 subset:01 lexing:01 runtime:01 lexer:01 apl:01 apl:01 cheers:01 constants:01 constants:01 On Monday 12 May 2008, Jon Harrop wrote: > On Monday 12 May 2008 13:54:45 Kuba Ober wrote: > > > 5. Strings: pushing unicode throughout a general purpose language is a > > > mistake, IMHO. This is why languages like Java and C# are so slow. > > > > Unicode by itself, when wider-than-byte encodings are used, adds "zero" > > runtime overhead; the only overhead is storage (2 or 4 bytes per > > character). > > You cannot degrade memory consumption without also degrading performance. > Moreover, there are hidden costs such as the added complexity in a lexer > which potentially has 256x larger dispatch tables or an extra indirection > for every byte read. In a typical programming language which only accepts ASCII characters outside of string constants, your dispatch table will be short anyway (covers ASCII subset only), and there will be an extra comparison or two, active only when lexing strings. So no biggie. > > Given that storage is cheap, I'd much rather have Unicode support than > > lack of it. > > Sure. I don't mind unicode being available. I just don't want to have to > use it myself because it is of no benefit to me (or many other people) but > is a significant cost. Let's look at a relatively widely deployed example: Qt toolkit. Qt uses a 16 bit Unicode representation, and I really doubt that there are any runtime-measurable costs associated with it. By "runtime measurable" I mean that, say, application startup would take longer. A typical Qt application will do quite a bit of string manipulation on startup (even file names are stored in Unicode and converted to/from OS's code page), and they have slashed startup time by half on "major" applications, between Qt 3 and Qt 4, by doing algorithmic-style optimizations unrelated to strings (reducing number of malloc's, for one). So, unless you can show that one of your applications actually runs faster when you use non-Unicode strings as compared to well implemented Unicode ones, I will not really consider Unicode to be a problem. I do agree that many tools, like lexer generators, may not be Unicode-aware or have poorly implemented Unicode awareness. The 256x lexer table blowup shouldn't happen even if you were implementing APL with fully Unicode-aware lexer. The 1st level lexer table should be split into two pieces (ASCII and APL ranges), and everything else is either an error or goes opaquely into string constants. A lexer jump table only makes sense when it actually saves time compared to a bunch of compare-and-jumps. On modern architectures some jump lookup tables may actually be slower than compare-and-jumps, because some hardware optimizations done by CPU (say branch prediction) may simply ignore branch lookup tables, or only handle tables commonly generated by compilers... Cheers, Kuba