From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on yquem.inria.fr X-Spam-Level: X-Spam-Status: No, score=0.0 required=5.0 tests=AWL autolearn=disabled version=3.1.3 Received: from mail3-relais-sop.national.inria.fr (mail3-relais-sop.national.inria.fr [192.134.164.104]) by yquem.inria.fr (Postfix) with ESMTP id 6B7EFBBCA for ; Tue, 13 May 2008 17:18:08 +0200 (CEST) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: ArAAACJOKUjBL1AZo2dsb2JhbACSEAEBAQEBAQcFCAcRmlk X-IronPort-AV: E=Sophos;i="4.27,480,1204498800"; d="scan'208";a="12541470" Received: from gw.exalead.com (HELO exalead.com) ([193.47.80.25]) by mail3-smtp-sop.national.inria.fr with ESMTP; 13 May 2008 17:18:07 +0200 Received: from [192.168.204.148] (madpc064.exalead.com [192.168.204.148]) (authenticated bits=0) by exalead.com (8.14.2/8.14.0) with ESMTP id m4DFI0hW022568; Tue, 13 May 2008 17:18:00 +0200 Message-ID: <4829B128.2020708@exalead.com> Date: Tue, 13 May 2008 17:18:00 +0200 From: Berke Durak User-Agent: Thunderbird 1.5.0.10 (X11/20070221) MIME-Version: 1.0 To: Robert Fischer Cc: caml-list@yquem.inria.fr Subject: Re: [Caml-list] Re: Why OCaml sucks References: <200805090139.54870.jon@ffconsultancy.com> <200805120854.45499.ober.14@osu.edu> <200805121516.13983.jon@ffconsultancy.com> <200805130933.13952.ober.14@osu.edu> <48299C67.1010905@fischerventure.com> <48299F46.5030906@janestcapital.com> <4829A207.2030601@fischerventure.com> In-Reply-To: <4829A207.2030601@fischerventure.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Spam: no; 0.00; berke:01 durak:01 berke:01 durak:01 ocaml:01 assertion:01 bindings:01 pcre:01 mutable:01 externally:01 lexing:01 ocaml:01 lexer:01 byte:01 lowercase:01 Robert Fischer wrote: > Getting back to the original question, though -- is there any evidence that Java/C# are slow because > of unicode support, and not because of other aspects of the languages? Because that assertion seems > flat-out bogus to me. I do not think the JVM is especially slow in practice. However, one potential source of slowness could be, in some particular cases, conversions to and from the internal short array-based string representation to UTF8 when using native code. Similarly, Java strings being immutable, in-place modification of strings is not possible from native code, so a lot of bindings to C libraries end up duplicating strings a lot (see e.g. PCRE). This is why the NIO API exposing mutable and/or externally allocated buffers was introduced in the JVM, but it remains hard to use. However it is true that regexes on UTF8 can be quite slow. Compare (on Linux): /udir/durak> dd if=/dev/urandom bs=10M count=1 of=/dev/shm/z /udir/durak> time LANG=en_US.UTF-8 grep -c "^[a-z]*$" /dev/shm/z 2.31s user 0.01s system 99% cpu 2.320 total /udir/durak> time LANG=C grep -c "^[a-z]*$" /dev/shm/z 0.04s user 0.01s system 98% cpu 0.048 total Lesson 1: when lexing, do not read unicode chars one at a time. Pre-process your regular expression according to your input encoding, instead. That being said, I think strings should be represented as they are today, and that the core Ocaml libraries do not have much business dealing with UTF8. We seldom need letter-indexed random access to strings. However, the time is ripe for throwing out old 8-bit charsets such as ISO-8859-x (a.k.a Latin-y) and whatnot. This simplifies considerably lesson 1: it's either ASCII or UTF8 Unicode. I think the Ocaml lexer should simply treat any byte with its high bit set as a lowercase letter. -- Berke DURAK