From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <berke.durak@exalead.com>
X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on yquem.inria.fr
X-Spam-Level: 
X-Spam-Status: No, score=0.0 required=5.0 tests=AWL autolearn=disabled 
	version=3.1.3
Received: from mail3-relais-sop.national.inria.fr (mail3-relais-sop.national.inria.fr [192.134.164.104])
	by yquem.inria.fr (Postfix) with ESMTP id 6B7EFBBCA
	for <caml-list@yquem.inria.fr>; Tue, 13 May 2008 17:18:08 +0200 (CEST)
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: ArAAACJOKUjBL1AZo2dsb2JhbACSEAEBAQEBAQcFCAcRmlk
X-IronPort-AV: E=Sophos;i="4.27,480,1204498800"; 
   d="scan'208";a="12541470"
Received: from gw.exalead.com (HELO exalead.com) ([193.47.80.25])
  by mail3-smtp-sop.national.inria.fr with ESMTP; 13 May 2008 17:18:07 +0200
Received: from [192.168.204.148] (madpc064.exalead.com [192.168.204.148])
	(authenticated bits=0)
	by exalead.com (8.14.2/8.14.0) with ESMTP id m4DFI0hW022568;
	Tue, 13 May 2008 17:18:00 +0200
Message-ID: <4829B128.2020708@exalead.com>
Date: Tue, 13 May 2008 17:18:00 +0200
From: Berke Durak <berke.durak@exalead.com>
User-Agent: Thunderbird 1.5.0.10 (X11/20070221)
MIME-Version: 1.0
To: Robert Fischer <robert@fischerventure.com>
Cc: caml-list@yquem.inria.fr
Subject: Re: [Caml-list] Re: Why OCaml sucks
References: <200805090139.54870.jon@ffconsultancy.com>	<200805120854.45499.ober.14@osu.edu>	<200805121516.13983.jon@ffconsultancy.com>	<200805130933.13952.ober.14@osu.edu>	<48299C67.1010905@fischerventure.com>	<48299F46.5030906@janestcapital.com> <4829A207.2030601@fischerventure.com>
In-Reply-To: <4829A207.2030601@fischerventure.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
X-Spam: no; 0.00; berke:01 durak:01 berke:01 durak:01 ocaml:01 assertion:01 bindings:01 pcre:01 mutable:01 externally:01 lexing:01 ocaml:01 lexer:01 byte:01 lowercase:01 

Robert Fischer wrote:

> Getting back to the original question, though -- is there any evidence that Java/C# are slow because
> of unicode support, and not because of other aspects of the languages?  Because that assertion seems
> flat-out bogus to me.

I do not think the JVM is especially slow in practice.  However, one potential source of
slowness could be, in some particular cases, conversions to and from the internal short array-based
string representation to UTF8 when using native code.  Similarly, Java strings being immutable,
in-place modification of strings is not possible from native code, so a lot of bindings to C libraries
end up duplicating strings a lot (see e.g. PCRE).

This is why the NIO API exposing mutable and/or externally allocated buffers was introduced in the JVM,
but it remains hard to use.

However it is true that regexes on UTF8 can be quite slow.  Compare (on Linux):

/udir/durak> dd if=/dev/urandom bs=10M count=1 of=/dev/shm/z

/udir/durak> time LANG=en_US.UTF-8 grep -c "^[a-z]*$" /dev/shm/z
2.31s user 0.01s system 99% cpu 2.320 total
/udir/durak> time LANG=C grep -c "^[a-z]*$" /dev/shm/z
0.04s user 0.01s system 98% cpu 0.048 total

Lesson 1: when lexing, do not read unicode chars one at a time.  Pre-process your regular expression
according to your input encoding, instead.

That being said, I think strings should be represented as they are today, and that the core
Ocaml libraries do not have much business dealing with UTF8.  We seldom need letter-indexed
random access to strings.

However, the time is ripe for throwing out old 8-bit charsets such as ISO-8859-x (a.k.a Latin-y)
and whatnot.  This simplifies considerably lesson 1:  it's either ASCII or UTF8 Unicode.  I think
the Ocaml lexer should simply treat any byte with its high bit set as a lowercase letter.
-- 
Berke DURAK