From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <robert@fischerventure.com>
X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on yquem.inria.fr
X-Spam-Level: 
X-Spam-Status: No, score=0.0 required=5.0 tests=none autolearn=disabled 
	version=3.1.3
Received: from mail3-relais-sop.national.inria.fr (mail3-relais-sop.national.inria.fr [192.134.164.104])
	by yquem.inria.fr (Postfix) with ESMTP id 65A7BBBCA
	for <caml-list@yquem.inria.fr>; Tue, 13 May 2008 16:13:30 +0200 (CEST)
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: AiwCAA0/KUjQccgFf2dsb2JhbACSEAEBCwUCBgcRmlQ
X-IronPort-AV: E=Sophos;i="4.27,479,1204498800"; 
   d="scan'208";a="12537841"
Received: from lax-green-bigip-5.dreamhost.com (HELO looneymail-a6.g.dreamhost.com) ([208.113.200.5])
  by mail3-smtp-sop.national.inria.fr with ESMTP; 13 May 2008 16:13:29 +0200
Received: from Robert-Fischers-Smokejumping-MacBook-Pro.local (unknown [64.241.37.140])
	by looneymail-a6.g.dreamhost.com (Postfix) with ESMTP id 7F9D09ED56
	for <caml-list@yquem.inria.fr>; Tue, 13 May 2008 07:13:29 -0700 (PDT)
Message-ID: <4829A207.2030601@fischerventure.com>
Date: Tue, 13 May 2008 09:13:27 -0500
From: Robert Fischer <robert@fischerventure.com>
User-Agent: Thunderbird 2.0.0.14 (Macintosh/20080421)
MIME-Version: 1.0
To: caml-list@yquem.inria.fr
Subject: Re: [Caml-list] Re: Why OCaml sucks
References: <200805090139.54870.jon@ffconsultancy.com>	<200805120854.45499.ober.14@osu.edu>	<200805121516.13983.jon@ffconsultancy.com>	<200805130933.13952.ober.14@osu.edu> <48299C67.1010905@fischerventure.com> <48299F46.5030906@janestcapital.com>
In-Reply-To: <48299F46.5030906@janestcapital.com>
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
X-Spam: no; 0.00; ocaml:01 parsers:01 moderately:01 wikipedia:01 wikipedia:01 wiki:01 parsers:01 assertion:01 lexical:01 caml-list:01 writting:01 data:02 expression:02 finite:02 finite:02 

> The problem, as I understand it, is in writting parsers.  Your standard
> finite automata based regular expression library or lexical analyzer is
> based, at it's heart, on a table lookup- you have a 2D array, whose size
> is the number of input characters times the number of states.  For ASCII
> input, the number of possible input characters is small- 256 at most. 
> 256 input characters times hundreds of states isn't that big of a table-
> we're looking at sizes in 10's of K- easily handlable even in the bad
> old days of 64K segments.  Even going to UTF-16 ups the number of input
> characters from 256 to 65,536- and now a moderately large state machine
> (hundreds of states) weighs in at tens of megabytes of table space. 
> And, of course, if you try to handle the entire 31-bit full unicode
> point space, welcome to really large tables :-).
> 
> The solution, I think, is to change the implementation of your finite
> automata to use some data structure smarter than a flat 2D array, but
> that's me.
> 
Yes.  It is certainly possible to write slow code to solve this problem.

A slightly more involved analysis is probably in order, so let's ask Wikipedia for some more light.

http://en.wikipedia.org/wiki/UTF-8#Rationale_behind_UTF-8.27s_design

As Kuba pointed out, the high bit is 0 on any ASCII characters, and the significant bits of a
multi-byte sequence determine the length of the sequence.  There are also a few large classes of bit
sequences which are simply not allowed.  Now, I'm not an expert in writing parsers, but these
qualities certainly sound like nice optimization hooks.

Getting back to the original question, though -- is there any evidence that Java/C# are slow because
of unicode support, and not because of other aspects of the languages?  Because that assertion seems
flat-out bogus to me.

~~ Robert.