From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <bhurt@janestcapital.com>
X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on yquem.inria.fr
X-Spam-Level: 
X-Spam-Status: No, score=0.0 required=5.0 tests=AWL,HTML_MESSAGE 
	autolearn=disabled version=3.1.3
Received: from mail2-relais-roc.national.inria.fr (mail2-relais-roc.national.inria.fr [192.134.164.83])
	by yquem.inria.fr (Postfix) with ESMTP id 34BA9BBCA
	for <caml-list@yquem.inria.fr>; Tue, 13 May 2008 16:01:45 +0200 (CEST)
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: AvcAAL47KUhCm3xro2dsb2JhbACCNTYojn0BAQEBAQEHBQgHEZpB
X-IronPort-AV: E=Sophos;i="4.27,479,1204498800"; 
   d="scan'208,217";a="10667296"
Received: from janestcapital.com (HELO smtp.janestcapital.com) ([66.155.124.107])
  by mail2-smtp-roc.national.inria.fr with ESMTP; 13 May 2008 16:01:44 +0200
Received: from [172.25.129.161] [38.105.200.250] by janestcapital.com with ESMTP
  (SMTPD-9.10) id AF460464; Tue, 13 May 2008 10:01:42 -0400
Message-ID: <48299F46.5030906@janestcapital.com>
Date: Tue, 13 May 2008 10:01:42 -0400
From: Brian Hurt <bhurt@janestcapital.com>
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.2) Gecko/20040804 Netscape/7.2 (ax)
X-Accept-Language: en-us, en
MIME-Version: 1.0
To: Robert Fischer <robert@fischerventure.com>
Cc: caml-list@yquem.inria.fr
Subject: Re: [Caml-list] Re: Why OCaml sucks
References: <200805090139.54870.jon@ffconsultancy.com>	<200805120854.45499.ober.14@osu.edu>	<200805121516.13983.jon@ffconsultancy.com>	<200805130933.13952.ober.14@osu.edu> <48299C67.1010905@fischerventure.com>
In-Reply-To: <48299C67.1010905@fischerventure.com>
Content-Type: multipart/alternative;
 boundary="------------030207010700040908090305"
X-Spam: no; 0.00; ocaml:01 encodings:01 runtime:01 lexer:01 indirection:01 byte:01 assertion:01 mutable:01 parsers:01 moderately:01 encodings:01 runtime:01 lexer:01 indirection:01 byte:01 

This is a multi-part message in MIME format.
--------------030207010700040908090305
Content-Type: text/plain; charset=us-ascii; format=flowed
Content-Transfer-Encoding: 7bit

Robert Fischer wrote:

>>>>>5. Strings: pushing unicode throughout a general purpose language is a
>>>>>mistake, IMHO. This is why languages like Java and C# are so slow.
>>>>>          
>>>>>
>>>>Unicode by itself, when wider-than-byte encodings are used, adds "zero"
>>>>runtime overhead; the only overhead is storage (2 or 4 bytes per
>>>>character).
>>>>        
>>>>
>>>You cannot degrade memory consumption without also degrading performance.
>>>Moreover, there are hidden costs such as the added complexity in a lexer
>>>which potentially has 256x larger dispatch tables or an extra indirection
>>>for every byte read.
>>>      
>>>
>
>Okay, I was going to let this slide, but it kept resurfacing and annoying me.
>
>Is there any empirical support for the assertion that Java and C# are slow because of *unicode*?  Of
>all things, *unicode*?  The fact that they're bytecod languages isn't a bigger hit?  At least with
>the JVM, the hypercomplicated GC should probably take some of the blame, too -- I've seen 2x speed
>increases by *reducing* the space available to the GC, and 10x speed increases by boosting the space
>available to ridiculous levels so that the full GC barely ever has to fire.  The the nigh-universal
>optimization-ruining mutable data and virtual function (e.g. method) dispatch I'm sure doesn't help,
>too.  And this is to say nothing of user-space problems like the explosion of nontrivial types
>associated with the object-driven style.  With all that going on, you're blaming their *Unicode
>support* for why they're slow?  "This is why languages like Java and C# are so slow."  Really?  Got
>evidence for that?
>
>~~ Robert.
>
>_
>

The problem, as I understand it, is in writting parsers.  Your standard 
finite automata based regular expression library or lexical analyzer is 
based, at it's heart, on a table lookup- you have a 2D array, whose size 
is the number of input characters times the number of states.  For ASCII 
input, the number of possible input characters is small- 256 at most.  
256 input characters times hundreds of states isn't that big of a table- 
we're looking at sizes in 10's of K- easily handlable even in the bad 
old days of 64K segments.  Even going to UTF-16 ups the number of input 
characters from 256 to 65,536- and now a moderately large state machine 
(hundreds of states) weighs in at tens of megabytes of table space.  
And, of course, if you try to handle the entire 31-bit full unicode 
point space, welcome to really large tables :-).

The solution, I think, is to change the implementation of your finite 
automata to use some data structure smarter than a flat 2D array, but 
that's me.

Brian


--------------030207010700040908090305
Content-Type: text/html; charset=us-ascii
Content-Transfer-Encoding: 7bit

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
  <meta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type">
  <title></title>
</head>
<body bgcolor="#ffffff" text="#000000">
Robert Fischer wrote:<br>
<blockquote cite="mid48299C67.1010905@fischerventure.com" type="cite">
  <blockquote type="cite">
    <blockquote type="cite">
      <blockquote type="cite">
        <blockquote type="cite">
          <pre wrap="">5. Strings: pushing unicode throughout a general purpose language is a
mistake, IMHO. This is why languages like Java and C# are so slow.
          </pre>
        </blockquote>
        <pre wrap="">Unicode by itself, when wider-than-byte encodings are used, adds "zero"
runtime overhead; the only overhead is storage (2 or 4 bytes per
character).
        </pre>
      </blockquote>
      <pre wrap="">You cannot degrade memory consumption without also degrading performance.
Moreover, there are hidden costs such as the added complexity in a lexer
which potentially has 256x larger dispatch tables or an extra indirection
for every byte read.
      </pre>
    </blockquote>
  </blockquote>
  <pre wrap=""><!---->
Okay, I was going to let this slide, but it kept resurfacing and annoying me.

Is there any empirical support for the assertion that Java and C# are slow because of *unicode*?  Of
all things, *unicode*?  The fact that they're bytecod languages isn't a bigger hit?  At least with
the JVM, the hypercomplicated GC should probably take some of the blame, too -- I've seen 2x speed
increases by *reducing* the space available to the GC, and 10x speed increases by boosting the space
available to ridiculous levels so that the full GC barely ever has to fire.  The the nigh-universal
optimization-ruining mutable data and virtual function (e.g. method) dispatch I'm sure doesn't help,
too.  And this is to say nothing of user-space problems like the explosion of nontrivial types
associated with the object-driven style.  With all that going on, you're blaming their *Unicode
support* for why they're slow?  "This is why languages like Java and C# are so slow."  Really?  Got
evidence for that?

~~ Robert.

_</pre>
</blockquote>
<br>
The problem, as I understand it, is in writting parsers.&nbsp; Your standard
finite automata based regular expression library or lexical analyzer is
based, at it's heart, on a table lookup- you have a 2D array, whose
size is the number of input characters times the number of states.&nbsp; For
ASCII input, the number of possible input characters is small- 256 at
most.&nbsp; 256 input characters times hundreds of states isn't that big of
a table- we're looking at sizes in 10's of K- easily handlable even in
the bad old days of 64K segments.&nbsp; Even going to UTF-16 ups the number
of input characters from 256 to 65,536- and now a moderately large
state machine (hundreds of states) weighs in at tens of megabytes of
table space.&nbsp; And, of course, if you try to handle the entire 31-bit
full unicode point space, welcome to really large tables :-).<br>
<br>
The solution, I think, is to change the implementation of your finite
automata to use some data structure smarter than a flat 2D array, but
that's me.<br>
<br>
Brian<br>
<br>
</body>
</html>

--------------030207010700040908090305--