Robert Fischer wrote:

>>>>>5. Strings: pushing unicode throughout a general purpose language is a
>>>>>mistake, IMHO. This is why languages like Java and C# are so slow.
>>>>>          
>>>>>
>>>>Unicode by itself, when wider-than-byte encodings are used, adds "zero"
>>>>runtime overhead; the only overhead is storage (2 or 4 bytes per
>>>>character).
>>>>        
>>>>
>>>You cannot degrade memory consumption without also degrading performance.
>>>Moreover, there are hidden costs such as the added complexity in a lexer
>>>which potentially has 256x larger dispatch tables or an extra indirection
>>>for every byte read.
>>>      
>>>
>
>Okay, I was going to let this slide, but it kept resurfacing and annoying me.
>
>Is there any empirical support for the assertion that Java and C# are slow because of *unicode*?  Of
>all things, *unicode*?  The fact that they're bytecod languages isn't a bigger hit?  At least with
>the JVM, the hypercomplicated GC should probably take some of the blame, too -- I've seen 2x speed
>increases by *reducing* the space available to the GC, and 10x speed increases by boosting the space
>available to ridiculous levels so that the full GC barely ever has to fire.  The the nigh-universal
>optimization-ruining mutable data and virtual function (e.g. method) dispatch I'm sure doesn't help,
>too.  And this is to say nothing of user-space problems like the explosion of nontrivial types
>associated with the object-driven style.  With all that going on, you're blaming their *Unicode
>support* for why they're slow?  "This is why languages like Java and C# are so slow."  Really?  Got
>evidence for that?
>
>~~ Robert.
>
>_
>

The problem, as I understand it, is in writting parsers.  Your standard 
finite automata based regular expression library or lexical analyzer is 
based, at it's heart, on a table lookup- you have a 2D array, whose size 
is the number of input characters times the number of states.  For ASCII 
input, the number of possible input characters is small- 256 at most.  
256 input characters times hundreds of states isn't that big of a table- 
we're looking at sizes in 10's of K- easily handlable even in the bad 
old days of 64K segments.  Even going to UTF-16 ups the number of input 
characters from 256 to 65,536- and now a moderately large state machine 
(hundreds of states) weighs in at tens of megabytes of table space.  
And, of course, if you try to handle the entire 31-bit full unicode 
point space, welcome to really large tables :-).

The solution, I think, is to change the implementation of your finite 
automata to use some data structure smarter than a flat 2D array, but 
that's me.

Brian