Robert Fischer wrote:

5. Strings: pushing unicode throughout a general purpose language is a
mistake, IMHO. This is why languages like Java and C# are so slow.

Unicode by itself, when wider-than-byte encodings are used, adds "zero"
runtime overhead; the only overhead is storage (2 or 4 bytes per
character).

You cannot degrade memory consumption without also degrading performance.
Moreover, there are hidden costs such as the added complexity in a lexer
which potentially has 256x larger dispatch tables or an extra indirection
for every byte read.


Okay, I was going to let this slide, but it kept resurfacing and annoying me.

Is there any empirical support for the assertion that Java and C# are slow because of *unicode*?  Of
all things, *unicode*?  The fact that they're bytecod languages isn't a bigger hit?  At least with
the JVM, the hypercomplicated GC should probably take some of the blame, too -- I've seen 2x speed
increases by *reducing* the space available to the GC, and 10x speed increases by boosting the space
available to ridiculous levels so that the full GC barely ever has to fire.  The the nigh-universal
optimization-ruining mutable data and virtual function (e.g. method) dispatch I'm sure doesn't help,
too.  And this is to say nothing of user-space problems like the explosion of nontrivial types
associated with the object-driven style.  With all that going on, you're blaming their *Unicode
support* for why they're slow?  "This is why languages like Java and C# are so slow."  Really?  Got
evidence for that?

~~ Robert.

_

The problem, as I understand it, is in writting parsers. Your standard finite automata based regular expression library or lexical analyzer is based, at it's heart, on a table lookup- you have a 2D array, whose size is the number of input characters times the number of states. For ASCII input, the number of possible input characters is small- 256 at most. 256 input characters times hundreds of states isn't that big of a table- we're looking at sizes in 10's of K- easily handlable even in the bad old days of 64K segments. Even going to UTF-16 ups the number of input characters from 256 to 65,536- and now a moderately large state machine (hundreds of states) weighs in at tens of megabytes of table space. And, of course, if you try to handle the entire 31-bit full unicode point space, welcome to really large tables :-).

The solution, I think, is to change the implementation of your finite automata to use some data structure smarter than a flat 2D array, but that's me.

Brian