Robert Fischer wrote: >>>>>5. Strings: pushing unicode throughout a general purpose language is a >>>>>mistake, IMHO. This is why languages like Java and C# are so slow. >>>>> >>>>> >>>>Unicode by itself, when wider-than-byte encodings are used, adds "zero" >>>>runtime overhead; the only overhead is storage (2 or 4 bytes per >>>>character). >>>> >>>> >>>You cannot degrade memory consumption without also degrading performance. >>>Moreover, there are hidden costs such as the added complexity in a lexer >>>which potentially has 256x larger dispatch tables or an extra indirection >>>for every byte read. >>> >>> > >Okay, I was going to let this slide, but it kept resurfacing and annoying me. > >Is there any empirical support for the assertion that Java and C# are slow because of *unicode*? Of >all things, *unicode*? The fact that they're bytecod languages isn't a bigger hit? At least with >the JVM, the hypercomplicated GC should probably take some of the blame, too -- I've seen 2x speed >increases by *reducing* the space available to the GC, and 10x speed increases by boosting the space >available to ridiculous levels so that the full GC barely ever has to fire. The the nigh-universal >optimization-ruining mutable data and virtual function (e.g. method) dispatch I'm sure doesn't help, >too. And this is to say nothing of user-space problems like the explosion of nontrivial types >associated with the object-driven style. With all that going on, you're blaming their *Unicode >support* for why they're slow? "This is why languages like Java and C# are so slow." Really? Got >evidence for that? > >~~ Robert. > >_ > The problem, as I understand it, is in writting parsers. Your standard finite automata based regular expression library or lexical analyzer is based, at it's heart, on a table lookup- you have a 2D array, whose size is the number of input characters times the number of states. For ASCII input, the number of possible input characters is small- 256 at most. 256 input characters times hundreds of states isn't that big of a table- we're looking at sizes in 10's of K- easily handlable even in the bad old days of 64K segments. Even going to UTF-16 ups the number of input characters from 256 to 65,536- and now a moderately large state machine (hundreds of states) weighs in at tens of megabytes of table space. And, of course, if you try to handle the entire 31-bit full unicode point space, welcome to really large tables :-). The solution, I think, is to change the implementation of your finite automata to use some data structure smarter than a flat 2D array, but that's me. Brian