From: "Daniel Bünzli" <daniel.buenzli@erratique.ch>
To: caml-list@inria.fr
Subject: Re: [Caml-list] Immutable strings
Date: Wed, 9 Jul 2014 15:15:33 +0100 [thread overview]
Message-ID: <C8E64BE53B6D4027B43B29260AC28C5D@erratique.ch> (raw)
In-Reply-To: <sympa.1404842907.21063.651@inria.fr>
Le mardi, 8 juillet 2014 à 19:15, mattiasw@gmail.com a écrit :
> ocaml will be that last language that doesn't have standardize unicode
> support. Even old languages like Erlang has gone the UTF-8 way, and that
> includes program code.
For the fun I just had a look what python does.
So in python basically they have a Unicode string which is a string made of Unicode *code points*. Fail, end of discussion. Should have been: *scalar values* (for those who don't understand why, I suggest reading my minimal Unicode introduction [1]).
(both in 2 and 3, apparently 2 used to be messier for reason I didn't bother to understand, they seem to be highly confused)
Sample code. U+D800 is the first surrogate, i.e. something you should never see in concrete Unicode textual processing, only in UTF-16 encoded bytes and paired with an appropriate low surrogate.
Python2:
>>> u'\uD800'.encode('utf-8')
'\xed\xa0\x80'
Congratulations, you just produced an invalid UTF-8 sequence (serialized a surrogate).
Python3 is a *little* better with *UTF-8* (but wait…) encoding stuff
>>> "\uD800".encode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in position 0: surrogates not allowed
So now let's try UTF-16:
>>> "\uD800".encode("utf-16")
b'\xff\xfe\x00\xd8'
Congratulations you just produced an invalid UTF-16 sequence hi-surrogate without a corresponding low surrogate (which together would define an Unicode scalar value).
Why on earth do they allow to represent surrogates *at all* in their Unicode text data structure ? Basically they don't understand Unicode.
The old camel should not be ashamed of its *outsanding* (absolutely) unicode support — this is not to say that nothing can be improved, I do have some proposal in the works — but the situation is not bad either.
Best,
Daniel
P.S. Skimming through these articles about python unicode strings I gather why people find unicode hard, there seem to be a high level of both technical and conceptual confusion. Again have a read at [1] if you'd like to clear (I hope) your mind about these things.
[1] http://erratique.ch/software/uucp/doc/Uucp.html#uminimal
next prev parent reply other threads:[~2014-07-09 14:15 UTC|newest]
Thread overview: 29+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-07-04 19:18 Gerd Stolpmann
2014-07-04 20:31 ` Anthony Tavener
2014-07-04 20:38 ` Malcolm Matalka
2014-07-04 23:44 ` Daniel Bünzli
2014-07-05 11:04 ` Gerd Stolpmann
2014-07-16 11:38 ` Damien Doligez
2014-07-04 21:01 ` Markus Mottl
2014-07-05 11:24 ` Gerd Stolpmann
2014-07-08 13:23 ` Jacques Garrigue
2014-07-08 13:37 ` Alain Frisch
2014-07-08 14:04 ` Jacques Garrigue
2014-07-28 11:14 ` Goswin von Brederlow
2014-07-28 15:51 ` Markus Mottl
2014-07-29 2:54 ` Yaron Minsky
2014-07-29 9:46 ` Goswin von Brederlow
2014-07-29 11:48 ` John F. Carr
2014-07-07 12:42 ` Alain Frisch
2014-07-08 12:24 ` Gerd Stolpmann
2014-07-09 13:54 ` Alain Frisch
2014-07-09 18:04 ` Gerd Stolpmann
2014-07-10 6:41 ` Nicolas Boulay
2014-07-14 17:40 ` Richard W.M. Jones
2014-07-08 18:15 ` mattiasw
2014-07-08 19:24 ` Daniel Bünzli
2014-07-08 19:27 ` Raoul Duke
2014-07-09 14:15 ` Daniel Bünzli [this message]
2014-07-14 17:45 ` Richard W.M. Jones
2014-07-21 15:06 ` Alain Frisch
[not found] ` <20140722.235104.405798419265248505.Christophe.Troestler@umons.ac.be>
2014-08-29 16:30 ` Damien Doligez
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=C8E64BE53B6D4027B43B29260AC28C5D@erratique.ch \
--to=daniel.buenzli@erratique.ch \
--cc=caml-list@inria.fr \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox