[Caml-list] GSoC: better UTF-8 support

Mailing list for all users of the OCaml language and system.
 help / color / mirror / Atom feed

* [Caml-list] GSoC: better UTF-8 support
@ 2011-02-28  8:35 Christophe TROESTLER
  2011-02-28  8:58 ` Daniel Bünzli
                   ` (4 more replies)
  0 siblings, 5 replies; 20+ messages in thread
From: Christophe TROESTLER @ 2011-02-28  8:35 UTC (permalink / raw)
  To: OCaml Mailing List

Hi,

Starting from an idea on the Ocsigen mailing list, it was suggested
that better support for UTF-8 in the tools would be of interest to
several people.  In particular, the following points were identified:

- A flag (-utf8 ?) to the compilers should be added so that errors
  locations are correct in presence of UTF-8 strings [the programmer
  restricting himself to ASCII identifiers].

- ocamldoc: while an UTF-8 aware doc-generator is very easy to write,
  it would be nice to be able to parametrize any of them with the
  correct charset (using again the -utf8 flag ?)

- UTF8.Char and UTF8.String modules should be written with the same
  interface as Char and String.  [Camomile should be adapted
  consequently.]

- Printf/Scanf: %U of %cu for UTF8.Char.t

- Graphics: UTF-8 text printing

- Str: (character ranges)

The questions are: would such changes be beneficial to you?  Are there
other issues to address?  Is this enough for a GSoc proposal (seems a
little light to me)?  If it is done, is there a chance to have this
work included in the standard distribution?

Best,
C.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Caml-list] GSoC: better UTF-8 support
  2011-02-28  8:35 [Caml-list] GSoC: better UTF-8 support Christophe TROESTLER
@ 2011-02-28  8:58 ` Daniel Bünzli
  2011-02-28 10:07   ` David Allsopp
                     ` (2 more replies)
  2011-02-28 10:07 ` David Allsopp
                   ` (3 subsequent siblings)
  4 siblings, 3 replies; 20+ messages in thread
From: Daniel Bünzli @ 2011-02-28  8:58 UTC (permalink / raw)
  To: Christophe TROESTLER; +Cc: OCaml Mailing List

> - A flag (-utf8 ?) to the compilers should be added so that errors
>  locations are correct in presence of UTF-8 strings [the programmer
>  restricting himself to ASCII identifiers].

Alain mentioned that the patch would only be a few lines long.

> - ocamldoc: while an UTF-8 aware doc-generator is very easy to write,
>  it would be nice to be able to parametrize any of them with the
>  correct charset (using again the -utf8 flag ?)

http://caml.inria.fr/mantis/bug_view_page.php?bug_id=5066

> - UTF8.Char and UTF8.String modules should be written with the same
>  interface as Char and String.  [Camomile should be adapted
>  consequently.]

Is it a good idea to replicate the poor interface that the module Char
and String represent to manipulate strings ?

> - Graphics: UTF-8 text printing

Are there really a lot of people using the Graphics module ?

> - Str: (character ranges)

This would be the only interesting thing to me.

Daniel


^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: [Caml-list] GSoC: better UTF-8 support
  2011-02-28  8:58 ` Daniel Bünzli
@ 2011-02-28 10:07   ` David Allsopp
  2011-02-28 11:21     ` Daniel Bünzli
  2011-02-28 10:59   ` Sylvain Le Gall
  2011-02-28 14:39   ` [Caml-list] " David Rajchenbach-Teller
  2 siblings, 1 reply; 20+ messages in thread
From: David Allsopp @ 2011-02-28 10:07 UTC (permalink / raw)
  To: 'Daniel Bünzli', 'Christophe TROESTLER'
  Cc: 'OCaml Mailing List'

Daniel Bünzli wrote:
> > - UTF8.Char and UTF8.String modules should be written with the same
> >  interface as Char and String.  [Camomile should be adapted
> >  consequently.]
> 
> Is it a good idea to replicate the poor interface that the module Char
> and String represent to manipulate strings ?

If it's to go into the standard library then yes, it should exactly replicate the interface of the Char and String modules, that way within the standard library UTF8 can be used as a drop-in replacement (via a module or open statement at the top of some code). Anything else would be a very bad idea - two different interfaces over two representations of the same abstract concept (i.e. strings) would be far worse than one slightly inferior interface used on both. Out of interest, what are your complaints against the String and Char modules - missing functions or something deeper?

Anything else sounds like too wild a change to have a cat in hell's chance of getting into the standard library as a patch.

> > - Graphics: UTF-8 text printing
> 
> Are there really a lot of people using the Graphics module ?

Again, if the idea is to get UTF-8 support properly into the *standard* library then all aspects of string processing within the whole of the standard library should support it properly.

> > - Str: (character ranges)
> 
> This would be the only interesting thing to me.

It is mildly amusing that you criticise the String and Char modules, yet have interest in this module given that Pcre is so often recommended as the practical/sensible one to use (I know that Str is a lot faster than it used to be and indeed use it myself when I don't want the external dependency) :o)

David

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Caml-list] GSoC: better UTF-8 support
  2011-02-28 10:07   ` David Allsopp
@ 2011-02-28 11:21     ` Daniel Bünzli
  2011-02-28 11:46       ` David Allsopp
  0 siblings, 1 reply; 20+ messages in thread
From: Daniel Bünzli @ 2011-02-28 11:21 UTC (permalink / raw)
  To: David Allsopp; +Cc: Christophe TROESTLER, OCaml Mailing List

> If it's to go into the standard library then yes, it should exactly replicate the interface of the Char and String modules, that way within the standard library UTF8 can be used as a drop-in replacement
[...]

I'm not sure many programs would actually benefit from that. At a
certain point if you really want to process unicode at the character
level you'll need a proper library. Using these ascii/latin1 oriented
interfaces to process unicode at the character level would be
debilitating and frustrating for your final users (e.g. no treatement
of normal forms, you do realize that in unicode there's more than one
way of representing the character 'é').

The current status quo already allows you to treat UTF-8 encoded
string if you don't try to look into them at the character level which
is fine for many programs.

> Out of interest, what are your complaints against the String and Char modules - missing functions or something deeper?

Every time I have to explode a string at a given separator and want to
use only the standard library I complain.

> It is mildly amusing that you criticise the String and Char modules, yet have interest in this module given
[...]

The thing is that this support could be included without changing the
interfaces at all. Only the regexp language needs to be extended (and
I guess the underlying implementation wouldn't have to be changed).

Best,

Daniel

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: [Caml-list] GSoC: better UTF-8 support
  2011-02-28 11:21     ` Daniel Bünzli
@ 2011-02-28 11:46       ` David Allsopp
  2011-02-28 12:32         ` Daniel Bünzli
  0 siblings, 1 reply; 20+ messages in thread
From: David Allsopp @ 2011-02-28 11:46 UTC (permalink / raw)
  To: 'Daniel Bünzli'
  Cc: 'Christophe TROESTLER', 'OCaml Mailing List'

Daniel Bünzli wrote:
> > If it's to go into the standard library then yes, it should exactly
> > replicate the interface of the Char and String modules, that way
> > within the standard library UTF8 can be used as a drop-in replacement
> [...]
> 
> I'm not sure many programs would actually benefit from that. At a certain
> point if you really want to process unicode at the character level you'll
> need a proper library.

Yes, and at that point you'd use one. Not all programs need to analyse strings at a character level. For example, it would be nice for this to work within just the standard library:

D:\>md "Paweł Łukaszewski"
D:\>cd "Paweł Łukaszewski"
D:\Paweł Łukaszewski>ocaml
        Objective Caml version 3.11.2

# Sys.getcwd();;
- : string = "D:\\Pawel Lukaszewski"

The reason for this is because an internal conversion is done to hack around Unicode filenames - but having any representation of Unicode in the standard library would mean that these and other functions would no longer have to return incorrect answers, as in this case, but the standard library would have a default mechanism for dealing with Unicode. At the moment, there's nothing the standard library can do with this Unicode filename because there's no standardised (within the library) representation available for it.

Consider it as being similar to something like the Digest module. It's only really there because digests are used in .cmi files within the compiler. If you're serious about using digests in an application then you have to use an external library because MD5 is not usually enough and you'll need other algorithms. But the module is still potentially useful so it's better that it's publically visible in the standard library and not just an internal module of the compiler. Same would apply to this simple native support for UTF-8 - it would allow other parts of the standard library to work properly for simple applications would allow Unicode strings to be handled.

> Using these ascii/latin1 oriented interfaces to
> process unicode at the character level would be debilitating and
> frustrating for your final users (e.g. no treatement of normal forms, you
> do realize that in unicode there's more than one way of representing the
> character 'é').

Fully aware - but just because you need to work with strings does *not* imply you ever need even to compare them. Granted, the documentation may note that to perform a canonical comparison you'll need a third party library but I still maintain that having basic support is better and more usable than having none. The above issue, for example, means I can't even manipulate a directory tree in OCaml on my system which shouldn't even be related to string/text processing!

David

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Caml-list] GSoC: better UTF-8 support
  2011-02-28 11:46       ` David Allsopp
@ 2011-02-28 12:32         ` Daniel Bünzli
  2011-02-28 12:59           ` [Caml-list] " Sylvain Le Gall
  0 siblings, 1 reply; 20+ messages in thread
From: Daniel Bünzli @ 2011-02-28 12:32 UTC (permalink / raw)
  To: David Allsopp; +Cc: Christophe TROESTLER, OCaml Mailing List

> D:\>md "Paweł Łukaszewski"
> D:\>cd "Paweł Łukaszewski"
> D:\Paweł Łukaszewski>ocaml
>        Objective Caml version 3.11.2
>
> # Sys.getcwd();;
> - : string = "D:\\Pawel Lukaszewski"

1) That's very different problem from defining new Char and String
like modules for UTF-8 encoded strings.
2) Is that a windows problems ? Here on osx :

> mkdir  Łukaszewski
> cd Łukaszewski
> rlwrap ocaml
        Objective Caml version 3.12.0

# Sys.getcwd ();;
- : string = "/private/tmp/?\129ukaszewski"
# Char.code (Sys.getcwd ()).[13];;
- : int = 197
# Char.code (Sys.getcwd ()).[14];;
- : int = 129

so we have 0xC5 0x81 for Ł which is the right UTF-8 representation for it.

I'm currently not up to date on the problem of unicode encoded
filenames in ocaml but isn't that something that should be handled by
the underlying libc ?

Note, maybe a nice addition for the gsoc project would be to add an
option to ocaml so that it doesn't escape the bytes 127 to 159 when it
prints strings allowing your UTF-8 aware tty to display UTF-8 encoded
strings correctly, not as above.

> Fully aware - but just because you need to work with strings does *not* imply you ever need even to compare them. Granted, the documentation may note that to perform a canonical comparison you'll need a third party library but I still maintain that having basic support is better and more usable than having none
[...]

But then, if you only need byte-level comparison String.compare is
fine. So I really don't see the benefits of these UTF-8 Char and
String like modules.

Best,

Daniel

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [Caml-list] Re: GSoC: better UTF-8 support
  2011-02-28 12:32         ` Daniel Bünzli
@ 2011-02-28 12:59           ` Sylvain Le Gall
  0 siblings, 0 replies; 20+ messages in thread
From: Sylvain Le Gall @ 2011-02-28 12:59 UTC (permalink / raw)
  To: caml-list

On 28-02-2011, Daniel Bünzli <daniel.buenzli@erratique.ch> wrote:
>> D:\>md "Paweł Łukaszewski"
>> D:\>cd "Paweł Łukaszewski"
>> D:\Paweł Łukaszewski>ocaml
>>        Objective Caml version 3.11.2
>>
>> # Sys.getcwd();;
>> - : string = "D:\\Pawel Lukaszewski"
>
> 1) That's very different problem from defining new Char and String
> like modules for UTF-8 encoded strings.
> 2) Is that a windows problems ? Here on osx :
>

I think it is a windows issue. Because, on windows, you use either ASCII
or UTF-16 (I think this is the encoding of wide char on Windows, though
I am not sure).

So you have two sets of function: xxxA and xxxW.

E.g.  CreateDirectoryA and CreateDirectoryW

Cheers,
Sylvain Le Gall
-- 
My company: http://www.ocamlcore.com
Linkedin:   http://fr.linkedin.com/in/sylvainlegall
Start an OCaml project here: http://forge.ocamlcore.org
OCaml blogs:                 http://planet.ocamlcore.org



^ permalink raw reply	[flat|nested] 20+ messages in thread

* [Caml-list] Re: GSoC: better UTF-8 support
  2011-02-28  8:58 ` Daniel Bünzli
  2011-02-28 10:07   ` David Allsopp
@ 2011-02-28 10:59   ` Sylvain Le Gall
  2011-02-28 14:39   ` [Caml-list] " David Rajchenbach-Teller
  2 siblings, 0 replies; 20+ messages in thread
From: Sylvain Le Gall @ 2011-02-28 10:59 UTC (permalink / raw)
  To: caml-list

Hello,

On 28-02-2011, Daniel Bünzli <daniel.buenzli@erratique.ch> wrote:
>> - A flag (-utf8 ?) to the compilers should be added so that errors
>>  locations are correct in presence of UTF-8 strings [the programmer
>>  restricting himself to ASCII identifiers].
>
> Alain mentioned that the patch would only be a few lines long.
>

Alain Frisch is not the kind of student we will have for GSoC... Let
say that it can take a while for an average student to reach 1% of the
level of Alain, wrt to OCaml. So these few lines, can take a while to be
produced.

I think the whole task make sense for a GSoC and will be enough for a
full GSoC for a normal student.

Cheers,
Sylvain Le Gall
-- 
My company: http://www.ocamlcore.com
Linkedin:   http://fr.linkedin.com/in/sylvainlegall
Start an OCaml project here: http://forge.ocamlcore.org
OCaml blogs:                 http://planet.ocamlcore.org



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Caml-list] GSoC: better UTF-8 support
  2011-02-28  8:58 ` Daniel Bünzli
  2011-02-28 10:07   ` David Allsopp
  2011-02-28 10:59   ` Sylvain Le Gall
@ 2011-02-28 14:39   ` David Rajchenbach-Teller
  2 siblings, 0 replies; 20+ messages in thread
From: David Rajchenbach-Teller @ 2011-02-28 14:39 UTC (permalink / raw)
  To: Daniel Bünzli; +Cc: Christophe TROESTLER, OCaml Mailing List

Don't forget to check OCaml Batteries Included. Some of the work is already done (including an extended printf that handles UTF-8 and is further user-extensible).

Cheers,
 David

On Feb 28, 2011, at 9:58 AM, Daniel Bünzli wrote:

>> - A flag (-utf8 ?) to the compilers should be added so that errors
>>  locations are correct in presence of UTF-8 strings [the programmer
>>  restricting himself to ASCII identifiers].
> 
> Alain mentioned that the patch would only be a few lines long.
> 
>> - ocamldoc: while an UTF-8 aware doc-generator is very easy to write,
>>  it would be nice to be able to parametrize any of them with the
>>  correct charset (using again the -utf8 flag ?)
> 
> http://caml.inria.fr/mantis/bug_view_page.php?bug_id=5066
> 
>> - UTF8.Char and UTF8.String modules should be written with the same
>>  interface as Char and String.  [Camomile should be adapted
>>  consequently.]
> 
> Is it a good idea to replicate the poor interface that the module Char
> and String represent to manipulate strings ?
> 
>> - Graphics: UTF-8 text printing
> 
> Are there really a lot of people using the Graphics module ?
> 
>> - Str: (character ranges)
> 
> This would be the only interesting thing to me.
> 
> Daniel
> 
> 
> -- 
> Caml-list mailing list.  Subscription management and archives:
> https://sympa-roc.inria.fr/wws/info/caml-list
> Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
> Bug reports: http://caml.inria.fr/bin/caml-bugs
> 



^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: [Caml-list] GSoC: better UTF-8 support
  2011-02-28  8:35 [Caml-list] GSoC: better UTF-8 support Christophe TROESTLER
  2011-02-28  8:58 ` Daniel Bünzli
@ 2011-02-28 10:07 ` David Allsopp
       [not found]   ` <20110228.143157.1265982603697554449.Christophe.Troestler+ocaml@umons.ac.be>
  2011-02-28 14:13 ` Gerd Stolpmann
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 20+ messages in thread
From: David Allsopp @ 2011-02-28 10:07 UTC (permalink / raw)
  To: 'Christophe TROESTLER', 'OCaml Mailing List'

Christophe TROESTLER wrote:
> - UTF8.Char and UTF8.String modules should be written with the same
>   interface as Char and String.  [Camomile should be adapted
>   consequently.]

Thinking of conventions like Unix/Pervasives.LargeFile, Bigarray.Genarray, Bigarray.Array1, etc. wouldn't it be better for these to be Char.UTF8 and String.UTF8? 

> - Printf/Scanf: %U of %cu for UTF8.Char.t
> 
> - Graphics: UTF-8 text printing
> 
> - Str: (character ranges)

If UTF-8 support is added to the standard library then it should be added everywhere where strings are manipulated or used - which rears the potentially ugly prospect of the Unix module?

> The questions are: would such changes be beneficial to you?

Personally, yes - it's an annoying limitation that you have to pull in a 3rd party library when all you want to do is handle a couple of accented characters accurately (my point being that not every application which needs UTF-8 needs it as a priority feature and isn't necessarily manipulating terabytes of data so requires completely optimised processing).

IMO it'd be better to have a standard library only supporting one particular Unicode encoding with a perhaps imperfect interface over a non-optimal storage representation than to have no support whatsoever, especially given that there are very good 3rd party libraries which provide the optimal (and with it, slightly more complex) implementations.

> Are there other issues to address?

I found this very old archive thread but it still poses some potentially relevant points: http://caml.inria.fr/pub/old_caml_site/caml-list/1224.html

> Is this enough for a GSoc proposal (seems a little light to me)?

I would posit that if this included the Unix module then it's a very big proposal!

> If it is done, is there a chance to have this work included in the standard distribution?

If the patches themselves are as potentially small as suggested (so maintenance issues aren't vastly increased) and the interfaces remain compatible (so nothing breaks) then it seems reasonable to hope, doesn't it?

David

^ permalink raw reply	[flat|nested] 20+ messages in thread

[parent not found: <20110228.143157.1265982603697554449.Christophe.Troestler+ocaml@umons.ac.be>]

* Re: [Caml-list] GSoC: better UTF-8 support
       [not found]   ` <20110228.143157.1265982603697554449.Christophe.Troestler+ocaml@umons.ac.be>
@ 2011-02-28 14:11     ` Daniel Bünzli
  2011-02-28 14:57       ` Dario Teixeira
  0 siblings, 1 reply; 20+ messages in thread
From: Daniel Bünzli @ 2011-02-28 14:11 UTC (permalink / raw)
  To: Christophe TROESTLER; +Cc: OCaml Mailing List

> Thinking more about this, one could introduce a new type (say “utf8”
> or “ustring”) for these UTF-8 strings.  It should be compatible with
> the way UTF-8 strings are handled on the C side for interoperability
> but “optimized” — e.g. should they contain their length (number of
> unicode chars)?
>
> Another thing: it could be a nice way to transition to *immutable*
> unicode strings.  This is not possible for (standard) strings because,
> as you all know, they are both used as strings and as buffers.  The
> introduction of unicode strings may be the right opportunity to
> distinguish both [1].

Frankly I see no benefit of introducing this half-baked UTF-8 support
into the standard library (which is what this proposal is about).

This will just bring in more noise in the interfaces. Even worse,
developers will think they handle unicode properly while they do in
fact not, bringing more confusion on already confusing topic (I'm
always surprised the little programmers know about unicode). Again,
pretending supporting unicode character level processing by replacing
latin1 character level processing the way you suggest is just plain
wrong.

For me either you :

1) Provide full unicode support in the standard library with at least
normal form and collation support in a new API, separate from the
current, existing String and Char modules.

2) Leave full unicode support to a third party library and keep the
current state with some improvements for coping with UTF-8 encoded
string literals and for interacting with file systems correctly.

Given signals given in the past by the ocaml dev team 2) seems more
likely to be accepted.

Best,

Daniel

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Caml-list] GSoC: better UTF-8 support
  2011-02-28 14:11     ` Daniel Bünzli
@ 2011-02-28 14:57       ` Dario Teixeira
  0 siblings, 0 replies; 20+ messages in thread
From: Dario Teixeira @ 2011-02-28 14:57 UTC (permalink / raw)
  To: caml-list

Hi,

> Frankly I see no benefit of introducing this half-baked UTF-8 support
> into the standard library (which is what this proposal is about).

I tend to agree with Daniel.  In my mind I already rename the stdlib
types char/string into byte/blob.  For many common operations (like
concatenation) it is perfectly safe to use these stdlib blob-oriented
functions with UTF8 encoded strings.  For others -- like accessing
char at position N -- the blob-oriented functions will fail.  However,
if your app needs UTF8-aware functions, it is very likely that sooner
or later you will also need support for some of the more complex
aspects of Unicode, at which point a half-arsed implementation will
not suffice and you will need to link an external library anyway.
Therefore, either the stdlib should remain strictly blob-oriented
(which is fine by me), or it should get serious about Unicode support.

Cheers,
Dario Teixeira

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Caml-list] GSoC: better UTF-8 support
  2011-02-28  8:35 [Caml-list] GSoC: better UTF-8 support Christophe TROESTLER
  2011-02-28  8:58 ` Daniel Bünzli
  2011-02-28 10:07 ` David Allsopp
@ 2011-02-28 14:13 ` Gerd Stolpmann
  2011-02-28 14:31   ` [Caml-list] " Sylvain Le Gall
                     ` (2 more replies)
  2011-02-28 14:21 ` [Caml-list] " Michael Ekstrand
  2011-03-03 15:37 ` Damien Doligez
  4 siblings, 3 replies; 20+ messages in thread
From: Gerd Stolpmann @ 2011-02-28 14:13 UTC (permalink / raw)
  To: Christophe TROESTLER; +Cc: OCaml Mailing List

Am Montag, den 28.02.2011, 09:35 +0100 schrieb Christophe TROESTLER:
> Hi,
> 
> Starting from an idea on the Ocsigen mailing list, it was suggested
> that better support for UTF-8 in the tools would be of interest to
> several people.  In particular, the following points were identified:
> 
> - A flag (-utf8 ?) to the compilers should be added so that errors
>   locations are correct in presence of UTF-8 strings [the programmer
>   restricting himself to ASCII identifiers].
> 
> - ocamldoc: while an UTF-8 aware doc-generator is very easy to write,
>   it would be nice to be able to parametrize any of them with the
>   correct charset (using again the -utf8 flag ?)
> 
> - UTF8.Char and UTF8.String modules should be written with the same
>   interface as Char and String.  [Camomile should be adapted
>   consequently.]

Well, UTF-8 is the wrong term here. What you need on this level are
Unicode modules, where a uni_char can contain all Unicode code points,
and a uni_string is an array of such uni_char's.

UTF-8 is a run-length encoding of Unicode for I/O. It is not well suited
for string manipulation, at least if you want efficient support for
index-based access, because the length of the char representation is not
constant.

Probably you would choose uni_char=int as representation for characters,
but for strings there are several possibilities:

- 16 bits/char: This path is taken by other languages, but only a 
  subset of Unicode chars can be represented directly
- 24 bits/char: All Unicode chars can be represented (range is
  0 to 0x10ffff), but you need to multiply by 3 to access by index. 
  This multiplication is relatively cheap (one bit shift plus 
  one addition).
- 32 bits/char: A slight waste of RAM but very efficient access by
  index
- int/char: same as 32 bits/char for 32-bit platforms but
  64 bits/char for 64-bit. Probably no good choice.

Of course, there should also be conversions from/to normal
chars/strings, and this is the place where UTF-8 comes into play.

Another comment: for supporting lowercase/uppercase conversion one needs
lookup tables. Not really big tables, because only a small fraction of
the Unicode chars has this variation.

One should also think about whether other properties of the Unicode
character database should be made available. E.g. character classes.
This could also live in add-on libraries, but it is worth discussing.

> - Printf/Scanf: %U of %cu for UTF8.Char.t

You need also string conversions for uni_string.

> 
> - Graphics: UTF-8 text printing
> 
> - Str: (character ranges)

Before talking about character ranges, Str needs to support single
Unicode characters properly. If it runs over a uni_string this is
trivial. If it runs directly over a UTF-8 encoded string this is
possible but the algorithm needs to be adapted to cope with multi-byte
representations.

For full internationalization you need more than just character ranges.
There are also character classes (e.g. "all letters", "all digits"), and
a few other phenomenons.

> The questions are: would such changes be beneficial to you?

I'd like it very much.

>   Are there
> other issues to address? 

You probably would also need a Unicode-version of Buffer.

Which syntax to choose for Unicode string accesses? Maybe

s.[[k]] ?

So far I see normal strings would still be used for I/O, only that it is
now easy to decode them as UTF-8. There is one difficulty, though.
Imagine you read a file block by block, where a block has a fixed length
in bytes. Also, the file contains UTF-8-encoded data. It can now happen
that the end of a block is not at the end of a character. For that
reason you need special decoding functions that can deal with that.

There are probably other functions where you would like to have Unicode
support directly, e.g. int_of_uni_string.

> Is this enough for a GSoc proposal (seems a
> little light to me)? 

Honestly, this is not the type of work that is well-suited for GSoc. You
need to only change code, not develop something entirely new. Also, you
need to dig into very different parts of the code base - and you need
broad knowledge how everything interacts.

This is might be better done as community project with some moderation
from INRIA. There could be a master plan, and several people take over
tasks.

Gerd

>  If it is done, is there a chance to have this
> work included in the standard distribution?
> 
> Best,
> C.
> 


-- 
------------------------------------------------------------
Gerd Stolpmann, Bad Nauheimer Str.3, 64289 Darmstadt,Germany 
gerd@gerd-stolpmann.de          http://www.gerd-stolpmann.de
Phone: +49-6151-153855                  Fax: +49-6151-997714
------------------------------------------------------------


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [Caml-list] Re: GSoC: better UTF-8 support
  2011-02-28 14:13 ` Gerd Stolpmann
@ 2011-02-28 14:31   ` Sylvain Le Gall
  2011-02-28 15:09   ` [Caml-list] " Dario Teixeira
  2011-02-28 15:50   ` David Allsopp
  2 siblings, 0 replies; 20+ messages in thread
From: Sylvain Le Gall @ 2011-02-28 14:31 UTC (permalink / raw)
  To: caml-list

On 28-02-2011, Gerd Stolpmann <info@gerd-stolpmann.de> wrote:
> Am Montag, den 28.02.2011, 09:35 +0100 schrieb Christophe TROESTLER:
>
>> Is this enough for a GSoc proposal (seems a
>> little light to me)? 
>
> Honestly, this is not the type of work that is well-suited for GSoc. You
> need to only change code, not develop something entirely new. Also, you
> need to dig into very different parts of the code base - and you need
> broad knowledge how everything interacts.
>

I am not a GSoC insider, but projects for GSoC can perfectly be about
extending an already existing projects. We even decided to focus on
ideas that extend existing project for OCaml's GSoC. The point is that
it is more likely that a new project started during GSoC stop being
developped right after GSoC, whereas an already existing project has a
chance to live further...

Cheers,
Sylvain Le Gall
-- 
My company: http://www.ocamlcore.com
Linkedin:   http://fr.linkedin.com/in/sylvainlegall
Start an OCaml project here: http://forge.ocamlcore.org
OCaml blogs:                 http://planet.ocamlcore.org



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Caml-list] GSoC: better UTF-8 support
  2011-02-28 14:13 ` Gerd Stolpmann
  2011-02-28 14:31   ` [Caml-list] " Sylvain Le Gall
@ 2011-02-28 15:09   ` Dario Teixeira
  2011-02-28 15:50   ` David Allsopp
  2 siblings, 0 replies; 20+ messages in thread
From: Dario Teixeira @ 2011-02-28 15:09 UTC (permalink / raw)
  To: Christophe TROESTLER, Gerd Stolpmann; +Cc: OCaml Mailing List

Hi,

> Probably you would choose uni_char=int as representation for characters,
> but for strings there are several possibilities:
> 
> - 16 bits/char: This path is taken by other languages, but only a 
>   subset of Unicode chars can be represented directly

I think this particular representation should be discarded upfront.
It gives the illusion of proper Unicode support, when in fact it is
fundamentally broken.

Cheers,
Dario Teixeira



      


^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: [Caml-list] GSoC: better UTF-8 support
  2011-02-28 14:13 ` Gerd Stolpmann
  2011-02-28 14:31   ` [Caml-list] " Sylvain Le Gall
  2011-02-28 15:09   ` [Caml-list] " Dario Teixeira
@ 2011-02-28 15:50   ` David Allsopp
  2011-03-01  5:49     ` [Caml-list] " Yoriyuki Yamagata
  2 siblings, 1 reply; 20+ messages in thread
From: David Allsopp @ 2011-02-28 15:50 UTC (permalink / raw)
  To: 'Gerd Stolpmann', 'Christophe TROESTLER'
  Cc: 'OCaml Mailing List'

Gerd Stolpmann wrote:
> Am Montag, den 28.02.2011, 09:35 +0100 schrieb Christophe TROESTLER:
> > Hi,
> >
> > Starting from an idea on the Ocsigen mailing list, it was suggested
> > that better support for UTF-8 in the tools would be of interest to
> > several people.  In particular, the following points were identified:
> >
> > - A flag (-utf8 ?) to the compilers should be added so that errors
> >   locations are correct in presence of UTF-8 strings [the programmer
> >   restricting himself to ASCII identifiers].
> >
> > - ocamldoc: while an UTF-8 aware doc-generator is very easy to write,
> >   it would be nice to be able to parametrize any of them with the
> >   correct charset (using again the -utf8 flag ?)
> >
> > - UTF8.Char and UTF8.String modules should be written with the same
> >   interface as Char and String.  [Camomile should be adapted
> >   consequently.]
> 
> Well, UTF-8 is the wrong term here. What you need on this level are
> Unicode modules, where a uni_char can contain all Unicode code points,
> and a uni_string is an array of such uni_char's.
> 
> UTF-8 is a run-length encoding of Unicode for I/O. It is not well suited
> for string manipulation, at least if you want efficient support for
> index-based access, because the length of the char representation is not
> constant.
> 
> Probably you would choose uni_char=int as representation for characters,
> but for strings there are several possibilities:
> 
> - 16 bits/char: This path is taken by other languages, but only a
>   subset of Unicode chars can be represented directly
> - 24 bits/char: All Unicode chars can be represented (range is
>   0 to 0x10ffff), but you need to multiply by 3 to access by index.
>   This multiplication is relatively cheap (one bit shift plus
>   one addition).
> - 32 bits/char: A slight waste of RAM but very efficient access by
>   index
> - int/char: same as 32 bits/char for 32-bit platforms but
>   64 bits/char for 64-bit. Probably no good choice.
> 
> Of course, there should also be conversions from/to normal chars/strings,
> and this is the place where UTF-8 comes into play.
> 
> Another comment: for supporting lowercase/uppercase conversion one needs
> lookup tables. Not really big tables, because only a small fraction of
> the Unicode chars has this variation.

Although you could reasonably exclude case conversion functions if you wanted (of course, if you're trying to be totally compatible with Char/String then they'd have to be implemented as it has them). Not providing a function doesn't imply half-baked as long as there's the capability to implement it on top of the functions you do provide.

> One should also think about whether other properties of the Unicode
> character database should be made available. E.g. character classes.
> This could also live in add-on libraries, but it is worth discussing.

Personally, I'd say they could safely live in other libraries - the advantage of having basic Unicode string handling (length, character retrieval, simple operations over Unicode-character offsets, etc.) would be that the standard library can use the representation itself and other libraries can be updated to work with it (that's what we have modules and functors for, after all). The fact that OCaml's I/O functions can interface with Unicode-based file systems to me means that it absolutely must support Unicode (at least as a future target) - the status quo of being unable, for example, accurately to query the number of characters in the length of a filename returned by Unix.readdir () is not in any way desirable (and that's to say nothing of the fact that if the Windows ports used the wide versions of the Win32 API instead then you'd have Unix.readdir() returning UTF-8 strings on *nix and 16-bit wchar strings on Windows so you'd have lost platform independence as well!). Fixing that does not require a fully featured Unicode library and wishing for that seems a bit silly as a) it's exceedingly unlikely to happen and b) OCaml already has several very good libraries for full-blown Unicode.

David

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [Caml-list] Re: GSoC: better UTF-8 support
  2011-02-28 15:50   ` David Allsopp
@ 2011-03-01  5:49     ` Yoriyuki Yamagata
  0 siblings, 0 replies; 20+ messages in thread
From: Yoriyuki Yamagata @ 2011-03-01  5:49 UTC (permalink / raw)
  To: Sylvain Le Gall, Caml List

Sorry, I didn't notice this thread since caml-list did not reach me
sometime.  So allow me to jump in the discossion.

 I think the entire discussion went a bit astray.  It seems for me
that the argument goes to specify the project detail as much as
possible. According to my experience being a GSoC menter (Yes, I was),
students often come up with better idea than menter.  Therefore,
instead of specifying the details, we'd better specify a general
direction and let the students decide.

As the general direction, I think we need
1) light weight stdlib replacement:   Data type for Unicode chars and
strings.  Extensible character encofing, and simple IO.  Interfaces
shoud be purely functinal as far as possible.  For example string wil
be imutable, IO is monadic etc...

2) minimal language extension: unocode character and string literal.
Unicode aware toplevel(pretty printing) etc...

This is no means a complete support of Unicode, but having this in
stdlib we can add more feature of Unicode standard (through for
example camomile) or modify third party library to use Unicode.

Best,

-- 
Yoriyuki Yamagata
yoriyuki.y@gmail.com
http://sites.google.com/site/yoriyukiy/<https://sites.google.com/site/yoriyukiy/>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Caml-list] GSoC: better UTF-8 support
  2011-02-28  8:35 [Caml-list] GSoC: better UTF-8 support Christophe TROESTLER
                   ` (2 preceding siblings ...)
  2011-02-28 14:13 ` Gerd Stolpmann
@ 2011-02-28 14:21 ` Michael Ekstrand
  2011-03-03 15:37 ` Damien Doligez
  4 siblings, 0 replies; 20+ messages in thread
From: Michael Ekstrand @ 2011-02-28 14:21 UTC (permalink / raw)
  To: caml-list

On 02/28/2011 02:35 AM, Christophe TROESTLER wrote:
> - UTF8.Char and UTF8.String modules should be written with the same
>   interface as Char and String.  [Camomile should be adapted
>   consequently.]

If this project is undertaken, then IMO the prospective student should
also consult the Batteries and Extlib UTF8 modules, mostly based on
Camomile's UTF8, so that new UTF8-specific functions are not needlessly
incompatible with code written against Batteries, Extlib, or Camomile. 
This shouldn't be very difficult - Extlib and Batteries basically
simplify and extend Camomile for UTF-8 handling - but should still be
considered.

- Michael

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Caml-list] GSoC: better UTF-8 support
  2011-02-28  8:35 [Caml-list] GSoC: better UTF-8 support Christophe TROESTLER
                   ` (3 preceding siblings ...)
  2011-02-28 14:21 ` [Caml-list] " Michael Ekstrand
@ 2011-03-03 15:37 ` Damien Doligez
  2011-03-03 16:42   ` Dario Teixeira
  4 siblings, 1 reply; 20+ messages in thread
From: Damien Doligez @ 2011-03-03 15:37 UTC (permalink / raw)
  To: OCaml Mailing List


On 2011-02-28, at 09:35, Christophe TROESTLER wrote:

> - Printf/Scanf: %U of %cu for UTF8.Char.t

It cannot be %cu because that would break the following code:

    Printf.printf "Ct%cul%cu fhtagn\n" 'h' 'h';;

-- Damien


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Caml-list] GSoC: better UTF-8 support
  2011-03-03 15:37 ` Damien Doligez
@ 2011-03-03 16:42   ` Dario Teixeira
  0 siblings, 0 replies; 20+ messages in thread
From: Dario Teixeira @ 2011-03-03 16:42 UTC (permalink / raw)
  To: OCaml Mailing List, Damien Doligez

Hi,

> It cannot be %cu because that would break the following
> code:
> 
>     Printf.printf "Ct%cul%cu fhtagn\n" 'h' 'h';;

And anything that breaks Cthulhu's sleep would have such tremendous
side-effects that it would upset even us "impure" ML guys...

/off-topic

Cheers,
Dario



      


^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2011-03-03 16:42 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-02-28  8:35 [Caml-list] GSoC: better UTF-8 support Christophe TROESTLER
2011-02-28  8:58 ` Daniel Bünzli
2011-02-28 10:07   ` David Allsopp
2011-02-28 11:21     ` Daniel Bünzli
2011-02-28 11:46       ` David Allsopp
2011-02-28 12:32         ` Daniel Bünzli
2011-02-28 12:59           ` [Caml-list] " Sylvain Le Gall
2011-02-28 10:59   ` Sylvain Le Gall
2011-02-28 14:39   ` [Caml-list] " David Rajchenbach-Teller
2011-02-28 10:07 ` David Allsopp
     [not found]   ` <20110228.143157.1265982603697554449.Christophe.Troestler+ocaml@umons.ac.be>
2011-02-28 14:11     ` Daniel Bünzli
2011-02-28 14:57       ` Dario Teixeira
2011-02-28 14:13 ` Gerd Stolpmann
2011-02-28 14:31   ` [Caml-list] " Sylvain Le Gall
2011-02-28 15:09   ` [Caml-list] " Dario Teixeira
2011-02-28 15:50   ` David Allsopp
2011-03-01  5:49     ` [Caml-list] " Yoriyuki Yamagata
2011-02-28 14:21 ` [Caml-list] " Michael Ekstrand
2011-03-03 15:37 ` Damien Doligez
2011-03-03 16:42   ` Dario Teixeira

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox