* [Caml-list] GSoC: better UTF-8 support @ 2011-02-28 8:35 Christophe TROESTLER 2011-02-28 8:58 ` Daniel Bünzli ` (4 more replies) 0 siblings, 5 replies; 20+ messages in thread From: Christophe TROESTLER @ 2011-02-28 8:35 UTC (permalink / raw) To: OCaml Mailing List Hi, Starting from an idea on the Ocsigen mailing list, it was suggested that better support for UTF-8 in the tools would be of interest to several people. In particular, the following points were identified: - A flag (-utf8 ?) to the compilers should be added so that errors locations are correct in presence of UTF-8 strings [the programmer restricting himself to ASCII identifiers]. - ocamldoc: while an UTF-8 aware doc-generator is very easy to write, it would be nice to be able to parametrize any of them with the correct charset (using again the -utf8 flag ?) - UTF8.Char and UTF8.String modules should be written with the same interface as Char and String. [Camomile should be adapted consequently.] - Printf/Scanf: %U of %cu for UTF8.Char.t - Graphics: UTF-8 text printing - Str: (character ranges) The questions are: would such changes be beneficial to you? Are there other issues to address? Is this enough for a GSoc proposal (seems a little light to me)? If it is done, is there a chance to have this work included in the standard distribution? Best, C. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [Caml-list] GSoC: better UTF-8 support 2011-02-28 8:35 [Caml-list] GSoC: better UTF-8 support Christophe TROESTLER @ 2011-02-28 8:58 ` Daniel Bünzli 2011-02-28 10:07 ` David Allsopp ` (2 more replies) 2011-02-28 10:07 ` David Allsopp ` (3 subsequent siblings) 4 siblings, 3 replies; 20+ messages in thread From: Daniel Bünzli @ 2011-02-28 8:58 UTC (permalink / raw) To: Christophe TROESTLER; +Cc: OCaml Mailing List > - A flag (-utf8 ?) to the compilers should be added so that errors > locations are correct in presence of UTF-8 strings [the programmer > restricting himself to ASCII identifiers]. Alain mentioned that the patch would only be a few lines long. > - ocamldoc: while an UTF-8 aware doc-generator is very easy to write, > it would be nice to be able to parametrize any of them with the > correct charset (using again the -utf8 flag ?) http://caml.inria.fr/mantis/bug_view_page.php?bug_id=5066 > - UTF8.Char and UTF8.String modules should be written with the same > interface as Char and String. [Camomile should be adapted > consequently.] Is it a good idea to replicate the poor interface that the module Char and String represent to manipulate strings ? > - Graphics: UTF-8 text printing Are there really a lot of people using the Graphics module ? > - Str: (character ranges) This would be the only interesting thing to me. Daniel ^ permalink raw reply [flat|nested] 20+ messages in thread
* RE: [Caml-list] GSoC: better UTF-8 support 2011-02-28 8:58 ` Daniel Bünzli @ 2011-02-28 10:07 ` David Allsopp 2011-02-28 11:21 ` Daniel Bünzli 2011-02-28 10:59 ` Sylvain Le Gall 2011-02-28 14:39 ` [Caml-list] " David Rajchenbach-Teller 2 siblings, 1 reply; 20+ messages in thread From: David Allsopp @ 2011-02-28 10:07 UTC (permalink / raw) To: 'Daniel Bünzli', 'Christophe TROESTLER' Cc: 'OCaml Mailing List' Daniel Bünzli wrote: > > - UTF8.Char and UTF8.String modules should be written with the same > > interface as Char and String. [Camomile should be adapted > > consequently.] > > Is it a good idea to replicate the poor interface that the module Char > and String represent to manipulate strings ? If it's to go into the standard library then yes, it should exactly replicate the interface of the Char and String modules, that way within the standard library UTF8 can be used as a drop-in replacement (via a module or open statement at the top of some code). Anything else would be a very bad idea - two different interfaces over two representations of the same abstract concept (i.e. strings) would be far worse than one slightly inferior interface used on both. Out of interest, what are your complaints against the String and Char modules - missing functions or something deeper? Anything else sounds like too wild a change to have a cat in hell's chance of getting into the standard library as a patch. > > - Graphics: UTF-8 text printing > > Are there really a lot of people using the Graphics module ? Again, if the idea is to get UTF-8 support properly into the *standard* library then all aspects of string processing within the whole of the standard library should support it properly. > > - Str: (character ranges) > > This would be the only interesting thing to me. It is mildly amusing that you criticise the String and Char modules, yet have interest in this module given that Pcre is so often recommended as the practical/sensible one to use (I know that Str is a lot faster than it used to be and indeed use it myself when I don't want the external dependency) :o) David ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [Caml-list] GSoC: better UTF-8 support 2011-02-28 10:07 ` David Allsopp @ 2011-02-28 11:21 ` Daniel Bünzli 2011-02-28 11:46 ` David Allsopp 0 siblings, 1 reply; 20+ messages in thread From: Daniel Bünzli @ 2011-02-28 11:21 UTC (permalink / raw) To: David Allsopp; +Cc: Christophe TROESTLER, OCaml Mailing List > If it's to go into the standard library then yes, it should exactly replicate the interface of the Char and String modules, that way within the standard library UTF8 can be used as a drop-in replacement [...] I'm not sure many programs would actually benefit from that. At a certain point if you really want to process unicode at the character level you'll need a proper library. Using these ascii/latin1 oriented interfaces to process unicode at the character level would be debilitating and frustrating for your final users (e.g. no treatement of normal forms, you do realize that in unicode there's more than one way of representing the character 'é'). The current status quo already allows you to treat UTF-8 encoded string if you don't try to look into them at the character level which is fine for many programs. > Out of interest, what are your complaints against the String and Char modules - missing functions or something deeper? Every time I have to explode a string at a given separator and want to use only the standard library I complain. > It is mildly amusing that you criticise the String and Char modules, yet have interest in this module given [...] The thing is that this support could be included without changing the interfaces at all. Only the regexp language needs to be extended (and I guess the underlying implementation wouldn't have to be changed). Best, Daniel ^ permalink raw reply [flat|nested] 20+ messages in thread
* RE: [Caml-list] GSoC: better UTF-8 support 2011-02-28 11:21 ` Daniel Bünzli @ 2011-02-28 11:46 ` David Allsopp 2011-02-28 12:32 ` Daniel Bünzli 0 siblings, 1 reply; 20+ messages in thread From: David Allsopp @ 2011-02-28 11:46 UTC (permalink / raw) To: 'Daniel Bünzli' Cc: 'Christophe TROESTLER', 'OCaml Mailing List' Daniel Bünzli wrote: > > If it's to go into the standard library then yes, it should exactly > > replicate the interface of the Char and String modules, that way > > within the standard library UTF8 can be used as a drop-in replacement > [...] > > I'm not sure many programs would actually benefit from that. At a certain > point if you really want to process unicode at the character level you'll > need a proper library. Yes, and at that point you'd use one. Not all programs need to analyse strings at a character level. For example, it would be nice for this to work within just the standard library: D:\>md "Paweł Łukaszewski" D:\>cd "Paweł Łukaszewski" D:\Paweł Łukaszewski>ocaml Objective Caml version 3.11.2 # Sys.getcwd();; - : string = "D:\\Pawel Lukaszewski" The reason for this is because an internal conversion is done to hack around Unicode filenames - but having any representation of Unicode in the standard library would mean that these and other functions would no longer have to return incorrect answers, as in this case, but the standard library would have a default mechanism for dealing with Unicode. At the moment, there's nothing the standard library can do with this Unicode filename because there's no standardised (within the library) representation available for it. Consider it as being similar to something like the Digest module. It's only really there because digests are used in .cmi files within the compiler. If you're serious about using digests in an application then you have to use an external library because MD5 is not usually enough and you'll need other algorithms. But the module is still potentially useful so it's better that it's publically visible in the standard library and not just an internal module of the compiler. Same would apply to this simple native support for UTF-8 - it would allow other parts of the standard library to work properly for simple applications would allow Unicode strings to be handled. > Using these ascii/latin1 oriented interfaces to > process unicode at the character level would be debilitating and > frustrating for your final users (e.g. no treatement of normal forms, you > do realize that in unicode there's more than one way of representing the > character 'é'). Fully aware - but just because you need to work with strings does *not* imply you ever need even to compare them. Granted, the documentation may note that to perform a canonical comparison you'll need a third party library but I still maintain that having basic support is better and more usable than having none. The above issue, for example, means I can't even manipulate a directory tree in OCaml on my system which shouldn't even be related to string/text processing! David ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [Caml-list] GSoC: better UTF-8 support 2011-02-28 11:46 ` David Allsopp @ 2011-02-28 12:32 ` Daniel Bünzli 2011-02-28 12:59 ` [Caml-list] " Sylvain Le Gall 0 siblings, 1 reply; 20+ messages in thread From: Daniel Bünzli @ 2011-02-28 12:32 UTC (permalink / raw) To: David Allsopp; +Cc: Christophe TROESTLER, OCaml Mailing List > D:\>md "Paweł Łukaszewski" > D:\>cd "Paweł Łukaszewski" > D:\Paweł Łukaszewski>ocaml > Objective Caml version 3.11.2 > > # Sys.getcwd();; > - : string = "D:\\Pawel Lukaszewski" 1) That's very different problem from defining new Char and String like modules for UTF-8 encoded strings. 2) Is that a windows problems ? Here on osx : > mkdir Łukaszewski > cd Łukaszewski > rlwrap ocaml Objective Caml version 3.12.0 # Sys.getcwd ();; - : string = "/private/tmp/?\129ukaszewski" # Char.code (Sys.getcwd ()).[13];; - : int = 197 # Char.code (Sys.getcwd ()).[14];; - : int = 129 so we have 0xC5 0x81 for Ł which is the right UTF-8 representation for it. I'm currently not up to date on the problem of unicode encoded filenames in ocaml but isn't that something that should be handled by the underlying libc ? Note, maybe a nice addition for the gsoc project would be to add an option to ocaml so that it doesn't escape the bytes 127 to 159 when it prints strings allowing your UTF-8 aware tty to display UTF-8 encoded strings correctly, not as above. > Fully aware - but just because you need to work with strings does *not* imply you ever need even to compare them. Granted, the documentation may note that to perform a canonical comparison you'll need a third party library but I still maintain that having basic support is better and more usable than having none [...] But then, if you only need byte-level comparison String.compare is fine. So I really don't see the benefits of these UTF-8 Char and String like modules. Best, Daniel ^ permalink raw reply [flat|nested] 20+ messages in thread
* [Caml-list] Re: GSoC: better UTF-8 support 2011-02-28 12:32 ` Daniel Bünzli @ 2011-02-28 12:59 ` Sylvain Le Gall 0 siblings, 0 replies; 20+ messages in thread From: Sylvain Le Gall @ 2011-02-28 12:59 UTC (permalink / raw) To: caml-list On 28-02-2011, Daniel Bünzli <daniel.buenzli@erratique.ch> wrote: >> D:\>md "Paweł Łukaszewski" >> D:\>cd "Paweł Łukaszewski" >> D:\Paweł Łukaszewski>ocaml >> Objective Caml version 3.11.2 >> >> # Sys.getcwd();; >> - : string = "D:\\Pawel Lukaszewski" > > 1) That's very different problem from defining new Char and String > like modules for UTF-8 encoded strings. > 2) Is that a windows problems ? Here on osx : > I think it is a windows issue. Because, on windows, you use either ASCII or UTF-16 (I think this is the encoding of wide char on Windows, though I am not sure). So you have two sets of function: xxxA and xxxW. E.g. CreateDirectoryA and CreateDirectoryW Cheers, Sylvain Le Gall -- My company: http://www.ocamlcore.com Linkedin: http://fr.linkedin.com/in/sylvainlegall Start an OCaml project here: http://forge.ocamlcore.org OCaml blogs: http://planet.ocamlcore.org ^ permalink raw reply [flat|nested] 20+ messages in thread
* [Caml-list] Re: GSoC: better UTF-8 support 2011-02-28 8:58 ` Daniel Bünzli 2011-02-28 10:07 ` David Allsopp @ 2011-02-28 10:59 ` Sylvain Le Gall 2011-02-28 14:39 ` [Caml-list] " David Rajchenbach-Teller 2 siblings, 0 replies; 20+ messages in thread From: Sylvain Le Gall @ 2011-02-28 10:59 UTC (permalink / raw) To: caml-list Hello, On 28-02-2011, Daniel Bünzli <daniel.buenzli@erratique.ch> wrote: >> - A flag (-utf8 ?) to the compilers should be added so that errors >> locations are correct in presence of UTF-8 strings [the programmer >> restricting himself to ASCII identifiers]. > > Alain mentioned that the patch would only be a few lines long. > Alain Frisch is not the kind of student we will have for GSoC... Let say that it can take a while for an average student to reach 1% of the level of Alain, wrt to OCaml. So these few lines, can take a while to be produced. I think the whole task make sense for a GSoC and will be enough for a full GSoC for a normal student. Cheers, Sylvain Le Gall -- My company: http://www.ocamlcore.com Linkedin: http://fr.linkedin.com/in/sylvainlegall Start an OCaml project here: http://forge.ocamlcore.org OCaml blogs: http://planet.ocamlcore.org ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [Caml-list] GSoC: better UTF-8 support 2011-02-28 8:58 ` Daniel Bünzli 2011-02-28 10:07 ` David Allsopp 2011-02-28 10:59 ` Sylvain Le Gall @ 2011-02-28 14:39 ` David Rajchenbach-Teller 2 siblings, 0 replies; 20+ messages in thread From: David Rajchenbach-Teller @ 2011-02-28 14:39 UTC (permalink / raw) To: Daniel Bünzli; +Cc: Christophe TROESTLER, OCaml Mailing List Don't forget to check OCaml Batteries Included. Some of the work is already done (including an extended printf that handles UTF-8 and is further user-extensible). Cheers, David On Feb 28, 2011, at 9:58 AM, Daniel Bünzli wrote: >> - A flag (-utf8 ?) to the compilers should be added so that errors >> locations are correct in presence of UTF-8 strings [the programmer >> restricting himself to ASCII identifiers]. > > Alain mentioned that the patch would only be a few lines long. > >> - ocamldoc: while an UTF-8 aware doc-generator is very easy to write, >> it would be nice to be able to parametrize any of them with the >> correct charset (using again the -utf8 flag ?) > > http://caml.inria.fr/mantis/bug_view_page.php?bug_id=5066 > >> - UTF8.Char and UTF8.String modules should be written with the same >> interface as Char and String. [Camomile should be adapted >> consequently.] > > Is it a good idea to replicate the poor interface that the module Char > and String represent to manipulate strings ? > >> - Graphics: UTF-8 text printing > > Are there really a lot of people using the Graphics module ? > >> - Str: (character ranges) > > This would be the only interesting thing to me. > > Daniel > > > -- > Caml-list mailing list. Subscription management and archives: > https://sympa-roc.inria.fr/wws/info/caml-list > Beginner's list: http://groups.yahoo.com/group/ocaml_beginners > Bug reports: http://caml.inria.fr/bin/caml-bugs > ^ permalink raw reply [flat|nested] 20+ messages in thread
* RE: [Caml-list] GSoC: better UTF-8 support 2011-02-28 8:35 [Caml-list] GSoC: better UTF-8 support Christophe TROESTLER 2011-02-28 8:58 ` Daniel Bünzli @ 2011-02-28 10:07 ` David Allsopp [not found] ` <20110228.143157.1265982603697554449.Christophe.Troestler+ocaml@umons.ac.be> 2011-02-28 14:13 ` Gerd Stolpmann ` (2 subsequent siblings) 4 siblings, 1 reply; 20+ messages in thread From: David Allsopp @ 2011-02-28 10:07 UTC (permalink / raw) To: 'Christophe TROESTLER', 'OCaml Mailing List' Christophe TROESTLER wrote: > - UTF8.Char and UTF8.String modules should be written with the same > interface as Char and String. [Camomile should be adapted > consequently.] Thinking of conventions like Unix/Pervasives.LargeFile, Bigarray.Genarray, Bigarray.Array1, etc. wouldn't it be better for these to be Char.UTF8 and String.UTF8? > - Printf/Scanf: %U of %cu for UTF8.Char.t > > - Graphics: UTF-8 text printing > > - Str: (character ranges) If UTF-8 support is added to the standard library then it should be added everywhere where strings are manipulated or used - which rears the potentially ugly prospect of the Unix module? > The questions are: would such changes be beneficial to you? Personally, yes - it's an annoying limitation that you have to pull in a 3rd party library when all you want to do is handle a couple of accented characters accurately (my point being that not every application which needs UTF-8 needs it as a priority feature and isn't necessarily manipulating terabytes of data so requires completely optimised processing). IMO it'd be better to have a standard library only supporting one particular Unicode encoding with a perhaps imperfect interface over a non-optimal storage representation than to have no support whatsoever, especially given that there are very good 3rd party libraries which provide the optimal (and with it, slightly more complex) implementations. > Are there other issues to address? I found this very old archive thread but it still poses some potentially relevant points: http://caml.inria.fr/pub/old_caml_site/caml-list/1224.html > Is this enough for a GSoc proposal (seems a little light to me)? I would posit that if this included the Unix module then it's a very big proposal! > If it is done, is there a chance to have this work included in the standard distribution? If the patches themselves are as potentially small as suggested (so maintenance issues aren't vastly increased) and the interfaces remain compatible (so nothing breaks) then it seems reasonable to hope, doesn't it? David ^ permalink raw reply [flat|nested] 20+ messages in thread
[parent not found: <20110228.143157.1265982603697554449.Christophe.Troestler+ocaml@umons.ac.be>]
* Re: [Caml-list] GSoC: better UTF-8 support [not found] ` <20110228.143157.1265982603697554449.Christophe.Troestler+ocaml@umons.ac.be> @ 2011-02-28 14:11 ` Daniel Bünzli 2011-02-28 14:57 ` Dario Teixeira 0 siblings, 1 reply; 20+ messages in thread From: Daniel Bünzli @ 2011-02-28 14:11 UTC (permalink / raw) To: Christophe TROESTLER; +Cc: OCaml Mailing List > Thinking more about this, one could introduce a new type (say “utf8” > or “ustring”) for these UTF-8 strings. It should be compatible with > the way UTF-8 strings are handled on the C side for interoperability > but “optimized” — e.g. should they contain their length (number of > unicode chars)? > > Another thing: it could be a nice way to transition to *immutable* > unicode strings. This is not possible for (standard) strings because, > as you all know, they are both used as strings and as buffers. The > introduction of unicode strings may be the right opportunity to > distinguish both [1]. Frankly I see no benefit of introducing this half-baked UTF-8 support into the standard library (which is what this proposal is about). This will just bring in more noise in the interfaces. Even worse, developers will think they handle unicode properly while they do in fact not, bringing more confusion on already confusing topic (I'm always surprised the little programmers know about unicode). Again, pretending supporting unicode character level processing by replacing latin1 character level processing the way you suggest is just plain wrong. For me either you : 1) Provide full unicode support in the standard library with at least normal form and collation support in a new API, separate from the current, existing String and Char modules. 2) Leave full unicode support to a third party library and keep the current state with some improvements for coping with UTF-8 encoded string literals and for interacting with file systems correctly. Given signals given in the past by the ocaml dev team 2) seems more likely to be accepted. Best, Daniel ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [Caml-list] GSoC: better UTF-8 support 2011-02-28 14:11 ` Daniel Bünzli @ 2011-02-28 14:57 ` Dario Teixeira 0 siblings, 0 replies; 20+ messages in thread From: Dario Teixeira @ 2011-02-28 14:57 UTC (permalink / raw) To: caml-list Hi, > Frankly I see no benefit of introducing this half-baked UTF-8 support > into the standard library (which is what this proposal is about). I tend to agree with Daniel. In my mind I already rename the stdlib types char/string into byte/blob. For many common operations (like concatenation) it is perfectly safe to use these stdlib blob-oriented functions with UTF8 encoded strings. For others -- like accessing char at position N -- the blob-oriented functions will fail. However, if your app needs UTF8-aware functions, it is very likely that sooner or later you will also need support for some of the more complex aspects of Unicode, at which point a half-arsed implementation will not suffice and you will need to link an external library anyway. Therefore, either the stdlib should remain strictly blob-oriented (which is fine by me), or it should get serious about Unicode support. Cheers, Dario Teixeira ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [Caml-list] GSoC: better UTF-8 support 2011-02-28 8:35 [Caml-list] GSoC: better UTF-8 support Christophe TROESTLER 2011-02-28 8:58 ` Daniel Bünzli 2011-02-28 10:07 ` David Allsopp @ 2011-02-28 14:13 ` Gerd Stolpmann 2011-02-28 14:31 ` [Caml-list] " Sylvain Le Gall ` (2 more replies) 2011-02-28 14:21 ` [Caml-list] " Michael Ekstrand 2011-03-03 15:37 ` Damien Doligez 4 siblings, 3 replies; 20+ messages in thread From: Gerd Stolpmann @ 2011-02-28 14:13 UTC (permalink / raw) To: Christophe TROESTLER; +Cc: OCaml Mailing List Am Montag, den 28.02.2011, 09:35 +0100 schrieb Christophe TROESTLER: > Hi, > > Starting from an idea on the Ocsigen mailing list, it was suggested > that better support for UTF-8 in the tools would be of interest to > several people. In particular, the following points were identified: > > - A flag (-utf8 ?) to the compilers should be added so that errors > locations are correct in presence of UTF-8 strings [the programmer > restricting himself to ASCII identifiers]. > > - ocamldoc: while an UTF-8 aware doc-generator is very easy to write, > it would be nice to be able to parametrize any of them with the > correct charset (using again the -utf8 flag ?) > > - UTF8.Char and UTF8.String modules should be written with the same > interface as Char and String. [Camomile should be adapted > consequently.] Well, UTF-8 is the wrong term here. What you need on this level are Unicode modules, where a uni_char can contain all Unicode code points, and a uni_string is an array of such uni_char's. UTF-8 is a run-length encoding of Unicode for I/O. It is not well suited for string manipulation, at least if you want efficient support for index-based access, because the length of the char representation is not constant. Probably you would choose uni_char=int as representation for characters, but for strings there are several possibilities: - 16 bits/char: This path is taken by other languages, but only a subset of Unicode chars can be represented directly - 24 bits/char: All Unicode chars can be represented (range is 0 to 0x10ffff), but you need to multiply by 3 to access by index. This multiplication is relatively cheap (one bit shift plus one addition). - 32 bits/char: A slight waste of RAM but very efficient access by index - int/char: same as 32 bits/char for 32-bit platforms but 64 bits/char for 64-bit. Probably no good choice. Of course, there should also be conversions from/to normal chars/strings, and this is the place where UTF-8 comes into play. Another comment: for supporting lowercase/uppercase conversion one needs lookup tables. Not really big tables, because only a small fraction of the Unicode chars has this variation. One should also think about whether other properties of the Unicode character database should be made available. E.g. character classes. This could also live in add-on libraries, but it is worth discussing. > - Printf/Scanf: %U of %cu for UTF8.Char.t You need also string conversions for uni_string. > > - Graphics: UTF-8 text printing > > - Str: (character ranges) Before talking about character ranges, Str needs to support single Unicode characters properly. If it runs over a uni_string this is trivial. If it runs directly over a UTF-8 encoded string this is possible but the algorithm needs to be adapted to cope with multi-byte representations. For full internationalization you need more than just character ranges. There are also character classes (e.g. "all letters", "all digits"), and a few other phenomenons. > The questions are: would such changes be beneficial to you? I'd like it very much. > Are there > other issues to address? You probably would also need a Unicode-version of Buffer. Which syntax to choose for Unicode string accesses? Maybe s.[[k]] ? So far I see normal strings would still be used for I/O, only that it is now easy to decode them as UTF-8. There is one difficulty, though. Imagine you read a file block by block, where a block has a fixed length in bytes. Also, the file contains UTF-8-encoded data. It can now happen that the end of a block is not at the end of a character. For that reason you need special decoding functions that can deal with that. There are probably other functions where you would like to have Unicode support directly, e.g. int_of_uni_string. > Is this enough for a GSoc proposal (seems a > little light to me)? Honestly, this is not the type of work that is well-suited for GSoc. You need to only change code, not develop something entirely new. Also, you need to dig into very different parts of the code base - and you need broad knowledge how everything interacts. This is might be better done as community project with some moderation from INRIA. There could be a master plan, and several people take over tasks. Gerd > If it is done, is there a chance to have this > work included in the standard distribution? > > Best, > C. > -- ------------------------------------------------------------ Gerd Stolpmann, Bad Nauheimer Str.3, 64289 Darmstadt,Germany gerd@gerd-stolpmann.de http://www.gerd-stolpmann.de Phone: +49-6151-153855 Fax: +49-6151-997714 ------------------------------------------------------------ ^ permalink raw reply [flat|nested] 20+ messages in thread
* [Caml-list] Re: GSoC: better UTF-8 support 2011-02-28 14:13 ` Gerd Stolpmann @ 2011-02-28 14:31 ` Sylvain Le Gall 2011-02-28 15:09 ` [Caml-list] " Dario Teixeira 2011-02-28 15:50 ` David Allsopp 2 siblings, 0 replies; 20+ messages in thread From: Sylvain Le Gall @ 2011-02-28 14:31 UTC (permalink / raw) To: caml-list On 28-02-2011, Gerd Stolpmann <info@gerd-stolpmann.de> wrote: > Am Montag, den 28.02.2011, 09:35 +0100 schrieb Christophe TROESTLER: > >> Is this enough for a GSoc proposal (seems a >> little light to me)? > > Honestly, this is not the type of work that is well-suited for GSoc. You > need to only change code, not develop something entirely new. Also, you > need to dig into very different parts of the code base - and you need > broad knowledge how everything interacts. > I am not a GSoC insider, but projects for GSoC can perfectly be about extending an already existing projects. We even decided to focus on ideas that extend existing project for OCaml's GSoC. The point is that it is more likely that a new project started during GSoC stop being developped right after GSoC, whereas an already existing project has a chance to live further... Cheers, Sylvain Le Gall -- My company: http://www.ocamlcore.com Linkedin: http://fr.linkedin.com/in/sylvainlegall Start an OCaml project here: http://forge.ocamlcore.org OCaml blogs: http://planet.ocamlcore.org ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [Caml-list] GSoC: better UTF-8 support 2011-02-28 14:13 ` Gerd Stolpmann 2011-02-28 14:31 ` [Caml-list] " Sylvain Le Gall @ 2011-02-28 15:09 ` Dario Teixeira 2011-02-28 15:50 ` David Allsopp 2 siblings, 0 replies; 20+ messages in thread From: Dario Teixeira @ 2011-02-28 15:09 UTC (permalink / raw) To: Christophe TROESTLER, Gerd Stolpmann; +Cc: OCaml Mailing List Hi, > Probably you would choose uni_char=int as representation for characters, > but for strings there are several possibilities: > > - 16 bits/char: This path is taken by other languages, but only a > subset of Unicode chars can be represented directly I think this particular representation should be discarded upfront. It gives the illusion of proper Unicode support, when in fact it is fundamentally broken. Cheers, Dario Teixeira ^ permalink raw reply [flat|nested] 20+ messages in thread
* RE: [Caml-list] GSoC: better UTF-8 support 2011-02-28 14:13 ` Gerd Stolpmann 2011-02-28 14:31 ` [Caml-list] " Sylvain Le Gall 2011-02-28 15:09 ` [Caml-list] " Dario Teixeira @ 2011-02-28 15:50 ` David Allsopp 2011-03-01 5:49 ` [Caml-list] " Yoriyuki Yamagata 2 siblings, 1 reply; 20+ messages in thread From: David Allsopp @ 2011-02-28 15:50 UTC (permalink / raw) To: 'Gerd Stolpmann', 'Christophe TROESTLER' Cc: 'OCaml Mailing List' Gerd Stolpmann wrote: > Am Montag, den 28.02.2011, 09:35 +0100 schrieb Christophe TROESTLER: > > Hi, > > > > Starting from an idea on the Ocsigen mailing list, it was suggested > > that better support for UTF-8 in the tools would be of interest to > > several people. In particular, the following points were identified: > > > > - A flag (-utf8 ?) to the compilers should be added so that errors > > locations are correct in presence of UTF-8 strings [the programmer > > restricting himself to ASCII identifiers]. > > > > - ocamldoc: while an UTF-8 aware doc-generator is very easy to write, > > it would be nice to be able to parametrize any of them with the > > correct charset (using again the -utf8 flag ?) > > > > - UTF8.Char and UTF8.String modules should be written with the same > > interface as Char and String. [Camomile should be adapted > > consequently.] > > Well, UTF-8 is the wrong term here. What you need on this level are > Unicode modules, where a uni_char can contain all Unicode code points, > and a uni_string is an array of such uni_char's. > > UTF-8 is a run-length encoding of Unicode for I/O. It is not well suited > for string manipulation, at least if you want efficient support for > index-based access, because the length of the char representation is not > constant. > > Probably you would choose uni_char=int as representation for characters, > but for strings there are several possibilities: > > - 16 bits/char: This path is taken by other languages, but only a > subset of Unicode chars can be represented directly > - 24 bits/char: All Unicode chars can be represented (range is > 0 to 0x10ffff), but you need to multiply by 3 to access by index. > This multiplication is relatively cheap (one bit shift plus > one addition). > - 32 bits/char: A slight waste of RAM but very efficient access by > index > - int/char: same as 32 bits/char for 32-bit platforms but > 64 bits/char for 64-bit. Probably no good choice. > > Of course, there should also be conversions from/to normal chars/strings, > and this is the place where UTF-8 comes into play. > > Another comment: for supporting lowercase/uppercase conversion one needs > lookup tables. Not really big tables, because only a small fraction of > the Unicode chars has this variation. Although you could reasonably exclude case conversion functions if you wanted (of course, if you're trying to be totally compatible with Char/String then they'd have to be implemented as it has them). Not providing a function doesn't imply half-baked as long as there's the capability to implement it on top of the functions you do provide. > One should also think about whether other properties of the Unicode > character database should be made available. E.g. character classes. > This could also live in add-on libraries, but it is worth discussing. Personally, I'd say they could safely live in other libraries - the advantage of having basic Unicode string handling (length, character retrieval, simple operations over Unicode-character offsets, etc.) would be that the standard library can use the representation itself and other libraries can be updated to work with it (that's what we have modules and functors for, after all). The fact that OCaml's I/O functions can interface with Unicode-based file systems to me means that it absolutely must support Unicode (at least as a future target) - the status quo of being unable, for example, accurately to query the number of characters in the length of a filename returned by Unix.readdir () is not in any way desirable (and that's to say nothing of the fact that if the Windows ports used the wide versions of the Win32 API instead then you'd have Unix.readdir() returning UTF-8 strings on *nix and 16-bit wchar strings on Windows so you'd have lost platform independence as well!). Fixing that does not require a fully featured Unicode library and wishing for that seems a bit silly as a) it's exceedingly unlikely to happen and b) OCaml already has several very good libraries for full-blown Unicode. David ^ permalink raw reply [flat|nested] 20+ messages in thread
* [Caml-list] Re: GSoC: better UTF-8 support 2011-02-28 15:50 ` David Allsopp @ 2011-03-01 5:49 ` Yoriyuki Yamagata 0 siblings, 0 replies; 20+ messages in thread From: Yoriyuki Yamagata @ 2011-03-01 5:49 UTC (permalink / raw) To: Sylvain Le Gall, Caml List Sorry, I didn't notice this thread since caml-list did not reach me sometime. So allow me to jump in the discossion. I think the entire discussion went a bit astray. It seems for me that the argument goes to specify the project detail as much as possible. According to my experience being a GSoC menter (Yes, I was), students often come up with better idea than menter. Therefore, instead of specifying the details, we'd better specify a general direction and let the students decide. As the general direction, I think we need 1) light weight stdlib replacement: Data type for Unicode chars and strings. Extensible character encofing, and simple IO. Interfaces shoud be purely functinal as far as possible. For example string wil be imutable, IO is monadic etc... 2) minimal language extension: unocode character and string literal. Unicode aware toplevel(pretty printing) etc... This is no means a complete support of Unicode, but having this in stdlib we can add more feature of Unicode standard (through for example camomile) or modify third party library to use Unicode. Best, -- Yoriyuki Yamagata yoriyuki.y@gmail.com http://sites.google.com/site/yoriyukiy/<https://sites.google.com/site/yoriyukiy/> ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [Caml-list] GSoC: better UTF-8 support 2011-02-28 8:35 [Caml-list] GSoC: better UTF-8 support Christophe TROESTLER ` (2 preceding siblings ...) 2011-02-28 14:13 ` Gerd Stolpmann @ 2011-02-28 14:21 ` Michael Ekstrand 2011-03-03 15:37 ` Damien Doligez 4 siblings, 0 replies; 20+ messages in thread From: Michael Ekstrand @ 2011-02-28 14:21 UTC (permalink / raw) To: caml-list On 02/28/2011 02:35 AM, Christophe TROESTLER wrote: > - UTF8.Char and UTF8.String modules should be written with the same > interface as Char and String. [Camomile should be adapted > consequently.] If this project is undertaken, then IMO the prospective student should also consult the Batteries and Extlib UTF8 modules, mostly based on Camomile's UTF8, so that new UTF8-specific functions are not needlessly incompatible with code written against Batteries, Extlib, or Camomile. This shouldn't be very difficult - Extlib and Batteries basically simplify and extend Camomile for UTF-8 handling - but should still be considered. - Michael ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [Caml-list] GSoC: better UTF-8 support 2011-02-28 8:35 [Caml-list] GSoC: better UTF-8 support Christophe TROESTLER ` (3 preceding siblings ...) 2011-02-28 14:21 ` [Caml-list] " Michael Ekstrand @ 2011-03-03 15:37 ` Damien Doligez 2011-03-03 16:42 ` Dario Teixeira 4 siblings, 1 reply; 20+ messages in thread From: Damien Doligez @ 2011-03-03 15:37 UTC (permalink / raw) To: OCaml Mailing List On 2011-02-28, at 09:35, Christophe TROESTLER wrote: > - Printf/Scanf: %U of %cu for UTF8.Char.t It cannot be %cu because that would break the following code: Printf.printf "Ct%cul%cu fhtagn\n" 'h' 'h';; -- Damien ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [Caml-list] GSoC: better UTF-8 support 2011-03-03 15:37 ` Damien Doligez @ 2011-03-03 16:42 ` Dario Teixeira 0 siblings, 0 replies; 20+ messages in thread From: Dario Teixeira @ 2011-03-03 16:42 UTC (permalink / raw) To: OCaml Mailing List, Damien Doligez Hi, > It cannot be %cu because that would break the following > code: > > Printf.printf "Ct%cul%cu fhtagn\n" 'h' 'h';; And anything that breaks Cthulhu's sleep would have such tremendous side-effects that it would upset even us "impure" ML guys... /off-topic Cheers, Dario ^ permalink raw reply [flat|nested] 20+ messages in thread
end of thread, other threads:[~2011-03-03 16:42 UTC | newest] Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2011-02-28 8:35 [Caml-list] GSoC: better UTF-8 support Christophe TROESTLER 2011-02-28 8:58 ` Daniel Bünzli 2011-02-28 10:07 ` David Allsopp 2011-02-28 11:21 ` Daniel Bünzli 2011-02-28 11:46 ` David Allsopp 2011-02-28 12:32 ` Daniel Bünzli 2011-02-28 12:59 ` [Caml-list] " Sylvain Le Gall 2011-02-28 10:59 ` Sylvain Le Gall 2011-02-28 14:39 ` [Caml-list] " David Rajchenbach-Teller 2011-02-28 10:07 ` David Allsopp [not found] ` <20110228.143157.1265982603697554449.Christophe.Troestler+ocaml@umons.ac.be> 2011-02-28 14:11 ` Daniel Bünzli 2011-02-28 14:57 ` Dario Teixeira 2011-02-28 14:13 ` Gerd Stolpmann 2011-02-28 14:31 ` [Caml-list] " Sylvain Le Gall 2011-02-28 15:09 ` [Caml-list] " Dario Teixeira 2011-02-28 15:50 ` David Allsopp 2011-03-01 5:49 ` [Caml-list] " Yoriyuki Yamagata 2011-02-28 14:21 ` [Caml-list] " Michael Ekstrand 2011-03-03 15:37 ` Damien Doligez 2011-03-03 16:42 ` Dario Teixeira
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox