* [Caml-list] Immutable strings @ 2014-07-04 19:18 Gerd Stolpmann 2014-07-04 20:31 ` Anthony Tavener ` (4 more replies) 0 siblings, 5 replies; 29+ messages in thread From: Gerd Stolpmann @ 2014-07-04 19:18 UTC (permalink / raw) To: caml-list Hi list, I've just posted a blog article where I criticize the new concept of immutable strings that will be available in OCaml 4.02 (as option): http://blog.camlcity.org/blog/bytes1.html In short my point is that it the new concept is not far reaching enough, and will even have negative impact on the code quality when it is not improved. I also present three ideas how to improve it. Gerd -- ------------------------------------------------------------ Gerd Stolpmann, Darmstadt, Germany gerd@gerd-stolpmann.de My OCaml site: http://www.camlcity.org Contact details: http://www.camlcity.org/contact.html Company homepage: http://www.gerd-stolpmann.de ------------------------------------------------------------ ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [Caml-list] Immutable strings 2014-07-04 19:18 [Caml-list] Immutable strings Gerd Stolpmann @ 2014-07-04 20:31 ` Anthony Tavener 2014-07-04 20:38 ` Malcolm Matalka ` (2 more replies) 2014-07-04 21:01 ` Markus Mottl ` (3 subsequent siblings) 4 siblings, 3 replies; 29+ messages in thread From: Anthony Tavener @ 2014-07-04 20:31 UTC (permalink / raw) To: Gerd Stolpmann; +Cc: caml-list [-- Attachment #1: Type: text/plain, Size: 2238 bytes --] I'm rather welcoming of the immutable change (hehe) of strings, but I haven't considered these details -- perhaps because I only use strings as immutable (currently with no such guarantee!), and use bigarray for a block of mutable bytes... which is your idea #3. It seems the "bytes" type would be most useful in cases where mutable and immutable strings are used in a mixed manner... but given these practical issues you raise, it could be less pleasant than it first appears. Your "stringlike" solution seems reasonable, but I don't have a good use-case in mind for mixed mutable/immutable to help me imagine the result. What are some scenarios where this mix of types is desired? I think even Rust doesn't support mutable strings -- which seems bold for its target audience, yet they're fine with it? When I consider possible scenarios of utf8 encoded strings, and mutating that in-place... ugh. Even "back in the day", doing string operations in C on ASCII, I'd favor building a new string rather than flirting with saving ops by overwriting values in the current string. Oh! Upper/lower-case! Maybe that's the one good use-case. ;) On Fri, Jul 4, 2014 at 1:18 PM, Gerd Stolpmann <info@gerd-stolpmann.de> wrote: > Hi list, > > I've just posted a blog article where I criticize the new concept of > immutable strings that will be available in OCaml 4.02 (as option): > > http://blog.camlcity.org/blog/bytes1.html > > In short my point is that it the new concept is not far reaching enough, > and will even have negative impact on the code quality when it is not > improved. I also present three ideas how to improve it. > > Gerd > -- > ------------------------------------------------------------ > Gerd Stolpmann, Darmstadt, Germany gerd@gerd-stolpmann.de > My OCaml site: http://www.camlcity.org > Contact details: http://www.camlcity.org/contact.html > Company homepage: http://www.gerd-stolpmann.de > ------------------------------------------------------------ > > > > -- > Caml-list mailing list. Subscription management and archives: > https://sympa.inria.fr/sympa/arc/caml-list > Beginner's list: http://groups.yahoo.com/group/ocaml_beginners > Bug reports: http://caml.inria.fr/bin/caml-bugs > [-- Attachment #2: Type: text/html, Size: 3497 bytes --] ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [Caml-list] Immutable strings 2014-07-04 20:31 ` Anthony Tavener @ 2014-07-04 20:38 ` Malcolm Matalka 2014-07-04 23:44 ` Daniel Bünzli 2014-07-05 11:04 ` Gerd Stolpmann 2 siblings, 0 replies; 29+ messages in thread From: Malcolm Matalka @ 2014-07-04 20:38 UTC (permalink / raw) To: Anthony Tavener; +Cc: Gerd Stolpmann, caml-list I haven't really been following this but I'm curious why a new type, rstring, was not introduced? But, as for the actual impact on the community. This seems like a question the OPAM team can answer now, right? They can compile every package with immutable strings turned on and see how many fail? That would give an idea of the impact and possibly suggest a migration path or an alternative approach. /M Anthony Tavener <anthony.tavener@gmail.com> writes: > I'm rather welcoming of the immutable change (hehe) of strings, but I > haven't > considered these details -- perhaps because I only use strings as immutable > (currently with no such guarantee!), and use bigarray for a block of mutable > bytes... which is your idea #3. > > It seems the "bytes" type would be most useful in cases where mutable and > immutable strings are used in a mixed manner... but given these practical > issues you raise, it could be less pleasant than it first appears. Your > "stringlike" solution seems reasonable, but I don't have a good use-case in > mind for mixed mutable/immutable to help me imagine the result. What are > some > scenarios where this mix of types is desired? I think even Rust doesn't > support mutable strings -- which seems bold for its target audience, yet > they're fine with it? > > When I consider possible scenarios of utf8 encoded strings, and mutating > that > in-place... ugh. Even "back in the day", doing string operations in C on > ASCII, I'd favor building a new string rather than flirting with saving ops > by > overwriting values in the current string. Oh! Upper/lower-case! Maybe that's > the one good use-case. ;) > > > > On Fri, Jul 4, 2014 at 1:18 PM, Gerd Stolpmann <info@gerd-stolpmann.de> > wrote: > >> Hi list, >> >> I've just posted a blog article where I criticize the new concept of >> immutable strings that will be available in OCaml 4.02 (as option): >> >> http://blog.camlcity.org/blog/bytes1.html >> >> In short my point is that it the new concept is not far reaching enough, >> and will even have negative impact on the code quality when it is not >> improved. I also present three ideas how to improve it. >> >> Gerd >> -- >> ------------------------------------------------------------ >> Gerd Stolpmann, Darmstadt, Germany gerd@gerd-stolpmann.de >> My OCaml site: http://www.camlcity.org >> Contact details: http://www.camlcity.org/contact.html >> Company homepage: http://www.gerd-stolpmann.de >> ------------------------------------------------------------ >> >> >> >> -- >> Caml-list mailing list. Subscription management and archives: >> https://sympa.inria.fr/sympa/arc/caml-list >> Beginner's list: http://groups.yahoo.com/group/ocaml_beginners >> Bug reports: http://caml.inria.fr/bin/caml-bugs >> ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [Caml-list] Immutable strings 2014-07-04 20:31 ` Anthony Tavener 2014-07-04 20:38 ` Malcolm Matalka @ 2014-07-04 23:44 ` Daniel Bünzli 2014-07-05 11:04 ` Gerd Stolpmann 2 siblings, 0 replies; 29+ messages in thread From: Daniel Bünzli @ 2014-07-04 23:44 UTC (permalink / raw) To: Anthony Tavener; +Cc: Gerd Stolpmann, caml-list Le vendredi, 4 juillet 2014 à 21:31, Anthony Tavener a écrit : > When I consider possible scenarios of utf8 encoded strings, and mutating that > in-place... ugh. Even "back in the day", doing string operations in C on > ASCII, I'd favor building a new string rather than flirting with saving ops by > overwriting values in the current string. Oh! Upper/lower-case! Maybe that's > the one good use-case. ;) Not even… that is if you care about Unicode, e.g.: # Uucp.Case.Map.to_upper 0xFB01;; - : [ `Self | `Uchars of Uucp.uchar list ] = `Uchars [U+0046; U+0049] Best, Daniel ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [Caml-list] Immutable strings 2014-07-04 20:31 ` Anthony Tavener 2014-07-04 20:38 ` Malcolm Matalka 2014-07-04 23:44 ` Daniel Bünzli @ 2014-07-05 11:04 ` Gerd Stolpmann 2014-07-16 11:38 ` Damien Doligez 2 siblings, 1 reply; 29+ messages in thread From: Gerd Stolpmann @ 2014-07-05 11:04 UTC (permalink / raw) To: Anthony Tavener; +Cc: caml-list [-- Attachment #1: Type: text/plain, Size: 3588 bytes --] Am Freitag, den 04.07.2014, 14:31 -0600 schrieb Anthony Tavener: > It seems the "bytes" type would be most useful in cases where mutable > and > immutable strings are used in a mixed manner... but given these > practical > issues you raise, it could be less pleasant than it first appears. > Your > "stringlike" solution seems reasonable, but I don't have a good > use-case in > mind for mixed mutable/immutable to help me imagine the result. What > are some > scenarios where this mix of types is desired? I think even Rust > doesn't > support mutable strings -- which seems bold for its target audience, > yet > they're fine with it? I've mostly buffers in mind, as you need them for block-by-block I/O. Actually, I started thinking about this issue when looking again at OCamlnet, and how I could use "bytes" there. It's a hard case, lots of buffers of different types, and you really run into the problems I sketched in the article, as it is a common operation to copy the contents of one buffer into the other. That's also why I'm suggesting to use bigarrays - for interfacing with C these are much easier to use as buffers, as bigarrays are just malloc'ed memory and cannot be moved around by the GC. (And the C interface is needed for I/O.) So my scenario is quite low-level: I/O, and C interfaces. > When I consider possible scenarios of utf8 encoded strings, No, that's a no-go, of course. When it comes to real text, mutability doesn't give you much. Gerd > and mutating that > in-place... ugh. Even "back in the day", doing string operations in C > on > ASCII, I'd favor building a new string rather than flirting with > saving ops by > overwriting values in the current string. Oh! Upper/lower-case! Maybe > that's > the one good use-case. ;) > > > > > On Fri, Jul 4, 2014 at 1:18 PM, Gerd Stolpmann > <info@gerd-stolpmann.de> wrote: > Hi list, > > I've just posted a blog article where I criticize the new > concept of > immutable strings that will be available in OCaml 4.02 (as > option): > > http://blog.camlcity.org/blog/bytes1.html > > In short my point is that it the new concept is not far > reaching enough, > and will even have negative impact on the code quality when it > is not > improved. I also present three ideas how to improve it. > > Gerd > -- > ------------------------------------------------------------ > Gerd Stolpmann, Darmstadt, Germany gerd@gerd-stolpmann.de > My OCaml site: http://www.camlcity.org > Contact details: http://www.camlcity.org/contact.html > Company homepage: http://www.gerd-stolpmann.de > ------------------------------------------------------------ > > > > -- > Caml-list mailing list. Subscription management and archives: > https://sympa.inria.fr/sympa/arc/caml-list > Beginner's list: http://groups.yahoo.com/group/ocaml_beginners > Bug reports: http://caml.inria.fr/bin/caml-bugs > > -- ------------------------------------------------------------ Gerd Stolpmann, Darmstadt, Germany gerd@gerd-stolpmann.de My OCaml site: http://www.camlcity.org Contact details: http://www.camlcity.org/contact.html Company homepage: http://www.gerd-stolpmann.de ------------------------------------------------------------ [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 473 bytes --] ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [Caml-list] Immutable strings 2014-07-05 11:04 ` Gerd Stolpmann @ 2014-07-16 11:38 ` Damien Doligez 0 siblings, 0 replies; 29+ messages in thread From: Damien Doligez @ 2014-07-16 11:38 UTC (permalink / raw) To: Gerd Stolpmann; +Cc: caml-list [-- Attachment #1: Type: text/plain, Size: 3117 bytes --] Hi Gerd and OCaml users, First note that we are not breaking backward compatibility: you can always use the -unsafe-string flag to compile your dusty code. On 2014-07-05, at 13:04, Gerd Stolpmann wrote: > So my scenario is quite low-level: I/O, and C interfaces. As you said, bigarrays are the best suited for that kind of code. But that's not a good reason to make all strings as heavy as bigarrays. If you need bigarrays, by all means use bigarrays in your code, not String or Bytes. On 2014-07-05, at 13:24, Gerd Stolpmann wrote: > Well, the complexity can be reduced a bit by using phantom types: > > type string = [`String] stringlike > type bytes = [`Bytes] stringlike > > and then just define function-by-function what is permitted: This is almost the same as our first version, which we discarded as too complex and not compatible enough (as you noted, because of unresolved type variables). But it might make a come-back. On 2014-07-08, at 14:24, Gerd Stolpmann wrote: > It will create confusion even with actively maintained code bases. What > could help here is very clear communication when the change will be the > standard behavior, and how the migration will take place. Currently, it > feels like a big experiment - hey, let's users tentatively enable it, > and watch out for problems. OK, we need to be clearer on the "how" (in an nutshell, the default will switch from -unsafe-string to -safe-string at some point in the future when we feel that enough of the existing code has been updated). As for the "when", we can't tell because that depends a lot on how fast the community updates its code. Hopefully no more than three years. Possibly as soon as 4.03.0. > There could also be a section in > the manual explaining the new behavior, and how to convert code. That's a good idea. > Right, that's the good side of it. (Although the danger is quite > theoretical, as most programmers seem to intuitively follow the rule > "don't mutate strings you did not create". I've never seen this kind of > bug in practice.) What about programmers who deliberately trigger the bug (aka "attackers", in a security setting)? It's not just about how unlikely a bug is, but also whether it can be exploited. > For instance, there is one module in OCamlnet where a regexp is directly > run on an I/O buffer (generally, you need to do some primitive parsing > on I/O buffers before you can extract strings, and that's where > stringlike would be nice to have). Without stringlike, I would have to > replace that regexp somehow. If stringlike is polymorphic, you will need a new regexp library that operates on stringlike. We cannot update the current regexp library to use stringlike because that would introduce polymorphism and unresolved type variables, and that might break some of the code that used to run on 1.03... On 2014-07-14, at 19:45, Richard W.M. Jones wrote: > That would imply removing incorrect functions like String.uppercase > and String.lowercase. First, we mark them deprecated. Then we wait a very long time before we actually remove them from (if ever). -- Damien [-- Attachment #2: Message signed with OpenPGP using GPGMail --] [-- Type: application/pgp-signature, Size: 630 bytes --] ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [Caml-list] Immutable strings 2014-07-04 19:18 [Caml-list] Immutable strings Gerd Stolpmann 2014-07-04 20:31 ` Anthony Tavener @ 2014-07-04 21:01 ` Markus Mottl 2014-07-05 11:24 ` Gerd Stolpmann 2014-07-28 11:14 ` Goswin von Brederlow 2014-07-07 12:42 ` Alain Frisch ` (2 subsequent siblings) 4 siblings, 2 replies; 29+ messages in thread From: Markus Mottl @ 2014-07-04 21:01 UTC (permalink / raw) To: Gerd Stolpmann; +Cc: caml-list I agree that the new concept has some noteworthy downsides as demonstrated in the Lexing-example. Your proposed solution 2 (stringlike) would probably solve these issues from a safety point of view. The downside is that the complexity of string-handling would increase even more, because then we would have three types to deal with. I personally prefer safety over convenience, but other people's (especially beginner's) mileage may vary. The Bigarray-approach doesn't seem appealing to me. Strings are much more lightweight, since they can be allocated cheaply on the OCaml-heap. E.g. String.create is about 10x-100x faster than Bigarray.create. That seems too big to ignore. Regards, Markus On Fri, Jul 4, 2014 at 3:18 PM, Gerd Stolpmann <info@gerd-stolpmann.de> wrote: > Hi list, > > I've just posted a blog article where I criticize the new concept of > immutable strings that will be available in OCaml 4.02 (as option): > > http://blog.camlcity.org/blog/bytes1.html > > In short my point is that it the new concept is not far reaching enough, > and will even have negative impact on the code quality when it is not > improved. I also present three ideas how to improve it. > > Gerd > -- > ------------------------------------------------------------ > Gerd Stolpmann, Darmstadt, Germany gerd@gerd-stolpmann.de > My OCaml site: http://www.camlcity.org > Contact details: http://www.camlcity.org/contact.html > Company homepage: http://www.gerd-stolpmann.de > ------------------------------------------------------------ > > > > -- > Caml-list mailing list. Subscription management and archives: > https://sympa.inria.fr/sympa/arc/caml-list > Beginner's list: http://groups.yahoo.com/group/ocaml_beginners > Bug reports: http://caml.inria.fr/bin/caml-bugs -- Markus Mottl http://www.ocaml.info markus.mottl@gmail.com ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [Caml-list] Immutable strings 2014-07-04 21:01 ` Markus Mottl @ 2014-07-05 11:24 ` Gerd Stolpmann 2014-07-08 13:23 ` Jacques Garrigue 2014-07-28 11:14 ` Goswin von Brederlow 1 sibling, 1 reply; 29+ messages in thread From: Gerd Stolpmann @ 2014-07-05 11:24 UTC (permalink / raw) To: Markus Mottl; +Cc: caml-list [-- Attachment #1: Type: text/plain, Size: 3664 bytes --] Am Freitag, den 04.07.2014, 17:01 -0400 schrieb Markus Mottl: > I agree that the new concept has some noteworthy downsides as > demonstrated in the Lexing-example. Your proposed solution 2 > (stringlike) would probably solve these issues from a safety point of > view. The downside is that the complexity of string-handling would > increase even more, because then we would have three types to deal > with. I personally prefer safety over convenience, but other people's > (especially beginner's) mileage may vary. Well, the complexity can be reduced a bit by using phantom types: type string = [`String] stringlike type bytes = [`Bytes] stringlike and then just define function-by-function what is permitted: val get : 'a stringlike -> int -> char val set : [`Bytes] stringlike -> int -> char -> unit val sub : 'a stringlike -> int -> int -> [`String] stringlike val sub_bytes : 'a stringlike -> int -> int -> [`Bytes] stringlike etc., and the modules String and Bytes would just contain aliases of these functions with monomorphed typing. I don't know, though, whether we can be safe to never see the polymorphic typing when just using string and bytes. It would be a bit surprising for beginners to see that, and you sometimes would have to deal with unresolved type variables. > The Bigarray-approach doesn't seem appealing to me. Strings are much > more lightweight, since they can be allocated cheaply on the > OCaml-heap. E.g. String.create is about 10x-100x faster than > Bigarray.create. That seems too big to ignore. Oh, we ignore already that Unix.read and Unix.write copy all data through an additional buffer because we cannot pass an OCaml string directly to the OS while another thread could relocate this string. So that copy would be eliminated. So I'd guess you are normally even faster with bigarrays, at least when you only look at the use as I/O buffers. But there might be other uses where this is different. Gerd > > Regards, > Markus > > On Fri, Jul 4, 2014 at 3:18 PM, Gerd Stolpmann <info@gerd-stolpmann.de> wrote: > > Hi list, > > > > I've just posted a blog article where I criticize the new concept of > > immutable strings that will be available in OCaml 4.02 (as option): > > > > http://blog.camlcity.org/blog/bytes1.html > > > > In short my point is that it the new concept is not far reaching enough, > > and will even have negative impact on the code quality when it is not > > improved. I also present three ideas how to improve it. > > > > Gerd > > -- > > ------------------------------------------------------------ > > Gerd Stolpmann, Darmstadt, Germany gerd@gerd-stolpmann.de > > My OCaml site: http://www.camlcity.org > > Contact details: http://www.camlcity.org/contact.html > > Company homepage: http://www.gerd-stolpmann.de > > ------------------------------------------------------------ > > > > > > > > -- > > Caml-list mailing list. Subscription management and archives: > > https://sympa.inria.fr/sympa/arc/caml-list > > Beginner's list: http://groups.yahoo.com/group/ocaml_beginners > > Bug reports: http://caml.inria.fr/bin/caml-bugs > > > > -- > Markus Mottl http://www.ocaml.info markus.mottl@gmail.com > -- ------------------------------------------------------------ Gerd Stolpmann, Darmstadt, Germany gerd@gerd-stolpmann.de My OCaml site: http://www.camlcity.org Contact details: http://www.camlcity.org/contact.html Company homepage: http://www.gerd-stolpmann.de ------------------------------------------------------------ [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 473 bytes --] ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [Caml-list] Immutable strings 2014-07-05 11:24 ` Gerd Stolpmann @ 2014-07-08 13:23 ` Jacques Garrigue 2014-07-08 13:37 ` Alain Frisch 0 siblings, 1 reply; 29+ messages in thread From: Jacques Garrigue @ 2014-07-08 13:23 UTC (permalink / raw) To: Gerd Stolpmann; +Cc: OCaML List Mailing On 2014/07/05 20:24, Gerd Stolpmann wrote: > > Am Freitag, den 04.07.2014, 17:01 -0400 schrieb Markus Mottl: >> I agree that the new concept has some noteworthy downsides as >> demonstrated in the Lexing-example. Your proposed solution 2 >> (stringlike) would probably solve these issues from a safety point of >> view. The downside is that the complexity of string-handling would >> increase even more, because then we would have three types to deal >> with. I personally prefer safety over convenience, but other people's >> (especially beginner's) mileage may vary. > > Well, the complexity can be reduced a bit by using phantom types: > > type string = [`String] stringlike > type bytes = [`Bytes] stringlike > > and then just define function-by-function what is permitted: > > val get : 'a stringlike -> int -> char > val set : [`Bytes] stringlike -> int -> char -> unit > val sub : 'a stringlike -> int -> int -> [`String] stringlike > val sub_bytes : 'a stringlike -> int -> int -> [`Bytes] stringlike > > etc., and the modules String and Bytes would just contain aliases of > these functions with monomorphed typing. > > I don't know, though, whether we can be safe to never see the > polymorphic typing when just using string and bytes. It would be a bit > surprising for beginners to see that, and you sometimes would have to > deal with unresolved type variables. Indeed. Originally the plan was to use the above scheme for strings, and use polymorphism to allow more flexibility. However, this is not 100% compatible, even if we allow to ignore the parameters, because of these unresolved type variables. This also becomes complicated when you want to take functions as parameters. The stringlike type itself is a good idea. In the standard library, it could be implemented as: type string = private stringlike type bytes = private stringlike However, it is only about allowing passing string and bytes arguments to functions in an homogeneous way. For the return case, the situation is more confused, because returning a stringlike is actually weaker than either bytes or string. Alain’s idea of using an extra type-only parameter (‘a is_a_type) works, and it doesn’t really need to be a GADT. But this is a bit strange to use an extra parameter where a phantom type on string itself would solve the problem. I.e., using your above approach one can be safe just writing: val copy : ‘a stringlike -> ‘b stringlike val sub : ‘a stringlike -> int -> int -> ‘b stringlike (assuming that we are always copying in sub too) One could try to mix the two approaches: i.e. have a type ‘a stringlike, with explicit coercions to and from bytes and string. Note that you can do that yourself: create your own Stringlike module, with the coercions type ’a stringlike external from_string : string -> [> `String] stringlike = “%identity" external to_string : [`String] stringlike -> string = “%identity” … Note that you should not write “type +’a stringlike”, since you want to exploit the fact any stringlike must be monomorphic. This could of course be added to the standard library, but for compatibility reasons I think that string itself has to stay as an abstract (or private) type with no parameter. And the above kind of coercions is compiled away, so if your goal is performance this should not be a problem. Jacques ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [Caml-list] Immutable strings 2014-07-08 13:23 ` Jacques Garrigue @ 2014-07-08 13:37 ` Alain Frisch 2014-07-08 14:04 ` Jacques Garrigue 0 siblings, 1 reply; 29+ messages in thread From: Alain Frisch @ 2014-07-08 13:37 UTC (permalink / raw) To: Jacques Garrigue, Gerd Stolpmann; +Cc: OCaML List Mailing On 07/08/2014 03:23 PM, Jacques Garrigue wrote: > Alain’s idea of using an extra type-only parameter (‘a is_a_type) works, > and it doesn’t really need to be a GADT. > But this is a bit strange to use an extra parameter where a phantom type > on string itself would solve the problem. I mentioned that some functions could behave differently according to the requested result type. For instance, a function val of_bool: 'a is_a_string -> bool -> 'a would return string literals when 'a = String and it would copy them when 'a = Bytes. Similarly, a function could memoize some strings it produces in order to return them later again, but only when 'a = String, not 'a = Bytes. Even for functions such as "copy" or "sub", it makes sense to avoid a copy in some cases (when both the input and the output are immutable, and for sub, when the range covers the entire input). So I don't think that "'a is_a_string" can really be only a phantom type. -- Alain ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [Caml-list] Immutable strings 2014-07-08 13:37 ` Alain Frisch @ 2014-07-08 14:04 ` Jacques Garrigue 0 siblings, 0 replies; 29+ messages in thread From: Jacques Garrigue @ 2014-07-08 14:04 UTC (permalink / raw) To: Alain Frisch; +Cc: Mailing List OCaml, Gerd Stolpmann [-- Attachment #1: Type: text/plain, Size: 1491 bytes --] 2014/07/08 22:38 "Alain Frisch" <alain@frisch.fr>: > > On 07/08/2014 03:23 PM, Jacques Garrigue wrote: >> >> Alain’s idea of using an extra type-only parameter (‘a is_a_type) works, >> and it doesn’t really need to be a GADT. >> But this is a bit strange to use an extra parameter where a phantom type >> on string itself would solve the problem. > > > I mentioned that some functions could behave differently according to the requested result type. For instance, a function > > val of_bool: 'a is_a_string -> bool -> 'a > > would return string literals when 'a = String and it would copy them when 'a = Bytes. Similarly, a function could memoize some strings it produces in order to return them later again, but only when 'a = String, not 'a = Bytes. I see. But in that case we could also have different functions, since the semantics change (at least for physical equality) > Even for functions such as "copy" or "sub", it makes sense to avoid a copy in some cases (when both the input and the output are immutable, and for sub, when the range covers the entire input). Ok, but in that case you will need a flag for both input and output strings, since there is no way to recover this information from the string itself. > So I don't think that "'a is_a_string" can really be only a phantom type. I see. I think that both approaches have interesting applications. But from a type system point of view they are clearly advanced. Jacques [-- Attachment #2: Type: text/html, Size: 1850 bytes --] ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [Caml-list] Immutable strings 2014-07-04 21:01 ` Markus Mottl 2014-07-05 11:24 ` Gerd Stolpmann @ 2014-07-28 11:14 ` Goswin von Brederlow 2014-07-28 15:51 ` Markus Mottl 1 sibling, 1 reply; 29+ messages in thread From: Goswin von Brederlow @ 2014-07-28 11:14 UTC (permalink / raw) To: caml-list On Fri, Jul 04, 2014 at 05:01:18PM -0400, Markus Mottl wrote: > I agree that the new concept has some noteworthy downsides as > demonstrated in the Lexing-example. Your proposed solution 2 > (stringlike) would probably solve these issues from a safety point of > view. The downside is that the complexity of string-handling would > increase even more, because then we would have three types to deal > with. I personally prefer safety over convenience, but other people's > (especially beginner's) mileage may vary. > > The Bigarray-approach doesn't seem appealing to me. Strings are much > more lightweight, since they can be allocated cheaply on the > OCaml-heap. E.g. String.create is about 10x-100x faster than > Bigarray.create. That seems too big to ignore. > > Regards, > Markus Why is that? A bigarray allocates a small block on the ocaml heap and the buffer outside the ocaml heap. Is that normal malloc() call just so much slower? Or are there other factors involved? On the other hand if your app is IO heavy then you should allocate a few buffers and reuse them. In that case the allocation overhead is constant and the time saved for not copying in the I/O will more than make up for it. Or read/mmap the file into a huge bigarray and the slice it into smaller chunks. > On Fri, Jul 4, 2014 at 3:18 PM, Gerd Stolpmann <info@gerd-stolpmann.de> wrote: > > Hi list, > > > > I've just posted a blog article where I criticize the new concept of > > immutable strings that will be available in OCaml 4.02 (as option): > > > > http://blog.camlcity.org/blog/bytes1.html > > > > In short my point is that it the new concept is not far reaching enough, > > and will even have negative impact on the code quality when it is not > > improved. I also present three ideas how to improve it. > > > > Gerd You have a few more points: 1) there are 3 kinds of strings: - string literal / constant strings [which never change ever] - read-only strings [which YOU are not allowed to change but might change] - mutable strings [which you are allowed to changed] There is one other thing you didn't mention here. While it is nice to pass a mutable string to the lexer (or similar) one has to realize that that is not thread save. Another thread might be mutating the string while it is being used. So I would suggest there is a 4th kind of string: - frozen strings [which are mutable but won't be changed anymore] That is basically like read-only strings but with the addes promise that they won't be changed. Nothing in the type system garanties that, it is just a promise from the programmer. 2) there are lots of functions that just need any kind of string and should accept all 3 This kind of asks for type classes. There should be a read-from-string type class that all 3 string types would fit. Then one could have one function accepting a read-from-string type class and all 3 string types could be passed. But unfortunately ocaml doesn't have type classes. The next best thing would be enumerations (not in stdlib). Make enumerations accept all 3 string types and then have everything else accept enumerations. This would also mean you could pass a char list or rope or any other type that gives you an enumeration of chars. 3) I/O code That the stdlib uses strings for I/O and needs to copy the data around all the time has been nagging me for years. There certainly should be read/write functions dealing with bigarrays. There also should be a function to create a bigarray with special alignment (e.g. PAGESIZE) to get the best I/O performance (or in case of linux async IO make it work at all). As for mutable/immutable strings there should be a read function returning an immutable string, which it creates internally. The string can't be passed as argument so creating a fresh one is the only way. Here is a completly new point: 4) What is good for strings is also good for bigarray The same arguments concerning strings applies to bigarrays. Say you pass a bigarray to the lexer. Can it just use it as is for its lexbuf or does it need to copy it because it might mutate? An immutable bigarray could be used savely as is. And this doesn't realy stop at bigarray. Even references could be read-only, in the sense of "this might change but YOU aren't allowed to change it". And I think the only way to solve the const/immutable/mutable/frozen sub-types that will be applicable to more than just string is to use phantom types. MfG Goswin ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [Caml-list] Immutable strings 2014-07-28 11:14 ` Goswin von Brederlow @ 2014-07-28 15:51 ` Markus Mottl 2014-07-29 2:54 ` Yaron Minsky 0 siblings, 1 reply; 29+ messages in thread From: Markus Mottl @ 2014-07-28 15:51 UTC (permalink / raw) To: Goswin von Brederlow; +Cc: caml-list On Mon, Jul 28, 2014 at 7:14 AM, Goswin von Brederlow <goswin-v-b@web.de> wrote: > Why is that? A bigarray allocates a small block on the ocaml heap and > the buffer outside the ocaml heap. Is that normal malloc() call just > so much slower? Or are there other factors involved? If you look at the runtime code, you'll see that there is quite a lot going on to create a bigarray value. Allocating small OCaml-strings on the minor heap only costs a handful of cheap instructions, which is obviously way more efficient. There is some threshold at which malloc will perform more expensive system calls to obtain memory whereas OCaml may still be able to get some larger chunks from the major heap. Unless Bigarrays become really large, standard OCaml strings can be obtained much more cheaply. > On the other hand if your app is IO heavy then you should allocate a > few buffers and reuse them. In that case the allocation overhead is > constant and the time saved for not copying in the I/O will more than > make up for it. Exactly. Bigarrays are my buffer of choice for I/O. > Or read/mmap the file into a huge bigarray and the slice it into > smaller chunks. This can improve performance for certain operations, but beware of page faults when accessing ranges that only reside on disk. Unless this access is done outside of the OCaml-lock, your application could freeze longer than allowed for realtime applications. Regards, Markus >> On Fri, Jul 4, 2014 at 3:18 PM, Gerd Stolpmann <info@gerd-stolpmann.de> wrote: >> > Hi list, >> > >> > I've just posted a blog article where I criticize the new concept of >> > immutable strings that will be available in OCaml 4.02 (as option): >> > >> > http://blog.camlcity.org/blog/bytes1.html >> > >> > In short my point is that it the new concept is not far reaching enough, >> > and will even have negative impact on the code quality when it is not >> > improved. I also present three ideas how to improve it. >> > >> > Gerd > > You have a few more points: > > 1) there are 3 kinds of strings: > > - string literal / constant strings [which never change ever] > - read-only strings [which YOU are not allowed to change but might change] > - mutable strings [which you are allowed to changed] > > There is one other thing you didn't mention here. While it is nice to > pass a mutable string to the lexer (or similar) one has to realize > that that is not thread save. Another thread might be mutating the > string while it is being used. > > So I would suggest there is a 4th kind of string: > > - frozen strings [which are mutable but won't be changed anymore] > > That is basically like read-only strings but with the addes promise > that they won't be changed. Nothing in the type system garanties that, > it is just a promise from the programmer. > > 2) there are lots of functions that just need any kind of string and > should accept all 3 > > This kind of asks for type classes. There should be a read-from-string > type class that all 3 string types would fit. Then one could have one > function accepting a read-from-string type class and all 3 string > types could be passed. But unfortunately ocaml doesn't have type > classes. > > The next best thing would be enumerations (not in stdlib). Make > enumerations accept all 3 string types and then have everything else > accept enumerations. This would also mean you could pass a char list > or rope or any other type that gives you an enumeration of chars. > > 3) I/O code > > That the stdlib uses strings for I/O and needs to copy the data around > all the time has been nagging me for years. There certainly should be > read/write functions dealing with bigarrays. > > There also should be a function to create a bigarray with special > alignment (e.g. PAGESIZE) to get the best I/O performance (or in case > of linux async IO make it work at all). > > As for mutable/immutable strings there should be a read function > returning an immutable string, which it creates internally. The string > can't be passed as argument so creating a fresh one is the only way. > > > Here is a completly new point: > > 4) What is good for strings is also good for bigarray > > The same arguments concerning strings applies to bigarrays. Say you > pass a bigarray to the lexer. Can it just use it as is for its lexbuf > or does it need to copy it because it might mutate? An immutable > bigarray could be used savely as is. > > > And this doesn't realy stop at bigarray. Even references could be > read-only, in the sense of "this might change but YOU aren't allowed > to change it". And I think the only way to solve the > const/immutable/mutable/frozen sub-types that will be applicable to > more than just string is to use phantom types. > > MfG > Goswin > > -- > Caml-list mailing list. Subscription management and archives: > https://sympa.inria.fr/sympa/arc/caml-list > Beginner's list: http://groups.yahoo.com/group/ocaml_beginners > Bug reports: http://caml.inria.fr/bin/caml-bugs -- Markus Mottl http://www.ocaml.info markus.mottl@gmail.com ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [Caml-list] Immutable strings 2014-07-28 15:51 ` Markus Mottl @ 2014-07-29 2:54 ` Yaron Minsky 2014-07-29 9:46 ` Goswin von Brederlow 2014-07-29 11:48 ` John F. Carr 0 siblings, 2 replies; 29+ messages in thread From: Yaron Minsky @ 2014-07-29 2:54 UTC (permalink / raw) To: Markus Mottl; +Cc: Goswin von Brederlow, caml-list This isn't my idea, but it seems worth repeating: perhaps it would make sense to have an unmovable byte-array type that had the same memory representation as Bytes.t, with the extra guarantee that the collector wouldn't move it. You could imagine representing this as a private type: module Immovable_bytes : sig type t = private Bytes.t val create : int -> t end with a special creation function for creating these immovable strings. This would avoid some of the current need to write what is effectively the same code twice, once for bigarrays and once for Bytes.t's. In particular, you could modify an Immovable_bytes by first up-casting it to a Bytes.t. But you could only actually create one by going through the special creation function. I'm not sure if the runtime details could be made to work out, but if they could, I think it would be a bit nicer than the current world. y On Mon, Jul 28, 2014 at 11:51 AM, Markus Mottl <markus.mottl@gmail.com> wrote: > On Mon, Jul 28, 2014 at 7:14 AM, Goswin von Brederlow <goswin-v-b@web.de> wrote: >> Why is that? A bigarray allocates a small block on the ocaml heap and >> the buffer outside the ocaml heap. Is that normal malloc() call just >> so much slower? Or are there other factors involved? > > If you look at the runtime code, you'll see that there is quite a lot > going on to create a bigarray value. Allocating small OCaml-strings > on the minor heap only costs a handful of cheap instructions, which is > obviously way more efficient. There is some threshold at which malloc > will perform more expensive system calls to obtain memory whereas > OCaml may still be able to get some larger chunks from the major heap. > Unless Bigarrays become really large, standard OCaml strings can be > obtained much more cheaply. > >> On the other hand if your app is IO heavy then you should allocate a >> few buffers and reuse them. In that case the allocation overhead is >> constant and the time saved for not copying in the I/O will more than >> make up for it. > > Exactly. Bigarrays are my buffer of choice for I/O. > >> Or read/mmap the file into a huge bigarray and the slice it into >> smaller chunks. > > This can improve performance for certain operations, but beware of > page faults when accessing ranges that only reside on disk. Unless > this access is done outside of the OCaml-lock, your application could > freeze longer than allowed for realtime applications. > > Regards, > Markus > >>> On Fri, Jul 4, 2014 at 3:18 PM, Gerd Stolpmann <info@gerd-stolpmann.de> wrote: >>> > Hi list, >>> > >>> > I've just posted a blog article where I criticize the new concept of >>> > immutable strings that will be available in OCaml 4.02 (as option): >>> > >>> > http://blog.camlcity.org/blog/bytes1.html >>> > >>> > In short my point is that it the new concept is not far reaching enough, >>> > and will even have negative impact on the code quality when it is not >>> > improved. I also present three ideas how to improve it. >>> > >>> > Gerd >> >> You have a few more points: >> >> 1) there are 3 kinds of strings: >> >> - string literal / constant strings [which never change ever] >> - read-only strings [which YOU are not allowed to change but might change] >> - mutable strings [which you are allowed to changed] >> >> There is one other thing you didn't mention here. While it is nice to >> pass a mutable string to the lexer (or similar) one has to realize >> that that is not thread save. Another thread might be mutating the >> string while it is being used. >> >> So I would suggest there is a 4th kind of string: >> >> - frozen strings [which are mutable but won't be changed anymore] >> >> That is basically like read-only strings but with the addes promise >> that they won't be changed. Nothing in the type system garanties that, >> it is just a promise from the programmer. >> >> 2) there are lots of functions that just need any kind of string and >> should accept all 3 >> >> This kind of asks for type classes. There should be a read-from-string >> type class that all 3 string types would fit. Then one could have one >> function accepting a read-from-string type class and all 3 string >> types could be passed. But unfortunately ocaml doesn't have type >> classes. >> >> The next best thing would be enumerations (not in stdlib). Make >> enumerations accept all 3 string types and then have everything else >> accept enumerations. This would also mean you could pass a char list >> or rope or any other type that gives you an enumeration of chars. >> >> 3) I/O code >> >> That the stdlib uses strings for I/O and needs to copy the data around >> all the time has been nagging me for years. There certainly should be >> read/write functions dealing with bigarrays. >> >> There also should be a function to create a bigarray with special >> alignment (e.g. PAGESIZE) to get the best I/O performance (or in case >> of linux async IO make it work at all). >> >> As for mutable/immutable strings there should be a read function >> returning an immutable string, which it creates internally. The string >> can't be passed as argument so creating a fresh one is the only way. >> >> >> Here is a completly new point: >> >> 4) What is good for strings is also good for bigarray >> >> The same arguments concerning strings applies to bigarrays. Say you >> pass a bigarray to the lexer. Can it just use it as is for its lexbuf >> or does it need to copy it because it might mutate? An immutable >> bigarray could be used savely as is. >> >> >> And this doesn't realy stop at bigarray. Even references could be >> read-only, in the sense of "this might change but YOU aren't allowed >> to change it". And I think the only way to solve the >> const/immutable/mutable/frozen sub-types that will be applicable to >> more than just string is to use phantom types. >> >> MfG >> Goswin >> >> -- >> Caml-list mailing list. Subscription management and archives: >> https://sympa.inria.fr/sympa/arc/caml-list >> Beginner's list: http://groups.yahoo.com/group/ocaml_beginners >> Bug reports: http://caml.inria.fr/bin/caml-bugs > > > > -- > Markus Mottl http://www.ocaml.info markus.mottl@gmail.com > > -- > Caml-list mailing list. Subscription management and archives: > https://sympa.inria.fr/sympa/arc/caml-list > Beginner's list: http://groups.yahoo.com/group/ocaml_beginners > Bug reports: http://caml.inria.fr/bin/caml-bugs ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [Caml-list] Immutable strings 2014-07-29 2:54 ` Yaron Minsky @ 2014-07-29 9:46 ` Goswin von Brederlow 2014-07-29 11:48 ` John F. Carr 1 sibling, 0 replies; 29+ messages in thread From: Goswin von Brederlow @ 2014-07-29 9:46 UTC (permalink / raw) To: caml-list On Mon, Jul 28, 2014 at 10:54:36PM -0400, Yaron Minsky wrote: > This isn't my idea, but it seems worth repeating: perhaps it would > make sense to have an unmovable byte-array type that had the same > memory representation as Bytes.t, with the extra guarantee that the > collector wouldn't move it. > > You could imagine representing this as a private type: > > module Immovable_bytes : sig > type t = private Bytes.t > val create : int -> t > end > > with a special creation function for creating these immovable strings. > This would avoid some of the current need to write what is effectively > the same code twice, once for bigarrays and once for Bytes.t's. In > particular, you could modify an Immovable_bytes by first up-casting it > to a Bytes.t. But you could only actually create one by going through > the special creation function. Include Bytes and then redefine all the functions creating new Bytes to use the special create. That way you can use all the Bytes functions without upcasting. > I'm not sure if the runtime details could be made to work out, but if > they could, I think it would be a bit nicer than the current world. > > y I think that is simple enough to do. Simply have your create call malloc and set the right header for the block and return the address + header_size. Also add a Gc.finalise to free the memory when the block becomes unreachable. Or can you only finalise blocks inside the ocaml heap? A slight problem though is that sometimes alignment is important (e.g linux AIO needs block aligned data). You would have to allocate the memory to be 4/8 bytes shy of block alignment, which means allocating a chunk one block bigger and aligning the data within that larger block. It also means you need to store the real address of the block before or after the data and complicates your free function. For page aligned data of page size that means you need to allocate (2 * PAGE_SIZE + 16) bytes of data whereas a bigarray only needs PAGE_SIZE + the ocaml block on the major heap. In general it could be nice to include `Immovable as phantom type next to `Const, `Read and `Write. Then you could have immovable strings, bytes, ... MfG Goswin ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [Caml-list] Immutable strings 2014-07-29 2:54 ` Yaron Minsky 2014-07-29 9:46 ` Goswin von Brederlow @ 2014-07-29 11:48 ` John F. Carr 1 sibling, 0 replies; 29+ messages in thread From: John F. Carr @ 2014-07-29 11:48 UTC (permalink / raw) To: Yaron Minsky; +Cc: caml-list > This isn't my idea, but it seems worth repeating: perhaps it would > make sense to have an unmovable byte-array type that had the same > memory representation as Bytes.t, with the extra guarantee that the > collector wouldn't move it. I don't think this can be implemented without a significant runtime change. If your program is special enough to need pinned strings, try turning off compaction with Gc.set. The collector will never move objects larger than 256 words. It is easy to allocate a string from outside the heap, so the collector can not move it, but then you have to do manual memory management. Manual memory management means explicit free or a finalizer on an object you know will outlive any reference to the string. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [Caml-list] Immutable strings 2014-07-04 19:18 [Caml-list] Immutable strings Gerd Stolpmann 2014-07-04 20:31 ` Anthony Tavener 2014-07-04 21:01 ` Markus Mottl @ 2014-07-07 12:42 ` Alain Frisch 2014-07-08 12:24 ` Gerd Stolpmann 2014-07-08 18:15 ` mattiasw 2014-07-21 15:06 ` Alain Frisch 4 siblings, 1 reply; 29+ messages in thread From: Alain Frisch @ 2014-07-07 12:42 UTC (permalink / raw) To: Gerd Stolpmann, caml-list Hi Gerd, Thanks for your interesting post. Your general point about not breaking backward compatibility at the source level, as long as only "basic" features are used, is important. Caml is now more than 30 years old (20 years for OCaml), and it would be very constraining not to prevent ourselves from fixing bugs in the language design, including when they are about core features. Some care need to be taken to provide a nice story to long-term users and a smooth migration path, using a combination of social means (interaction with the community) and technical ones (backward compatibility mode, "deprecated" warnings, sometimes tools to automate the transition). Even if we look only at industrial adoption, OCaml compete with languages more recently designed and if we cannot touch revisit existing choices, the risk is real for OCaml to appear "frozen", less modern, and a less compelling choice for new projects. This needs to be balanced against the risk of putting off owners of "passive" code bases (on which no dedicated development team work on a regular basis, but which need to be marginally modified and re-compiled once in a while). Concerning immutable strings, the migration path seems quite good to me: a warning tells you about direct uses of string-mutation features (such as String.set), and the default behavior does not break existing code. FWIW, it was a matter of hours to go through the entire LexiFi's code base to enable the new safe mode, and as always in such operations, it was a good opportunity to factorize similar code. And Jane Street does not seem overly worried by the task ( see https://blogs.janestreet.com/ocaml-4-02-everything-else/ ). As one of the problems with the current solution, you mention that conversion of strings to bytes and vice versa requires a copy, which incurs some performance penalty. This is true, but the new system can also avoid a lot of string copying (in safe mode). With mutable strings, a library which expects strings from the client code and depends on those strings to remain the same need to copy them, and similarly, it cannot return directly a string from its internal data structures, since the client could modify them and thus break internal invariants. (Many libraries don't do such copy, and, in the good cases, mention in their documentation that the strings should be treated as immutable ones by the caller. This is clearly a source of possibly tricky bugs.) The biggest problem you mention is related to the fact that in many contexts, both mutable and immutable strings could be relevant. Your first idea to address this problem is to consider bytes (mutable strings) as a subtype of (immutable) strings. This only addresses part of the problem: a library might still need to copy most strings on its boundaries to ensure a proper semantics; not only strings returned by the library as you mention, but also some strings passed from the client code to the library functions. Your second idea is to create a common supertype of both string and bytes, to be used in contexts which can consume either type. A minor variantiation would be to introduce a third abstract type, with "identity" injection from byte and string to it, and to expose the "read-only" part of the string API on it. This can entirely be implemented in user-land (although it could benefit from being in the stdlib, so that e.g. Lexing could expose a from_stringlike). Another variant of it is to see "stringlike" as a type class, implemented with explicit dictionaries. This could be done with records: type 'a stringlike = { get: 'a -> int -> char; length: 'a -> int; sub_string: 'a -> int -> int -> string; output: out_channel -> 'a -> unit; ... } There would be two constant records "string stringlike" and "bytes stringlike", and functions accepting either kind of string would take an extra stringlike argument. (Alternatively, one could use first class modules instead of records.) There is some overhead related to the dynamic dispatch, but I'm not convinced this would be unacceptable. Your third idea (using char bigarrays) would then fit nicely in this approach. Another direction would be to support also the case of functions which can return either a bytes or a string. A typical case is Bytes.sub / Bytes.sub_string. One could also want Bytes.cat_to_string: bytes -> bytes -> string in addition to Bytes.cat: bytes -> bytes -> bytes. For those cases, one could introduce a GADT such as: type _ is_a_string = | String: string is_a_string | Bytes: bytes is_a_string (* potentially more cases *) You could then pass the desired constructor to the functions, e.g.: Bytes.sub: bytes -> int -> int -> 'a is_a_string -> 'a. The cost of dispatching on the constructor is tiny, and the stdlib could bypass the test altogether using unsafe features internally. Higher-level functions which can return either a string or a bytes are likely to produce the result by passing the is_a_string value down to lower-level functions. But one could also imagine that some function behave differently according to the actual type of result. For instance, a function which is likely to produce often the same strings could decide to keep a (weak) table of already returned strings, or to have a hard-coded list of common strings; this works only for immutable results, and so the function needs to check the is_a_string constructor to enable/disable these optimizations. The "stringlike" idea could also be replaced by this is_a_string GADT, so that there could be a single function: val sub: 'a is_a_string -> 'a -> int -> int -> 'b is_a_string -> 'b All that said, I think the current situation is already a net improvement over the previous one, and that further layers can be built on top of it, if needed (and not necessarily in stdlib). Alain On 07/04/2014 09:18 PM, Gerd Stolpmann wrote: > Hi list, > > I've just posted a blog article where I criticize the new concept of > immutable strings that will be available in OCaml 4.02 (as option): > > http://blog.camlcity.org/blog/bytes1.html > > In short my point is that it the new concept is not far reaching enough, > and will even have negative impact on the code quality when it is not > improved. I also present three ideas how to improve it. > > Gerd > ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [Caml-list] Immutable strings 2014-07-07 12:42 ` Alain Frisch @ 2014-07-08 12:24 ` Gerd Stolpmann 2014-07-09 13:54 ` Alain Frisch 0 siblings, 1 reply; 29+ messages in thread From: Gerd Stolpmann @ 2014-07-08 12:24 UTC (permalink / raw) To: Alain Frisch; +Cc: caml-list [-- Attachment #1: Type: text/plain, Size: 9956 bytes --] Am Montag, den 07.07.2014, 14:42 +0200 schrieb Alain Frisch: > Hi Gerd, > > Thanks for your interesting post. Your general point about not breaking > backward compatibility at the source level, as long as only "basic" > features are used, is important. ... Even if we look only at > industrial adoption, OCaml compete with languages more recently designed > and if we cannot touch revisit existing choices, the risk is real for > OCaml to appear "frozen", less modern, and a less compelling choice for > new projects. This needs to be balanced against the risk of putting off > owners of "passive" code bases (on which no dedicated development team > work on a regular basis, but which need to be marginally modified and > re-compiled once in a while). It will create confusion even with actively maintained code bases. What could help here is very clear communication when the change will be the standard behavior, and how the migration will take place. Currently, it feels like a big experiment - hey, let's users tentatively enable it, and watch out for problems. That's quite naive. In particular, users hitting problems will probably not try out the switch (or immediately revert), because leaving the code base in a non-buildable state for longer time is not an option. (And ignoring these users would not be good, because it's exactly these users who are really doing string mutation who could profit at most from the change.) > Concerning immutable strings, the migration path seems quite good to me: > a warning tells you about direct uses of string-mutation features (such > as String.set), and the default behavior does not break existing code. That's good for now, but I'm more expecting something like: next ocaml version it is experimental (interfaces may still evolve). The following version it is recommended standard and we'll emit a warning when -safe-strings is not on. The version after that we'll make -safe-strings the default, etc. Something like that. There could also be a section in the manual explaining the new behavior, and how to convert code. > FWIW, it was a matter of hours to go through the entire LexiFi's code > base to enable the new safe mode, and as always in such operations, it > was a good opportunity to factorize similar code. And Jane Street does > not seem overly worried by the task ( see > https://blogs.janestreet.com/ocaml-4-02-everything-else/ ). With my current customer, I don't see any bigger problems either, because string mutation doesn't play a big role there (it's a compiler project). I see a big problem with OCamlnet, though, as it is focused on I/O, and the issue how to deal with buffers is quite central. > As one of the problems with the current solution, you mention that > conversion of strings to bytes and vice versa requires a copy, which > incurs some performance penalty. This is true, but the new system can > also avoid a lot of string copying (in safe mode). ... > (Many libraries don't do such copy, and, in the good cases, > mention in their documentation that the strings should be treated as > immutable ones by the caller. This is clearly a source of possibly > tricky bugs.) Right, that's the good side of it. (Although the danger is quite theoretical, as most programmers seem to intuitively follow the rule "don't mutate strings you did not create". I've never seen this kind of bug in practice.) > Your second idea is to create a common supertype of both string and > bytes, to be used in contexts which can consume either type. A minor > variantiation would be to introduce a third abstract type, with > "identity" injection from byte and string to it, and to expose the > "read-only" part of the string API on it. This can entirely be > implemented in user-land (although it could benefit from being in the > stdlib, so that e.g. Lexing could expose a from_stringlike). I think it would be quite important to have that in the stdlib: - This sets a standard for interoperability between libraries - The stdlib can exploit the details of the representation - It would be possible to use stringlike directly in C interfaces For instance, there is one module in OCamlnet where a regexp is directly run on an I/O buffer (generally, you need to do some primitive parsing on I/O buffers before you can extract strings, and that's where stringlike would be nice to have). Without stringlike, I would have to replace that regexp somehow. > Another > variant of it is to see "stringlike" as a type class, implemented with > explicit dictionaries. This could be done with records: > > type 'a stringlike = { > get: 'a -> int -> char; > length: 'a -> int; > sub_string: 'a -> int -> int -> string; > output: out_channel -> 'a -> unit; > ... > } > > There would be two constant records "string stringlike" and "bytes > stringlike", and functions accepting either kind of string would take an > extra stringlike argument. (Alternatively, one could use first class > modules instead of records.) There is some overhead related to the > dynamic dispatch, but I'm not convinced this would be unacceptable. The overhead is quite low. If you need to call e.g. "get" several times, you could factor out the dictionary lookup: let get = stringlike.get in ... The (only) price is that the access cannot be inlined anymore. > Your third idea (using char bigarrays) would then fit nicely in this > approach. Right, and it would even be possible to use that for other buffer representations (e.g. I have a non-contiguous buffer type called Netpagebuffer in OCamlnet that could also be compatible with stringlike; also think of ring buffers). It's a really nice idea. The downside is that I cannot imagine any easy way to support this in C interfaces. Well, you could have low_level_buffer : 'a -> (Obj.t * int * int) that gets you a base address, an offset, and a length, but that could be too optimistic. Maybe C interfaces should simply dynamically check whether 'a is a string or bigarray, and fail otherwise. These dynamic checks are at least possible (maybe there could be a caml_stringlike_val function that does its very best). Another downside of this approach is that it introduces a lot of type variables. > Another direction would be to support also the case of functions which > can return either a bytes or a string. A typical case is Bytes.sub / > Bytes.sub_string. One could also want Bytes.cat_to_string: bytes -> > bytes -> string in addition to Bytes.cat: bytes -> bytes -> bytes. For > those cases, one could introduce a GADT such as: > > type _ is_a_string = > | String: string is_a_string > | Bytes: bytes is_a_string > (* potentially more cases *) > > You could then pass the desired constructor to the functions, e.g.: > Bytes.sub: bytes -> int -> int -> 'a is_a_string -> 'a. The cost of > dispatching on the constructor is tiny, and the stdlib could bypass the > test altogether using unsafe features internally. Higher-level > functions which can return either a string or a bytes are likely to > produce the result by passing the is_a_string value down to lower-level > functions. That's also a nice idea, and it will definitely save a few string copies here and there. > But one could also imagine that some function behave > differently according to the actual type of result. For instance, a > function which is likely to produce often the same strings could decide > to keep a (weak) table of already returned strings, or to have a > hard-coded list of common strings; this works only for immutable > results, and so the function needs to check the is_a_string constructor > to enable/disable these optimizations. The "stringlike" idea could also > be replaced by this is_a_string GADT, so that there could be a single > function: > > val sub: 'a is_a_string -> 'a -> int -> int -> 'b is_a_string -> 'b > > > All that said, I think the current situation is already a net > improvement over the previous one, Well, I wouldn't say so because I'm missing good migration paths for some important cases. > and that further layers can be built > on top of it, if needed (and not necessarily in stdlib). Well, as pointed out, I'd really like to see one such layer in stdlib, because we'll otherwise have five different solutions in the library scene which are all incompatible to each other. (Your type class suggestion looks easy and will already solve most of the issues; why not just include it into the stdlib, it wouldn't need much: a new module Stringlike defining it, the records for String and Bytes and maybe char Bigarrays, and some extensions here and there where it is used, e.g. in Lexing.) IMHO, it is important to really provide practical solutions, and not only to theoretically have one. Gerd > > Alain > > > > On 07/04/2014 09:18 PM, Gerd Stolpmann wrote: > > Hi list, > > > > I've just posted a blog article where I criticize the new concept of > > immutable strings that will be available in OCaml 4.02 (as option): > > > > http://blog.camlcity.org/blog/bytes1.html > > > > In short my point is that it the new concept is not far reaching enough, > > and will even have negative impact on the code quality when it is not > > improved. I also present three ideas how to improve it. > > > > Gerd > > > > -- ------------------------------------------------------------ Gerd Stolpmann, Darmstadt, Germany gerd@gerd-stolpmann.de My OCaml site: http://www.camlcity.org Contact details: http://www.camlcity.org/contact.html Company homepage: http://www.gerd-stolpmann.de ------------------------------------------------------------ [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 473 bytes --] ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [Caml-list] Immutable strings 2014-07-08 12:24 ` Gerd Stolpmann @ 2014-07-09 13:54 ` Alain Frisch 2014-07-09 18:04 ` Gerd Stolpmann 2014-07-14 17:40 ` Richard W.M. Jones 0 siblings, 2 replies; 29+ messages in thread From: Alain Frisch @ 2014-07-09 13:54 UTC (permalink / raw) To: Gerd Stolpmann; +Cc: caml-list On 07/08/2014 02:24 PM, Gerd Stolpmann wrote: > It will create confusion even with actively maintained code bases. What > could help here is very clear communication when the change will be the > standard behavior, and how the migration will take place. It's a very different kind of criticism from your initial point about the decision of going into the current direction. Point taken: the development team will need to communicate about the expected timeline and migrate path. But note that 4.02 is not even out, and since the default behavior is the previous one, there is no hurry, and it's fine if people wait a few months before trying the new mode. It doesn't seem crazy to wait for some early user feedback and synchronize with them before deciding on a more precise plan for the wider community. For instance, you feedback about porting ocamlnet is quite useful and the current discussion shows that several solutions compete and need further thought. Without the new compiler switch, this discussion would not have taken place. > Right, that's the good side of it. (Although the danger is quite > theoretical, as most programmers seem to intuitively follow the rule > "don't mutate strings you did not create". I've never seen this kind of > bug in practice.) Still, library functions such as string_of_bool, or string_of_format (in the previous version) had to be written carefully, with extra copies, to avoid public humiliation (or not). > I think it would be quite important to have that in the stdlib: > > - This sets a standard for interoperability between libraries > - The stdlib can exploit the details of the representation > - It would be possible to use stringlike directly in C interfaces Note that if it goes to stdlib, one cannot refer to bigarrays. (One might want to have bigarrays in stdlib, but we are not there yet.) -- Alain ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [Caml-list] Immutable strings 2014-07-09 13:54 ` Alain Frisch @ 2014-07-09 18:04 ` Gerd Stolpmann 2014-07-10 6:41 ` Nicolas Boulay 2014-07-14 17:40 ` Richard W.M. Jones 1 sibling, 1 reply; 29+ messages in thread From: Gerd Stolpmann @ 2014-07-09 18:04 UTC (permalink / raw) To: Alain Frisch; +Cc: caml-list [-- Attachment #1: Type: text/plain, Size: 3082 bytes --] Am Mittwoch, den 09.07.2014, 15:54 +0200 schrieb Alain Frisch: > On 07/08/2014 02:24 PM, Gerd Stolpmann wrote: > > It will create confusion even with actively maintained code bases. What > > could help here is very clear communication when the change will be the > > standard behavior, and how the migration will take place. > > It's a very different kind of criticism from your initial point about > the decision of going into the current direction. Right, but the question how the user process will look like is just the next one. The design of the change so far is minimalistic, and it is obvious that some abstraction is missing, and my only explanation is that there wasn't a consensus in the OCaml team (but that's just a wild guess). I don't want to say that the OCaml team is ignoring any problems, but it looks like the missing abstraction is somehow offloaded to the users, namely whether it is needed at all in the stdlib (maybe nobody is complaining), or which style is preferred. (I just want to say that there is IMHO a connection between the minimalistic design, and the social embedding.) > Point taken: the > development team will need to communicate about the expected timeline > and migrate path. But note that 4.02 is not even out, and since the > default behavior is the previous one, there is no hurry, and it's fine > if people wait a few months before trying the new mode. My thinking here is that 95% of the users will have no problems at all when they convert their programs. It's the other 5% for which the current design is not really sufficient. Let's just hope these users aren't immediately discouraged when they find it out. > It doesn't > seem crazy to wait for some early user feedback and synchronize with > them before deciding on a more precise plan for the wider community. For > instance, you feedback about porting ocamlnet is quite useful and the > current discussion shows that several solutions compete and need further > thought. Without the new compiler switch, this discussion would not > have taken place. Fully agreed. > > I think it would be quite important to have that in the stdlib: > > > > - This sets a standard for interoperability between libraries > > - The stdlib can exploit the details of the representation > > - It would be possible to use stringlike directly in C interfaces > > Note that if it goes to stdlib, one cannot refer to bigarrays. (One > might want to have bigarrays in stdlib, but we are not there yet.) Right, but this isn't a big deal. (Bigarray also uses Unix.file_descr, but this dep is easy to work around by anchoring file_descr in Pervasives.) Gerd -- ------------------------------------------------------------ Gerd Stolpmann, Darmstadt, Germany gerd@gerd-stolpmann.de My OCaml site: http://www.camlcity.org Contact details: http://www.camlcity.org/contact.html Company homepage: http://www.gerd-stolpmann.de ------------------------------------------------------------ [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 473 bytes --] ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [Caml-list] Immutable strings 2014-07-09 18:04 ` Gerd Stolpmann @ 2014-07-10 6:41 ` Nicolas Boulay 0 siblings, 0 replies; 29+ messages in thread From: Nicolas Boulay @ 2014-07-10 6:41 UTC (permalink / raw) Cc: caml-list [-- Attachment #1: Type: text/plain, Size: 3618 bytes --] In one of my program, i parse lot of small files. Most of the time was consume by the GC, because lot of string was created. File are read into string then converted to data structure, but those string buffer are not easly reusable for the next file. So the GC have a hard work. Maybe ocaml need a way to enable the reuse of string to reduce the pressure on the GC, and reduce the need of mutable string. Regards, Nicolas 2014-07-09 20:04 GMT+02:00 Gerd Stolpmann <info@gerd-stolpmann.de>: > Am Mittwoch, den 09.07.2014, 15:54 +0200 schrieb Alain Frisch: > > On 07/08/2014 02:24 PM, Gerd Stolpmann wrote: > > > It will create confusion even with actively maintained code bases. What > > > could help here is very clear communication when the change will be the > > > standard behavior, and how the migration will take place. > > > > It's a very different kind of criticism from your initial point about > > the decision of going into the current direction. > > Right, but the question how the user process will look like is just the > next one. The design of the change so far is minimalistic, and it is > obvious that some abstraction is missing, and my only explanation is > that there wasn't a consensus in the OCaml team (but that's just a wild > guess). I don't want to say that the OCaml team is ignoring any > problems, but it looks like the missing abstraction is somehow offloaded > to the users, namely whether it is needed at all in the stdlib (maybe > nobody is complaining), or which style is preferred. (I just want to say > that there is IMHO a connection between the minimalistic design, and the > social embedding.) > > > Point taken: the > > development team will need to communicate about the expected timeline > > and migrate path. But note that 4.02 is not even out, and since the > > default behavior is the previous one, there is no hurry, and it's fine > > if people wait a few months before trying the new mode. > > My thinking here is that 95% of the users will have no problems at all > when they convert their programs. It's the other 5% for which the > current design is not really sufficient. Let's just hope these users > aren't immediately discouraged when they find it out. > > > It doesn't > > seem crazy to wait for some early user feedback and synchronize with > > them before deciding on a more precise plan for the wider community. For > > instance, you feedback about porting ocamlnet is quite useful and the > > current discussion shows that several solutions compete and need further > > thought. Without the new compiler switch, this discussion would not > > have taken place. > > Fully agreed. > > > > I think it would be quite important to have that in the stdlib: > > > > > > - This sets a standard for interoperability between libraries > > > - The stdlib can exploit the details of the representation > > > - It would be possible to use stringlike directly in C interfaces > > > > Note that if it goes to stdlib, one cannot refer to bigarrays. (One > > might want to have bigarrays in stdlib, but we are not there yet.) > > Right, but this isn't a big deal. (Bigarray also uses Unix.file_descr, > but this dep is easy to work around by anchoring file_descr in > Pervasives.) > > > Gerd > -- > ------------------------------------------------------------ > Gerd Stolpmann, Darmstadt, Germany gerd@gerd-stolpmann.de > My OCaml site: http://www.camlcity.org > Contact details: http://www.camlcity.org/contact.html > Company homepage: http://www.gerd-stolpmann.de > ------------------------------------------------------------ > > [-- Attachment #2: Type: text/html, Size: 4686 bytes --] ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [Caml-list] Immutable strings 2014-07-09 13:54 ` Alain Frisch 2014-07-09 18:04 ` Gerd Stolpmann @ 2014-07-14 17:40 ` Richard W.M. Jones 1 sibling, 0 replies; 29+ messages in thread From: Richard W.M. Jones @ 2014-07-14 17:40 UTC (permalink / raw) To: Alain Frisch; +Cc: Gerd Stolpmann, caml-list On Wed, Jul 09, 2014 at 03:54:57PM +0200, Alain Frisch wrote: > On 07/08/2014 02:24 PM, Gerd Stolpmann wrote: > >It will create confusion even with actively maintained code bases. What > >could help here is very clear communication when the change will be the > >standard behavior, and how the migration will take place. > > It's a very different kind of criticism from your initial point > about the decision of going into the current direction. Point > taken: the development team will need to communicate about the > expected timeline and migrate path. But note that 4.02 is not even > out, and since the default behavior is the previous one, there is no > hurry, and it's fine if people wait a few months before trying the > new mode. It doesn't seem crazy to wait for some early user > feedback and synchronize with them before deciding on a more precise > plan for the wider community. For instance, you feedback about > porting ocamlnet is quite useful and the current discussion shows > that several solutions compete and need further thought. Without > the new compiler switch, this discussion would not have taken place. The problem we may* have is that we have to support OCaml back to ~ 3.10 from the same code base. Rich. * I say `may' in that sentence because I've just ignored the warnings so far -- having much bigger problems with armv7hl & aarch64 support in 4.02 right now. -- Richard Jones Red Hat ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [Caml-list] Immutable strings 2014-07-04 19:18 [Caml-list] Immutable strings Gerd Stolpmann ` (2 preceding siblings ...) 2014-07-07 12:42 ` Alain Frisch @ 2014-07-08 18:15 ` mattiasw 2014-07-08 19:24 ` Daniel Bünzli ` (2 more replies) 2014-07-21 15:06 ` Alain Frisch 4 siblings, 3 replies; 29+ messages in thread From: mattiasw @ 2014-07-08 18:15 UTC (permalink / raw) To: caml-list My two cents: To me it seems very strange to introduce a new string type and not make it UTF-8 from start. ocaml will be that last language that doesn't have standardize unicode support. Even old languages like Erlang has gone the UTF-8 way, and that includes program code. Bytes and strings have nothing in common, but str.[4] is still relevant for UTF-8 strings. The algorithm is slighly more complicated. I converted a big ocaml program to F# and the immutable strings was the smallest problem, since detected by the compiler. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [Caml-list] Immutable strings 2014-07-08 18:15 ` mattiasw @ 2014-07-08 19:24 ` Daniel Bünzli 2014-07-08 19:27 ` Raoul Duke 2014-07-09 14:15 ` Daniel Bünzli 2014-07-14 17:45 ` Richard W.M. Jones 2 siblings, 1 reply; 29+ messages in thread From: Daniel Bünzli @ 2014-07-08 19:24 UTC (permalink / raw) To: mattiasw; +Cc: caml-list Le mardi, 8 juillet 2014 à 19:15, mattiasw@gmail.com a écrit : > My two cents: > > To me it seems very strange to introduce a new string type and not make it > UTF-8 from start. No new string type was introduced. A bytes type was introduced. > ocaml will be that last language that doesn't have standardize unicode > support. What do you mean by standarized unicode support in the language *exactly* ? I'd be genuinely interested in knowing the actual real level of support for Unicode in these language, beyond saying our string is an UTF-X encoded sequence of scalar values. For example do these other language do perform Unicode normalisation on string literals/patterns (and identifiers if they choose that craze) ? This for example would be absolutely necessary to have for performing any kind of real world processing on unicode strings, but then there's not only a single normalisation form and the one you want depends on the context. Do they have a notation to indicate in which form they want the literal/pattern to be ? > Even old languages like Erlang has gone the UTF-8 way, and that > includes program code. For a very very very very long time it has been possible to write, unnormalized or normalized according to the normal form your editor, UTF-8 encoded literals in your OCaml sources; you just had to drop the idea of using latin1 identifiers, which are now anyway deprecated since 4.01. As for being able to write Unicode *identifiers* in the language I'm actually quite glad OCaml hasn't that, there are both too many arrow characters to use in Unicode and too many unreasonable programmers out there. > Bytes and strings have nothing in common, but str.[4] is still relevant for > UTF-8 strings. Direct indexing is rarely relevant in Unicode as usually you want those indexes to correspond to user perceived characters (e.g. to align things in text formatting) and user perceived characters may be written as a sequence of unicode scalar value… or not (even in normal forms, since an arbitrary number of combining character can be applied to a base character). The unicode segmentation algorithm allows you to find these boundaries, simple indexing doesn't and is mostly worthless in Unicode processing. Best, Daniel ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [Caml-list] Immutable strings 2014-07-08 19:24 ` Daniel Bünzli @ 2014-07-08 19:27 ` Raoul Duke 0 siblings, 0 replies; 29+ messages in thread From: Raoul Duke @ 2014-07-08 19:27 UTC (permalink / raw) To: OCaml ja wohl, n'est pas, это жизнь, based on my experiences with strings and stuff over the years, i resonate with what Daniel posted. :-) things like UTF-whatever are baseline requirements, but beyond that (a) nobody has it right (b) unicode sucks. :-) On Tue, Jul 8, 2014 at 12:24 PM, Daniel Bünzli <daniel.buenzli@erratique.ch> wrote: > Le mardi, 8 juillet 2014 à 19:15, mattiasw@gmail.com a écrit : >> My two cents: >> >> To me it seems very strange to introduce a new string type and not make it >> UTF-8 from start. > > No new string type was introduced. A bytes type was introduced. > >> ocaml will be that last language that doesn't have standardize unicode >> support. > > What do you mean by standarized unicode support in the language *exactly* ? > > I'd be genuinely interested in knowing the actual real level of support for Unicode in these language, beyond saying our string is an UTF-X encoded sequence of scalar values. For example do these other language do perform Unicode normalisation on string literals/patterns (and identifiers if they choose that craze) ? This for example would be absolutely necessary to have for performing any kind of real world processing on unicode strings, but then there's not only a single normalisation form and the one you want depends on the context. Do they have a notation to indicate in which form they want the literal/pattern to be ? > >> Even old languages like Erlang has gone the UTF-8 way, and that >> includes program code. > > For a very very very very long time it has been possible to write, unnormalized or normalized according to the normal form your editor, UTF-8 encoded literals in your OCaml sources; you just had to drop the idea of using latin1 identifiers, which are now anyway deprecated since 4.01. > > As for being able to write Unicode *identifiers* in the language I'm actually quite glad OCaml hasn't that, there are both too many arrow characters to use in Unicode and too many unreasonable programmers out there. > >> Bytes and strings have nothing in common, but str.[4] is still relevant for >> UTF-8 strings. > > Direct indexing is rarely relevant in Unicode as usually you want those indexes to correspond to user perceived characters (e.g. to align things in text formatting) and user perceived characters may be written as a sequence of unicode scalar value… or not (even in normal forms, since an arbitrary number of combining character can be applied to a base character). The unicode segmentation algorithm allows you to find these boundaries, simple indexing doesn't and is mostly worthless in Unicode processing. > > Best, > > Daniel > > -- > Caml-list mailing list. Subscription management and archives: > https://sympa.inria.fr/sympa/arc/caml-list > Beginner's list: http://groups.yahoo.com/group/ocaml_beginners > Bug reports: http://caml.inria.fr/bin/caml-bugs ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [Caml-list] Immutable strings 2014-07-08 18:15 ` mattiasw 2014-07-08 19:24 ` Daniel Bünzli @ 2014-07-09 14:15 ` Daniel Bünzli 2014-07-14 17:45 ` Richard W.M. Jones 2 siblings, 0 replies; 29+ messages in thread From: Daniel Bünzli @ 2014-07-09 14:15 UTC (permalink / raw) To: caml-list Le mardi, 8 juillet 2014 à 19:15, mattiasw@gmail.com a écrit : > ocaml will be that last language that doesn't have standardize unicode > support. Even old languages like Erlang has gone the UTF-8 way, and that > includes program code. For the fun I just had a look what python does. So in python basically they have a Unicode string which is a string made of Unicode *code points*. Fail, end of discussion. Should have been: *scalar values* (for those who don't understand why, I suggest reading my minimal Unicode introduction [1]). (both in 2 and 3, apparently 2 used to be messier for reason I didn't bother to understand, they seem to be highly confused) Sample code. U+D800 is the first surrogate, i.e. something you should never see in concrete Unicode textual processing, only in UTF-16 encoded bytes and paired with an appropriate low surrogate. Python2: >>> u'\uD800'.encode('utf-8') '\xed\xa0\x80' Congratulations, you just produced an invalid UTF-8 sequence (serialized a surrogate). Python3 is a *little* better with *UTF-8* (but wait…) encoding stuff >>> "\uD800".encode('utf-8') Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in position 0: surrogates not allowed So now let's try UTF-16: >>> "\uD800".encode("utf-16") b'\xff\xfe\x00\xd8' Congratulations you just produced an invalid UTF-16 sequence hi-surrogate without a corresponding low surrogate (which together would define an Unicode scalar value). Why on earth do they allow to represent surrogates *at all* in their Unicode text data structure ? Basically they don't understand Unicode. The old camel should not be ashamed of its *outsanding* (absolutely) unicode support — this is not to say that nothing can be improved, I do have some proposal in the works — but the situation is not bad either. Best, Daniel P.S. Skimming through these articles about python unicode strings I gather why people find unicode hard, there seem to be a high level of both technical and conceptual confusion. Again have a read at [1] if you'd like to clear (I hope) your mind about these things. [1] http://erratique.ch/software/uucp/doc/Uucp.html#uminimal ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [Caml-list] Immutable strings 2014-07-08 18:15 ` mattiasw 2014-07-08 19:24 ` Daniel Bünzli 2014-07-09 14:15 ` Daniel Bünzli @ 2014-07-14 17:45 ` Richard W.M. Jones 2 siblings, 0 replies; 29+ messages in thread From: Richard W.M. Jones @ 2014-07-14 17:45 UTC (permalink / raw) To: caml-list On Tue, Jul 08, 2014 at 08:15:39PM +0200, mattiasw@gmail.com wrote: > My two cents: > > To me it seems very strange to introduce a new string type and not make it > UTF-8 from start. I would far prefer that OCaml did *not* specify an encoding for string, and just left them as effectively array of bytes as now. This leaves the business of encoding UTF-8 up to higher layers, either camomile, iconv or the database. That would imply removing incorrect functions like String.uppercase and String.lowercase. There are a couple of reasons for this: (1) It's easy to get Unicode wrong, and baking incorrect Unicode into the language could be worse than not having it at all. See also: Java, Python 2, Ruby, everything using the Win32 API. (2) Doing it right is incredibly complex. See also: Perl 5. Rich. -- Richard Jones Red Hat ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [Caml-list] Immutable strings 2014-07-04 19:18 [Caml-list] Immutable strings Gerd Stolpmann ` (3 preceding siblings ...) 2014-07-08 18:15 ` mattiasw @ 2014-07-21 15:06 ` Alain Frisch [not found] ` <20140722.235104.405798419265248505.Christophe.Troestler@umons.ac.be> 4 siblings, 1 reply; 29+ messages in thread From: Alain Frisch @ 2014-07-21 15:06 UTC (permalink / raw) To: Gerd Stolpmann, caml-list On 07/04/2014 09:18 PM, Gerd Stolpmann wrote: > http://blog.camlcity.org/blog/bytes1.html Coming back to motivating example of this post. Lexing provides: val from_channel : in_channel -> lexbuf val from_string : string -> lexbuf val from_function : (bytes -> int -> int) -> lexbuf In particular, from_function expects you to write to a buffer, so it's pretty clear that its callback must accept a "bytes", not a "string". There is no place for a (string -> int -> int) -> lexbuf function. Concerning from_string: this function copies the string to an internal buffer. This is purely implemented on the OCaml side without any unsafe features. We could avoid this copy because we know that the generated lexers won't actually modify the buffer in that case, but it would be very difficult to do this without using an unsafe feature, even if we had some sort of generalization of bytes and string. We would instead need a completely different implementation (which would not use "stringable" to make the "source" (string or "stream") explicit in the lexbuf datastructure. We could also provide an extra from_bytes function, but it can currently be implemented by composing Bytes.to_string and Lexing.from_string. Are you concerned only by the performance overhead of this approach (two copies)? If so, the same argument would apply to the current implementation of from_string, and we would need to switch to a different approach, for which it's not clear that "stringable" would be a big help (see above). Before doing anything like that, it would be interesting to evaluate the exact overhead. It could very well be negligible/acceptable for most cases compared to the cost of actual lexing. -- Alain ^ permalink raw reply [flat|nested] 29+ messages in thread
[parent not found: <20140722.235104.405798419265248505.Christophe.Troestler@umons.ac.be>]
* Re: [Caml-list] Immutable strings [not found] ` <20140722.235104.405798419265248505.Christophe.Troestler@umons.ac.be> @ 2014-08-29 16:30 ` Damien Doligez 0 siblings, 0 replies; 29+ messages in thread From: Damien Doligez @ 2014-08-29 16:30 UTC (permalink / raw) To: OCaml Mailing List On 2014-07-22, at 23:51, Christophe Troestler wrote: > What about having a phantom variable on bytes indicating access? A > string could become a "ro bytes" without copying. Technically, that would work. In the latest developers meeting, we decided against the phantom type approach because its main advantage is also its main drawback: it takes advantage of the common representation of string and bytes. By keeping the two types separate, we get the freedom of representing them differently. While we have no short-term plan to do that in the normal OCaml runtime, we expect this to be a big win for the likes of ocamljava and js_of_ocaml. Also, as far as we can tell (and we need user feedback at this point) strings and byte buffers are quite distinct in normal OCaml source, so we wouldn't win much by being able to mix them. We are also open to feedback and suggestions on convenience functions that could be added to string.ml to help build strings in common cases without going through a bytes value (http://caml.inria.fr/mantis/view.php?id=6500 ) -- Damien ^ permalink raw reply [flat|nested] 29+ messages in thread
end of thread, other threads:[~2014-08-29 16:30 UTC | newest] Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2014-07-04 19:18 [Caml-list] Immutable strings Gerd Stolpmann 2014-07-04 20:31 ` Anthony Tavener 2014-07-04 20:38 ` Malcolm Matalka 2014-07-04 23:44 ` Daniel Bünzli 2014-07-05 11:04 ` Gerd Stolpmann 2014-07-16 11:38 ` Damien Doligez 2014-07-04 21:01 ` Markus Mottl 2014-07-05 11:24 ` Gerd Stolpmann 2014-07-08 13:23 ` Jacques Garrigue 2014-07-08 13:37 ` Alain Frisch 2014-07-08 14:04 ` Jacques Garrigue 2014-07-28 11:14 ` Goswin von Brederlow 2014-07-28 15:51 ` Markus Mottl 2014-07-29 2:54 ` Yaron Minsky 2014-07-29 9:46 ` Goswin von Brederlow 2014-07-29 11:48 ` John F. Carr 2014-07-07 12:42 ` Alain Frisch 2014-07-08 12:24 ` Gerd Stolpmann 2014-07-09 13:54 ` Alain Frisch 2014-07-09 18:04 ` Gerd Stolpmann 2014-07-10 6:41 ` Nicolas Boulay 2014-07-14 17:40 ` Richard W.M. Jones 2014-07-08 18:15 ` mattiasw 2014-07-08 19:24 ` Daniel Bünzli 2014-07-08 19:27 ` Raoul Duke 2014-07-09 14:15 ` Daniel Bünzli 2014-07-14 17:45 ` Richard W.M. Jones 2014-07-21 15:06 ` Alain Frisch [not found] ` <20140722.235104.405798419265248505.Christophe.Troestler@umons.ac.be> 2014-08-29 16:30 ` Damien Doligez
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox