From: Alain Frisch <alain@frisch.fr>
To: Gerd Stolpmann <info@gerd-stolpmann.de>, caml-list <caml-list@inria.fr>
Subject: Re: [Caml-list] Immutable strings
Date: Mon, 07 Jul 2014 14:42:20 +0200 [thread overview]
Message-ID: <53BA95AC.3050602@frisch.fr> (raw)
In-Reply-To: <1404501528.4384.4.camel@e130>
Hi Gerd,
Thanks for your interesting post. Your general point about not breaking
backward compatibility at the source level, as long as only "basic"
features are used, is important. Caml is now more than 30 years old (20
years for OCaml), and it would be very constraining not to prevent
ourselves from fixing bugs in the language design, including when they
are about core features. Some care need to be taken to provide a nice
story to long-term users and a smooth migration path, using a
combination of social means (interaction with the community) and
technical ones (backward compatibility mode, "deprecated" warnings,
sometimes tools to automate the transition). Even if we look only at
industrial adoption, OCaml compete with languages more recently designed
and if we cannot touch revisit existing choices, the risk is real for
OCaml to appear "frozen", less modern, and a less compelling choice for
new projects. This needs to be balanced against the risk of putting off
owners of "passive" code bases (on which no dedicated development team
work on a regular basis, but which need to be marginally modified and
re-compiled once in a while).
Concerning immutable strings, the migration path seems quite good to me:
a warning tells you about direct uses of string-mutation features (such
as String.set), and the default behavior does not break existing code.
FWIW, it was a matter of hours to go through the entire LexiFi's code
base to enable the new safe mode, and as always in such operations, it
was a good opportunity to factorize similar code. And Jane Street does
not seem overly worried by the task ( see
https://blogs.janestreet.com/ocaml-4-02-everything-else/ ).
As one of the problems with the current solution, you mention that
conversion of strings to bytes and vice versa requires a copy, which
incurs some performance penalty. This is true, but the new system can
also avoid a lot of string copying (in safe mode). With mutable
strings, a library which expects strings from the client code and
depends on those strings to remain the same need to copy them, and
similarly, it cannot return directly a string from its internal data
structures, since the client could modify them and thus break internal
invariants. (Many libraries don't do such copy, and, in the good cases,
mention in their documentation that the strings should be treated as
immutable ones by the caller. This is clearly a source of possibly
tricky bugs.)
The biggest problem you mention is related to the fact that in many
contexts, both mutable and immutable strings could be relevant. Your
first idea to address this problem is to consider bytes (mutable
strings) as a subtype of (immutable) strings. This only addresses part
of the problem: a library might still need to copy most strings on its
boundaries to ensure a proper semantics; not only strings returned by
the library as you mention, but also some strings passed from the client
code to the library functions.
Your second idea is to create a common supertype of both string and
bytes, to be used in contexts which can consume either type. A minor
variantiation would be to introduce a third abstract type, with
"identity" injection from byte and string to it, and to expose the
"read-only" part of the string API on it. This can entirely be
implemented in user-land (although it could benefit from being in the
stdlib, so that e.g. Lexing could expose a from_stringlike). Another
variant of it is to see "stringlike" as a type class, implemented with
explicit dictionaries. This could be done with records:
type 'a stringlike = {
get: 'a -> int -> char;
length: 'a -> int;
sub_string: 'a -> int -> int -> string;
output: out_channel -> 'a -> unit;
...
}
There would be two constant records "string stringlike" and "bytes
stringlike", and functions accepting either kind of string would take an
extra stringlike argument. (Alternatively, one could use first class
modules instead of records.) There is some overhead related to the
dynamic dispatch, but I'm not convinced this would be unacceptable.
Your third idea (using char bigarrays) would then fit nicely in this
approach.
Another direction would be to support also the case of functions which
can return either a bytes or a string. A typical case is Bytes.sub /
Bytes.sub_string. One could also want Bytes.cat_to_string: bytes ->
bytes -> string in addition to Bytes.cat: bytes -> bytes -> bytes. For
those cases, one could introduce a GADT such as:
type _ is_a_string =
| String: string is_a_string
| Bytes: bytes is_a_string
(* potentially more cases *)
You could then pass the desired constructor to the functions, e.g.:
Bytes.sub: bytes -> int -> int -> 'a is_a_string -> 'a. The cost of
dispatching on the constructor is tiny, and the stdlib could bypass the
test altogether using unsafe features internally. Higher-level
functions which can return either a string or a bytes are likely to
produce the result by passing the is_a_string value down to lower-level
functions. But one could also imagine that some function behave
differently according to the actual type of result. For instance, a
function which is likely to produce often the same strings could decide
to keep a (weak) table of already returned strings, or to have a
hard-coded list of common strings; this works only for immutable
results, and so the function needs to check the is_a_string constructor
to enable/disable these optimizations. The "stringlike" idea could also
be replaced by this is_a_string GADT, so that there could be a single
function:
val sub: 'a is_a_string -> 'a -> int -> int -> 'b is_a_string -> 'b
All that said, I think the current situation is already a net
improvement over the previous one, and that further layers can be built
on top of it, if needed (and not necessarily in stdlib).
Alain
On 07/04/2014 09:18 PM, Gerd Stolpmann wrote:
> Hi list,
>
> I've just posted a blog article where I criticize the new concept of
> immutable strings that will be available in OCaml 4.02 (as option):
>
> http://blog.camlcity.org/blog/bytes1.html
>
> In short my point is that it the new concept is not far reaching enough,
> and will even have negative impact on the code quality when it is not
> improved. I also present three ideas how to improve it.
>
> Gerd
>
next prev parent reply other threads:[~2014-07-07 12:42 UTC|newest]
Thread overview: 29+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-07-04 19:18 Gerd Stolpmann
2014-07-04 20:31 ` Anthony Tavener
2014-07-04 20:38 ` Malcolm Matalka
2014-07-04 23:44 ` Daniel Bünzli
2014-07-05 11:04 ` Gerd Stolpmann
2014-07-16 11:38 ` Damien Doligez
2014-07-04 21:01 ` Markus Mottl
2014-07-05 11:24 ` Gerd Stolpmann
2014-07-08 13:23 ` Jacques Garrigue
2014-07-08 13:37 ` Alain Frisch
2014-07-08 14:04 ` Jacques Garrigue
2014-07-28 11:14 ` Goswin von Brederlow
2014-07-28 15:51 ` Markus Mottl
2014-07-29 2:54 ` Yaron Minsky
2014-07-29 9:46 ` Goswin von Brederlow
2014-07-29 11:48 ` John F. Carr
2014-07-07 12:42 ` Alain Frisch [this message]
2014-07-08 12:24 ` Gerd Stolpmann
2014-07-09 13:54 ` Alain Frisch
2014-07-09 18:04 ` Gerd Stolpmann
2014-07-10 6:41 ` Nicolas Boulay
2014-07-14 17:40 ` Richard W.M. Jones
2014-07-08 18:15 ` mattiasw
2014-07-08 19:24 ` Daniel Bünzli
2014-07-08 19:27 ` Raoul Duke
2014-07-09 14:15 ` Daniel Bünzli
2014-07-14 17:45 ` Richard W.M. Jones
2014-07-21 15:06 ` Alain Frisch
[not found] ` <20140722.235104.405798419265248505.Christophe.Troestler@umons.ac.be>
2014-08-29 16:30 ` Damien Doligez
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=53BA95AC.3050602@frisch.fr \
--to=alain@frisch.fr \
--cc=caml-list@inria.fr \
--cc=info@gerd-stolpmann.de \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox