From mboxrd@z Thu Jan  1 00:00:00 1970
Received: (from weis@localhost) by pauillac.inria.fr (8.7.6/8.7.3) id UAA09887 for caml-redistribution; Thu, 2 Sep 1999 20:48:21 +0200 (MET DST)
Received: from nez-perce.inria.fr (nez-perce.inria.fr [192.93.2.78]) by pauillac.inria.fr (8.7.6/8.7.3) with ESMTP id LAA30407 for <caml-list@pauillac.inria.fr>; Thu, 2 Sep 1999 11:56:13 +0200 (MET DST)
Received: from mail3.microsoft.com (mail3.microsoft.com [131.107.3.123])
	by nez-perce.inria.fr (8.8.7/8.8.7) with SMTP id LAA17721
	for <caml-list@inria.fr>; Thu, 2 Sep 1999 11:56:11 +0200 (MET DST)
Received: from 157.54.9.100 by mail3.microsoft.com (InterScan E-Mail VirusWall NT); Thu, 02 Sep 1999 02:45:43 -0700 (Pacific Daylight Time)
Received: by INET-IMC-03 with Internet Mail Service (5.5.2650.14)
	id <R9R76G66>; Thu, 2 Sep 1999 02:45:43 -0700
Message-ID: <7CD674FF54FBD21181D800805F57CD5488B0F4@RED-MSG-44>
From: Andrew Kennedy <akenn@microsoft.com>
To: "'Gerard Huet'" <Gerard.Huet@inria.fr>
Cc: OCAML <caml-list@inria.fr>,
        "'comp-lang-ml@cs.cmu.edu'"
	 <comp-lang-ml@cs.cmu.edu>
Subject: Formal semantics of programming languages; was "convincing manage
	ment to switch to Ocaml"
Date: Thu, 2 Sep 1999 02:45:41 -0700 
X-Mailer: Internet Mail Service (5.5.2650.14)
Sender: weis

[cross-posted to comp.lang.ml from the O'Caml mailing list as the subject is
SML-related]

Gerard Huet wrote:
> 
> Yes indeed. It is still a major challenge to write a formal 
> semantics of a
> non-trivial programming language, with two requirements :
> 1. It should be reasonably close to an implementation of the 
> language used
> for real applications
> 2. It should be machine-manipulable to the extent that at 
> least it brings
> some confidence about being able to use it to prove some 
> program property
> or as a basis to a software engineering tool such as a 
> debugger or a static
> analyser
 
...snip...

> - so today the formal semantics of Standard ML is to my 
> knowledge the sole
> published complete semantics of a real programming language. I do not
> believe it answers requirement 2, and probably only a handful of
> specialists can explain how close it comes to answering requirement 1.

As perhaps I can claim to be one of those "handful of specialists", I would
like to make the following comments. 

The static semantics of the core language as specified by The Definition of
Standard ML is a reasonable guide to implementing a type checker, even
though it (rightly) specifies only the valid typing judgments and doesn't
explain how to implement *inference*. There are many many papers that
describe inference algorithms so this isn't a big deal. However, the
remainder of the definition is not so tractable for the compiler writer. The
static semantics of Modules is (even in the SML'97 revision) awfully
complicated, introducing many concepts (realisations, enrichment, support,
etc.) not immediately familiar from type theory. Also, I don't know of any
literature on implementing the SML module system. The fact that most bugs in
SML compilers seem connected in some way to modules suggests that there's a
lot of scope for improvement in this area.

The dynamic semantics is specified quite directly as a big-step evaluation
relation. It would be a reasonable guide to implementing an interpreter for
the language. However, compilers don't work this way, instead working by
*translation*, so the recent "type-theoretic" semantics of Harper and Stone
is probably a better guide for compiler writers.

Whilst The Definition specifies the observable behaviour of SML programs, it
omits any discussion or formalisation of intensional behaviour, notably
space efficiency. In contrast to the Scheme standard, there is no
requirement for tail calls to be implemented efficiently, whereas many SML
programmers would assume that tail-call optimisation is obligatory. Also,
there's no discussion of sharing, 
but programmers unthinkingly assume that "let val x = (2,3) in f (x,x) end"
does not make two copies of the pair (2,3) when applying f.  

Returning to the original point (1), I don't actually believe that the
formal semantics should be driven by implementations. The most important
properties of a semantics, in my view, are clarity and correctness. The
"reference semantics" of a language should specify as clearly as possible,
and with as little room for unsoundness as possible, the observable
*meaning* of programs. In addition to the reference semantics, additional
explanation or formalisation should be provided for the benefit of
implementers, for example presenting typing rules closer in spirit to an
inference algorithm, and specifying the dynamic semantics by translation
into a smaller language whose dynamic semantics is specified separately.
Furthermore, certain intensional properties should be *prescribed* formally.


I believe that such a reference semantics should suffice for (2), that is,
for machine manipulation in theorem provers and static analysis tools. I
also have a (perhaps naive) belief that the reference semantics *should* be
comprehensible to an educated programmer, in the same way that a formal
grammar for the syntax of a programming language can be understood by a
programmer with a degree in computer science. Of course, part of the problem
here is in the computer science curriculum -- every course on compilers
describes BNF but how many discuss SOS-style semantics (or even mention the
word semantics at all)?

The Definition of Standard ML is still a model for formal semantics of
practical programming languages -- the recent formalisations of (subsets of)
Java follow its style -- but it's ten or so years since its creation and I
think we could do better another time around. Here's a few things:

* The static semantics should be more type-theoretic, with no reliance on
"fresh type names". One of the holes in the original 1990 definition was
related to this issue; much better is to present typing rules whose context
contains explicit environments for type variables (= type names, really), as
one does when presenting System F, say. 

* The typing rules should be defined on the *abstract* syntax; quite
incredibly, the Definition has special rules for stripping off parentheses,
and still more rules for injecting one concrete syntactic category into
another (e.g. atomic expressions into expressions). 

* The type of an expression or the signature of a module should be
expressible in the syntax of the language. Features of SML combine to
preclude this, complicating things for the programmer *and* compiler writer
(and, presumably, theorem provers and other tools). 

* The use of "evaluation contexts" could simplify the rules. 

* I prefer substituion of argument values for formal parameters in function
application rather than the use of special closure values. It's much nicer
for values to be a subset of expressions so that when I write "E evaluates
to V" then V is just a normal form. (The abstract syntax of expressions does
need to be extended with store location names, but that's a minor tweak).

* Let the semantics guide the design to an even greater extent. Most people
now agree that equality polymorphism should be thrown out for software
engineering reasons (or else generalised to type classes a la Haskell); it
also complicates the semantics out of all proportion to its utility.
Similarly, the idea of "identifier status" is a complication that can be
removed by adopting Haskell-style lexical distinctions. Complicated
semantics = confused programmers! 

Every now and then we hear something about ML2000. I'd be particularly
interested to hear the views of those associated with that project.

- Andrew Kennedy (implementer of MLj; see http://www.dcs.ed.ac.uk/~mlj/)