From mboxrd@z Thu Jan 1 00:00:00 1970 Received: (from weis@localhost) by pauillac.inria.fr (8.7.6/8.7.3) id UAA09887 for caml-redistribution; Thu, 2 Sep 1999 20:48:21 +0200 (MET DST) Received: from nez-perce.inria.fr (nez-perce.inria.fr [192.93.2.78]) by pauillac.inria.fr (8.7.6/8.7.3) with ESMTP id LAA30407 for ; Thu, 2 Sep 1999 11:56:13 +0200 (MET DST) Received: from mail3.microsoft.com (mail3.microsoft.com [131.107.3.123]) by nez-perce.inria.fr (8.8.7/8.8.7) with SMTP id LAA17721 for ; Thu, 2 Sep 1999 11:56:11 +0200 (MET DST) Received: from 157.54.9.100 by mail3.microsoft.com (InterScan E-Mail VirusWall NT); Thu, 02 Sep 1999 02:45:43 -0700 (Pacific Daylight Time) Received: by INET-IMC-03 with Internet Mail Service (5.5.2650.14) id ; Thu, 2 Sep 1999 02:45:43 -0700 Message-ID: <7CD674FF54FBD21181D800805F57CD5488B0F4@RED-MSG-44> From: Andrew Kennedy To: "'Gerard Huet'" Cc: OCAML , "'comp-lang-ml@cs.cmu.edu'" Subject: Formal semantics of programming languages; was "convincing manage ment to switch to Ocaml" Date: Thu, 2 Sep 1999 02:45:41 -0700 X-Mailer: Internet Mail Service (5.5.2650.14) Sender: weis [cross-posted to comp.lang.ml from the O'Caml mailing list as the subject is SML-related] Gerard Huet wrote: > > Yes indeed. It is still a major challenge to write a formal > semantics of a > non-trivial programming language, with two requirements : > 1. It should be reasonably close to an implementation of the > language used > for real applications > 2. It should be machine-manipulable to the extent that at > least it brings > some confidence about being able to use it to prove some > program property > or as a basis to a software engineering tool such as a > debugger or a static > analyser ...snip... > - so today the formal semantics of Standard ML is to my > knowledge the sole > published complete semantics of a real programming language. I do not > believe it answers requirement 2, and probably only a handful of > specialists can explain how close it comes to answering requirement 1. As perhaps I can claim to be one of those "handful of specialists", I would like to make the following comments. The static semantics of the core language as specified by The Definition of Standard ML is a reasonable guide to implementing a type checker, even though it (rightly) specifies only the valid typing judgments and doesn't explain how to implement *inference*. There are many many papers that describe inference algorithms so this isn't a big deal. However, the remainder of the definition is not so tractable for the compiler writer. The static semantics of Modules is (even in the SML'97 revision) awfully complicated, introducing many concepts (realisations, enrichment, support, etc.) not immediately familiar from type theory. Also, I don't know of any literature on implementing the SML module system. The fact that most bugs in SML compilers seem connected in some way to modules suggests that there's a lot of scope for improvement in this area. The dynamic semantics is specified quite directly as a big-step evaluation relation. It would be a reasonable guide to implementing an interpreter for the language. However, compilers don't work this way, instead working by *translation*, so the recent "type-theoretic" semantics of Harper and Stone is probably a better guide for compiler writers. Whilst The Definition specifies the observable behaviour of SML programs, it omits any discussion or formalisation of intensional behaviour, notably space efficiency. In contrast to the Scheme standard, there is no requirement for tail calls to be implemented efficiently, whereas many SML programmers would assume that tail-call optimisation is obligatory. Also, there's no discussion of sharing, but programmers unthinkingly assume that "let val x = (2,3) in f (x,x) end" does not make two copies of the pair (2,3) when applying f. Returning to the original point (1), I don't actually believe that the formal semantics should be driven by implementations. The most important properties of a semantics, in my view, are clarity and correctness. The "reference semantics" of a language should specify as clearly as possible, and with as little room for unsoundness as possible, the observable *meaning* of programs. In addition to the reference semantics, additional explanation or formalisation should be provided for the benefit of implementers, for example presenting typing rules closer in spirit to an inference algorithm, and specifying the dynamic semantics by translation into a smaller language whose dynamic semantics is specified separately. Furthermore, certain intensional properties should be *prescribed* formally. I believe that such a reference semantics should suffice for (2), that is, for machine manipulation in theorem provers and static analysis tools. I also have a (perhaps naive) belief that the reference semantics *should* be comprehensible to an educated programmer, in the same way that a formal grammar for the syntax of a programming language can be understood by a programmer with a degree in computer science. Of course, part of the problem here is in the computer science curriculum -- every course on compilers describes BNF but how many discuss SOS-style semantics (or even mention the word semantics at all)? The Definition of Standard ML is still a model for formal semantics of practical programming languages -- the recent formalisations of (subsets of) Java follow its style -- but it's ten or so years since its creation and I think we could do better another time around. Here's a few things: * The static semantics should be more type-theoretic, with no reliance on "fresh type names". One of the holes in the original 1990 definition was related to this issue; much better is to present typing rules whose context contains explicit environments for type variables (= type names, really), as one does when presenting System F, say. * The typing rules should be defined on the *abstract* syntax; quite incredibly, the Definition has special rules for stripping off parentheses, and still more rules for injecting one concrete syntactic category into another (e.g. atomic expressions into expressions). * The type of an expression or the signature of a module should be expressible in the syntax of the language. Features of SML combine to preclude this, complicating things for the programmer *and* compiler writer (and, presumably, theorem provers and other tools). * The use of "evaluation contexts" could simplify the rules. * I prefer substituion of argument values for formal parameters in function application rather than the use of special closure values. It's much nicer for values to be a subset of expressions so that when I write "E evaluates to V" then V is just a normal form. (The abstract syntax of expressions does need to be extended with store location names, but that's a minor tweak). * Let the semantics guide the design to an even greater extent. Most people now agree that equality polymorphism should be thrown out for software engineering reasons (or else generalised to type classes a la Haskell); it also complicates the semantics out of all proportion to its utility. Similarly, the idea of "identifier status" is a complication that can be removed by adopting Haskell-style lexical distinctions. Complicated semantics = confused programmers! Every now and then we hear something about ML2000. I'd be particularly interested to hear the views of those associated with that project. - Andrew Kennedy (implementer of MLj; see http://www.dcs.ed.ac.uk/~mlj/)