From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <dra-news@metastack.com>
X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on yquem.inria.fr
X-Spam-Level: 
X-Spam-Status: No, score=0.0 required=5.0 tests=none autolearn=disabled 
	version=3.1.3
Received: from concorde.inria.fr (concorde.inria.fr [192.93.2.39])
	by yquem.inria.fr (Postfix) with ESMTP id 26F9BBC0A
	for <caml-list@yquem.inria.fr>; Wed, 23 May 2007 17:10:02 +0200 (CEST)
Received: from orion.metastack.com (no-dns-yet.demon.co.uk [80.177.38.218] (may be forged))
	by concorde.inria.fr (8.13.6/8.13.6) with ESMTP id l4NFA0P5020368
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=FAIL)
	for <caml-list@yquem.inria.fr>; Wed, 23 May 2007 17:10:01 +0200
Received: from treble (cpc2-cmbg6-0-0-cust535.cmbg.cable.ntl.com [81.107.34.24])
	(authenticated bits=0)
	by orion.metastack.com (8.13.4/8.13.3) with ESMTP id l4NF7klX008505
	(version=TLSv1/SSLv3 cipher=RC4-MD5 bits=128 verify=NO)
	for <caml-list@yquem.inria.fr>; Wed, 23 May 2007 16:07:46 +0100
From: "David Allsopp" <dra-news@metastack.com>
To: "OCaml List" <caml-list@yquem.inria.fr>
Subject: Yet another yacc question
Date: Wed, 23 May 2007 16:09:58 +0100
Organization: MetaStack Solutions Ltd.
Message-ID: <00cf01c79d4c$6e33c3b0$6a7ba8c0@treble>
MIME-Version: 1.0
Content-Type: text/plain;
	charset="us-ascii"
Content-Transfer-Encoding: 7bit
X-Mailer: Microsoft Office Outlook 11
X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.3028
Thread-Index: AcedTG1w7w7hNu8cTGuI2ZUtwwYZdA==
X-Miltered: at concorde with ID 46545948.001 by Joe's j-chkmail (http://j-chkmail . ensmp . fr)!
X-Spam: no; 0.00; context-free:01 ocamlyacc:01 lexer:01 lexer:01 tokens:01 lexing:01 parsing:01 ocamlyacc:01 look-ahead:01 non-terminal:01 look-ahead:01 buffer:01 parser:01 tokens:01 terminating:01 

Suppose I have the context-free grammar

P -> A | A B | A B C | A C

In ocamlyacc, I code this up as:

%token A B C
%type <unit> parse
%start parse
%%

parse:
  A     {()}
| A B   {()}
| A B C {()}
| A C   {()}
;

And set-up a lexer (called lexer) so that the characters 'A', 'B' and 'C'
produce the tokens A, B and C. Then I write the following function:

let f s = parse lexer (Lexing.from_string s)

And use it a few times...

f "ABCZ"   ...   gives ()
f "ACZ"    ...   gives ()
f "AA"     ...   raises Parsing.Parse_error

The third case fails because "AA" is not in the grammar. However, the first
two work even though "ABCZ" and "ACZ" are also not in the grammar (and Z
isn't even a token!). They work because ocamlyacc doesn't need look-ahead
after the "C" in each case to determine that it can reduce to the entry
non-terminal and so return (). In the third case, look-ahead is required -
it looks ahead, sees an A and so fails.

I would quite like the third to match as well and ignore the second A
(ignore and leave on the buffer ready for a future parse... so "peek-ahead"
rather than "look-ahead", I guess). I think I'm probably right in assuming
that ocamlyacc can't do this. I'm not willing to alter my parser to return a
list of tokens which as far as I can see is the only way to make ocamlyacc
do this correctly - i.e.

parse:
  token parse {$1::$2}
| EOF {[]}
;

token:
  /as for parse in the previous grammar/ 

(Incidentally, lest anyone have it confirmed that I'm mad, I'm trying to
parse batches of SQL statements so have no obvious terminating token for a
clause - the parser needs to do a longest possible match ignoring anything
else following that would appear to be a syntax error)

So my question: can menhir, dypgen or any of the other parser generators out
there do this - i.e. return one () on the first call and then another () on
the second with the string "AA"? It would finally be a reason for abandoning
ocamlyacc :o)

Thanks! (in hope that I haven't missed something blindingly obvious...)


David