From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on yquem.inria.fr X-Spam-Level: X-Spam-Status: No, score=0.0 required=5.0 tests=none autolearn=disabled version=3.1.3 Received: from concorde.inria.fr (concorde.inria.fr [192.93.2.39]) by yquem.inria.fr (Postfix) with ESMTP id 26F9BBC0A for ; Wed, 23 May 2007 17:10:02 +0200 (CEST) Received: from orion.metastack.com (no-dns-yet.demon.co.uk [80.177.38.218] (may be forged)) by concorde.inria.fr (8.13.6/8.13.6) with ESMTP id l4NFA0P5020368 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=FAIL) for ; Wed, 23 May 2007 17:10:01 +0200 Received: from treble (cpc2-cmbg6-0-0-cust535.cmbg.cable.ntl.com [81.107.34.24]) (authenticated bits=0) by orion.metastack.com (8.13.4/8.13.3) with ESMTP id l4NF7klX008505 (version=TLSv1/SSLv3 cipher=RC4-MD5 bits=128 verify=NO) for ; Wed, 23 May 2007 16:07:46 +0100 From: "David Allsopp" To: "OCaml List" Subject: Yet another yacc question Date: Wed, 23 May 2007 16:09:58 +0100 Organization: MetaStack Solutions Ltd. Message-ID: <00cf01c79d4c$6e33c3b0$6a7ba8c0@treble> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit X-Mailer: Microsoft Office Outlook 11 X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.3028 Thread-Index: AcedTG1w7w7hNu8cTGuI2ZUtwwYZdA== X-Miltered: at concorde with ID 46545948.001 by Joe's j-chkmail (http://j-chkmail . ensmp . fr)! X-Spam: no; 0.00; context-free:01 ocamlyacc:01 lexer:01 lexer:01 tokens:01 lexing:01 parsing:01 ocamlyacc:01 look-ahead:01 non-terminal:01 look-ahead:01 buffer:01 parser:01 tokens:01 terminating:01 Suppose I have the context-free grammar P -> A | A B | A B C | A C In ocamlyacc, I code this up as: %token A B C %type parse %start parse %% parse: A {()} | A B {()} | A B C {()} | A C {()} ; And set-up a lexer (called lexer) so that the characters 'A', 'B' and 'C' produce the tokens A, B and C. Then I write the following function: let f s = parse lexer (Lexing.from_string s) And use it a few times... f "ABCZ" ... gives () f "ACZ" ... gives () f "AA" ... raises Parsing.Parse_error The third case fails because "AA" is not in the grammar. However, the first two work even though "ABCZ" and "ACZ" are also not in the grammar (and Z isn't even a token!). They work because ocamlyacc doesn't need look-ahead after the "C" in each case to determine that it can reduce to the entry non-terminal and so return (). In the third case, look-ahead is required - it looks ahead, sees an A and so fails. I would quite like the third to match as well and ignore the second A (ignore and leave on the buffer ready for a future parse... so "peek-ahead" rather than "look-ahead", I guess). I think I'm probably right in assuming that ocamlyacc can't do this. I'm not willing to alter my parser to return a list of tokens which as far as I can see is the only way to make ocamlyacc do this correctly - i.e. parse: token parse {$1::$2} | EOF {[]} ; token: /as for parse in the previous grammar/ (Incidentally, lest anyone have it confirmed that I'm mad, I'm trying to parse batches of SQL statements so have no obvious terminating token for a clause - the parser needs to do a longest possible match ignoring anything else following that would appear to be a syntax error) So my question: can menhir, dypgen or any of the other parser generators out there do this - i.e. return one () on the first call and then another () on the second with the string "AA"? It would finally be a reason for abandoning ocamlyacc :o) Thanks! (in hope that I haven't missed something blindingly obvious...) David