From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <pierre.weis@inria.fr>
Received: from concorde.inria.fr (concorde.inria.fr [192.93.2.39])
	by yquem.inria.fr (Postfix) with ESMTP id 7FAFBBC8D
	for <caml-list@yquem.inria.fr>; Tue, 11 Jan 2005 09:07:08 +0100 (CET)
Received: from pauillac.inria.fr (pauillac.inria.fr [128.93.11.35])
	by concorde.inria.fr (8.13.0/8.13.0) with ESMTP id j0B878Zw009789
	for <caml-list@yquem.inria.fr>; Tue, 11 Jan 2005 09:07:08 +0100
Received: from nez-perce.inria.fr (nez-perce.inria.fr [192.93.2.78]) by pauillac.inria.fr (8.7.6/8.7.3) with ESMTP id JAA27071 for <caml-list@pauillac.inria.fr>; Tue, 11 Jan 2005 09:07:07 +0100 (MET)
Received: from smtp3.adl2.internode.on.net (smtp3.adl2.internode.on.net [203.16.214.203])
	by nez-perce.inria.fr (8.13.0/8.13.0) with ESMTP id j0B875x9030922
	for <caml-list@inria.fr>; Tue, 11 Jan 2005 09:07:06 +0100
Received: from [192.168.1.200] (ppp201-179.lns1.syd3.internode.on.net [203.122.201.179])
	by smtp3.adl2.internode.on.net (8.12.9/8.12.9) with ESMTP id j0B86sjA014620;
	Tue, 11 Jan 2005 18:36:58 +1030 (CST)
Subject: Re: [Caml-list] Thread safe Str
From: skaller <skaller@users.sourceforge.net>
Reply-To: skaller@users.sourceforge.net
To: Chris King <colanderman@gmail.com>
Cc: "O'Caml Mailing List" <caml-list@inria.fr>
In-Reply-To: <875c7e0705011023033fa114ef@mail.gmail.com>
References: <Pine.LNX.4.44.0501101103500.1591-100000@localhost>
	 <1105415669.3534.55.camel@pelican.wigram>
	 <875c7e0705011023033fa114ef@mail.gmail.com>
Content-Type: text/plain
Message-Id: <1105430813.3534.116.camel@pelican.wigram>
Mime-Version: 1.0
X-Mailer: Ximian Evolution 1.2.2 (1.2.2-4) 
Date: 11 Jan 2005 19:06:54 +1100
Content-Transfer-Encoding: 7bit
X-Miltered: at concorde with ID 41E3892C.000 by Joe's j-chkmail (http://j-chkmail.ensmp.fr)!
X-Miltered: at nez-perce with ID 41E38929.000 by Joe's j-chkmail (http://j-chkmail.ensmp.fr)!
X-Spam: no; 0.00; caml-list:01 sourceforge:01 wrote:01 sourceforge:01 wrote:01 parser:01 ocamllex:01 event-based:01 parsers:01 parsers:01 lexer:01 foo:01 foo:01 regexp:01 regexp:01 
X-Spam-Checker-Version: SpamAssassin 3.0.0 (2004-09-13) on yquem.inria.fr
X-Spam-Status: No, score=0.0 required=5.0 tests=none autolearn=disabled 
	version=3.0.0
X-Spam-Level: 

On Tue, 2005-01-11 at 18:03, Chris King wrote:
> On 11 Jan 2005 14:54:30 +1100, skaller <skaller@users.sourceforge.net> wrote:
> > If you want captures use the proper tool, namely a parser,
> > [...]
> > If some technology is to be integrated, please use the right technology
> > and integrate Ocamllex.
> 
> Event-based parsers are not the "proper tool" for many applications. 

What are 'event-based' parsers??

> Why write a lexer and all its necessary event handlers when one can
> just write "s/foo(.*)bar/bar\1foo/g"?  

Just compare a real example from the Alioth Shootout:

Specification:
---------------------------------------------------
The telephone number pattern: 

      * there may be zero or one telephone numbers per line of input
      * a telephone number may start at the beginning of the line or be
        preceeded by a non-digit, (which may be preceeded by anything)
      * it begins with a 3-digit area code that looks like this (DDD) or
        DDD (where D is [0-9])
      * the area code is followed by one space
      * which is followed by the 3 digits of the exchange: DDD
      * the exchange is followed by a space or hyphen [ -]
      * which is followed by the last 4 digits: DDDD
      * which can be followed by end of line or a non-digit (which may
        be followed by anything).

------ FELIX SOLUTION---------------------

regexp digit = ["0123456789"];
regexp digits3 = digit digit digit;
regexp digits4 =  digits3 digit;

regexp area_code = digits3 | "(" digits3 ")";
regexp exchange = digits3;

regexp phone = area_code " " exchange (" " | "-") digits4;

fun lexit (start:iterator, finish:iterator): iterator * string =>
  reglex start to finish with
  | phone => check_context (lexeme_start, lexeme_end)
  | _ => ""
  endmatch
;


--------------- PCRE SOLUTION -------------------------

"(?: ^ | [^\\d\\(])           # must be preceeded by non-digit
(\\(\\d\\d\\d\\)|\\d\\d\\d)  # match 1: area code
[ ]                          # area code followed by one space
\\d\\d\\d                    # prefix of 3 digits
[ -]                         # separator is either space or dash
 \\d\\d\\d\\d                 # last 4 digits
(?: \\D|$)                   # must be followed by a non-digit (or EOL)
"

-------------------------------------------------------

If you think the PCRE solution is in any way better -- well
I'd like to point out it is in fact WRONG. The Felix solution
is obviously correct. As an exercise, find the bug in the
idiotic PCRE solution. Oh yes, it is extremely stupid.
In fact there is a much better solution not using 
the irregular back match crap which tried to be tricky
by matching either empty or ( with empty or ) --
and failed utterly. So as an exercise, find this superior
solution.

The bottom line is: a properly supported syntax for
regular matching is superior even for small regexps.

> Regular expressions were
> designed for pattern matching and substitution,

So called regular expressions are NOT mathematically
regular expressions, and they were certainly NOT designed by
people that knew what they were doing from a theoretical viewpoint.

Regular expressions have a precise mathematical foundation,
and they do NOT include captures.

Attempts to add captures to the theory have been made, 
and none are satisfactory IMHO. Certainly none actually agree 
with any implementations.

PCRE has some wording about longest matches
and left most capture but these semantics turn out to
be inconsistent, and are not what PCRE actually implements.

The problem with parsing to obtain captures is that parsers
have not traditionally been well integrated into any programming
languages (except possibly LISP .. :)

The fact is that 'regexps' are actually lame parsers, 
so it is better to get rid of the irregular crud and provide 
a proper regular expression facility and a proper parser. IMHO.

That way we have solid mathemtical foundations for both subsystems,
and the parsing support is actually vastly more capable than
mere regexps (since you can build parse trees :)

Expressions like:

"s/foo(.*)bar/bar\1foo/g"

may well make more sense in a language like Perl which

(a) is dynamically typed

(b) has strong convenient string handling which 
allows the woeful lack of regular definitions to
be replaced by other features such as interpolation.

(c) the language was never expected to scale up
to programming in the large

Ocaml on the other hand is statically typed, doens't
have strong convenient string handling, and is expected
to scale up to programming in the large.


-- 
John Skaller, mailto:skaller@users.sf.net
voice: 061-2-9660-0850, 
snail: PO BOX 401 Glebe NSW 2037 Australia
Checkout the Felix programming language http://felix.sf.net