From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from nez-perce.inria.fr (nez-perce.inria.fr [192.93.2.78]) by yquem.inria.fr (Postfix) with ESMTP id 7DD10BC8D for ; Tue, 11 Jan 2005 21:30:34 +0100 (CET) Received: from pauillac.inria.fr (pauillac.inria.fr [128.93.11.35]) by nez-perce.inria.fr (8.13.0/8.13.0) with ESMTP id j0BKUXrZ020120 for ; Tue, 11 Jan 2005 21:30:34 +0100 Received: from concorde.inria.fr (concorde.inria.fr [192.93.2.39]) by pauillac.inria.fr (8.7.6/8.7.3) with ESMTP id VAA22730 for ; Tue, 11 Jan 2005 21:30:33 +0100 (MET) Received: from moutng.kundenserver.de (moutng.kundenserver.de [212.227.126.184]) by concorde.inria.fr (8.13.0/8.13.0) with ESMTP id j0BKUXC9010261 for ; Tue, 11 Jan 2005 21:30:33 +0100 Received: from [212.227.126.160] (helo=mrelayng.kundenserver.de) by moutng.kundenserver.de with esmtp (Exim 3.35 #1) id 1CoSet-0001wg-00; Tue, 11 Jan 2005 21:30:31 +0100 Received: from [80.129.115.55] (helo=gate.gerd-stolpmann.de) by mrelayng.kundenserver.de with asmtp (Exim 3.35 #1) id 1CoSet-0002KB-00; Tue, 11 Jan 2005 21:30:31 +0100 Received: from ice.gerd-stolpmann.de (ice [192.168.0.13]) by gate.gerd-stolpmann.de (Postfix) with ESMTP id DE80AC04C; Tue, 11 Jan 2005 21:30:30 +0100 (CET) Received: by ice.gerd-stolpmann.de (Postfix, from userid 1000) id E2EC2B031; Tue, 11 Jan 2005 21:30:20 +0100 (CET) Subject: Re: [Caml-list] Thread safe Str From: Gerd Stolpmann To: skaller@users.sourceforge.net Cc: Chris King , "O'Caml Mailing List" In-Reply-To: <1105466124.2574.48.camel@pelican.wigram> References: <1105415669.3534.55.camel@pelican.wigram> <875c7e0705011023033fa114ef@mail.gmail.com> <1105430813.3534.116.camel@pelican.wigram> <1105445337.15581.82.camel@ice.gerd-stolpmann.de> <1105466124.2574.48.camel@pelican.wigram> Content-Type: text/plain Content-Transfer-Encoding: 7bit Message-Id: <1105475416.15581.105.camel@ice.gerd-stolpmann.de> Mime-Version: 1.0 X-Mailer: Ximian Evolution 1.4.6 Date: Tue, 11 Jan 2005 21:30:17 +0100 X-Provags-ID: kundenserver.de abuse@kundenserver.de auth:a6865a839c0178d9aa0ce41878507ea2 X-Miltered: at nez-perce with ID 41E43769.000 by Joe's j-chkmail (http://j-chkmail.ensmp.fr)! X-Miltered: at concorde with ID 41E43769.000 by Joe's j-chkmail (http://j-chkmail.ensmp.fr)! X-Spam: no; 0.00; caml-list:01 gerd:01 stolpmann:01 wrote:01 gerd:01 stolpmann:01 wrote:01 backtracking:01 backtracking:01 regexps:01 frisch:01 cardelli:01 frisch:01 dfa:01 cduce:01 X-Spam-Checker-Version: SpamAssassin 3.0.0 (2004-09-13) on yquem.inria.fr X-Spam-Status: No, score=0.1 required=5.0 tests=FORGED_RCVD_HELO autolearn=disabled version=3.0.0 X-Spam-Level: On Die, 2005-01-11 at 18:55, skaller wrote: > On Tue, 2005-01-11 at 23:08, Gerd Stolpmann wrote: > > > > Capturing just for the purpose of string extraction is not problematic. > > Unfortunately you're wrong. I had in fact hoped this was not > the case, but having read a couple of paper on it, > I now know that, unfortunately, they are. (See below). It's a matter of taste. Granted, you cannot define capturing directly for an automaton, only by a possibly poor backtracking algorithm. (This means the approach is "definition by algorithm", not really the best way to do it.) Which leads to a more serious problem: One must know the algorithm of the implementation to avoid running into a hard backtracking case (which can be exponential). There are a lot of tutorials explaining good and bad regexps. The lex-type scanners do not have this problem (but you may run into space problems instead because the automaton becomes large). > Alain Frisch is one of the leaders in this area, perhaps > he can explain it better. You can get Cardelli and Frisch > paper here: > > http://felix.sf.net/papers/greedy.pdf When I understand the direction correctly, this paper explains how to avoid backtracking by using a two-pass algorithm (automaton plus expression-based postprocessing to get the captured strings). > another by Ville Laurikari here: > > http://felix.sf.net/papers/spire2000-tnfa.ps > > and his Master thesis on the topic here: > > http://felix.sf.net/papers/regex-submatch.ps > > [and if you can figure out how to actually build > a tagged DFA after reading any of that please let > me know .. ] > > Frisch's algorithm used in CDuce works with a forward > pass that ignores captures, but records the automaton states, > and then a backwards pass the extracts the information > (CDuce actually builds trees). > > Unfortunately this has a problem: it requires a > bidirectional iterator, whereas a DFA only requires > an input iterator. (NFA's require forward iterators) For string regexps this is not a big problem, the string is in memory. For XML, this is more a problem, especially when the XML document is only available event by event (processing large documents). Gerd -- ------------------------------------------------------------ Gerd Stolpmann * Viktoriastr. 45 * 64293 Darmstadt * Germany gerd@gerd-stolpmann.de http://www.gerd-stolpmann.de ------------------------------------------------------------