* Thread safe Str @ 2005-01-09 19:30 Christophe TROESTLER 2005-01-09 20:57 ` [Caml-list] " Gerd Stolpmann ` (2 more replies) 0 siblings, 3 replies; 19+ messages in thread From: Christophe TROESTLER @ 2005-01-09 19:30 UTC (permalink / raw) To: O'Caml Mailing List Hi, The new Str module not only made it LGPL but also very fast (100% more than Pcre in some cases). However it is still not thread safe -- and it is not possible given its interface. However, from a quick look, it seems to me that what is under the hood is almost thread safe. So why not to write a new module (say Regexp) which would be thread safe and on top of which Str would be build? This would not be too much work, would it? Cheers, ChriS ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Caml-list] Thread safe Str 2005-01-09 19:30 Thread safe Str Christophe TROESTLER @ 2005-01-09 20:57 ` Gerd Stolpmann 2005-01-10 9:57 ` Alex Baretta 2005-01-10 15:49 ` Xavier Leroy 2 siblings, 0 replies; 19+ messages in thread From: Gerd Stolpmann @ 2005-01-09 20:57 UTC (permalink / raw) To: Christophe TROESTLER; +Cc: O'Caml Mailing List On Son, 2005-01-09 at 20:30, Christophe TROESTLER wrote: > Hi, > > The new Str module not only made it LGPL but also very fast (100% more > than Pcre in some cases). However it is still not thread safe -- and > it is not possible given its interface. However, from a quick look, > it seems to me that what is under the hood is almost thread safe. So > why not to write a new module (say Regexp) which would be thread safe > and on top of which Str would be build? This would not be too much > work, would it? The Netstring_str module (part of ocamlnet) does it the other way round: The thread-safe interface is built on top of a potentially unsafe implementation. Although the current version of Netstring_str uses Pcre as its implementation, there is a historic version on top of Str. This other way of layering can be made working because there is the global master lock protecting the C parts of the regexp engine. The interface of Netstring_str: http://ocamlnet.sourceforge.net/manual/refman/Netstring_str.html Of course, this is not the optimal solution, I would prefer it the way you suggest. Gerd -- ------------------------------------------------------------ Gerd Stolpmann * Viktoriastr. 45 * 64293 Darmstadt * Germany gerd@gerd-stolpmann.de http://www.gerd-stolpmann.de ------------------------------------------------------------ ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Caml-list] Thread safe Str 2005-01-09 19:30 Thread safe Str Christophe TROESTLER 2005-01-09 20:57 ` [Caml-list] " Gerd Stolpmann @ 2005-01-10 9:57 ` Alex Baretta 2005-01-10 15:49 ` Xavier Leroy 2 siblings, 0 replies; 19+ messages in thread From: Alex Baretta @ 2005-01-10 9:57 UTC (permalink / raw) To: Christophe TROESTLER, Ocaml Christophe TROESTLER wrote: > Hi, > > The new Str module not only made it LGPL but also very fast (100% more > than Pcre in some cases). However it is still not thread safe -- and > it is not possible given its interface. However, from a quick look, > it seems to me that what is under the hood is almost thread safe. So > why not to write a new module (say Regexp) which would be thread safe > and on top of which Str would be build? This would not be too much > work, would it? > > Cheers, > ChriS I think I remember a thread where Xavier explained that Str is fully thread safe because all state data structures are allocated on a per-thread basis. Alex -- ********************************************************************* http://www.barettadeit.com/ Baretta DE&IT A division of Baretta SRL tel. +39 02 370 111 55 fax. +39 02 370 111 54 Our technology: The Application System/Xcaml (AS/Xcaml) <http://www.asxcaml.org/> The FreerP Project <http://www.freerp.org/> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Caml-list] Thread safe Str 2005-01-09 19:30 Thread safe Str Christophe TROESTLER 2005-01-09 20:57 ` [Caml-list] " Gerd Stolpmann 2005-01-10 9:57 ` Alex Baretta @ 2005-01-10 15:49 ` Xavier Leroy 2005-01-10 16:39 ` Richard Jones ` (3 more replies) 2 siblings, 4 replies; 19+ messages in thread From: Xavier Leroy @ 2005-01-10 15:49 UTC (permalink / raw) To: Christophe TROESTLER; +Cc: O'Caml Mailing List > The new Str module not only made it LGPL but also very fast (100% more > than Pcre in some cases). However it is still not thread safe -- and > it is not possible given its interface. However, from a quick look, > it seems to me that what is under the hood is almost thread safe. You are correct: the regexp compiler (written in Caml) and the execution engine (written in C) are thread-safe, it's only the API (in Caml) that uses global state. > So why not to write a new module (say Regexp) which would be thread > safe and on top of which Str would be build? That's a very good idea. My initial plan was to have two APIs, the old Str for compatibility and a newer, better designed one for new programs. Besides being thread-safe, the new API could also expose the abstract syntax tree for regexps, allowing easier construction of complex regexps by programs than can be done by working at the level of the string representation of regexps. It's just that I never got to design that new API :-( > This would not be too much work, would it? The implementation work should be quite small indeed, but don't underestimate the work needed to design the API. API design is harder than it seems... But if a few persons on this list want to team up to design an API, that would be wonderful indeed. (A small group is better than individual designers in that several pairs of eyes spot inconsistencies better.) - Xavier Leroy ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Caml-list] Thread safe Str 2005-01-10 15:49 ` Xavier Leroy @ 2005-01-10 16:39 ` Richard Jones 2005-01-10 18:21 ` Eric C. Cooper ` (2 subsequent siblings) 3 siblings, 0 replies; 19+ messages in thread From: Richard Jones @ 2005-01-10 16:39 UTC (permalink / raw) Cc: O'Caml Mailing List On Mon, Jan 10, 2005 at 04:49:17PM +0100, Xavier Leroy wrote: > The implementation work should be quite small indeed, but don't > underestimate the work needed to design the API. API design is harder > than it seems... But if a few persons on this list want to team up to > design an API, that would be wonderful indeed. (A small group is > better than individual designers in that several pairs of eyes spot > inconsistencies better.) Any chance of syntactic support for regexps in the language? (As in Perl and particularly Regexp/OCaml[1], for example). Rich. [1] http://web.yl.is.s.u-tokyo.ac.jp/~oiwa/pub/caml/regexp-pp-0.9.5.1/README.match-regexp -- ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Caml-list] Thread safe Str 2005-01-10 15:49 ` Xavier Leroy 2005-01-10 16:39 ` Richard Jones @ 2005-01-10 18:21 ` Eric C. Cooper 2005-01-10 20:25 ` Martin Jambon 2005-01-11 6:41 ` Chris King 3 siblings, 0 replies; 19+ messages in thread From: Eric C. Cooper @ 2005-01-10 18:21 UTC (permalink / raw) To: caml-list, O'Caml Mailing List On Mon, Jan 10, 2005 at 04:49:17PM +0100, Xavier Leroy wrote: > My initial plan was to have two APIs, the old Str for compatibility > and a newer, better designed one for new programs. Besides being > thread-safe, the new API could also expose the abstract syntax tree > for regexps, allowing easier construction of complex regexps by > programs than can be done by working at the level of the string > representation of regexps. It's just that I never got to design > that new API :-( The API used by Olin Shivers for regexps in scsh might be a good starting point. -- Eric C. Cooper e c c @ c m u . e d u ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Caml-list] Thread safe Str 2005-01-10 15:49 ` Xavier Leroy 2005-01-10 16:39 ` Richard Jones 2005-01-10 18:21 ` Eric C. Cooper @ 2005-01-10 20:25 ` Martin Jambon 2005-01-11 3:54 ` skaller 2005-01-11 6:41 ` Chris King 3 siblings, 1 reply; 19+ messages in thread From: Martin Jambon @ 2005-01-10 20:25 UTC (permalink / raw) To: Xavier Leroy; +Cc: Christophe TROESTLER, O'Caml Mailing List On Mon, 10 Jan 2005, Xavier Leroy wrote: > You are correct: the regexp compiler (written in Caml) and the > execution engine (written in C) are thread-safe, it's only the API (in > Caml) that uses global state. Good to know :-) > The implementation work should be quite small indeed, but don't > underestimate the work needed to design the API. API design is harder > than it seems... But if a few persons on this list want to team up to > design an API, that would be wonderful indeed. (A small group is > better than individual designers in that several pairs of eyes spot > inconsistencies better.) I would be glad to help if I can. However, my concerns are more about how to integrate regular expressions in the language so that they can be checked at compile-time just like the rest of the program. My Camlp4 syntax extension (http://martin.jambon.free.fr/micmatch.html) works just fine for this purpose and I am not asking for any change in the core OCaml syntax ;-) Martin -- Martin Jambon, PhD Researcher in Structural Bioinformatics since the 20th Century The Burnham Institute http://www.burnham.org San Diego, California ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Caml-list] Thread safe Str 2005-01-10 20:25 ` Martin Jambon @ 2005-01-11 3:54 ` skaller 2005-01-11 7:03 ` Chris King 2005-01-11 20:53 ` Martin Jambon 0 siblings, 2 replies; 19+ messages in thread From: skaller @ 2005-01-11 3:54 UTC (permalink / raw) To: Martin Jambon; +Cc: Xavier Leroy, Christophe TROESTLER, O'Caml Mailing List On Tue, 2005-01-11 at 07:25, Martin Jambon wrote: > However, my concerns are more about how to integrate regular expressions > in the language so that they can be checked at compile-time just like the > rest of the program. My Camlp4 syntax extension > (http://martin.jambon.free.fr/micmatch.html) works just fine for this > purpose and I am not asking for any change in the core OCaml syntax ;-) But Str is a hack. Like all NFA based solutions, it's unreliable because it is unsound: possible infinite recursion, indeterminate capture results, and exponential performance. In addition, regular expressions have poor scalability and fail to provide simple i18n support. I think Felix does this the right way: core language support for regular definitions, linear classification and tokensisation, and no captures. If you want captures use the proper tool, namely a parser, which is also supported directly in the language (this is more problematic due to the number of parsers around, Felix currently provides the GLR parser Elkhound). So I would love to see integration of regexp support in Ocaml but I'm very much against Str. If some technology is to be integrated, please use the right technology and integrate Ocamllex. See the ulex package for a model. The problem with micmatch is precisely that it does use Str. BTW: it probably doesn't work either, due to the bug in Str I mentioned in an earlier post. -- John Skaller, mailto:skaller@users.sf.net voice: 061-2-9660-0850, snail: PO BOX 401 Glebe NSW 2037 Australia Checkout the Felix programming language http://felix.sf.net ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Caml-list] Thread safe Str 2005-01-11 3:54 ` skaller @ 2005-01-11 7:03 ` Chris King 2005-01-11 8:06 ` skaller 2005-01-11 20:53 ` Martin Jambon 1 sibling, 1 reply; 19+ messages in thread From: Chris King @ 2005-01-11 7:03 UTC (permalink / raw) To: skaller; +Cc: O'Caml Mailing List On 11 Jan 2005 14:54:30 +1100, skaller <skaller@users.sourceforge.net> wrote: > If you want captures use the proper tool, namely a parser, > [...] > If some technology is to be integrated, please use the right technology > and integrate Ocamllex. Event-based parsers are not the "proper tool" for many applications. Why write a lexer and all its necessary event handlers when one can just write "s/foo(.*)bar/bar\1foo/g"? Regular expressions were designed for pattern matching and substitution, and the latter function is why they have captures. Lexers were designed for parsing, not substitution, and thus excel at the former while making the latter difficult. Regexps are certainly the wrong tool for parsing, but they do have their place. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Caml-list] Thread safe Str 2005-01-11 7:03 ` Chris King @ 2005-01-11 8:06 ` skaller 2005-01-11 12:08 ` Gerd Stolpmann [not found] ` <875c7e070501111007dc3e86d@mail.gmail.com> 0 siblings, 2 replies; 19+ messages in thread From: skaller @ 2005-01-11 8:06 UTC (permalink / raw) To: Chris King; +Cc: O'Caml Mailing List On Tue, 2005-01-11 at 18:03, Chris King wrote: > On 11 Jan 2005 14:54:30 +1100, skaller <skaller@users.sourceforge.net> wrote: > > If you want captures use the proper tool, namely a parser, > > [...] > > If some technology is to be integrated, please use the right technology > > and integrate Ocamllex. > > Event-based parsers are not the "proper tool" for many applications. What are 'event-based' parsers?? > Why write a lexer and all its necessary event handlers when one can > just write "s/foo(.*)bar/bar\1foo/g"? Just compare a real example from the Alioth Shootout: Specification: --------------------------------------------------- The telephone number pattern: * there may be zero or one telephone numbers per line of input * a telephone number may start at the beginning of the line or be preceeded by a non-digit, (which may be preceeded by anything) * it begins with a 3-digit area code that looks like this (DDD) or DDD (where D is [0-9]) * the area code is followed by one space * which is followed by the 3 digits of the exchange: DDD * the exchange is followed by a space or hyphen [ -] * which is followed by the last 4 digits: DDDD * which can be followed by end of line or a non-digit (which may be followed by anything). ------ FELIX SOLUTION--------------------- regexp digit = ["0123456789"]; regexp digits3 = digit digit digit; regexp digits4 = digits3 digit; regexp area_code = digits3 | "(" digits3 ")"; regexp exchange = digits3; regexp phone = area_code " " exchange (" " | "-") digits4; fun lexit (start:iterator, finish:iterator): iterator * string => reglex start to finish with | phone => check_context (lexeme_start, lexeme_end) | _ => "" endmatch ; --------------- PCRE SOLUTION ------------------------- "(?: ^ | [^\\d\\(]) # must be preceeded by non-digit (\\(\\d\\d\\d\\)|\\d\\d\\d) # match 1: area code [ ] # area code followed by one space \\d\\d\\d # prefix of 3 digits [ -] # separator is either space or dash \\d\\d\\d\\d # last 4 digits (?: \\D|$) # must be followed by a non-digit (or EOL) " ------------------------------------------------------- If you think the PCRE solution is in any way better -- well I'd like to point out it is in fact WRONG. The Felix solution is obviously correct. As an exercise, find the bug in the idiotic PCRE solution. Oh yes, it is extremely stupid. In fact there is a much better solution not using the irregular back match crap which tried to be tricky by matching either empty or ( with empty or ) -- and failed utterly. So as an exercise, find this superior solution. The bottom line is: a properly supported syntax for regular matching is superior even for small regexps. > Regular expressions were > designed for pattern matching and substitution, So called regular expressions are NOT mathematically regular expressions, and they were certainly NOT designed by people that knew what they were doing from a theoretical viewpoint. Regular expressions have a precise mathematical foundation, and they do NOT include captures. Attempts to add captures to the theory have been made, and none are satisfactory IMHO. Certainly none actually agree with any implementations. PCRE has some wording about longest matches and left most capture but these semantics turn out to be inconsistent, and are not what PCRE actually implements. The problem with parsing to obtain captures is that parsers have not traditionally been well integrated into any programming languages (except possibly LISP .. :) The fact is that 'regexps' are actually lame parsers, so it is better to get rid of the irregular crud and provide a proper regular expression facility and a proper parser. IMHO. That way we have solid mathemtical foundations for both subsystems, and the parsing support is actually vastly more capable than mere regexps (since you can build parse trees :) Expressions like: "s/foo(.*)bar/bar\1foo/g" may well make more sense in a language like Perl which (a) is dynamically typed (b) has strong convenient string handling which allows the woeful lack of regular definitions to be replaced by other features such as interpolation. (c) the language was never expected to scale up to programming in the large Ocaml on the other hand is statically typed, doens't have strong convenient string handling, and is expected to scale up to programming in the large. -- John Skaller, mailto:skaller@users.sf.net voice: 061-2-9660-0850, snail: PO BOX 401 Glebe NSW 2037 Australia Checkout the Felix programming language http://felix.sf.net ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Caml-list] Thread safe Str 2005-01-11 8:06 ` skaller @ 2005-01-11 12:08 ` Gerd Stolpmann 2005-01-11 17:55 ` skaller [not found] ` <875c7e070501111007dc3e86d@mail.gmail.com> 1 sibling, 1 reply; 19+ messages in thread From: Gerd Stolpmann @ 2005-01-11 12:08 UTC (permalink / raw) To: skaller; +Cc: Chris King, O'Caml Mailing List On Die, 2005-01-11 at 09:06, skaller wrote: > On Tue, 2005-01-11 at 18:03, Chris King wrote: > > On 11 Jan 2005 14:54:30 +1100, skaller <skaller@users.sourceforge.net> wrote: > > > If you want captures use the proper tool, namely a parser, > > > [...] > > > If some technology is to be integrated, please use the right technology > > > and integrate Ocamllex. Which is already done. There is a camlp4 extension allowing one to define ocamllex-like scanners as part of normal modules (pa_ocamllex). It is even part of the O'caml distribution, in camlp4/unmaintained. Well, there is also a maintained and internationalised version called ulex by the same author. > > Why write a lexer and all its necessary event handlers when one can > > just write "s/foo(.*)bar/bar\1foo/g"? This is a typical example where one would prefer regexp engines over lex-type scanners. Just because of simplicity. > Just compare a real example from the Alioth Shootout: > > Specification: > --------------------------------------------------- > The telephone number pattern: > > * there may be zero or one telephone numbers per line of input > * a telephone number may start at the beginning of the line or be > preceeded by a non-digit, (which may be preceeded by anything) > * it begins with a 3-digit area code that looks like this (DDD) or > DDD (where D is [0-9]) > * the area code is followed by one space > * which is followed by the 3 digits of the exchange: DDD > * the exchange is followed by a space or hyphen [ -] > * which is followed by the last 4 digits: DDDD > * which can be followed by end of line or a non-digit (which may > be followed by anything). This is already a problem near the limit of writing regexps directly, in one step. I think the problem is not the complexity of the problem, but the string notation of regexps which makes any solution hard to read. However, there are better criterions when to use which lexing technology: - When one wants to have a token stream as result: use (ocaml)lex - When one can profit from recursive lexer definitions: use (ocaml)lex - When the regexp is computed: use regexp engine > ------ FELIX SOLUTION--------------------- > > regexp digit = ["0123456789"]; > regexp digits3 = digit digit digit; > regexp digits4 = digits3 digit; > > regexp area_code = digits3 | "(" digits3 ")"; > regexp exchange = digits3; > > regexp phone = area_code " " exchange (" " | "-") digits4; > > fun lexit (start:iterator, finish:iterator): iterator * string => > reglex start to finish with > | phone => check_context (lexeme_start, lexeme_end) > | _ => "" > endmatch > ; The pa_ocamllex/ulex solution would be very similar to this. > > Regular expressions were > > designed for pattern matching and substitution, > > So called regular expressions are NOT mathematically > regular expressions, and they were certainly NOT designed by > people that knew what they were doing from a theoretical viewpoint. > > Regular expressions have a precise mathematical foundation, > and they do NOT include captures. You mean back references, e.g. ((a|b|c)*)d\1 here matching strings where the part of the string before "d" is identical to the part after "d". The set of matched strings is a context-sensitive language! Capturing just for the purpose of string extraction is not problematic. Back references are unsound when they can match the empty word. > Attempts to add captures to the theory have been made, > and none are satisfactory IMHO. Certainly none actually agree > with any implementations. This is true, but I think it is practically irrelevant. > PCRE has some wording about longest matches > and left most capture but these semantics turn out to > be inconsistent, and are not what PCRE actually implements. There are often several ways of capturing. In practice, I found this never to be a real problem, because there are almost always alternate regexps avoiding such problems. I still think that a regexp engine is a useful addition to a language like ocaml. Even with camlp4 integration ocamllex is an overkill solution for many simple matching problems (notation and execution overhead). Gerd -- ------------------------------------------------------------ Gerd Stolpmann * Viktoriastr. 45 * 64293 Darmstadt * Germany gerd@gerd-stolpmann.de http://www.gerd-stolpmann.de ------------------------------------------------------------ ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Caml-list] Thread safe Str 2005-01-11 12:08 ` Gerd Stolpmann @ 2005-01-11 17:55 ` skaller 2005-01-11 20:30 ` Gerd Stolpmann 0 siblings, 1 reply; 19+ messages in thread From: skaller @ 2005-01-11 17:55 UTC (permalink / raw) To: Gerd Stolpmann; +Cc: Chris King, O'Caml Mailing List On Tue, 2005-01-11 at 23:08, Gerd Stolpmann wrote: > Capturing just for the purpose of string extraction is not problematic. Unfortunately you're wrong. I had in fact hoped this was not the case, but having read a couple of paper on it, I now know that, unfortunately, they are. (See below). Alain Frisch is one of the leaders in this area, perhaps he can explain it better. You can get Cardelli and Frisch paper here: http://felix.sf.net/papers/greedy.pdf another by Ville Laurikari here: http://felix.sf.net/papers/spire2000-tnfa.ps and his Master thesis on the topic here: http://felix.sf.net/papers/regex-submatch.ps [and if you can figure out how to actually build a tagged DFA after reading any of that please let me know .. ] Frisch's algorithm used in CDuce works with a forward pass that ignores captures, but records the automaton states, and then a backwards pass the extracts the information (CDuce actually builds trees). Unfortunately this has a problem: it requires a bidirectional iterator, whereas a DFA only requires an input iterator. (NFA's require forward iterators) -- John Skaller, mailto:skaller@users.sf.net voice: 061-2-9660-0850, snail: PO BOX 401 Glebe NSW 2037 Australia Checkout the Felix programming language http://felix.sf.net ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Caml-list] Thread safe Str 2005-01-11 17:55 ` skaller @ 2005-01-11 20:30 ` Gerd Stolpmann 2005-01-12 7:42 ` skaller 0 siblings, 1 reply; 19+ messages in thread From: Gerd Stolpmann @ 2005-01-11 20:30 UTC (permalink / raw) To: skaller; +Cc: Chris King, O'Caml Mailing List On Die, 2005-01-11 at 18:55, skaller wrote: > On Tue, 2005-01-11 at 23:08, Gerd Stolpmann wrote: > > > > Capturing just for the purpose of string extraction is not problematic. > > Unfortunately you're wrong. I had in fact hoped this was not > the case, but having read a couple of paper on it, > I now know that, unfortunately, they are. (See below). It's a matter of taste. Granted, you cannot define capturing directly for an automaton, only by a possibly poor backtracking algorithm. (This means the approach is "definition by algorithm", not really the best way to do it.) Which leads to a more serious problem: One must know the algorithm of the implementation to avoid running into a hard backtracking case (which can be exponential). There are a lot of tutorials explaining good and bad regexps. The lex-type scanners do not have this problem (but you may run into space problems instead because the automaton becomes large). > Alain Frisch is one of the leaders in this area, perhaps > he can explain it better. You can get Cardelli and Frisch > paper here: > > http://felix.sf.net/papers/greedy.pdf When I understand the direction correctly, this paper explains how to avoid backtracking by using a two-pass algorithm (automaton plus expression-based postprocessing to get the captured strings). > another by Ville Laurikari here: > > http://felix.sf.net/papers/spire2000-tnfa.ps > > and his Master thesis on the topic here: > > http://felix.sf.net/papers/regex-submatch.ps > > [and if you can figure out how to actually build > a tagged DFA after reading any of that please let > me know .. ] > > Frisch's algorithm used in CDuce works with a forward > pass that ignores captures, but records the automaton states, > and then a backwards pass the extracts the information > (CDuce actually builds trees). > > Unfortunately this has a problem: it requires a > bidirectional iterator, whereas a DFA only requires > an input iterator. (NFA's require forward iterators) For string regexps this is not a big problem, the string is in memory. For XML, this is more a problem, especially when the XML document is only available event by event (processing large documents). Gerd -- ------------------------------------------------------------ Gerd Stolpmann * Viktoriastr. 45 * 64293 Darmstadt * Germany gerd@gerd-stolpmann.de http://www.gerd-stolpmann.de ------------------------------------------------------------ ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Caml-list] Thread safe Str 2005-01-11 20:30 ` Gerd Stolpmann @ 2005-01-12 7:42 ` skaller 0 siblings, 0 replies; 19+ messages in thread From: skaller @ 2005-01-12 7:42 UTC (permalink / raw) To: Gerd Stolpmann; +Cc: Chris King, O'Caml Mailing List On Wed, 2005-01-12 at 07:30, Gerd Stolpmann wrote: > It's a matter of taste. Granted, you cannot define capturing directly > for an automaton, only by a possibly poor backtracking algorithm. (This > means the approach is "definition by algorithm", not really the best way > to do it.) It's worse than that .. > Which leads to a more serious problem: One must know the algorithm of > the implementation to avoid running into a hard backtracking case (which > can be exponential). .. it's worse .. some cases could be infinite recursion: see Frisch & Cardelli. It is known PCRE has this problem. It was hacked to fix one of the common cases. The hack isn't general. > There are a lot of tutorials explaining good and > bad regexps. Which are worthless junk if you're generating them.. The point is -- this is folklore not mathematics. > The lex-type scanners do not have this problem (but you may > run into space problems instead because the automaton becomes large). Indeed, but they also have another problem: they don't support captures at all. > > > > Unfortunately this has a problem: it requires a > > bidirectional iterator, whereas a DFA only requires > > an input iterator. (NFA's require forward iterators) > > For string regexps this is not a big problem, the string is in memory. > For XML, this is more a problem, especially when the XML document is > only available event by event (processing large documents). Hehe .. but the main use of the algorithm is in the specialist XML processing language XDuce... -- John Skaller, mailto:skaller@users.sf.net voice: 061-2-9660-0850, snail: PO BOX 401 Glebe NSW 2037 Australia Checkout the Felix programming language http://felix.sf.net ^ permalink raw reply [flat|nested] 19+ messages in thread
[parent not found: <875c7e070501111007dc3e86d@mail.gmail.com>]
[parent not found: <1105471138.2574.88.camel@pelican.wigram>]
[parent not found: <875c7e07050111115618692184@mail.gmail.com>]
* Re: [Caml-list] Thread safe Str [not found] ` <875c7e07050111115618692184@mail.gmail.com> @ 2005-01-11 19:58 ` Chris King 0 siblings, 0 replies; 19+ messages in thread From: Chris King @ 2005-01-11 19:58 UTC (permalink / raw) To: O'Caml Mailing List Forgot to CC this to the list, sorry if anyone gets a dupe.... > I'm sure you'd agree there are separable issues here: > > (1) using a string encoding of a regexp as opposed to > a lex like one -- this has nothing to do with > captures. Yes. As I stated in another e-mail in this thread, I'd love to see an API that exposes the parse tree of a regexp. > (2) Captures I'd like to add (3) Parsing vs. substitution. You can't effectively do the latter without captures (of course it can be done but it's messy). > The fact that the regexp syntax is not checked statically > isn't relevant in a dynamic language since the typing > of the rest of the program isn't either. I think we're talking about different things... I used the "s//g" syntax to represent the substitution function in whatever language is being used, not as an example of something to be compiled. (regexp "foo(.*)bar") certainly has a static type, and if it's malformed it simply raises an exception. > I'm trying to provide that in Felix. It has Python style literals, > and Python style substrings. However it is still clumbsy compared > with Perl (I guess .. I can't write Perl ..) Perl string mangling is clumsy compared with Python. The key is that Python treats strings as arrays (or lists) of characters. > > True. Performing multiple replacements on a single string with a > > regexp is retarded. But so is writing a lexer for a simple one-shot > > replacement job. > > That depends on how hard it is to write a lexer. Specifically, a lexer whose input and output are both strings and which performs substitution. Not pretty, unless captures are provided. > At present, Felix regexps have to be constants. It will be possible > in the bootstrapped compiler to generate them as text, and then > compile and link (i.e. there will be a function sort of like 'eval'). That's not acceptable for, say, an incremental search, though, where a new regexp must be generated on each keystroke. (Yes I know regexps aren't the best way to go about that but I'm sure there are better examples.) > > My point is just that regexps are useful enough to co-exist with lexers. > > But they're the same thing. Lexer provide regular definitions, > which is just a way of naming regexps, and reusing the regexps > by providing the name. Not in the form of lex/flex/ocamllex. Yes, lexer token definitions are equivalent to regexps, but everything else about them is different (specifically, lexers are event-based, and don't provied captures). I'd love to see a regexp engine allowing dynamic creation of token-based regexps, complete with captures. It could easily serve as both the base of a lexer and a substitution engine. Heck, that sounds like a fun project... what I am doing this weekend? :P ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Caml-list] Thread safe Str 2005-01-11 3:54 ` skaller 2005-01-11 7:03 ` Chris King @ 2005-01-11 20:53 ` Martin Jambon 2005-01-12 7:59 ` skaller 1 sibling, 1 reply; 19+ messages in thread From: Martin Jambon @ 2005-01-11 20:53 UTC (permalink / raw) To: skaller Cc: Martin Jambon, Xavier Leroy, Christophe TROESTLER, O'Caml Mailing List On 11 Jan 2005, skaller wrote: > So I would love to see integration of regexp support in Ocaml > but I'm very much against Str. If some technology is to > be integrated, please use the right technology and > integrate Ocamllex. > > See the ulex package for a model. The problem with micmatch > is precisely that it does use Str. It uses either Str or PCRE. The PCRE variant makes me feel very satisfied and as you know, a programmer which feels this way is unlikely to spend time to improve his tools :-) You can also integrate ulex by yourself if it is possible. In that case, it should be relatively straightforward to reuse the existing micmatch core which is already shared between the 2 existing variants. Martin -- Martin Jambon, PhD Researcher in Structural Bioinformatics since the 20th Century The Burnham Institute http://www.burnham.org San Diego, California ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Caml-list] Thread safe Str 2005-01-11 20:53 ` Martin Jambon @ 2005-01-12 7:59 ` skaller 2005-01-12 20:12 ` Martin Jambon 0 siblings, 1 reply; 19+ messages in thread From: skaller @ 2005-01-12 7:59 UTC (permalink / raw) To: Martin Jambon; +Cc: Xavier Leroy, Christophe TROESTLER, O'Caml Mailing List On Wed, 2005-01-12 at 07:53, Martin Jambon wrote: > On 11 Jan 2005, skaller wrote: > > See the ulex package for a model. The problem with micmatch > > is precisely that it does use Str. > > It uses either Str or PCRE. > You can also integrate ulex by yourself if it is possible. I'm not sure .. ulex doesn't use NFA's does it? AFIAK it doesn't support captures. The problem with micmatch is that it probably doesn't do what can be done with a system supported in the core language, linear matching over all cases. Felix does do it: in this code regmatch .. with | re1 => .. | re2 => .. | re3 =>.. endmatch the input characters (from ...) are read exactly once, not only is there no backtracking, since a DFA is used, but it isn't a sequence of comparisons (it's a DFA based tokeniser, the token selecting the RHS expression). I would guess micmatch does not support that, although it probably could. -- John Skaller, mailto:skaller@users.sf.net voice: 061-2-9660-0850, snail: PO BOX 401 Glebe NSW 2037 Australia Checkout the Felix programming language http://felix.sf.net ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Caml-list] Thread safe Str 2005-01-12 7:59 ` skaller @ 2005-01-12 20:12 ` Martin Jambon 0 siblings, 0 replies; 19+ messages in thread From: Martin Jambon @ 2005-01-12 20:12 UTC (permalink / raw) To: skaller; +Cc: O'Caml Mailing List On 12 Jan 2005, skaller wrote: > On Wed, 2005-01-12 at 07:53, Martin Jambon wrote: > > On 11 Jan 2005, skaller wrote: > > > > See the ulex package for a model. The problem with micmatch > > > is precisely that it does use Str. > > > > It uses either Str or PCRE. > > > You can also integrate ulex by yourself if it is possible. > > I'm not sure .. ulex doesn't use NFA's does it? > AFIAK it doesn't support captures. > > The problem with micmatch is that it probably doesn't do what > can be done with a system supported in the core language, > linear matching over all cases. Felix does do it: in this code > > regmatch .. with > | re1 => .. > | re2 => .. > | re3 =>.. > endmatch > > the input characters (from ...) are read exactly once, > not only is there no backtracking, since a DFA is used, > but it isn't a sequence of comparisons (it's a DFA based > tokeniser, the token selecting the RHS expression). > > I would guess micmatch does not support that, although > it probably could. Sorry, I don't know the details of what can be done with DFA or NFA. The idea of micmatch is to be compatible with the behavior of the regular match...with, which allows us to mix both forms of pattern matching (returning the first match, not the longest match). The following returns `Case1: match "abcde", 1 with RE "abc", 1 -> `Case1 | RE "abcd", _ -> `Case2 And the following simplified code still returns `Case1: match "abcde" with RE "abc" -> `Case1 | RE "abcd" -> `Case2 And similarly this returns "abc": match "abcde" with RE ("abc" | "abcd" as x) -> x I could implement the following syntax: match subj with RE (re1 as x1 => `Case1 x1 as y) | (re2 as x2 => `Case2 x2 as y) -> y And the simplified forms: match subj with RE (re1 => `Case1 as y) | (re2 => `Case2 as y) -> y match subj with RE (re1 => f1 ()) | (re2 => f2 ()) -> () We could then do sophisticated things like that: match subj with RE ... ((re1 => f1 ()) | (re2 => raise Try_next_please)) ... -> ... | ... (* next *) -> ... but I don't know if these additions would be a good thing. Martin -- Martin Jambon, PhD Researcher in Structural Bioinformatics since the 20th Century The Burnham Institute http://www.burnham.org San Diego, California ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Caml-list] Thread safe Str 2005-01-10 15:49 ` Xavier Leroy ` (2 preceding siblings ...) 2005-01-10 20:25 ` Martin Jambon @ 2005-01-11 6:41 ` Chris King 3 siblings, 0 replies; 19+ messages in thread From: Chris King @ 2005-01-11 6:41 UTC (permalink / raw) To: Xavier Leroy; +Cc: O'Caml Mailing List On Mon, 10 Jan 2005 16:49:17 +0100, Xavier Leroy <Xavier.Leroy@inria.fr> wrote: > Besides being thread-safe, the new API could also expose > the abstract syntax tree for regexps, allowing easier construction of > complex regexps by programs than can be done by working at the level > of the string representation of regexps. Nifty! This is something I'd really like to see. > But if a few persons on this list want to team up to > design an API, that would be wonderful indeed. I was actually considering writing a wrapper for Str modeled after Python's object-oriented re library, which I've found to be very well-designed (much better than any syntax-based solution). Being relatively new to OCaml, I'm not sure that classes in the standand library are considered kosher; would it be preferable to use types rather than classes? Also, Python (and IIRC Perl also) support "raw" strings, strings which do not interpret escape sequences. These are very useful for avoiding extra backslash garbage in regular expressions; and (I assume) could be easily added to OCaml to ease the usage of regexps. ^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2005-01-12 20:13 UTC | newest] Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2005-01-09 19:30 Thread safe Str Christophe TROESTLER 2005-01-09 20:57 ` [Caml-list] " Gerd Stolpmann 2005-01-10 9:57 ` Alex Baretta 2005-01-10 15:49 ` Xavier Leroy 2005-01-10 16:39 ` Richard Jones 2005-01-10 18:21 ` Eric C. Cooper 2005-01-10 20:25 ` Martin Jambon 2005-01-11 3:54 ` skaller 2005-01-11 7:03 ` Chris King 2005-01-11 8:06 ` skaller 2005-01-11 12:08 ` Gerd Stolpmann 2005-01-11 17:55 ` skaller 2005-01-11 20:30 ` Gerd Stolpmann 2005-01-12 7:42 ` skaller [not found] ` <875c7e070501111007dc3e86d@mail.gmail.com> [not found] ` <1105471138.2574.88.camel@pelican.wigram> [not found] ` <875c7e07050111115618692184@mail.gmail.com> 2005-01-11 19:58 ` Chris King 2005-01-11 20:53 ` Martin Jambon 2005-01-12 7:59 ` skaller 2005-01-12 20:12 ` Martin Jambon 2005-01-11 6:41 ` Chris King
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox