* Str.string_match incorrect @ 2004-12-22 3:49 skaller 2004-12-22 7:44 ` [Caml-list] " Evan Martin 0 siblings, 1 reply; 15+ messages in thread From: skaller @ 2004-12-22 3:49 UTC (permalink / raw) To: caml-list This looks like a fairly fundamental bug in Str module.. (so probably I'm missing something ..) This program: let m = Str.regexp "a";; let b = Str.string_match m "aa" 0;; print_endline (if b then "YES" else "NO");; prints YES. Clearly the string "aa" is NOT matched by the regexp "a". This *would be* the correct result for the Str.string_partial_match function which matches prefixes. However Str.string_match should match the whole string. [In bugtracker] -- John Skaller, mailto:skaller@users.sf.net voice: 061-2-9660-0850, snail: PO BOX 401 Glebe NSW 2037 Australia Checkout the Felix programming language http://felix.sf.net ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [Caml-list] Str.string_match incorrect 2004-12-22 3:49 Str.string_match incorrect skaller @ 2004-12-22 7:44 ` Evan Martin 2004-12-22 8:00 ` William Lovas 0 siblings, 1 reply; 15+ messages in thread From: Evan Martin @ 2004-12-22 7:44 UTC (permalink / raw) To: skaller; +Cc: caml-list On Wed, Dec 22, 2004 at 02:49:30PM +1100, skaller wrote: > This looks like a fairly fundamental bug in Str module.. > (so probably I'm missing something ..) > > This program: > > let m = Str.regexp "a";; > Str.string_match m "aa" 0;; [evaluates to true] This is consistent with the docs, which say: [string_match r s start] tests whether the characters in s starting at position start match the regular expression r. and in general with how regular expression systems work. string_match corresponds to running your automaton directly and seeing whether you end up in an accept state, while string_partial_match effectively adds an extra ".*" to the beginning. (It's more debatable whether this makes sense.) To force a match across the entire string, use $: # Str.string_match (Str.regexp "a$") "aa" 0;; - : bool = false -- Evan Martin martine@danga.com http://neugierig.org ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [Caml-list] Str.string_match incorrect 2004-12-22 7:44 ` [Caml-list] " Evan Martin @ 2004-12-22 8:00 ` William Lovas 2004-12-22 8:38 ` Evan Martin ` (2 more replies) 0 siblings, 3 replies; 15+ messages in thread From: William Lovas @ 2004-12-22 8:00 UTC (permalink / raw) To: caml-list On Tue, Dec 21, 2004 at 11:44:55PM -0800, Evan Martin wrote: > This is consistent with the docs, which say: > [string_match r s start] tests whether the characters in s starting at > position start match the regular expression r. > and in general with how regular expression systems work. string_match > corresponds to running your automaton directly and seeing whether you > end up in an accept state, while string_partial_match effectively adds > an extra ".*" to the beginning. > (It's more debatable whether this makes sense.) I concur with your assessment, but i think you're characterization of the semantics of string_partial_match is inaccurate: # Str.string_match (Str.regexp "ab") "a" 0;; - : bool = false # Str.string_partial_match (Str.regexp "ab") "a" 0;; - : bool = true # Str.string_match (Str.regexp ".*ab") "a" 0;; - : bool = false ... unless of course, i've misunderstood you. I don't think there's a simple transformation on regular expressions that allow you to emulate string_partial_match's behavior using only string_match. (Does anyone care to prove me wrong? :) cheers, William ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [Caml-list] Str.string_match incorrect 2004-12-22 8:00 ` William Lovas @ 2004-12-22 8:38 ` Evan Martin 2004-12-22 10:37 ` Gerd Stolpmann 2004-12-22 15:57 ` skaller 2 siblings, 0 replies; 15+ messages in thread From: Evan Martin @ 2004-12-22 8:38 UTC (permalink / raw) To: caml-list On Wed, Dec 22, 2004 at 03:00:10AM -0500, William Lovas wrote: > I concur with your assessment, but i think you're characterization of the > semantics of string_partial_match is inaccurate: > > # Str.string_match (Str.regexp "ab") "a" 0;; > - : bool = false > # Str.string_partial_match (Str.regexp "ab") "a" 0;; > - : bool = true > # Str.string_match (Str.regexp ".*ab") "a" 0;; > - : bool = false > > ... unless of course, i've misunderstood you. I don't think there's a > simple transformation on regular expressions that allow you to emulate > string_partial_match's behavior using only string_match. (Does anyone > care to prove me wrong? :) No misunderstanding except for mine. You are correct. (I had assumed, without reading the docs, that the match/partial_match distinction would be like Python's match/search.) -- Evan Martin martine@danga.com http://neugierig.org ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [Caml-list] Str.string_match incorrect 2004-12-22 8:00 ` William Lovas 2004-12-22 8:38 ` Evan Martin @ 2004-12-22 10:37 ` Gerd Stolpmann 2004-12-22 15:57 ` skaller 2 siblings, 0 replies; 15+ messages in thread From: Gerd Stolpmann @ 2004-12-22 10:37 UTC (permalink / raw) To: William Lovas; +Cc: caml-list On Mit, 2004-12-22 at 09:00, William Lovas wrote: > On Tue, Dec 21, 2004 at 11:44:55PM -0800, Evan Martin wrote: > > This is consistent with the docs, which say: > > [string_match r s start] tests whether the characters in s starting at > > position start match the regular expression r. > > and in general with how regular expression systems work. string_match > > corresponds to running your automaton directly and seeing whether you > > end up in an accept state, while string_partial_match effectively adds > > an extra ".*" to the beginning. > > (It's more debatable whether this makes sense.) > > I concur with your assessment, but i think you're characterization of the > semantics of string_partial_match is inaccurate: > > # Str.string_match (Str.regexp "ab") "a" 0;; > - : bool = false > # Str.string_partial_match (Str.regexp "ab") "a" 0;; > - : bool = true > # Str.string_match (Str.regexp ".*ab") "a" 0;; > - : bool = false > > ... unless of course, i've misunderstood you. I don't think there's a > simple transformation on regular expressions that allow you to emulate > string_partial_match's behavior using only string_match. (Does anyone > care to prove me wrong? :) The difference between string_match and string_partial_match is quite simple: - string_match: whether a prefix of the string matches the regexp - string_partial_match: whether the string is a prefix of another string matching the regexp On the level of the automaton, string_match is true when the automaton reaches an accept state whereas string_partial_match is true when the automaton is in a valid state at the end of the string and there is still a path to an accept state. Of course, it is easy to emulate string_partial_match when you already have the automaton - just make all states accepting that have a path to an accept state. But there is not a comparably simple rule for the regular expression. You can do an emulating transformation by generating the automaton, adding the additional accept states, and transforming the automaton back to the regexp. There is an algorithm for this, but the length of the regexp explodes (exponentially when I remember correctly). Gerd -- ------------------------------------------------------------ Gerd Stolpmann * Viktoriastr. 45 * 64293 Darmstadt * Germany gerd@gerd-stolpmann.de http://www.gerd-stolpmann.de ------------------------------------------------------------ ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [Caml-list] Str.string_match incorrect 2004-12-22 8:00 ` William Lovas 2004-12-22 8:38 ` Evan Martin 2004-12-22 10:37 ` Gerd Stolpmann @ 2004-12-22 15:57 ` skaller 2004-12-22 16:58 ` David Brown 2004-12-22 17:26 ` Kurt Welgehausen 2 siblings, 2 replies; 15+ messages in thread From: skaller @ 2004-12-22 15:57 UTC (permalink / raw) To: William Lovas; +Cc: caml-list On Wed, 2004-12-22 at 19:00, William Lovas wrote: > On Tue, Dec 21, 2004 at 11:44:55PM -0800, Evan Martin wrote: > > This is consistent with the docs, which say: > > [string_match r s start] tests whether the characters in s starting at > > position start match the regular expression r. > > and in general with how regular expression systems work. Then they're simply wrong. The fundamental operation is to check if a string is in a regular set of strings. Plainly 'aa' is not in the set { 'a' }. string_match is actually testing if some prefix of the argument is in the regular set -- this is core operation of a lexical analyser. I'm not against having that operation -- Felix does -- but it leaves us without the most fundamental and important operation -- validation. > I concur with your assessment, but i think you're characterization of the > semantics of string_partial_match is inaccurate: Looks like partial_match runs thru the whole string, and if it doesn't encounter an error returns true. This is the same automaton in which all non-error states are considered accepting states. An error state is actually a state where an accepting state is not reachable. There is a transformation: construct the DFA, mark non-error states accepting, then generate the corresponding regexp. Whether that is a 'simple' transformation is another issue :)) -- John Skaller, mailto:skaller@users.sf.net voice: 061-2-9660-0850, snail: PO BOX 401 Glebe NSW 2037 Australia Checkout the Felix programming language http://felix.sf.net ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [Caml-list] Str.string_match incorrect 2004-12-22 15:57 ` skaller @ 2004-12-22 16:58 ` David Brown 2004-12-23 2:33 ` skaller 2004-12-22 17:26 ` Kurt Welgehausen 1 sibling, 1 reply; 15+ messages in thread From: David Brown @ 2004-12-22 16:58 UTC (permalink / raw) To: skaller; +Cc: William Lovas, caml-list On Thu, Dec 23, 2004 at 02:57:25AM +1100, skaller wrote: > On Wed, 2004-12-22 at 19:00, William Lovas wrote: > > On Tue, Dec 21, 2004 at 11:44:55PM -0800, Evan Martin wrote: > > > This is consistent with the docs, which say: > > > [string_match r s start] tests whether the characters in s starting at > > > position start match the regular expression r. > > > and in general with how regular expression systems work. > > Then they're simply wrong. The fundamental operation is > to check if a string is in a regular set of strings. > Plainly 'aa' is not in the set { 'a' }. This is a strange notion of right and wrong. The function behaves exactly as it is specified in the documentation. It is not difficult to append a '$' to the regular expression to only match the entire string. If the function only matched the entire string, yes, appending ".*" would cause it to match partial matches, but then it would be more difficult to extract the matched pattern out. Arguably, another function could be provided to always do full matching of strings, but I suspect it just isn't used all that much. Dave ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [Caml-list] Str.string_match incorrect 2004-12-22 16:58 ` David Brown @ 2004-12-23 2:33 ` skaller 2004-12-24 17:40 ` Christopher A. Watford 0 siblings, 1 reply; 15+ messages in thread From: skaller @ 2004-12-23 2:33 UTC (permalink / raw) To: David Brown; +Cc: William Lovas, caml-list On Thu, 2004-12-23 at 03:58, David Brown wrote: > On Thu, Dec 23, 2004 at 02:57:25AM +1100, skaller wrote: > > On Wed, 2004-12-22 at 19:00, William Lovas wrote: > > > On Tue, Dec 21, 2004 at 11:44:55PM -0800, Evan Martin wrote: > > > > This is consistent with the docs, which say: > > > > [string_match r s start] tests whether the characters in s starting at > > > > position start match the regular expression r. > > > > and in general with how regular expression systems work. > > > > Then they're simply wrong. The fundamental operation is > > to check if a string is in a regular set of strings. > > Plainly 'aa' is not in the set { 'a' }. > > This is a strange notion of right and wrong. The function behaves exactly > as it is specified in the documentation. Huh? I'm really confused. Two people think the documentation is correct for the behaviour. I do not -- Saying string x is matched by regexp r means to me the same as x is a member of the regular set of strings r denotes and with that interpretation there is no possible doubt that the documentation is wrong and should say matches a prefix of the argument string in order to describe the behaviour. -- John Skaller, mailto:skaller@users.sf.net voice: 061-2-9660-0850, snail: PO BOX 401 Glebe NSW 2037 Australia Checkout the Felix programming language http://felix.sf.net ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [Caml-list] Str.string_match incorrect 2004-12-23 2:33 ` skaller @ 2004-12-24 17:40 ` Christopher A. Watford 2004-12-25 0:57 ` skaller 0 siblings, 1 reply; 15+ messages in thread From: Christopher A. Watford @ 2004-12-24 17:40 UTC (permalink / raw) To: skaller; +Cc: David Brown, William Lovas, caml-list > > > Then they're simply wrong. The fundamental operation is > > > to check if a string is in a regular set of strings. > > > Plainly 'aa' is not in the set { 'a' }. > > > > This is a strange notion of right and wrong. The function behaves exactly > > as it is specified in the documentation. > > Huh? I'm really confused. Two people think the documentation > is correct for the behaviour. > > I do not -- Saying > > string x is matched by regexp r > > means to me the same as > > x is a member of the regular set of strings r denotes > > and with that interpretation there is no possible doubt that > the documentation is wrong and should say > > matches a prefix of the argument string > > in order to describe the behaviour. > Going by PCRE's documentation the regex /a/ matches: 'a', 'aa', 'baaaaa', 'nota bene', and any other string with 'a' anywhere inside. >From OCaml's description of string_match: val string_match: regexp -> string -> int -> bool [string_match r s start] tests whether the characters in s starting at position start match the regular expression r. The first character of a string has position 0, as usual. This sounds to me like you're doing the PCRE: /a/ if you say: Str.string_match (Str.regexp "a") "ab" 0 ;; And in PCRE /a/ matches "ab" or "aa" or "ba". /^a/ matches "aa", "ab", but not "ba", and /a$/ matches "aa", "ba", but not "ab". So as far as I can tell, OCaml's string_match is correct as the manual describes. -- Christopher A. Watford christopher.watford@gmail.com http://dorm.tunkeymicket.com ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [Caml-list] Str.string_match incorrect 2004-12-24 17:40 ` Christopher A. Watford @ 2004-12-25 0:57 ` skaller 2004-12-25 3:07 ` Christopher A. Watford 2004-12-26 1:14 ` William Lovas 0 siblings, 2 replies; 15+ messages in thread From: skaller @ 2004-12-25 0:57 UTC (permalink / raw) To: Christopher A. Watford; +Cc: David Brown, William Lovas, caml-list On Sat, 2004-12-25 at 04:40, Christopher A. Watford wrote: > Going by PCRE's documentation no idea why I would do that ... > Str.string_match (Str.regexp "a") "ab" 0 ;; > > And in PCRE /a/ matches "ab" or "aa" or "ba". # Str.string_match (Str.regexp "a") "xax" 0;; - : bool = false -- John Skaller, mailto:skaller@users.sf.net voice: 061-2-9660-0850, snail: PO BOX 401 Glebe NSW 2037 Australia Checkout the Felix programming language http://felix.sf.net ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [Caml-list] Str.string_match incorrect 2004-12-25 0:57 ` skaller @ 2004-12-25 3:07 ` Christopher A. Watford 2004-12-25 4:24 ` skaller 2004-12-26 1:14 ` William Lovas 1 sibling, 1 reply; 15+ messages in thread From: Christopher A. Watford @ 2004-12-25 3:07 UTC (permalink / raw) To: skaller; +Cc: David Brown, William Lovas, caml-list On 25 Dec 2004 11:57:05 +1100, skaller <skaller@users.sourceforge.net> wrote: > On Sat, 2004-12-25 at 04:40, Christopher A. Watford wrote: > > > Going by PCRE's documentation > > no idea why I would do that ... > > > Str.string_match (Str.regexp "a") "ab" 0 ;; > > > > And in PCRE /a/ matches "ab" or "aa" or "ba". > > # Str.string_match (Str.regexp "a") "xax" 0;; > - : bool = false > > -- > John Skaller, mailto:skaller@users.sf.net > voice: 061-2-9660-0850, > snail: PO BOX 401 Glebe NSW 2037 Australia > Checkout the Felix programming language http://felix.sf.net > > >From the documentation, as you have pointed out (now I see what you mean), it does not make this constraint obvious. It appears that string_match's documentation should include that an implicit ^ is added. string_partial_match makes this clear however. -- Christopher A. Watford christopher.watford@gmail.com http://dorm.tunkeymicket.com ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [Caml-list] Str.string_match incorrect 2004-12-25 3:07 ` Christopher A. Watford @ 2004-12-25 4:24 ` skaller 0 siblings, 0 replies; 15+ messages in thread From: skaller @ 2004-12-25 4:24 UTC (permalink / raw) To: Christopher A. Watford; +Cc: David Brown, William Lovas, caml-list On Sat, 2004-12-25 at 14:07, Christopher A. Watford wrote: > >From the documentation, as you have pointed out (now I see what you > mean), it does not make this constraint obvious. It appears that > string_match's documentation should include that an implicit ^ is > added. string_partial_match makes this clear however. Except even that isn't quite correct -- ^ matches a newline as well as start of string. When matching on a 'string' the special handling of newline is out of place -- newlines are ordinary characters. When matching on a whole file, as in awk/sed/perl style of tools, the line orientation makes some sense. In any case, the bug is a mismatch, IMHO, between what the function does and what the documentation says it does. I had expected string_match to match a whole string or fail, because that's what the documentation says it does. However there are two compatibility arguments for changing the documentation instead -- non Ocaml user expectations (clearly indicated in replies to my post), and breaking of existing Ocaml code that relies on the actual behaviour not the documented behaviour, misreading it as others seem to have done, and assuming PCRE/Emacs/awk/sed ... or whatever .. semantics. If that path is chosen I think it would be nice to provide a function that actually does what the documentation *originally* specified (that is, what it says right now): match a whole string. This is because the obvious hack: ^r$ does not in fact work if the string contains newlines -- according to the documentation anyhow :) I'm not a heavy user of Str -- I would never use it in a serious application. However I used it in the program I wrote to build test harness for ExtLib (there is now an extlib-test module in CVS which tests ExtLib) and it didn't work. I did indeed fix the problem with ^r$, because I'm checking filenames which seem unlikely to have any newlines in them. Anyhow .. it's up to Xavier: I posted the original note into the bugtracker too :) -- John Skaller, mailto:skaller@users.sf.net voice: 061-2-9660-0850, snail: PO BOX 401 Glebe NSW 2037 Australia Checkout the Felix programming language http://felix.sf.net ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [Caml-list] Str.string_match incorrect 2004-12-25 0:57 ` skaller 2004-12-25 3:07 ` Christopher A. Watford @ 2004-12-26 1:14 ` William Lovas 1 sibling, 0 replies; 15+ messages in thread From: William Lovas @ 2004-12-26 1:14 UTC (permalink / raw) To: caml-list On Sat, Dec 25, 2004 at 11:57:05AM +1100, skaller wrote: > On Sat, 2004-12-25 at 04:40, Christopher A. Watford wrote: > > > Going by PCRE's documentation > > no idea why I would do that ... > > > Str.string_match (Str.regexp "a") "ab" 0 ;; > > > > And in PCRE /a/ matches "ab" or "aa" or "ba". > > # Str.string_match (Str.regexp "a") "xax" 0;; > - : bool = false *That* i would call a bug. I think that of Str.string_match (Str.regexp "a") "xa" 0;; Str.string_match (Str.regexp "a") "ax" 0;; either both should evaluate to true, or neither should. Both evaluating to true would match the typical perl/grep/etc. notion of searching anywhere inside a string to for a matching substring, whereas neither doing so would correspond to the mathematical notion of containment in the language of a regular expression. William ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [Caml-list] Str.string_match incorrect 2004-12-22 15:57 ` skaller 2004-12-22 16:58 ` David Brown @ 2004-12-22 17:26 ` Kurt Welgehausen 2004-12-23 2:09 ` skaller 1 sibling, 1 reply; 15+ messages in thread From: Kurt Welgehausen @ 2004-12-22 17:26 UTC (permalink / raw) To: caml-list > On Tue, Dec 21, 2004 at 11:44:55PM -0800, Evan Martin wrote: > > This is consistent with the docs, which say: > > [string_match r s start] tests whether the characters in s starting at > > position start match the regular expression r. > > and in general with how regular expression systems work. > > Then they're simply wrong. The fundamental operation is > to check if a string is in a regular set of strings. > Plainly 'aa' is not in the set { 'a' }. Str.string_match is almost exactly the same as Python's re.match: # Str.string_match (Str.regexp "a") "ab" 0 ;; - : bool = true >>> re.match(re.compile("a"), "ab") #succeeds <_sre.SRE_Match object at 0x8146f60> # Str.string_match (Str.regexp "a") "bab" 0 ;; - : bool = false >>> re.match(re.compile("a"), "bab") #returns None >>> # Str.string_match (Str.regexp "a") "bab" 1 ;; - : bool = true >>> re.match(re.compile("a"), "bab"[1: ]) <_sre.SRE_Match object at 0x814c748> >>> #Python has no parameter to change the start position Regards ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [Caml-list] Str.string_match incorrect 2004-12-22 17:26 ` Kurt Welgehausen @ 2004-12-23 2:09 ` skaller 0 siblings, 0 replies; 15+ messages in thread From: skaller @ 2004-12-23 2:09 UTC (permalink / raw) To: Kurt Welgehausen; +Cc: caml-list On Thu, 2004-12-23 at 04:26, Kurt Welgehausen wrote: > Str.string_match is almost exactly the same as Python's > re.match: Which is based on PCRE. I checked the Python documentation and it is *also* incorrect for the same reason. I do not care what the expected behaviour is, provided the documentation describes the semantics, and in this case both Ocaml and Python have it wrong. -- John Skaller, mailto:skaller@users.sf.net voice: 061-2-9660-0850, snail: PO BOX 401 Glebe NSW 2037 Australia Checkout the Felix programming language http://felix.sf.net ^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2004-12-26 1:14 UTC | newest] Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2004-12-22 3:49 Str.string_match incorrect skaller 2004-12-22 7:44 ` [Caml-list] " Evan Martin 2004-12-22 8:00 ` William Lovas 2004-12-22 8:38 ` Evan Martin 2004-12-22 10:37 ` Gerd Stolpmann 2004-12-22 15:57 ` skaller 2004-12-22 16:58 ` David Brown 2004-12-23 2:33 ` skaller 2004-12-24 17:40 ` Christopher A. Watford 2004-12-25 0:57 ` skaller 2004-12-25 3:07 ` Christopher A. Watford 2004-12-25 4:24 ` skaller 2004-12-26 1:14 ` William Lovas 2004-12-22 17:26 ` Kurt Welgehausen 2004-12-23 2:09 ` skaller
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox