* substring match like "strstr" @ 2000-12-08 6:04 eijiro_sumii 2000-12-08 12:57 ` Gerd Stolpmann 0 siblings, 1 reply; 23+ messages in thread From: eijiro_sumii @ 2000-12-08 6:04 UTC (permalink / raw) To: caml-list; +Cc: sumii Hi all, Is there a substring matching function (like "strstr" in libc) in the standard OCaml library? Of course there are many ways to implement it (by writing it from scratch, using the Str library, interfacing "strstr" in libc, etc.), but they are overkill for my purpose. So I'm wondering whether such a function already exists, but I couldn't find it in the manual... Eijiro ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: substring match like "strstr" 2000-12-08 6:04 substring match like "strstr" eijiro_sumii @ 2000-12-08 12:57 ` Gerd Stolpmann 2000-12-10 13:16 ` eijiro_sumii 0 siblings, 1 reply; 23+ messages in thread From: Gerd Stolpmann @ 2000-12-08 12:57 UTC (permalink / raw) To: eijiro_sumii, caml-list; +Cc: sumii On Fri, 08 Dec 2000, eijiro_sumii@anet.ne.jp wrote: >Hi all, > >Is there a substring matching function (like "strstr" in libc) in the >standard OCaml library? Of course there are many ways to implement it >(by writing it from scratch, using the Str library, interfacing >"strstr" in libc, etc.), but they are overkill for my purpose. So I'm >wondering whether such a function already exists, but I couldn't find >it in the manual... There isn't such a function. You can write it yourself: let find_substring s1 s2 = (* find s2 in s1, or raise Not_found *) if String.length s2 > String.length s1 then raise Not_found; let b = String.create (String.length s2) in let rec search k = if k > String.length s1 - String.length s2 then raise Not_found; String.blit ~src:s1 ~src_pos:k ~dst:b ~dst_pos:0 ~len:(String.length s2); if b = s2 then k else search (k+1) in search 0 ;; This version of find_substring avoids superflous heap allocations, and I think it is tricky anough to be included into the stdlib. (However, if we had strstr as primitive, even the allocation of b could be avoided.) If speed is important, I recommend let find_substring s1 s2 = Str.search_forward (Str.regexp (Str.quote s2)) s1 0 ;; I hope these solutions aren't overkilling. Gerd -- ---------------------------------------------------------------------------- Gerd Stolpmann Telefon: +49 6151 997705 (privat) Viktoriastr. 100 64293 Darmstadt EMail: gerd@gerd-stolpmann.de Germany ---------------------------------------------------------------------------- ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: substring match like "strstr" 2000-12-08 12:57 ` Gerd Stolpmann @ 2000-12-10 13:16 ` eijiro_sumii 2000-12-10 15:39 ` Gerd Stolpmann 2000-12-11 21:07 ` Chris Hecker 0 siblings, 2 replies; 23+ messages in thread From: eijiro_sumii @ 2000-12-10 13:16 UTC (permalink / raw) To: caml-list; +Cc: gerd, sumii > You can write it yourself: ... > If speed is important, I recommend > > let find_substring s1 s2 = > Str.search_forward (Str.regexp (Str.quote s2)) s1 0 > ;; OK, I benchmarked the following 4 implementations for the purpose of comparison. strstr stub to call "strstr" in libc regexp combination of "Str.regexp_string" and "Str.search_forward" OCamlA the simple implementation in OCaml given in the previous message, a little tuned for more efficiency OCamlB another straightforward implementation in OCaml, attached at the end of this message The results (execution time in seconds) were as follows. strstr 55.74 regexp 154.37 OCamlA 302.57 OCamlB 129.23 The application is a kind of library for data mining of gene sequences. It calls the substring matching function more than 400,000,000 times. The machine is a Sun UltraEnterprise 450 with 4 UltraSPARC II (480 MHz) processors and 2GB main memory, but the program itself doesn't make any use of multiple processors or much space. Although I haven't checked the reasons for these results in detail, this might hopefully be of some information for other people, too. And probably I should also consider implementing more sophisticated, efficient algorithm in OCaml... // Eijiro Sumii (http://www.yl.is.s.u-tokyo.ac.jp/~sumii/) // // Ph.D. Student at Department of IS, University of Tokyo // Visiting Scholar at Department of CIS, University of Pennsylvania ---------------------------------------------------------------------- let strstr pat str = let patlen = String.length pat in let strlen = String.length str in let rec is_prefix pos spos epos = if pos < 0 then true else if String.unsafe_get pat pos <> String.unsafe_get str epos then false else is_prefix (pos - 1) spos (epos - 1) in let rec search spos = let epos = spos + patlen - 1 in if epos >= strlen then raise Not_found else if is_prefix (patlen - 1) spos epos then spos else search (spos + 1) in search 0 ---------------------------------------------------------------------- ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: substring match like "strstr" 2000-12-10 13:16 ` eijiro_sumii @ 2000-12-10 15:39 ` Gerd Stolpmann 2000-12-11 3:57 ` eijiro_sumii 2000-12-12 13:58 ` Julian Assange 2000-12-11 21:07 ` Chris Hecker 1 sibling, 2 replies; 23+ messages in thread From: Gerd Stolpmann @ 2000-12-10 15:39 UTC (permalink / raw) To: eijiro_sumii, caml-list; +Cc: gerd, sumii On Sun, 10 Dec 2000, eijiro_sumii@anet.ne.jp wrote: >OK, I benchmarked the following 4 implementations for the purpose of >comparison. > > strstr > stub to call "strstr" in libc > > regexp > combination of "Str.regexp_string" and "Str.search_forward" > > OCamlA > the simple implementation in OCaml given in the previous message, > a little tuned for more efficiency > > OCamlB > another straightforward implementation in OCaml, attached at the > end of this message > >The results (execution time in seconds) were as follows. > > strstr 55.74 > regexp 154.37 > OCamlA 302.57 > OCamlB 129.23 I did not expect that OCamlB performs so well; so I suggested OcamlA and regexp (which are both fast for the bytecode interpreter, too). I think it depends also on the problem size (especially on the length of the substring). Gerd -- ---------------------------------------------------------------------------- Gerd Stolpmann Telefon: +49 6151 997705 (privat) Viktoriastr. 100 64293 Darmstadt EMail: gerd@gerd-stolpmann.de Germany ---------------------------------------------------------------------------- ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: substring match like "strstr" 2000-12-10 15:39 ` Gerd Stolpmann @ 2000-12-11 3:57 ` eijiro_sumii 2000-12-12 13:58 ` Julian Assange 1 sibling, 0 replies; 23+ messages in thread From: eijiro_sumii @ 2000-12-11 3:57 UTC (permalink / raw) To: caml-list; +Cc: gerd, sumii > I did not expect that OCamlB performs so well; so I suggested OcamlA and > regexp (which are both fast for the bytecode interpreter, too). I think it > depends also on the problem size (especially on the length of the substring). Right, I forgot to mention those -- in the benchmark, the "needles" (matched substrings) were 3 characters long and the "haystacks" (searched strings) were about 20 characters long; I used the native code compiler. For much larger input, regexp could possibly do better. Anyway, thank you for the helpful suggestions. // Eijiro Sumii (http://www.yl.is.s.u-tokyo.ac.jp/~sumii/) // // Ph.D. Student at Department of IS, University of Tokyo // Visiting Scholar at Department of CIS, University of Pennsylvania ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: substring match like "strstr" 2000-12-10 15:39 ` Gerd Stolpmann 2000-12-11 3:57 ` eijiro_sumii @ 2000-12-12 13:58 ` Julian Assange 1 sibling, 0 replies; 23+ messages in thread From: Julian Assange @ 2000-12-12 13:58 UTC (permalink / raw) To: gerd; +Cc: eijiro_sumii, caml-list, sumii, proff Gerd Stolpmann <gerd@gerd-stolpmann.de> writes: > >The results (execution time in seconds) were as follows. > > > > strstr 55.74 > > regexp 154.37 > > OCamlA 302.57 > > OCamlB 129.23 > > I did not expect that OCamlB performs so well; so I suggested OcamlA and > regexp (which are both fast for the bytecode interpreter, too). I think it > depends also on the problem size (especially on the length of the substring). > > Gerd If you are really calling strstr 400,000,000 times, I strongly suggest the use of Boyer-Moore. Is your alphabet amino-acids or base-pairs? If the latter, I have developed parallel hashing version of Boyer-More (lets call it Assange-Boyer-More :), which beats the pants off Knuth-Moris-Pratt. 500 long substrings searches are only about 50% slower than 1, although for this application, if you have the memory, you might also want to look at suffix trees or suffix arrays. -- Julian Assange |If you want to build a ship, don't drum up people |together to collect wood or assign them tasks proff@iq.org |and work, but rather teach them to long for the endless proff@gnu.ai.mit.edu |immensity of the sea. -- Antoine de Saint Exupery ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: substring match like "strstr" 2000-12-10 13:16 ` eijiro_sumii 2000-12-10 15:39 ` Gerd Stolpmann @ 2000-12-11 21:07 ` Chris Hecker 2000-12-11 22:22 ` Gerd Stolpmann ` (3 more replies) 1 sibling, 4 replies; 23+ messages in thread From: Chris Hecker @ 2000-12-11 21:07 UTC (permalink / raw) To: eijiro_sumii, caml-list; +Cc: gerd, sumii >The results (execution time in seconds) were as follows. > strstr 55.74 > regexp 154.37 > OCamlA 302.57 > OCamlB 129.23 Any ideas why strstr blows the others away? What's the libc strstr look like? I just looked in the MSVC source and it's a braindead while loop (copied below), so it's not like it's doing a fancy Boyer-Moore or anything. This is exactly the kind of problem on which I'd expect caml to come within 10% of c. Have you tried a purely imperative version? I'd say I'd do the tests myself, but I don't have a bunch of gene sequences laying around. :) Okay, I'm curious, so I'll port the code to caml and include it below as well (as practice for myself). Can you try it in your test harness? I've attached a cheesy test app with your caml, calling msvc's strstr, and my imperative version. I make no claims of profiling accuracy or even code correctness, but here are the results when run from a DOS box on Win98: strstr total = 1.060000 (5,0,19,13) strstr_imp total = 0.510000 (5,0,19,13) strstr_c total = 0.415000 (5,0,19,13) (The weird thing is when I ran it with shell-command under emacs the timings were strstr total = 0.575000 (5,0,19,13) strstr_imp total = 0.500000 (5,0,19,13) strstr_c total = 0.415000 (5,0,19,13) which I found bizarre since strstr became 2x faster, but I haven't had time to look into it. Can somebody else try these tests? ) Looking at the asm output from strstr_imp, there's not that much you could do to make it better. Maybe there are a few peephole things, and there are probably 30% more instructions than you need in this specific case because the code to compare characters goes to the trouble of keeping the ints in the caml format even in registers inside the loop (so they're shifted back and forth), and it converts the chars from the unsafe_gets to full caml ints, which is useless since they just get compared to each other and then dumped. It might be interesting to write a peephole optimizer aimed at peepholing camlopt code, and looking for things like this. Can anybody optimize the caml src for strstr_imp? Chris ----strstrc.c----- #include "/test/ocaml/lib/caml/mlvalues.h" /* ripped from msvc crt src */ char * __cdecl strstr ( const char * str1, const char * str2 ) { char *cp = (char *) str1; char *s1, *s2; if ( !*str2 ) return((char *)str1); while (*cp) { s1 = cp; s2 = (char *) str2; while ( *s1 && *s2 && !(*s1-*s2) ) s1++, s2++; if (!*s2) return(cp); cp++; } return(0); } value Camlstrstr( value pat, value str ) { char *string = String_val(str); return Val_int(strstr(string,String_val(pat))-string); } ------strstrc.mli----- external strstr_c: string -> string -> int = "Camlstrstr" ------strstr.ml------- (* Eijiro Sumii's functional strstr *) let strstr pat str = let patlen = String.length pat in let strlen = String.length str in let rec is_prefix pos spos epos = if pos < 0 then true else if String.unsafe_get pat pos <> String.unsafe_get str epos then false else is_prefix (pos - 1) spos (epos - 1) in let rec search spos = let epos = spos + patlen - 1 in if epos >= strlen then raise Not_found else if is_prefix (patlen - 1) spos epos then spos else search (spos + 1) in search 0 (* checker's lame imperative strstr *) let strstr_imp pat str = let strlen = String.length str in let patlen = String.length pat and i = ref 0 and f = ref strlen in while !i < strlen do let pati = ref 0 and j = ref !i in while !pati < patlen && !j < strlen && String.unsafe_get str !j == String.unsafe_get pat !pati do pati := succ !pati; j := succ !j; done; if !pati = patlen then begin f := !i; i := strlen end else i := succ !i done; !f (* a really questionable timing harness *) let time_strstr name strstr = let t0 = Unix.gettimeofday () in for i = 0 to 100000 do strstr "this" "that this the other"; strstr "this" "this that this the other"; strstr "this" "that tis the other this"; strstr "this" "that the othethisr th"; done; let t = Unix.gettimeofday () -. t0 in Printf.printf "%s total = %f (%d,%d,%d,%d)\n" name t (* a really pathetic regression test *) (strstr "this" "that this the other") (strstr "this" "this that this the other") (strstr "this" "that tis the other this") (strstr "this" "that the othethisr th") (* main *) let _ = time_strstr "strstr" strstr; time_strstr "strstr_imp" strstr_imp; time_strstr "strstr_c" Strstrc.strstr_c; ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: substring match like "strstr" 2000-12-11 21:07 ` Chris Hecker @ 2000-12-11 22:22 ` Gerd Stolpmann 2000-12-12 5:06 ` Chris Hecker 2000-12-12 3:28 ` eijiro_sumii ` (2 subsequent siblings) 3 siblings, 1 reply; 23+ messages in thread From: Gerd Stolpmann @ 2000-12-11 22:22 UTC (permalink / raw) To: Chris Hecker, eijiro_sumii, caml-list; +Cc: gerd, sumii On Mon, 11 Dec 2000, Chris Hecker wrote: >Any ideas why strstr blows the others away? It's the type of brain-dead loop for which C compilers are written. >Have you tried a purely imperative version? I'd say I'd do the tests myself, >but I don't have a bunch of gene sequences laying around. :) It's not always the imperative version which is better. The difference in the generated code of a functional and an imperative version is often minimal. >Looking at the asm output from strstr_imp, there's not that much you could do >to make it better. Maybe there are a few peephole things, and there are >probably 30% more instructions than you need in this specific case because the >code to compare characters goes to the trouble of keeping the ints in the caml >format even in registers inside the loop (so they're shifted back and forth), >and it converts the chars from the unsafe_gets to full caml ints, which is >useless since they just get compared to each other and then dumped. It might >be interesting to write a peephole optimizer aimed at peepholing camlopt code, >and looking for things like this. > >Can anybody optimize the caml src for strstr_imp? let strstr_imp2 pat str = let strlen = String.length str in let patlen = String.length pat and i = ref 0 and f = ref strlen in let i_limit = strlen - patlen in while !i < i_limit do let pati = ref 0 and j = ref !i in while !pati < patlen && String.unsafe_get str !j == String.unsafe_get pat !pati do pati := succ !pati; j := succ !j; done; if !pati = patlen then begin f := !i; i := strlen end else i := succ !i done; !f ;; Perhaps !pati < patlen can also be removed by introducing a "guard" (modifying the last character of pat such that the loop runs always into an inequality). However, this would require extra code managing the temporary modification of pat, and I guess it is not worth-while for short pats. Gerd -- ---------------------------------------------------------------------------- Gerd Stolpmann Telefon: +49 6151 997705 (privat) Viktoriastr. 100 64293 Darmstadt EMail: gerd@gerd-stolpmann.de Germany ---------------------------------------------------------------------------- ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: substring match like "strstr" 2000-12-11 22:22 ` Gerd Stolpmann @ 2000-12-12 5:06 ` Chris Hecker 2000-12-12 12:28 ` Jean-Christophe Filliatre 0 siblings, 1 reply; 23+ messages in thread From: Chris Hecker @ 2000-12-12 5:06 UTC (permalink / raw) To: gerd, eijiro_sumii, caml-list; +Cc: gerd, sumii > let i_limit = strlen - patlen in Ah, duh. Good idea. Here are some new timings, including the _imp2 (I fixed a minor fencepost bug in yours in an admittedly cheesy way; the code's below): Under Win2k cmd.exe: strstr total = 0.610000 (5,0,19,13) strstr_imp total = 0.593000 (5,0,19,13) strstr_imp2 total = 0.563000 (5,0,19,13) strstr_c total = 0.469000 (5,0,19,13) Under Win98 command.com: strstr total = 1.460000 (5,0,19,13) strstr_imp total = 1.020000 (5,0,19,13) strstr_imp2 total = 0.955000 (5,0,19,13) strstr_c total = 0.800000 (5,0,19,13) Under Win98 emacs shell-command: strstr total = 1.125000 (5,0,19,13) strstr_imp total = 1.025000 (5,0,19,13) strstr_imp2 total = 0.960000 (5,0,19,13) strstr_c total = 0.800000 (5,0,19,13) I doubled the number of iterations to try to figure out the emacs/command.com difference, but to no avail. However, then I decided to check to see if this was actually the strstr in the library, so instead of calling that c code, I called string.h's strstr. I discovered that there's actually an intel-asm version of strstr that's compiled into the crt: strstr total = 1.130000 (5,0,19,13) strstr_imp total = 1.020000 (5,0,19,13) strstr_imp2 total = 0.960000 (5,0,19,13) strstr_c total = 0.365000 (5,0,19,13) Ouch! We're back to 3x slower. But hey, no one said ocaml could compete with asm. :) Anyway, as it stands, the ocaml code is 17% slower than the C code at relatively comparable code-simplicity. I'd say the ocaml code is a bit harder to understand and wordier, but that's mostly because ocaml doesn't have modifyable pointers, and the C code just walks the pointers through the array. Also, the C has an early-out return that ocaml can't do (without an exception), and that makes things a little simpler as well. Of course, I'm not saying you can extrapolate anything about real programs from this simple example, but it's interesting to me nonetheless. Chris Here's the updated strstr_imp2. If you really wanted to try to max out the ocaml, we could read the pattern into an int and try to match it a word at a time, etc. (* "optimized" imperative strstr *) let strstr_imp2 pat str = let strlen = String.length str in let patlen = String.length pat and i = ref 0 and f = ref strlen in let limit = strlen - patlen + 1 in while !i < limit do let pati = ref 0 and j = ref !i in while !pati < patlen && String.unsafe_get str !j == String.unsafe_get pat !pati do pati := succ !pati; j := succ !j; done; if !pati = patlen then begin f := !i; i := strlen end else i := succ !i done; !f ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: substring match like "strstr" 2000-12-12 5:06 ` Chris Hecker @ 2000-12-12 12:28 ` Jean-Christophe Filliatre 2000-12-13 10:02 ` eijiro_sumii 0 siblings, 1 reply; 23+ messages in thread From: Jean-Christophe Filliatre @ 2000-12-12 12:28 UTC (permalink / raw) To: Chris Hecker; +Cc: gerd, eijiro_sumii, caml-list, sumii Hello, I missed the beginning of the discussion, but implementing Knuth-Morris-Pratt is quite easy. The code is given below (26 lines), following Sedgewick's book "Algorithms". Notice that the function kmp can be partially applied to a pattern, therefore computing the offset table only once. The complexity is N+M where N is the size of the text and M the size of the pattern. (The following program was proved correct in the Coq Proof Assistant; without the Not_found exception, however) Hope this helps, -- Jean-Christophe FILLIATRE mailto:Jean-Christophe.Filliatre@lri.fr http://www.lri.fr/~filliatr ====================================================================== let initnext p = let m = String.length p in let next = Array.create m 0 in let i = ref 1 and j = ref 0 in while !i < m-1 do if p.[!i] = p.[!j] then begin incr i; incr j; next.(!i) <- !j end else if !j = 0 then begin incr i; next.(!i) <- 0 end else j := next.(!j) done; next let kmp p = let next = initnext p in fun a -> let n = String.length a and m = String.length p and i = ref 0 and j = ref 0 in while !j < m & !i < n do if a.[!i] = p.[!j] then begin incr i; incr j end else if !j = 0 then incr i else j := next.(!j) done; if !j >= m then !i-m else raise Not_found ====================================================================== ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: substring match like "strstr" 2000-12-12 12:28 ` Jean-Christophe Filliatre @ 2000-12-13 10:02 ` eijiro_sumii 2000-12-13 10:17 ` Eijiro Sumii 2000-12-13 10:53 ` Julian Assange 0 siblings, 2 replies; 23+ messages in thread From: eijiro_sumii @ 2000-12-13 10:02 UTC (permalink / raw) To: caml-list; +Cc: Jean-Christophe.Filliatre, proff, jmp, sumii Hi everyone, Thank you very much for a lot of kind advice! Jean-Christophe Filliatre <Jean-Christophe.Filliatre@lri.fr> wrote: > I missed the beginning of the discussion, but implementing > Knuth-Morris-Pratt is quite easy. The code is given below (26 lines), Actually, I found the (almost) same code on the web and tried it. The results (again, execution time in seconds) were: SPARC Pentium strstr 52.68 57.050 kmp 111.52 143.490 The SPARC machine is the same as before. The Pentium machine is running Linux 2.2.17 inside VMWare 2.0.3 (with 48MB virtual main memory) on Windows 2000 with a Pentium II 400MHz processor and 128MB main memory. Since my program calls the function more than 3000 times for _each_ pattern, I did partial application, of course. Even so, kmp in OCaml still performed much worse than strstr in libc (both Sun's and GNU's), at least in this program. If you'd like to check it yourself, the source code is available at the same URL as before (http://www.yl.is.s.u-tokyo.ac.jp/~sumii/tmp/hc.tar.gz). Julian Assange <proff@iq.org> wrote: > I'm interested in this library. Can you tell me a little more? What's > your ETA? It is explained a little at: http://www.hypothesiscreator.net/ It has been developed in C++, but we began to port it to OCaml for clarity and convenience (and fun:-). Julian Assange <proff@iq.org> wrote: > If you are really calling strstr 400,000,000 times, I strongly suggest > the use of Boyer-Moore. Is your alphabet amino-acids or base-pairs? If Can Boyer-Moore (in OCaml) be _much_ faster than KMP, so that it beats strstr in libc? John Prevost <jmp@arsdigita.com> wrote: > I just grabbed the glibc version, and found the following comment at > the head of it: That was exactly what I found too (and mentioned a bit in my previous message). Maybe I should also disassemble and check Sun's strstr...? Regards, Eijiro ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: substring match like "strstr" 2000-12-13 10:02 ` eijiro_sumii @ 2000-12-13 10:17 ` Eijiro Sumii 2000-12-13 10:53 ` Julian Assange 1 sibling, 0 replies; 23+ messages in thread From: Eijiro Sumii @ 2000-12-13 10:17 UTC (permalink / raw) To: caml-list; +Cc: Jean-Christophe.Filliatre, proff, jmp, sumii > Actually, I found the (almost) same code on the web and tried it. The > results (again, execution time in seconds) were: > > SPARC Pentium > strstr 52.68 57.050 > kmp 111.52 143.490 Oops, I forgot to give "-unsafe" to ocamlopt in kmp. With -unsafe, the results have improved (but still worse than strstr): SPARC Pentium kmp 73.50 79.940 Eijiro ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: substring match like "strstr" 2000-12-13 10:02 ` eijiro_sumii 2000-12-13 10:17 ` Eijiro Sumii @ 2000-12-13 10:53 ` Julian Assange 2000-12-13 13:28 ` Eijiro Sumii 1 sibling, 1 reply; 23+ messages in thread From: Julian Assange @ 2000-12-13 10:53 UTC (permalink / raw) To: eijiro_sumii; +Cc: caml-list, Jean-Christophe.Filliatre, jmp, sumii, proff eijiro_sumii@anet.ne.jp writes: > Hi everyone, > > Thank you very much for a lot of kind advice! > > Jean-Christophe Filliatre <Jean-Christophe.Filliatre@lri.fr> wrote: > > I missed the beginning of the discussion, but implementing > > Knuth-Morris-Pratt is quite easy. The code is given below (26 lines), > > Actually, I found the (almost) same code on the web and tried it. The > results (again, execution time in seconds) were: > > SPARC Pentium > strstr 52.68 57.050 > kmp 111.52 143.490 All of these more sophisticated methods are only going to be a win when you are searching for larger strings. strstr is brutally simple, has little setup/cleanup time, and hence is hard to beat for short strings. Additionally gcc, will replace various s* calls with in-line assembly versions. I'm not sure if this includes strstr (it is a little more complicated than most). Cheers, Julian. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: substring match like "strstr" 2000-12-13 10:53 ` Julian Assange @ 2000-12-13 13:28 ` Eijiro Sumii 0 siblings, 0 replies; 23+ messages in thread From: Eijiro Sumii @ 2000-12-13 13:28 UTC (permalink / raw) To: proff; +Cc: caml-list, sumii Julian Assange <proff@iq.org> wrote: > All of these more sophisticated methods are only going to be a win > when you are searching for larger strings. strstr is brutally simple, > has little setup/cleanup time, and hence is hard to beat for short > strings. That is completely correct, and the substrings in my application are indeed short (3 characters), but: eijiro_sumii@anet.ne.jp wrote: | Since my program calls the function more than 3000 times for _each_ | pattern, I did partial application, of course. Even so, kmp in OCaml | still performed much worse than strstr in libc (both Sun's and GNU's), | at least in this program. I expected that 3000 strings for 1 substring, each of which are about 20 characters long as I mentioned before, are enough for paying the "setup/cleanup" overhead. (On the other hand, strstr in libc doesn't benefit from the partial application, of course.) Eijiro ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: substring match like "strstr" 2000-12-11 21:07 ` Chris Hecker 2000-12-11 22:22 ` Gerd Stolpmann @ 2000-12-12 3:28 ` eijiro_sumii 2000-12-13 1:12 ` John Prevost 2000-12-12 10:07 ` Sven LUTHER 2000-12-14 3:36 ` eijiro_sumii 3 siblings, 1 reply; 23+ messages in thread From: eijiro_sumii @ 2000-12-12 3:28 UTC (permalink / raw) To: checker; +Cc: caml-list, gerd, sumii > Any ideas why strstr blows the others away? What's the libc strstr > look like? I have no idea, unfortunately... > I just looked in the MSVC source and it's a braindead while loop > (copied below), so it's not like it's doing a fancy Boyer-Moore or > anything. I don't know anything about strstr in Sun's strstr, but I checked strstr in GNU libc. It is a quite complicated program, but look like a brute-force algorithm (that is, no Knuth-Morris-Pratt or anything like that). > This is exactly the kind of problem on which I'd expect caml to come > within 10% of c. That was what I expected, too. > I'd say I'd do the tests myself, but I don't have a bunch of gene > sequences laying around. :) I've put the current version of my application program at: http://www.yl.is.s.u-tokyo.ac.jp/~sumii/tmp/hc.tar.gz If you like, you're welcome to check it yourself, of course.:) You can run it as "make ; time ./hc". The output should look like: score = 0.348672 score = 0.391356 (snip) score = 0.630415 The OCaml function "strstr" is in the file "strstr.ml". > Okay, I'm curious, so I'll port the code to caml and include it > below as well (as practice for myself). Can you try it in your test > harness? Sure, but please give me a little time. This is a kind of part-time, weekend job for me, and I can't tell when I have time to do it next... Eijiro ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: substring match like "strstr" 2000-12-12 3:28 ` eijiro_sumii @ 2000-12-13 1:12 ` John Prevost 2000-12-13 2:35 ` Chris Hecker 0 siblings, 1 reply; 23+ messages in thread From: John Prevost @ 2000-12-13 1:12 UTC (permalink / raw) To: eijiro_sumii; +Cc: checker, caml-list, gerd, sumii >>>>> "es" == eijiro sumii <eijiro_sumii@anet.ne.jp> writes: >> Any ideas why strstr blows the others away? What's the libc >> strstr look like? es> I have no idea, unfortunately... I just grabbed the glibc version, and found the following comment at the head of it: /* * My personal strstr() implementation that beats most other algorithms. * Until someone tells me otherwise, I assume that this is the * fastest implementation of strstr() in C. * I deliberately chose not to comment it. You should have at least * as much fun trying to understand it, as I had to write it :-). * * Stephen R. van den Berg, berg@pool.informatik.rwth-aachen.de */ Since the text of the function involves not only nasty gotos (particularly frightening is "goto shloop;" but also pointer tricks and register declarations, I'd say it's to be expected that it'll outperform O'Caml. I am, however, seeing if I can figure out the essential algorithm used (if there's really anything smart to it) and will let you know if I come up with anything extra special. The file strstr.c is included below for your edification. I'd argue that any Caml version is likely to a be a wee bit easier to understand. John. /* Return the offset of one string within another. Copyright (C) 1994, 1996, 1997 Free Software Foundation, Inc. This file is part of the GNU C Library. The GNU C Library is free software; you can redistribute it and/or modify it under the terms of the GNU Library General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. The GNU C Library is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Library General Public License for more details. You should have received a copy of the GNU Library General Public License along with the GNU C Library; see the file COPYING.LIB. If not, write to the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. */ /* * My personal strstr() implementation that beats most other algorithms. * Until someone tells me otherwise, I assume that this is the * fastest implementation of strstr() in C. * I deliberately chose not to comment it. You should have at least * as much fun trying to understand it, as I had to write it :-). * * Stephen R. van den Berg, berg@pool.informatik.rwth-aachen.de */ #if HAVE_CONFIG_H # include <config.h> #endif #if defined _LIBC || defined HAVE_STRING_H # include <string.h> #endif typedef unsigned chartype; #undef strstr char * strstr (phaystack, pneedle) const char *phaystack; const char *pneedle; { register const unsigned char *haystack, *needle; register chartype b, c; haystack = (const unsigned char *) phaystack; needle = (const unsigned char *) pneedle; b = *needle; if (b != '\0') { haystack--; /* possible ANSI violation */ do { c = *++haystack; if (c == '\0') goto ret0; } while (c != b); c = *++needle; if (c == '\0') goto foundneedle; ++needle; goto jin; for (;;) { register chartype a; register const unsigned char *rhaystack, *rneedle; do { a = *++haystack; if (a == '\0') goto ret0; if (a == b) break; a = *++haystack; if (a == '\0') goto ret0; shloop: } while (a != b); jin: a = *++haystack; if (a == '\0') goto ret0; if (a != c) goto shloop; rhaystack = haystack-- + 1; rneedle = needle; a = *rneedle; if (*rhaystack == a) do { if (a == '\0') goto foundneedle; ++rhaystack; a = *++needle; if (*rhaystack != a) break; if (a == '\0') goto foundneedle; ++rhaystack; a = *++needle; } while (*rhaystack == a); needle = rneedle; /* took the register-poor approach */ if (a == '\0') break; } } foundneedle: return (char*) haystack; ret0: return 0; } ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: substring match like "strstr" 2000-12-13 1:12 ` John Prevost @ 2000-12-13 2:35 ` Chris Hecker 0 siblings, 0 replies; 23+ messages in thread From: Chris Hecker @ 2000-12-13 2:35 UTC (permalink / raw) To: John Prevost, eijiro_sumii; +Cc: caml-list, gerd, sumii >The file strstr.c is included below for your edification. I'd argue >that any Caml version is likely to a be a wee bit easier to >understand. Okay, so here's a new challenge: write the most readable strstr you can in ocaml. The restriction is you have to actually implement the algorithm, not just call the regex library. :) What is the most functional and readable one we can come up with? And then we'll compare its performance and see if the readable/performance curve is appropriate. Eijiro's was relatively short and functional, but still not completely transparent. My imperative one sucks for readability because of all the references and variables. Chris ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: substring match like "strstr" 2000-12-11 21:07 ` Chris Hecker 2000-12-11 22:22 ` Gerd Stolpmann 2000-12-12 3:28 ` eijiro_sumii @ 2000-12-12 10:07 ` Sven LUTHER 2000-12-14 3:36 ` eijiro_sumii 3 siblings, 0 replies; 23+ messages in thread From: Sven LUTHER @ 2000-12-12 10:07 UTC (permalink / raw) To: Chris Hecker; +Cc: eijiro_sumii, caml-list, gerd, sumii On Mon, Dec 11, 2000 at 01:07:22PM -0800, Chris Hecker wrote: > > >The results (execution time in seconds) were as follows. > > strstr 55.74 > > regexp 154.37 > > OCamlA 302.57 > > OCamlB 129.23 > > Any ideas why strstr blows the others away? What's the libc strstr look like? I just looked in the MSVC source and it's a braindead while loop (copied below), so it's not like it's doing a fancy Boyer-Moore or anything. This is exactly the kind of problem on which I'd expect caml to come within 10% of c. Is this kind of thing not one of the field which can take advantage of the vector units included in the processor, like mmx, altivec and the equivalent in alpha or sparc processors ? Not sure though, it would depend on the actual implementation used and command set. Friendly, Sven Luther ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: substring match like "strstr" 2000-12-11 21:07 ` Chris Hecker ` (2 preceding siblings ...) 2000-12-12 10:07 ` Sven LUTHER @ 2000-12-14 3:36 ` eijiro_sumii 2000-12-14 6:48 ` Chris Hecker 3 siblings, 1 reply; 23+ messages in thread From: eijiro_sumii @ 2000-12-14 3:36 UTC (permalink / raw) To: checker; +Cc: caml-list, gerd, sumii > Okay, I'm curious, so I'll port the code to caml and include it > below as well (as practice for myself). Can you try it in your test > harness? I tried the optimized version of the imperative strstr, along with an optimized version of the functional strstr (attached at the end). The results (on the SPARC machine) were: strstr_imp2 90.57 strstr_fun2 89.64 C strstr 53.76 So the imperative version and the functional version were comparable, though both were slower than the C version. My guess is that they actually compile into rather similar code, because all recursive calls are in tail positions. Anyway, for my application, the libc version seems the most reasonable choice, as far as I use a strstr-like function at all... // Eijiro Sumii (http://www.yl.is.s.u-tokyo.ac.jp/~sumii/) // // Ph.D. Student at Department of IS, University of Tokyo // Visiting Scholar at Department of CIS, University of Pennsylvania let strstr_fun2 pat str = let patlim = String.length pat - 1 in let strlim = String.length str - 1 in let rec sub patpos strpos = (* compare pat[0..patpos] and str[(strpos-patpos)..strpos] *) if patpos < 0 then true else if pat.[patpos] <> str.[strpos] then false else sub (patpos - 1) (strpos - 1) in let rec search strpos = (* compare pat[0..patlim] and str[(strpos-patlim)..strpos] *) if strpos > strlim then raise Not_found else if sub patlim strpos then strpos - patlim else search (strpos + 1) in search patlim ;; ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: substring match like "strstr" 2000-12-14 3:36 ` eijiro_sumii @ 2000-12-14 6:48 ` Chris Hecker 2000-12-14 8:02 ` eijiro_sumii 0 siblings, 1 reply; 23+ messages in thread From: Chris Hecker @ 2000-12-14 6:48 UTC (permalink / raw) To: eijiro_sumii; +Cc: caml-list, gerd, sumii > strstr_imp2 90.57 > strstr_fun2 89.64 > C strstr 53.76 Is that C the simple C one I posted, or the libc one on sparc? Here are my timings with yours added: strstr_fun total = 1.135000 (5,0,19,13) strstr_fun2 total = 1.470000 (5,0,19,13) strstr_imp total = 1.030000 (5,0,19,13) strstr_imp2 total = 0.965000 (5,0,19,13) strstr_c total = 0.790000 (5,0,19,13) strstr_libc total = 0.370000 (5,0,19,13) Your new strstr is actually slower on x86 than your first one, but I haven't looked into why yet. It may also be data (since my data is obviously lame garbage). I should try Alpha Linux... Chris ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: substring match like "strstr" 2000-12-14 6:48 ` Chris Hecker @ 2000-12-14 8:02 ` eijiro_sumii 2000-12-14 21:53 ` Stephan Tolksdorf 0 siblings, 1 reply; 23+ messages in thread From: eijiro_sumii @ 2000-12-14 8:02 UTC (permalink / raw) To: checker; +Cc: caml-list, gerd, sumii > Is that C the simple C one I posted, or the libc one on sparc? It's the one in libc on Solaris, so may be tuned by hand in assembly. > Your new strstr is actually slower on x86 than your first one, but I > haven't looked into why yet. That's interesting. I confirmed it myself on my x86 machine with my program: strstr_imp2 88.530 strstr_fun 124.210 strstr_fun2 127.120 glibc strstr 61.310 Seeing such strong machine dependency, it seems yet more reasonable to use the "default" strstr provided by the system (unless the _pattern_ is long enough, in which case KMP or BM might perform better), even though there is some overhead --- about 5% in my profiling on SPARC --- of interfacing a C function to OCaml. This is a natural (though boring) conclusion, I think. Eijiro ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: substring match like "strstr" 2000-12-14 8:02 ` eijiro_sumii @ 2000-12-14 21:53 ` Stephan Tolksdorf 0 siblings, 0 replies; 23+ messages in thread From: Stephan Tolksdorf @ 2000-12-14 21:53 UTC (permalink / raw) To: caml-list Hello, some time ago I've been interested in highly optimized text search algorithms and wrote some in assembler. A very good analysis of text search algorithms can be found in the 8/97 issue of the German magazin C't (www.heise.de/ct) - "Blitzfindig" by Michael Tamm. This article compares the algoritms "Brute Force" "Boyer-Moore", "Quicksearch" and the advanced "T-Search" in detail. The C source can be found on: ftp://ftp.heise.de/pub/ct/listings/ct9708.zip Have fun... MfG, Stephan Tolksdorf ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: substring match like "strstr" @ 2000-12-14 21:12 Ruchira Datta 0 siblings, 0 replies; 23+ messages in thread From: Ruchira Datta @ 2000-12-14 21:12 UTC (permalink / raw) To: caml-list John Prevost wrote: >I just grabbed the glibc version, and found the following comment at >the head of it: >/* > * My personal strstr() implementation that beats most other algorithms. > * Until someone tells me otherwise, I assume that this is the > * fastest implementation of strstr() in C. > * I deliberately chose not to comment it. You should have at least > * as much fun trying to understand it, as I had to write it :-). > * > * Stephen R. van den Berg, berg@pool.informatik.rwth-aachen.de */ >Since the text of the function involves not only nasty gotos >(particularly frightening is "goto shloop;" but also pointer tricks >and register declarations, I'd say it's to be expected that it'll >outperform O'Caml. I am, however, seeing if I can figure out the >essential algorithm used (if there's really anything smart to it) and >will let you know if I come up with anything extra special. The frightening tricks are shocking to us functional programmers :-) but once you get over your shock and read through it, you won't have any problem understanding it. At each point, don't ask yourself ``why is this statement written in this obfuscated way?'' Just assume as an article of faith that this incantation is invoked to propitiate the Almighty Assembler :-). Instead ask yourself ``what does this statement do?'' That is fairly simple... There is absolutely nothing smart about the algorithm; it is exactly the brute-force one. For example, the first character of the needle is b, and the second character is c. Suppose we have matched b and c and then find that we don't match the whole needle. Then we back up our haystack pointer all the way back to the character after the last potential match, i.e., to the character that matched c, and start from the beginning again by checking if it matches b. We do this every time regardless of whether b and c are different. I guess this is to save one register variable. The speed comes from the low-level assembly language optimization; C is being used as portable assembler. I am sure it took a lot of ingenuity and very intimate knowledge of the gcc compiler to do such a low-level optimization while remaining in portable C. But it is not something for OCaml users to emulate. Ruchira Datta datta@math.berkeley.edu ^ permalink raw reply [flat|nested] 23+ messages in thread
end of thread, other threads:[~2000-12-15 12:53 UTC | newest] Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2000-12-08 6:04 substring match like "strstr" eijiro_sumii 2000-12-08 12:57 ` Gerd Stolpmann 2000-12-10 13:16 ` eijiro_sumii 2000-12-10 15:39 ` Gerd Stolpmann 2000-12-11 3:57 ` eijiro_sumii 2000-12-12 13:58 ` Julian Assange 2000-12-11 21:07 ` Chris Hecker 2000-12-11 22:22 ` Gerd Stolpmann 2000-12-12 5:06 ` Chris Hecker 2000-12-12 12:28 ` Jean-Christophe Filliatre 2000-12-13 10:02 ` eijiro_sumii 2000-12-13 10:17 ` Eijiro Sumii 2000-12-13 10:53 ` Julian Assange 2000-12-13 13:28 ` Eijiro Sumii 2000-12-12 3:28 ` eijiro_sumii 2000-12-13 1:12 ` John Prevost 2000-12-13 2:35 ` Chris Hecker 2000-12-12 10:07 ` Sven LUTHER 2000-12-14 3:36 ` eijiro_sumii 2000-12-14 6:48 ` Chris Hecker 2000-12-14 8:02 ` eijiro_sumii 2000-12-14 21:53 ` Stephan Tolksdorf 2000-12-14 21:12 Ruchira Datta
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox