* substring match like "strstr"
@ 2000-12-08 6:04 eijiro_sumii
2000-12-08 12:57 ` Gerd Stolpmann
0 siblings, 1 reply; 23+ messages in thread
From: eijiro_sumii @ 2000-12-08 6:04 UTC (permalink / raw)
To: caml-list; +Cc: sumii
Hi all,
Is there a substring matching function (like "strstr" in libc) in the
standard OCaml library? Of course there are many ways to implement it
(by writing it from scratch, using the Str library, interfacing
"strstr" in libc, etc.), but they are overkill for my purpose. So I'm
wondering whether such a function already exists, but I couldn't find
it in the manual...
Eijiro
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: substring match like "strstr"
2000-12-08 6:04 substring match like "strstr" eijiro_sumii
@ 2000-12-08 12:57 ` Gerd Stolpmann
2000-12-10 13:16 ` eijiro_sumii
0 siblings, 1 reply; 23+ messages in thread
From: Gerd Stolpmann @ 2000-12-08 12:57 UTC (permalink / raw)
To: eijiro_sumii, caml-list; +Cc: sumii
On Fri, 08 Dec 2000, eijiro_sumii@anet.ne.jp wrote:
>Hi all,
>
>Is there a substring matching function (like "strstr" in libc) in the
>standard OCaml library? Of course there are many ways to implement it
>(by writing it from scratch, using the Str library, interfacing
>"strstr" in libc, etc.), but they are overkill for my purpose. So I'm
>wondering whether such a function already exists, but I couldn't find
>it in the manual...
There isn't such a function.
You can write it yourself:
let find_substring s1 s2 =
(* find s2 in s1, or raise Not_found *)
if String.length s2 > String.length s1 then raise Not_found;
let b = String.create (String.length s2) in
let rec search k =
if k > String.length s1 - String.length s2 then raise Not_found;
String.blit ~src:s1 ~src_pos:k ~dst:b ~dst_pos:0 ~len:(String.length s2);
if b = s2 then
k
else search (k+1)
in
search 0
;;
This version of find_substring avoids superflous heap allocations, and I think
it is tricky anough to be included into the stdlib. (However, if we had strstr
as primitive, even the allocation of b could be avoided.)
If speed is important, I recommend
let find_substring s1 s2 =
Str.search_forward (Str.regexp (Str.quote s2)) s1 0
;;
I hope these solutions aren't overkilling.
Gerd
--
----------------------------------------------------------------------------
Gerd Stolpmann Telefon: +49 6151 997705 (privat)
Viktoriastr. 100
64293 Darmstadt EMail: gerd@gerd-stolpmann.de
Germany
----------------------------------------------------------------------------
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: substring match like "strstr"
2000-12-08 12:57 ` Gerd Stolpmann
@ 2000-12-10 13:16 ` eijiro_sumii
2000-12-10 15:39 ` Gerd Stolpmann
2000-12-11 21:07 ` Chris Hecker
0 siblings, 2 replies; 23+ messages in thread
From: eijiro_sumii @ 2000-12-10 13:16 UTC (permalink / raw)
To: caml-list; +Cc: gerd, sumii
> You can write it yourself:
...
> If speed is important, I recommend
>
> let find_substring s1 s2 =
> Str.search_forward (Str.regexp (Str.quote s2)) s1 0
> ;;
OK, I benchmarked the following 4 implementations for the purpose of
comparison.
strstr
stub to call "strstr" in libc
regexp
combination of "Str.regexp_string" and "Str.search_forward"
OCamlA
the simple implementation in OCaml given in the previous message,
a little tuned for more efficiency
OCamlB
another straightforward implementation in OCaml, attached at the
end of this message
The results (execution time in seconds) were as follows.
strstr 55.74
regexp 154.37
OCamlA 302.57
OCamlB 129.23
The application is a kind of library for data mining of gene
sequences. It calls the substring matching function more than
400,000,000 times. The machine is a Sun UltraEnterprise 450 with 4
UltraSPARC II (480 MHz) processors and 2GB main memory, but the
program itself doesn't make any use of multiple processors or much
space.
Although I haven't checked the reasons for these results in detail,
this might hopefully be of some information for other people, too.
And probably I should also consider implementing more sophisticated,
efficient algorithm in OCaml...
// Eijiro Sumii (http://www.yl.is.s.u-tokyo.ac.jp/~sumii/)
//
// Ph.D. Student at Department of IS, University of Tokyo
// Visiting Scholar at Department of CIS, University of Pennsylvania
----------------------------------------------------------------------
let strstr pat str =
let patlen = String.length pat in
let strlen = String.length str in
let rec is_prefix pos spos epos =
if pos < 0 then true else
if String.unsafe_get pat pos <>
String.unsafe_get str epos then false else
is_prefix (pos - 1) spos (epos - 1) in
let rec search spos =
let epos = spos + patlen - 1 in
if epos >= strlen then raise Not_found else
if is_prefix (patlen - 1) spos epos then spos else
search (spos + 1) in
search 0
----------------------------------------------------------------------
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: substring match like "strstr"
2000-12-10 13:16 ` eijiro_sumii
@ 2000-12-10 15:39 ` Gerd Stolpmann
2000-12-11 3:57 ` eijiro_sumii
2000-12-12 13:58 ` Julian Assange
2000-12-11 21:07 ` Chris Hecker
1 sibling, 2 replies; 23+ messages in thread
From: Gerd Stolpmann @ 2000-12-10 15:39 UTC (permalink / raw)
To: eijiro_sumii, caml-list; +Cc: gerd, sumii
On Sun, 10 Dec 2000, eijiro_sumii@anet.ne.jp wrote:
>OK, I benchmarked the following 4 implementations for the purpose of
>comparison.
>
> strstr
> stub to call "strstr" in libc
>
> regexp
> combination of "Str.regexp_string" and "Str.search_forward"
>
> OCamlA
> the simple implementation in OCaml given in the previous message,
> a little tuned for more efficiency
>
> OCamlB
> another straightforward implementation in OCaml, attached at the
> end of this message
>
>The results (execution time in seconds) were as follows.
>
> strstr 55.74
> regexp 154.37
> OCamlA 302.57
> OCamlB 129.23
I did not expect that OCamlB performs so well; so I suggested OcamlA and
regexp (which are both fast for the bytecode interpreter, too). I think it
depends also on the problem size (especially on the length of the substring).
Gerd
--
----------------------------------------------------------------------------
Gerd Stolpmann Telefon: +49 6151 997705 (privat)
Viktoriastr. 100
64293 Darmstadt EMail: gerd@gerd-stolpmann.de
Germany
----------------------------------------------------------------------------
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: substring match like "strstr"
2000-12-10 15:39 ` Gerd Stolpmann
@ 2000-12-11 3:57 ` eijiro_sumii
2000-12-12 13:58 ` Julian Assange
1 sibling, 0 replies; 23+ messages in thread
From: eijiro_sumii @ 2000-12-11 3:57 UTC (permalink / raw)
To: caml-list; +Cc: gerd, sumii
> I did not expect that OCamlB performs so well; so I suggested OcamlA and
> regexp (which are both fast for the bytecode interpreter, too). I think it
> depends also on the problem size (especially on the length of the substring).
Right, I forgot to mention those -- in the benchmark, the "needles"
(matched substrings) were 3 characters long and the "haystacks"
(searched strings) were about 20 characters long; I used the native
code compiler. For much larger input, regexp could possibly do
better. Anyway, thank you for the helpful suggestions.
// Eijiro Sumii (http://www.yl.is.s.u-tokyo.ac.jp/~sumii/)
//
// Ph.D. Student at Department of IS, University of Tokyo
// Visiting Scholar at Department of CIS, University of Pennsylvania
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: substring match like "strstr"
2000-12-10 13:16 ` eijiro_sumii
2000-12-10 15:39 ` Gerd Stolpmann
@ 2000-12-11 21:07 ` Chris Hecker
2000-12-11 22:22 ` Gerd Stolpmann
` (3 more replies)
1 sibling, 4 replies; 23+ messages in thread
From: Chris Hecker @ 2000-12-11 21:07 UTC (permalink / raw)
To: eijiro_sumii, caml-list; +Cc: gerd, sumii
>The results (execution time in seconds) were as follows.
> strstr 55.74
> regexp 154.37
> OCamlA 302.57
> OCamlB 129.23
Any ideas why strstr blows the others away? What's the libc strstr look like? I just looked in the MSVC source and it's a braindead while loop (copied below), so it's not like it's doing a fancy Boyer-Moore or anything. This is exactly the kind of problem on which I'd expect caml to come within 10% of c.
Have you tried a purely imperative version? I'd say I'd do the tests myself, but I don't have a bunch of gene sequences laying around. :)
Okay, I'm curious, so I'll port the code to caml and include it below as well (as practice for myself). Can you try it in your test harness? I've attached a cheesy test app with your caml, calling msvc's strstr, and my imperative version. I make no claims of profiling accuracy or even code correctness, but here are the results when run from a DOS box on Win98:
strstr total = 1.060000 (5,0,19,13)
strstr_imp total = 0.510000 (5,0,19,13)
strstr_c total = 0.415000 (5,0,19,13)
(The weird thing is when I ran it with shell-command under emacs the timings were
strstr total = 0.575000 (5,0,19,13)
strstr_imp total = 0.500000 (5,0,19,13)
strstr_c total = 0.415000 (5,0,19,13)
which I found bizarre since strstr became 2x faster, but I haven't had time to look into it. Can somebody else try these tests? )
Looking at the asm output from strstr_imp, there's not that much you could do to make it better. Maybe there are a few peephole things, and there are probably 30% more instructions than you need in this specific case because the code to compare characters goes to the trouble of keeping the ints in the caml format even in registers inside the loop (so they're shifted back and forth), and it converts the chars from the unsafe_gets to full caml ints, which is useless since they just get compared to each other and then dumped. It might be interesting to write a peephole optimizer aimed at peepholing camlopt code, and looking for things like this.
Can anybody optimize the caml src for strstr_imp?
Chris
----strstrc.c-----
#include "/test/ocaml/lib/caml/mlvalues.h"
/* ripped from msvc crt src */
char * __cdecl strstr (
const char * str1,
const char * str2
)
{
char *cp = (char *) str1;
char *s1, *s2;
if ( !*str2 )
return((char *)str1);
while (*cp)
{
s1 = cp;
s2 = (char *) str2;
while ( *s1 && *s2 && !(*s1-*s2) )
s1++, s2++;
if (!*s2)
return(cp);
cp++;
}
return(0);
}
value Camlstrstr( value pat, value str )
{
char *string = String_val(str);
return Val_int(strstr(string,String_val(pat))-string);
}
------strstrc.mli-----
external strstr_c: string -> string -> int = "Camlstrstr"
------strstr.ml-------
(* Eijiro Sumii's functional strstr *)
let strstr pat str =
let patlen = String.length pat in
let strlen = String.length str in
let rec is_prefix pos spos epos =
if pos < 0 then true else
if String.unsafe_get pat pos <>
String.unsafe_get str epos then false else
is_prefix (pos - 1) spos (epos - 1) in
let rec search spos =
let epos = spos + patlen - 1 in
if epos >= strlen then raise Not_found else
if is_prefix (patlen - 1) spos epos then spos else
search (spos + 1) in
search 0
(* checker's lame imperative strstr *)
let strstr_imp pat str =
let strlen = String.length str in
let patlen = String.length pat and
i = ref 0 and
f = ref strlen in
while !i < strlen do
let pati = ref 0 and
j = ref !i in
while !pati < patlen && !j < strlen &&
String.unsafe_get str !j == String.unsafe_get pat !pati do
pati := succ !pati;
j := succ !j;
done;
if !pati = patlen then
begin f := !i; i := strlen end
else
i := succ !i
done;
!f
(* a really questionable timing harness *)
let time_strstr name strstr =
let t0 = Unix.gettimeofday () in
for i = 0 to 100000 do
strstr "this" "that this the other";
strstr "this" "this that this the other";
strstr "this" "that tis the other this";
strstr "this" "that the othethisr th";
done;
let t = Unix.gettimeofday () -. t0 in
Printf.printf "%s total = %f (%d,%d,%d,%d)\n" name t
(* a really pathetic regression test *)
(strstr "this" "that this the other")
(strstr "this" "this that this the other")
(strstr "this" "that tis the other this")
(strstr "this" "that the othethisr th")
(* main *)
let _ =
time_strstr "strstr" strstr;
time_strstr "strstr_imp" strstr_imp;
time_strstr "strstr_c" Strstrc.strstr_c;
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: substring match like "strstr"
2000-12-11 21:07 ` Chris Hecker
@ 2000-12-11 22:22 ` Gerd Stolpmann
2000-12-12 5:06 ` Chris Hecker
2000-12-12 3:28 ` eijiro_sumii
` (2 subsequent siblings)
3 siblings, 1 reply; 23+ messages in thread
From: Gerd Stolpmann @ 2000-12-11 22:22 UTC (permalink / raw)
To: Chris Hecker, eijiro_sumii, caml-list; +Cc: gerd, sumii
On Mon, 11 Dec 2000, Chris Hecker wrote:
>Any ideas why strstr blows the others away?
It's the type of brain-dead loop for which C compilers are written.
>Have you tried a purely imperative version? I'd say I'd do the tests myself,
>but I don't have a bunch of gene sequences laying around. :)
It's not always the imperative version which is better. The difference in the
generated code of a functional and an imperative version is often minimal.
>Looking at the asm output from strstr_imp, there's not that much you could do
>to make it better. Maybe there are a few peephole things, and there are
>probably 30% more instructions than you need in this specific case because the
>code to compare characters goes to the trouble of keeping the ints in the caml
>format even in registers inside the loop (so they're shifted back and forth),
>and it converts the chars from the unsafe_gets to full caml ints, which is
>useless since they just get compared to each other and then dumped. It might
>be interesting to write a peephole optimizer aimed at peepholing camlopt code,
>and looking for things like this.
>
>Can anybody optimize the caml src for strstr_imp?
let strstr_imp2 pat str =
let strlen = String.length str in
let patlen = String.length pat and
i = ref 0 and
f = ref strlen in
let i_limit = strlen - patlen in
while !i < i_limit do
let pati = ref 0 and
j = ref !i in
while !pati < patlen &&
String.unsafe_get str !j == String.unsafe_get pat !pati do
pati := succ !pati;
j := succ !j;
done;
if !pati = patlen then
begin f := !i; i := strlen end
else
i := succ !i
done;
!f
;;
Perhaps !pati < patlen can also be removed by introducing a "guard" (modifying
the last character of pat such that the loop runs always into an inequality).
However, this would require extra code managing the temporary modification of
pat, and I guess it is not worth-while for short pats.
Gerd
--
----------------------------------------------------------------------------
Gerd Stolpmann Telefon: +49 6151 997705 (privat)
Viktoriastr. 100
64293 Darmstadt EMail: gerd@gerd-stolpmann.de
Germany
----------------------------------------------------------------------------
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: substring match like "strstr"
2000-12-11 21:07 ` Chris Hecker
2000-12-11 22:22 ` Gerd Stolpmann
@ 2000-12-12 3:28 ` eijiro_sumii
2000-12-13 1:12 ` John Prevost
2000-12-12 10:07 ` Sven LUTHER
2000-12-14 3:36 ` eijiro_sumii
3 siblings, 1 reply; 23+ messages in thread
From: eijiro_sumii @ 2000-12-12 3:28 UTC (permalink / raw)
To: checker; +Cc: caml-list, gerd, sumii
> Any ideas why strstr blows the others away? What's the libc strstr
> look like?
I have no idea, unfortunately...
> I just looked in the MSVC source and it's a braindead while loop
> (copied below), so it's not like it's doing a fancy Boyer-Moore or
> anything.
I don't know anything about strstr in Sun's strstr, but I checked
strstr in GNU libc. It is a quite complicated program, but look like
a brute-force algorithm (that is, no Knuth-Morris-Pratt or anything
like that).
> This is exactly the kind of problem on which I'd expect caml to come
> within 10% of c.
That was what I expected, too.
> I'd say I'd do the tests myself, but I don't have a bunch of gene
> sequences laying around. :)
I've put the current version of my application program at:
http://www.yl.is.s.u-tokyo.ac.jp/~sumii/tmp/hc.tar.gz
If you like, you're welcome to check it yourself, of course.:) You can
run it as "make ; time ./hc". The output should look like:
score = 0.348672
score = 0.391356
(snip)
score = 0.630415
The OCaml function "strstr" is in the file "strstr.ml".
> Okay, I'm curious, so I'll port the code to caml and include it
> below as well (as practice for myself). Can you try it in your test
> harness?
Sure, but please give me a little time. This is a kind of part-time,
weekend job for me, and I can't tell when I have time to do it next...
Eijiro
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: substring match like "strstr"
2000-12-11 22:22 ` Gerd Stolpmann
@ 2000-12-12 5:06 ` Chris Hecker
2000-12-12 12:28 ` Jean-Christophe Filliatre
0 siblings, 1 reply; 23+ messages in thread
From: Chris Hecker @ 2000-12-12 5:06 UTC (permalink / raw)
To: gerd, eijiro_sumii, caml-list; +Cc: gerd, sumii
> let i_limit = strlen - patlen in
Ah, duh. Good idea.
Here are some new timings, including the _imp2 (I fixed a minor fencepost bug in yours in an admittedly cheesy way; the code's below):
Under Win2k cmd.exe:
strstr total = 0.610000 (5,0,19,13)
strstr_imp total = 0.593000 (5,0,19,13)
strstr_imp2 total = 0.563000 (5,0,19,13)
strstr_c total = 0.469000 (5,0,19,13)
Under Win98 command.com:
strstr total = 1.460000 (5,0,19,13)
strstr_imp total = 1.020000 (5,0,19,13)
strstr_imp2 total = 0.955000 (5,0,19,13)
strstr_c total = 0.800000 (5,0,19,13)
Under Win98 emacs shell-command:
strstr total = 1.125000 (5,0,19,13)
strstr_imp total = 1.025000 (5,0,19,13)
strstr_imp2 total = 0.960000 (5,0,19,13)
strstr_c total = 0.800000 (5,0,19,13)
I doubled the number of iterations to try to figure out the emacs/command.com difference, but to no avail.
However, then I decided to check to see if this was actually the strstr in the library, so instead of calling that c code, I called string.h's strstr. I discovered that there's actually an intel-asm version of strstr that's compiled into the crt:
strstr total = 1.130000 (5,0,19,13)
strstr_imp total = 1.020000 (5,0,19,13)
strstr_imp2 total = 0.960000 (5,0,19,13)
strstr_c total = 0.365000 (5,0,19,13)
Ouch! We're back to 3x slower. But hey, no one said ocaml could compete with asm. :)
Anyway, as it stands, the ocaml code is 17% slower than the C code at relatively comparable code-simplicity. I'd say the ocaml code is a bit harder to understand and wordier, but that's mostly because ocaml doesn't have modifyable pointers, and the C code just walks the pointers through the array. Also, the C has an early-out return that ocaml can't do (without an exception), and that makes things a little simpler as well.
Of course, I'm not saying you can extrapolate anything about real programs from this simple example, but it's interesting to me nonetheless.
Chris
Here's the updated strstr_imp2. If you really wanted to try to max out the ocaml, we could read the pattern into an int and try to match it a word at a time, etc.
(* "optimized" imperative strstr *)
let strstr_imp2 pat str =
let strlen = String.length str in
let patlen = String.length pat and
i = ref 0 and
f = ref strlen in
let limit = strlen - patlen + 1 in
while !i < limit do
let pati = ref 0 and
j = ref !i in
while !pati < patlen &&
String.unsafe_get str !j == String.unsafe_get pat !pati do
pati := succ !pati;
j := succ !j;
done;
if !pati = patlen then
begin f := !i; i := strlen end
else
i := succ !i
done;
!f
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: substring match like "strstr"
2000-12-11 21:07 ` Chris Hecker
2000-12-11 22:22 ` Gerd Stolpmann
2000-12-12 3:28 ` eijiro_sumii
@ 2000-12-12 10:07 ` Sven LUTHER
2000-12-14 3:36 ` eijiro_sumii
3 siblings, 0 replies; 23+ messages in thread
From: Sven LUTHER @ 2000-12-12 10:07 UTC (permalink / raw)
To: Chris Hecker; +Cc: eijiro_sumii, caml-list, gerd, sumii
On Mon, Dec 11, 2000 at 01:07:22PM -0800, Chris Hecker wrote:
>
> >The results (execution time in seconds) were as follows.
> > strstr 55.74
> > regexp 154.37
> > OCamlA 302.57
> > OCamlB 129.23
>
> Any ideas why strstr blows the others away? What's the libc strstr look like? I just looked in the MSVC source and it's a braindead while loop (copied below), so it's not like it's doing a fancy Boyer-Moore or anything. This is exactly the kind of problem on which I'd expect caml to come within 10% of c.
Is this kind of thing not one of the field which can take advantage of the
vector units included in the processor, like mmx, altivec and the equivalent
in alpha or sparc processors ? Not sure though, it would depend on the actual
implementation used and command set.
Friendly,
Sven Luther
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: substring match like "strstr"
2000-12-12 5:06 ` Chris Hecker
@ 2000-12-12 12:28 ` Jean-Christophe Filliatre
2000-12-13 10:02 ` eijiro_sumii
0 siblings, 1 reply; 23+ messages in thread
From: Jean-Christophe Filliatre @ 2000-12-12 12:28 UTC (permalink / raw)
To: Chris Hecker; +Cc: gerd, eijiro_sumii, caml-list, sumii
Hello,
I missed the beginning of the discussion, but implementing
Knuth-Morris-Pratt is quite easy. The code is given below (26 lines),
following Sedgewick's book "Algorithms". Notice that the function kmp
can be partially applied to a pattern, therefore computing the offset
table only once. The complexity is N+M where N is the size of the text
and M the size of the pattern.
(The following program was proved correct in the Coq Proof Assistant;
without the Not_found exception, however)
Hope this helps,
--
Jean-Christophe FILLIATRE
mailto:Jean-Christophe.Filliatre@lri.fr
http://www.lri.fr/~filliatr
======================================================================
let initnext p =
let m = String.length p in
let next = Array.create m 0 in
let i = ref 1 and j = ref 0 in
while !i < m-1 do
if p.[!i] = p.[!j] then begin
incr i; incr j; next.(!i) <- !j
end else
if !j = 0 then begin incr i; next.(!i) <- 0 end else j := next.(!j)
done;
next
let kmp p =
let next = initnext p in
fun a ->
let n = String.length a
and m = String.length p
and i = ref 0
and j = ref 0 in
while !j < m & !i < n do
if a.[!i] = p.[!j] then begin
incr i; incr j
end else
if !j = 0 then incr i else j := next.(!j)
done;
if !j >= m then !i-m else raise Not_found
======================================================================
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: substring match like "strstr"
2000-12-10 15:39 ` Gerd Stolpmann
2000-12-11 3:57 ` eijiro_sumii
@ 2000-12-12 13:58 ` Julian Assange
1 sibling, 0 replies; 23+ messages in thread
From: Julian Assange @ 2000-12-12 13:58 UTC (permalink / raw)
To: gerd; +Cc: eijiro_sumii, caml-list, sumii, proff
Gerd Stolpmann <gerd@gerd-stolpmann.de> writes:
> >The results (execution time in seconds) were as follows.
> >
> > strstr 55.74
> > regexp 154.37
> > OCamlA 302.57
> > OCamlB 129.23
>
> I did not expect that OCamlB performs so well; so I suggested OcamlA and
> regexp (which are both fast for the bytecode interpreter, too). I think it
> depends also on the problem size (especially on the length of the substring).
>
> Gerd
If you are really calling strstr 400,000,000 times, I strongly suggest
the use of Boyer-Moore. Is your alphabet amino-acids or base-pairs? If
the latter, I have developed parallel hashing version of Boyer-More
(lets call it Assange-Boyer-More :), which beats the pants off
Knuth-Moris-Pratt. 500 long substrings searches are only about 50%
slower than 1, although for this application, if you have the memory,
you might also want to look at suffix trees or suffix arrays.
--
Julian Assange |If you want to build a ship, don't drum up people
|together to collect wood or assign them tasks
proff@iq.org |and work, but rather teach them to long for the endless
proff@gnu.ai.mit.edu |immensity of the sea. -- Antoine de Saint Exupery
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: substring match like "strstr"
2000-12-12 3:28 ` eijiro_sumii
@ 2000-12-13 1:12 ` John Prevost
2000-12-13 2:35 ` Chris Hecker
0 siblings, 1 reply; 23+ messages in thread
From: John Prevost @ 2000-12-13 1:12 UTC (permalink / raw)
To: eijiro_sumii; +Cc: checker, caml-list, gerd, sumii
>>>>> "es" == eijiro sumii <eijiro_sumii@anet.ne.jp> writes:
>> Any ideas why strstr blows the others away? What's the libc
>> strstr look like?
es> I have no idea, unfortunately...
I just grabbed the glibc version, and found the following comment at
the head of it:
/*
* My personal strstr() implementation that beats most other algorithms.
* Until someone tells me otherwise, I assume that this is the
* fastest implementation of strstr() in C.
* I deliberately chose not to comment it. You should have at least
* as much fun trying to understand it, as I had to write it :-).
*
* Stephen R. van den Berg, berg@pool.informatik.rwth-aachen.de */
Since the text of the function involves not only nasty gotos
(particularly frightening is "goto shloop;" but also pointer tricks
and register declarations, I'd say it's to be expected that it'll
outperform O'Caml. I am, however, seeing if I can figure out the
essential algorithm used (if there's really anything smart to it) and
will let you know if I come up with anything extra special.
The file strstr.c is included below for your edification. I'd argue
that any Caml version is likely to a be a wee bit easier to
understand.
John.
/* Return the offset of one string within another.
Copyright (C) 1994, 1996, 1997 Free Software Foundation, Inc.
This file is part of the GNU C Library.
The GNU C Library is free software; you can redistribute it and/or
modify it under the terms of the GNU Library General Public License as
published by the Free Software Foundation; either version 2 of the
License, or (at your option) any later version.
The GNU C Library is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
Library General Public License for more details.
You should have received a copy of the GNU Library General Public
License along with the GNU C Library; see the file COPYING.LIB. If not,
write to the Free Software Foundation, Inc., 59 Temple Place - Suite 330,
Boston, MA 02111-1307, USA. */
/*
* My personal strstr() implementation that beats most other algorithms.
* Until someone tells me otherwise, I assume that this is the
* fastest implementation of strstr() in C.
* I deliberately chose not to comment it. You should have at least
* as much fun trying to understand it, as I had to write it :-).
*
* Stephen R. van den Berg, berg@pool.informatik.rwth-aachen.de */
#if HAVE_CONFIG_H
# include <config.h>
#endif
#if defined _LIBC || defined HAVE_STRING_H
# include <string.h>
#endif
typedef unsigned chartype;
#undef strstr
char *
strstr (phaystack, pneedle)
const char *phaystack;
const char *pneedle;
{
register const unsigned char *haystack, *needle;
register chartype b, c;
haystack = (const unsigned char *) phaystack;
needle = (const unsigned char *) pneedle;
b = *needle;
if (b != '\0')
{
haystack--; /* possible ANSI violation */
do
{
c = *++haystack;
if (c == '\0')
goto ret0;
}
while (c != b);
c = *++needle;
if (c == '\0')
goto foundneedle;
++needle;
goto jin;
for (;;)
{
register chartype a;
register const unsigned char *rhaystack, *rneedle;
do
{
a = *++haystack;
if (a == '\0')
goto ret0;
if (a == b)
break;
a = *++haystack;
if (a == '\0')
goto ret0;
shloop: }
while (a != b);
jin: a = *++haystack;
if (a == '\0')
goto ret0;
if (a != c)
goto shloop;
rhaystack = haystack-- + 1;
rneedle = needle;
a = *rneedle;
if (*rhaystack == a)
do
{
if (a == '\0')
goto foundneedle;
++rhaystack;
a = *++needle;
if (*rhaystack != a)
break;
if (a == '\0')
goto foundneedle;
++rhaystack;
a = *++needle;
}
while (*rhaystack == a);
needle = rneedle; /* took the register-poor approach */
if (a == '\0')
break;
}
}
foundneedle:
return (char*) haystack;
ret0:
return 0;
}
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: substring match like "strstr"
2000-12-13 1:12 ` John Prevost
@ 2000-12-13 2:35 ` Chris Hecker
0 siblings, 0 replies; 23+ messages in thread
From: Chris Hecker @ 2000-12-13 2:35 UTC (permalink / raw)
To: John Prevost, eijiro_sumii; +Cc: caml-list, gerd, sumii
>The file strstr.c is included below for your edification. I'd argue
>that any Caml version is likely to a be a wee bit easier to
>understand.
Okay, so here's a new challenge: write the most readable strstr you can in ocaml. The restriction is you have to actually implement the algorithm, not just call the regex library. :)
What is the most functional and readable one we can come up with? And then we'll compare its performance and see if the readable/performance curve is appropriate. Eijiro's was relatively short and functional, but still not completely transparent. My imperative one sucks for readability because of all the references and variables.
Chris
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: substring match like "strstr"
2000-12-12 12:28 ` Jean-Christophe Filliatre
@ 2000-12-13 10:02 ` eijiro_sumii
2000-12-13 10:17 ` Eijiro Sumii
2000-12-13 10:53 ` Julian Assange
0 siblings, 2 replies; 23+ messages in thread
From: eijiro_sumii @ 2000-12-13 10:02 UTC (permalink / raw)
To: caml-list; +Cc: Jean-Christophe.Filliatre, proff, jmp, sumii
Hi everyone,
Thank you very much for a lot of kind advice!
Jean-Christophe Filliatre <Jean-Christophe.Filliatre@lri.fr> wrote:
> I missed the beginning of the discussion, but implementing
> Knuth-Morris-Pratt is quite easy. The code is given below (26 lines),
Actually, I found the (almost) same code on the web and tried it. The
results (again, execution time in seconds) were:
SPARC Pentium
strstr 52.68 57.050
kmp 111.52 143.490
The SPARC machine is the same as before. The Pentium machine is
running Linux 2.2.17 inside VMWare 2.0.3 (with 48MB virtual main
memory) on Windows 2000 with a Pentium II 400MHz processor and 128MB
main memory.
Since my program calls the function more than 3000 times for _each_
pattern, I did partial application, of course. Even so, kmp in OCaml
still performed much worse than strstr in libc (both Sun's and GNU's),
at least in this program.
If you'd like to check it yourself, the source code is available at
the same URL as before
(http://www.yl.is.s.u-tokyo.ac.jp/~sumii/tmp/hc.tar.gz).
Julian Assange <proff@iq.org> wrote:
> I'm interested in this library. Can you tell me a little more? What's
> your ETA?
It is explained a little at:
http://www.hypothesiscreator.net/
It has been developed in C++, but we began to port it to OCaml for
clarity and convenience (and fun:-).
Julian Assange <proff@iq.org> wrote:
> If you are really calling strstr 400,000,000 times, I strongly suggest
> the use of Boyer-Moore. Is your alphabet amino-acids or base-pairs? If
Can Boyer-Moore (in OCaml) be _much_ faster than KMP, so that it beats
strstr in libc?
John Prevost <jmp@arsdigita.com> wrote:
> I just grabbed the glibc version, and found the following comment at
> the head of it:
That was exactly what I found too (and mentioned a bit in my previous
message). Maybe I should also disassemble and check Sun's strstr...?
Regards,
Eijiro
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: substring match like "strstr"
2000-12-13 10:02 ` eijiro_sumii
@ 2000-12-13 10:17 ` Eijiro Sumii
2000-12-13 10:53 ` Julian Assange
1 sibling, 0 replies; 23+ messages in thread
From: Eijiro Sumii @ 2000-12-13 10:17 UTC (permalink / raw)
To: caml-list; +Cc: Jean-Christophe.Filliatre, proff, jmp, sumii
> Actually, I found the (almost) same code on the web and tried it. The
> results (again, execution time in seconds) were:
>
> SPARC Pentium
> strstr 52.68 57.050
> kmp 111.52 143.490
Oops, I forgot to give "-unsafe" to ocamlopt in kmp. With -unsafe,
the results have improved (but still worse than strstr):
SPARC Pentium
kmp 73.50 79.940
Eijiro
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: substring match like "strstr"
2000-12-13 10:02 ` eijiro_sumii
2000-12-13 10:17 ` Eijiro Sumii
@ 2000-12-13 10:53 ` Julian Assange
2000-12-13 13:28 ` Eijiro Sumii
1 sibling, 1 reply; 23+ messages in thread
From: Julian Assange @ 2000-12-13 10:53 UTC (permalink / raw)
To: eijiro_sumii; +Cc: caml-list, Jean-Christophe.Filliatre, jmp, sumii, proff
eijiro_sumii@anet.ne.jp writes:
> Hi everyone,
>
> Thank you very much for a lot of kind advice!
>
> Jean-Christophe Filliatre <Jean-Christophe.Filliatre@lri.fr> wrote:
> > I missed the beginning of the discussion, but implementing
> > Knuth-Morris-Pratt is quite easy. The code is given below (26 lines),
>
> Actually, I found the (almost) same code on the web and tried it. The
> results (again, execution time in seconds) were:
>
> SPARC Pentium
> strstr 52.68 57.050
> kmp 111.52 143.490
All of these more sophisticated methods are only going to be a win
when you are searching for larger strings. strstr is brutally simple,
has little setup/cleanup time, and hence is hard to beat for short
strings. Additionally gcc, will replace various s* calls with in-line
assembly versions. I'm not sure if this includes strstr (it is a
little more complicated than most).
Cheers,
Julian.
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: substring match like "strstr"
2000-12-13 10:53 ` Julian Assange
@ 2000-12-13 13:28 ` Eijiro Sumii
0 siblings, 0 replies; 23+ messages in thread
From: Eijiro Sumii @ 2000-12-13 13:28 UTC (permalink / raw)
To: proff; +Cc: caml-list, sumii
Julian Assange <proff@iq.org> wrote:
> All of these more sophisticated methods are only going to be a win
> when you are searching for larger strings. strstr is brutally simple,
> has little setup/cleanup time, and hence is hard to beat for short
> strings.
That is completely correct, and the substrings in my application are
indeed short (3 characters), but:
eijiro_sumii@anet.ne.jp wrote:
| Since my program calls the function more than 3000 times for _each_
| pattern, I did partial application, of course. Even so, kmp in OCaml
| still performed much worse than strstr in libc (both Sun's and GNU's),
| at least in this program.
I expected that 3000 strings for 1 substring, each of which are about
20 characters long as I mentioned before, are enough for paying the
"setup/cleanup" overhead. (On the other hand, strstr in libc doesn't
benefit from the partial application, of course.)
Eijiro
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: substring match like "strstr"
2000-12-11 21:07 ` Chris Hecker
` (2 preceding siblings ...)
2000-12-12 10:07 ` Sven LUTHER
@ 2000-12-14 3:36 ` eijiro_sumii
2000-12-14 6:48 ` Chris Hecker
3 siblings, 1 reply; 23+ messages in thread
From: eijiro_sumii @ 2000-12-14 3:36 UTC (permalink / raw)
To: checker; +Cc: caml-list, gerd, sumii
> Okay, I'm curious, so I'll port the code to caml and include it
> below as well (as practice for myself). Can you try it in your test
> harness?
I tried the optimized version of the imperative strstr, along with an
optimized version of the functional strstr (attached at the end). The
results (on the SPARC machine) were:
strstr_imp2 90.57
strstr_fun2 89.64
C strstr 53.76
So the imperative version and the functional version were comparable,
though both were slower than the C version. My guess is that they
actually compile into rather similar code, because all recursive calls
are in tail positions.
Anyway, for my application, the libc version seems the most reasonable
choice, as far as I use a strstr-like function at all...
// Eijiro Sumii (http://www.yl.is.s.u-tokyo.ac.jp/~sumii/)
//
// Ph.D. Student at Department of IS, University of Tokyo
// Visiting Scholar at Department of CIS, University of Pennsylvania
let strstr_fun2 pat str =
let patlim = String.length pat - 1 in
let strlim = String.length str - 1 in
let rec sub patpos strpos =
(* compare pat[0..patpos] and str[(strpos-patpos)..strpos] *)
if patpos < 0 then true else
if pat.[patpos] <> str.[strpos] then false else
sub (patpos - 1) (strpos - 1) in
let rec search strpos =
(* compare pat[0..patlim] and str[(strpos-patlim)..strpos] *)
if strpos > strlim then raise Not_found else
if sub patlim strpos then strpos - patlim else
search (strpos + 1) in
search patlim
;;
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: substring match like "strstr"
2000-12-14 3:36 ` eijiro_sumii
@ 2000-12-14 6:48 ` Chris Hecker
2000-12-14 8:02 ` eijiro_sumii
0 siblings, 1 reply; 23+ messages in thread
From: Chris Hecker @ 2000-12-14 6:48 UTC (permalink / raw)
To: eijiro_sumii; +Cc: caml-list, gerd, sumii
> strstr_imp2 90.57
> strstr_fun2 89.64
> C strstr 53.76
Is that C the simple C one I posted, or the libc one on sparc?
Here are my timings with yours added:
strstr_fun total = 1.135000 (5,0,19,13)
strstr_fun2 total = 1.470000 (5,0,19,13)
strstr_imp total = 1.030000 (5,0,19,13)
strstr_imp2 total = 0.965000 (5,0,19,13)
strstr_c total = 0.790000 (5,0,19,13)
strstr_libc total = 0.370000 (5,0,19,13)
Your new strstr is actually slower on x86 than your first one, but I haven't looked into why yet. It may also be data (since my data is obviously lame garbage). I should try Alpha Linux...
Chris
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: substring match like "strstr"
2000-12-14 6:48 ` Chris Hecker
@ 2000-12-14 8:02 ` eijiro_sumii
2000-12-14 21:53 ` Stephan Tolksdorf
0 siblings, 1 reply; 23+ messages in thread
From: eijiro_sumii @ 2000-12-14 8:02 UTC (permalink / raw)
To: checker; +Cc: caml-list, gerd, sumii
> Is that C the simple C one I posted, or the libc one on sparc?
It's the one in libc on Solaris, so may be tuned by hand in assembly.
> Your new strstr is actually slower on x86 than your first one, but I
> haven't looked into why yet.
That's interesting. I confirmed it myself on my x86 machine with my
program:
strstr_imp2 88.530
strstr_fun 124.210
strstr_fun2 127.120
glibc strstr 61.310
Seeing such strong machine dependency, it seems yet more reasonable to
use the "default" strstr provided by the system (unless the _pattern_
is long enough, in which case KMP or BM might perform better), even
though there is some overhead --- about 5% in my profiling on SPARC
--- of interfacing a C function to OCaml. This is a natural (though
boring) conclusion, I think.
Eijiro
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: substring match like "strstr"
2000-12-14 8:02 ` eijiro_sumii
@ 2000-12-14 21:53 ` Stephan Tolksdorf
0 siblings, 0 replies; 23+ messages in thread
From: Stephan Tolksdorf @ 2000-12-14 21:53 UTC (permalink / raw)
To: caml-list
Hello,
some time ago I've been interested in highly optimized text search
algorithms and wrote some in assembler.
A very good analysis of text search algorithms can be found in the
8/97 issue of the German magazin C't (www.heise.de/ct) - "Blitzfindig"
by Michael Tamm.
This article compares the algoritms "Brute Force" "Boyer-Moore",
"Quicksearch" and the advanced "T-Search" in detail.
The C source can be found on: ftp://ftp.heise.de/pub/ct/listings/ct9708.zip
Have fun...
MfG,
Stephan Tolksdorf
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: substring match like "strstr"
@ 2000-12-14 21:12 Ruchira Datta
0 siblings, 0 replies; 23+ messages in thread
From: Ruchira Datta @ 2000-12-14 21:12 UTC (permalink / raw)
To: caml-list
John Prevost wrote:
>I just grabbed the glibc version, and found the following comment at
>the head of it:
>/*
> * My personal strstr() implementation that beats most other algorithms.
> * Until someone tells me otherwise, I assume that this is the
> * fastest implementation of strstr() in C.
> * I deliberately chose not to comment it. You should have at least
> * as much fun trying to understand it, as I had to write it :-).
> *
> * Stephen R. van den Berg, berg@pool.informatik.rwth-aachen.de */
>Since the text of the function involves not only nasty gotos
>(particularly frightening is "goto shloop;" but also pointer tricks
>and register declarations, I'd say it's to be expected that it'll
>outperform O'Caml. I am, however, seeing if I can figure out the
>essential algorithm used (if there's really anything smart to it) and
>will let you know if I come up with anything extra special.
The frightening tricks are shocking to us functional programmers :-) but
once you get over your shock and read through it, you won't have any
problem understanding it. At each point, don't ask yourself ``why is
this statement written in this obfuscated way?'' Just assume as an
article of faith that this incantation is invoked to propitiate the
Almighty Assembler :-). Instead ask yourself ``what does this statement
do?'' That is fairly simple...
There is absolutely nothing smart about the algorithm; it is exactly
the brute-force one. For example, the first character of the needle
is b, and the second character is c. Suppose we have matched b and
c and then find that we don't match the whole needle. Then we back
up our haystack pointer all the way back to the character after the
last potential match, i.e., to the character that matched c, and start
from the beginning again by checking if it matches b. We do this every
time regardless of whether b and c are different. I guess this is to
save one register variable.
The speed comes from the low-level assembly language optimization; C is
being used as portable assembler. I am sure it took a lot of ingenuity
and very intimate knowledge of the gcc compiler to do such a low-level
optimization while remaining in portable C. But it is not something for
OCaml users to emulate.
Ruchira Datta
datta@math.berkeley.edu
^ permalink raw reply [flat|nested] 23+ messages in thread
end of thread, other threads:[~2000-12-15 12:53 UTC | newest]
Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2000-12-08 6:04 substring match like "strstr" eijiro_sumii
2000-12-08 12:57 ` Gerd Stolpmann
2000-12-10 13:16 ` eijiro_sumii
2000-12-10 15:39 ` Gerd Stolpmann
2000-12-11 3:57 ` eijiro_sumii
2000-12-12 13:58 ` Julian Assange
2000-12-11 21:07 ` Chris Hecker
2000-12-11 22:22 ` Gerd Stolpmann
2000-12-12 5:06 ` Chris Hecker
2000-12-12 12:28 ` Jean-Christophe Filliatre
2000-12-13 10:02 ` eijiro_sumii
2000-12-13 10:17 ` Eijiro Sumii
2000-12-13 10:53 ` Julian Assange
2000-12-13 13:28 ` Eijiro Sumii
2000-12-12 3:28 ` eijiro_sumii
2000-12-13 1:12 ` John Prevost
2000-12-13 2:35 ` Chris Hecker
2000-12-12 10:07 ` Sven LUTHER
2000-12-14 3:36 ` eijiro_sumii
2000-12-14 6:48 ` Chris Hecker
2000-12-14 8:02 ` eijiro_sumii
2000-12-14 21:53 ` Stephan Tolksdorf
2000-12-14 21:12 Ruchira Datta
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox