counting words

Mailing list for all users of the OCaml language and system.
 help / color / mirror / Atom feed

* counting words
@ 1999-01-19 16:47 Toby Moth
  1999-01-19 22:52 ` Stefan Monnier
  1999-01-20  2:01 ` Markus Mottl
  0 siblings, 2 replies; 3+ messages in thread
From: Toby Moth @ 1999-01-19 16:47 UTC (permalink / raw)
  To: caml-list

Remember the program in K & R for counting the number of occurrences of all the words in some input (K&R2, 6.5) ? Anyway, I happened to mention this program to a friend whilst a Perl hacker was listening in. To our surprise, the Perl programmer wrote a word counting program, including a binary tree sort, as we spoke. My friend then wrote his own rather less faithful but elegant version of this program in C++. Finally, I decided to write 2 OCAML versions - one using explicit buffering and streams, the other giving much of the work to ocamllex.

The surprise (at least for me) was this - both ocaml versions were significantly faster than the C++ solution. I think the Perl program wins on style. It is closest to the K&R solution and handles white space most cleverly. However, we all know that OCAML can be written stylishly. The point which I am only just beginning to appreciate is that OCAML is remarkably fast. After all, this is just the sort of program which people usually say should be written in C/C++ because it will then run faster.

Here are some results just to give a flavour of relative speeds on an Ultra 5.

[tm1]$ ls -l tm.dv
-rw-rw-r--   1 tm1      this      452849 Jan 13 13:06 tm.dv
[tm1]$ time cat tm.dv | perl ~/lang/perl/wc.pl > pcount 

real    0m28.604s                       PERL
user    0m28.260s                            
sys     0m0.100s
[tm1]$ time cat tm.dv | ~/lang/c/wc > ccount

real    0m2.708s                        C++
user    0m2.260s
sys     0m0.180s
[tm1]$ time cat tm.dv | ~/lang/caml/ywc > mcount

real    0m0.999s                       OCAML
user    0m0.720s
sys     0m0.080s

And here is the code -

PERL ( by Robin Barker )
---
package Tree;

sub new { bless { count => 0 } }
sub putintree {
    my($tree,$value) = @_;
    if( $tree->{count} ) {
	if( $tree->{value} lt $value ) {
	    $tree->{up} -> putintree($value)	
        }
	elsif( $tree->{value} gt $value ) {
	    $tree->{down} -> putintree($value)	
        }
	else { $tree->{count}++ }
    }
    else {
	$tree->{up} = new Tree; 
	$tree->{down} = new Tree; 
	$tree->{value} = $value;
	$tree->{count} = 1;
    }
}

sub show {
    my $tree = shift;
    if( $tree -> {count} ) {
	$tree -> {down} -> show;	
	printf "%4d\t%s\n", $tree->{count}, $tree->{value};
	$tree -> {up} -> show;	
    }
}

package main;

my $tree = new Tree;

while( <> ) { for ( /\w+/g ) { $tree-> putintree ($&) } }

$tree->show;
---

C++ ( by Steve Gardner, also an OCAML fan )
> Steve says:
> "This might be faster if it used lex"
---
#include <iostream>
#include <string>
#include <map>

int main()
{
  string buf;
  typedef map<string, long> mymap;
  mymap vec;

  while (cin >> buf) ++vec[buf];

  for (mymap::const_iterator p = vec.begin(); p != vec.end(); ++p)
    cout << p->first << ": " << p->second << endl;
}
---

OCAML ( by me. Forgive me any blemishes - I mean well )
---
(*** Lexer Approach to Word Count ***)

{
let hash = Hashtbl.create 101

let hash_add string = 
  try
    let v = Hashtbl.find hash string in v:= !v + 1
  with
    Not_found -> Hashtbl.add hash string (ref 1)
} 

rule read = parse
  [' ' '\n' '\r' '\t']+ { read lexbuf } 
| [^' ' '\n' '\r' '\r']+ { hash_add ( Lexing.lexeme lexbuf ); read lexbuf }
| eof { Hashtbl.iter ( fun k v -> Printf.printf "%4d\t%s\n" !v k ) hash }

{ let _ = read_rec ( Lexing.from_channel stdin ) 0 }

---

 There are a few little extra details worth mentioning.

1) The 2nd OCAML program which I haven't included uses the sort of string buffering + stream approach which you can find in the genlex.ml code. This second approach is rather longer, but not much slower.
2) A longer C++ solution using #include <hash_map> was also much slower than either of the OCAML solutions.
3) The speed of the Perl program is irrelevant, since no hashing algorithm was being used.

-- 
Albert Tobias Moth
at work: research scientist at the National Physical Laboratory
         tm1@npl.co.uk   +44 (0)181 943 6016
at home: philosopher, dish washer, diarist, geek
         toby@heynonnyno.demon.co.uk

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: counting words
  1999-01-19 16:47 counting words Toby Moth
@ 1999-01-19 22:52 ` Stefan Monnier
  1999-01-20  2:01 ` Markus Mottl
  1 sibling, 0 replies; 3+ messages in thread
From: Stefan Monnier @ 1999-01-19 22:52 UTC (permalink / raw)
  To: caml-list

>>>>> "Toby" == Toby Moth <tm1@cise.npl.co.uk> writes:
> [tm1]$ time cat tm.dv | perl ~/lang/perl/wc.pl > pcount 

Running for 1999's "useless use of cat" contest ?


	Stefan




^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: counting words
  1999-01-19 16:47 counting words Toby Moth
  1999-01-19 22:52 ` Stefan Monnier
@ 1999-01-20  2:01 ` Markus Mottl
  1 sibling, 0 replies; 3+ messages in thread
From: Markus Mottl @ 1999-01-20  2:01 UTC (permalink / raw)
  To: Toby Moth; +Cc: OCAML

Hello,

> Here are some results just to give a flavour of relative speeds on an Ultra 5.
> 
> [tm1]$ ls -l tm.dv
> -rw-rw-r--   1 tm1      this      452849 Jan 13 13:06 tm.dv
> [tm1]$ time cat tm.dv | perl ~/lang/perl/wc.pl > pcount 
> 
> real    0m28.604s                       PERL
> user    0m28.260s                            
> sys     0m0.100s
> [tm1]$ time cat tm.dv | ~/lang/c/wc > ccount
> 
> real    0m2.708s                        C++
> user    0m2.260s
> sys     0m0.180s
> [tm1]$ time cat tm.dv | ~/lang/caml/ywc > mcount
> 
> real    0m0.999s                       OCAML
> user    0m0.720s
> sys     0m0.080s

Since comparing the performance of compilers is one of my hobbies ;-),
I have tried to follow this "contest" and have conducted a more thorough
test:

I implemented the solution to the problem with "flex" in C++, both with
maps and hash tables; then I wrote the PERL-version with hashes, too
(*very* short!) and then I took a look at the performance of the native
code compiler and interpreter of OCAML. The result of all runs had an
equal amount of characters (though, not in the same order, because hashes
don't sort). Thus, the implementation seems to be correct.

The test was conducted under Linux (PII 400Mhz, 128MB RAM), the
C++-compiler is "egcs-2.91.57" (optimizations: -O2 -fomit-frame-pointer).
The test case was the concatenated text of all the Linux HOWTO's on my
machine (about 7.3MB).

One of the problems with the C++-version (besides an internal compiler
error when trying to code the solution ;-) is that the number of buckets
in the hash table cannot be fixed at a low level. If the table is
created without parameters, it starts out with 193 buckets and increases
this number to 196613! To make up for this, I conducted the test under
OCAML with 101 buckets (as used by Albert) and the maximum found in C++
(196613). In another test I let C++ start out with this high level -
I think that we have a really fair comparison now.

Another problem with C++ was that finding the right use of templates was
everything but easy - the compiler not only sometimes emitted screen-fulls
of expanded templates as error message, but I'd say that the C++-code
looks really ill now! (I hope it's not my fault...)
By the way: I didn't use the STL-"string"-class for efficiency reasons...

Anyway, the speed result is very interesting indeed (in order from best to
worst, three successive tests each):

ocamlopt & ocamllex (196613 buckets):

4.270u 0.150s 0:04.42 100.0%    0+0k 0+0io 542pf+0w
4.250u 0.180s 0:04.66 95.0%     0+0k 0+0io 542pf+0w
4.240u 0.180s 0:04.42 100.0%    0+0k 0+0io 542pf+0w


ocaml (interpreted) & ocamllex (196613 buckets):

9.080u 0.180s 0:09.25 100.1%    0+0k 0+0io 539pf+0w
9.120u 0.150s 0:09.26 100.1%    0+0k 0+0io 539pf+0w
8.990u 0.270s 0:09.45 97.9%     0+0k 0+0io 539pf+0w


ocamlopt & ocamllex (101 buckets):

11.160u 0.270s 0:11.43 100.0%   0+0k 0+0io 542pf+0w
11.200u 0.240s 0:11.44 100.0%   0+0k 0+0io 542pf+0w
11.250u 0.200s 0:11.44 100.0%   0+0k 0+0io 542pf+0w


PERL:

11.340u 0.240s 0:11.58 100.0%   0+0k 0+0io 643pf+0w
11.380u 0.200s 0:11.58 100.0%   0+0k 0+0io 643pf+0w
11.350u 0.230s 0:11.58 100.0%   0+0k 0+0io 643pf+0w


C++ & flex (hash_map, 196613 buckets):

11.560u 0.340s 0:11.90 100.0%   0+0k 0+0io 552pf+0w
11.620u 0.280s 0:11.90 100.0%   0+0k 0+0io 552pf+0w
11.570u 0.320s 0:11.89 100.0%   0+0k 0+0io 552pf+0w

C++ & flex (hash_map, 193++ buckets):

11.780u 0.260s 0:12.04 100.0%   0+0k 0+0io 552pf+0w
11.720u 0.310s 0:12.03 100.0%   0+0k 0+0io 552pf+0w
11.870u 0.160s 0:12.03 100.0%   0+0k 0+0io 552pf+0w


C++ & flex (map):

14.530u 0.290s 0:14.82 100.0%   0+0k 0+0io 551pf+0w
14.540u 0.250s 0:14.80 99.9%    0+0k 0+0io 551pf+0w
14.630u 0.180s 0:14.81 100.0%   0+0k 0+0io 551pf+0w


ocaml (interpreted) & ocamllex (101 buckets):
20.450u 0.130s 0:20.58 100.0%   0+0k 0+0io 541pf+0w
20.300u 0.270s 0:20.57 100.0%   0+0k 0+0io 539pf+0w
20.450u 0.150s 0:20.59 100.0%   0+0k 0+0io 539pf+0w


The funny thing is that the OCAML-interpreter beats off all other
competitors (not the OCAML native code compiler, of course)!
Please tell this your local C++-guru :-)
Even PERL beats C++ without problem.

OCAML translated to native code is *way* faster than C++ (far more
than twice!)

The slowest competitor was the OCAML-interpreter with the unfair bucket
count of 101 - I just added it to show a result for a not unreasonable
number of buckets (196613 is probably a bit exaggerated...)


Now for the source code (shortest first):


PERL (yes, it *can* be this short!):
---------------------------------------------------------------------------
while(<>) { for (/[^ \t\n\r]+/g) { ++$table{$_}; } }
while (($key, $value) = each %table) { printf "%4d\t\%s\n", $value, $key; }
---------------------------------------------------------------------------


ocamllex-file (note: 196613 buckets):
---------------------------------------------------------------------------
{
let hash = Hashtbl.create 101

let hash_add string =
  try
    let v = Hashtbl.find hash string in v:= !v + 1
  with
    Not_found -> Hashtbl.add hash string (ref 1)
}

rule read = parse
  [' ' '\n' '\r' '\t']+ { read lexbuf }
| [^' ' '\n' '\r' '\t']+ { hash_add ( Lexing.lexeme lexbuf ); read lexbuf }
| eof { Hashtbl.iter ( fun k v -> Printf.printf "%4d\t%s\n" !v k ) hash }

{ let _ = read_rec ( Lexing.from_channel stdin ) 0 }
---------------------------------------------------------------------------


flex and C++ with hash (note: 196613 buckets):
Don't forget to use option "-+" to get C++-code!
---------------------------------------------------------------------------
%option noyywrap

  #include <stdio.h>
  #include <string.h>
  #include <hash_map>

  struct eqstr
  {
    bool operator()(const char* s1, const char* s2) const
    {
      return strcmp(s1, s2) == 0;
    }
  };

  typedef hash_map<const char*, int, hash<const char *>, eqstr > mymap;
  mymap vec(196613);

%%

[^ \t\n\r]+ { char *foo = new char[yyleng + 1]; strcpy(foo, yytext);
              ++vec[foo]; }
[ \t\n\r]+  ;

%%

int main()
{
  yyFlexLexer MyScanner;
  MyScanner.yylex();
  for (mymap::const_iterator p = vec.begin(); p != vec.end(); ++p)
    printf ("%4d\t%s\n",  p->second, p->first);
}
---------------------------------------------------------------------------


flex and C++ with map:
Don't forget to use option "-+" to get C++-code!
---------------------------------------------------------------------------
%option noyywrap

  #include <stdio.h>
  #include <string>
  #include <map>

  struct less_str
  {
    bool operator()(const char* s1, const char* s2) const
    {
      return strcmp(s1, s2) < 0;
    }
  };

  typedef map<const char*, int, less_str> mymap;
  mymap vec;

%%

[^ \t\n\r]+ { char *foo = new char[yyleng + 1]; strcpy(foo, yytext);
              ++vec[foo]; }
[ \t\n\r]+  ;

%%

int main()
{
  yyFlexLexer MyScanner;
  MyScanner.yylex();
  for (mymap::const_iterator p = vec.begin(); p != vec.end(); ++p)
    printf ("%4d\t%s\n",  p->second, p->first);
}
---------------------------------------------------------------------------

So much about C++ and efficiency ;-)

Interesting to see that PERL did perform very well on this test. Its
brevity is unmatched! One shouldn't forget that its hash functions
are highly optimized, its regular expression matching capabilities,
as well. Though, it is definitely not a language for implementing data
structures, as we have seen in Albert's example.

I think that this test is really significant. I cannot imagine how to
make C++ so much faster that it approaches OCAML - without low-level
coding hash tables and pattern matching.

This test showed me, what a nice life I have since I use OCAML instead of
C++ (not only in terms of efficiency...)

Best regards,
Markus Mottl

-- 
Markus Mottl, mottl@miss.wu-wien.ac.at, http://miss.wu-wien.ac.at/~mottl




^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~1999-01-20 13:02 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
1999-01-19 16:47 counting words Toby Moth
1999-01-19 22:52 ` Stefan Monnier
1999-01-20  2:01 ` Markus Mottl

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox