counting words - Toby Moth

Mailing list for all users of the OCaml language and system.
 help / color / mirror / Atom feed

From: Toby Moth <tm1@cise.npl.co.uk>
To: caml-list@inria.fr
Subject: counting words
Date: Tue, 19 Jan 1999 16:47:19 GMT	[thread overview]
Message-ID: <199901191647.QAA06463@squall.cise.npl.co.uk> (raw)

Remember the program in K & R for counting the number of occurrences of all the words in some input (K&R2, 6.5) ? Anyway, I happened to mention this program to a friend whilst a Perl hacker was listening in. To our surprise, the Perl programmer wrote a word counting program, including a binary tree sort, as we spoke. My friend then wrote his own rather less faithful but elegant version of this program in C++. Finally, I decided to write 2 OCAML versions - one using explicit buffering and streams, the other giving much of the work to ocamllex.

The surprise (at least for me) was this - both ocaml versions were significantly faster than the C++ solution. I think the Perl program wins on style. It is closest to the K&R solution and handles white space most cleverly. However, we all know that OCAML can be written stylishly. The point which I am only just beginning to appreciate is that OCAML is remarkably fast. After all, this is just the sort of program which people usually say should be written in C/C++ because it will then run faster.

Here are some results just to give a flavour of relative speeds on an Ultra 5.

[tm1]$ ls -l tm.dv
-rw-rw-r--   1 tm1      this      452849 Jan 13 13:06 tm.dv
[tm1]$ time cat tm.dv | perl ~/lang/perl/wc.pl > pcount 

real    0m28.604s                       PERL
user    0m28.260s                            
sys     0m0.100s
[tm1]$ time cat tm.dv | ~/lang/c/wc > ccount

real    0m2.708s                        C++
user    0m2.260s
sys     0m0.180s
[tm1]$ time cat tm.dv | ~/lang/caml/ywc > mcount

real    0m0.999s                       OCAML
user    0m0.720s
sys     0m0.080s

And here is the code -

PERL ( by Robin Barker )
---
package Tree;

sub new { bless { count => 0 } }
sub putintree {
    my($tree,$value) = @_;
    if( $tree->{count} ) {
	if( $tree->{value} lt $value ) {
	    $tree->{up} -> putintree($value)	
        }
	elsif( $tree->{value} gt $value ) {
	    $tree->{down} -> putintree($value)	
        }
	else { $tree->{count}++ }
    }
    else {
	$tree->{up} = new Tree; 
	$tree->{down} = new Tree; 
	$tree->{value} = $value;
	$tree->{count} = 1;
    }
}

sub show {
    my $tree = shift;
    if( $tree -> {count} ) {
	$tree -> {down} -> show;	
	printf "%4d\t%s\n", $tree->{count}, $tree->{value};
	$tree -> {up} -> show;	
    }
}

package main;

my $tree = new Tree;

while( <> ) { for ( /\w+/g ) { $tree-> putintree ($&) } }

$tree->show;
---

C++ ( by Steve Gardner, also an OCAML fan )
> Steve says:
> "This might be faster if it used lex"
---
#include <iostream>
#include <string>
#include <map>

int main()
{
  string buf;
  typedef map<string, long> mymap;
  mymap vec;

  while (cin >> buf) ++vec[buf];

  for (mymap::const_iterator p = vec.begin(); p != vec.end(); ++p)
    cout << p->first << ": " << p->second << endl;
}
---

OCAML ( by me. Forgive me any blemishes - I mean well )
---
(*** Lexer Approach to Word Count ***)

{
let hash = Hashtbl.create 101

let hash_add string = 
  try
    let v = Hashtbl.find hash string in v:= !v + 1
  with
    Not_found -> Hashtbl.add hash string (ref 1)
} 

rule read = parse
  [' ' '\n' '\r' '\t']+ { read lexbuf } 
| [^' ' '\n' '\r' '\r']+ { hash_add ( Lexing.lexeme lexbuf ); read lexbuf }
| eof { Hashtbl.iter ( fun k v -> Printf.printf "%4d\t%s\n" !v k ) hash }

{ let _ = read_rec ( Lexing.from_channel stdin ) 0 }

---

 There are a few little extra details worth mentioning.

1) The 2nd OCAML program which I haven't included uses the sort of string buffering + stream approach which you can find in the genlex.ml code. This second approach is rather longer, but not much slower.
2) A longer C++ solution using #include <hash_map> was also much slower than either of the OCAML solutions.
3) The speed of the Perl program is irrelevant, since no hashing algorithm was being used.

-- 
Albert Tobias Moth
at work: research scientist at the National Physical Laboratory
         tm1@npl.co.uk   +44 (0)181 943 6016
at home: philosopher, dish washer, diarist, geek
         toby@heynonnyno.demon.co.uk

next             reply	other threads:[~1999-01-19 17:08 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
1999-01-19 16:47 Toby Moth [this message]
1999-01-19 22:52 ` Stefan Monnier
1999-01-20  2:01 ` Markus Mottl

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=199901191647.QAA06463@squall.cise.npl.co.uk \
    --to=tm1@cise.npl.co.uk \
    --cc=caml-list@inria.fr \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox