* zcat vs CamlZip
@ 2006-08-29 18:40 Sam Steingold
2006-08-29 18:54 ` Bardur Arantsson
` (2 more replies)
0 siblings, 3 replies; 12+ messages in thread
From: Sam Steingold @ 2006-08-29 18:40 UTC (permalink / raw)
To: caml-list
I read through a huge *.gz file.
I have two versions of the code:
1. use Unix.open_process_in "zcat foo.gz".
2. use gzip.mli (1.2 2002/02/18) as comes with godi 3.09.
it turns out that the zcat version is 3(!) times as fast as the gzip.mli
one:
Run time: 189.435840 sec
Self: 189.435840 sec
sys: 183.447465 sec
user: 5.988375 sec
Children: 0.000000 sec
sys: 0.000000 sec
user: 0.000000 sec
GC: minor: 169778
major: 478
compactions: 3
Allocated: 5510457762.0 words
Wall clock: 206 sec (00:03:26)
vs
Run time: 58.471655 sec
Self: 54.855429 sec
sys: 48.527033 sec
user: 6.328396 sec
Children: 3.616226 sec
sys: 3.168198 sec
user: 0.448028 sec
GC: minor: 43174
major: 229
compactions: 5
Allocated: 1401290543.0 words
Wall clock: 78 sec (00:01:18)
since gzip.mli lacks input_line function, I had to roll my own:
let buf = Buffer.create 1024
let gz_input_line gz_in char_counter line_counter =
Buffer.clear buf;
let finish () = incr line_counter; Buffer.contents buf in
let rec loop () =
let ch = Gzip.input_char gz_in in
char_counter := Int64.succ !char_counter;
if ch = '\n' then finish () else ( Buffer.add_char buf ch; loop ();
) in
try loop ()
with End_of_file ->
if Buffer.length buf = 0 then raise End_of_file else finish ()
is there something wrong with my gz_input_line?
is this a know performance issue with the CamlZip library?
thanks.
Sam.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: zcat vs CamlZip
2006-08-29 18:40 zcat vs CamlZip Sam Steingold
@ 2006-08-29 18:54 ` Bardur Arantsson
2006-08-29 19:01 ` [Caml-list] " Florian Hars
` (2 more replies)
2006-08-29 19:11 ` [Caml-list] " Eric Cooper
2006-08-30 6:12 ` Jeff Henrikson
2 siblings, 3 replies; 12+ messages in thread
From: Bardur Arantsson @ 2006-08-29 18:54 UTC (permalink / raw)
To: caml-list
Sam Steingold wrote:
> I read through a huge *.gz file.
> I have two versions of the code:
[--snip--]
>
> let buf = Buffer.create 1024
> let gz_input_line gz_in char_counter line_counter =
> Buffer.clear buf;
> let finish () = incr line_counter; Buffer.contents buf in
> let rec loop () =
> let ch = Gzip.input_char gz_in in
This is your most likely culprit. Any kind of "do this for every
character" is usually insanely expensive when you can do it in bulk.
(This is especially true when needing to do system calls, or if the
called function cannot be inlined.)
--
Bardur Arantsson
<bardurREMOVE@THISscientician.net>
If you can't join 'em, beat 'em. Preferably with a big stick.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [Caml-list] Re: zcat vs CamlZip
2006-08-29 18:54 ` Bardur Arantsson
@ 2006-08-29 19:01 ` Florian Hars
2006-08-29 19:15 ` Sam Steingold
2006-08-29 19:37 ` John Carr
2 siblings, 0 replies; 12+ messages in thread
From: Florian Hars @ 2006-08-29 19:01 UTC (permalink / raw)
To: Bardur Arantsson; +Cc: caml-list
Bardur Arantsson schrieb:
> Sam Steingold wrote:
>> let ch = Gzip.input_char gz_in in
>
> This is your most likely culprit.
Apart from the fact that zcat is in fact at least twice as fast
as the ocaml gzip module.
Yours, Florian.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [Caml-list] zcat vs CamlZip
2006-08-29 18:40 zcat vs CamlZip Sam Steingold
2006-08-29 18:54 ` Bardur Arantsson
@ 2006-08-29 19:11 ` Eric Cooper
2006-08-30 6:12 ` Jeff Henrikson
2 siblings, 0 replies; 12+ messages in thread
From: Eric Cooper @ 2006-08-29 19:11 UTC (permalink / raw)
To: caml-list
On Tue, Aug 29, 2006 at 02:40:23PM -0400, Sam Steingold wrote:
> is this a known performance issue with the CamlZip library?
I found the same thing when I was writing approx, so I use a "gunzip"
process with Sys.command. (You can also use open_process_in, but I
just decompress to a temporary file and then reread it. That also
catches corrupt .gz files in a more robust way.)
--
Eric Cooper e c c @ c m u . e d u
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: zcat vs CamlZip
2006-08-29 18:54 ` Bardur Arantsson
2006-08-29 19:01 ` [Caml-list] " Florian Hars
@ 2006-08-29 19:15 ` Sam Steingold
2006-08-29 19:48 ` Bárður Árantsson
` (2 more replies)
2006-08-29 19:37 ` John Carr
2 siblings, 3 replies; 12+ messages in thread
From: Sam Steingold @ 2006-08-29 19:15 UTC (permalink / raw)
To: Bardur Arantsson, caml-list
Bardur Arantsson wrote:
> Sam Steingold wrote:
>> I read through a huge *.gz file.
>> I have two versions of the code:
> [--snip--]
>>
>> let buf = Buffer.create 1024
>> let gz_input_line gz_in char_counter line_counter =
>> Buffer.clear buf;
>> let finish () = incr line_counter; Buffer.contents buf in
>> let rec loop () =
>> let ch = Gzip.input_char gz_in in
>
> This is your most likely culprit. Any kind of "do this for every
> character" is usually insanely expensive when you can do it in bulk.
> (This is especially true when needing to do system calls, or if the
> called function cannot be inlined.)
>
yes, I thought about it, but I assumed that the ocaml gzip module
inlines Gzip.input_char (obviously the gzip module needs an internal
cache so Gzip.input_char does not _always_ translate to a system call,
most of the time it just pops a char from the internal buffer).
at any rate, do you really expect that using Gzip.input and then
searching the result for a newline, slicing and dicing to get the
individual input lines, &c &c would be faster?
Sam.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [Caml-list] Re: zcat vs CamlZip
2006-08-29 18:54 ` Bardur Arantsson
2006-08-29 19:01 ` [Caml-list] " Florian Hars
2006-08-29 19:15 ` Sam Steingold
@ 2006-08-29 19:37 ` John Carr
2 siblings, 0 replies; 12+ messages in thread
From: John Carr @ 2006-08-29 19:37 UTC (permalink / raw)
To: caml-list
> This is your most likely culprit. Any kind of "do this for every
> character" is usually insanely expensive when you can do it in bulk.
I wrote a program that read data from a text file, which
could optionally be compressed. I defined my text file
format to have nearly-fixed length lines so I could call
Gzip.really_input. My program doesn't spend much of its
time reading the text file so I didn't spend much time
making input fast. I just did what I thought the obvious
optimization of reading a block of characters in the
normal case.
let input_line =
begin function
Uncompressed c ->
input_line c
| Compressed c ->
begin match Gzip.input_char c with
'#' -> while Gzip.input_char c <> '\n' do () done; "#"
| 'S' ->
let buf = String.make 11 'S' in
Gzip.really_input c buf 1 10;
if String.unsafe_get buf 10 = '\n' then
String.unsafe_set buf 10 ' '
else begin
if Gzip.input_char c <> '\n' then
failwith "bad override file"
end;
buf
| _ -> failwith "bad override file"
end
end
(Lines are variable-length comments beginning '#' or data
lines beginning with 'S' followed by 9 or 10 characters.)
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: zcat vs CamlZip
2006-08-29 19:15 ` Sam Steingold
@ 2006-08-29 19:48 ` Bárður Árantsson
2006-08-29 19:54 ` [Caml-list] " Gerd Stolpmann
2006-08-29 20:04 ` Gerd Stolpmann
2 siblings, 0 replies; 12+ messages in thread
From: Bárður Árantsson @ 2006-08-29 19:48 UTC (permalink / raw)
To: caml-list
Sam Steingold wrote:
> Bardur Arantsson wrote:
>> Sam Steingold wrote:
>>> I read through a huge *.gz file.
>>> I have two versions of the code:
>> [--snip--]
>>>
>>> let buf = Buffer.create 1024
>>> let gz_input_line gz_in char_counter line_counter =
>>> Buffer.clear buf;
>>> let finish () = incr line_counter; Buffer.contents buf in
>>> let rec loop () =
>>> let ch = Gzip.input_char gz_in in
>>
>> This is your most likely culprit. Any kind of "do this for every
>> character" is usually insanely expensive when you can do it in bulk.
>> (This is especially true when needing to do system calls, or if the
>> called function cannot be inlined.)
>>
>
> yes, I thought about it, but I assumed that the ocaml gzip module
> inlines Gzip.input_char (obviously the gzip module needs an internal
> cache so Gzip.input_char does not _always_ translate to a system call,
> most of the time it just pops a char from the internal buffer).
You can also easily try this in C with fgetc() contrasted with fgets().
The difference is _huge_ even if they both do comparable numbers of
syscalls -- assuming that the buffering is identical (I haven't checked,
but I think it is a reasonable assumption). In the C case, the inlining
is not really guaranteed, but I don't think it is in OCaml either --
though I honestly don't know. You'd have to check the assembler output
to see if the call gets inlined.
Inlining aside, memory prefecthing probably also makes a difference in
favor of reading in bulk and then processing "in bulk".
> at any rate, do you really expect that using Gzip.input and then
> searching the result for a newline, slicing and dicing to get the
> individual input lines, &c &c would be faster?
I would guess so, yes.
(There may of course be other reasons for a large portion of the
difference as others have pointed out.)
--
Bardur Arantsson
<bardurREMOVE@THISscientician.net>
- 'Blackmail' is such an ugly word. I prefer 'extortion'. The X
makes it sound cool.
Bender, 'Futurama'
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [Caml-list] Re: zcat vs CamlZip
2006-08-29 19:15 ` Sam Steingold
2006-08-29 19:48 ` Bárður Árantsson
@ 2006-08-29 19:54 ` Gerd Stolpmann
2006-08-29 20:04 ` Gerd Stolpmann
2 siblings, 0 replies; 12+ messages in thread
From: Gerd Stolpmann @ 2006-08-29 19:54 UTC (permalink / raw)
To: Sam Steingold; +Cc: Bardur Arantsson, caml-list
Am Dienstag, den 29.08.2006, 15:15 -0400 schrieb Sam Steingold:
> Bardur Arantsson wrote:
> > Sam Steingold wrote:
> >> I read through a huge *.gz file.
> >> I have two versions of the code:
> > [--snip--]
> >>
> >> let buf = Buffer.create 1024
> >> let gz_input_line gz_in char_counter line_counter =
> >> Buffer.clear buf;
> >> let finish () = incr line_counter; Buffer.contents buf in
> >> let rec loop () =
> >> let ch = Gzip.input_char gz_in in
> >
> > This is your most likely culprit. Any kind of "do this for every
> > character" is usually insanely expensive when you can do it in bulk.
> > (This is especially true when needing to do system calls, or if the
> > called function cannot be inlined.)
> >
>
> yes, I thought about it, but I assumed that the ocaml gzip module
> inlines Gzip.input_char (obviously the gzip module needs an internal
> cache so Gzip.input_char does not _always_ translate to a system call,
> most of the time it just pops a char from the internal buffer).
This may be a godi issue, because gzip.cmx is not installed. Inlining
needs the .cmx file. However, I am not sure whether input_char can be
inlined at all. You can find that out with the dumpapprox tool:
dumpapprox path/to/foo.cmx
Look for the "Approximation" section. If the function (or better entry
point) is listed with the "(inline)" flag it can be inlined, otherwise
not.
> at any rate, do you really expect that using Gzip.input and then
> searching the result for a newline, slicing and dicing to get the
> individual input lines, &c &c would be faster?
The question is whether you finally get a loop that can be completely
executed in the CPU's cache, and how many variables need to be read and
written in a loop cycle. Whether functions are inlined or not is usually
not that important. My experience is that the Gzip.input method is
faster.
Gerd
--
------------------------------------------------------------
Gerd Stolpmann * Viktoriastr. 45 * 64293 Darmstadt * Germany
gerd@gerd-stolpmann.de http://www.gerd-stolpmann.de
Phone: +49-6151-153855 Fax: +49-6151-997714
------------------------------------------------------------
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [Caml-list] Re: zcat vs CamlZip
2006-08-29 19:15 ` Sam Steingold
2006-08-29 19:48 ` Bárður Árantsson
2006-08-29 19:54 ` [Caml-list] " Gerd Stolpmann
@ 2006-08-29 20:04 ` Gerd Stolpmann
2006-08-30 0:44 ` malc
2 siblings, 1 reply; 12+ messages in thread
From: Gerd Stolpmann @ 2006-08-29 20:04 UTC (permalink / raw)
To: Sam Steingold; +Cc: Bardur Arantsson, caml-list
Am Dienstag, den 29.08.2006, 15:15 -0400 schrieb Sam Steingold:
> at any rate, do you really expect that using Gzip.input and then
> searching the result for a newline, slicing and dicing to get the
> individual input lines, &c &c would be faster?
Ah yes, and there is an easy solution with ocamlnet:
class input_gzip_rec gzip_ch : Netchannels.rec_in_channel =
object(self)
method input s p l =
let n = Gzip.input gzip_ch s p l in
if n = 0 then raise End_of_file;
n
method close_in() =
Gzip.close_in gzip_ch
end
Then use it as follows:
let gz_ch =
Netchannels.lift_in (`Rec (new input_gzip gz_in))
let line = gz_ch # input_line()
This adds a buffering layer.
Gerd
--
------------------------------------------------------------
Gerd Stolpmann * Viktoriastr. 45 * 64293 Darmstadt * Germany
gerd@gerd-stolpmann.de http://www.gerd-stolpmann.de
Phone: +49-6151-153855 Fax: +49-6151-997714
------------------------------------------------------------
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [Caml-list] Re: zcat vs CamlZip
2006-08-29 20:04 ` Gerd Stolpmann
@ 2006-08-30 0:44 ` malc
2006-08-30 0:53 ` Jonathan Roewen
0 siblings, 1 reply; 12+ messages in thread
From: malc @ 2006-08-30 0:44 UTC (permalink / raw)
To: Gerd Stolpmann; +Cc: caml-list
On Tue, 29 Aug 2006, Gerd Stolpmann wrote:
> Am Dienstag, den 29.08.2006, 15:15 -0400 schrieb Sam Steingold:
>> at any rate, do you really expect that using Gzip.input and then
>> searching the result for a newline, slicing and dicing to get the
>> individual input lines, &c &c would be faster?
>
> Ah yes, and there is an easy solution with ocamlnet:
[..snip..]
> This adds a buffering layer.
The Netchannels buffering looks very elegant, but my (admittedly rather
cursory) testing shows that it's also rather slow.
Following code implements 4 line readers:
Sam's original [char]
Netchannels [net]
open_process_in [zcat]
and buffered (trying to stay compatible with original interface) [block]
While Netchannels do win over original implementation it looses to all
other methods (on my machine).
let buf = Buffer.create 1024
let gz_input_line gz_in char_counter line_counter =
Buffer.clear buf;
let finish () = incr line_counter; Buffer.contents buf in
let rec loop () =
let ch = Gzip.input_char gz_in in
char_counter := Int64.succ !char_counter;
if ch = '\n' then finish () else ( Buffer.add_char buf ch; loop (); ) in
try loop ()
with End_of_file ->
if Buffer.length buf = 0 then raise End_of_file else finish ()
class input_gzip_rec gzip_ch : Netchannels.rec_in_channel =
object(self)
method input s p l =
let n = Gzip.input gzip_ch s p l in
if n = 0 then raise End_of_file;
n
method close_in() =
Gzip.close_in gzip_ch
end
let wrap_gz gz_in =
let s = String.create 4096 in
let b = Buffer.create 1024 in
let r = ref (fun _ _ -> assert false) in
let findlf s start finish =
let rec loop pos = if pos >= finish then None
else if String.unsafe_get s pos = '\n' then Some pos else loop (succ pos)
in loop start
in
let rec cont pos char_counter line_counter =
let n = Gzip.input gz_in s pos (String.length s - pos) in
let rec subcont pos len char_counter line_counter =
let finish = pos + len in
match findlf s pos finish with
| None ->
Buffer.add_substring b s pos len;
cont 0 char_counter line_counter
| Some lfpos ->
let runlen = lfpos - pos in
incr line_counter;
Buffer.add_substring b s pos runlen;
let s = Buffer.contents b in
Buffer.clear b;
r := subcont (succ lfpos) (len - succ runlen);
s
in
if n = 0
then raise End_of_file
else (
char_counter := Int64.add (Int64.of_int n) !char_counter;
subcont pos n char_counter line_counter
)
in
let exec c l = !r c l in
r := cont 0;
exec
let char () =
let gz = Gzip.open_in_chan stdin in
let cc = ref 0L in
let lc = ref 0 in
try
while true
do
let _line = gz_input_line gz cc lc in
()
done
with End_of_file ->
Format.printf "cc=%Ld lc=%d@." !cc !lc
let block () =
let gz = Gzip.open_in_chan stdin in
let cc = ref 0L in
let lc = ref 0 in
let lg = wrap_gz gz in
try
while true
do
let _line = lg cc lc in
()
done
with End_of_file ->
Format.printf "cc=%Ld lc=%d@." !cc !lc
let zcat () =
let ic = Unix.open_process_in "zcat" in
let cc = ref 0L in
let lc = ref 0 in
try
while true
do
let _line = input_line ic in
cc := Int64.add (Int64.of_int (String.length _line + 1)) !cc;
incr lc
done
with End_of_file ->
Format.printf "cc=%Ld lc=%d@." !cc !lc
let net () =
let gz_in = Gzip.open_in_chan stdin in
let gz_ch = Netchannels.lift_in (`Rec (new input_gzip_rec gz_in)) in
let cc = ref 0L in
let lc = ref 0 in
try
while true
do
let _line = gz_ch#input_line () in
cc := Int64.add (Int64.of_int (String.length _line + 1)) !cc;
incr lc
done
with End_of_file ->
Format.printf "cc=%Ld lc=%d@." !cc !lc
let _ =
match Sys.argv with
| [| _; "char" |] -> char ()
| [| _; "zcat" |] -> zcat ()
| [| _; "block" |] -> block ()
| [| _; "net" |] -> net ()
| _ -> prerr_endline (Sys.argv.(0) ^ ": [char|zcat|block|net]")
--
mailto:malc@pulsesoft.com
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [Caml-list] Re: zcat vs CamlZip
2006-08-30 0:44 ` malc
@ 2006-08-30 0:53 ` Jonathan Roewen
0 siblings, 0 replies; 12+ messages in thread
From: Jonathan Roewen @ 2006-08-30 0:53 UTC (permalink / raw)
Cc: caml-list
Have you tried Unzip module from Extlib? Haven't tried it, but plan on
using it later on.
Jonathan
On 8/30/06, malc <malc@pulsesoft.com> wrote:
> On Tue, 29 Aug 2006, Gerd Stolpmann wrote:
>
> > Am Dienstag, den 29.08.2006, 15:15 -0400 schrieb Sam Steingold:
> >> at any rate, do you really expect that using Gzip.input and then
> >> searching the result for a newline, slicing and dicing to get the
> >> individual input lines, &c &c would be faster?
> >
> > Ah yes, and there is an easy solution with ocamlnet:
>
> [..snip..]
>
> > This adds a buffering layer.
>
> The Netchannels buffering looks very elegant, but my (admittedly rather
> cursory) testing shows that it's also rather slow.
>
> Following code implements 4 line readers:
> Sam's original [char]
> Netchannels [net]
> open_process_in [zcat]
> and buffered (trying to stay compatible with original interface) [block]
>
> While Netchannels do win over original implementation it looses to all
> other methods (on my machine).
>
> let buf = Buffer.create 1024
> let gz_input_line gz_in char_counter line_counter =
> Buffer.clear buf;
> let finish () = incr line_counter; Buffer.contents buf in
> let rec loop () =
> let ch = Gzip.input_char gz_in in
> char_counter := Int64.succ !char_counter;
> if ch = '\n' then finish () else ( Buffer.add_char buf ch; loop (); ) in
> try loop ()
> with End_of_file ->
> if Buffer.length buf = 0 then raise End_of_file else finish ()
>
> class input_gzip_rec gzip_ch : Netchannels.rec_in_channel =
> object(self)
> method input s p l =
> let n = Gzip.input gzip_ch s p l in
> if n = 0 then raise End_of_file;
> n
> method close_in() =
> Gzip.close_in gzip_ch
> end
>
> let wrap_gz gz_in =
> let s = String.create 4096 in
> let b = Buffer.create 1024 in
> let r = ref (fun _ _ -> assert false) in
> let findlf s start finish =
> let rec loop pos = if pos >= finish then None
> else if String.unsafe_get s pos = '\n' then Some pos else loop (succ pos)
> in loop start
> in
> let rec cont pos char_counter line_counter =
> let n = Gzip.input gz_in s pos (String.length s - pos) in
> let rec subcont pos len char_counter line_counter =
> let finish = pos + len in
> match findlf s pos finish with
> | None ->
> Buffer.add_substring b s pos len;
> cont 0 char_counter line_counter
>
> | Some lfpos ->
> let runlen = lfpos - pos in
> incr line_counter;
> Buffer.add_substring b s pos runlen;
> let s = Buffer.contents b in
> Buffer.clear b;
> r := subcont (succ lfpos) (len - succ runlen);
> s
> in
> if n = 0
> then raise End_of_file
> else (
> char_counter := Int64.add (Int64.of_int n) !char_counter;
> subcont pos n char_counter line_counter
> )
> in
> let exec c l = !r c l in
> r := cont 0;
> exec
>
> let char () =
> let gz = Gzip.open_in_chan stdin in
> let cc = ref 0L in
> let lc = ref 0 in
> try
> while true
> do
> let _line = gz_input_line gz cc lc in
> ()
> done
> with End_of_file ->
> Format.printf "cc=%Ld lc=%d@." !cc !lc
>
> let block () =
> let gz = Gzip.open_in_chan stdin in
> let cc = ref 0L in
> let lc = ref 0 in
> let lg = wrap_gz gz in
> try
> while true
> do
> let _line = lg cc lc in
> ()
> done
> with End_of_file ->
> Format.printf "cc=%Ld lc=%d@." !cc !lc
>
> let zcat () =
> let ic = Unix.open_process_in "zcat" in
> let cc = ref 0L in
> let lc = ref 0 in
> try
> while true
> do
> let _line = input_line ic in
> cc := Int64.add (Int64.of_int (String.length _line + 1)) !cc;
> incr lc
> done
> with End_of_file ->
> Format.printf "cc=%Ld lc=%d@." !cc !lc
>
> let net () =
> let gz_in = Gzip.open_in_chan stdin in
> let gz_ch = Netchannels.lift_in (`Rec (new input_gzip_rec gz_in)) in
> let cc = ref 0L in
> let lc = ref 0 in
> try
> while true
> do
> let _line = gz_ch#input_line () in
> cc := Int64.add (Int64.of_int (String.length _line + 1)) !cc;
> incr lc
> done
> with End_of_file ->
> Format.printf "cc=%Ld lc=%d@." !cc !lc
>
> let _ =
> match Sys.argv with
> | [| _; "char" |] -> char ()
> | [| _; "zcat" |] -> zcat ()
> | [| _; "block" |] -> block ()
> | [| _; "net" |] -> net ()
> | _ -> prerr_endline (Sys.argv.(0) ^ ": [char|zcat|block|net]")
>
> --
> mailto:malc@pulsesoft.com
>
> _______________________________________________
> Caml-list mailing list. Subscription management:
> http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list
> Archives: http://caml.inria.fr
> Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
> Bug reports: http://caml.inria.fr/bin/caml-bugs
>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [Caml-list] zcat vs CamlZip
2006-08-29 18:40 zcat vs CamlZip Sam Steingold
2006-08-29 18:54 ` Bardur Arantsson
2006-08-29 19:11 ` [Caml-list] " Eric Cooper
@ 2006-08-30 6:12 ` Jeff Henrikson
2 siblings, 0 replies; 12+ messages in thread
From: Jeff Henrikson @ 2006-08-30 6:12 UTC (permalink / raw)
To: Sam Steingold; +Cc: caml-list
I was planning on using the library "ocaml gz" in my application, which
is a binding to zlib. I haven't done any detailed benchmarking, but I
presume its speed is comparable to gzip/gunzip since they just call out
to zlib.
http://ocamlplot.sourceforge.net/
Jeff Henrikson
Sam Steingold wrote:
> I read through a huge *.gz file.
> I have two versions of the code:
>
> 1. use Unix.open_process_in "zcat foo.gz".
>
> 2. use gzip.mli (1.2 2002/02/18) as comes with godi 3.09.
>
> it turns out that the zcat version is 3(!) times as fast as the
> gzip.mli one:
>
> Run time: 189.435840 sec
> Self: 189.435840 sec
> sys: 183.447465 sec
> user: 5.988375 sec
> Children: 0.000000 sec
> sys: 0.000000 sec
> user: 0.000000 sec
> GC: minor: 169778
> major: 478
> compactions: 3
> Allocated: 5510457762.0 words
> Wall clock: 206 sec (00:03:26)
>
> vs
>
> Run time: 58.471655 sec
> Self: 54.855429 sec
> sys: 48.527033 sec
> user: 6.328396 sec
> Children: 3.616226 sec
> sys: 3.168198 sec
> user: 0.448028 sec
> GC: minor: 43174
> major: 229
> compactions: 5
> Allocated: 1401290543.0 words
> Wall clock: 78 sec (00:01:18)
>
> since gzip.mli lacks input_line function, I had to roll my own:
>
> let buf = Buffer.create 1024
> let gz_input_line gz_in char_counter line_counter =
> Buffer.clear buf;
> let finish () = incr line_counter; Buffer.contents buf in
> let rec loop () =
> let ch = Gzip.input_char gz_in in
> char_counter := Int64.succ !char_counter;
> if ch = '\n' then finish () else ( Buffer.add_char buf ch; loop
> (); ) in
> try loop ()
> with End_of_file ->
> if Buffer.length buf = 0 then raise End_of_file else finish ()
>
> is there something wrong with my gz_input_line?
> is this a know performance issue with the CamlZip library?
>
> thanks.
> Sam.
>
> _______________________________________________
> Caml-list mailing list. Subscription management:
> http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list
> Archives: http://caml.inria.fr
> Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
> Bug reports: http://caml.inria.fr/bin/caml-bugs
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2006-08-30 6:03 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-08-29 18:40 zcat vs CamlZip Sam Steingold
2006-08-29 18:54 ` Bardur Arantsson
2006-08-29 19:01 ` [Caml-list] " Florian Hars
2006-08-29 19:15 ` Sam Steingold
2006-08-29 19:48 ` Bárður Árantsson
2006-08-29 19:54 ` [Caml-list] " Gerd Stolpmann
2006-08-29 20:04 ` Gerd Stolpmann
2006-08-30 0:44 ` malc
2006-08-30 0:53 ` Jonathan Roewen
2006-08-29 19:37 ` John Carr
2006-08-29 19:11 ` [Caml-list] " Eric Cooper
2006-08-30 6:12 ` Jeff Henrikson
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox