* [Caml-list] what is the functional way to solve this problem? @ 2003-10-02 12:31 Ram Bhamidipaty 2003-10-02 14:53 ` Michal Moskal 0 siblings, 1 reply; 6+ messages in thread From: Ram Bhamidipaty @ 2003-10-02 12:31 UTC (permalink / raw) To: caml-list Someone suggested that my previous question about dynamically resizing arrays hinted that my solution may be going in a non-functional direction. That might be true. So here is the problem I am trying to solve. I would like to solve it in a "functional way". I want to create an in-memory representation of some file system data. The input data file has three different types of lines in the file: 1. starts with "R": R 0 <dirname> 2. starts with "D": D <dir_num> <parent_dir_num> <name> The <dir_num> associated with each directory are sequentially assigned starting at 1. 3. starts with "F": F <file_size> <name> The first line will always be: "R 0 <dir_name>" where 0 is the directory number for the top level directory. This purpose of this line is provide the starting point for the data in the rest of the file. The D lines indicate directories. The dir_num is an integer that uniquely identifies this particular directory. The parent_dir_num integer is used to locate this directory relative to the other directories in the data file. The F lines indicate data for a single file. The majority of the lines in the file should be F lines. A file listed on an F line is in the directory indicated by the closest D line that came earlier in the file. Once all the data is read in I want to output: A list of the top 100 largest files, A list of the top 100 directories that contain the largest fraction of the total disk space used by all the files in the data file - and in this case the file size for a directory does not include sub-directories. Eventually I want to also categorize the data by user-id as well. I have already written a python application that can read this data file and generate the data I am looking for. I was not happy with the python solution because it was not very fast. Even with using a heap to store the top 100 largest files I was not able to create a python solution that could beat a "grep | sort" pipeline (on unix of course). In the python solution the limiting factor was putting the individual files into a heap and repeatedly calling delete_min on the heap to remove the smallest files. Even though the unix pipe based solution ended up sorting _all_ the files and the python solution was handling a smaller set of data the unix pipe solution was still much faster. There is a thread in google about this experiment. Do a google groups search for group:comp.lang.python + bhamidipaty + "looking for speed-up ideas". The bottom line for the python solution was that the grep+sort pipeline took about 8 seconds and the fastest I could get python was around 16 seconds. Of course the unix pipe solution would not be able to do the other analysis that I wanted. My goal of implementing this in OCaml is to beat the grep+sort combination. If I can create a solution that can output all the date I want - in one pass - AND still be faster than the grep+sort partial solution - _that_ would be cool! Having said all that - I wanted to use an array to hold the data for each directory. I hoped that using an array would be faster than a hash table since I know that the directory numbers are assigned sequentially. Thanks for reading this rather long message and thank you for any advice. -Ram --------------------------------- An example of the data file: R 0 /usr D 1 0 local F 4095165 f1 D 2 1 bin F 189408 f2 F 189445 f3 D 4 1 etc F 3956 f4 D 5 1 info F 2613 f5 F 50111 f6 D 6 1 lib F 610422 f7 D 7 1 man F 82097 f8 --------------------------------- ------------------- To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/ Beginner's list: http://groups.yahoo.com/group/ocaml_beginners ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [Caml-list] what is the functional way to solve this problem? 2003-10-02 12:31 [Caml-list] what is the functional way to solve this problem? Ram Bhamidipaty @ 2003-10-02 14:53 ` Michal Moskal 2003-10-08 14:48 ` Pierre Weis 0 siblings, 1 reply; 6+ messages in thread From: Michal Moskal @ 2003-10-02 14:53 UTC (permalink / raw) To: Ram Bhamidipaty; +Cc: caml-list [-- Attachment #1: Type: text/plain, Size: 2415 bytes --] On Thu, Oct 02, 2003 at 05:31:22AM -0700, Ram Bhamidipaty wrote: > Someone suggested that my previous question about dynamically resizing > arrays hinted that my solution may be going in a non-functional > direction. That might be true. > > So here is the problem I am trying to solve. I would like to solve it > in a "functional way". Use Map. You won't notice a difference. [...] > I have already written a python application that can read this data file and > generate the data I am looking for. I was not happy with the python solution > because it was not very fast. Even with using a heap to store the top 100 > largest files I was not able to create a python solution that could beat > a "grep | sort" pipeline (on unix of course). [...] ./stats < /shm/d 0.11s user 0.00s system 78% cpu 0.139 total grep '^F' /shm/d 0.02s user 0.00s system 22% cpu 0.090 total sort -n -t' ' -k2,2 0.32s user 0.05s system 71% cpu 0.517 total tail -100 0.02s user 0.00s system 3% cpu 0.514 total 86287 263948 1912177 /shm/d Did I win something? ;-) [...] > My goal of implementing this in OCaml is to beat the grep+sort > combination. If I can create a solution that can output all the > date I want - in one pass - AND still be faster than the grep+sort > partial solution - _that_ would be cool! > > Having said all that - I wanted to use an array to hold the data for > each directory. I hoped that using an array would be faster than a > hash table since I know that the directory numbers are assigned > sequentially. Map! Map! Stop thinking in such an imperative way ;-) Key points of my implementation: 1. It also uses heap (simple one from pure-fun) 2. It is compiled with ocamlopt (ocamlc version is 6 times slower) 3. It doesn't use Scanf. For such linear task as this Scanf takes 10 or more times to parse input then actual computations. 4. There is little trick, that heap holds 101 elements, and always discards last one, so we don't need to check if newly inserted element is really bigger then min we just deleted. 5. The code isn't very pretty (it would be better to make heap functor with item count, and so on). It is left as an exercise to the reader. You're free to use it. I also attach program I used to create input. -- : Michal Moskal :: http://www.kernel.pl/~malekith : GCS {C,UL}++++$ a? !tv : When in doubt, use brute force. -- Ken Thompson : {E-,w}-- {b++,e}>+++ h [-- Attachment #2: stats.ml --] [-- Type: text/plain, Size: 4105 bytes --] exception Empty module Heap (Element : Map.OrderedType) = struct module Elem = Element type heap = E | T of int * Elem.t * heap * heap let rank = function E -> 0 | T (r,_,_,_) -> r let makeT x a b = if rank a >= rank b then T (rank b + 1, x, a, b) else T (rank a + 1, x, b, a) let empty = E let is_empty h = h = E let rec merge h1 h2 = match h1, h2 with | _, E -> h1 | E, _ -> h2 | T (_, x, a1, b1), T (_, y, a2, b2) -> if Elem.compare x y <= 0 then makeT x a1 (merge b1 h2) else makeT y a2 (merge h1 b2) let insert x h = merge (T (1, x, E, E)) h let find_min = function E -> raise Empty | T (_, x, _, _) -> x let delete_min = function E -> raise Empty | T (_, x, a, b) -> merge a b let flatten x = let rec loop acc = function | E -> acc | T (_, x, a, b) -> loop (loop (x :: acc) a) b in loop [] x end module Dir = struct type t = { size : int; name : string; parent : t option; no : int; } let compare x y = x.size - y.size end module Int = struct type t = int let compare x y = x - y end module File = struct type t = { size : int; dir : Dir.t; name : string; } let compare x y = x.size - y.size end module Dir_set = Map.Make(Int) module Dir_heap = Heap(Dir) module File_heap = Heap(File) type line = | D of int * int * string | F of int * string | Eof let limit = 100 let read () = try let l = read_line () in let rec scan_int pos = match l.[pos] with | '0' .. '9' -> scan_int (pos + 1) | _ -> pos in let int_from pos = let xend = scan_int pos in (int_of_string (String.sub l pos (xend - pos)), xend + 1) in match l.[0] with | 'D' -> let (x, p) = int_from 2 in let (y, p) = int_from p in let s = String.sub l p (String.length l - p) in D (x, y, s) | 'F' -> let (x, p) = int_from 2 in let s = String.sub l p (String.length l - p) in F (x, s) | _ -> invalid_arg "read" with End_of_file -> Eof let rec loop dh fh ds cur_dir = let push_dir () = let (dh_sz, dh) = dh in if dh_sz > limit then (dh_sz, Dir_heap.insert cur_dir (Dir_heap.delete_min dh)) else (dh_sz + 1, Dir_heap.insert cur_dir dh) in function | Eof -> (push_dir (), fh) | D (no, par, nam) -> let cur_dir = { Dir.parent = Some (Dir_set.find par ds); Dir.name = nam; Dir.size = 0; Dir.no = no; } in loop (push_dir ()) fh (Dir_set.add cur_dir.Dir.no cur_dir ds) cur_dir (read ()) | F (sz, name) -> let f = { File.size = sz; File.name = name; File.dir = cur_dir } in let fh = let (fh_sz, fh) = fh in if fh_sz > limit then (fh_sz, File_heap.insert f (File_heap.delete_min fh)) else (fh_sz + 1, File_heap.insert f fh) in loop dh fh ds {cur_dir with Dir.size = cur_dir.Dir.size + sz} (read ()) let main = let start nam = let dir = { Dir.name = nam; Dir.size = 0; Dir.no = 0; Dir.parent = None } in let ds = Dir_set.add 0 dir Dir_set.empty in loop (0, Dir_heap.empty) (0, File_heap.empty) ds dir (read ()) in let l = read_line () in let (dh, fh) = Scanf.sscanf l "R 0 %[^\n]" start in let dir_name d = let rec loop acc = function | { Dir.name = n; Dir.parent = None } -> n :: acc | { Dir.name = n; Dir.parent = Some p } -> loop (n :: acc) p in String.concat "/" (loop [] d) in let print_dir d = Printf.printf "%10d %s\n" d.Dir.size (dir_name d) in let print_file f = Printf.printf "%10d %s/%s\n" f.File.size (dir_name f.File.dir) f.File.name in let dh = let (dh_sz, dh) = dh in if dh_sz > limit then Dir_heap.delete_min dh else dh in Printf.printf "DIRS:\n"; List.iter print_dir (List.sort Dir.compare (Dir_heap.flatten dh)); let fh = let (fh_sz, fh) = fh in if fh_sz > limit then File_heap.delete_min fh else fh in Printf.printf "FILES:\n"; List.iter print_file (List.sort File.compare (File_heap.flatten fh)); [-- Attachment #3: mkd.ml --] [-- Type: text/plain, Size: 695 bytes --] let id = ref 0 let rec go dir_name par_id = Unix.chdir dir_name; incr id; let id = !id in Printf.printf "D %d %d %s\n" id par_id dir_name; let d = Unix.opendir "." in let dirs = ref [] in let rec loop () = let f = Unix.readdir d in let st = Unix.lstat f in let _ = match f.[0], st.Unix.st_kind with | '.', _ -> () | _, Unix.S_DIR -> dirs := f :: !dirs | _, Unix.S_REG -> Printf.printf "F %d %s\n" st.Unix.st_size f | _ -> () in loop () in begin List.iter (fun f -> go f id) (try loop () with End_of_file -> Unix.closedir d; !dirs) end; Unix.chdir ".." let _ = Printf.printf "R 0 .\n"; go "." 0 ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [Caml-list] what is the functional way to solve this problem? 2003-10-02 14:53 ` Michal Moskal @ 2003-10-08 14:48 ` Pierre Weis 2003-10-08 16:09 ` Michal Moskal 0 siblings, 1 reply; 6+ messages in thread From: Pierre Weis @ 2003-10-08 14:48 UTC (permalink / raw) To: Michal Moskal; +Cc: ramb, caml-list Hi Michal, [...] > Key points of my implementation: [...] > 3. It doesn't use Scanf. For such linear task as this Scanf takes 10 or > more times to parse input then actual computations. [...] > -- > : Michal Moskal :: http://www.kernel.pl/~malekith : GCS {C,UL}++++$ a? !tv > : When in doubt, use brute force. -- Ken Thompson : {E-,w}-- {b++,e}>+++ h I'm sorry to report that I was so puzzled by your number 3) key point, that I tested your code and a ``purely scanf'' version to see this ``10 or more times to parse'' behaviour; I didn't notice any runtime difference between the two (native code compiled) versions. Am I missing something ? Could you please try to use the following version of your read function, and report the runtime difference between your hand written code ? (BTW, I compiled the 2 programs using ocamlopt -unsafe -inline 9) let read () = try Scanf.bscanf Scanf.Scanning.stdib " %c" (function | 'D' -> Scanf.bscanf Scanf.Scanning.stdib " %d %d %s" (fun x y s -> D (x, y, s)) | 'F' -> Scanf.bscanf Scanf.Scanning.stdib " %d %s" (fun x s -> F (x, s)) | _ -> invalid_arg "read") with End_of_file -> Eof You should also replace (* let l = read_line () in let (dh, fh) = Scanf.sscanf l "R 0 %[^\n]" start in *) by let (dh, fh) = Scanf.bscanf Scanf.Scanning.stdib "R 0 %s@\n" start in Looking forward for reading from you, Pierre Weis INRIA, Projet Cristal, Pierre.Weis@inria.fr, http://pauillac.inria.fr/~weis/ ------------------- To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/ Beginner's list: http://groups.yahoo.com/group/ocaml_beginners ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [Caml-list] what is the functional way to solve this problem? 2003-10-08 14:48 ` Pierre Weis @ 2003-10-08 16:09 ` Michal Moskal 2003-10-08 20:34 ` Pierre Weis 0 siblings, 1 reply; 6+ messages in thread From: Michal Moskal @ 2003-10-08 16:09 UTC (permalink / raw) To: Pierre Weis; +Cc: ramb, caml-list On Wed, Oct 08, 2003 at 04:48:09PM +0200, Pierre Weis wrote: > Hi Michal, > > [...] > > Key points of my implementation: > [...] > > 3. It doesn't use Scanf. For such linear task as this Scanf takes 10 or > > more times to parse input then actual computations. > [...] > > -- > > : Michal Moskal :: http://www.kernel.pl/~malekith : GCS {C,UL}++++$ a? !tv > > : When in doubt, use brute force. -- Ken Thompson : {E-,w}-- {b++,e}>+++ h > > I'm sorry to report that I was so puzzled by your number 3) key point, > that I tested your code and a ``purely scanf'' version to see this > ``10 or more times to parse'' behaviour; I didn't notice any runtime > difference between the two (native code compiled) versions. Am I > missing something ? Hm, with your version it's only 3 times slower: -rw-r--r-- 1 malekith users 1909887 Oct 8 18:02 /shm/d2 ./statsf < /shm/d2 > /dev/null 0.31s user 0.01s system 103% cpu 0.308 total ./stats < /shm/d2 > /dev/null 0.10s user 0.00s system 155% cpu 0.064 total However, your code is not correct: > Could you please try to use the following version of your read > function, and report the runtime difference between your hand written > code ? > > (BTW, I compiled the 2 programs using ocamlopt -unsafe -inline 9) > > let read () = > try > Scanf.bscanf Scanf.Scanning.stdib " %c" (function > | 'D' -> > Scanf.bscanf Scanf.Scanning.stdib " %d %d %s" ----------------------------------------------------^^ It should be [^\n], as filenames can contain spaces. In this case: ./statsf < /shm/d2 > /dev/null 1.56s user 0.04s system 96% cpu 1.654 total Well, it's 15 times slower. All tests were run on athlon 1.56GHz, under pld-linux, ocaml 3.07, ocamlopt -unsafe -inline 9. -- : Michal Moskal :: http://www.kernel.pl/~malekith : GCS {C,UL}++++$ a? !tv : When in doubt, use brute force. -- Ken Thompson : {E-,w}-- {b++,e}>+++ h ------------------- To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/ Beginner's list: http://groups.yahoo.com/group/ocaml_beginners ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [Caml-list] what is the functional way to solve this problem? 2003-10-08 16:09 ` Michal Moskal @ 2003-10-08 20:34 ` Pierre Weis 2003-10-08 21:22 ` scanf (Re: [Caml-list] what is the functional way to solve this problem?) Michal Moskal 0 siblings, 1 reply; 6+ messages in thread From: Pierre Weis @ 2003-10-08 20:34 UTC (permalink / raw) To: Michal Moskal; +Cc: pierre.weis, ramb, caml-list [...] > Hm, with your version it's only 3 times slower: [...] > However, your code is not correct: > > Scanf.bscanf Scanf.Scanning.stdib " %d %d %s" > ----------------------------------------------------^^ > > It should be [^\n], as filenames can contain spaces. In this case: I did not know the complete specification :) However, your proposed patch is rather inefficient: > It should be [^\n], as filenames can contain spaces. In this case: ---------------^^^^^ You should prefer %s@\n, if efficiency is a concern (which since to be the case :) > Well, it's 15 times slower. Could you please let me have access to the data to be able to test the code without bothering you ? Thank you very much indeed for your testbed case, since you exhibit situations where there is room for improvement in the implementation of scanf (or at least there is the necessity to explain what are the efficient pattern constructs for Scanf). > All tests were run on athlon 1.56GHz, under pld-linux, ocaml 3.07, > ocamlopt -unsafe -inline 9. > -- > : Michal Moskal :: http://www.kernel.pl/~malekith : GCS {C,UL}++++$ a? !tv > : When in doubt, use brute force. -- Ken Thompson : {E-,w}-- {b++,e}>+++ h Thanks for the precision. I use a (now old) 700MHZ pentium III, but since we just compare the efficiency of different versions of the same program, I think that the machine hardware kind is not critical. Best regards, Pierre Weis INRIA, Projet Cristal, Pierre.Weis@inria.fr, http://pauillac.inria.fr/~weis/ ------------------- To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/ Beginner's list: http://groups.yahoo.com/group/ocaml_beginners ^ permalink raw reply [flat|nested] 6+ messages in thread
* scanf (Re: [Caml-list] what is the functional way to solve this problem?) 2003-10-08 20:34 ` Pierre Weis @ 2003-10-08 21:22 ` Michal Moskal 0 siblings, 0 replies; 6+ messages in thread From: Michal Moskal @ 2003-10-08 21:22 UTC (permalink / raw) To: Pierre Weis; +Cc: caml-list On Wed, Oct 08, 2003 at 10:34:40PM +0200, Pierre Weis wrote: > [...] > > Hm, with your version it's only 3 times slower: > [...] > > However, your code is not correct: > > > Scanf.bscanf Scanf.Scanning.stdib " %d %d %s" > > ----------------------------------------------------^^ > > > > It should be [^\n], as filenames can contain spaces. In this case: > > I did not know the complete specification :) Neither did I ;-) I just generated filesystem data from my /usr/share and there was few files that contained spaces in names. It simply more reasonable to assume filenames can contain spaces then to assume they can't. OTOH filenames can as well contain \n... But we are talking about scanf efficiency, so original problem becomes somewhat irrelevant :-) > However, your proposed patch is rather inefficient: > > > It should be [^\n], as filenames can contain spaces. In this case: > ---------------^^^^^ > > You should prefer %s@\n, if efficiency is a concern (which since to be the > case :) I admit I didn't go that throughly through scanf manual. And there seems to be no indication about efficiency of both solutions. But as I understand [] thing would involve creating some flag array for all 256 characters, for each end every scanf call, which has to be slow. Maybe some caching (for cases where [] really needs to be used)? > > Well, it's 15 times slower. > > Could you please let me have access to the data to be able to test the > code without bothering you ? Sure, but along with stats.ml I posted mkd.ml (from make-data). You can run it on your /usr or /usr/share redirecting input to file. If you need *my* test case, please drop me an email. > Thank you very much indeed for your testbed case, since you exhibit > situations where there is room for improvement in the implementation > of scanf (or at least there is the necessity to explain what are the > efficient pattern constructs for Scanf). I generally not use Scanf. This is because I use OCaml mostly for some compilers, preprocessors etc, that need more sophisticated lexing. It is however nice when using OCaml to do some simple and/or scripting tasks. I used Scanf mostly for programming contest, where input data is formated into sequence of numbers or some simple strings. And found it to be fast enough (you know, this is programming contest, they time you, so you have to be fast). But I used only %d and maybe %s, mainly through: let get_int () = Scanf.scanf " %d" (fun x -> x) Which was more natural to me (didn't still convert myself to all-functional-style-I'm-using-only-Haskell :-) -- : Michal Moskal :: http://www.kernel.pl/~malekith : GCS {C,UL}++++$ a? !tv : When in doubt, use brute force. -- Ken Thompson : {E-,w}-- {b++,e}>+++ h ------------------- To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/ Beginner's list: http://groups.yahoo.com/group/ocaml_beginners ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2003-10-08 21:25 UTC | newest] Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2003-10-02 12:31 [Caml-list] what is the functional way to solve this problem? Ram Bhamidipaty 2003-10-02 14:53 ` Michal Moskal 2003-10-08 14:48 ` Pierre Weis 2003-10-08 16:09 ` Michal Moskal 2003-10-08 20:34 ` Pierre Weis 2003-10-08 21:22 ` scanf (Re: [Caml-list] what is the functional way to solve this problem?) Michal Moskal
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox