* OC4MC : OCaml for Multicore architectures @ 2009-09-22 21:30 Philippe Wang 2009-09-23 10:53 ` [Caml-list] " Goswin von Brederlow ` (2 more replies) 0 siblings, 3 replies; 64+ messages in thread From: Philippe Wang @ 2009-09-22 21:30 UTC (permalink / raw) To: caml-list This is some additional "noise" about "OCaml for Multicore architectures" (or "Ok with parallel threads GC"). ---------------------------- Dear list, We have implemented an alternative runtime library for OCaml, one that allows threads to compute in parallel on different cores of now widespread CPUs. This project will be presented at IFL 2009 (http://blogs.shu.edu/projects/IFL2009/ ). A testing version available online at http://www.algo-prog.info/ocmc/ It works with OCaml 3.10.2 for Linux x86-64bit, we haven't met any bugs with the latest build (it doesn't *unexpectedly* crash, not yet). Hope you'll enjoy, -- Mathias Bourgoin, Adrien Jonquet, Emmanuel Chailloux, Benjamin Canou, Philippe Wang ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [Caml-list] OC4MC : OCaml for Multicore architectures 2009-09-22 21:30 OC4MC : OCaml for Multicore architectures Philippe Wang @ 2009-09-23 10:53 ` Goswin von Brederlow 2009-09-23 12:21 ` Jon Harrop 2009-09-24 0:21 ` Jon Harrop 2009-09-24 14:11 ` Dario Teixeira 2009-09-24 18:24 ` David Teller 2 siblings, 2 replies; 64+ messages in thread From: Goswin von Brederlow @ 2009-09-23 10:53 UTC (permalink / raw) To: Philippe Wang; +Cc: caml-list Philippe Wang <philippe.wang@lip6.fr> writes: > This is some additional "noise" about "OCaml for Multicore > architectures" (or "Ok with parallel threads GC"). > ---------------------------- > > Dear list, > > We have implemented an alternative runtime library for OCaml, one that > allows threads to compute in parallel on different cores of now > widespread CPUs. > > This project will be presented at IFL 2009 > (http://blogs.shu.edu/projects/IFL2009/ > ). > > A testing version available online at > http://www.algo-prog.info/ocmc/ > It works with OCaml 3.10.2 for Linux x86-64bit, we haven't met any > bugs with the latest build (it doesn't *unexpectedly* crash, not yet). > > Hope you'll enjoy, > > -- > Mathias Bourgoin, Adrien Jonquet, Emmanuel Chailloux, Benjamin Canou, > Philippe Wang Has anyone tested this yet? Any success stories? MfG Goswin ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [Caml-list] OC4MC : OCaml for Multicore architectures 2009-09-23 10:53 ` [Caml-list] " Goswin von Brederlow @ 2009-09-23 12:21 ` Jon Harrop 2009-09-23 13:00 ` Jon Harrop 2009-09-24 0:21 ` Jon Harrop 1 sibling, 1 reply; 64+ messages in thread From: Jon Harrop @ 2009-09-23 12:21 UTC (permalink / raw) To: caml-list On Wednesday 23 September 2009 11:53:09 Goswin von Brederlow wrote: > Has anyone tested this yet? Any success stories? Its compiling. :-) -- Dr Jon Harrop, Flying Frog Consultancy Ltd. http://www.ffconsultancy.com/?e ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [Caml-list] OC4MC : OCaml for Multicore architectures 2009-09-23 12:21 ` Jon Harrop @ 2009-09-23 13:00 ` Jon Harrop 2009-09-23 14:26 ` Philippe Wang 0 siblings, 1 reply; 64+ messages in thread From: Jon Harrop @ 2009-09-23 13:00 UTC (permalink / raw) To: caml-list On Wednesday 23 September 2009 13:21:35 Jon Harrop wrote: > On Wednesday 23 September 2009 11:53:09 Goswin von Brederlow wrote: > > Has anyone tested this yet? Any success stories? > > Its compiling. :-) Oops, I just compiled a vanilla OCaml 3.10 and their patch is not currently downloadable. I assume everyone else is thrashing their server instead of writing contentless posts here? :-) -- Dr Jon Harrop, Flying Frog Consultancy Ltd. http://www.ffconsultancy.com/?e ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [Caml-list] OC4MC : OCaml for Multicore architectures 2009-09-23 13:00 ` Jon Harrop @ 2009-09-23 14:26 ` Philippe Wang 0 siblings, 0 replies; 64+ messages in thread From: Philippe Wang @ 2009-09-23 14:26 UTC (permalink / raw) To: caml-list I've updated the download page, it should be more robust to multiple downloads now. Cheers, Philippe Wang ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [Caml-list] OC4MC : OCaml for Multicore architectures 2009-09-23 10:53 ` [Caml-list] " Goswin von Brederlow 2009-09-23 12:21 ` Jon Harrop @ 2009-09-24 0:21 ` Jon Harrop 2009-09-23 23:15 ` Philippe Wang 1 sibling, 1 reply; 64+ messages in thread From: Jon Harrop @ 2009-09-24 0:21 UTC (permalink / raw) To: caml-list On Wednesday 23 September 2009 11:53:09 Goswin von Brederlow wrote: > Has anyone tested this yet? Any success stories? Well, I've used the build.sh script to build a patched OCaml 3.10.2 that identifies itself as: $ ocamlopt -v The Objective Caml native-code compiler, version 3.10.2+patch-ocaml4multicore-20090823 Standard library directory: /home/jdh30/src/ocaml/parallel/oc4mc-20090823/ocaml-3.10.2/../out/lib/ocaml and I've built their tests: $ cd tests $ make matmul.nc ocamlopt -o "matmul.nc" -thread unix.cmxa threads.cmxa graphics.cmxa "matmul.ml" File "matmul.ml", line 25, characters 8-13: Warning Y: unused variable count. File "matmul.ml", line 26, characters 8-16: Warning Y: unused variable last_col. and run them: $ time ./matmul.nc 1000 8 Temp de calcul: utime 38.930433, stime 0.012000, rtime 38.943138 Fatal error: exception Invalid_argument("index out of bounds") real 0m38.974s user 0m38.942s sys 0m0.028s Note the exception that (I think) should have been caught and handled silently. But I cannot get anything to run in parallel. None of the tests use more than one core and my own busy-wait-loops-on-two-threads test also runs only on one core. Any idea what I'm doing wrong? Is there a flag to enable it or something? One possible cause: I'm running in a 64-bit chroot. -- Dr Jon Harrop, Flying Frog Consultancy Ltd. http://www.ffconsultancy.com/?e ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [Caml-list] OC4MC : OCaml for Multicore architectures 2009-09-24 0:21 ` Jon Harrop @ 2009-09-23 23:15 ` Philippe Wang 2009-09-24 0:05 ` Jon Harrop 2009-09-24 14:57 ` Philippe Wang 0 siblings, 2 replies; 64+ messages in thread From: Philippe Wang @ 2009-09-23 23:15 UTC (permalink / raw) To: Jon Harrop; +Cc: caml-list make program.nc uses original ocamlopt make program.th uses the newly built ocamlopt with the necessary options (lib links) then you can compare program.nc and program.th On Thu, Sep 24, 2009 at 2:21 AM, Jon Harrop <jon@ffconsultancy.com> wrote: > On Wednesday 23 September 2009 11:53:09 Goswin von Brederlow wrote: >> Has anyone tested this yet? Any success stories? > > Well, I've used the build.sh script to build a patched OCaml 3.10.2 that > identifies itself as: > > $ ocamlopt -v > The Objective Caml native-code compiler, version > 3.10.2+patch-ocaml4multicore-20090823 > Standard library > directory: /home/jdh30/src/ocaml/parallel/oc4mc-20090823/ocaml-3.10.2/../out/lib/ocaml > > and I've built their tests: > > $ cd tests > $ make matmul.nc > ocamlopt -o "matmul.nc" -thread unix.cmxa threads.cmxa > graphics.cmxa "matmul.ml" > File "matmul.ml", line 25, characters 8-13: > Warning Y: unused variable count. > File "matmul.ml", line 26, characters 8-16: > Warning Y: unused variable last_col. > > and run them: > > $ time ./matmul.nc 1000 8 > Temp de calcul: utime 38.930433, stime 0.012000, rtime 38.943138 > Fatal error: exception Invalid_argument("index out of bounds") > > real 0m38.974s > user 0m38.942s > sys 0m0.028s > > Note the exception that (I think) should have been caught and handled > silently. > > But I cannot get anything to run in parallel. None of the tests use more than > one core and my own busy-wait-loops-on-two-threads test also runs only on one > core. Any idea what I'm doing wrong? Is there a flag to enable it or > something? > > One possible cause: I'm running in a 64-bit chroot. > > -- > Dr Jon Harrop, Flying Frog Consultancy Ltd. > http://www.ffconsultancy.com/?e > > _______________________________________________ > Caml-list mailing list. Subscription management: > http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list > Archives: http://caml.inria.fr > Beginner's list: http://groups.yahoo.com/group/ocaml_beginners > Bug reports: http://caml.inria.fr/bin/caml-bugs > -- Philippe Wang mail@philippewang.info ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [Caml-list] OC4MC : OCaml for Multicore architectures 2009-09-23 23:15 ` Philippe Wang @ 2009-09-24 0:05 ` Jon Harrop 2009-09-24 0:01 ` Philippe Wang 2009-09-24 14:57 ` Philippe Wang 1 sibling, 1 reply; 64+ messages in thread From: Jon Harrop @ 2009-09-24 0:05 UTC (permalink / raw) To: Philippe Wang, caml-list On Thursday 24 September 2009 00:15:14 you wrote: > make program.nc uses original ocamlopt > > make program.th uses the newly built ocamlopt with the necessary > options (lib links) > > then you can compare program.nc and program.th Aha! Progress, but now I get errors: $ make matmul.th ../out/bin/ocamlopt -ccopt -march=native -ccopt -mtune=native -ccopt -O4 -I /home/jdh30/src/ocaml/parallel/oc4mc-20090823/tests/../out/lib/ocaml/ -I /home/jdh30/src/ocaml/parallel/oc4mc-20090823/tests/../runtime/gcs/sc_par /home/jdh30/src/ocaml/parallel/oc4mc-20090823/tests/../runtime/gcs/sc_par/gci.o -cclib -lgc -cclib -g -thread unix.cmxa threads.cmxa graphics.cmxa -verbose -compact -rectypes -inline 100 -fno-PIC -cclib -lunix -cclib -lpthread "matmul.ml" -o "matmul.th" File "matmul.ml", line 25, characters 8-13: Warning Y: unused variable count. File "matmul.ml", line 26, characters 8-16: Warning Y: unused variable last_col. + as -o matmul.o /tmp/camlasm081590.s + as -o /tmp/camlstartupdac3e2.o /tmp/camlstartup8f7152.s + gcc -o 'matmul.th' -I'/home/jdh30/src/ocaml/parallel/oc4mc-20090823/ocaml-3.10.2/../out/lib/ocaml' -march=native -mtune=native -O4 '/tmp/camlstartupdac3e2.o' '/home/jdh30/src/ocaml/parallel/oc4mc-20090823/tests/../out/lib/ocaml/std_exit.o' 'matmul.o' '/home/jdh30/src/ocaml/parallel/oc4mc-20090823/tests/../out/lib/ocaml/graphics.a' '/home/jdh30/src/ocaml/parallel/oc4mc-20090823/ocaml-3.10.2/../out/lib/ocaml/threads/threads.a' '/home/jdh30/src/ocaml/parallel/oc4mc-20090823/tests/../out/lib/ocaml/unix.a' '/home/jdh30/src/ocaml/parallel/oc4mc-20090823/tests/../out/lib/ocaml/stdlib.a' '-L/home/jdh30/src/ocaml/parallel/oc4mc-20090823/tests/../out/lib/ocaml/' '-L/home/jdh30/src/ocaml/parallel/oc4mc-20090823/tests/../runtime/gcs/sc_par' '-L/home/jdh30/src/ocaml/parallel/oc4mc-20090823/ocaml-3.10.2/../out/lib/ocaml/threads' '-L/home/jdh30/src/ocaml/parallel/oc4mc-20090823/ocaml-3.10.2/../out/lib/ocaml' '-lgraphics' '-lX11' '-lthreadsnat' '-lunix' '-lpthread' '-lunix' '/home/jdh30/src/ocaml/parallel/oc4mc-20090823/tests/../runtime/gcs/sc_par/gci.o' '-lgc' '-g' '-lunix' '-lpthread' '/home/jdh30/src/ocaml/parallel/oc4mc-20090823/tests/../out/lib/ocaml/libasmrun.a' -lm -ldl /home/jdh30/src/ocaml/parallel/oc4mc-20090823/tests/../out/lib/ocaml/libasmrun.a(memory.o): In function `gc_end_roots': memory.c:(.text+0x10): multiple definition of `gc_end_roots' /home/jdh30/src/ocaml/parallel/oc4mc-20090823/tests/../runtime/gcs/sc_par/gci.o:/home/jdh30/src/ocaml/parallel/oc4mc-20090823/runtime/gcs/sc_par/gci.c:948: first defined here /home/jdh30/src/ocaml/parallel/oc4mc-20090823/tests/../out/lib/ocaml/libasmrun.a(memory.o): In function `gc_begin_roots': memory.c:(.text+0x12): multiple definition of `gc_begin_roots' /home/jdh30/src/ocaml/parallel/oc4mc-20090823/tests/../runtime/gcs/sc_par/gci.o:/home/jdh30/src/ocaml/parallel/oc4mc-20090823/runtime/gcs/sc_par/gci.c:947: first defined here /home/jdh30/src/ocaml/parallel/oc4mc-20090823/tests/../out/lib/ocaml/libasmrun.a(finalise.o): In function `caml_final_do_strong_roots': finalise.c:(.text+0x0): multiple definition of `caml_final_do_strong_roots' /home/jdh30/src/ocaml/parallel/oc4mc-20090823/tests/../runtime/gcs/sc_par/gci.o:/home/jdh30/src/ocaml/parallel/oc4mc-20090823/runtime/gcs/sc_par/gci.c:301: first defined here /home/jdh30/src/ocaml/parallel/oc4mc-20090823/tests/../runtime/gcs/sc_par/gci.o: In function `stop_the_world': gci.c:(.text+0x38e): undefined reference to `caml_all_threads' gci.c:(.text+0x403): undefined reference to `caml_all_threads' gci.c:(.text+0x410): undefined reference to `caml_all_threads' gci.c:(.text+0x48a): undefined reference to `caml_all_threads' /home/jdh30/src/ocaml/parallel/oc4mc-20090823/tests/../runtime/gcs/sc_par/gci.o: In function `resume_the_world': gci.c:(.text+0x4c4): undefined reference to `caml_all_threads' /home/jdh30/src/ocaml/parallel/oc4mc-20090823/tests/../runtime/gcs/sc_par/gci.o:gci.c: (.text+0x57c): more undefined references to `caml_all_threads' follow /home/jdh30/src/ocaml/parallel/oc4mc-20090823/tests/../runtime/gcs/sc_par/gci.o: In function `termination_action': gci.c:(.text+0x1e94): undefined reference to `remove_thread_from_list' /home/jdh30/src/ocaml/parallel/oc4mc-20090823/tests/../runtime/gcs/sc_par/gci.o: In function `gc_terminate_local': gci.c:(.text+0x1fe5): undefined reference to `remove_thread_from_list' collect2: ld returned 1 exit status Error during linking make: *** [matmul.th] Error 2 -- Dr Jon Harrop, Flying Frog Consultancy Ltd. http://www.ffconsultancy.com/?e ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [Caml-list] OC4MC : OCaml for Multicore architectures 2009-09-24 0:05 ` Jon Harrop @ 2009-09-24 0:01 ` Philippe Wang 2009-09-24 1:47 ` Jon Harrop 0 siblings, 1 reply; 64+ messages in thread From: Philippe Wang @ 2009-09-24 0:01 UTC (permalink / raw) To: Jon Harrop; +Cc: caml-list Ok... well, I guess that - whether it is something about your environment that is too different from ours (in which case build.sh is bad), - whether you have corrupted your installation (it could be by having a bad PATH value that makes original ocamlopt be mixed up with oc4mc ocamlopt) What I suggest is to use a default PATH (without modifying it for the purpose of OC4MC), and do these steps in a clean directory that is not included in PATH : 1) wget oc4mc-2009XXXX.tgz 2) tar xzf oc4mc-2009XXXX.tgz 3) cd oc4mc-2009XXXX 4) wget ocaml 3.10.2 (tar.gz or tar.bz2) 5) bash build.sh ... wait 6) cd test 7) make matmul.th 8) time matmul.th 1000 8 Sorry it's messy, we are thinking about something cleaner... (there's a matter of lack of time somewhere) cheers, -- Philippe Wang mail@philippewang.info On Thu, Sep 24, 2009 at 2:05 AM, Jon Harrop <jon@ffconsultancy.com> wrote: > On Thursday 24 September 2009 00:15:14 you wrote: >> make program.nc uses original ocamlopt >> >> make program.th uses the newly built ocamlopt with the necessary >> options (lib links) >> >> then you can compare program.nc and program.th > > Aha! Progress, but now I get errors: > > $ make matmul.th > ../out/bin/ocamlopt -ccopt -march=native -ccopt -mtune=native -ccopt -O4 -I /home/jdh30/src/ocaml/parallel/oc4mc-20090823/tests/../out/lib/ocaml/ -I /home/jdh30/src/ocaml/parallel/oc4mc-20090823/tests/../runtime/gcs/sc_par /home/jdh30/src/ocaml/parallel/oc4mc-20090823/tests/../runtime/gcs/sc_par/gci.o -cclib -lgc -cclib -g -thread > unix.cmxa threads.cmxa graphics.cmxa -verbose -compact -rectypes -inline > 100 -fno-PIC -cclib -lunix -cclib -lpthread "matmul.ml" -o "matmul.th" > File "matmul.ml", line 25, characters 8-13: > Warning Y: unused variable count. > File "matmul.ml", line 26, characters 8-16: > Warning Y: unused variable last_col. > + as -o matmul.o /tmp/camlasm081590.s > + as -o /tmp/camlstartupdac3e2.o /tmp/camlstartup8f7152.s > + > gcc -o 'matmul.th' -I'/home/jdh30/src/ocaml/parallel/oc4mc-20090823/ocaml-3.10.2/../out/lib/ocaml' -march=native -mtune=native -O4 '/tmp/camlstartupdac3e2.o' '/home/jdh30/src/ocaml/parallel/oc4mc-20090823/tests/../out/lib/ocaml/std_exit.o' 'matmul.o' '/home/jdh30/src/ocaml/parallel/oc4mc-20090823/tests/../out/lib/ocaml/graphics.a' '/home/jdh30/src/ocaml/parallel/oc4mc-20090823/ocaml-3.10.2/../out/lib/ocaml/threads/threads.a' '/home/jdh30/src/ocaml/parallel/oc4mc-20090823/tests/../out/lib/ocaml/unix.a' '/home/jdh30/src/ocaml/parallel/oc4mc-20090823/tests/../out/lib/ocaml/stdlib.a' '-L/home/jdh30/src/ocaml/parallel/oc4mc-20090823/tests/../out/lib/ocaml/' '-L/home/jdh30/src/ocaml/parallel/oc4mc-20090823/tests/../runtime/gcs/sc_par' '-L/home/jdh30/src/ocaml/parallel/oc4mc-20090823/ocaml-3.10.2/../out/lib/ocaml/threads' '-L/home/jdh30/src/ocaml/parallel/oc4mc-20090823/ocaml-3.10.2/../out/lib/ocaml' '-lgraphics' '-lX11' '-lthreadsnat' '-lunix' '-lpthread' '-lunix' '/home/jdh30/src/ocaml/parallel/oc4mc-20090823/tests/../runtime/gcs/sc_par/gci.o' '-lgc' '-g' '-lunix' '-lpthread' '/home/jdh30/src/ocaml/parallel/oc4mc-20090823/tests/../out/lib/ocaml/libasmrun.a' -lm -ldl > /home/jdh30/src/ocaml/parallel/oc4mc-20090823/tests/../out/lib/ocaml/libasmrun.a(memory.o): > In function `gc_end_roots': > memory.c:(.text+0x10): multiple definition of `gc_end_roots' > /home/jdh30/src/ocaml/parallel/oc4mc-20090823/tests/../runtime/gcs/sc_par/gci.o:/home/jdh30/src/ocaml/parallel/oc4mc-20090823/runtime/gcs/sc_par/gci.c:948: > first defined here > /home/jdh30/src/ocaml/parallel/oc4mc-20090823/tests/../out/lib/ocaml/libasmrun.a(memory.o): > In function `gc_begin_roots': > memory.c:(.text+0x12): multiple definition of `gc_begin_roots' > /home/jdh30/src/ocaml/parallel/oc4mc-20090823/tests/../runtime/gcs/sc_par/gci.o:/home/jdh30/src/ocaml/parallel/oc4mc-20090823/runtime/gcs/sc_par/gci.c:947: > first defined here > /home/jdh30/src/ocaml/parallel/oc4mc-20090823/tests/../out/lib/ocaml/libasmrun.a(finalise.o): > In function `caml_final_do_strong_roots': > finalise.c:(.text+0x0): multiple definition of `caml_final_do_strong_roots' > /home/jdh30/src/ocaml/parallel/oc4mc-20090823/tests/../runtime/gcs/sc_par/gci.o:/home/jdh30/src/ocaml/parallel/oc4mc-20090823/runtime/gcs/sc_par/gci.c:301: > first defined here > /home/jdh30/src/ocaml/parallel/oc4mc-20090823/tests/../runtime/gcs/sc_par/gci.o: > In function `stop_the_world': > gci.c:(.text+0x38e): undefined reference to `caml_all_threads' > gci.c:(.text+0x403): undefined reference to `caml_all_threads' > gci.c:(.text+0x410): undefined reference to `caml_all_threads' > gci.c:(.text+0x48a): undefined reference to `caml_all_threads' > /home/jdh30/src/ocaml/parallel/oc4mc-20090823/tests/../runtime/gcs/sc_par/gci.o: > In function `resume_the_world': > gci.c:(.text+0x4c4): undefined reference to `caml_all_threads' > /home/jdh30/src/ocaml/parallel/oc4mc-20090823/tests/../runtime/gcs/sc_par/gci.o:gci.c: > (.text+0x57c): more undefined references to `caml_all_threads' follow > /home/jdh30/src/ocaml/parallel/oc4mc-20090823/tests/../runtime/gcs/sc_par/gci.o: > In function `termination_action': > gci.c:(.text+0x1e94): undefined reference to `remove_thread_from_list' > /home/jdh30/src/ocaml/parallel/oc4mc-20090823/tests/../runtime/gcs/sc_par/gci.o: > In function `gc_terminate_local': > gci.c:(.text+0x1fe5): undefined reference to `remove_thread_from_list' > collect2: ld returned 1 exit status > Error during linking > make: *** [matmul.th] Error 2 > > -- > Dr Jon Harrop, Flying Frog Consultancy Ltd. > http://www.ffconsultancy.com/?e > ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [Caml-list] OC4MC : OCaml for Multicore architectures 2009-09-24 0:01 ` Philippe Wang @ 2009-09-24 1:47 ` Jon Harrop 2009-09-24 9:49 ` Richard Jones ` (2 more replies) 0 siblings, 3 replies; 64+ messages in thread From: Jon Harrop @ 2009-09-24 1:47 UTC (permalink / raw) To: Philippe Wang, caml-list On Thursday 24 September 2009 01:01:58 you wrote: > Ok... well, I guess that > - whether it is something about your environment that is too different > from ours (in which case build.sh is bad), > - whether you have corrupted your installation (it could be by having > a bad PATH value that makes original ocamlopt be mixed up with oc4mc > ocamlopt) > > What I suggest is to use a default PATH (without modifying it for the > purpose of OC4MC), and do these steps in a clean directory that is not > included in PATH : > > 1) wget oc4mc-2009XXXX.tgz > 2) tar xzf oc4mc-2009XXXX.tgz > 3) cd oc4mc-2009XXXX > 4) wget ocaml 3.10.2 (tar.gz or tar.bz2) > 5) bash build.sh > 6) cd tests > 7) make matmul.th > 8) time ./matmul.th 1000 8 > > Sorry it's messy, we are thinking about something cleaner... (there's > a matter of lack of time somewhere) No problem. I'll be happy to get anything working! Following your advice, it seems to work perfectly now: $ ./matmul.th 500 1 Temp de calcul: utime 2.324145, stime 0.020001, rtime 2.325608 $ ./matmul.th 500 2 Temp de calcul: utime 1.780111, stime 0.000000, rtime 0.890797 $ ./matmul.th 500 3 Temp de calcul: utime 1.784111, stime 0.004000, rtime 0.608895 $ ./matmul.th 500 4 Temp de calcul: utime 1.764110, stime 0.004000, rtime 0.451214 $ ./matmul.th 500 5 Temp de calcul: utime 1.768111, stime 0.000000, rtime 0.393285 $ ./matmul.th 500 6 Temp de calcul: utime 1.924120, stime 0.004001, rtime 0.333215 $ ./matmul.th 500 7 Temp de calcul: utime 1.788112, stime 0.000000, rtime 0.302328 $ ./matmul.th 500 8 Temp de calcul: utime 1.992124, stime 0.000000, rtime 0.290383 Wow! 2.6x faster on 2 cores is good. ;-) That's a really fantastic piece of work. I'll do my best to study it and write literature about it. May I ask, can you give a rough overview of the design? For example, is there a separate nursery per thread so each thread can allocate a certain amount before incurring a global pause? Do you have any ideas for libraries built on top of this, such as a task parallel library using work-stealing deques? Thanks very much!!! -- Dr Jon Harrop, Flying Frog Consultancy Ltd. http://www.ffconsultancy.com/?e ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [Caml-list] OC4MC : OCaml for Multicore architectures 2009-09-24 1:47 ` Jon Harrop @ 2009-09-24 9:49 ` Richard Jones 2009-09-24 10:00 ` rixed ` (2 more replies) 2009-09-24 10:00 ` kcheung 2009-09-24 12:14 ` Philippe Wang 2 siblings, 3 replies; 64+ messages in thread From: Richard Jones @ 2009-09-24 9:49 UTC (permalink / raw) To: Jon Harrop; +Cc: Philippe Wang, caml-list On Thu, Sep 24, 2009 at 02:47:17AM +0100, Jon Harrop wrote: > Wow! 2.6x faster on 2 cores is good. ;-) Isn't that impossible? Or is the multicore GC better than the single threaded one? (Sorry if this is a stupid or obvious question) Rich. -- Richard Jones Red Hat ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [Caml-list] OC4MC : OCaml for Multicore architectures 2009-09-24 9:49 ` Richard Jones @ 2009-09-24 10:00 ` rixed 2009-09-24 10:40 ` Florian Hars 2009-09-24 11:45 ` Jon Harrop 2 siblings, 0 replies; 64+ messages in thread From: rixed @ 2009-09-24 10:00 UTC (permalink / raw) To: caml-list > > Wow! 2.6x faster on 2 cores is good. ;-) > > Isn't that impossible? Or is the multicore GC better than the single > threaded one? (Sorry if this is a stupid or obvious question) There are so many factors that makes the running time unpredictable that nothing is surprising any more. Haven't you read this paper [1] about the length of an environment variable causing a program to be 10% faster or slower ? :) [1]: http://www-plan.cs.colorado.edu/diwan/asplos09.pdf ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [Caml-list] OC4MC : OCaml for Multicore architectures 2009-09-24 9:49 ` Richard Jones 2009-09-24 10:00 ` rixed @ 2009-09-24 10:40 ` Florian Hars 2009-09-24 11:45 ` Jon Harrop 2 siblings, 0 replies; 64+ messages in thread From: Florian Hars @ 2009-09-24 10:40 UTC (permalink / raw) To: Richard Jones; +Cc: Jon Harrop, caml-list, Philippe Wang Richard Jones schrieb: > On Thu, Sep 24, 2009 at 02:47:17AM +0100, Jon Harrop wrote: >> Wow! 2.6x faster on 2 cores is good. ;-) > > Isn't that impossible? Or is the multicore GC better than the single > threaded one? (Sorry if this is a stupid or obvious question) It might just happen that the size of the working set and memory access pattern of the application is just right so that you get a better interleaving of cache misses and thread execution if you run more than two threads on two cores. Hyperthreading might muddle things further. - Florian ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [Caml-list] OC4MC : OCaml for Multicore architectures 2009-09-24 9:49 ` Richard Jones 2009-09-24 10:00 ` rixed 2009-09-24 10:40 ` Florian Hars @ 2009-09-24 11:45 ` Jon Harrop 2 siblings, 0 replies; 64+ messages in thread From: Jon Harrop @ 2009-09-24 11:45 UTC (permalink / raw) To: caml-list On Thursday 24 September 2009 10:49:43 Richard Jones wrote: > On Thu, Sep 24, 2009 at 02:47:17AM +0100, Jon Harrop wrote: > > Wow! 2.6x faster on 2 cores is good. ;-) > > Isn't that impossible? Or is the multicore GC better than the single > threaded one? (Sorry if this is a stupid or obvious question) Superlinear scaling is entirely possible because more cores can mean more cache in play. However, I have only seen superlinear scaling on AMD hardware and not Intel hardware. -- Dr Jon Harrop, Flying Frog Consultancy Ltd. http://www.ffconsultancy.com/?e ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [Caml-list] OC4MC : OCaml for Multicore architectures 2009-09-24 1:47 ` Jon Harrop 2009-09-24 9:49 ` Richard Jones @ 2009-09-24 10:00 ` kcheung 2009-09-24 11:52 ` Jon Harrop 2009-09-24 12:14 ` Philippe Wang 2 siblings, 1 reply; 64+ messages in thread From: kcheung @ 2009-09-24 10:00 UTC (permalink / raw) To: caml-list > On Thursday 24 September 2009 01:01:58 you wrote: > No problem. I'll be happy to get anything working! > > Following your advice, it seems to work perfectly now: I'm not too familiar with concurrency in ocaml. How does OC4MC compare with JoCaml? > > $ ./matmul.th 500 1 > Temp de calcul: utime 2.324145, stime 0.020001, rtime 2.325608 > $ ./matmul.th 500 2 > Temp de calcul: utime 1.780111, stime 0.000000, rtime 0.890797 > $ ./matmul.th 500 3 > Temp de calcul: utime 1.784111, stime 0.004000, rtime 0.608895 > $ ./matmul.th 500 4 > Temp de calcul: utime 1.764110, stime 0.004000, rtime 0.451214 > $ ./matmul.th 500 5 > Temp de calcul: utime 1.768111, stime 0.000000, rtime 0.393285 > $ ./matmul.th 500 6 > Temp de calcul: utime 1.924120, stime 0.004001, rtime 0.333215 > $ ./matmul.th 500 7 > Temp de calcul: utime 1.788112, stime 0.000000, rtime 0.302328 > $ ./matmul.th 500 8 > Temp de calcul: utime 1.992124, stime 0.000000, rtime 0.290383 > > Wow! 2.6x faster on 2 cores is good. ;-) > > That's a really fantastic piece of work. I'll do my best to study it and > write > literature about it. May I ask, can you give a rough overview of the > design? > For example, is there a separate nursery per thread so each thread can > allocate a certain amount before incurring a global pause? Do you have any > ideas for libraries built on top of this, such as a task parallel library > using work-stealing deques? > > Thanks very much!!! > > -- > Dr Jon Harrop, Flying Frog Consultancy Ltd. > http://www.ffconsultancy.com/?e > > _______________________________________________ > Caml-list mailing list. Subscription management: > http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list > Archives: http://caml.inria.fr > Beginner's list: http://groups.yahoo.com/group/ocaml_beginners > Bug reports: http://caml.inria.fr/bin/caml-bugs > ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [Caml-list] OC4MC : OCaml for Multicore architectures 2009-09-24 10:00 ` kcheung @ 2009-09-24 11:52 ` Jon Harrop 2009-09-24 11:55 ` Rakotomandimby Mihamina ` (3 more replies) 0 siblings, 4 replies; 64+ messages in thread From: Jon Harrop @ 2009-09-24 11:52 UTC (permalink / raw) To: caml-list On Thursday 24 September 2009 11:00:57 kcheung@math.carleton.ca wrote: > > On Thursday 24 September 2009 01:01:58 you wrote: > > > > No problem. I'll be happy to get anything working! > > > > Following your advice, it seems to work perfectly now: > > I'm not too familiar with concurrency in ocaml. > How does OC4MC compare with JoCaml? JoCaml is all about concurrency: minimizing latency. Oc4mc is all about parallelism: maximizing throughput. Until now, OCaml sucked at parallelism. You can sometimes obtain some parallelism by forking threads but it is asymptotically slower than using shared memory. Consequently, oc4mc is a hugely-important development in the OCaml world because it means that OCaml programmers can write OCaml programs that use multicore machines efficiently for the first time. The next steps are to get oc4mc into the apt repositories and build some libraries that make parallelism easier (like Microsoft's Task Parallel Library). -- Dr Jon Harrop, Flying Frog Consultancy Ltd. http://www.ffconsultancy.com/?e ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [Caml-list] OC4MC : OCaml for Multicore architectures 2009-09-24 11:52 ` Jon Harrop @ 2009-09-24 11:55 ` Rakotomandimby Mihamina 2009-09-24 12:11 ` rixed ` (2 subsequent siblings) 3 siblings, 0 replies; 64+ messages in thread From: Rakotomandimby Mihamina @ 2009-09-24 11:55 UTC (permalink / raw) To: caml-list, debian-ocaml-maint 09/24/2009 02:52 PM, Jon Harrop: > The next steps are to get oc4mc into the apt repositories Amen! ;-) -- Architecte Informatique chez Blueline/Gulfsat: Administration Systeme, Recherche & Developpement +261 34 29 155 34 ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [Caml-list] OC4MC : OCaml for Multicore architectures 2009-09-24 11:52 ` Jon Harrop 2009-09-24 11:55 ` Rakotomandimby Mihamina @ 2009-09-24 12:11 ` rixed 2009-09-24 15:58 ` Jon Harrop 2009-09-24 12:39 ` Stefano Zacchiroli 2009-09-24 15:36 ` Philippe Wang 3 siblings, 1 reply; 64+ messages in thread From: rixed @ 2009-09-24 12:11 UTC (permalink / raw) To: caml-list > Until now, OCaml sucked at parallelism. (...) OCaml programmers > can write OCaml programs that use multicore machines efficiently > for the first time. Subtle and strongly argumented, as expected. ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [Caml-list] OC4MC : OCaml for Multicore architectures 2009-09-24 12:11 ` rixed @ 2009-09-24 15:58 ` Jon Harrop 0 siblings, 0 replies; 64+ messages in thread From: Jon Harrop @ 2009-09-24 15:58 UTC (permalink / raw) To: caml-list On Thursday 24 September 2009 13:11:24 rixed@happyleptic.org wrote: > > Until now, OCaml sucked at parallelism. (...) OCaml programmers > > can write OCaml programs that use multicore machines efficiently > > for the first time. > > Subtle and strongly argumented, as expected. I forgot to mention that multithreaded programming is vastly easier than multi-process programming in the context of parallelism because you get automatic memory management and O(1) communication. -- Dr Jon Harrop, Flying Frog Consultancy Ltd. http://www.ffconsultancy.com/?e ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [Caml-list] OC4MC : OCaml for Multicore architectures 2009-09-24 11:52 ` Jon Harrop 2009-09-24 11:55 ` Rakotomandimby Mihamina 2009-09-24 12:11 ` rixed @ 2009-09-24 12:39 ` Stefano Zacchiroli 2009-09-24 13:09 ` Jon Harrop ` (2 more replies) 2009-09-24 15:36 ` Philippe Wang 3 siblings, 3 replies; 64+ messages in thread From: Stefano Zacchiroli @ 2009-09-24 12:39 UTC (permalink / raw) To: caml-list [-- Attachment #1: Type: text/plain, Size: 972 bytes --] On Thu, Sep 24, 2009 at 12:52:24PM +0100, Jon Harrop wrote: > The next steps are to get oc4mc into the apt repositories and build Uhm, I'm curious: how do you plan to achieve that? AFAICT the patch is only against 3.10.2, and in Debian we're at 3.11.1. Thus far, we have never had support for more than one version of OCaml at a time. If it were worth we can surely consider that, but the current uncertainty about OC4MC future doesn't seem enough to justify that. So, the real question is: is OC4MC going to be ported to mainline OCaml and support in the future or not? If the answer is "no", I don't see it arriving in Debian anytime soon. Cheers. -- Stefano Zacchiroli -o- PhD in Computer Science \ PostDoc @ Univ. Paris 7 zack@{upsilon.cc,pps.jussieu.fr,debian.org} -<>- http://upsilon.cc/zack/ Dietro un grande uomo c'è ..| . |. Et ne m'en veux pas si je te tutoie sempre uno zaino ...........| ..: |.... Je dis tu à tous ceux que j'aime [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 190 bytes --] ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [Caml-list] OC4MC : OCaml for Multicore architectures 2009-09-24 12:39 ` Stefano Zacchiroli @ 2009-09-24 13:09 ` Jon Harrop 2009-09-24 16:49 ` Richard Jones 2009-09-25 15:05 ` Xavier Leroy 2009-09-24 13:40 ` Rakotomandimby Mihamina 2009-09-24 13:55 ` Mike Lin 2 siblings, 2 replies; 64+ messages in thread From: Jon Harrop @ 2009-09-24 13:09 UTC (permalink / raw) To: caml-list On Thursday 24 September 2009 13:39:40 Stefano Zacchiroli wrote: > On Thu, Sep 24, 2009 at 12:52:24PM +0100, Jon Harrop wrote: > > The next steps are to get oc4mc into the apt repositories and build > > Uhm, I'm curious: how do you plan to achieve that? Good question. I have no idea, of course. :-) > AFAICT the patch is only against 3.10.2, and in Debian we're at 3.11.1. Philippe, is it feasible to bring your patches up to date wrt OCaml? > Thus far, we have never had support for more than one version of OCaml > at a time. If it were worth we can surely consider that, but the current > uncertainty about OC4MC future doesn't seem enough to justify that. Fair enough. I think this is the single most important development OCaml has seen since its inception so I would personally drop OCaml in favor of oc4mc even if it meant reverting to 3.10.2. There is also the issue that this is x64 only... > So, the real question is: is OC4MC going to be ported to mainline OCaml > and support in the future or not? If the answer is "no", I don't see it > arriving in Debian anytime soon. Yes, that would be ideal. Pretty please, Xavier? ;-) -- Dr Jon Harrop, Flying Frog Consultancy Ltd. http://www.ffconsultancy.com/?e ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [Caml-list] OC4MC : OCaml for Multicore architectures 2009-09-24 13:09 ` Jon Harrop @ 2009-09-24 16:49 ` Richard Jones 2009-09-24 16:56 ` Philippe Wang 2009-09-24 21:09 ` Jon Harrop 2009-09-25 15:05 ` Xavier Leroy 1 sibling, 2 replies; 64+ messages in thread From: Richard Jones @ 2009-09-24 16:49 UTC (permalink / raw) Cc: caml-list On Thu, Sep 24, 2009 at 02:09:56PM +0100, Jon Harrop wrote: > Fair enough. I think this is the single most important development OCaml has > seen since its inception so I would personally drop OCaml in favor of oc4mc > even if it meant reverting to 3.10.2. I think 'personally' is the key word there. You forget that people are quite happily programming in very slow languages like Perl, Python, Ruby and Visual Basic, and those people vastly outnumber the ones using F#, Haskell, OCaml, SML etc. (They don't even have static safety, dammit!). Rich. -- Richard Jones Red Hat ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [Caml-list] OC4MC : OCaml for Multicore architectures 2009-09-24 16:49 ` Richard Jones @ 2009-09-24 16:56 ` Philippe Wang 2009-09-24 17:36 ` Richard Jones 2009-09-24 19:39 ` rixed 2009-09-24 21:09 ` Jon Harrop 1 sibling, 2 replies; 64+ messages in thread From: Philippe Wang @ 2009-09-24 16:56 UTC (permalink / raw) To: Richard Jones; +Cc: Philippe Wang, caml-list On Sep 24, 2009, at 18:49 GMT+02:00, Richard Jones wrote: > On Thu, Sep 24, 2009 at 02:09:56PM +0100, Jon Harrop wrote: >> Fair enough. I think this is the single most important development >> OCaml has >> seen since its inception so I would personally drop OCaml in favor >> of oc4mc >> even if it meant reverting to 3.10.2. > > I think 'personally' is the key word there. You forget that people > are quite happily programming in very slow languages like Perl, > Python, Ruby and Visual Basic, and those people vastly outnumber the > ones using F#, Haskell, OCaml, SML etc. (They don't even have static > safety, dammit!). Should we tell them that using CPU for nothing (side-effect for using a "slow language") has a bad effect on global warming? Could it be a wake-up call? :-p half-kidding, Philippe Wang ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [Caml-list] OC4MC : OCaml for Multicore architectures 2009-09-24 16:56 ` Philippe Wang @ 2009-09-24 17:36 ` Richard Jones 2009-09-24 19:39 ` rixed 1 sibling, 0 replies; 64+ messages in thread From: Richard Jones @ 2009-09-24 17:36 UTC (permalink / raw) To: Philippe Wang; +Cc: caml-list On Thu, Sep 24, 2009 at 06:56:16PM +0200, Philippe Wang wrote: > On Sep 24, 2009, at 18:49 GMT+02:00, Richard Jones wrote: > > >On Thu, Sep 24, 2009 at 02:09:56PM +0100, Jon Harrop wrote: > >>Fair enough. I think this is the single most important development > >>OCaml has > >>seen since its inception so I would personally drop OCaml in favor > >>of oc4mc > >>even if it meant reverting to 3.10.2. > > > >I think 'personally' is the key word there. You forget that people > >are quite happily programming in very slow languages like Perl, > >Python, Ruby and Visual Basic, and those people vastly outnumber the > >ones using F#, Haskell, OCaml, SML etc. (They don't even have static > >safety, dammit!). > > Should we tell them that using CPU for nothing (side-effect for using > a "slow language") has a bad effect on global warming? Could it be a > wake-up call? :-p I've been telling them ... Rich. -- Richard Jones Red Hat ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [Caml-list] OC4MC : OCaml for Multicore architectures 2009-09-24 16:56 ` Philippe Wang 2009-09-24 17:36 ` Richard Jones @ 2009-09-24 19:39 ` rixed 1 sibling, 0 replies; 64+ messages in thread From: rixed @ 2009-09-24 19:39 UTC (permalink / raw) To: caml-list > Should we tell them that using CPU for nothing (side-effect for using > a "slow language") has a bad effect on global warming? Could it be a > wake-up call? :-p It also has bad effect on battery life, but that does not refrain them from releasing full software stacks for embedded devices based on these languages :-/ ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [Caml-list] OC4MC : OCaml for Multicore architectures 2009-09-24 16:49 ` Richard Jones 2009-09-24 16:56 ` Philippe Wang @ 2009-09-24 21:09 ` Jon Harrop 2009-09-24 21:26 ` rixed ` (2 more replies) 1 sibling, 3 replies; 64+ messages in thread From: Jon Harrop @ 2009-09-24 21:09 UTC (permalink / raw) To: caml-list On Thursday 24 September 2009 17:49:33 Richard Jones wrote: > On Thu, Sep 24, 2009 at 02:09:56PM +0100, Jon Harrop wrote: > > Fair enough. I think this is the single most important development OCaml > > has seen since its inception so I would personally drop OCaml in favor of > > oc4mc even if it meant reverting to 3.10.2. > > I think 'personally' is the key word there. You forget that people > are quite happily programming in very slow languages like Perl, > Python, Ruby and Visual Basic, Visual Basic has been a *lot* faster than OCaml for several years now, not least because it makes efficient multicore programming easy. Even Python is beating OCaml on benchmarks now: http://flyingfrogblog.blogspot.com/2009/04/f-vs-ocaml-vs-haskell-hash-table.html Even if that were not the case, the idea of cherry picking interpreted scripting languages to compete with because OCaml has fallen so far behind mainstream languages (let alone modern languages) is embarrassing. What's next, OCaml vs Bash for your high performance needs? > and those people vastly outnumber the ones using F#, Haskell, OCaml, SML > etc. (They don't even have static safety, dammit!). If you want to draw aspirations based upon popularity, look at the most popular languages: Java and C#. They are far more popular than OCaml for many reasons but parallel threads to make efficient multicore programming easy is a big one. -- Dr Jon Harrop, Flying Frog Consultancy Ltd. http://www.ffconsultancy.com/?e ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [Caml-list] OC4MC : OCaml for Multicore architectures 2009-09-24 21:09 ` Jon Harrop @ 2009-09-24 21:26 ` rixed 2009-09-25 4:07 ` Jacques Garrigue 2009-09-25 8:08 ` Stéphane Glondu 2 siblings, 0 replies; 64+ messages in thread From: rixed @ 2009-09-24 21:26 UTC (permalink / raw) To: caml-list > Visual Basic has been a *lot* faster than OCaml for several years now, not > (...) Even Python (...) Java and C#. They are far more popular than OCaml for many > reasons but parallel threads to make efficient multicore programming easy is > a big one. In general you sounds like a reasonable and knowledgeable person, yet in some messages you seam to completely lose contact with reality. Either you have a small kid at home who steals your identity when you are away, or, considering that it always happens when the toppic gets close to concurrency or the dotnet platform, you might be suffering in some way. ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [Caml-list] OC4MC : OCaml for Multicore architectures 2009-09-24 21:09 ` Jon Harrop 2009-09-24 21:26 ` rixed @ 2009-09-25 4:07 ` Jacques Garrigue 2009-09-25 7:32 ` Hugo Ferreira ` (2 more replies) 2009-09-25 8:08 ` Stéphane Glondu 2 siblings, 3 replies; 64+ messages in thread From: Jacques Garrigue @ 2009-09-25 4:07 UTC (permalink / raw) To: jon; +Cc: caml-list First, like everybody else, I'd like very much to try this out. Is there any chance it could compile on Snow Leopard :-) (I suppose it's near impossible, but still ask...) From: Jon Harrop <jon@ffconsultancy.com> > Visual Basic has been a *lot* faster than OCaml for several years now, not > least because it makes efficient multicore programming easy. Even Python is > beating OCaml on benchmarks now: > > http://flyingfrogblog.blogspot.com/2009/04/f-vs-ocaml-vs-haskell-hash-table.html IIRC, currently Visual Basic is just a "skin" for C#. You have to write all the types, so it's rather hard to call it "Basic". And yes, MS has invested a lot in the CLR, and that pays. Your benchmark seems strange to me, as you are comparing apples with oranges. Hashtables in Python are a basic feature of the language, and they are of course implemented in C. In ocaml, they are implemented in ocaml (except the hashing function, which has to be polymorphic), using an array of association lists! (Actually the pairs are flattened for better performance, but still) What is impressive is that you don't need any special optimization to get reasonably good performance. Actually the only tuning you need is to start from a reasonable table size, which you didn't (never start from 1, you will have to redo all the hashing every time the table needs to be grown). > Even if that were not the case, the idea of cherry picking interpreted > scripting languages to compete with because OCaml has fallen so far behind > mainstream languages (let alone modern languages) is embarrassing. What's > next, OCaml vs Bash for your high performance needs? OCaml was never touted as an HPC language! The only claim I've seen is that it intends to stay within 2x of C for most applications. (Which is not so easy these days, gcc getting much faster.) Actually, I believe that Philippe's point is rather different. Making a functional language work well on multicores is difficult. If I tell you that you just have to modify a bit your program to get a near linear speedup, then it looks great. But in practice it is rather having to rethink completely your algorithm, to eventually get a speedup bounded by bandwidth, and starting from a point lower than the original single thread program. There are applications for that (ray tracing is one), but this is not the kind of needs most people have. By the way, I was discussing with numerical computation people working on BLAS the other day, and their answer was clear: if you need high performance, better use a grid than SMP, since bandwidth is paramount. And you have to write in C or FORTRAN (or asm), because the timing of instructions matter. The funniest part was that those people were working on integer computations, but had to stick to floating point, because timing on integers is unpredictable, making synchronization harder. Cheers, Jacques ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [Caml-list] OC4MC : OCaml for Multicore architectures 2009-09-25 4:07 ` Jacques Garrigue @ 2009-09-25 7:32 ` Hugo Ferreira 2009-09-25 10:17 ` Jon Harrop 2009-09-25 21:39 ` Gerd Stolpmann 2009-09-25 9:33 ` Philippe Wang 2009-09-25 21:39 ` Jon Harrop 2 siblings, 2 replies; 64+ messages in thread From: Hugo Ferreira @ 2009-09-25 7:32 UTC (permalink / raw) To: caml-list Hello, In tried not getting into this discussion but I could not resist commenting on the following: Jacques Garrigue wrote: >... > ... There are applications for that (ray tracing is > one), but this is not the kind of needs most people have. >... As with most technology people will or will not use something according to their perceived effort/pleasure to learn/use something and the advantages it is supposed to bring. Put it another way; if parallel/concurrent programming could be easily used with a minimum of effort then I believe "most people" would use it simply because it is available. In other words the (ready) availability of (multi-core PCs and) parallel computing support (in Ocaml) will certainly influence the number of people that will take advantage of it simply because it is available (confer with e-mails on this thread). >... > If I tell you that you just have to modify a bit your program to get a > near linear speedup, then it looks great. But in practice it is rather > having to rethink completely your algorithm, to eventually get a > speedup bounded by bandwidth, and starting from a point lower than the > original single thread program. >... Rethinking our application/algorithmic structure may not be a real deterrent. An application does not require parallel/concurrent processing everywhere. It is really a question of identifying where and when this is useful. Much like selecting the most "appropriate" data-structure for any application. It's not an all or nothing proposition. My 2 cents. Regards, Hugo F. ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [Caml-list] OC4MC : OCaml for Multicore architectures 2009-09-25 7:32 ` Hugo Ferreira @ 2009-09-25 10:17 ` Jon Harrop 2009-09-25 13:04 ` kcheung 2009-09-25 21:39 ` Gerd Stolpmann 1 sibling, 1 reply; 64+ messages in thread From: Jon Harrop @ 2009-09-25 10:17 UTC (permalink / raw) To: caml-list On Friday 25 September 2009 08:32:26 Hugo Ferreira wrote: > Put it another way; if parallel/concurrent programming could be > easily used with a minimum of effort then I believe "most people" > would use it simply because it is available. Once your run-time supports it, you just need a library that farms tasks out to threads via queues and a lot of parallelism really is easy. > >... > > If I tell you that you just have to modify a bit your program to get a > > near linear speedup, then it looks great. But in practice it is rather > > having to rethink completely your algorithm, to eventually get a > > speedup bounded by bandwidth, and starting from a point lower than the > > original single thread program. > >... > > Rethinking our application/algorithmic structure may not be a real > deterrent. An application does not require parallel/concurrent > processing everywhere. It is really a question of identifying where > and when this is useful. Much like selecting the most "appropriate" > data-structure for any application. It's not an all or nothing > proposition. Right. Parallelizing programs generally consists of identifying a performance bottleneck via measurement and performing the outermost parallelizable loops in parallel. You can do many more clever things but they are far less common. -- Dr Jon Harrop, Flying Frog Consultancy Ltd. http://www.ffconsultancy.com/?e ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [Caml-list] OC4MC : OCaml for Multicore architectures 2009-09-25 10:17 ` Jon Harrop @ 2009-09-25 13:04 ` kcheung 0 siblings, 0 replies; 64+ messages in thread From: kcheung @ 2009-09-25 13:04 UTC (permalink / raw) To: caml-list > On Friday 25 September 2009 08:32:26 Hugo Ferreira wrote: >> Put it another way; if parallel/concurrent programming could be >> easily used with a minimum of effort then I believe "most people" >> would use it simply because it is available. > > Once your run-time supports it, you just need a library that farms tasks > out > to threads via queues and a lot of parallelism really is easy. I wonder if Snow Leopard's Grand Central Dispatch is of relevance here. But then, it'll be OS-specific. ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [Caml-list] OC4MC : OCaml for Multicore architectures 2009-09-25 7:32 ` Hugo Ferreira 2009-09-25 10:17 ` Jon Harrop @ 2009-09-25 21:39 ` Gerd Stolpmann 1 sibling, 0 replies; 64+ messages in thread From: Gerd Stolpmann @ 2009-09-25 21:39 UTC (permalink / raw) To: Hugo Ferreira; +Cc: caml-list > Rethinking our application/algorithmic structure may not be a real > deterrent. An application does not require parallel/concurrent > processing everywhere. It is really a question of identifying where > and when this is useful. Much like selecting the most "appropriate" > data-structure for any application. It's not an all or nothing > proposition. Well, if you get many cores for free it sounds logical to get the most out of it. If you have to pay for extra cores, it becomes quickly a bad deal. Imagine you can parallelize 50% of the runtime of the application. Even if you have as many cores as you want, and the runtime of the sped-up part drops to almost 0, the other still-sequential 50% limit the overall improvement to only 50%. (That's known as Amdahl's law, Xavier also mentioned it.) So, especially when you have many cores, it is not the number of cores that limit the speed-up in practice, but the fraction of the algorithm that can be parallelized at all. I'm working for a company that uses Ocaml in a highly parallelized world. We are running it on grid-style compute clusters to process text and symbolic data. We are using multi-processing, which is easy to do with current Ocaml. Programs we write often run on more than 100 cores. Guess what our biggest problem is? Getting all the cores busy. Because there is always also some sequential part, or buggy parallel part that limits the overall throughput. We are constantly searching for these "bottlenecks" as our managers call this phenomenon (and we get a lot of pressure because the company pays a lot for these many cores, and they want to see them utilized). We have the big advantage that our data sets are already organized in an easy-to-parallelize way, i.e. you can usually split it up into independent portions, and process them independently (but not always). If you cannot do this (like in a multi-core-capable GC where always some part of the heap is shared by all cores), things become quickly very complicated. So I generally do not expect much from such a GC. We are also using Java with its multi-core GC. However, we are sometimes seeing better performance when we don't scale it to the full number of cores the system has, but also combine it with multi-processing (i.e. start several Javas). I simply guess the GC runs at some time into lock contention, and has to do many things sequentially. So, I'm a professional and massive user of multi-core programming. Nevertheless, my first wish is not to get a multi-core GC for shared-memory parallelism, because I doubt we ever get a satisfactory solution. My first wish is to make single-threaded execution as fast as possible. The second one is to make RPC's cheaper, especially between processes on the same system (or put it this way: I'd like to see that the processes normally have their private heaps and are fully separated, but also that they can use a shared memory segment by explicitly moving values there - in the direction of Richard's Ancient module - so that it is possible to make an RPC call by moving data to this special segment). Of course, I appreciate any work on multi-core improvements, so applause to Philippe and team. Gerd -- ------------------------------------------------------------ Gerd Stolpmann, Bad Nauheimer Str.3, 64289 Darmstadt,Germany gerd@gerd-stolpmann.de http://www.gerd-stolpmann.de Phone: +49-6151-153855 Fax: +49-6151-997714 ------------------------------------------------------------ ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [Caml-list] OC4MC : OCaml for Multicore architectures 2009-09-25 4:07 ` Jacques Garrigue 2009-09-25 7:32 ` Hugo Ferreira @ 2009-09-25 9:33 ` Philippe Wang 2009-09-25 21:39 ` Jon Harrop 2 siblings, 0 replies; 64+ messages in thread From: Philippe Wang @ 2009-09-25 9:33 UTC (permalink / raw) To: Jacques Garrigue; +Cc: Philippe Wang, caml-list On Sep 25, 2009, at 6:07 AM, Jacques Garrigue wrote: > First, like everybody else, I'd like very much to try this out. > Is there any chance it could compile on Snow Leopard :-) > (I suppose it's near impossible, but still ask...) I haven't tried that yet, mostly because I guess that it wouldn't work out-of-the-box. However, the .asm file should be ok with OS X and what may clash are configure file behavior and C macros. I should take a closer look at that, since SL now seems to work well. Cheers, -- Philippe Wang Philippe.Wang@lip6.fr http://www-apr.lip6.fr/~pwang/ ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [Caml-list] OC4MC : OCaml for Multicore architectures 2009-09-25 4:07 ` Jacques Garrigue 2009-09-25 7:32 ` Hugo Ferreira 2009-09-25 9:33 ` Philippe Wang @ 2009-09-25 21:39 ` Jon Harrop 2009-09-26 16:55 ` Jon Harrop 2 siblings, 1 reply; 64+ messages in thread From: Jon Harrop @ 2009-09-25 21:39 UTC (permalink / raw) To: caml-list On Friday 25 September 2009 05:07:21 Jacques Garrigue wrote: > Your benchmark seems strange to me, as you are comparing apples with > oranges. In some sense, yes. I was interested in the performance of the defacto-standard hash table implementations and not the performance that can be obtained by reinventing the wheel. > Hashtables in Python are a basic feature of the language, > and they are of course implemented in C. In ocaml, they are > implemented in ocaml (except the hashing function, which has to be > polymorphic), using an array of association lists! > (Actually the pairs are flattened for better performance, but still) > What is impressive is that you don't need any special optimization to > get reasonably good performance. OCaml is 4x slower than F# on that benchmark for several reasons: 1. Overhead of 31-bit int arithmetic. 2. Lack of constant table sizes in the implementation and OCaml's failure to optimize mod-by-a-constant. 3. No monomorphization. You can write a far more efficient hash table implementation in F# than you can in OCaml because it addressed all of those deficiencies. > Actually the only tuning you need is to start from a reasonable table size, > which you didn't... No, the exact opposite is true: OCaml had the unfair advantage of starting from the optimal table size for the problem whereas F# started from the default size and had to resize. If you level the playing field then OCaml is 8x slower than F#. > > Even if that were not the case, the idea of cherry picking interpreted > > scripting languages to compete with because OCaml has fallen so far > > behind mainstream languages (let alone modern languages) is embarrassing. > > What's next, OCaml vs Bash for your high performance needs? > > OCaml was never touted as an HPC language! I started learning OCaml because people were running high performance OCaml code on a 256-CPU supercomputer in Cambridge. I have been touting OCaml for HPC ever since. Thousands of scientists and engineers all over the world have used OCaml for technical computing and chose it precisely because it was competitively performant. > The only claim I've seen is that it intends to stay within 2x of C for most > applications. (Which is not so easy these days, gcc getting much faster.) Yes. The infrastructure for compiler writers is improving rapidly as well though, e.g. LLVM. > Actually, I believe that Philippe's point is rather different. > Making a functional language work well on multicores is difficult. > If I tell you that you just have to modify a bit your program to get a > near linear speedup, then it looks great. But in practice it is rather > having to rethink completely your algorithm, Sure. The free lunch is over. However, the solution usually consists either of spawning independent computations or parallelizing outer loops, both of which can be made very easy by the language implementor. > to eventually get a speedup bounded by bandwidth, For some applications under certain circumstances, yes. > and starting from a point lower than the original single thread program. Yes. > There are applications for that (ray tracing is one), but this is not the > kind of needs most people have. Not the kind of needs the remaining OCaml programmers have, perhaps. Outside the OCaml world, a lot of people are now programming for multicores. > By the way, I was discussing with numerical computation people working > on BLAS the other day, and their answer was clear: if you need high > performance, better use a grid than SMP, since bandwidth is > paramount. That is a false dichotomy. Grids are inevitably composed of multicores so you will still lose out if you fail to leverage SMP when programming for a grid. > ...And you have to write in C or FORTRAN (or asm), because the timing of > instructions matter. I have written linear algebra code in F# that outperforms Intel's vendor tuned Fortran (the MKL) by a substantial margin on Intel hardware. Moreover, their code only works on certain types whereas mine is generic. OCaml is an excellent language for this kind of work but it requires an implementation with a performance profile that is very different from OCaml's. > The funniest part was that those people were working on integer > computations, but had to stick to floating point, because timing on integers > is unpredictable, making synchronization harder. Interesting. -- Dr Jon Harrop, Flying Frog Consultancy Ltd. http://www.ffconsultancy.com/?e ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [Caml-list] OC4MC : OCaml for Multicore architectures 2009-09-25 21:39 ` Jon Harrop @ 2009-09-26 16:55 ` Jon Harrop 0 siblings, 0 replies; 64+ messages in thread From: Jon Harrop @ 2009-09-26 16:55 UTC (permalink / raw) To: caml-list On Friday 25 September 2009 22:39:42 Jon Harrop wrote: > On Friday 25 September 2009 05:07:21 Jacques Garrigue wrote: > > Hashtables in Python are a basic feature of the language, > > and they are of course implemented in C. In ocaml, they are > > implemented in ocaml (except the hashing function, which has to be > > polymorphic), using an array of association lists! > > (Actually the pairs are flattened for better performance, but still) > > What is impressive is that you don't need any special optimization to > > get reasonably good performance. > > OCaml is 4x slower than F# on that benchmark... That was mapping int -> int where OCaml has the unfair advantage of optimal initial size. If you map float -> float and give F# an initial size then it is over 18x faster than OCaml. The reason is, of course, OCaml's data representation strategy that is optimized for Xavier's Coq. -- Dr Jon Harrop, Flying Frog Consultancy Ltd. http://www.ffconsultancy.com/?e ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [Caml-list] OC4MC : OCaml for Multicore architectures 2009-09-24 21:09 ` Jon Harrop 2009-09-24 21:26 ` rixed 2009-09-25 4:07 ` Jacques Garrigue @ 2009-09-25 8:08 ` Stéphane Glondu 2 siblings, 0 replies; 64+ messages in thread From: Stéphane Glondu @ 2009-09-25 8:08 UTC (permalink / raw) To: Jon Harrop; +Cc: caml-list Jon Harrop a écrit : > If you want to draw aspirations based upon popularity, look at the most > popular languages: Java and C#. [...] I which world? Do you have references? These languages might be the most commercially {backed,advertised,etc.}, so I guess you are refering to the software industry / Windows world. I don't care (and I think I am not the only one on this mailing-list) about what companies use for their proprietary software! Even (most) Windows users don't! However, there is also the free software world. Here are some concrete numbers (taken from Debian): steph@korell:~$ aptitude search '?tag(implemented-in)'|wc -l 7834 steph@korell:~$ aptitude search '?tag(implemented-in::c)'|wc -l 4011 steph@korell:~$ aptitude search '?tag(implemented-in::perl)'|wc -l 1694 steph@korell:~$ aptitude search '?tag(implemented-in::c\+\+)'|wc -l 1030 steph@korell:~$ aptitude search '?tag(implemented-in::python)'|wc -l 831 steph@korell:~$ aptitude search '?tag(implemented-in::ocaml)'|wc -l 185 steph@korell:~$ aptitude search '?tag(implemented-in::java)'|wc -l 180 steph@korell:~$ aptitude search '?tag(implemented-in::c-sharp)'|wc -l 103 steph@korell:~$ aptitude search '?tag(implemented-in::haskell)'|wc -l 72 Note that the same package might be implemented in several languages (e.g. C and target language for bindings). > [...] They are far more popular than OCaml for many > reasons but parallel threads to make efficient multicore programming easy is > a big one. Hummm... I'm very doubtful about this claim. I would say that the "popularity" you are talking about is more about the commercial backing behing the languages than the languages themselves... Best regards, -- Stéphane ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [Caml-list] OC4MC : OCaml for Multicore architectures 2009-09-24 13:09 ` Jon Harrop 2009-09-24 16:49 ` Richard Jones @ 2009-09-25 15:05 ` Xavier Leroy 2009-09-25 23:26 ` Benjamin Canou 1 sibling, 1 reply; 64+ messages in thread From: Xavier Leroy @ 2009-09-25 15:05 UTC (permalink / raw) To: caml-list Jon Harrop wrote: > On Thursday 24 September 2009 13:39:40 Stefano Zacchiroli wrote: >> On Thu, Sep 24, 2009 at 12:52:24PM +0100, Jon Harrop wrote: >>> The next steps are to get oc4mc into the apt repositories and build >> Uhm, I'm curious: how do you plan to achieve that? > > Good question. I have no idea, of course. :-) That would be suicidal. I definitely do not want to belittle the work of Philippe and his teammates -- what they did is an amazing hack indeed --, but you need to keep in mind the difference between a proof-of-concept experiment and a product. In a proof-of-concept experiment, you implement the feature want to experiment with and keep everything else as simple as possible (otherwise there is little chance that you'll complete the experiment). That's exactly what Philippe et al did, and rightly so: their GC is about the simplest you can think of, they didn't bother adapting some features of the run-time system, they target AMD64/Unix only, etc. Now they have a platform they can experiment with and make measurements on: mission accomplished. In a product, you'd need something that is essentially a drop-off replacement for today's OCaml and can run, say, Coq with at most a 10% slowdown. That's a long way to go (I'd say a couple of years of work). For example, single-generation stop-and-copy GC is known to have terrible performance (both in running time and in latency) for programs that have large data sets and allocate intensively. This is true in the sequential case and even worse in a stop-the-world parallel setting, by Amdahl's law. Note that the programs I mentioned above are exactly those that the Caml user community cares most about -- not matrix multiply nor ray tracers, Harrop's propaganda notwithstanding -- and those for which OCaml has been delivering top-class performance for the last 12 years -- again, Harrop's propaganda notwithstanding. On your way to a product, you'd need to independently-collectable generations (which means some work on the compiler as well), plus a parallel or even better concurrent major collector. And of course a lot more work on the runtime system and C interface to make everything truly reentrant while remaining portable. And probably some kind of two-level scheduler for threads. And after all that work you'd end up with an extremely low-level and unsafe parallel programming model that you'd need to tame by developing clever libraries that mere mortals can use effectively (Apple's Grand Central was mentioned on this thread; it's a good example)... In summary, Philippe and his coauthors do deserve a round of applause, but please keep a cool head. - Xavier Leroy ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [Caml-list] OC4MC : OCaml for Multicore architectures 2009-09-25 15:05 ` Xavier Leroy @ 2009-09-25 23:26 ` Benjamin Canou 2009-09-26 0:45 ` kcheung 2009-10-10 4:01 ` Jon Harrop 0 siblings, 2 replies; 64+ messages in thread From: Benjamin Canou @ 2009-09-25 23:26 UTC (permalink / raw) To: caml-list Hi everyone, And let's have a little prayer for Philippe who is now in bed, suffering from its head and hands because of his teammates letting him answer all the mail. Just (half) kidding. So, Xavier Leroy a wrote (and probably described the work quite well) : > what they did is an amazing hack [1] > indeed --, but you need to keep in mind the difference between a > proof-of-concept experiment and a product. By reading some messages in this thread I think we need to clarify again the context and goals of OC4MC. One of our main goals for OC4MC is to serve as a parallel and shared memory low-level concurrency implementation, on top of which higher level research concurrency libraries and language extensions can be built. And as most of us agree, multicores, and soon manycores, are hard to program, in particular because of the memory bandwidth. So there probably are experiments to be done to help this at the language level, now that we have this parallel runtime. Moreover, and to answer a question that appeared in this thread, we provide our simple GC, but we separated the GC algorithm from the runtime, so OC4MC is also a low-level playground to experiment with your own GCs and choose the one you want to use at linking. To sum up, let's see OC4MC as an experimentation platform that leverages some restrictions of OCaml, but of course neither as a drop-in replacement for the official distribution nor as the future of OCaml. We do not claim that the ideal solution to bring shared memory parallelism to OCaml is, as OC4MC does, only to replace the runtime (and that INRIA can just replace the official runtime by our hacked one). However, from a pragmatic (and optimistic) point of view, the modifications to the compiler have been kept very lightweight, yet sufficient to break binary compatibility. So if the excitement continues around OC4MC as in this thread, maybe these modifications could be integrated into the distribution since they really do not touch the core of the compiler and cannot cause a lot of maintenance overhead. I will add that we did not made this experiment to beat F# or python's hashtables, so I will not comment on that here. The point about performance is that it should be *predictable*. We now have rewritten and debugged most of the memory related behaviors present in the original runtime in a more generic (and OC4MC friendly) way to achieve this, and if it's not the case for some particular cases, we'll be glad to (try to) fix these bugs. On the maintenance side, as Philippe said, we already have some half working version with ocaml 3.11.x, but partly because of the changes made to the native runtime in this release and partly because of [1], porting the patch is not trivial. Cheers and have fun experimenting with OC4MC (so it will compensate the amount of debugging we spent on it ;-) ). Benjamin. ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [Caml-list] OC4MC : OCaml for Multicore architectures 2009-09-25 23:26 ` Benjamin Canou @ 2009-09-26 0:45 ` kcheung 2009-09-26 1:53 ` Jon Harrop 2009-10-10 4:01 ` Jon Harrop 1 sibling, 1 reply; 64+ messages in thread From: kcheung @ 2009-09-26 0:45 UTC (permalink / raw) To: caml-list > I will add that we did not made this experiment to beat F# or python's > hashtables, so I will not comment on that here. The point about > performance is that it should be *predictable*. Perhaps an off-topic and naive question: What does it take to beat F# and still have predictable performance? In any case, OC4MC is very encouraging. Congrats to the team! ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [Caml-list] OC4MC : OCaml for Multicore architectures 2009-09-26 0:45 ` kcheung @ 2009-09-26 1:53 ` Jon Harrop 2009-09-26 13:51 ` kcheung 0 siblings, 1 reply; 64+ messages in thread From: Jon Harrop @ 2009-09-26 1:53 UTC (permalink / raw) To: caml-list On Saturday 26 September 2009 01:45:50 kcheung@math.carleton.ca wrote: > Perhaps an off-topic and naive question: What does it take to beat F# and > still have predictable performance? Provided you're talking abouts today's machines and don't care about pause times, HLVM with a parallel GC (not unlike the oc4mc one) and a task library would beat F# and still have predictable performance. -- Dr Jon Harrop, Flying Frog Consultancy Ltd. http://www.ffconsultancy.com/?e ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [Caml-list] OC4MC : OCaml for Multicore architectures 2009-09-26 1:53 ` Jon Harrop @ 2009-09-26 13:51 ` kcheung 2009-09-26 14:46 ` Jon Harrop 0 siblings, 1 reply; 64+ messages in thread From: kcheung @ 2009-09-26 13:51 UTC (permalink / raw) To: caml-list > On Saturday 26 September 2009 01:45:50 kcheung@math.carleton.ca wrote: >> Perhaps an off-topic and naive question: What does it take to beat F# >> and >> still have predictable performance? > > Provided you're talking abouts today's machines and don't care about pause > times, HLVM with a parallel GC (not unlike the oc4mc one) and a task > library > would beat F# and still have predictable performance. If I understand correctly, HLVM is an analog of Microsoft's CLR. So theoretically, one can build a compiler for ocaml that compiles to HLVM. Would that make ocaml beat F#? Kevin. ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [Caml-list] OC4MC : OCaml for Multicore architectures 2009-09-26 13:51 ` kcheung @ 2009-09-26 14:46 ` Jon Harrop 0 siblings, 0 replies; 64+ messages in thread From: Jon Harrop @ 2009-09-26 14:46 UTC (permalink / raw) To: caml-list On Saturday 26 September 2009 14:51:21 kcheung@math.carleton.ca wrote: > > On Saturday 26 September 2009 01:45:50 kcheung@math.carleton.ca wrote: > >> Perhaps an off-topic and naive question: What does it take to beat F# > >> and > >> still have predictable performance? > > > > Provided you're talking abouts today's machines and don't care about > > pause times, HLVM with a parallel GC (not unlike the oc4mc one) and a > > task library > > would beat F# and still have predictable performance. > > If I understand correctly, HLVM is an > analog of Microsoft's CLR. HLVM certainly draws upon ideas from the CLR but it is different in many respects. One important advantage of HLVM over the CLR is that it handles structs correctly in the presence of tail calls (thanks to LLVM). This means that tuples can be represented (in the absence of polymorphic recursion) as unboxed C structs which *greatly* reduces the burden on the garbage collector. HLVM also uses a far superior code generator (LLVM) compared to the CLR and OCaml. > So theoretically, > one can build a compiler for ocaml that > compiles to HLVM. Would that make ocaml > beat F#? That would beat the performance of F# with minimal effort. That was the goal of my HLVM hobby project but I was forced to shelve it when the recession hit. Hopefully I'll get back to it in 2010... -- Dr Jon Harrop, Flying Frog Consultancy Ltd. http://www.ffconsultancy.com/?e ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [Caml-list] OC4MC : OCaml for Multicore architectures 2009-09-25 23:26 ` Benjamin Canou 2009-09-26 0:45 ` kcheung @ 2009-10-10 4:01 ` Jon Harrop 1 sibling, 0 replies; 64+ messages in thread From: Jon Harrop @ 2009-10-10 4:01 UTC (permalink / raw) To: caml-list On Saturday 26 September 2009 00:26:50 Benjamin Canou wrote: > On the maintenance side, as Philippe said, we already have some half > working version with ocaml 3.11.x, but partly because of the changes > made to the native runtime in this release and partly because of [1], > porting the patch is not trivial. OC4MC seems to work very well for numerical problems that do not allocation at all but introducing even the slightest mutation (not even in the inner loop) completely destroys performance and scaling. I'm guessing the reason is that any allocations eventually trigger collections and those are copying the entire heap which, in this case, consists almost entirely of float array arrays. My guess was that using big arrays would alleviate this problem by placing most of the data outside the OCaml heap (I'm guessing that oc4mc leaves the element data of a big array alone and copies only the small reference to it?). However, it does not seem to handle bigarrays: ../out/lib/ocaml//libbigarray.a(bigarray_stubs.o): In function `caml_ba_compare': bigarray_stubs.c:(.text+0x1e5): undefined reference to `caml_compare_unordered' bigarray_stubs.c:(.text+0x28d): undefined reference to `caml_compare_unordered' collect2: ld returned 1 exit status Error during linking If I am correct then I would value functioning bigarrays above OCaml 3.11 support. -- Dr Jon Harrop, Flying Frog Consultancy Ltd. http://www.ffconsultancy.com/?e ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [Caml-list] OC4MC : OCaml for Multicore architectures 2009-09-24 12:39 ` Stefano Zacchiroli 2009-09-24 13:09 ` Jon Harrop @ 2009-09-24 13:40 ` Rakotomandimby Mihamina 2009-09-24 14:22 ` Philippe Wang 2009-09-24 14:49 ` Stefano Zacchiroli 2009-09-24 13:55 ` Mike Lin 2 siblings, 2 replies; 64+ messages in thread From: Rakotomandimby Mihamina @ 2009-09-24 13:40 UTC (permalink / raw) To: caml-list 09/24/2009 03:39 PM, Stefano Zacchiroli: > So, the real question is: is OC4MC going to be ported to mainline OCaml > and support in the future or not? I dont write so much programs that would really require multiple cores. But I think this is such a good "feature" that should be inclided in the main distribution... -- Architecte Informatique chez Blueline/Gulfsat: Administration Systeme, Recherche & Developpement +261 34 29 155 34 ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [Caml-list] OC4MC : OCaml for Multicore architectures 2009-09-24 13:40 ` Rakotomandimby Mihamina @ 2009-09-24 14:22 ` Philippe Wang 2009-09-24 14:49 ` Stefano Zacchiroli 1 sibling, 0 replies; 64+ messages in thread From: Philippe Wang @ 2009-09-24 14:22 UTC (permalink / raw) To: Rakotomandimby Mihamina; +Cc: caml-list On Thu, Sep 24, 2009 at 3:40 PM, Rakotomandimby Mihamina <mihamina@gulfsat.mg> wrote: > 09/24/2009 03:39 PM, Stefano Zacchiroli: >> >> So, the real question is: is OC4MC going to be ported to mainline OCaml >> and support in the future or not? > > I dont write so much programs that would really require multiple cores. > But I think this is such a good "feature" that should be inclided in > the main distribution... Thing is that having a runtime library that supports parallel threads costs more than having a runtime library that doesn't. Programs that take advantage of multicore architectures are not easy to write, not easy to maintain, not easy to debug, ... So "it's a great feature, so it should get into mainstream" is not a good enough reason for INRIA's team. It's probably up to the community to find a great way of taking advantage of multicore architectures. One must be aware that - parallel threads vs not-parellel threads : if a program is well suited to parallel computing on multicore CPUs, then it means that not-parallel-capable runtime library puts the performance bottleneck at the CPU. Then, allowing parallel threads means *moving* this bottleneck (moving, not removing) : indeed, it's much likely that the bottleneck will then be at memory (RAM) bandwidth. See, if your memory is 1000 MHz, having 8 cores means 125MHz/core, which becomes ridiculous even if it were 2400MHz it would mean only 300MHz/core, imaging a 300MHz memory bandwidth for a 3GHz core ! So it's *very* important to keep that in mind. - for programming langages that are from the early beginning quite slower than INRIA OCaml, it's much easier to gain performance because they come from far, sometimes from very very far. Well, from a quite subjective personal point of view, of course it would be really great to give parallel threads capability to mainstream INRIA OCaml, because it would mean having found a (great) acceptable solution. -- Philippe Wang mail@philippewang.info ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [Caml-list] OC4MC : OCaml for Multicore architectures 2009-09-24 13:40 ` Rakotomandimby Mihamina 2009-09-24 14:22 ` Philippe Wang @ 2009-09-24 14:49 ` Stefano Zacchiroli 1 sibling, 0 replies; 64+ messages in thread From: Stefano Zacchiroli @ 2009-09-24 14:49 UTC (permalink / raw) To: caml-list On Thu, Sep 24, 2009 at 04:40:53PM +0300, Rakotomandimby Mihamina wrote: > I dont write so much programs that would really require multiple cores. > But I think this is such a good "feature" that should be inclided in > the main distribution... I think you miss what does that would mean in terms of efforts for maintaining the corresponding packages. De facto, it would mean duplicating all source packages of the libraries you want to be able to build against ocaml 3.10.2 + OC4MC. You want PCRE? then you need two PCRE packages (3.11 and 3.10.2 4MC) You want ocamlnet? then you need two ocamlnet packages You got the picture :-) Additionally, it would also mean supporting in-house potential security problems arising for old version of the compiler (or even 3rd party libraries when you will be forced to "fork" then due to source-level incompatibilities between versions) without any upstream support. Not fun. Cheers. -- Stefano Zacchiroli -o- PhD in Computer Science \ PostDoc @ Univ. Paris 7 zack@{upsilon.cc,pps.jussieu.fr,debian.org} -<>- http://upsilon.cc/zack/ Dietro un grande uomo c'è ..| . |. Et ne m'en veux pas si je te tutoie sempre uno zaino ...........| ..: |.... Je dis tu à tous ceux que j'aime ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [Caml-list] OC4MC : OCaml for Multicore architectures 2009-09-24 12:39 ` Stefano Zacchiroli 2009-09-24 13:09 ` Jon Harrop 2009-09-24 13:40 ` Rakotomandimby Mihamina @ 2009-09-24 13:55 ` Mike Lin 2009-09-24 14:52 ` Stefano Zacchiroli 2 siblings, 1 reply; 64+ messages in thread From: Mike Lin @ 2009-09-24 13:55 UTC (permalink / raw) To: caml-list [-- Attachment #1: Type: text/plain, Size: 557 bytes --] On Thu, Sep 24, 2009 at 8:39 AM, Stefano Zacchiroli <zack@debian.org> wrote: > > So, the real question is: is OC4MC going to be ported to mainline OCaml > and support in the future or not? Recalling how mainline had us waiting like 5 years for native exception backtraces, and then another like 3 years for the ability to access the backtrace within the program, I most certainly hope NOT :) (Nothing personal to INRIA, I work on academic projects and well know how these things go, it's just not the most awesome maintenance schedule for one's main PL) [-- Attachment #2: Type: text/html, Size: 840 bytes --] ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [Caml-list] OC4MC : OCaml for Multicore architectures 2009-09-24 13:55 ` Mike Lin @ 2009-09-24 14:52 ` Stefano Zacchiroli 0 siblings, 0 replies; 64+ messages in thread From: Stefano Zacchiroli @ 2009-09-24 14:52 UTC (permalink / raw) To: caml-list On Thu, Sep 24, 2009 at 09:55:53AM -0400, Mike Lin wrote: > On Thu, Sep 24, 2009 at 8:39 AM, Stefano Zacchiroli <zack@debian.org> wrote: > > So, the real question is: is OC4MC going to be ported to mainline OCaml > > and support in the future or not? > > Recalling how mainline had us waiting like 5 years for native exception > backtraces, and then another like 3 years for the ability to access the > backtrace within the program, I most certainly hope NOT :) > (Nothing personal to INRIA, I work on academic projects and well know how > these things go, it's just not the most awesome maintenance schedule for > one's main PL) But the result you are anticipating will actually mean low acceptance of OC4MC among "common" users, possibly close to 0. All "mainstream" ways of distributing OCaml (both .rpm and .deb distros, GODI, ...) are regularly switching to most recent versions of the compiler. The only people being able to stay to 3.10.2 to benefit of OC4MC will be industries which fixed their developed on a specific version and do not plan to change. Or am I missing something? Cheers. -- Stefano Zacchiroli -o- PhD in Computer Science \ PostDoc @ Univ. Paris 7 zack@{upsilon.cc,pps.jussieu.fr,debian.org} -<>- http://upsilon.cc/zack/ Dietro un grande uomo c'è ..| . |. Et ne m'en veux pas si je te tutoie sempre uno zaino ...........| ..: |.... Je dis tu à tous ceux que j'aime ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [Caml-list] OC4MC : OCaml for Multicore architectures 2009-09-24 11:52 ` Jon Harrop ` (2 preceding siblings ...) 2009-09-24 12:39 ` Stefano Zacchiroli @ 2009-09-24 15:36 ` Philippe Wang 2009-09-24 15:50 ` Jon Harrop 3 siblings, 1 reply; 64+ messages in thread From: Philippe Wang @ 2009-09-24 15:36 UTC (permalink / raw) To: caml-list >> I'm not too familiar with concurrency in ocaml. >> How does OC4MC compare with JoCaml? > > JoCaml is all about concurrency: minimizing latency. Oc4mc is all about > parallelism: maximizing throughput. Maybe a nice thing would be to have both in one piece... -- Philippe Wang mail@philippewang.info ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [Caml-list] OC4MC : OCaml for Multicore architectures 2009-09-24 15:36 ` Philippe Wang @ 2009-09-24 15:50 ` Jon Harrop 0 siblings, 0 replies; 64+ messages in thread From: Jon Harrop @ 2009-09-24 15:50 UTC (permalink / raw) To: caml-list On Thursday 24 September 2009 16:36:17 Philippe Wang wrote: > >> I'm not too familiar with concurrency in ocaml. > >> How does OC4MC compare with JoCaml? > > > > JoCaml is all about concurrency: minimizing latency. Oc4mc is all about > > parallelism: maximizing throughput. > > Maybe a nice thing would be to have both in one piece... Indeed. I have no idea how well received JoCaml has been but am certain that your work is of huge value. -- Dr Jon Harrop, Flying Frog Consultancy Ltd. http://www.ffconsultancy.com/?e ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [Caml-list] OC4MC : OCaml for Multicore architectures 2009-09-24 1:47 ` Jon Harrop 2009-09-24 9:49 ` Richard Jones 2009-09-24 10:00 ` kcheung @ 2009-09-24 12:14 ` Philippe Wang 2009-09-24 13:11 ` Jon Harrop 2 siblings, 1 reply; 64+ messages in thread From: Philippe Wang @ 2009-09-24 12:14 UTC (permalink / raw) To: Jon Harrop; +Cc: caml-list On Thu, Sep 24, 2009 at 3:47 AM, Jon Harrop <jon@ffconsultancy.com> wrote: > Following your advice, it seems to work perfectly now: :-) > Wow! 2.6x faster on 2 cores is good. ;-) your machine is more generous than ours (which is Intel, not AMD) :-) > That's a really fantastic piece of work. I'll do my best to study it and write > literature about it. May I ask, can you give a rough overview of the design? > For example, is there a separate nursery per thread so each thread can > allocate a certain amount before incurring a global pause? Do you have any > ideas for libraries built on top of this, such as a task parallel library > using work-stealing deques? A few words on the GC's design (that uses stop© algorithm several times) : Heaps : - a set of pages are used to give threads the possibility to allocate memory without interfering with other threads, such as there is no mutex locking at local memory allocation. Each thread borns with an empty page, when it's full, the thread takes another one. - a big heap is shared between all, there is a mutex over it to prevent parallel memory allocation into this one. Collection : - when there are no pages left, a collection stops-the-world and copies living values (of the pages) to the shared heap - when the shared heap is full, a collection stops-the-world and copies all living values (pages+shared heap) to a new shared heap (which can be grow if need be) Special operations : - if there is a blocking operation (e.g. mutex lock or I/O operation), the mechanism is roughly the same as original INRIA OCaml's : it tells the GC that there is no need to stop it when stopping the world. - if there is a thread with no allocation and no blocking operation, the behaviur is the same as INRIA OCaml. The number of pages, the size of a page, and the size of the shared heap can be changed before running a program by setting some environment variables (cf. last lines README file included in the distribution package). -- Philippe Wang mail@philippewang.info ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [Caml-list] OC4MC : OCaml for Multicore architectures 2009-09-24 12:14 ` Philippe Wang @ 2009-09-24 13:11 ` Jon Harrop 2009-09-24 14:51 ` Philippe Wang 0 siblings, 1 reply; 64+ messages in thread From: Jon Harrop @ 2009-09-24 13:11 UTC (permalink / raw) To: caml-list On Thursday 24 September 2009 13:14:35 Philippe Wang wrote: > On Thu, Sep 24, 2009 at 3:47 AM, Jon Harrop <jon@ffconsultancy.com> wrote: > > Following your advice, it seems to work perfectly now: > > > :-) > : > > Wow! 2.6x faster on 2 cores is good. ;-) > > your machine is more generous than ours (which is Intel, not AMD) :-) Yes. I don't know why AMD are so much better at this but I have seen it several times now. > > That's a really fantastic piece of work. I'll do my best to study it and > > write literature about it. May I ask, can you give a rough overview of > > the design? For example, is there a separate nursery per thread so each > > thread can allocate a certain amount before incurring a global pause? Do > > you have any ideas for libraries built on top of this, such as a task > > parallel library using work-stealing deques? > > A few words on the GC's design (that uses stop© algorithm several > times) : > > Heaps : > - a set of pages are used to give threads the possibility to allocate > memory without interfering with other threads, such as there is no > mutex locking at local memory allocation. Each thread borns with an > empty page, when it's full, the thread takes another one. > - a big heap is shared between all, there is a mutex over it to > prevent parallel memory allocation into this one. > > Collection : > - when there are no pages left, a collection stops-the-world and > copies living values (of the pages) to the shared heap > - when the shared heap is full, a collection stops-the-world and > copies all living values (pages+shared heap) to a new shared heap > (which can be grow if need be) Ok, so this is stop© GC with per-thread nurseries/gen0. Are values such as float arrays copied in their entirety or are they allocated outside the shared heap and only a pointer to them is copied? Is the copy operation parallelized? Is there a write barrier but no read barrier? If so, what exactly does the write barrier do? > Special operations : > - if there is a blocking operation (e.g. mutex lock or I/O operation), > the mechanism is roughly the same as original INRIA OCaml's : it tells > the GC that there is no need to stop it when stopping the world. Can users mark external calls in their bindings as blocking so the GC will treat them appropriately? > - if there is a thread with no allocation and no blocking operation, > the behaviur is the same as INRIA OCaml. > > The number of pages, the size of a page, and the size of the shared > heap can be changed before running a program by setting some > environment variables (cf. last lines README file included in the > distribution package). Great! -- Dr Jon Harrop, Flying Frog Consultancy Ltd. http://www.ffconsultancy.com/?e ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [Caml-list] OC4MC : OCaml for Multicore architectures 2009-09-24 13:11 ` Jon Harrop @ 2009-09-24 14:51 ` Philippe Wang 0 siblings, 0 replies; 64+ messages in thread From: Philippe Wang @ 2009-09-24 14:51 UTC (permalink / raw) To: Jon Harrop; +Cc: caml-list On Thu, Sep 24, 2009 at 3:11 PM, Jon Harrop <jon@ffconsultancy.com> wrote: > Are values such as float arrays copied in their entirety or are they allocated > outside the shared heap and only a pointer to them is copied? They should be in a heap (page or shared). We don't allocate many things outside the heaps. > Is the copy operation parallelized? Nope. When the world is stopped for the collection, everything is done sequentially until the world is resumed. I don't think it's relevant to parallelize the copy operation (hell to implement&debug, then I don't think that performance would be very interesting because we would probably need a write mutex on the destination heap) > Is there a write barrier but no read barrier? If so, what exactly does the > write barrier do? There is a lock when a thread is created because we need to update the list of existing threads and we have to give it a page. Then, each time a thread wants memory, it checks if the world needs to be stopped. If the world needs to be stopped, it means that there is a necessary collection waiting for the world to be stopped. There is lock if a thread needs to allocate memory in the shared heap so that two threads don't end up using the same space for different things. If two threads want to write in the same block, it's up to the programmer to prevent (or allow) such a thing with a mutex (or whatever other mechanism). >> Special operations : >> - if there is a blocking operation (e.g. mutex lock or I/O operation), >> the mechanism is roughly the same as original INRIA OCaml's : it tells >> the GC that there is no need to stop it when stopping the world. > > Can users mark external calls in their bindings as blocking so the GC will > treat them appropriately? Yes, it's the same as INRIA OCaml : enter_blocking_operation / leave_blocking_operation functions. It's mandatory that in the section between entrance and exit, the thread is not accessing anything allocated in a Caml heap. If there is need to write some value returned by the blocking operation, it should be written in a C side value (on C stack or with C malloc) and put back to Caml heap after exit (and then C free if C malloced). -- Philippe Wang mail@philippewang.info ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [Caml-list] OC4MC : OCaml for Multicore architectures 2009-09-23 23:15 ` Philippe Wang 2009-09-24 0:05 ` Jon Harrop @ 2009-09-24 14:57 ` Philippe Wang 1 sibling, 0 replies; 64+ messages in thread From: Philippe Wang @ 2009-09-24 14:57 UTC (permalink / raw) To: caml-list I've seen a question about 3.11 and I think I didn't answer, so I'm answering here : We have tried to make OC4MC work with OCaml 3.11 (I don't remember the subsubversion number). Currently, it does not work properly (it's still too easy to write a program that crashes or deadlocks). Cheers, Philippe Wang ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [Caml-list] OC4MC : OCaml for Multicore architectures 2009-09-22 21:30 OC4MC : OCaml for Multicore architectures Philippe Wang 2009-09-23 10:53 ` [Caml-list] " Goswin von Brederlow @ 2009-09-24 14:11 ` Dario Teixeira 2009-09-24 14:38 ` Philippe Wang 2009-09-24 18:24 ` David Teller 2 siblings, 1 reply; 64+ messages in thread From: Dario Teixeira @ 2009-09-24 14:11 UTC (permalink / raw) To: caml-list, Philippe Wang Hi, Cheers for the work you guys put into this project! And I'd like to join the crowd that has questions, if I may: a) If I understand correctly, part of prerequisites for implementing the new GC was cleaning up the excessive use of imperative constructs in the compiler's tree. Will the new tree be also more amenable to the implementation of new language constructs such as GADTs? b) Could you quantify the performance penalty (if any) of using the new GC in a single-thread context? And should this penalty be significant, are there provisions for a compile-time choice of which GC to use? c) Is there an understanding between you and the folks at INRIA concerning the eventual merging of this code into the mainline tree? Thanks a lot for your time! Best regards, Dario Teixeira ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [Caml-list] OC4MC : OCaml for Multicore architectures 2009-09-24 14:11 ` Dario Teixeira @ 2009-09-24 14:38 ` Philippe Wang 2009-09-24 15:20 ` Dario Teixeira 2009-09-24 23:28 ` Jon Harrop 0 siblings, 2 replies; 64+ messages in thread From: Philippe Wang @ 2009-09-24 14:38 UTC (permalink / raw) To: Dario Teixeira; +Cc: caml-list, Philippe Wang On Thu, Sep 24, 2009 at 4:11 PM, Dario Teixeira <darioteixeira@yahoo.com> wrote: > Hi, > > Cheers for the work you guys put into this project! And I'd like to join > the crowd that has questions, if I may: > > a) If I understand correctly, part of prerequisites for implementing the > new GC was cleaning up the excessive use of imperative constructs in > the compiler's tree. Will the new tree be also more amenable to the > implementation of new language constructs such as GADTs? Nope... We wanted not to touch the code generator (or any other part of the compiler). Eventually, we had to modify a very little bit the code generator so that it does not compact too much the generated code. That meant changing less than 10 lines of ml code. > b) Could you quantify the performance penalty (if any) of using the new GC > in a single-thread context? And should this penalty be significant, are > there provisions for a compile-time choice of which GC to use? Very few programs that are not written with multicore in mind would not be penalized. I mean our GC is much much dumber than INRIA OCaml's one. Our goal was to show it was possible to have good performance with multicores for OCaml. Maybe someday we'll find some time to optimize the GC, but it's likely not very soon. > c) Is there an understanding between you and the folks at INRIA concerning > the eventual merging of this code into the mainline tree? Almost same answer as the previous one. We have shown that it's possible to enjoy multicore for performance. The changes over the whole runtime library are not easy to merge into mainstream. It is very important to know this : the runtime library is written in C (and a little part is in ASM in order to have better performance... but mainly because of the "foreign function interface" so there is no way to ignore it). Its type system really sucks (comparing to OCaml's). When you change a very little part, it will tell you that you were wrong, but not with a hard-to-understand type error message : it will be some tricky dirty segmentation fault, which can sometimes that days or weeks, even months, to take down. I guess that if INRIA decides to implement parallel threads capability, they will have to make the runtime library ready (clean up some global variables, tidy the code like remove compatibility.h and such stuff) before thinking about the GC. This could take some time, because it's not good to break everything at once. Then, if they have finished this step, I would be confident that they could integrate an awesome GC. But that's only my personal opinion... Oh, why they wouldn't take OC4MC? ... If I were them, I wouldn't. We have probably broken some stuff such as Weak or Lazy, so there is no chance to bootstrap with OC4MC. Well, I mean that it's better to change INRIA's OCaml with all the lessons learnt than to try to fix OC4MC such that it's fully compatible with latest version of INRIA OCaml. -- Philippe Wang mail@philippewang.info ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [Caml-list] OC4MC : OCaml for Multicore architectures 2009-09-24 14:38 ` Philippe Wang @ 2009-09-24 15:20 ` Dario Teixeira 2009-09-24 23:28 ` Jon Harrop 1 sibling, 0 replies; 64+ messages in thread From: Dario Teixeira @ 2009-09-24 15:20 UTC (permalink / raw) To: Philippe Wang; +Cc: caml-list Hi, > Very few programs that are not written with multicore in mind would > not be penalized. I mean our GC is much much dumber than INRIA OCaml's > one. Our goal was to show it was possible to have good performance with > multicores for OCaml. Maybe someday we'll find some time to optimize > the GC, but it's likely not very soon. Thanks for the clarification. While not detracting from your work (which I think is very interesting and valuable), for me single-thread performance is still paramount. I am working in a domain (doing backend web application programming using the Ocsigen framework) where multi-threaded parallelism is a bit silly, since you can get much better performance and design simplicity by running multiple independent servers (one for each core). Each server runs multiple concurrent Lwt-threads (a cooperative form of green threads) to make sure the CPU is always busy and not waiting on I/O. This solution has the advantage of requiring no process context-switching within each server, while still maximising CPU utilisation. And I suspect there are many other fields where a similar approach could be used advantageously instead of thread-based parallelism. > I guess that if INRIA decides to implement parallel threads capability, > they will have to make the runtime library ready (clean up some global > variables, tidy the code like remove compatibility.h and such stuff) > before thinking about the GC. This could take some time, because it's > not good to break everything at once. Then, if they have finished this > step, I would be confident that they could integrate an awesome GC. > But that's only my personal opinion... Again, it's a question of whether the cost justifies the benefits. Personally, I'm in the camp that would rather see improvements to the type system (like native GADTS!)... Anyway, keep us appraised of your work. It's very welcome. Best regards, Dario Teixeira ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [Caml-list] OC4MC : OCaml for Multicore architectures 2009-09-24 14:38 ` Philippe Wang 2009-09-24 15:20 ` Dario Teixeira @ 2009-09-24 23:28 ` Jon Harrop 2009-09-24 23:25 ` Philippe Wang ` (2 more replies) 1 sibling, 3 replies; 64+ messages in thread From: Jon Harrop @ 2009-09-24 23:28 UTC (permalink / raw) To: caml-list On Thursday 24 September 2009 15:38:06 Philippe Wang wrote: > Very few programs that are not written with multicore in mind would > not be penalized. > I mean our GC is much much dumber than INRIA OCaml's one. > Our goal was to show it was possible to have good performance with > multicores for OCaml. > Maybe someday we'll find some time to optimize the GC, but it's likely > not very soon. Just to quantify this with a data point: the fastest (serial) version of my ray tracer benchmark is 10x slower with the new GC. However, this is anomalous with respect to complexity and the relative performance is much better for simpler renderings. For example, the new GC is only 1.7x slower with n=6 instead of n=9. -- Dr Jon Harrop, Flying Frog Consultancy Ltd. http://www.ffconsultancy.com/?e ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [Caml-list] OC4MC : OCaml for Multicore architectures 2009-09-24 23:28 ` Jon Harrop @ 2009-09-24 23:25 ` Philippe Wang 2009-09-25 14:11 ` Philippe Wang 2009-11-08 18:12 ` Jon Harrop 2 siblings, 0 replies; 64+ messages in thread From: Philippe Wang @ 2009-09-24 23:25 UTC (permalink / raw) To: Jon Harrop; +Cc: Philippe Wang, caml-list On Sep 25, 2009, at 1:28 AM, Jon Harrop wrote: > On Thursday 24 September 2009 15:38:06 Philippe Wang wrote: >> Very few programs that are not written with multicore in mind would >> not be penalized. >> I mean our GC is much much dumber than INRIA OCaml's one. >> Our goal was to show it was possible to have good performance with >> multicores for OCaml. >> Maybe someday we'll find some time to optimize the GC, but it's >> likely >> not very soon. > > Just to quantify this with a data point: the fastest (serial) > version of my > ray tracer benchmark is 10x slower with the new GC. However, this is > anomalous with respect to complexity and the relative performance is > much > better for simpler renderings. For example, the new GC is only 1.7x > slower > with n=6 instead of n=9. Can you tell what data structures (and their sizes if possible) you are using? Thanks for your feedbacks. -- Philippe Wang Philippe.Wang@lip6.fr http://www-apr.lip6.fr/~pwang/ ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [Caml-list] OC4MC : OCaml for Multicore architectures 2009-09-24 23:28 ` Jon Harrop 2009-09-24 23:25 ` Philippe Wang @ 2009-09-25 14:11 ` Philippe Wang 2009-11-08 18:12 ` Jon Harrop 2 siblings, 0 replies; 64+ messages in thread From: Philippe Wang @ 2009-09-25 14:11 UTC (permalink / raw) To: Jon Harrop; +Cc: caml-list On Fri, Sep 25, 2009 at 1:28 AM, Jon Harrop <jon@ffconsultancy.com> wrote: > On Thursday 24 September 2009 15:38:06 Philippe Wang wrote: >> Very few programs that are not written with multicore in mind would >> not be penalized. >> I mean our GC is much much dumber than INRIA OCaml's one. >> Our goal was to show it was possible to have good performance with >> multicores for OCaml. >> Maybe someday we'll find some time to optimize the GC, but it's likely >> not very soon. > > Just to quantify this with a data point: the fastest (serial) version of my > ray tracer benchmark is 10x slower with the new GC. However, this is > anomalous with respect to complexity and the relative performance is much > better for simpler renderings. For example, the new GC is only 1.7x slower > with n=6 instead of n=9. I just put a version with a bug fix on some structures allocation (20090925). I hope it removes this anomaly. -- Philippe Wang mail@philippewang.info ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [Caml-list] OC4MC : OCaml for Multicore architectures 2009-09-24 23:28 ` Jon Harrop 2009-09-24 23:25 ` Philippe Wang 2009-09-25 14:11 ` Philippe Wang @ 2009-11-08 18:12 ` Jon Harrop 2 siblings, 0 replies; 64+ messages in thread From: Jon Harrop @ 2009-11-08 18:12 UTC (permalink / raw) To: caml-list On Friday 25 September 2009 00:28:57 Jon Harrop wrote: > Just to quantify this with a data point: the fastest (serial) version of my > ray tracer benchmark is 10x slower with the new GC. However, this is > anomalous with respect to complexity and the relative performance is much > better for simpler renderings. For example, the new GC is only 1.7x slower > with n=6 instead of n=9. The new SmartPumpkin release of OC4MC does a lot better. Specifically, the version compiled with partial collections is now only 3.9x slower on a serial ray tracer with n=9 (compared to 10x slower before). I'll try it in more detail... -- Dr Jon Harrop, Flying Frog Consultancy Ltd. http://www.ffconsultancy.com/?e ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [Caml-list] OC4MC : OCaml for Multicore architectures 2009-09-22 21:30 OC4MC : OCaml for Multicore architectures Philippe Wang 2009-09-23 10:53 ` [Caml-list] " Goswin von Brederlow 2009-09-24 14:11 ` Dario Teixeira @ 2009-09-24 18:24 ` David Teller 2 siblings, 0 replies; 64+ messages in thread From: David Teller @ 2009-09-24 18:24 UTC (permalink / raw) To: Philippe Wang; +Cc: caml-list Well, let me join the chorus and congratulate. I'll need to test this as soon as possible. Cheers, David On Tue, 2009-09-22 at 23:30 +0200, Philippe Wang wrote: > This is some additional "noise" about "OCaml for Multicore > architectures" (or "Ok with parallel threads GC"). > ---------------------------- > > Dear list, > > We have implemented an alternative runtime library for OCaml, one that > allows threads to compute in parallel on different cores of now > widespread CPUs. > > This project will be presented at IFL 2009 (http://blogs.shu.edu/projects/IFL2009/ > ). > > A testing version available online at > http://www.algo-prog.info/ocmc/ > It works with OCaml 3.10.2 for Linux x86-64bit, we haven't met any > bugs with the latest build (it doesn't *unexpectedly* crash, not yet). > > Hope you'll enjoy, > > -- > Mathias Bourgoin, Adrien Jonquet, Emmanuel Chailloux, Benjamin Canou, > Philippe Wang > > _______________________________________________ > Caml-list mailing list. Subscription management: > http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list > Archives: http://caml.inria.fr > Beginner's list: http://groups.yahoo.com/group/ocaml_beginners > Bug reports: http://caml.inria.fr/bin/caml-bugs > ^ permalink raw reply [flat|nested] 64+ messages in thread
[parent not found: <20090924154716.BCD0ABC5A@yquem.inria.fr>]
* Re: [Caml-list] OC4MC : OCaml for Multicore architectures [not found] <20090924154716.BCD0ABC5A@yquem.inria.fr> @ 2009-09-24 16:02 ` Pascal Cuoq 2009-09-24 16:30 ` Philippe Wang 0 siblings, 1 reply; 64+ messages in thread From: Pascal Cuoq @ 2009-09-24 16:02 UTC (permalink / raw) To: caml-list On Sep 24, 2009, at 5:47 PM, Philippe Wang wrote: >> Is the copy operation parallelized? > > Nope. When the world is stopped for the collection, everything is done > sequentially until the world is resumed. > I don't think it's relevant to parallelize the copy operation (hell to > implement&debug, then I don't think that performance would be very > interesting because we would probably need a write mutex on the > destination heap) Well, you could start copying to the bottom of the next heap with one thread going up and to the top of it with another going down. Assume optimistically that the two threads will not reach the same cacheline at the end of the copies, and you don't need any synchronisation at all between them, except joining at the end. After checking, if they have reached the same cacheline, you need to reallocate the destination heap anyway. You still get a single unfragmented free block as a result. Even better: stop the world just before there remains less that one cacheline of free space and you don't need to check if the two threads have met. You still need to reallocate the destination heap sometimes though. Oh, and I meant to say, but everyone else was faster than me: well done! Pascal ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [Caml-list] OC4MC : OCaml for Multicore architectures 2009-09-24 16:02 ` Pascal Cuoq @ 2009-09-24 16:30 ` Philippe Wang 0 siblings, 0 replies; 64+ messages in thread From: Philippe Wang @ 2009-09-24 16:30 UTC (permalink / raw) To: Pascal Cuoq; +Cc: caml-list On Sep 24, 2009, at 18:02 GMT+02:00, Pascal Cuoq wrote: > On Sep 24, 2009, at 5:47 PM, Philippe Wang wrote: > >>> Is the copy operation parallelized? >> >> Nope. When the world is stopped for the collection, everything is >> done >> sequentially until the world is resumed. >> I don't think it's relevant to parallelize the copy operation (hell >> to >> implement&debug, then I don't think that performance would be very >> interesting because we would probably need a write mutex on the >> destination heap) > > Well, you could start copying to the bottom of the next heap with > one thread going up and to the top of it with another going down. > Assume optimistically that the two threads will not reach the same > cacheline at the end of the copies, and you don't need any > synchronisation at all between them, except joining at the end. > > After checking, if they have reached the same cacheline, > you need to reallocate the destination heap anyway. > > You still get a single unfragmented free block as a result. > > Even better: stop the world just before there remains less that one > cacheline of free space and you don't need to check if the two > threads have > met. You still need to reallocate the destination heap sometimes > though. A concurrent copy means that there would be bad overhead for single core. It also means putting bottleneck to memory bandwidth as memory copy operations are clearly quickly limited by this bandwidth, not by CPU. It may hopefully become false in a few years, but hardware manufacturers don't seem to be excited by that, they seem to prefer making the marketing on the number of cores. Look at GPUs : they have very fast graphical RAM, but they have a huge number of processing units. I don't really see the point in that (i.e. having a huge number of PU) anyway (except "marketing"). Ok, back to GC stuff. A stop© algorithm needs to have a set of roots to make the copy of living values. Each thread has its stack, so it has its subset of roots. Then what ? Parallelize the copy from each thread ? Ok we have to determine the best number of threads according to number of cores but more importantly according to memory bandwidth given per core. (what a nightmare!) Then there are shared values (in the shared heap for instance, but what if there are lateral pointers due to mutable values?). (We are leaving the nightmare for hell! but some people have been there.) Copying a living value means that if later you encounter something pointing to its old address, you have to know the new one. This means writing at the old address. I don't see how we can make *today* something very interesting in concurrent with a stop© algorithm. I believe (but I'm *not* a GC expert at all) concurrent GCs are not based on stop© algorithm but rather some mark&{do-some-stuff-such- as-"sweep"}. > Oh, and I meant to say, but everyone else was faster than me: > well done! Thank you, and thanks everyone else who appreciate this work. :-) Philippe Wang ^ permalink raw reply [flat|nested] 64+ messages in thread
end of thread, other threads:[~2009-11-08 18:11 UTC | newest] Thread overview: 64+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2009-09-22 21:30 OC4MC : OCaml for Multicore architectures Philippe Wang 2009-09-23 10:53 ` [Caml-list] " Goswin von Brederlow 2009-09-23 12:21 ` Jon Harrop 2009-09-23 13:00 ` Jon Harrop 2009-09-23 14:26 ` Philippe Wang 2009-09-24 0:21 ` Jon Harrop 2009-09-23 23:15 ` Philippe Wang 2009-09-24 0:05 ` Jon Harrop 2009-09-24 0:01 ` Philippe Wang 2009-09-24 1:47 ` Jon Harrop 2009-09-24 9:49 ` Richard Jones 2009-09-24 10:00 ` rixed 2009-09-24 10:40 ` Florian Hars 2009-09-24 11:45 ` Jon Harrop 2009-09-24 10:00 ` kcheung 2009-09-24 11:52 ` Jon Harrop 2009-09-24 11:55 ` Rakotomandimby Mihamina 2009-09-24 12:11 ` rixed 2009-09-24 15:58 ` Jon Harrop 2009-09-24 12:39 ` Stefano Zacchiroli 2009-09-24 13:09 ` Jon Harrop 2009-09-24 16:49 ` Richard Jones 2009-09-24 16:56 ` Philippe Wang 2009-09-24 17:36 ` Richard Jones 2009-09-24 19:39 ` rixed 2009-09-24 21:09 ` Jon Harrop 2009-09-24 21:26 ` rixed 2009-09-25 4:07 ` Jacques Garrigue 2009-09-25 7:32 ` Hugo Ferreira 2009-09-25 10:17 ` Jon Harrop 2009-09-25 13:04 ` kcheung 2009-09-25 21:39 ` Gerd Stolpmann 2009-09-25 9:33 ` Philippe Wang 2009-09-25 21:39 ` Jon Harrop 2009-09-26 16:55 ` Jon Harrop 2009-09-25 8:08 ` Stéphane Glondu 2009-09-25 15:05 ` Xavier Leroy 2009-09-25 23:26 ` Benjamin Canou 2009-09-26 0:45 ` kcheung 2009-09-26 1:53 ` Jon Harrop 2009-09-26 13:51 ` kcheung 2009-09-26 14:46 ` Jon Harrop 2009-10-10 4:01 ` Jon Harrop 2009-09-24 13:40 ` Rakotomandimby Mihamina 2009-09-24 14:22 ` Philippe Wang 2009-09-24 14:49 ` Stefano Zacchiroli 2009-09-24 13:55 ` Mike Lin 2009-09-24 14:52 ` Stefano Zacchiroli 2009-09-24 15:36 ` Philippe Wang 2009-09-24 15:50 ` Jon Harrop 2009-09-24 12:14 ` Philippe Wang 2009-09-24 13:11 ` Jon Harrop 2009-09-24 14:51 ` Philippe Wang 2009-09-24 14:57 ` Philippe Wang 2009-09-24 14:11 ` Dario Teixeira 2009-09-24 14:38 ` Philippe Wang 2009-09-24 15:20 ` Dario Teixeira 2009-09-24 23:28 ` Jon Harrop 2009-09-24 23:25 ` Philippe Wang 2009-09-25 14:11 ` Philippe Wang 2009-11-08 18:12 ` Jon Harrop 2009-09-24 18:24 ` David Teller [not found] <20090924154716.BCD0ABC5A@yquem.inria.fr> 2009-09-24 16:02 ` Pascal Cuoq 2009-09-24 16:30 ` Philippe Wang
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox