* Re: [Caml-list] Multiplication of matrix in C and OCaml
@ 2007-02-09 14:24 Frederic GAVA
0 siblings, 0 replies; 32+ messages in thread
From: Frederic GAVA @ 2007-02-09 14:24 UTC (permalink / raw)
To: Brian Hurt, ls-ocaml-developer-2006; +Cc: caml-list
> For the C side, I recommend using clock() to time the runs. This also
> factors out the cost of creating the random matricies. Basically, you
> just do:
>
> #include
>
> ...
> {
> clock_t start, stop;
> ...
> start = clock();
> /* stuff to time here */
> stop = clock();
> printf("The amount of time taken = %f seconds\n",
> ((double)(stop-start))/((double) CLOCKS_PER_SEC));
> /* write result to file */
> }
Many parallel C libraries have also there own timer.
MPI=> MPI_Wtime
BSPlib => bsp_time
PVM=>...
It is very usefull because they do not take into account time to initialize the parallel machine...
Frédéric Gava
^ permalink raw reply [flat|nested] 32+ messages in thread
* Multiplication of matrix in C and OCaml @ 2007-02-07 23:42 Frédéric Gava 2007-02-08 2:14 ` [Caml-list] " Jacques Garrigue 0 siblings, 1 reply; 32+ messages in thread From: Frédéric Gava @ 2007-02-07 23:42 UTC (permalink / raw) To: caml-list Dear All, I compared multiplication of matrix in C and OCaml and I was a little surprise to see that the following C code (using -O2) is 8 time faster than the OCaml one (even with -unsafe). Anybody have an idea to optimize my OCaml code or know why is there this "big" difference ? many thanks Frédéric Gava ps: the difference is the same even when I use a non-polymorphic multiplication *************** C Code ****************** typedef struct Complexe {double re; double img;} complexe; __inline__ void* add(void *x, void *y) { complexe a=((complexe*) x)[0]; complexe b=((complexe*) y)[0]; return (void *) &((complexe){a.re+b.re, a.img+b.img}); } __inline__ void* mult(void *x, void *y) { complexe a=((complexe*) x)[0]; complexe b=((complexe*) y)[0]; return (void *) &((complexe){a.re*b.re-a.img*b.img, a.re*b.img+a.img*b.re}); } void multiply_generic(int n, void *zero, void* (*add)(void *a, void *b), void* (*mult)(void *a, void *b), void *a, void *b, void *c) { register int i,j,k; void *tmp; void *cc; for (i=0; i<n; i++) { for (j=0;j<n;j++) { tmp=zero; for (k=0;k<n;k++) tmp=add(tmp,(mult(b+(k*n+j),a+(i*n+k)))); cc=c+(i*n+j); cc=tmp; } } } complexe random_complexe(void) { complexe c = {c.re=(double)(rand()%1000), c.img=(double)(rand()%1000)}; return c; } Call this with (and with random matrix). complexe zero; zero.re=(double)0; zero.img=(double)0; multiply_generic(i,&zero,add,mult,a,b,c); ********************* OCaml Code ******************************* let random_complex n = {Complex.re=Random.float n; im=Random.float n} let matrixRandom n = Array.init (n*n) (fun _ -> random_complex 1000.) let multiplication n zero add mul a b c = let tmp=ref zero in for i=0 to n-1 do for j=0 to n-1 do tmp:=zero; for k=0 to n-1 do tmp:= add (!tmp) (mul b.(k*n+j) a.(i*n+k)) done; c.(i*n+j)<-(!tmp) done done ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [Caml-list] Multiplication of matrix in C and OCaml 2007-02-07 23:42 Frédéric Gava @ 2007-02-08 2:14 ` Jacques Garrigue 2007-02-08 9:27 ` Frédéric Gava 2007-02-08 9:56 ` Frédéric Gava 0 siblings, 2 replies; 32+ messages in thread From: Jacques Garrigue @ 2007-02-08 2:14 UTC (permalink / raw) To: gava; +Cc: caml-list From: Frédéric Gava <gava@univ-paris12.fr> > I compared multiplication of matrix in C and OCaml and I was a little > surprise to see that the following C code (using -O2) is 8 time faster > than the OCaml one (even with -unsafe). > > Anybody have an idea to optimize my OCaml code or know why is there this > "big" difference ? > > ps: the difference is the same even when I use a non-polymorphic > multiplication This is not what I see: Multiplying 1000 times two 50x50 matrices on a Pentium M 1.8G, using your code. ocamlopt -unsafe polymorphic: 5.4s ocamlopt -unsafe -inline 100 polymorphic: 5.0s ocamlopt -unsafe monomorphic: 4.4s gcc polymorphic: 6.5s gcc monomorphic: 5.8s gcc -O3 monomorphic: 4.7s So, actually it seems that ocamlopt is even faster than gcc in many cases. Now, I also found a strange result: gcc -O3 monomorphic: 1.4s So I looked at the generated assembler: .globl add .type add, @function add: pushl %ebp movl %esp, %ebp subl $56, %esp leal -56(%ebp), %eax leave ret .size add, .-add .p2align 2,,3 .globl mult .type mult, @function mult: pushl %ebp movl %esp, %ebp subl $56, %esp leal -56(%ebp), %eax leave ret It seems that returning a local value is not well-defined, and gcc just decides that this value being unused inside the function, there is no need to do any computation! --------------------------------------------------------------------------- Jacques Garrigue Nagoya University garrigue at math.nagoya-u.ac.jp <A HREF=http://www.math.nagoya-u.ac.jp/~garrigue/>JG</A> ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [Caml-list] Multiplication of matrix in C and OCaml 2007-02-08 2:14 ` [Caml-list] " Jacques Garrigue @ 2007-02-08 9:27 ` Frédéric Gava 2007-02-08 9:38 ` Frédéric Gava 2007-02-08 12:08 ` Jacques Garrigue 2007-02-08 9:56 ` Frédéric Gava 1 sibling, 2 replies; 32+ messages in thread From: Frédéric Gava @ 2007-02-08 9:27 UTC (permalink / raw) To: caml-list [-- Attachment #1: Type: text/plain, Size: 1287 bytes --] > This is not what I see: > Multiplying 1000 times two 50x50 matrices on a Pentium M 1.8G, using > your code. Thanks. Curious. In my cat "/proc/cpuinfo" (using Ubuntu 6.06) processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 13 model name : Intel(R) Pentium(R) M processor 1.60GHz stepping : 6 cpu MHz : 598.127 cache size : 2048 KB fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 2 wp : yes flags : fpu vme de pse tsc msr mce cx8 sep mtrr pge mca cmov pat clflush dts acpi mmx fxsr sse sse2 ss tm pbe est tm2 bogomips : 1197.16 when I multiply two 400x400 matrices, I observe that C is faster than OCaml (2 seconds for C code and 16 seconds for OCaml one). It is perhaps because I use MPI. Do you used -O2 ? > gcc -O3 monomorphic: 4.7s > Now, I also found a strange result: > gcc -O3 monomorphic: 1.4s Why two different times for the same program ? What is monomorphic ? Regards, Frédéric Gava ps: I attach the C code and the OCaml one which used some specific functions of our parallel version of OCaml (here implemented using MPI). I will send the .exe file to Jacques. [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #2: mult.c --] [-- Type: text/x-csrc; name="mult.c", Size: 3227 bytes --] /* param 1 = taille des matrices au depart param 2 = incrementation param 3 = taille des matrices à l'arrive param 4 = nombre d'itérations mpicc -O3 -o cmult cmult.c (using MPICH) mpirun -arch LINUX -np 1 ./cmult 400 2 402 1 */ #include <mpi.h> #include <stdlib.h> #include <stdio.h> #include <math.h> #include <string.h> typedef struct Complexe {double re; double img;} complexe; complexe zero; /* __inline__ complexe complexe_add(complexe a, complexe b) { return (complexe){a.re+b.re, a.img+b.img}; } __inline__ complexe complexe_mult(complexe x, complexe y) { return (complexe) {x.re*y.re-x.img*y.img, x.re*y.img+x.img*y.re}; } */ /* version generic */ __inline__ void* add(void *x, void *y) { complexe a=((complexe*) x)[0]; complexe b=((complexe*) y)[0]; return (void *) &((complexe){a.re+b.re, a.img+b.img}); } __inline__ void* mult(void *x, void *y) { complexe a=((complexe*) x)[0]; complexe b=((complexe*) y)[0]; return (void *) &((complexe){a.re*b.re-a.img*b.img, a.re*b.img+a.img*b.re}); } complexe random_complexe(void) { complexe c = {c.re=(double)(rand()%1000), c.img=(double)(rand()%1000)}; return c; } void multiply_generic(int n, void *zero, void* (*add)(void *a, void *b), void* (*mult)(void *a, void *b), void *a, void *b, void *c) { register int i,j,k; void *tmp; void *cc; void *tt1, *tt2, *tt3; for (i=0; i<n; i++) { for (j=0;j<n;j++) { tmp=zero; for (k=0;k<n;k++) { tmp=add(tmp,(mult(b+(k*n+j),a+(i*n+k)))); /* tt1=malloc(sizeof(complexe)); memcpy(tt1,mult(b+(k*n+j),a+(i*n+k)),sizeof(complexe)); tt2=malloc(sizeof(complexe)); memcpy(tt2,add(tmp,tt1),sizeof(complexe)); free(tt1); memcpy(tmp,tt2,sizeof(complexe)); free(tt2); */ } cc=c+(i*n+j); cc=tmp; /* memcpy(cc,tmp,sizeof(complexe)); */ } } } /* void multiply_complex(int n, complexe *a, complexe *b, complexe *c) { register int i,j,k,t1,t2; complexe tmp; for (i=0; i<n; i++) { for (j=0;j<n;j++) { tmp=zero; for (k=0;k<n;k++) tmp=complexe_add(tmp,(complexe_mult(b[k*n+j],a[i*n+k]))); c[i*n+j]=tmp; } } } */ int main(int argc, char* argv[]) { complexe *a, *b, *c; int i,j,k,t1,pas,maxi,iter,pid; double chrono=0; zero.re=(double)0; zero.img=(double)0; MPI_Init(&argc,&argv); MPI_Comm_rank(MPI_COMM_WORLD,&pid); i=atoi(argv[1]); pas=atoi(argv[2]); maxi=atoi(argv[3]); iter=atoi(argv[4]); srand(1000); while (i<maxi) { /* initialisation des matrices */ a=(complexe *)malloc(i*i*sizeof(complexe)); b=(complexe *)malloc(i*i*sizeof(complexe)); c=(complexe *)malloc(i*i*sizeof(complexe)); for (j=0;j<i;j++) for (k=0,t1=j*i;k<i;k++) { a[t1+k]=random_complexe(); b[t1+k]=random_complexe(); } chrono=MPI_Wtime(); for (k=1;k<=iter;k++) /* multiply_complex(i,a,b,c); */ multiply_generic(i,&zero,add,mult,a,b,c); chrono=MPI_Wtime()-chrono; MPI_Barrier(MPI_COMM_WORLD); if (pid==0) printf("%f\n",chrono/iter); fflush(stdout); /* suppression des vieilles matrices */ free((complexe *)a); free((complexe *)b); free((complexe *)c); i=i+pas; } MPI_Finalize(); return 0; } [-- Attachment #3: mult.ml --] [-- Type: text/plain, Size: 1260 bytes --] (* param 1 = taille des matrices au depart param 2 = incrementation param 3 = taille des matrices à l'arrive param 4 = nombre d'itérations *) open Bsmllib open Bsmlcomm let multiplication n zero add mul a b c = let tmp=ref zero in for i=0 to n-1 do for j=0 to n-1 do tmp:=zero; (* let rec loop tmp k = if k=n then tmp else loop (add tmp (mul b.(k*n+j) a.(i*n+k))) (k+1) in c.(i*n+j)<-loop zero 0 *) for k=0 to n-1 do tmp:= add (!tmp) (mul b.(k*n+j) a.(i*n+k)) done; c.(i*n+j)<-(!tmp) done done let random_complex n = {Complex.re=Random.float n; im=Random.float n} let matrixRandom n = Array.init (n*n) (fun _ -> random_complex 1000.) let _ = Bsmllib.initialize(); Random.self_init(); let i = ref (int_of_string (argv()).(1)) and pas=(int_of_string (argv()).(2)) and maxi=(int_of_string (argv()).(3)) and iter=(int_of_string (argv()).(4)) in while !i<maxi do let a=matrixRandom (!i) and b=matrixRandom (!i) and c=Array.make ((!i)*(!i)) Complex.zero in ignore(bench_fun (fun () -> ignore(multiplication (!i) Complex.zero Complex.add Complex.mul a b c)) iter); i:=!i+pas; done ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [Caml-list] Multiplication of matrix in C and OCaml 2007-02-08 9:27 ` Frédéric Gava @ 2007-02-08 9:38 ` Frédéric Gava 2007-02-08 12:08 ` Jacques Garrigue 1 sibling, 0 replies; 32+ messages in thread From: Frédéric Gava @ 2007-02-08 9:38 UTC (permalink / raw) To: gava; +Cc: caml-list > when I multiply two 400x400 matrices, mistake. 600x600 matrices FG ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [Caml-list] Multiplication of matrix in C and OCaml 2007-02-08 9:27 ` Frédéric Gava 2007-02-08 9:38 ` Frédéric Gava @ 2007-02-08 12:08 ` Jacques Garrigue 1 sibling, 0 replies; 32+ messages in thread From: Jacques Garrigue @ 2007-02-08 12:08 UTC (permalink / raw) To: gava; +Cc: caml-list From: Frédéric Gava <gava@univ-paris12.fr> > > gcc -O3 monomorphic: 4.7s > > Now, I also found a strange result: > > gcc -O3 monomorphic: 1.4s > > Why two different times for the same program ? What is monomorphic ? The second one is a misprint fron "gcc -O3 polymorphic". By monomorphic I mean that add and mult are used directly (actually inlined with -O3), rather than being passed as parameter to generic_multiply, which should be faster. The funny thing is that this results in correct code even with -O3, because the inlined version does not discard its result, as does the add and mult optimized functions. So the polymorphic code, completely broken, ends up being 3 times faster :-) By the way there was another evident bug in your code: you do pointer arithmetic on void*, which is clearly wrong. Yet, this can be easily corrected by passing sizeof(complexe) as an extra argument to generic_multiply, and adjusting offsets with it. Jacques Garrigue ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [Caml-list] Multiplication of matrix in C and OCaml 2007-02-08 2:14 ` [Caml-list] " Jacques Garrigue 2007-02-08 9:27 ` Frédéric Gava @ 2007-02-08 9:56 ` Frédéric Gava 2007-02-08 10:45 ` Xavier Leroy 1 sibling, 1 reply; 32+ messages in thread From: Frédéric Gava @ 2007-02-08 9:56 UTC (permalink / raw) To: Jacques Garrigue; +Cc: caml-list [-- Attachment #1: Type: text/plain, Size: 443 bytes --] I attached the pure sequential version of my program. In the file, I give how I compile, How I run and the result. Here C is 5 time faster than OCaml Many thanks to tell me why my OCaml code is so slow. Regards, Frédéric Gava ps: note that in a more generic version of the C code (that is, do memcpy because the data could be of any size, (for example, matrices of polynomials) the ocaml code is faster and faster than the C one :-) ) [-- Attachment #2: mult.c --] [-- Type: text/x-csrc, Size: 2055 bytes --] /* param 1 = size of the matrices at the beginning param 2 = incrementation param 3 = size of the matrices at the end param 4 = number of iteration gcc -O2 -o cmult mult.c time ./cmult 600 2 602 1 real 0m3.800s user 0m3.764s sys 0m0.020s */ #include <stdlib.h> #include <stdio.h> #include <math.h> typedef struct Complexe {double re; double img;} complexe; complexe zero; /* version generic */ __inline__ void* add(void *x, void *y) { complexe a=((complexe*) x)[0]; complexe b=((complexe*) y)[0]; return (void *) &((complexe){a.re+b.re, a.img+b.img}); } __inline__ void* mult(void *x, void *y) { complexe a=((complexe*) x)[0]; complexe b=((complexe*) y)[0]; return (void *) &((complexe){a.re*b.re-a.img*b.img, a.re*b.img+a.img*b.re}); } complexe random_complexe(void) { complexe c = {c.re=(double)(rand()%1000), c.img=(double)(rand()%1000)}; return c; } void multiply_generic(int n, void *zero, void* (*add)(void *a, void *b), void* (*mult)(void *a, void *b), void *a, void *b, void *c) { register int i,j,k; void *tmp; void *cc; for (i=0; i<n; i++) { for (j=0;j<n;j++) { tmp=zero; for (k=0;k<n;k++) { tmp=add(tmp,(mult(b+(k*n+j),a+(i*n+k)))); } cc=c+(i*n+j); cc=tmp; } } } int main(int argc, char* argv[]) { complexe *a, *b, *c; int i,j,k,t1,pas,maxi,iter; zero.re=(double)0; zero.img=(double)0; i=atoi(argv[1]); pas=atoi(argv[2]); maxi=atoi(argv[3]); iter=atoi(argv[4]); srand(1000); while (i<maxi) { /* initialisation des matrices */ a=(complexe *)malloc(i*i*sizeof(complexe)); b=(complexe *)malloc(i*i*sizeof(complexe)); c=(complexe *)malloc(i*i*sizeof(complexe)); for (j=0;j<i;j++) for (k=0,t1=j*i;k<i;k++) { a[t1+k]=random_complexe(); b[t1+k]=random_complexe(); } for (k=1;k<=iter;k++) multiply_generic(i,&zero,add,mult,a,b,c); /* suppression des vieilles matrices */ free((complexe *)a); free((complexe *)b); free((complexe *)c); i=i+pas; } return 0; } [-- Attachment #3: mult.ml --] [-- Type: text/plain, Size: 1372 bytes --] (* param 1 = size of the matrices at the beginning param 2 = incrementation param 3 = size of the matrices at the end param 4 = number of iteration (Objective Caml version 3.09.1) ocamlopt -unsafe -o ocamlmult mult.ml time ./ocamlmult 600 2 602 1 real 0m17.043s user 0m16.937s sys 0m0.052s *) let multiplication n zero add mul a b c = let tmp=ref zero in for i=0 to n-1 do for j=0 to n-1 do tmp:=zero; (* let rec loop tmp k = if k=n then tmp else loop (add tmp (mul b.(k*n+j) a.(i*n+k))) (k+1) in c.(i*n+j)<-loop zero 0 *) for k=0 to n-1 do tmp:= add (!tmp) (mul b.(k*n+j) a.(i*n+k)) done; c.(i*n+j)<-(!tmp) done done let random_complex n = {Complex.re=Random.float n; im=Random.float n} let matrixRandom n = Array.init (n*n) (fun _ -> random_complex 1000.) let _ = Random.self_init(); let i = ref (int_of_string (Sys.argv).(1)) and pas=(int_of_string (Sys.argv).(2)) and maxi=(int_of_string (Sys.argv).(3)) and iter=(int_of_string (Sys.argv).(4)) in while !i<maxi do let a=matrixRandom (!i) and b=matrixRandom (!i) and c=Array.make ((!i)*(!i)) Complex.zero in for j=1 to iter do ignore(multiplication (!i) Complex.zero Complex.add Complex.mul a b c); done; i:=!i+pas; done ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [Caml-list] Multiplication of matrix in C and OCaml 2007-02-08 9:56 ` Frédéric Gava @ 2007-02-08 10:45 ` Xavier Leroy 2007-02-08 15:16 ` Frédéric Gava 0 siblings, 1 reply; 32+ messages in thread From: Xavier Leroy @ 2007-02-08 10:45 UTC (permalink / raw) To: gava; +Cc: Jacques Garrigue, caml-list > In the file, I give how I compile, How I run and the result. Here C is 5 > time faster than OCaml > Many thanks to tell me why my OCaml code is so slow. Because, as Jacques told you already, your C code is wrong. "add" and "mult" invoke undefined behaviors of C and therefore gcc feels free to optimize these functions as no-ops at optimization levels 1 and above. That's a major speedup, for sure. Why don't you check your code for correctness first before drawing conclusions on performance? - Xavier Leroy ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [Caml-list] Multiplication of matrix in C and OCaml 2007-02-08 10:45 ` Xavier Leroy @ 2007-02-08 15:16 ` Frédéric Gava 2007-02-09 2:58 ` Jacques Garrigue 0 siblings, 1 reply; 32+ messages in thread From: Frédéric Gava @ 2007-02-08 15:16 UTC (permalink / raw) To: Xavier Leroy; +Cc: Jacques Garrigue, caml-list [-- Attachment #1: Type: text/plain, Size: 1626 bytes --] > Because, as Jacques told you already, your C code is wrong. "add" and > "mult" invoke undefined behaviors of C and therefore gcc feels free to > optimize these functions as no-ops at optimization levels 1 and above. > That's a major speedup, for sure. Why don't you check your code for > correctness first before drawing conclusions on performance? Sorry for the inconvenience and this stupid error: I am a very bad C programmer. But, I do not obtain the performance of Jacques Garrigue :-( I try to bench a parallel matrix multiplication algorithm and test the difference between C+MPI and OCaml+MPI (I try to prove that OCaml is efficient enought for high-performance, in this community, they largely prefer Fortran or C...)) a) with a "polymorphic" C program (using "multiply_complex_generic(i,complexe_add,complexe_mult,a,b,c);") time ./cmult 600 2 602 1 real 0m18.402s user 0m17.333s sys 0m0.044s b) for a monomorphic C programs (using "multiply_complex(i,a,b,c);"); time ./cmult 600 2 602 1 real 0m5.604s user 0m5.556s sys 0m0.036s c) for a polymorphic OCaml program (using "ignore(multiplication_polymorphic (!i) Complex.zero Complex.add Complex.mul a b c);") time ./ocamlmult 600 2 602 1 real 0m16.433s user 0m16.125s sys 0m0.068s d) for a monomorphic OCaml program (using "ignore(multiplication_monomorphic (!i) a b c);") time ./ocamlmult 600 2 602 1 real 0m15.460s user 0m15.345s sys 0m0.072s I win 1 second if I compile with "-inline 100". So I have not the "twice slower at most". thanks, Frédéric Gava ps: here I have test the programs ;-) [-- Attachment #2: mult.c --] [-- Type: text/x-csrc, Size: 2227 bytes --] /* param 1 = size of the matrices at the beginning param 2 = incrementation param 3 = size of the matrices at the end param 4 = number of iteration with gcc -O3 -o cmult mult.c */ #include <stdlib.h> #include <stdio.h> #include <math.h> typedef struct Complexe {double re; double img;} complexe; complexe zero; __inline__ complexe complexe_add(complexe a, complexe b) { return (complexe){a.re+b.re, a.img+b.img}; } __inline__ complexe complexe_mult(complexe x, complexe y) { return (complexe) {x.re*y.re-x.img*y.img, x.re*y.img+x.img*y.re}; } complexe random_complexe(void) { complexe c = {c.re=(double)(rand()%1000), c.img=(double)(rand()%1000)}; return c; } void multiply_complex(int n, complexe *a, complexe *b, complexe *c) { register int i,j,k; complexe tmp; for (i=0; i<n; i++) { for (j=0;j<n;j++) { tmp=zero; for (k=0;k<n;k++) tmp=complexe_add(tmp,(complexe_mult(b[k*n+j],a[i*n+k]))); c[i*n+j]=tmp; } } return; } void multiply_complex_generic(int n, complexe (*add)(complexe x, complexe y) , complexe (*mult)(complexe x, complexe y),complexe *a, complexe *b, complexe *c) { register int i,j,k; complexe tmp; for (i=0; i<n; i++) { for (j=0;j<n;j++) { tmp=zero; for (k=0;k<n;k++) tmp=add(tmp,(mult(b[k*n+j],a[i*n+k]))); c[i*n+j]=tmp; } } return; } int main(int argc, char* argv[]) { complexe *a, *b, *c; int i,j,k,t1,pas,maxi,iter; zero.re=(double)0; zero.img=(double)0; i=atoi(argv[1]); pas=atoi(argv[2]); maxi=atoi(argv[3]); iter=atoi(argv[4]); srand(1000); while (i<maxi) { /* initialisation des matrices */ a=(complexe *)malloc(i*i*sizeof(complexe)); b=(complexe *)malloc(i*i*sizeof(complexe)); c=(complexe *)malloc(i*i*sizeof(complexe)); for (j=0;j<i;j++) for (k=0,t1=j*i;k<i;k++) { a[t1+k]=random_complexe(); b[t1+k]=random_complexe(); } for (k=1;k<=iter;k++) multiply_complex(i,a,b,c); /* multiply_complex_generic(i,complexe_add,complexe_mult,a,b,c); */ /* suppression des vieilles matrices */ free((complexe *)a); free((complexe *)b); free((complexe *)c); i=i+pas; } return 0; } [-- Attachment #3: mult.ml --] [-- Type: text/plain, Size: 1431 bytes --] (* param 1 = size of the matrices at the beginning param 2 = incrementation param 3 = size of the matrices at the end param 4 = number of iteration ocamlopt -unsafe -o ocamlmult mult.ml *) let multiplication_polymorphic n zero add mul a b c = for i=0 to n-1 do for j=0 to n-1 do let rec loop tmp k = if k=n then tmp else loop (add tmp (mul b.(k*n+j) a.(i*n+k))) (k+1) in c.(i*n+j)<-loop zero 0 done done let multiplication_monomorphic n a b c = let tmp=ref Complex.zero in for i=0 to n-1 do for j=0 to n-1 do for k=0 to n-1 do tmp:=Complex.add !tmp (Complex.mul b.(k*n+j) a.(i*n+k)) done; c.(i*n+j)<-(!tmp) done done (* same efficienty with the loop function *) let random_complex n = {Complex.re=Random.float n; im=Random.float n} let matrixRandom n = Array.init (n*n) (fun _ -> random_complex 1000.) let _ = Random.self_init(); let i = ref (int_of_string (Sys.argv).(1)) and pas=(int_of_string (Sys.argv).(2)) and maxi=(int_of_string (Sys.argv).(3)) and iter=(int_of_string (Sys.argv).(4)) in while !i<maxi do let a=matrixRandom (!i) and b=matrixRandom (!i) and c=Array.make ((!i)*(!i)) Complex.zero in for j=1 to iter do (* ignore(multiplication_polymorphic (!i) Complex.zero Complex.add Complex.mul a b c); *) ignore(multiplication_monomorphic (!i) a b c); done; i:=!i+pas; done ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [Caml-list] Multiplication of matrix in C and OCaml 2007-02-08 15:16 ` Frédéric Gava @ 2007-02-09 2:58 ` Jacques Garrigue 2007-02-09 9:06 ` ls-ocaml-developer-2006 0 siblings, 1 reply; 32+ messages in thread From: Jacques Garrigue @ 2007-02-09 2:58 UTC (permalink / raw) To: gava; +Cc: caml-list From: Frédéric Gava <gava@univ-paris12.fr> > Sorry for the inconvenience and this stupid error: I am a very bad C > programmer. > > But, I do not obtain the performance of Jacques Garrigue :-( I try to > bench a parallel matrix multiplication algorithm and test the difference > between C+MPI and OCaml+MPI (I try to prove that OCaml is efficient > enought for high-performance, in this community, they largely prefer > Fortran or C...)) > > > a) with a "polymorphic" C program (using > "multiply_complex_generic(i,complexe_add,complexe_mult,a,b,c);") > > time ./cmult 600 2 602 1 > real 0m18.402s > user 0m17.333s > sys 0m0.044s > > b) for a monomorphic C programs (using "multiply_complex(i,a,b,c);"); > > time ./cmult 600 2 602 1 > real 0m5.604s > user 0m5.556s > sys 0m0.036s Interesting. It all depends on the compiler. With gcc 3.4, as provided in FreeBSD, I get almost no difference between your polymorphic and monomorphic versions. But if I switch to gcc 4.1, the monomorphic version is indeed much faster. Actually, what I get is: gcc 3.4 polymorphic: 15s gcc 4.1 polymorphic: 20s gcc 3.4 monomorphic: 15s gcc 4.1 monomorphic: 7s So it looks like gcc 4.1 is better for monomorphic code, but worse for function calls... Note that in my case, this is still within a factor 2 of ocaml (which is about the same as gcc 3.4). But your C compiler may be doing some other platform specific optimizations. The only way to know what is happening is to look at the generated assembler. Jacques Garrigue ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [Caml-list] Multiplication of matrix in C and OCaml 2007-02-09 2:58 ` Jacques Garrigue @ 2007-02-09 9:06 ` ls-ocaml-developer-2006 2007-02-09 10:32 ` ls-ocaml-developer-2006 ` (2 more replies) 0 siblings, 3 replies; 32+ messages in thread From: ls-ocaml-developer-2006 @ 2007-02-09 9:06 UTC (permalink / raw) To: caml-list Jacques Garrigue <garrigue@math.nagoya-u.ac.jp> writes: > From: Frédéric Gava <gava@univ-paris12.fr> > >> Sorry for the inconvenience and this stupid error: I am a very bad C >> programmer. >> >> But, I do not obtain the performance of Jacques Garrigue :-( I try to >> bench a parallel matrix multiplication algorithm and test the difference >> between C+MPI and OCaml+MPI (I try to prove that OCaml is efficient >> enought for high-performance, in this community, they largely prefer >> Fortran or C...)) >> >> >> a) with a "polymorphic" C program (using >> "multiply_complex_generic(i,complexe_add,complexe_mult,a,b,c);") >> >> time ./cmult 600 2 602 1 >> real 0m18.402s >> user 0m17.333s >> sys 0m0.044s >> >> b) for a monomorphic C programs (using "multiply_complex(i,a,b,c);"); >> >> time ./cmult 600 2 602 1 >> real 0m5.604s >> user 0m5.556s >> sys 0m0.036s > > Interesting. It all depends on the compiler. > With gcc 3.4, as provided in FreeBSD, I get almost no difference > between your polymorphic and monomorphic versions. But if I switch to > gcc 4.1, the monomorphic version is indeed much faster. Actually, what > I get is: > gcc 3.4 polymorphic: 15s > gcc 4.1 polymorphic: 20s > gcc 3.4 monomorphic: 15s > gcc 4.1 monomorphic: 7s > So it looks like gcc 4.1 is better for monomorphic code, but worse for > function calls... > Note that in my case, this is still within a factor 2 of ocaml (which > is about the same as gcc 3.4). > But your C compiler may be doing some other platform specific > optimizations. The only way to know what is happening is to look at the > generated assembler. I'm just wondering about that. All the data produced during the matrix-multiplication is AFAICS not used for anything. So I wouldn't exclude that in the case of monomorphic code gcc 4.1 is just deciding to ditch most of the actual work after doing some data flow analysis. Gcc 4.1 is (I think) known to do some aggressive optimizations. I'd feel better if the code is benchmarked in a way that the result of the multiplication is output to a file and to subtract the constant contribution of that to the run time that the time is measured for various problem sizes (number of matrices). One would get a linear dependency t(n) = C + K*n and fitting a straight line against the data points could obtain K to compare the efficiency of C (under various compiler versions and options) and Ocaml against each other without having to worry about no-op optimizations or constant startup costs (like load times and run time initialization: which might be a bit higher with Ocaml, though certainly not in the order of seconds). Comparing the output to expected results would also help to ensure that the code is correct which I still find difficult to assert at first glance. Regard -- Markus ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [Caml-list] Multiplication of matrix in C and OCaml 2007-02-09 9:06 ` ls-ocaml-developer-2006 @ 2007-02-09 10:32 ` ls-ocaml-developer-2006 2007-02-09 14:22 ` skaller 2007-02-09 12:05 ` Jon Harrop 2007-02-09 13:50 ` Brian Hurt 2 siblings, 1 reply; 32+ messages in thread From: ls-ocaml-developer-2006 @ 2007-02-09 10:32 UTC (permalink / raw) To: caml-list ls-ocaml-developer-2006@m-e-leypold.de writes: > optimizations. I'd feel better if the code is benchmarked in a way > that the result of the multiplication is output to a file and to > subtract the constant contribution of that to the run time that the > time is measured for various problem sizes (number of matrices). Correction: ... the constant contribution of writing the result to a file should be subtracted from the run time by measuring with various problem sizes. Hope that is understandable now. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [Caml-list] Multiplication of matrix in C and OCaml 2007-02-09 10:32 ` ls-ocaml-developer-2006 @ 2007-02-09 14:22 ` skaller 2007-02-09 21:47 ` ls-ocaml-developer-2006 2007-02-09 23:56 ` Jon Harrop 0 siblings, 2 replies; 32+ messages in thread From: skaller @ 2007-02-09 14:22 UTC (permalink / raw) To: ls-ocaml-developer-2006; +Cc: caml-list On Fri, 2007-02-09 at 11:32 +0100, ls-ocaml-developer-2006@m-e-leypold.de wrote: > ls-ocaml-developer-2006@m-e-leypold.de writes: > > > optimizations. I'd feel better if the code is benchmarked in a way > > that the result of the multiplication is output to a file and to > > subtract the constant contribution of that to the run time that the > > time is measured for various problem sizes (number of matrices). > > Correction: ... the constant contribution of writing the result to a > file should be subtracted from the run time by measuring with various > problem sizes. > > Hope that is understandable now. There is no need for that IMHO. The right way to benchmark simple things like matrix operations AND simultaneously validate them is to evaluate known laws of arithmetic, such as the associative law, on various samples. This doesn't require any I/O, nor a 'known correct' program to generate the comparison data. Generating the samples is harder .. I once used things like Fibonacci to generate input for integer calculator tests. You really want to run these benchmarks for at least 5 minutes and randomly to eliminate cache effects. If you can generate increasing sized problems based on a single linear integer parameter, my Python test harness can handle randomising the tests, run multiple processes, and also draw a nice plot of the results, such as these: http://felix.sourceforge.net/speed/en_flx_perf_0005.html http://felix.sourceforge.net/speed/en_flx_perf_0012.html full sources etc are here to read: http://felix.sourceforge.net/speed/en_flx_perf_top.html or you can grab from Felix svn archive (the test harness is independent of Felix). Or I can email to you (saves building Felix just to get the test harness ..) Ocamlopt scores quite well thank you! This data is from two AMD64 machines running Ubuntu Linux and is the result of HOURS of run time. I don't know if it is cache, random context switches, or what, but on my boxes variations up to 20% on tests under 5 seconds are normal. Note the measurements are real time. Other measurements are suspect .. you really SHOULD count VM paging for example. -- John Skaller <skaller at users dot sf dot net> Felix, successor to C++: http://felix.sf.net ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [Caml-list] Multiplication of matrix in C and OCaml 2007-02-09 14:22 ` skaller @ 2007-02-09 21:47 ` ls-ocaml-developer-2006 2007-02-09 21:55 ` Andrej Bauer 2007-02-09 23:56 ` Jon Harrop 1 sibling, 1 reply; 32+ messages in thread From: ls-ocaml-developer-2006 @ 2007-02-09 21:47 UTC (permalink / raw) To: caml-list skaller <skaller@users.sourceforge.net> writes: > On Fri, 2007-02-09 at 11:32 +0100, > ls-ocaml-developer-2006@m-e-leypold.de wrote: >> ls-ocaml-developer-2006@m-e-leypold.de writes: >> >> > optimizations. I'd feel better if the code is benchmarked in a way >> > that the result of the multiplication is output to a file and to >> > subtract the constant contribution of that to the run time that the >> > time is measured for various problem sizes (number of matrices). >> >> Correction: ... the constant contribution of writing the result to a >> file should be subtracted from the run time by measuring with various >> problem sizes. >> >> Hope that is understandable now. > There is no need for that IMHO. The right way to benchmark simple > things like matrix operations AND simultaneously validate them > is to evaluate known laws of arithmetic, such as the associative > law, on various samples. Yes, that's another way. But what stays is the fact that (a) you should somehow use the result of your operations to keep the compiler from just optimzing the operation away (since the compiler propably doesn't know about the distributive law, that would be the right way to go, but I can just imagin if you try to test (A * B) * C == A * (B * C) (the associative law) it might just be that the compiler inlines the function calls and then discovers that this is always true (since it just knows the laws of float arithmetics). Perhaps I'm just being paranoid. > This doesn't require any I/O, nor a 'known correct' program > to generate the comparison data. And as far as correctness goes: The program should not invoke undefined behaviour and if you have 2 programs in 2 different languages that you want to benchmark against each other, they should really do the same thing. I don't see how you could get any meaningfull results with an incorrect program or by comparing 2 programs from which one just invokes undefined behaviour in C. > Generating the samples is harder .. I once used things like > Fibonacci to generate input for integer calculator tests. > You really want to run these benchmarks for at least 5 minutes > and randomly to eliminate cache effects. > > If you can generate increasing sized problems based on a single > linear integer parameter, my Python test harness can handle > randomising the tests, run multiple processes, > and also draw a nice plot of the results, such as these: > > http://felix.sourceforge.net/speed/en_flx_perf_0005.html > http://felix.sourceforge.net/speed/en_flx_perf_0012.html > > full sources etc are here to read: > > http://felix.sourceforge.net/speed/en_flx_perf_top.html > > or you can grab from Felix svn archive (the test harness > is independent of Felix). Or I can email to you (saves > building Felix just to get the test harness ..) I'm not the OP (with the original problem, but I'd be interested in the test harness (if it is documented). I've been thinking about writing something similar for testing algorithms (sometimes the scaling behaviour tells you, what you did wrong or where problem lies), but didn't have the time yet. > Ocamlopt scores quite well thank you! This data is from > two AMD64 machines running Ubuntu Linux and is the result > of HOURS of run time. I'd have prefered no straight lines between the points (physicist know why) and more data points. But yes, this is the kind of benchmark I've been suggesting. Which version of gcc did you use? I'm still a bit wary on the OPs results: That gcc generated code just runs 3 times faster than code from another version from gcc seems to point to an optimization bug, IMHO: There is no way that there was a optimization potential of factor 3 in older gcc versions (else I suppose the Intel cc would have generated much faster than gcc, too, because those people presumably knew what they where doing and weren't likely to make "the same mistakes as the gcc people: So, since I don't believ in miracles ...) > I don't know if it is cache, random context switches, or what, > but on my boxes variations up to 20% on tests under 5 seconds are > normal. Almost the same here. I assume you didn't test in single user mode (I tried it once and the resulting data was much smoother). > Note the measurements are real time. Other measurements > are suspect .. you really SHOULD count VM paging for example. :-). Why me? Regards -- Markus (who isn't really in the benchamark business presently, but only wanted to point to the importance of being earnest^W observing scaling behaviour instead of looking at absolute numbers) ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [Caml-list] Multiplication of matrix in C and OCaml 2007-02-09 21:47 ` ls-ocaml-developer-2006 @ 2007-02-09 21:55 ` Andrej Bauer 2007-02-09 22:36 ` ls-ocaml-developer-2006 2007-02-10 1:10 ` Brian Hurt 0 siblings, 2 replies; 32+ messages in thread From: Andrej Bauer @ 2007-02-09 21:55 UTC (permalink / raw) Cc: caml-list I hate to add to this long discussion, but since when is floating point multiplication associative? # 1.3 *. (0.7 *. 2.1) = (1.3 *. 0.7) *. 2.1 ;; - : bool = false The suggestion that the results be tested for validity by checking the associative law of multipliciation won't worky very well, and neither will commutativity. In this case, instead of checking that A = B holds, it's better to output the difference between A and B (in case A and B are matrices, output the sup norm of the difference to keep it down to a single number). Andrej ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [Caml-list] Multiplication of matrix in C and OCaml 2007-02-09 21:55 ` Andrej Bauer @ 2007-02-09 22:36 ` ls-ocaml-developer-2006 2007-02-09 23:53 ` Jon Harrop 2007-02-10 1:10 ` Brian Hurt 1 sibling, 1 reply; 32+ messages in thread From: ls-ocaml-developer-2006 @ 2007-02-09 22:36 UTC (permalink / raw) To: caml-list Andrej Bauer <Andrej.Bauer@fmf.uni-lj.si> writes: > I hate to add to this long discussion, but since when is floating > point multiplication associative? > > # 1.3 *. (0.7 *. 2.1) = (1.3 *. 0.7) *. 2.1 ;; > - : bool = false Which just means, that "testing assoicativity" wouldn't work as a benchmark. Still I think the compiler (GCC) might take the license to optimize (a * b) *c == a * (b * c) to false with -O3. Gcc 4.1 is optimizing pretty aggressivly (yes even to the extend of doubtful, sometimes simply wrong semantics) AFAIK. > The suggestion that the results be tested for validity by checking the > associative law of multipliciation won't worky very well, and neither > will commutativity. In this case, instead of checking that A = B > holds, it's better to output the difference between A and B (in case A > and B are matrices, output the sup norm of the difference to keep it > down to a single number). :-) But wasn't I/O what Maxim wanted to avoid in a benchmark? Regards -- Markus ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [Caml-list] Multiplication of matrix in C and OCaml 2007-02-09 22:36 ` ls-ocaml-developer-2006 @ 2007-02-09 23:53 ` Jon Harrop 2007-02-10 1:41 ` ls-ocaml-developer-2006 0 siblings, 1 reply; 32+ messages in thread From: Jon Harrop @ 2007-02-09 23:53 UTC (permalink / raw) To: caml-list On Friday 09 February 2007 22:36, ls-ocaml-developer-2006@m-e-leypold.de wrote: > Still I think the compiler (GCC) might take the license to > optimize (a * b) *c == a * (b * c) to false with -O3. The -O* flags try to be correct but you can supply flags like -ffast-math to opt for faster but possibly less correct code: http://docs.freebsd.org/info//gcc/gcc.info.Optimize_Options.html Many compilers and languages sacrifice strict double-precision for performance (by not rounding from 80-bit), which is a major headache when you're implementing algorithms that rely upon it, e.g. the exact arithmetic in Triangle: http://www.soest.hawaii.edu/GMT/gmt/README.TRIANGLE -- Dr Jon D Harrop, Flying Frog Consultancy Ltd. OCaml for Scientists http://www.ffconsultancy.com/products/ocaml_for_scientists ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [Caml-list] Multiplication of matrix in C and OCaml 2007-02-09 23:53 ` Jon Harrop @ 2007-02-10 1:41 ` ls-ocaml-developer-2006 2007-02-10 2:24 ` Jon Harrop 0 siblings, 1 reply; 32+ messages in thread From: ls-ocaml-developer-2006 @ 2007-02-10 1:41 UTC (permalink / raw) To: caml-list Jon Harrop <jon@ffconsultancy.com> writes: > On Friday 09 February 2007 22:36, ls-ocaml-developer-2006@m-e-leypold.de > wrote: >> Still I think the compiler (GCC) might take the license to >> optimize (a * b) *c == a * (b * c) to false with -O3. > > The -O* flags try to be correct but you can supply flags like -ffast-math to Sorry, I wanted to say "optimize (a * b) *c == a * (b * c) to true", i.e. 1. Would the compiler be wrong/incorrect in doing so? I don't think so. Regards -- Markus ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [Caml-list] Multiplication of matrix in C and OCaml 2007-02-10 1:41 ` ls-ocaml-developer-2006 @ 2007-02-10 2:24 ` Jon Harrop 2007-02-10 14:41 ` ls-ocaml-developer-2006 2007-02-11 13:13 ` Christophe Raffalli 0 siblings, 2 replies; 32+ messages in thread From: Jon Harrop @ 2007-02-10 2:24 UTC (permalink / raw) To: caml-list On Saturday 10 February 2007 01:41, ls-ocaml-developer-2006@m-e-leypold.de wrote: > Sorry, I wanted to say "optimize (a * b) *c == a * (b * c) to true", > i.e. 1. Would the compiler be wrong/incorrect in doing so? Yes. > I don't think so. Provided you only specify -O* optimisation flags, I believe that reducing that expression to "true" in the general case would be a compiler bug. If you specify flags like -ffast-math then anything can happen, but it will happen very quickly! ;-) -- Dr Jon D Harrop, Flying Frog Consultancy Ltd. OCaml for Scientists http://www.ffconsultancy.com/products/ocaml_for_scientists ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [Caml-list] Multiplication of matrix in C and OCaml 2007-02-10 2:24 ` Jon Harrop @ 2007-02-10 14:41 ` ls-ocaml-developer-2006 2007-02-10 14:52 ` Jon Harrop 2007-02-10 14:55 ` Mattias Engdegård 2007-02-11 13:13 ` Christophe Raffalli 1 sibling, 2 replies; 32+ messages in thread From: ls-ocaml-developer-2006 @ 2007-02-10 14:41 UTC (permalink / raw) To: caml-list Jon Harrop <jon@ffconsultancy.com> writes: > On Saturday 10 February 2007 01:41, ls-ocaml-developer-2006@m-e-leypold.de > wrote: >> Sorry, I wanted to say "optimize (a * b) *c == a * (b * c) to true", >> i.e. 1. Would the compiler be wrong/incorrect in doing so? > > Yes. Just to be sure: Would the compiler be wrong to optimize c * q > c * k to just q > k (all floats). And why? If not, why, in the case above? I don't want letter and verse, but a general hand waving in the right direction would be nice, since I have the impression, that is exactly what Gcc 4.1. is currently doing (though for the integer case). > Provided you only specify -O* optimisation flags, I believe that reducing that > expression to "true" in the general case would be a compiler bug. If you > specify flags like -ffast-math then anything can happen, but it will happen > very quickly! ;-) :-) Regards -- Markus ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [Caml-list] Multiplication of matrix in C and OCaml 2007-02-10 14:41 ` ls-ocaml-developer-2006 @ 2007-02-10 14:52 ` Jon Harrop 2007-02-10 15:51 ` ls-ocaml-developer-2006 2007-02-10 14:55 ` Mattias Engdegård 1 sibling, 1 reply; 32+ messages in thread From: Jon Harrop @ 2007-02-10 14:52 UTC (permalink / raw) To: caml-list On Saturday 10 February 2007 14:41, ls-ocaml-developer-2006@m-e-leypold.de wrote: > Just to be sure: Would the compiler be wrong to optimize > > c * q > c * k > > to just > > q > k > > (all floats). And why? Consider c=0, q=3 and k=2: 0 * 3 > 0 * 2 --> false 3 > 2 --> true It is just mathematically incorrect. > If not, why, in the case above? Floats are not reals, they are just an approximation that happens to be very useful. Floats do not obey the same laws (e.g. associativity). However, programmers may be relying upon this fact, e.g. when doing exact float arithmetic. > I don't want > letter and verse, but a general hand waving in the right direction > would be nice, since I have the impression, that is exactly what Gcc > 4.1. is currently doing (though for the integer case). Ints are completely different because they are exact (as modulo integers). So they do obey the same laws and they do not have special constants (like infinity, neg_infinity, nan, -0. and so on as floats do). Xavier has written some enlightening posts here in the past, regarding the adoption of strict IEEE compliance in OCaml: http://caml.inria.fr/pub/ml-archives/caml-list/2004/10/ffa452944f4bb9827f2cdca552f4f823.en.html For an example of someone using the float constants that lie outside the set of reals, look no further than my ray tracer: http://www.ffconsultancy.com/free/ray_tracer/languages.html I used an initial intersection parameter of lambda=infinity to represent no intersection (or intersecting with the sky). I would be mortified if a compiler decided to optimise away my necessary and working code just because infinity is not in the set of real numbers. -- Dr Jon D Harrop, Flying Frog Consultancy Ltd. OCaml for Scientists http://www.ffconsultancy.com/products/ocaml_for_scientists ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [Caml-list] Multiplication of matrix in C and OCaml 2007-02-10 14:52 ` Jon Harrop @ 2007-02-10 15:51 ` ls-ocaml-developer-2006 2007-02-10 16:10 ` Xavier Leroy 2007-02-10 16:11 ` Jon Harrop 0 siblings, 2 replies; 32+ messages in thread From: ls-ocaml-developer-2006 @ 2007-02-10 15:51 UTC (permalink / raw) To: caml-list Jon Harrop <jon@ffconsultancy.com> writes: > On Saturday 10 February 2007 14:41, ls-ocaml-developer-2006@m-e-leypold.de > wrote: >> Just to be sure: Would the compiler be wrong to optimize >> >> c * q > c * k >> >> to just >> >> q > k >> >> (all floats). And why? > > Consider c=0, q=3 and k=2: > > 0 * 3 > 0 * 2 --> false > 3 > 2 --> true > > It is just mathematically incorrect. Damn. I should have taken '+'. I've not been paying attention. My question should rather have been: Is the compiler allowed to make optimizations according to known mathematical laws? >> If not, why, in the case above? > > Floats are not reals, they are just an approximation that happens to be very > useful. Floats do not obey the same laws (e.g. associativity). However, > programmers may be relying upon this fact, e.g. when doing exact float > arithmetic. Apart from my obvious error above, I wonder, how much the compiler is required to keep to that expectations and when. Some examples to illustrate: Recently someone complained to the Gcc team that the compiler optimizes a + 100 < 100 to false in case a is an unsigned int. Actually the lnaguage standard allows it: If the addition flows over, anything is allowed (undefined behaviour) and if not, a+100 < 100 is just like a < 0. The expectation that the overflow must be visible in the language was not justfied in this case. I've been wondering about similar optimizations for floats, but didn't get my examples right. The transformtion done by the compiler would of course be forbidden to increase the error. >> I don't want >> letter and verse, but a general hand waving in the right direction >> would be nice, since I have the impression, that is exactly what Gcc >> 4.1. is currently doing (though for the integer case). > Ints are completely different because they are exact (as modulo integers). So > they do obey the same laws and they do not have special constants (like > infinity, neg_infinity, nan, -0. and so on as floats do). Yes, thanks. I hope I have justfied sufficiently my motiviation to ask my question, about which I should have been thinking just for some more minutes (I admit :-). Still: With a certain Gcc version and flags combination the OP saw a threefold improvement in performance. That in intself is suspicious (I don't think that this much optimization potential was left in Gcc ...) and I still would check for optimization errors in this case. Gcc is not bug free either, so one should test the correctness of the compiled program first and wether it really does the work it is supposed to do. Regards -- Markus ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [Caml-list] Multiplication of matrix in C and OCaml 2007-02-10 15:51 ` ls-ocaml-developer-2006 @ 2007-02-10 16:10 ` Xavier Leroy 2007-02-10 16:11 ` Jon Harrop 1 sibling, 0 replies; 32+ messages in thread From: Xavier Leroy @ 2007-02-10 16:10 UTC (permalink / raw) To: ls-ocaml-developer-2006; +Cc: caml-list > [Many questions about float arithmetic and optimizations] Do yourself a favor and read Goldberg's excellent reference: "What Every Computer Scientist Should Know About Floating Point Arithmetic" ACM Computing Surveys 23(1), 5-48, 1991 http://citeseer.ist.psu.edu/goldberg91what.html It doesn't read like _TV_digest_, but it's well worth the effort. > Is the compiler allowed to make > optimizations according to known mathematical laws? Yes, provided they actually hold. Goldberg lists a few that hold in IEEE float arithmetic (section "Optimizers"). But pretty much every algebraic law that holds over the reals doesn't hold in floating-point arithmetic. > Still: With a certain Gcc version and flags combination the OP saw a > threefold improvement in performance. That in intself is suspicious (I > don't think that this much optimization potential was left in Gcc ...) > and I still would check for optimization errors in this case. That's a good idea. Note however that there are known optimizations (loop blocking, loop interchange) that can dramatically improve the performance of dense matrix multiply by making better use of the caches. Automatic vectorization (generation of SSE2 instructions that operate over pairs of double-precision floats) could also have a significant impact, although not by a factor of 3. There's only one way to know: read the assembly code generated by gcc. - Xavier Leroy ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [Caml-list] Multiplication of matrix in C and OCaml 2007-02-10 15:51 ` ls-ocaml-developer-2006 2007-02-10 16:10 ` Xavier Leroy @ 2007-02-10 16:11 ` Jon Harrop 1 sibling, 0 replies; 32+ messages in thread From: Jon Harrop @ 2007-02-10 16:11 UTC (permalink / raw) To: caml-list On Saturday 10 February 2007 15:51, ls-ocaml-developer-2006@m-e-leypold.de wrote: > Damn. I should have taken '+'. Even for modulo integer arithmetic that still breaks: a+b > a+c when a+c wraps around max_int and becomes negative but a+b does not, then a+b>a+c even though c>b. > Is the compiler allowed to make optimizations according to known > mathematical laws? Provided they are the correct laws, yes. > I've been wondering about similar optimizations for floats, but didn't > get my examples right. The transformtion done by the compiler would of > course be forbidden to increase the error. Yes, which is actually very restrictive. That is why -ffast-math can make C/C++ programs run so much more quickly. In OCaml, -ffast-math has a different purpose (to enable the emission of trig instructions, IIRC) and it exists on x86 and not AMD64 whereas the same flag exists on both platforms for GCC. Incidentally, why is -ffast-math not always enabled on x86 and why isn't it a no-op on AMD64 (rather than causing an "unknown option" error as it does ATM)? > Still: With a certain Gcc version and flags combination the OP saw a > threefold improvement in performance. That in intself is suspicious (I > don't think that this much optimization potential was left in Gcc ...) > and I still would check for optimization errors in this case. Gcc is > not bug free either, so one should test the correctness of the > compiled program first and wether it really does the work it is > supposed to do. Although Frederic is an expert OCaml programmer, it sounds like his C was a little off the mark (having undefined behaviour and allowing arbitrary optimisations as a consequence). Benchmarking is a real can of worms. I've spent a long time trying to benchmark programs and languages and I really think the only feasible way forward is to measure the real time taken to solve real problems. Even this quickly becomes apples and oranges. For example, F# has a concurrent GC and I have a dual core machine, so F# has the advantage that even single threaded programs can exploit both cores when they are allocation intensive. So how am I supposed to compare the performance of the two languages? There is no panacea. In the end I must measure the time taken to perform an important real world task, like spinning a 3D bunny around. Is spinning a bunny serious enough though? If not, maybe I should use a Mandelbrot renderer: http://caml.inria.fr/cgi-bin/hump.en.cgi?contrib=481 -- Dr Jon D Harrop, Flying Frog Consultancy Ltd. OCaml for Scientists http://www.ffconsultancy.com/products/ocaml_for_scientists ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [Caml-list] Multiplication of matrix in C and OCaml 2007-02-10 14:41 ` ls-ocaml-developer-2006 2007-02-10 14:52 ` Jon Harrop @ 2007-02-10 14:55 ` Mattias Engdegård 1 sibling, 0 replies; 32+ messages in thread From: Mattias Engdegård @ 2007-02-10 14:55 UTC (permalink / raw) To: ls-ocaml-developer-2006; +Cc: caml-list >Just to be sure: Would the compiler be wrong to optimize > > c * q > c * k > >to just > > q > k With IEEE floating-point it would be wrong for several reasons. Example 1: q and k are finite but c*q and c*k both are infinite. Example 2: q and k are small, so that c*q and c*k both are +-0 Example 3: c is NaN but q and k are not. I'm sure there are other examples involving but not limited to partial or total loss of precision, positive/negative zero, infinities and NaNs. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [Caml-list] Multiplication of matrix in C and OCaml 2007-02-10 2:24 ` Jon Harrop 2007-02-10 14:41 ` ls-ocaml-developer-2006 @ 2007-02-11 13:13 ` Christophe Raffalli 1 sibling, 0 replies; 32+ messages in thread From: Christophe Raffalli @ 2007-02-11 13:13 UTC (permalink / raw) Cc: caml-list Hi, Picking up one coefficient, choosed randomaly in the matrix should take a constant time and prevent gcc to do any optimization by guessing the the result is useless. Christophe Raffalli ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [Caml-list] Multiplication of matrix in C and OCaml 2007-02-09 21:55 ` Andrej Bauer 2007-02-09 22:36 ` ls-ocaml-developer-2006 @ 2007-02-10 1:10 ` Brian Hurt 2007-02-10 1:16 ` Robert Roessler 1 sibling, 1 reply; 32+ messages in thread From: Brian Hurt @ 2007-02-10 1:10 UTC (permalink / raw) To: Andrej.Bauer; +Cc: caml-list On Fri, 9 Feb 2007, Andrej Bauer wrote: > I hate to add to this long discussion, but since when is floating point > multiplication associative? > > # 1.3 *. (0.7 *. 2.1) = (1.3 *. 0.7) *. 2.1 ;; > - : bool = false Round off error. You can't represent numbers like 7/10 or 13/10 in binary precisely in a finite number of digits, any more than you can represent the numbers 1/3 or 4/7 precisely in base-10 in a finite number of digits. This is a fundamental problem with floating point numbers, and shows up in every language that has them. But this means reordering the calculations will result in slightly different answers. And no, most of the "obvious" solutions don't work. Using rational arithmetic doesn't help if I ask you to calculate the euclidean length of the vector (1, 1) for example (hint: sqrt (1.*.1. +. 1.*.1.)). Interval arithmetic fails miserably on simple algorithms like Newton's method. And so on. On the other hand, the precision given by double precision floating point numbers is generally more than enough- they're precise enough to hold the distance from here to the moon in microns, for example. My advice: if you're doing much of anything with floating point numbers, get a book on numerical analysis and follow the algorithms and recommendations it gives you (any book except "Numerical Recipies", which suck). And don't use floating point for money. Brian ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [Caml-list] Multiplication of matrix in C and OCaml 2007-02-10 1:10 ` Brian Hurt @ 2007-02-10 1:16 ` Robert Roessler 0 siblings, 0 replies; 32+ messages in thread From: Robert Roessler @ 2007-02-10 1:16 UTC (permalink / raw) To: caml-list Brian Hurt wrote: > On Fri, 9 Feb 2007, Andrej Bauer wrote: > >> I hate to add to this long discussion, but since when is floating >> point multiplication associative? >> >> # 1.3 *. (0.7 *. 2.1) = (1.3 *. 0.7) *. 2.1 ;; >> - : bool = false > > Round off error. You can't represent numbers like 7/10 or 13/10 in > binary precisely in a finite number of digits, any more than you can > represent the numbers 1/3 or 4/7 precisely in base-10 in a finite number > of digits. This is a fundamental problem with floating point numbers, > and shows up in every language that has them. But this means reordering > the calculations will result in slightly different answers. I had assumed this was the point he was making... Robert Roessler roessler@rftp.com http://www.rftp.com ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [Caml-list] Multiplication of matrix in C and OCaml 2007-02-09 14:22 ` skaller 2007-02-09 21:47 ` ls-ocaml-developer-2006 @ 2007-02-09 23:56 ` Jon Harrop 1 sibling, 0 replies; 32+ messages in thread From: Jon Harrop @ 2007-02-09 23:56 UTC (permalink / raw) To: caml-list On Friday 09 February 2007 14:22, skaller wrote: > The right way to benchmark simple > things like matrix operations AND simultaneously validate them > is to evaluate known laws of arithmetic, such as the associative > law, on various samples. They aren't likely to be associative (or commutative). The best way to benchmark (IMHO) is to run programs that actually do something useful. -- Dr Jon D Harrop, Flying Frog Consultancy Ltd. OCaml for Scientists http://www.ffconsultancy.com/products/ocaml_for_scientists ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [Caml-list] Multiplication of matrix in C and OCaml 2007-02-09 9:06 ` ls-ocaml-developer-2006 2007-02-09 10:32 ` ls-ocaml-developer-2006 @ 2007-02-09 12:05 ` Jon Harrop 2007-02-09 12:35 ` ls-ocaml-developer-2006 2007-02-09 13:50 ` Brian Hurt 2 siblings, 1 reply; 32+ messages in thread From: Jon Harrop @ 2007-02-09 12:05 UTC (permalink / raw) To: caml-list On Friday 09 February 2007 09:06, ls-ocaml-developer-2006@m-e-leypold.de wrote: > t(n) = C + K*n If you're assuming that the time taken will be proportional to complexity then that is almost certainly quite inaccurate in this case due to cache effects. -- Dr Jon D Harrop, Flying Frog Consultancy Ltd. OCaml for Scientists http://www.ffconsultancy.com/products/ocaml_for_scientists ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [Caml-list] Multiplication of matrix in C and OCaml 2007-02-09 12:05 ` Jon Harrop @ 2007-02-09 12:35 ` ls-ocaml-developer-2006 0 siblings, 0 replies; 32+ messages in thread From: ls-ocaml-developer-2006 @ 2007-02-09 12:35 UTC (permalink / raw) To: caml-list Jon Harrop <jon@ffconsultancy.com> writes: > On Friday 09 February 2007 09:06, ls-ocaml-developer-2006@m-e-leypold.de > wrote: >> t(n) = C + K*n > > If you're assuming that the time taken will be proportional to complexity then Not complexity: The number of matrices multiplied ... > that is almost certainly quite inaccurate in this case due to cache effects. Yes, that's too much simplification. I'm aware of that. Exactly for that reason the exact shape of the resulting graph would be interesting, because it reflects where the different caches run out of capacity. Still -- with 15s tun time and random data I assume that the processor L1/L2 caches don't count (I don't see, why multiplying 2 different matrices one after the other should profit from caching -- mind: I scale the number of multiplications, not the size of matrices). BTW: With my short note I didn't want to suggest a complete methodology of measurement, but only wanted to comment that perhaps optimization does still take place (because the results are not used) and just hint to better methods of measurements than just taking the time of a single run at a constant problem size. You (or better the guy with the original problem) might improve on that at will. Regards -- Markus ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [Caml-list] Multiplication of matrix in C and OCaml 2007-02-09 9:06 ` ls-ocaml-developer-2006 2007-02-09 10:32 ` ls-ocaml-developer-2006 2007-02-09 12:05 ` Jon Harrop @ 2007-02-09 13:50 ` Brian Hurt 2007-02-09 14:23 ` Gerd Stolpmann 2 siblings, 1 reply; 32+ messages in thread From: Brian Hurt @ 2007-02-09 13:50 UTC (permalink / raw) To: ls-ocaml-developer-2006; +Cc: caml-list >I'd feel better if the code is benchmarked in a way >that the result of the multiplication is output to a file and to >subtract the constant contribution of that to the run time that the >time is measured for various problem sizes (number of matrices). > For the C side, I recommend using clock() to time the runs. This also factors out the cost of creating the random matricies. Basically, you just do: #include <time.h> ... { clock_t start, stop; ... start = clock(); /* stuff to time here */ stop = clock(); printf("The amount of time taken = %f seconds\n", ((double)(stop-start))/((double) CLOCKS_PER_SEC)); /* write result to file */ } In Ocaml, Unix.gettimeofday can be used similiarly: let start = Unix.gettimeofday() in (* stuff to time here *) let stop = Unix.gettimeofday() in Printf.printf "The amount of time taken = %f seconds\n" (stop -. start); (* write results to file *) Brian ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [Caml-list] Multiplication of matrix in C and OCaml 2007-02-09 13:50 ` Brian Hurt @ 2007-02-09 14:23 ` Gerd Stolpmann 0 siblings, 0 replies; 32+ messages in thread From: Gerd Stolpmann @ 2007-02-09 14:23 UTC (permalink / raw) To: Brian Hurt; +Cc: ls-ocaml-developer-2006, caml-list Am Freitag, den 09.02.2007, 08:50 -0500 schrieb Brian Hurt: > For the C side, I recommend using clock() to time the runs. > ... > In Ocaml, Unix.gettimeofday can be used similiarly: Ahem, don't understand why you suggest two different methods for C and Ocaml. First, there is Unix.times in O'Caml that also returns the process time like clock(). Second - believe it or not - there is also gettimeofday in C. Personally, I recommend to rely on gettimeofday, i.e. the wall clock time of the system, on an otherwise idle system. That's the time that finally counts, and the way the process time is accumulated by the kernel is somewhat dubious (but may be useful to check whether the system was really idle during the benchmark). Gerd -- ------------------------------------------------------------ Gerd Stolpmann * Viktoriastr. 45 * 64293 Darmstadt * Germany gerd@gerd-stolpmann.de http://www.gerd-stolpmann.de Phone: +49-6151-153855 Fax: +49-6151-997714 ------------------------------------------------------------ ^ permalink raw reply [flat|nested] 32+ messages in thread
end of thread, other threads:[~2007-02-11 13:12 UTC | newest] Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2007-02-09 14:24 [Caml-list] Multiplication of matrix in C and OCaml Frederic GAVA -- strict thread matches above, loose matches on Subject: below -- 2007-02-07 23:42 Frédéric Gava 2007-02-08 2:14 ` [Caml-list] " Jacques Garrigue 2007-02-08 9:27 ` Frédéric Gava 2007-02-08 9:38 ` Frédéric Gava 2007-02-08 12:08 ` Jacques Garrigue 2007-02-08 9:56 ` Frédéric Gava 2007-02-08 10:45 ` Xavier Leroy 2007-02-08 15:16 ` Frédéric Gava 2007-02-09 2:58 ` Jacques Garrigue 2007-02-09 9:06 ` ls-ocaml-developer-2006 2007-02-09 10:32 ` ls-ocaml-developer-2006 2007-02-09 14:22 ` skaller 2007-02-09 21:47 ` ls-ocaml-developer-2006 2007-02-09 21:55 ` Andrej Bauer 2007-02-09 22:36 ` ls-ocaml-developer-2006 2007-02-09 23:53 ` Jon Harrop 2007-02-10 1:41 ` ls-ocaml-developer-2006 2007-02-10 2:24 ` Jon Harrop 2007-02-10 14:41 ` ls-ocaml-developer-2006 2007-02-10 14:52 ` Jon Harrop 2007-02-10 15:51 ` ls-ocaml-developer-2006 2007-02-10 16:10 ` Xavier Leroy 2007-02-10 16:11 ` Jon Harrop 2007-02-10 14:55 ` Mattias Engdegård 2007-02-11 13:13 ` Christophe Raffalli 2007-02-10 1:10 ` Brian Hurt 2007-02-10 1:16 ` Robert Roessler 2007-02-09 23:56 ` Jon Harrop 2007-02-09 12:05 ` Jon Harrop 2007-02-09 12:35 ` ls-ocaml-developer-2006 2007-02-09 13:50 ` Brian Hurt 2007-02-09 14:23 ` Gerd Stolpmann
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox