* [Caml-list] Closing the performance gap to C @ 2016-12-17 13:01 Christoph Höger 2016-12-17 13:02 ` Christoph Höger 0 siblings, 1 reply; 25+ messages in thread From: Christoph Höger @ 2016-12-17 13:01 UTC (permalink / raw) To: caml users [-- Attachment #1.1: Type: text/plain, Size: 2110 bytes --] Dear all, find attached two simple runge-kutta iteration schemes. One is written in C, the other in OCaml. I compared the runtime of both and gcc (-O2) produces an executable that is roughly 30% faster (to be more precise: 3.52s vs. 2.63s). That is in itself quite pleasing, I think. I do not understand however, what causes this difference. Admittedly, the generated assembly looks completely different, but both compilers inline all functions and generate one big loop. Ocaml generates a lot more scaffolding, but that is to be expected. There is however an interesting particularity: OCaml generates 6 calls to cos, while gcc only needs 3 (and one direct jump). Surprisingly, there are also calls to cosh, acos and pretty much any other trigonometric function (initialization of constants, maybe?) However, the true culprit seems to be an excess of instructions between the different calls to cos. This is what happens between the first two calls to cos: gcc: jmpq 400530 <cos@plt> nop nopw %cs:0x0(%rax,%rax,1) sub $0x38,%rsp movsd %xmm0,0x10(%rsp) movapd %xmm1,%xmm0 movsd %xmm2,0x18(%rsp) movsd %xmm1,0x8(%rsp) callq 400530 <cos@plt> ocamlopt: callq 401a60 <cos@plt> mulsd (%r12),%xmm0 movsd %xmm0,0x10(%rsp) sub $0x10,%r15 lea 0x25c7b6(%rip),%rax cmp (%rax),%r15 jb 404a8a <dlerror@plt+0x2d0a> lea 0x8(%r15),%rax movq $0x4fd,-0x8(%rax) movsd 0x32319(%rip),%xmm1 movapd %xmm1,%xmm2 mulsd %xmm0,%xmm2 addsd 0x0(%r13),%xmm2 movsd %xmm2,(%rax) movapd %xmm1,%xmm0 mulsd (%r12),%xmm0 addsd (%rbx),%xmm0 callq 401a60 <cos@plt> Is this caused by some underlying difference in the representation of numeric values (i.e. tagged ints) or is it reasonable to attack this issue as a hobby experiment? thanks for any advice, Christoph -- Christoph Höger Technische Universität Berlin Fakultät IV - Elektrotechnik und Informatik Übersetzerbau und Programmiersprachen Sekr. TEL12-2, Ernst-Reuter-Platz 7, 10587 Berlin Tel.: +49 (30) 314-24890 E-Mail: christoph.hoeger@tu-berlin.de [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 181 bytes --] ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [Caml-list] Closing the performance gap to C 2016-12-17 13:01 [Caml-list] Closing the performance gap to C Christoph Höger @ 2016-12-17 13:02 ` Christoph Höger 2016-12-19 10:58 ` Soegtrop, Michael 2016-12-19 11:51 ` Gerd Stolpmann 0 siblings, 2 replies; 25+ messages in thread From: Christoph Höger @ 2016-12-17 13:02 UTC (permalink / raw) To: caml-list [-- Attachment #1.1.1: Type: text/plain, Size: 2330 bytes --] Ups. Forgot the actual examples. Am 17.12.2016 um 14:01 schrieb Christoph Höger: > Dear all, > > find attached two simple runge-kutta iteration schemes. One is written > in C, the other in OCaml. I compared the runtime of both and gcc (-O2) > produces an executable that is roughly 30% faster (to be more precise: > 3.52s vs. 2.63s). That is in itself quite pleasing, I think. I do not > understand however, what causes this difference. Admittedly, the > generated assembly looks completely different, but both compilers inline > all functions and generate one big loop. Ocaml generates a lot more > scaffolding, but that is to be expected. > > There is however an interesting particularity: OCaml generates 6 calls > to cos, while gcc only needs 3 (and one direct jump). Surprisingly, > there are also calls to cosh, acos and pretty much any other > trigonometric function (initialization of constants, maybe?) > > However, the true culprit seems to be an excess of instructions between > the different calls to cos. This is what happens between the first two > calls to cos: > > gcc: > jmpq 400530 <cos@plt> > nop > nopw %cs:0x0(%rax,%rax,1) > > sub $0x38,%rsp > movsd %xmm0,0x10(%rsp) > movapd %xmm1,%xmm0 > movsd %xmm2,0x18(%rsp) > movsd %xmm1,0x8(%rsp) > callq 400530 <cos@plt> > > ocamlopt: > > callq 401a60 <cos@plt> > mulsd (%r12),%xmm0 > movsd %xmm0,0x10(%rsp) > sub $0x10,%r15 > lea 0x25c7b6(%rip),%rax > cmp (%rax),%r15 > jb 404a8a <dlerror@plt+0x2d0a> > lea 0x8(%r15),%rax > movq $0x4fd,-0x8(%rax) > > movsd 0x32319(%rip),%xmm1 > > movapd %xmm1,%xmm2 > mulsd %xmm0,%xmm2 > addsd 0x0(%r13),%xmm2 > movsd %xmm2,(%rax) > movapd %xmm1,%xmm0 > mulsd (%r12),%xmm0 > addsd (%rbx),%xmm0 > callq 401a60 <cos@plt> > > > Is this caused by some underlying difference in the representation of > numeric values (i.e. tagged ints) or is it reasonable to attack this > issue as a hobby experiment? > > > thanks for any advice, > > Christoph > -- Christoph Höger Technische Universität Berlin Fakultät IV - Elektrotechnik und Informatik Übersetzerbau und Programmiersprachen Sekr. TEL12-2, Ernst-Reuter-Platz 7, 10587 Berlin Tel.: +49 (30) 314-24890 E-Mail: christoph.hoeger@tu-berlin.de [-- Attachment #1.1.2: rk4.c --] [-- Type: text/plain, Size: 843 bytes --] #include <stdio.h> #include <math.h> double exact(double t) { return sin(t); } double dy(double t, double y) { return cos(t); } double rk4_step(double y, double t, double h) { double k1 = h * dy(t, y); double k2 = h * dy(t + 0.5 * h, y + 0.5 * k1); double k3 = h * dy(t + 0.5 * h, y + 0.5 * k2); double k4 = h * dy(t + h, y + k3); return y + (k1 + k4)/ 6.0 + (k2+k3) / 3.0; } double loop (int steps, double h, int n, double y, double t) { if (n < steps) return loop(steps, h, n+1, rk4_step(y,t,h), t+h); else return y; } int main() { double h = 0.1; double y = loop(102, h, 1, 1.0, 0.0); double err = fabs(y - exact(102 * h)); int large = 10000000; double y2 = loop(large, h, 1, 1.0, 0.0); printf("%d\n", (fabs(y2 - (exact(large * h))) < 2. * err)); return 0; } [-- Attachment #1.1.3: testrk4.ml --] [-- Type: text/plain, Size: 653 bytes --] let y' t y = cos t let exact t = sin t let rk4_step y t h = let k1 = h *. y' t y in let k2 = h *. y' (t +. 0.5*.h) (y +. 0.5*.k1) in let k3 = h *. y' (t +. 0.5*.h) (y +. 0.5*.k2) in let k4 = h *. y' (t +. h) (y +. k3) in y +. (k1+.k4)/.6.0 +. (k2+.k3)/.3.0 let rec loop steps h n y t = if n < steps then loop steps h (n+1) (rk4_step y t h) (t +. h) else y let _ = let h = 0.1 in let y = loop 102 h 1 1.0 0.0 in let err = abs_float (y -. (exact ((float_of_int 102) *. h))) in let large = 10000000 in let y = loop large h 1 1.0 0.0 in Printf.printf "%b\n" (abs_float (y -. (exact (float_of_int large) *. h)) < 2. *. err) [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 181 bytes --] ^ permalink raw reply [flat|nested] 25+ messages in thread
* RE: [Caml-list] Closing the performance gap to C 2016-12-17 13:02 ` Christoph Höger @ 2016-12-19 10:58 ` Soegtrop, Michael 2016-12-19 11:51 ` Gerd Stolpmann 1 sibling, 0 replies; 25+ messages in thread From: Soegtrop, Michael @ 2016-12-19 10:58 UTC (permalink / raw) To: Christoph Höger, caml-list Dear Christoph, would you mind sharing the assembly code? It would be interesting to see if OCaml generates x87 style stack based FPU code, or more modern SSE style register based code. Also it would be interesting to know how gcc was configured (for what kind of floating point code). For floating point stuff, the configuration of the CPU can have very substantial impact on the result. Best regards, Michael Intel Deutschland GmbH Registered Address: Am Campeon 10-12, 85579 Neubiberg, Germany Tel: +49 89 99 8853-0, www.intel.de Managing Directors: Christin Eisenschmid, Christian Lamprechter Chairperson of the Supervisory Board: Nicole Lau Registered Office: Munich Commercial Register: Amtsgericht Muenchen HRB 186928 ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [Caml-list] Closing the performance gap to C 2016-12-17 13:02 ` Christoph Höger 2016-12-19 10:58 ` Soegtrop, Michael @ 2016-12-19 11:51 ` Gerd Stolpmann 2016-12-19 14:52 ` Soegtrop, Michael 2016-12-19 15:48 ` Ivan Gotovchits 1 sibling, 2 replies; 25+ messages in thread From: Gerd Stolpmann @ 2016-12-19 11:51 UTC (permalink / raw) To: Christoph Höger, caml-list [-- Attachment #1: Type: text/plain, Size: 4007 bytes --] Hi Christoph, the extra code looks very much like an allocation on the minor heap: sub $0x10,%r15 lea 0x25c7b6(%rip),%rax cmp (%rax),%r15 jb 404a8a <dlerror@plt+0x2d0a> lea 0x8(%r15),%rax movq $0x4fd,-0x8(%rax) r15 points to the used area of the minor heap - by decrementing it you get an additional block of memory. It is compared against the beginning of the heap to check whether GC is needed. The constant 0x4fd is the header of the new block (which must be always initialized). From the source code, it remains unclear for what this is used. Obviously, the compiler runs out of registers, and moves some values to the minor heap (temporarily). When you call a C function like cos it is likely that this happens because the C calling conventions do not preserve the FP registers (xmm*). This could be improved if the OCaml compiler tried alternate places for temporarily storing FP values: - int registers (which is perfectly possible on 64 bit platforms). A number of int registers survive C calls. - stack To my knowledge, the OCaml compiler never tries this (but this could be out of date). This is a fairly specific optimization that makes mostly sense for purely iterating or aggregating functions like yours that do not store FP values away. Gerd Am Samstag, den 17.12.2016, 14:02 +0100 schrieb Christoph Höger: > Ups. Forgot the actual examples. > > Am 17.12.2016 um 14:01 schrieb Christoph Höger: > > > > Dear all, > > > > find attached two simple runge-kutta iteration schemes. One is > > written > > in C, the other in OCaml. I compared the runtime of both and gcc (- > > O2) > > produces an executable that is roughly 30% faster (to be more > > precise: > > 3.52s vs. 2.63s). That is in itself quite pleasing, I think. I do > > not > > understand however, what causes this difference. Admittedly, the > > generated assembly looks completely different, but both compilers > > inline > > all functions and generate one big loop. Ocaml generates a lot more > > scaffolding, but that is to be expected. > > > > There is however an interesting particularity: OCaml generates 6 > > calls > > to cos, while gcc only needs 3 (and one direct jump). Surprisingly, > > there are also calls to cosh, acos and pretty much any other > > trigonometric function (initialization of constants, maybe?) > > > > However, the true culprit seems to be an excess of instructions > > between > > the different calls to cos. This is what happens between the first > > two > > calls to cos: > > > > gcc: > > jmpq 400530 <cos@plt> > > nop > > nopw %cs:0x0(%rax,%rax,1) > > > > sub $0x38,%rsp > > movsd %xmm0,0x10(%rsp) > > movapd %xmm1,%xmm0 > > movsd %xmm2,0x18(%rsp) > > movsd %xmm1,0x8(%rsp) > > callq 400530 <cos@plt> > > > > ocamlopt: > > > > callq 401a60 <cos@plt> > > mulsd (%r12),%xmm0 > > movsd %xmm0,0x10(%rsp) > > sub $0x10,%r15 > > lea 0x25c7b6(%rip),%rax > > cmp (%rax),%r15 > > jb 404a8a <dlerror@plt+0x2d0a> > > lea 0x8(%r15),%rax > > movq $0x4fd,-0x8(%rax) > > > > movsd 0x32319(%rip),%xmm1 > > > > movapd %xmm1,%xmm2 > > mulsd %xmm0,%xmm2 > > addsd 0x0(%r13),%xmm2 > > movsd %xmm2,(%rax) > > movapd %xmm1,%xmm0 > > mulsd (%r12),%xmm0 > > addsd (%rbx),%xmm0 > > callq 401a60 <cos@plt> > > > > > > Is this caused by some underlying difference in the representation > > of > > numeric values (i.e. tagged ints) or is it reasonable to attack > > this > > issue as a hobby experiment? > > > > > > thanks for any advice, > > > > Christoph > > > -- ------------------------------------------------------------ Gerd Stolpmann, Darmstadt, Germany gerd@gerd-stolpmann.de My OCaml site: http://www.camlcity.org Contact details: http://www.camlcity.org/contact.html Company homepage: http://www.gerd-stolpmann.de ------------------------------------------------------------ [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 473 bytes --] ^ permalink raw reply [flat|nested] 25+ messages in thread
* RE: [Caml-list] Closing the performance gap to C 2016-12-19 11:51 ` Gerd Stolpmann @ 2016-12-19 14:52 ` Soegtrop, Michael 2016-12-19 16:41 ` Gerd Stolpmann 2016-12-19 15:48 ` Ivan Gotovchits 1 sibling, 1 reply; 25+ messages in thread From: Soegtrop, Michael @ 2016-12-19 14:52 UTC (permalink / raw) To: Gerd Stolpmann, Christoph Höger, caml-list Dear Gerd, > When you call a C function like cos it is likely that this > happens because the C calling conventions do not preserve the FP registers > (xmm*). This could be improved if the OCaml compiler tried alternate places for > temporarily storing FP values: For Windows this doesn't seem to be true. See e.g.: https://msdn.microsoft.com/en-us/library/ms235286.aspx which states that XMM0..XMM5 are volatile, while XMM6..XMM15 must be preserved. I think for Linix you are right. I couldn't find a better reference than Wikipedia: https://en.wikipedia.org/wiki/X86_calling_conventions see "System V AMD64 ABI" there. This reference contains a good overview, which matches the above data in table 4: http://www.agner.org/optimize/calling_conventions.pdf So on Windows, there is for sure no need to save XMM6..XMM15, while on Linux this seems to be an issue. Best regards, Michael Intel Deutschland GmbH Registered Address: Am Campeon 10-12, 85579 Neubiberg, Germany Tel: +49 89 99 8853-0, www.intel.de Managing Directors: Christin Eisenschmid, Christian Lamprechter Chairperson of the Supervisory Board: Nicole Lau Registered Office: Munich Commercial Register: Amtsgericht Muenchen HRB 186928 ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [Caml-list] Closing the performance gap to C 2016-12-19 14:52 ` Soegtrop, Michael @ 2016-12-19 16:41 ` Gerd Stolpmann 2016-12-19 17:09 ` Frédéric Bour 0 siblings, 1 reply; 25+ messages in thread From: Gerd Stolpmann @ 2016-12-19 16:41 UTC (permalink / raw) To: Soegtrop, Michael, Christoph Höger, caml-list [-- Attachment #1: Type: text/plain, Size: 2326 bytes --] Michael, look here, it's the "definitive source": https://github.com/ocaml/ocaml/blob/trunk/asmcomp/amd64/proc.ml Windows is in deed different. I don't have enough insight into the register spilling algorithm to say whether this has a significant effect, though. OCaml code never keeps registers alive, and thus I would bet the spilling algorithm is designed for that, and it is probably not tried to move values preferably to xmm6-15 in order to avoid spilling during C calls. But that's just a hypothesis... Does somebody know? Gerd Am Montag, den 19.12.2016, 14:52 +0000 schrieb Soegtrop, Michael: > Dear Gerd, > > > > > When you call a C function like cos it is likely that this > > happens because the C calling conventions do not preserve the FP > > registers > > (xmm*). This could be improved if the OCaml compiler tried > > alternate places for > > temporarily storing FP values: > For Windows this doesn't seem to be true. See e.g.: > > https://msdn.microsoft.com/en-us/library/ms235286.aspx > > which states that XMM0..XMM5 are volatile, while XMM6..XMM15 must be > preserved. > > I think for Linix you are right. I couldn't find a better reference > than Wikipedia: > > https://en.wikipedia.org/wiki/X86_calling_conventions > > see "System V AMD64 ABI" there. > > This reference contains a good overview, which matches the above data > in table 4: > > http://www.agner.org/optimize/calling_conventions.pdf > > So on Windows, there is for sure no need to save XMM6..XMM15, while > on Linux this seems to be an issue. > > Best regards, > > Michael > Intel Deutschland GmbH > Registered Address: Am Campeon 10-12, 85579 Neubiberg, Germany > Tel: +49 89 99 8853-0, www.intel.de > Managing Directors: Christin Eisenschmid, Christian Lamprechter > Chairperson of the Supervisory Board: Nicole Lau > Registered Office: Munich > Commercial Register: Amtsgericht Muenchen HRB 186928 > -- ------------------------------------------------------------ Gerd Stolpmann, Darmstadt, Germany gerd@gerd-stolpmann.de My OCaml site: http://www.camlcity.org Contact details: http://www.camlcity.org/contact.html Company homepage: http://www.gerd-stolpmann.de ------------------------------------------------------------ [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 473 bytes --] ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [Caml-list] Closing the performance gap to C 2016-12-19 16:41 ` Gerd Stolpmann @ 2016-12-19 17:09 ` Frédéric Bour 2016-12-19 17:19 ` Yotam Barnoy 0 siblings, 1 reply; 25+ messages in thread From: Frédéric Bour @ 2016-12-19 17:09 UTC (permalink / raw) To: Gerd Stolpmann; +Cc: Soegtrop, Michael, Christoph Höger, caml-list [-- Attachment #1: Type: text/plain, Size: 3426 bytes --] OCamlopt is able to spill floating point register. You can even see with -dalloc that the code will spill a floating point register in the loop. The problem observed is not because of spilling but simply because of float boxing and compilation of recursive calls. The loop seems to compile down to an efficient code ended by a jump, but float unboxing is done in a much earlier pass in the compiler (cmm). Passing -dcmm to the compiler, we can see that before the recursive call to loop the float is boxed again. At this point, it is just a regular ocaml function call, taking boxed float. A simpler code to observe this behavior: let rec test f = test (f +. 1.0) let () = test 0.0 will box at every iteration. > Le 19 déc. 2016 à 17:41, Gerd Stolpmann <info@gerd-stolpmann.de> a écrit : > > Michael, > > look here, it's the "definitive source": > https://github.com/ocaml/ocaml/blob/trunk/asmcomp/amd64/proc.ml <https://github.com/ocaml/ocaml/blob/trunk/asmcomp/amd64/proc.ml> > > Windows is in deed different. I don't have enough insight into the > register spilling algorithm to say whether this has a significant > effect, though. OCaml code never keeps registers alive, and thus I > would bet the spilling algorithm is designed for that, and it is > probably not tried to move values preferably to xmm6-15 in order to > avoid spilling during C calls. But that's just a hypothesis... Does > somebody know? > > Gerd > > > Am Montag, den 19.12.2016, 14:52 +0000 schrieb Soegtrop, Michael: >> Dear Gerd, >> >>> >>> When you call a C function like cos it is likely that this >>> happens because the C calling conventions do not preserve the FP >>> registers >>> (xmm*). This could be improved if the OCaml compiler tried >>> alternate places for >>> temporarily storing FP values: >> For Windows this doesn't seem to be true. See e.g.: >> >> https://msdn.microsoft.com/en-us/library/ms235286.aspx >> >> which states that XMM0..XMM5 are volatile, while XMM6..XMM15 must be >> preserved. >> >> I think for Linix you are right. I couldn't find a better reference >> than Wikipedia: >> >> https://en.wikipedia.org/wiki/X86_calling_conventions >> >> see "System V AMD64 ABI" there. >> >> This reference contains a good overview, which matches the above data >> in table 4: >> >> http://www.agner.org/optimize/calling_conventions.pdf >> >> So on Windows, there is for sure no need to save XMM6..XMM15, while >> on Linux this seems to be an issue. >> >> Best regards, >> >> Michael >> Intel Deutschland GmbH >> Registered Address: Am Campeon 10-12, 85579 Neubiberg, Germany >> Tel: +49 89 99 8853-0, www.intel.de >> Managing Directors: Christin Eisenschmid, Christian Lamprechter >> Chairperson of the Supervisory Board: Nicole Lau >> Registered Office: Munich >> Commercial Register: Amtsgericht Muenchen HRB 186928 >> > -- > ------------------------------------------------------------ > Gerd Stolpmann, Darmstadt, Germany gerd@gerd-stolpmann.de <mailto:gerd@gerd-stolpmann.de> > My OCaml site: http://www.camlcity.org <http://www.camlcity.org/> > Contact details: http://www.camlcity.org/contact.html <http://www.camlcity.org/contact.html> > Company homepage: http://www.gerd-stolpmann.de <http://www.gerd-stolpmann.de/> > ------------------------------------------------------------ [-- Attachment #2: Type: text/html, Size: 21661 bytes --] ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [Caml-list] Closing the performance gap to C 2016-12-19 17:09 ` Frédéric Bour @ 2016-12-19 17:19 ` Yotam Barnoy 2016-12-21 11:25 ` Alain Frisch 0 siblings, 1 reply; 25+ messages in thread From: Yotam Barnoy @ 2016-12-19 17:19 UTC (permalink / raw) To: Frédéric Bour Cc: Gerd Stolpmann, Soegtrop, Michael, Christoph Höger, caml-list On Mon, Dec 19, 2016 at 12:09 PM, Frédéric Bour <frederic.bour@lakaban.net> wrote: > The problem observed is not because of spilling but simply because of float > boxing and compilation of recursive calls. > The loop seems to compile down to an efficient code ended by a jump, but > float unboxing is done in a much earlier pass in the compiler (cmm). > > Passing -dcmm to the compiler, we can see that before the recursive call to > loop the float is boxed again. > At this point, it is just a regular ocaml function call, taking boxed float. > > A simpler code to observe this behavior: > > let rec test f = > test (f +. 1.0) > > let () = test 0.0 > > will box at every iteration. Yes, this is a current weak point of OCaml compilation, which I've reported here (https://caml.inria.fr/mantis/view.php?id=7289). The only real solution to this currently is to improve Flambda to the point that unboxing is done at the Flambda level rather than at cmmgen. Unboxing decisions are currently extremely local, which isn't good enough for recursive functions, which cannot be inlined. Currently, the only solution to this is to use a. for/while loops or b. a global mutable value instead of a parameter. -Yotam > > Le 19 déc. 2016 à 17:41, Gerd Stolpmann <info@gerd-stolpmann.de> a écrit : > > Michael, > > look here, it's the "definitive source": > https://github.com/ocaml/ocaml/blob/trunk/asmcomp/amd64/proc.ml > > Windows is in deed different. I don't have enough insight into the > register spilling algorithm to say whether this has a significant > effect, though. OCaml code never keeps registers alive, and thus I > would bet the spilling algorithm is designed for that, and it is > probably not tried to move values preferably to xmm6-15 in order to > avoid spilling during C calls. But that's just a hypothesis... Does > somebody know? > > Gerd > > > Am Montag, den 19.12.2016, 14:52 +0000 schrieb Soegtrop, Michael: > > Dear Gerd, > > > When you call a C function like cos it is likely that this > happens because the C calling conventions do not preserve the FP > registers > (xmm*). This could be improved if the OCaml compiler tried > alternate places for > temporarily storing FP values: > > For Windows this doesn't seem to be true. See e.g.: > > https://msdn.microsoft.com/en-us/library/ms235286.aspx > > which states that XMM0..XMM5 are volatile, while XMM6..XMM15 must be > preserved. > > I think for Linix you are right. I couldn't find a better reference > than Wikipedia: > > https://en.wikipedia.org/wiki/X86_calling_conventions > > see "System V AMD64 ABI" there. > > This reference contains a good overview, which matches the above data > in table 4: > > http://www.agner.org/optimize/calling_conventions.pdf > > So on Windows, there is for sure no need to save XMM6..XMM15, while > on Linux this seems to be an issue. > > Best regards, > > Michael > Intel Deutschland GmbH > Registered Address: Am Campeon 10-12, 85579 Neubiberg, Germany > Tel: +49 89 99 8853-0, www.intel.de > Managing Directors: Christin Eisenschmid, Christian Lamprechter > Chairperson of the Supervisory Board: Nicole Lau > Registered Office: Munich > Commercial Register: Amtsgericht Muenchen HRB 186928 > > -- > ------------------------------------------------------------ > Gerd Stolpmann, Darmstadt, Germany gerd@gerd-stolpmann.de > My OCaml site: http://www.camlcity.org > Contact details: http://www.camlcity.org/contact.html > Company homepage: http://www.gerd-stolpmann.de > ------------------------------------------------------------ > > ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [Caml-list] Closing the performance gap to C 2016-12-19 17:19 ` Yotam Barnoy @ 2016-12-21 11:25 ` Alain Frisch 2016-12-21 14:45 ` Yotam Barnoy 0 siblings, 1 reply; 25+ messages in thread From: Alain Frisch @ 2016-12-21 11:25 UTC (permalink / raw) To: Yotam Barnoy, Frédéric Bour Cc: Gerd Stolpmann, Soegtrop, Michael, Christoph Höger, caml-list On 19/12/2016 18:19, Yotam Barnoy wrote: > Yes, this is a current weak point of OCaml compilation, which I've > reported here (https://caml.inria.fr/mantis/view.php?id=7289). The > only real solution to this currently is to improve Flambda to the > point that unboxing is done at the Flambda level rather than at > cmmgen. The issue could very well be addressed outside flambda, at the cmmgen level. See https://caml.inria.fr/mantis/view.php?id=5894 . Making flambda aware of boxing can help its inlining decisions, but the notion of having specialized unboxing calling convention is not really tied to flambda. Pushing #5894 forward could have a direct impact on numerical code. I think it would really be useful to work on that, even if flambda also does something along these lines, at some point. -- Alain ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [Caml-list] Closing the performance gap to C 2016-12-21 11:25 ` Alain Frisch @ 2016-12-21 14:45 ` Yotam Barnoy 2016-12-21 16:06 ` Alain Frisch 0 siblings, 1 reply; 25+ messages in thread From: Yotam Barnoy @ 2016-12-21 14:45 UTC (permalink / raw) To: Alain Frisch Cc: Frédéric Bour, Gerd Stolpmann, Soegtrop, Michael, Christoph Höger, caml-list > The issue could very well be addressed outside flambda, at the cmmgen level. > See https://caml.inria.fr/mantis/view.php?id=5894 . Making flambda aware > of boxing can help its inlining decisions, but the notion of having > specialized unboxing calling convention is not really tied to flambda. > > Pushing #5894 forward could have a direct impact on numerical code. I think > it would really be useful to work on that, even if flambda also does > something along these lines, at some point. > > -- Alain I think it's not worth the effort. You need to examine all the code dealing with a parameter (ie. its flow) to see if any generic function is called on that parameter. This is work best left to Flambda. Even if a generic function is called on said parameter, you can still express a cost-based approximation to determine whether unboxing is worthwhile. This is all large-scale optimization rather than peephole optimization, and fits better under Flambda's umbrella IMO. -Yotam ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [Caml-list] Closing the performance gap to C 2016-12-21 14:45 ` Yotam Barnoy @ 2016-12-21 16:06 ` Alain Frisch 2016-12-21 16:31 ` Gerd Stolpmann 0 siblings, 1 reply; 25+ messages in thread From: Alain Frisch @ 2016-12-21 16:06 UTC (permalink / raw) To: Yotam Barnoy Cc: Frédéric Bour, Gerd Stolpmann, Soegtrop, Michael, Christoph Höger, caml-list On 21/12/2016 15:45, Yotam Barnoy wrote: > I think it's not worth the effort. You need to examine all the code > dealing with a parameter (ie. its flow) to see if any generic function > is called on that parameter. This would be treated a bit like the stubs for optional arguments with a default value. Any function taking float arguments or returning a float would be compiled to a specialized version with an unboxed calling convention plus a stub which implements the generic calling convention and delegate the call to the specialized version (or copy its body if it is small enough). On call site, when the called function is known, one calls the specialized version instead. This is a systematic, local compilation scheme, similar to other optimizations made in closure/cmmgen; it does not require a more global analysis nor a radically different representation of the code. About the "it's not worth the effort": the effort has largely been done, since the ticket came with a working patch (some more effort would be needed to bring it up to date, though). In my opinion, this seems like a rather low-hanging fruit with very immediate and significant gains. I'd rather have this soon than wait for flambda to become stable and usable on large projects. -- Alain ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [Caml-list] Closing the performance gap to C 2016-12-21 16:06 ` Alain Frisch @ 2016-12-21 16:31 ` Gerd Stolpmann 2016-12-21 16:39 ` Yotam Barnoy 0 siblings, 1 reply; 25+ messages in thread From: Gerd Stolpmann @ 2016-12-21 16:31 UTC (permalink / raw) To: Alain Frisch, Yotam Barnoy Cc: Frédéric Bour, Soegtrop, Michael, Christoph Höger, caml-list [-- Attachment #1: Type: text/plain, Size: 2428 bytes --] Dumb question because you are effectively suggesting an alternate calling convention in addition to the existing one: wouldn't it make more sense to switch to a completely different convention? Like: we pass fp values always in fp registers so far available? Sure, there are cases leading to double allocations of the same value then. But getting this under control is some kind of CSE problem, and something that flambda can really deal with. I mean the current calling conventions only exist because of outdated CPU architectures (x87). Maybe it is time to rethink this. Gerd Am Mittwoch, den 21.12.2016, 17:06 +0100 schrieb Alain Frisch: > On 21/12/2016 15:45, Yotam Barnoy wrote: > > > > I think it's not worth the effort. You need to examine all the code > > dealing with a parameter (ie. its flow) to see if any generic > > function > > is called on that parameter. > This would be treated a bit like the stubs for optional arguments > with a > default value. Any function taking float arguments or returning a > float > would be compiled to a specialized version with an unboxed calling > convention plus a stub which implements the generic calling > convention > and delegate the call to the specialized version (or copy its body if > it > is small enough). On call site, when the called function is known, > one > calls the specialized version instead. This is a systematic, local > compilation scheme, similar to other optimizations made in > closure/cmmgen; it does not require a more global analysis nor a > radically different representation of the code. > > About the "it's not worth the effort": the effort has largely been > done, > since the ticket came with a working patch (some more effort would > be > needed to bring it up to date, though). In my opinion, this seems > like > a rather low-hanging fruit with very immediate and significant > gains. > I'd rather have this soon than wait for flambda to become stable and > usable on large projects. > > > > -- Alain > -- ------------------------------------------------------------ Gerd Stolpmann, Darmstadt, Germany gerd@gerd-stolpmann.de My OCaml site: http://www.camlcity.org Contact details: http://www.camlcity.org/contact.html Company homepage: http://www.gerd-stolpmann.de ------------------------------------------------------------ [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 473 bytes --] ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [Caml-list] Closing the performance gap to C 2016-12-21 16:31 ` Gerd Stolpmann @ 2016-12-21 16:39 ` Yotam Barnoy 2016-12-21 16:47 ` Gabriel Scherer 2016-12-21 17:35 ` Alain Frisch 0 siblings, 2 replies; 25+ messages in thread From: Yotam Barnoy @ 2016-12-21 16:39 UTC (permalink / raw) To: Gerd Stolpmann Cc: Alain Frisch, Frédéric Bour, Soegtrop, Michael, Christoph Höger, caml-list On Wed, Dec 21, 2016 at 11:31 AM, Gerd Stolpmann <info@gerd-stolpmann.de> wrote: > Dumb question because you are effectively suggesting an alternate > calling convention in addition to the existing one: wouldn't it make > more sense to switch to a completely different convention? Like: we > pass fp values always in fp registers so far available? > > > Gerd As far as I understand, the current calling convention has to do with generic functions. Having to be polymorphic over all types, including primitive ones, such as floating point and ints of different widths, is really difficult. This is why OOP languages don't do it -- only classes are polymorphic. And Alain, even if we create these alternate calling conventions, the tricky part is still detecting when it's ok to call using direct calling conventions. That's the hard part, and it's best done by Flambda with its inlining. -Yotam > > Am Mittwoch, den 21.12.2016, 17:06 +0100 schrieb Alain Frisch: >> On 21/12/2016 15:45, Yotam Barnoy wrote: >> > >> > I think it's not worth the effort. You need to examine all the code >> > dealing with a parameter (ie. its flow) to see if any generic >> > function >> > is called on that parameter. >> This would be treated a bit like the stubs for optional arguments >> with a >> default value. Any function taking float arguments or returning a >> float >> would be compiled to a specialized version with an unboxed calling >> convention plus a stub which implements the generic calling >> convention >> and delegate the call to the specialized version (or copy its body if >> it >> is small enough). On call site, when the called function is known, >> one >> calls the specialized version instead. This is a systematic, local >> compilation scheme, similar to other optimizations made in >> closure/cmmgen; it does not require a more global analysis nor a >> radically different representation of the code. >> >> About the "it's not worth the effort": the effort has largely been >> done, >> since the ticket came with a working patch (some more effort would >> be >> needed to bring it up to date, though). In my opinion, this seems >> like >> a rather low-hanging fruit with very immediate and significant >> gains. >> I'd rather have this soon than wait for flambda to become stable and >> usable on large projects. >> >> >> >> -- Alain >> > -- > ------------------------------------------------------------ > Gerd Stolpmann, Darmstadt, Germany gerd@gerd-stolpmann.de > My OCaml site: http://www.camlcity.org > Contact details: http://www.camlcity.org/contact.html > Company homepage: http://www.gerd-stolpmann.de > ------------------------------------------------------------ > > ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [Caml-list] Closing the performance gap to C 2016-12-21 16:39 ` Yotam Barnoy @ 2016-12-21 16:47 ` Gabriel Scherer 2016-12-21 16:51 ` Yotam Barnoy 2016-12-21 16:56 ` Mark Shinwell 2016-12-21 17:35 ` Alain Frisch 1 sibling, 2 replies; 25+ messages in thread From: Gabriel Scherer @ 2016-12-21 16:47 UTC (permalink / raw) To: Yotam Barnoy Cc: Gerd Stolpmann, Alain Frisch, Frédéric Bour, Soegtrop, Michael, Christoph Höger, caml-list [-- Attachment #1: Type: text/plain, Size: 3701 bytes --] I think there is an information asymmetry here. Alain has done a lot of work on float unboxing in old and recent OCaml releases, and if he says that some approach is a low-hanging fruit it is reasonable to trust him. I suspect, Yotam, that you have different issues in mind that correspond to the fact that inlining and specialization at the flambda level could be complementary with the work that Alain is describing. (I think it is difficult to accurately discuss these optimization questions at a high/conceptual level only.) On Wed, Dec 21, 2016 at 11:39 AM, Yotam Barnoy <yotambarnoy@gmail.com> wrote: > On Wed, Dec 21, 2016 at 11:31 AM, Gerd Stolpmann <info@gerd-stolpmann.de> > wrote: > > Dumb question because you are effectively suggesting an alternate > > calling convention in addition to the existing one: wouldn't it make > > more sense to switch to a completely different convention? Like: we > > pass fp values always in fp registers so far available? > > > > > > Gerd > > As far as I understand, the current calling convention has to do with > generic functions. Having to be polymorphic over all types, including > primitive ones, such as floating point and ints of different widths, > is really difficult. This is why OOP languages don't do it -- only > classes are polymorphic. > > And Alain, even if we create these alternate calling conventions, the > tricky part is still detecting when it's ok to call using direct > calling conventions. That's the hard part, and it's best done by > Flambda with its inlining. > > -Yotam > > > > > Am Mittwoch, den 21.12.2016, 17:06 +0100 schrieb Alain Frisch: > >> On 21/12/2016 15:45, Yotam Barnoy wrote: > >> > > >> > I think it's not worth the effort. You need to examine all the code > >> > dealing with a parameter (ie. its flow) to see if any generic > >> > function > >> > is called on that parameter. > >> This would be treated a bit like the stubs for optional arguments > >> with a > >> default value. Any function taking float arguments or returning a > >> float > >> would be compiled to a specialized version with an unboxed calling > >> convention plus a stub which implements the generic calling > >> convention > >> and delegate the call to the specialized version (or copy its body if > >> it > >> is small enough). On call site, when the called function is known, > >> one > >> calls the specialized version instead. This is a systematic, local > >> compilation scheme, similar to other optimizations made in > >> closure/cmmgen; it does not require a more global analysis nor a > >> radically different representation of the code. > >> > >> About the "it's not worth the effort": the effort has largely been > >> done, > >> since the ticket came with a working patch (some more effort would > >> be > >> needed to bring it up to date, though). In my opinion, this seems > >> like > >> a rather low-hanging fruit with very immediate and significant > >> gains. > >> I'd rather have this soon than wait for flambda to become stable and > >> usable on large projects. > >> > >> > >> > >> -- Alain > >> > > -- > > ------------------------------------------------------------ > > Gerd Stolpmann, Darmstadt, Germany gerd@gerd-stolpmann.de > > My OCaml site: http://www.camlcity.org > > Contact details: http://www.camlcity.org/contact.html > > Company homepage: http://www.gerd-stolpmann.de > > ------------------------------------------------------------ > > > > > > -- > Caml-list mailing list. Subscription management and archives: > https://sympa.inria.fr/sympa/arc/caml-list > Beginner's list: http://groups.yahoo.com/group/ocaml_beginners > Bug reports: http://caml.inria.fr/bin/caml-bugs > [-- Attachment #2: Type: text/html, Size: 5399 bytes --] ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [Caml-list] Closing the performance gap to C 2016-12-21 16:47 ` Gabriel Scherer @ 2016-12-21 16:51 ` Yotam Barnoy 2016-12-21 16:56 ` Mark Shinwell 1 sibling, 0 replies; 25+ messages in thread From: Yotam Barnoy @ 2016-12-21 16:51 UTC (permalink / raw) To: Gabriel Scherer Cc: Gerd Stolpmann, Alain Frisch, Frédéric Bour, Soegtrop, Michael, Christoph Höger, caml-list Fair enough. I'm all for low-hanging fruit. On Wed, Dec 21, 2016 at 11:47 AM, Gabriel Scherer <gabriel.scherer@gmail.com> wrote: > I think there is an information asymmetry here. Alain has done a lot of work > on float unboxing in old and recent OCaml releases, and if he says that some > approach is a low-hanging fruit it is reasonable to trust him. I suspect, > Yotam, that you have different issues in mind that correspond to the fact > that inlining and specialization at the flambda level could be complementary > with the work that Alain is describing. > > (I think it is difficult to accurately discuss these optimization questions > at a high/conceptual level only.) > > On Wed, Dec 21, 2016 at 11:39 AM, Yotam Barnoy <yotambarnoy@gmail.com> > wrote: >> >> On Wed, Dec 21, 2016 at 11:31 AM, Gerd Stolpmann <info@gerd-stolpmann.de> >> wrote: >> > Dumb question because you are effectively suggesting an alternate >> > calling convention in addition to the existing one: wouldn't it make >> > more sense to switch to a completely different convention? Like: we >> > pass fp values always in fp registers so far available? >> > >> > >> > Gerd >> >> As far as I understand, the current calling convention has to do with >> generic functions. Having to be polymorphic over all types, including >> primitive ones, such as floating point and ints of different widths, >> is really difficult. This is why OOP languages don't do it -- only >> classes are polymorphic. >> >> And Alain, even if we create these alternate calling conventions, the >> tricky part is still detecting when it's ok to call using direct >> calling conventions. That's the hard part, and it's best done by >> Flambda with its inlining. >> >> -Yotam >> >> > >> > Am Mittwoch, den 21.12.2016, 17:06 +0100 schrieb Alain Frisch: >> >> On 21/12/2016 15:45, Yotam Barnoy wrote: >> >> > >> >> > I think it's not worth the effort. You need to examine all the code >> >> > dealing with a parameter (ie. its flow) to see if any generic >> >> > function >> >> > is called on that parameter. >> >> This would be treated a bit like the stubs for optional arguments >> >> with a >> >> default value. Any function taking float arguments or returning a >> >> float >> >> would be compiled to a specialized version with an unboxed calling >> >> convention plus a stub which implements the generic calling >> >> convention >> >> and delegate the call to the specialized version (or copy its body if >> >> it >> >> is small enough). On call site, when the called function is known, >> >> one >> >> calls the specialized version instead. This is a systematic, local >> >> compilation scheme, similar to other optimizations made in >> >> closure/cmmgen; it does not require a more global analysis nor a >> >> radically different representation of the code. >> >> >> >> About the "it's not worth the effort": the effort has largely been >> >> done, >> >> since the ticket came with a working patch (some more effort would >> >> be >> >> needed to bring it up to date, though). In my opinion, this seems >> >> like >> >> a rather low-hanging fruit with very immediate and significant >> >> gains. >> >> I'd rather have this soon than wait for flambda to become stable and >> >> usable on large projects. >> >> >> >> >> >> >> >> -- Alain >> >> >> > -- >> > ------------------------------------------------------------ >> > Gerd Stolpmann, Darmstadt, Germany gerd@gerd-stolpmann.de >> > My OCaml site: http://www.camlcity.org >> > Contact details: http://www.camlcity.org/contact.html >> > Company homepage: http://www.gerd-stolpmann.de >> > ------------------------------------------------------------ >> > >> > >> >> -- >> Caml-list mailing list. Subscription management and archives: >> https://sympa.inria.fr/sympa/arc/caml-list >> Beginner's list: http://groups.yahoo.com/group/ocaml_beginners >> Bug reports: http://caml.inria.fr/bin/caml-bugs > > ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [Caml-list] Closing the performance gap to C 2016-12-21 16:47 ` Gabriel Scherer 2016-12-21 16:51 ` Yotam Barnoy @ 2016-12-21 16:56 ` Mark Shinwell 2016-12-21 17:43 ` Alain Frisch 1 sibling, 1 reply; 25+ messages in thread From: Mark Shinwell @ 2016-12-21 16:56 UTC (permalink / raw) To: Gabriel Scherer Cc: Yotam Barnoy, Gerd Stolpmann, Alain Frisch, Frédéric Bour, Soegtrop, Michael, Christoph Höger, caml-list I agree with Gabriel, but, we are planning substantial work within Flambda during the coming year to implement unboxing transformations therein. I do think we should spend a little time, before diving into this particular issue, convincing ourselves that these two tracks of work are going to be complementary rather than conflicting. Mark On 21 December 2016 at 16:47, Gabriel Scherer <gabriel.scherer@gmail.com> wrote: > I think there is an information asymmetry here. Alain has done a lot of work > on float unboxing in old and recent OCaml releases, and if he says that some > approach is a low-hanging fruit it is reasonable to trust him. I suspect, > Yotam, that you have different issues in mind that correspond to the fact > that inlining and specialization at the flambda level could be complementary > with the work that Alain is describing. > > (I think it is difficult to accurately discuss these optimization questions > at a high/conceptual level only.) > > On Wed, Dec 21, 2016 at 11:39 AM, Yotam Barnoy <yotambarnoy@gmail.com> > wrote: >> >> On Wed, Dec 21, 2016 at 11:31 AM, Gerd Stolpmann <info@gerd-stolpmann.de> >> wrote: >> > Dumb question because you are effectively suggesting an alternate >> > calling convention in addition to the existing one: wouldn't it make >> > more sense to switch to a completely different convention? Like: we >> > pass fp values always in fp registers so far available? >> > >> > >> > Gerd >> >> As far as I understand, the current calling convention has to do with >> generic functions. Having to be polymorphic over all types, including >> primitive ones, such as floating point and ints of different widths, >> is really difficult. This is why OOP languages don't do it -- only >> classes are polymorphic. >> >> And Alain, even if we create these alternate calling conventions, the >> tricky part is still detecting when it's ok to call using direct >> calling conventions. That's the hard part, and it's best done by >> Flambda with its inlining. >> >> -Yotam >> >> > >> > Am Mittwoch, den 21.12.2016, 17:06 +0100 schrieb Alain Frisch: >> >> On 21/12/2016 15:45, Yotam Barnoy wrote: >> >> > >> >> > I think it's not worth the effort. You need to examine all the code >> >> > dealing with a parameter (ie. its flow) to see if any generic >> >> > function >> >> > is called on that parameter. >> >> This would be treated a bit like the stubs for optional arguments >> >> with a >> >> default value. Any function taking float arguments or returning a >> >> float >> >> would be compiled to a specialized version with an unboxed calling >> >> convention plus a stub which implements the generic calling >> >> convention >> >> and delegate the call to the specialized version (or copy its body if >> >> it >> >> is small enough). On call site, when the called function is known, >> >> one >> >> calls the specialized version instead. This is a systematic, local >> >> compilation scheme, similar to other optimizations made in >> >> closure/cmmgen; it does not require a more global analysis nor a >> >> radically different representation of the code. >> >> >> >> About the "it's not worth the effort": the effort has largely been >> >> done, >> >> since the ticket came with a working patch (some more effort would >> >> be >> >> needed to bring it up to date, though). In my opinion, this seems >> >> like >> >> a rather low-hanging fruit with very immediate and significant >> >> gains. >> >> I'd rather have this soon than wait for flambda to become stable and >> >> usable on large projects. >> >> >> >> >> >> >> >> -- Alain >> >> >> > -- >> > ------------------------------------------------------------ >> > Gerd Stolpmann, Darmstadt, Germany gerd@gerd-stolpmann.de >> > My OCaml site: http://www.camlcity.org >> > Contact details: http://www.camlcity.org/contact.html >> > Company homepage: http://www.gerd-stolpmann.de >> > ------------------------------------------------------------ >> > >> > >> >> -- >> Caml-list mailing list. Subscription management and archives: >> https://sympa.inria.fr/sympa/arc/caml-list >> Beginner's list: http://groups.yahoo.com/group/ocaml_beginners >> Bug reports: http://caml.inria.fr/bin/caml-bugs > > ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [Caml-list] Closing the performance gap to C 2016-12-21 16:56 ` Mark Shinwell @ 2016-12-21 17:43 ` Alain Frisch 2016-12-22 8:39 ` Mark Shinwell 2016-12-22 17:23 ` Pierre Chambart 0 siblings, 2 replies; 25+ messages in thread From: Alain Frisch @ 2016-12-21 17:43 UTC (permalink / raw) To: Mark Shinwell, Gabriel Scherer Cc: Yotam Barnoy, Gerd Stolpmann, Frédéric Bour, Soegtrop, Michael, Christoph Höger, caml-list On 21/12/2016 17:56, Mark Shinwell wrote: > I agree with Gabriel, but, we are planning substantial work within > Flambda during the coming year to implement unboxing transformations > therein. I do think we should spend a little time, before diving into > this particular issue, convincing ourselves that these two tracks of > work are going to be complementary rather than conflicting. Dealing with boxing/unboxing in flambda rather than keeping it at the cmm level certainly seems the way to go in the context of the flambda pipeline. This will probably need to touch the definition of clambda (to bring it closer to cmm) and thus cmmgen. Do you believe this is compatible with continuing sharing clambda and cmmgen between the flambda and non-flambda pipeline? At some point, it might be more natural, or just required, to compile flambda directly to cmm without going through clambda. What do you think? If the flambda pipeline targets cmm directly without going through clambda/cmmgen, the approach discussed here (for the non-flambda case) should not interact too much with the flambda version. As far as I can tell, they would mostly share the plumbing required to pass more info from typedtree to lambda. Alain ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [Caml-list] Closing the performance gap to C 2016-12-21 17:43 ` Alain Frisch @ 2016-12-22 8:39 ` Mark Shinwell 2016-12-22 17:23 ` Pierre Chambart 1 sibling, 0 replies; 25+ messages in thread From: Mark Shinwell @ 2016-12-22 8:39 UTC (permalink / raw) To: Alain Frisch Cc: Gabriel Scherer, Yotam Barnoy, Gerd Stolpmann, Frédéric Bour, Soegtrop, Michael, Christoph Höger, caml-list I'm not sure I have an answer to that question at the moment---I suggest taking this item to a smaller group for discussion. I think we've mooted the idea of trying to have some variant of Cmmgen that did the Clambda to Cmm translation without doing e.g. the unboxing part (which Flambda would already have done). However, something like that isn't entirely straightforward, as there are various Cmm -> Cmm rewrites that are currently done by Cmmgen and we probably need to preserve. I'm pretty uncertain about how to treat those at the moment; I think it would require some amount of experimentation. Something tangentially related is that we should be able to get to the stage fairly soon (within 2017 I think) where Flambda only generates the subset of the Cmm language that is in SSA form. It doesn't seem to me that keeping Clambda in the Flambda pipeline is actually a bad thing; it forms quite a nice progression. It would however be nice to remove the Un_anf pass which is only there to satisfy specific pattern-matches in the Cmmgen code, and slows down compilation. Mark On 21 December 2016 at 17:43, Alain Frisch <alain.frisch@lexifi.com> wrote: > On 21/12/2016 17:56, Mark Shinwell wrote: >> >> I agree with Gabriel, but, we are planning substantial work within >> Flambda during the coming year to implement unboxing transformations >> therein. I do think we should spend a little time, before diving into >> this particular issue, convincing ourselves that these two tracks of >> work are going to be complementary rather than conflicting. > > > Dealing with boxing/unboxing in flambda rather than keeping it at the cmm > level certainly seems the way to go in the context of the flambda pipeline. > This will probably need to touch the definition of clambda (to bring it > closer to cmm) and thus cmmgen. Do you believe this is compatible with > continuing sharing clambda and cmmgen between the flambda and non-flambda > pipeline? At some point, it might be more natural, or just required, to > compile flambda directly to cmm without going through clambda. What do you > think? > > If the flambda pipeline targets cmm directly without going through > clambda/cmmgen, the approach discussed here (for the non-flambda case) > should not interact too much with the flambda version. As far as I can > tell, they would mostly share the plumbing required to pass more info from > typedtree to lambda. > > > Alain ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [Caml-list] Closing the performance gap to C 2016-12-21 17:43 ` Alain Frisch 2016-12-22 8:39 ` Mark Shinwell @ 2016-12-22 17:23 ` Pierre Chambart 1 sibling, 0 replies; 25+ messages in thread From: Pierre Chambart @ 2016-12-22 17:23 UTC (permalink / raw) To: Alain Frisch; +Cc: caml-list Le 21/12/2016 à 18:43, Alain Frisch a écrit : > On 21/12/2016 17:56, Mark Shinwell wrote: >> I agree with Gabriel, but, we are planning substantial work within >> Flambda during the coming year to implement unboxing transformations >> therein. I do think we should spend a little time, before diving into >> this particular issue, convincing ourselves that these two tracks of >> work are going to be complementary rather than conflicting. > > Dealing with boxing/unboxing in flambda rather than keeping it at the > cmm level certainly seems the way to go in the context of the flambda > pipeline. This will probably need to touch the definition of clambda > (to bring it closer to cmm) and thus cmmgen. Do you believe this is > compatible with continuing sharing clambda and cmmgen between the > flambda and non-flambda pipeline? At some point, it might be more > natural, or just required, to compile flambda directly to cmm without > going through clambda. What do you think? > > If the flambda pipeline targets cmm directly without going through > clambda/cmmgen, the approach discussed here (for the non-flambda case) > should not interact too much with the flambda version. As far as I > can tell, they would mostly share the plumbing required to pass more > info from typedtree to lambda. > > > Alain > The machinery to do that correctly could be certainly be shared I think. The way I see it, is change the semantics of the float related primitives at the clambda and flambda level. This should represent explicitly unboxed float manipulation. The Closure/Closure_conversion (or Flambda_to_clambda for a first patch that does not change how flambda works) passes would rewrite them to be enclosed by new box/unbox primitives. For instance in Cmmgen this would look like: | Pmulfloat -> Cop(Cmulf, [transl env arg1; transl env arg2], dbg) | Pboxfloat -> box_float dbg (transl env arg) | Punboxfloat -> transl_unbox_float dbg env arg Then on functions and direct calls at clambda level, like in my old patch, annotate the calling convention. This should be sufficient for doing what was done in that patch without degrading anything and still allowing this to be implemented separately in flambda. Of course the Typedtree to lambda type annotation propagation can be shared without any problem. Updating the patch for that will be a be tedious, but simple. If you don't like the idea of having primitives that does not mean exactly the same thing at the different stages of the compiler, introducing a new unboxed version of each float primitives is also possible (with its use forbidden in lambda). I would like at some point to have a different type for lambda primitives and clambda primitives. It might be the right time to do that. Also, there was a problem with the approach of that old patch that bothered me. The function argument unboxing decision had to guess whether an argument was going to be used boxed or not. If the guess is wrong you might end up allocating more than without unboxing. Having the annotation there explicitly allow to do the unboxing earlier and rely on the annotation for the decision. Notice that the transformation was not completely local as it had to propagate across functions which argument might be used unboxed. In particular, recursive functions made it a bit complex. This was the job of the added Specialisation module. -- Pierre ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [Caml-list] Closing the performance gap to C 2016-12-21 16:39 ` Yotam Barnoy 2016-12-21 16:47 ` Gabriel Scherer @ 2016-12-21 17:35 ` Alain Frisch 1 sibling, 0 replies; 25+ messages in thread From: Alain Frisch @ 2016-12-21 17:35 UTC (permalink / raw) To: Yotam Barnoy, Gerd Stolpmann Cc: Frédéric Bour, Soegtrop, Michael, Christoph Höger, caml-list On 21/12/2016 17:39, Yotam Barnoy wrote: > As far as I understand, the current calling convention has to do with > generic functions. Indeed. If you write: let f x = x +. 1. let g y = f (y +. 1) you cannot simply choose a specific calling convention more suited to fp arguments/results, because the same function can be used as: let h l = List.map f l Currently, the code would be compiled as (pseudo code with explicit boxing/unboxing, and explicit function symbol and closures): let f$direct x = box (unbox x +. 1.) let f = closure(f$direct) let g$direct y = f$direct (box (unbox y. +. 1.)) let g y = closure(g$direct) let h y = List.map f l > And Alain, even if we create these alternate calling conventions, the > tricky part is still detecting when it's ok to call using direct > calling conventions. That's the hard part, and it's best done by > Flambda with its inlining. The point is that it's not really hard: there is already some support in the compiler to detect direct calls and apply a specific calling convention (cf Clambda.Udirect_apply vs Clambda.Ugeneric_apply). It is "just" that this specific calling convention does not do any specialization (the same function body is used for the direct calling convention and the generic one). The suggested approach would compile the code above as (assuming no inling of f'd nor g'd body): let f$unboxed x = x +. 1. let f$direct x = box (f$unboxed (unbox x)) let f = closure(f$direct) let g$unboxed y = unbox (f$direct (box (y. +. 1.))) = unbox (box (f$unboxed (unbox (box (y. +. 1.))))) (* short-circuit the wrapper *) = f$unboxed (y. +. 1.) (* collapse unbox(box(..))->.. *) let g$direct y = box (g$unboxed (unboxed y)) let h l = List.map f l The fact that g can be compiled to a direct call to f (and not to an arbitrary function) is already known to the compiler. And at the level of cmm, one can already describe functions that take unboxed floats (cf Cmm.machtype, and the fun_args field in Cmm.fundecl). What is missing is mostly just the plumbing of type information to know that f and g should be compiled as above (i.e. keep track of the type of their arguments and result), and arrange so that direct calls short-circuit the boxing wrapper. Now of course there are cases were this strategy would degrade performances: imagine a function taking a float, but using it most often as a boxed float (e.g. passing it to Printf.printf "%f"); all call sites would still pass an unboxed float (even if the float comes naturally in already boxed form) and that function would need to box the float again (possibly several times). This is exactly the same story as for the current unboxing strategy (described in http://www.lexifi.com/blog/unboxed-floats-ocaml ), where the compiler always decides to keep a local "float ref" variable in unboxed form, sometimes leading to extra boxing. But in practice, this works well and leads to low-level numerical code work with no boxing at all except at data-structure boundaries and function calls (for now). Alain ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [Caml-list] Closing the performance gap to C 2016-12-19 11:51 ` Gerd Stolpmann 2016-12-19 14:52 ` Soegtrop, Michael @ 2016-12-19 15:48 ` Ivan Gotovchits 2016-12-19 16:44 ` Yotam Barnoy 1 sibling, 1 reply; 25+ messages in thread From: Ivan Gotovchits @ 2016-12-19 15:48 UTC (permalink / raw) To: Gerd Stolpmann; +Cc: Christoph Höger, caml-list [-- Attachment #1: Type: text/plain, Size: 9211 bytes --] Hi Christoph, The problem is that your function definitions, like `loop` and `rk4_step`, have too many parameters and OCaml is not able to eliminate them and is actually not trying. It was always a feature of OCaml that poorly written code will be compiled into a poorly written program. OCaml never tries to fix programmer's errors. It will try to minimize the abstraction overhead (often to the zero). But if the abstractions, on the first hand, were chosen incorrectly, then it won't fix the code. In your particular example, the C compiler was quite aggressive and was capable of eliminating unnecessary computations. I wouldn't, however, assume that the C compiler will be always able to do this for you. Let's go through the code: let rec loop steps h n y t = if n < steps then loop steps h (n+1) (rk4_step y t h) (t +. h) else y Here variables `steps` and `h` are loop invariants, so you shouldn't put it into the loop function parameter list. Well yes, a compiler can find out loop invariants and remove them for you. But in our case, to remove them, it will need to change the type of the function. The compiler will not go that far for us. It respects a programmer, and if a programmer decided to make his loop function to depend on `6` arguments, then it means that the computation is actually depending on 6 arguments. So, it will allocate 6 registers to hold loop variables, with all the consequences. Now, let's take a look at the `rk4_step` function: let rk4_step y t h = let k1 = h *. y' t y in let k2 = h *. y' (t +. 0.5*.h) (y +. 0.5*.k1) in let k3 = h *. y' (t +. 0.5*.h) (y +. 0.5*.k2) in let k4 = h *. y' (t +. h) (y +. k3) in y +. (k1+.k4)/.6.0 +. (k2+.k3)/.3.0 This function, is, in fact, a body of the loop, and everything except t is loop invariant here. Moreover, function `y'` is defined as: let y' t y = cos t I.e., it doesn't really use it the second argument. Probably, a compiler should inline the call, and eliminate lots of unecessary computations, and thus free a few registers, but, apparently, OCaml doesn't do this (even in 4.03+flambda). So we should do this manually: let rk4_step y t = let k1 = h *. y' t in let k2 = h *. y' (t +. 0.5*.h) in let k3 = h *. y' (t +. 0.5*.h) in let k4 = h *. y' (t +. h) (y +. k3) in y +. (k1+.k4)/.6.0 +. (k2+.k3)/.3.0 We can even see, that `k3` and `k2` are equal now, so we can eliminate them: let rk4_step y t = let k1 = h *. y' t in let k2 = h *. y' (t +. 0.5*.h) in let k4 = h *. y' (t +. h) (y +. k3) in y +. (k1+.k4)/.6.0 +. k2 *. 1.5 Finally, we don't want to pass `y` into the `rk4_step` every time, as we don't want to require an extra register for it. After all these manual optimizations, we have the following program: let h = 0.1 let exact t = sin t let rk4_step t = let k1 = h *. cos t in let k2 = h *. cos (t +. 0.5*.h) in let k4 = h *. cos (t +. h) in (k1+.k4)/.6.0 +. k2*.1.5 let compute steps = let rec loop n y t = if n < steps then loop (n+1) (y +. rk4_step t) (t +. h) else y in loop 1 1.0 0.0 let () = let y = compute 102 in let err = abs_float (y -. (exact ((float_of_int 102) *. h))) in let large = 50000000 in let y = compute large in Printf.printf "%b\n" (abs_float (y -. (exact (float_of_int large) *. h)) < 2. *. err) This program has the same performance as the C one... unless I pass really aggressive optimization options to the C compiler, that will emit a platform specific code, e.g., gcc rk.c -lm -O3 -march=corei7-avx -o rksse These options basically double the performance of the C version, leaving OCaml lagging behind. That is because, OCaml, obviously, cannot follow the latest developments of intel CPU, especially in the field of SSE. The fun part is that when I've tried to compile the same file with clang, the resulting program was even slower than the original non-optimized OCaml. But this is all micro benchmarking of course, so don't jump to fast conclusions (although I like to think that OCaml is faster than Clang :)) As a final remark, my experience in HPC shows that in general you should not really rely on compiler optimizations and hope that the compiler will do the magic for you. Even the GCC compiler. It would be very easy to accidentally amend the above program in a such way, that the optimizations will no longer fire in. Of course, writing in assembly is also not a choice. If you really need to optimize, then you should find out the performance bottleneck and then optimize it manually until you get an expected performance. Alternatively, you can use plain old Fortran to get the reliable performance. And then call it from C or OCaml. Best wishes, Ivan On Mon, Dec 19, 2016 at 6:51 AM, Gerd Stolpmann <info@gerd-stolpmann.de> wrote: > Hi Christoph, > > the extra code looks very much like an allocation on the minor heap: > > sub $0x10,%r15 > lea 0x25c7b6(%rip),%rax > cmp (%rax),%r15 > jb 404a8a <dlerror@plt+0x2d0a> > lea 0x8(%r15),%rax > movq $0x4fd,-0x8(%rax) > > r15 points to the used area of the minor heap - by decrementing it you > get an additional block of memory. It is compared against the beginning > of the heap to check whether GC is needed. The constant 0x4fd is the > header of the new block (which must be always initialized). > > From the source code, it remains unclear for what this is used. > Obviously, the compiler runs out of registers, and moves some values to > the minor heap (temporarily). When you call a C function like cos it is > likely that this happens because the C calling conventions do not > preserve the FP registers (xmm*). This could be improved if the OCaml > compiler tried alternate places for temporarily storing FP values: > > - int registers (which is perfectly possible on 64 bit platforms). > A number of int registers survive C calls. > - stack > > To my knowledge, the OCaml compiler never tries this (but this could be > out of date). This is a fairly specific optimization that makes mostly > sense for purely iterating or aggregating functions like yours that do > not store FP values away. > > Gerd > > Am Samstag, den 17.12.2016, 14:02 +0100 schrieb Christoph Höger: > > Ups. Forgot the actual examples. > > > > Am 17.12.2016 um 14:01 schrieb Christoph Höger: > > > > > > Dear all, > > > > > > find attached two simple runge-kutta iteration schemes. One is > > > written > > > in C, the other in OCaml. I compared the runtime of both and gcc (- > > > O2) > > > produces an executable that is roughly 30% faster (to be more > > > precise: > > > 3.52s vs. 2.63s). That is in itself quite pleasing, I think. I do > > > not > > > understand however, what causes this difference. Admittedly, the > > > generated assembly looks completely different, but both compilers > > > inline > > > all functions and generate one big loop. Ocaml generates a lot more > > > scaffolding, but that is to be expected. > > > > > > There is however an interesting particularity: OCaml generates 6 > > > calls > > > to cos, while gcc only needs 3 (and one direct jump). Surprisingly, > > > there are also calls to cosh, acos and pretty much any other > > > trigonometric function (initialization of constants, maybe?) > > > > > > However, the true culprit seems to be an excess of instructions > > > between > > > the different calls to cos. This is what happens between the first > > > two > > > calls to cos: > > > > > > gcc: > > > jmpq 400530 <cos@plt> > > > nop > > > nopw %cs:0x0(%rax,%rax,1) > > > > > > sub $0x38,%rsp > > > movsd %xmm0,0x10(%rsp) > > > movapd %xmm1,%xmm0 > > > movsd %xmm2,0x18(%rsp) > > > movsd %xmm1,0x8(%rsp) > > > callq 400530 <cos@plt> > > > > > > ocamlopt: > > > > > > callq 401a60 <cos@plt> > > > mulsd (%r12),%xmm0 > > > movsd %xmm0,0x10(%rsp) > > > sub $0x10,%r15 > > > lea 0x25c7b6(%rip),%rax > > > cmp (%rax),%r15 > > > jb 404a8a <dlerror@plt+0x2d0a> > > > lea 0x8(%r15),%rax > > > movq $0x4fd,-0x8(%rax) > > > > > > movsd 0x32319(%rip),%xmm1 > > > > > > movapd %xmm1,%xmm2 > > > mulsd %xmm0,%xmm2 > > > addsd 0x0(%r13),%xmm2 > > > movsd %xmm2,(%rax) > > > movapd %xmm1,%xmm0 > > > mulsd (%r12),%xmm0 > > > addsd (%rbx),%xmm0 > > > callq 401a60 <cos@plt> > > > > > > > > > Is this caused by some underlying difference in the representation > > > of > > > numeric values (i.e. tagged ints) or is it reasonable to attack > > > this > > > issue as a hobby experiment? > > > > > > > > > thanks for any advice, > > > > > > Christoph > > > > > > -- > ------------------------------------------------------------ > Gerd Stolpmann, Darmstadt, Germany gerd@gerd-stolpmann.de > My OCaml site: http://www.camlcity.org > Contact details: http://www.camlcity.org/contact.html > Company homepage: http://www.gerd-stolpmann.de > ------------------------------------------------------------ > > > [-- Attachment #2: Type: text/html, Size: 16989 bytes --] ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [Caml-list] Closing the performance gap to C 2016-12-19 15:48 ` Ivan Gotovchits @ 2016-12-19 16:44 ` Yotam Barnoy 2016-12-19 16:59 ` Ivan Gotovchits 0 siblings, 1 reply; 25+ messages in thread From: Yotam Barnoy @ 2016-12-19 16:44 UTC (permalink / raw) To: Ivan Gotovchits; +Cc: Gerd Stolpmann, Christoph Höger, caml-list Christoph, you don't mention which version of OCaml you're compiling with, and whether FLambda is used. Flambda should be able to do the kind of optimizations Ivan mentions, at least ideally. If it's not able to do them yet, it will be when the rewritten version is deployed. -Yotam On Mon, Dec 19, 2016 at 10:48 AM, Ivan Gotovchits <ivg@ieee.org> wrote: > Hi Christoph, > > The problem is that your function definitions, like `loop` and `rk4_step`, > have too many parameters > and OCaml is not able to eliminate them and is actually not trying. It was > always a feature of OCaml > that poorly written code will be compiled into a poorly written program. > OCaml never tries to fix > programmer's errors. It will try to minimize the abstraction overhead (often > to the zero). But if the abstractions, > on the first hand, were chosen incorrectly, then it won't fix the code. > > In your particular example, the C compiler was quite aggressive and was > capable of eliminating unnecessary computations. > I wouldn't, however, assume that the C compiler will be always able to do > this for you. > > Let's go through the code: > > > let rec loop steps h n y t = > if n < steps then loop steps h (n+1) (rk4_step y t h) (t +. h) else > y > > Here variables `steps` and `h` are loop invariants, so you shouldn't put it > into the loop function parameter list. > Well yes, a compiler can find out loop invariants and remove them for you. > But in our case, to remove them, > it will need to change the type of the function. The compiler will not go > that far for us. It respects a programmer, and if a > programmer decided to make his loop function to depend on `6` arguments, > then it means that the computation is actually depending > on 6 arguments. So, it will allocate 6 registers to hold loop variables, > with all the consequences. > > > Now, let's take a look at the `rk4_step` function: > > let rk4_step y t h = > let k1 = h *. y' t y in > let k2 = h *. y' (t +. 0.5*.h) (y +. 0.5*.k1) in > let k3 = h *. y' (t +. 0.5*.h) (y +. 0.5*.k2) in > let k4 = h *. y' (t +. h) (y +. k3) in > y +. (k1+.k4)/.6.0 +. (k2+.k3)/.3.0 > > > > This function, is, in fact, a body of the loop, and everything except t is > loop invariant here. Moreover, > function `y'` is defined as: > > let y' t y = cos t > > I.e., it doesn't really use it the second argument. Probably, a compiler > should inline the call, and eliminate > lots of unecessary computations, and thus free a few registers, but, > apparently, OCaml doesn't do this > (even in 4.03+flambda). > > So we should do this manually: > > let rk4_step y t = > let k1 = h *. y' t in > let k2 = h *. y' (t +. 0.5*.h) in > let k3 = h *. y' (t +. 0.5*.h) in > let k4 = h *. y' (t +. h) (y +. k3) in > y +. (k1+.k4)/.6.0 +. (k2+.k3)/.3.0 > > We can even see, that `k3` and `k2` are equal now, so we can eliminate them: > > let rk4_step y t = > let k1 = h *. y' t in > let k2 = h *. y' (t +. 0.5*.h) in > let k4 = h *. y' (t +. h) (y +. k3) in > y +. (k1+.k4)/.6.0 +. k2 *. 1.5 > > > Finally, we don't want to pass `y` into the `rk4_step` every time, as we > don't want to require an extra register for it. > After all these manual optimizations, we have the following program: > > let h = 0.1 > > > let exact t = sin t > > > let rk4_step t = > > let k1 = h *. cos t in > > let k2 = h *. cos (t +. 0.5*.h) in > > let k4 = h *. cos (t +. h) in > > (k1+.k4)/.6.0 +. k2*.1.5 > > > let compute steps = > > let rec loop n y t = > > if n < steps > > then loop (n+1) (y +. rk4_step t) (t +. h) > > else y in > > loop 1 1.0 0.0 > > > let () = > > let y = compute 102 in > > let err = abs_float (y -. (exact ((float_of_int 102) *. h))) in > > let large = 50000000 in > > let y = compute large in > > Printf.printf "%b\n" > > (abs_float (y -. (exact (float_of_int large) *. h)) < 2. *. err) > > > > This program has the same performance as the C one... unless I pass really > aggressive optimization options > to the C compiler, that will emit a platform specific code, e.g., > > gcc rk.c -lm -O3 -march=corei7-avx -o rksse > > > These options basically double the performance of the C version, leaving > OCaml lagging behind. That is because, > OCaml, obviously, cannot follow the latest developments of intel CPU, > especially in the field of SSE. > > The fun part is that when I've tried to compile the same file with clang, > the resulting program was even slower > than the original non-optimized OCaml. But this is all micro benchmarking of > course, so don't jump to fast conclusions > (although I like to think that OCaml is faster than Clang :)) > > > As a final remark, my experience in HPC shows that in general you should not > really rely on compiler optimizations and hope > that the compiler will do the magic for you. Even the GCC compiler. It would > be very easy to accidentally amend the above program > in a such way, that the optimizations will no longer fire in. Of course, > writing in assembly is also not a choice. If you really need > to optimize, then you should find out the performance bottleneck and then > optimize it manually until you get an expected performance. > Alternatively, you can use plain old Fortran to get the reliable > performance. And then call it from C or OCaml. > > > Best wishes, > Ivan > > > > On Mon, Dec 19, 2016 at 6:51 AM, Gerd Stolpmann <info@gerd-stolpmann.de> > wrote: >> >> Hi Christoph, >> >> the extra code looks very much like an allocation on the minor heap: >> >> sub $0x10,%r15 >> lea 0x25c7b6(%rip),%rax >> cmp (%rax),%r15 >> jb 404a8a <dlerror@plt+0x2d0a> >> lea 0x8(%r15),%rax >> movq $0x4fd,-0x8(%rax) >> >> r15 points to the used area of the minor heap - by decrementing it you >> get an additional block of memory. It is compared against the beginning >> of the heap to check whether GC is needed. The constant 0x4fd is the >> header of the new block (which must be always initialized). >> >> From the source code, it remains unclear for what this is used. >> Obviously, the compiler runs out of registers, and moves some values to >> the minor heap (temporarily). When you call a C function like cos it is >> likely that this happens because the C calling conventions do not >> preserve the FP registers (xmm*). This could be improved if the OCaml >> compiler tried alternate places for temporarily storing FP values: >> >> - int registers (which is perfectly possible on 64 bit platforms). >> A number of int registers survive C calls. >> - stack >> >> To my knowledge, the OCaml compiler never tries this (but this could be >> out of date). This is a fairly specific optimization that makes mostly >> sense for purely iterating or aggregating functions like yours that do >> not store FP values away. >> >> Gerd >> >> Am Samstag, den 17.12.2016, 14:02 +0100 schrieb Christoph Höger: >> > Ups. Forgot the actual examples. >> > >> > Am 17.12.2016 um 14:01 schrieb Christoph Höger: >> > > >> > > Dear all, >> > > >> > > find attached two simple runge-kutta iteration schemes. One is >> > > written >> > > in C, the other in OCaml. I compared the runtime of both and gcc (- >> > > O2) >> > > produces an executable that is roughly 30% faster (to be more >> > > precise: >> > > 3.52s vs. 2.63s). That is in itself quite pleasing, I think. I do >> > > not >> > > understand however, what causes this difference. Admittedly, the >> > > generated assembly looks completely different, but both compilers >> > > inline >> > > all functions and generate one big loop. Ocaml generates a lot more >> > > scaffolding, but that is to be expected. >> > > >> > > There is however an interesting particularity: OCaml generates 6 >> > > calls >> > > to cos, while gcc only needs 3 (and one direct jump). Surprisingly, >> > > there are also calls to cosh, acos and pretty much any other >> > > trigonometric function (initialization of constants, maybe?) >> > > >> > > However, the true culprit seems to be an excess of instructions >> > > between >> > > the different calls to cos. This is what happens between the first >> > > two >> > > calls to cos: >> > > >> > > gcc: >> > > jmpq 400530 <cos@plt> >> > > nop >> > > nopw %cs:0x0(%rax,%rax,1) >> > > >> > > sub $0x38,%rsp >> > > movsd %xmm0,0x10(%rsp) >> > > movapd %xmm1,%xmm0 >> > > movsd %xmm2,0x18(%rsp) >> > > movsd %xmm1,0x8(%rsp) >> > > callq 400530 <cos@plt> >> > > >> > > ocamlopt: >> > > >> > > callq 401a60 <cos@plt> >> > > mulsd (%r12),%xmm0 >> > > movsd %xmm0,0x10(%rsp) >> > > sub $0x10,%r15 >> > > lea 0x25c7b6(%rip),%rax >> > > cmp (%rax),%r15 >> > > jb 404a8a <dlerror@plt+0x2d0a> >> > > lea 0x8(%r15),%rax >> > > movq $0x4fd,-0x8(%rax) >> > > >> > > movsd 0x32319(%rip),%xmm1 >> > > >> > > movapd %xmm1,%xmm2 >> > > mulsd %xmm0,%xmm2 >> > > addsd 0x0(%r13),%xmm2 >> > > movsd %xmm2,(%rax) >> > > movapd %xmm1,%xmm0 >> > > mulsd (%r12),%xmm0 >> > > addsd (%rbx),%xmm0 >> > > callq 401a60 <cos@plt> >> > > >> > > >> > > Is this caused by some underlying difference in the representation >> > > of >> > > numeric values (i.e. tagged ints) or is it reasonable to attack >> > > this >> > > issue as a hobby experiment? >> > > >> > > >> > > thanks for any advice, >> > > >> > > Christoph >> > > >> > >> -- >> ------------------------------------------------------------ >> Gerd Stolpmann, Darmstadt, Germany gerd@gerd-stolpmann.de >> My OCaml site: http://www.camlcity.org >> Contact details: http://www.camlcity.org/contact.html >> Company homepage: http://www.gerd-stolpmann.de >> ------------------------------------------------------------ >> >> > ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [Caml-list] Closing the performance gap to C 2016-12-19 16:44 ` Yotam Barnoy @ 2016-12-19 16:59 ` Ivan Gotovchits 2016-12-21 9:08 ` Christoph Höger 0 siblings, 1 reply; 25+ messages in thread From: Ivan Gotovchits @ 2016-12-19 16:59 UTC (permalink / raw) To: Yotam Barnoy; +Cc: Gerd Stolpmann, Christoph Höger, caml-list [-- Attachment #1: Type: text/plain, Size: 11614 bytes --] > Flambda should be able to do the kind of optimizations Ivan mentions, at least ideally. I've used 4.03+flambda and it didn't perform the optimizations. I've played a lot with the inline options and `inline` directive and even noticed a strange (on a first glance) thing, that the performance is better, when I explicitly _disable_ of the `trk4_step` function (with `[@@inline never]`) As a side note, not about compilers, but about mathematics. Since the three cosine functions, that are computed in the loop are basically cos(t), cos(t+h) and cos(t + h/2) and `h` is a constant, you can use the `cos(t+x) = cos(t)cos(x) - sin(t)sin(x)` trigonometric identity to factor out the constant factors (that are only depend on `h`), so finally, you will need to compute one cosine and one sine. This will boost the performance of the program, leaving even the SSE corei7 specific version far behind. On Mon, Dec 19, 2016 at 11:44 AM, Yotam Barnoy <yotambarnoy@gmail.com> wrote: > Christoph, you don't mention which version of OCaml you're compiling > with, and whether FLambda is used. Flambda should be able to do the > kind of optimizations Ivan mentions, at least ideally. If it's not > able to do them yet, it will be when the rewritten version is > deployed. > > -Yotam > > > On Mon, Dec 19, 2016 at 10:48 AM, Ivan Gotovchits <ivg@ieee.org> wrote: > > Hi Christoph, > > > > The problem is that your function definitions, like `loop` and > `rk4_step`, > > have too many parameters > > and OCaml is not able to eliminate them and is actually not trying. It > was > > always a feature of OCaml > > that poorly written code will be compiled into a poorly written program. > > OCaml never tries to fix > > programmer's errors. It will try to minimize the abstraction overhead > (often > > to the zero). But if the abstractions, > > on the first hand, were chosen incorrectly, then it won't fix the code. > > > > In your particular example, the C compiler was quite aggressive and was > > capable of eliminating unnecessary computations. > > I wouldn't, however, assume that the C compiler will be always able to do > > this for you. > > > > Let's go through the code: > > > > > > let rec loop steps h n y t = > > if n < steps then loop steps h (n+1) (rk4_step y t h) (t +. h) else > > y > > > > Here variables `steps` and `h` are loop invariants, so you shouldn't put > it > > into the loop function parameter list. > > Well yes, a compiler can find out loop invariants and remove them for > you. > > But in our case, to remove them, > > it will need to change the type of the function. The compiler will not go > > that far for us. It respects a programmer, and if a > > programmer decided to make his loop function to depend on `6` arguments, > > then it means that the computation is actually depending > > on 6 arguments. So, it will allocate 6 registers to hold loop variables, > > with all the consequences. > > > > > > Now, let's take a look at the `rk4_step` function: > > > > let rk4_step y t h = > > let k1 = h *. y' t y in > > let k2 = h *. y' (t +. 0.5*.h) (y +. 0.5*.k1) in > > let k3 = h *. y' (t +. 0.5*.h) (y +. 0.5*.k2) in > > let k4 = h *. y' (t +. h) (y +. k3) in > > y +. (k1+.k4)/.6.0 +. (k2+.k3)/.3.0 > > > > > > > > This function, is, in fact, a body of the loop, and everything except t > is > > loop invariant here. Moreover, > > function `y'` is defined as: > > > > let y' t y = cos t > > > > I.e., it doesn't really use it the second argument. Probably, a compiler > > should inline the call, and eliminate > > lots of unecessary computations, and thus free a few registers, but, > > apparently, OCaml doesn't do this > > (even in 4.03+flambda). > > > > So we should do this manually: > > > > let rk4_step y t = > > let k1 = h *. y' t in > > let k2 = h *. y' (t +. 0.5*.h) in > > let k3 = h *. y' (t +. 0.5*.h) in > > let k4 = h *. y' (t +. h) (y +. k3) in > > y +. (k1+.k4)/.6.0 +. (k2+.k3)/.3.0 > > > > We can even see, that `k3` and `k2` are equal now, so we can eliminate > them: > > > > let rk4_step y t = > > let k1 = h *. y' t in > > let k2 = h *. y' (t +. 0.5*.h) in > > let k4 = h *. y' (t +. h) (y +. k3) in > > y +. (k1+.k4)/.6.0 +. k2 *. 1.5 > > > > > > Finally, we don't want to pass `y` into the `rk4_step` every time, as we > > don't want to require an extra register for it. > > After all these manual optimizations, we have the following program: > > > > let h = 0.1 > > > > > > let exact t = sin t > > > > > > let rk4_step t = > > > > let k1 = h *. cos t in > > > > let k2 = h *. cos (t +. 0.5*.h) in > > > > let k4 = h *. cos (t +. h) in > > > > (k1+.k4)/.6.0 +. k2*.1.5 > > > > > > let compute steps = > > > > let rec loop n y t = > > > > if n < steps > > > > then loop (n+1) (y +. rk4_step t) (t +. h) > > > > else y in > > > > loop 1 1.0 0.0 > > > > > > let () = > > > > let y = compute 102 in > > > > let err = abs_float (y -. (exact ((float_of_int 102) *. h))) in > > > > let large = 50000000 in > > > > let y = compute large in > > > > Printf.printf "%b\n" > > > > (abs_float (y -. (exact (float_of_int large) *. h)) < 2. *. err) > > > > > > > > This program has the same performance as the C one... unless I pass > really > > aggressive optimization options > > to the C compiler, that will emit a platform specific code, e.g., > > > > gcc rk.c -lm -O3 -march=corei7-avx -o rksse > > > > > > These options basically double the performance of the C version, leaving > > OCaml lagging behind. That is because, > > OCaml, obviously, cannot follow the latest developments of intel CPU, > > especially in the field of SSE. > > > > The fun part is that when I've tried to compile the same file with clang, > > the resulting program was even slower > > than the original non-optimized OCaml. But this is all micro > benchmarking of > > course, so don't jump to fast conclusions > > (although I like to think that OCaml is faster than Clang :)) > > > > > > As a final remark, my experience in HPC shows that in general you should > not > > really rely on compiler optimizations and hope > > that the compiler will do the magic for you. Even the GCC compiler. It > would > > be very easy to accidentally amend the above program > > in a such way, that the optimizations will no longer fire in. Of course, > > writing in assembly is also not a choice. If you really need > > to optimize, then you should find out the performance bottleneck and then > > optimize it manually until you get an expected performance. > > Alternatively, you can use plain old Fortran to get the reliable > > performance. And then call it from C or OCaml. > > > > > > Best wishes, > > Ivan > > > > > > > > On Mon, Dec 19, 2016 at 6:51 AM, Gerd Stolpmann <info@gerd-stolpmann.de> > > wrote: > >> > >> Hi Christoph, > >> > >> the extra code looks very much like an allocation on the minor heap: > >> > >> sub $0x10,%r15 > >> lea 0x25c7b6(%rip),%rax > >> cmp (%rax),%r15 > >> jb 404a8a <dlerror@plt+0x2d0a> > >> lea 0x8(%r15),%rax > >> movq $0x4fd,-0x8(%rax) > >> > >> r15 points to the used area of the minor heap - by decrementing it you > >> get an additional block of memory. It is compared against the beginning > >> of the heap to check whether GC is needed. The constant 0x4fd is the > >> header of the new block (which must be always initialized). > >> > >> From the source code, it remains unclear for what this is used. > >> Obviously, the compiler runs out of registers, and moves some values to > >> the minor heap (temporarily). When you call a C function like cos it is > >> likely that this happens because the C calling conventions do not > >> preserve the FP registers (xmm*). This could be improved if the OCaml > >> compiler tried alternate places for temporarily storing FP values: > >> > >> - int registers (which is perfectly possible on 64 bit platforms). > >> A number of int registers survive C calls. > >> - stack > >> > >> To my knowledge, the OCaml compiler never tries this (but this could be > >> out of date). This is a fairly specific optimization that makes mostly > >> sense for purely iterating or aggregating functions like yours that do > >> not store FP values away. > >> > >> Gerd > >> > >> Am Samstag, den 17.12.2016, 14:02 +0100 schrieb Christoph Höger: > >> > Ups. Forgot the actual examples. > >> > > >> > Am 17.12.2016 um 14:01 schrieb Christoph Höger: > >> > > > >> > > Dear all, > >> > > > >> > > find attached two simple runge-kutta iteration schemes. One is > >> > > written > >> > > in C, the other in OCaml. I compared the runtime of both and gcc (- > >> > > O2) > >> > > produces an executable that is roughly 30% faster (to be more > >> > > precise: > >> > > 3.52s vs. 2.63s). That is in itself quite pleasing, I think. I do > >> > > not > >> > > understand however, what causes this difference. Admittedly, the > >> > > generated assembly looks completely different, but both compilers > >> > > inline > >> > > all functions and generate one big loop. Ocaml generates a lot more > >> > > scaffolding, but that is to be expected. > >> > > > >> > > There is however an interesting particularity: OCaml generates 6 > >> > > calls > >> > > to cos, while gcc only needs 3 (and one direct jump). Surprisingly, > >> > > there are also calls to cosh, acos and pretty much any other > >> > > trigonometric function (initialization of constants, maybe?) > >> > > > >> > > However, the true culprit seems to be an excess of instructions > >> > > between > >> > > the different calls to cos. This is what happens between the first > >> > > two > >> > > calls to cos: > >> > > > >> > > gcc: > >> > > jmpq 400530 <cos@plt> > >> > > nop > >> > > nopw %cs:0x0(%rax,%rax,1) > >> > > > >> > > sub $0x38,%rsp > >> > > movsd %xmm0,0x10(%rsp) > >> > > movapd %xmm1,%xmm0 > >> > > movsd %xmm2,0x18(%rsp) > >> > > movsd %xmm1,0x8(%rsp) > >> > > callq 400530 <cos@plt> > >> > > > >> > > ocamlopt: > >> > > > >> > > callq 401a60 <cos@plt> > >> > > mulsd (%r12),%xmm0 > >> > > movsd %xmm0,0x10(%rsp) > >> > > sub $0x10,%r15 > >> > > lea 0x25c7b6(%rip),%rax > >> > > cmp (%rax),%r15 > >> > > jb 404a8a <dlerror@plt+0x2d0a> > >> > > lea 0x8(%r15),%rax > >> > > movq $0x4fd,-0x8(%rax) > >> > > > >> > > movsd 0x32319(%rip),%xmm1 > >> > > > >> > > movapd %xmm1,%xmm2 > >> > > mulsd %xmm0,%xmm2 > >> > > addsd 0x0(%r13),%xmm2 > >> > > movsd %xmm2,(%rax) > >> > > movapd %xmm1,%xmm0 > >> > > mulsd (%r12),%xmm0 > >> > > addsd (%rbx),%xmm0 > >> > > callq 401a60 <cos@plt> > >> > > > >> > > > >> > > Is this caused by some underlying difference in the representation > >> > > of > >> > > numeric values (i.e. tagged ints) or is it reasonable to attack > >> > > this > >> > > issue as a hobby experiment? > >> > > > >> > > > >> > > thanks for any advice, > >> > > > >> > > Christoph > >> > > > >> > > >> -- > >> ------------------------------------------------------------ > >> Gerd Stolpmann, Darmstadt, Germany gerd@gerd-stolpmann.de > >> My OCaml site: http://www.camlcity.org > >> Contact details: http://www.camlcity.org/contact.html > >> Company homepage: http://www.gerd-stolpmann.de > >> ------------------------------------------------------------ > >> > >> > > > [-- Attachment #2: Type: text/html, Size: 15225 bytes --] ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [Caml-list] Closing the performance gap to C 2016-12-19 16:59 ` Ivan Gotovchits @ 2016-12-21 9:08 ` Christoph Höger 2016-12-23 12:18 ` Oleg 0 siblings, 1 reply; 25+ messages in thread From: Christoph Höger @ 2016-12-21 9:08 UTC (permalink / raw) To: Ivan Gotovchits, Yotam Barnoy; +Cc: Gerd Stolpmann, caml-list [-- Attachment #1.1: Type: text/plain, Size: 14418 bytes --] Dear all, thanks a lot for your valuable input. It seems like there still is something to gain for OCaml, but it is not quite simple to achieve. From your comments I reckon the main culprit is the use of a tail-recursive function instead of a loop, which causes boxing (which causes spilling). I will repeat my experiment with a new setup asap. @Ivan: Your point about the optimization is well taken, but please consider that the runge-kutta integrator is a sophisticated, general numerical method. Usually, the user of such a method has neither the knowledge nor the time to manually adapt it, but rather implements a standard interface. Therefore, the unused arguments and algebraic/trigonometric identities have to be resolved by the compiler. Otherwise, we could as well compute the exact solution, anyways ;). Am 19.12.2016 um 17:59 schrieb Ivan Gotovchits: >> Flambda should be able to do the kind of optimizations Ivan mentions, > at least ideally. > > I've used 4.03+flambda and it didn't perform the optimizations. I've > played a lot with the inline options and > `inline` directive and even noticed a strange (on a first glance) thing, > that the performance is better, when > I explicitly _disable_ of the `trk4_step` function (with `[@@inline never]`) > > As a side note, not about compilers, but about mathematics. > Since the three cosine functions, that are computed in the > loop are basically cos(t), cos(t+h) and cos(t + h/2) and `h` is a > constant, you can use the `cos(t+x) = cos(t)cos(x) - sin(t)sin(x)` > trigonometric > identity to factor out the constant factors (that are only depend on > `h`), so finally, you will need to compute one cosine and one sine. > This will boost the performance of the program, leaving even the SSE > corei7 specific version far behind. > > > On Mon, Dec 19, 2016 at 11:44 AM, Yotam Barnoy <yotambarnoy@gmail.com > <mailto:yotambarnoy@gmail.com>> wrote: > > Christoph, you don't mention which version of OCaml you're compiling > with, and whether FLambda is used. Flambda should be able to do the > kind of optimizations Ivan mentions, at least ideally. If it's not > able to do them yet, it will be when the rewritten version is > deployed. > > -Yotam > > > On Mon, Dec 19, 2016 at 10:48 AM, Ivan Gotovchits <ivg@ieee.org > <mailto:ivg@ieee.org>> wrote: > > Hi Christoph, > > > > The problem is that your function definitions, like `loop` and > `rk4_step`, > > have too many parameters > > and OCaml is not able to eliminate them and is actually not > trying. It was > > always a feature of OCaml > > that poorly written code will be compiled into a poorly written > program. > > OCaml never tries to fix > > programmer's errors. It will try to minimize the abstraction > overhead (often > > to the zero). But if the abstractions, > > on the first hand, were chosen incorrectly, then it won't fix the > code. > > > > In your particular example, the C compiler was quite aggressive > and was > > capable of eliminating unnecessary computations. > > I wouldn't, however, assume that the C compiler will be always > able to do > > this for you. > > > > Let's go through the code: > > > > > > let rec loop steps h n y t = > > if n < steps then loop steps h (n+1) (rk4_step y t h) (t +. h) else > > y > > > > Here variables `steps` and `h` are loop invariants, so you > shouldn't put it > > into the loop function parameter list. > > Well yes, a compiler can find out loop invariants and remove them > for you. > > But in our case, to remove them, > > it will need to change the type of the function. The compiler will > not go > > that far for us. It respects a programmer, and if a > > programmer decided to make his loop function to depend on `6` > arguments, > > then it means that the computation is actually depending > > on 6 arguments. So, it will allocate 6 registers to hold loop > variables, > > with all the consequences. > > > > > > Now, let's take a look at the `rk4_step` function: > > > > let rk4_step y t h = > > let k1 = h *. y' t y in > > let k2 = h *. y' (t +. 0.5*.h) (y +. 0.5*.k1) in > > let k3 = h *. y' (t +. 0.5*.h) (y +. 0.5*.k2) in > > let k4 = h *. y' (t +. h) (y +. k3) in > > y +. (k1+.k4)/.6.0 +. (k2+.k3)/.3.0 > > > > > > > > This function, is, in fact, a body of the loop, and everything > except t is > > loop invariant here. Moreover, > > function `y'` is defined as: > > > > let y' t y = cos t > > > > I.e., it doesn't really use it the second argument. Probably, a > compiler > > should inline the call, and eliminate > > lots of unecessary computations, and thus free a few registers, but, > > apparently, OCaml doesn't do this > > (even in 4.03+flambda). > > > > So we should do this manually: > > > > let rk4_step y t = > > let k1 = h *. y' t in > > let k2 = h *. y' (t +. 0.5*.h) in > > let k3 = h *. y' (t +. 0.5*.h) in > > let k4 = h *. y' (t +. h) (y +. k3) in > > y +. (k1+.k4)/.6.0 +. (k2+.k3)/.3.0 > > > > We can even see, that `k3` and `k2` are equal now, so we can > eliminate them: > > > > let rk4_step y t = > > let k1 = h *. y' t in > > let k2 = h *. y' (t +. 0.5*.h) in > > let k4 = h *. y' (t +. h) (y +. k3) in > > y +. (k1+.k4)/.6.0 +. k2 *. 1.5 > > > > > > Finally, we don't want to pass `y` into the `rk4_step` every time, > as we > > don't want to require an extra register for it. > > After all these manual optimizations, we have the following program: > > > > let h = 0.1 > > > > > > let exact t = sin t > > > > > > let rk4_step t = > > > > let k1 = h *. cos t in > > > > let k2 = h *. cos (t +. 0.5*.h) in > > > > let k4 = h *. cos (t +. h) in > > > > (k1+.k4)/.6.0 +. k2*.1.5 > > > > > > let compute steps = > > > > let rec loop n y t = > > > > if n < steps > > > > then loop (n+1) (y +. rk4_step t) (t +. h) > > > > else y in > > > > loop 1 1.0 0.0 > > > > > > let () = > > > > let y = compute 102 in > > > > let err = abs_float (y -. (exact ((float_of_int 102) *. h))) in > > > > let large = 50000000 in > > > > let y = compute large in > > > > Printf.printf "%b\n" > > > > (abs_float (y -. (exact (float_of_int large) *. h)) < 2. *. err) > > > > > > > > This program has the same performance as the C one... unless I > pass really > > aggressive optimization options > > to the C compiler, that will emit a platform specific code, e.g., > > > > gcc rk.c -lm -O3 -march=corei7-avx -o rksse > > > > > > These options basically double the performance of the C version, > leaving > > OCaml lagging behind. That is because, > > OCaml, obviously, cannot follow the latest developments of intel CPU, > > especially in the field of SSE. > > > > The fun part is that when I've tried to compile the same file with > clang, > > the resulting program was even slower > > than the original non-optimized OCaml. But this is all micro > benchmarking of > > course, so don't jump to fast conclusions > > (although I like to think that OCaml is faster than Clang :)) > > > > > > As a final remark, my experience in HPC shows that in general you > should not > > really rely on compiler optimizations and hope > > that the compiler will do the magic for you. Even the GCC > compiler. It would > > be very easy to accidentally amend the above program > > in a such way, that the optimizations will no longer fire in. Of > course, > > writing in assembly is also not a choice. If you really need > > to optimize, then you should find out the performance bottleneck > and then > > optimize it manually until you get an expected performance. > > Alternatively, you can use plain old Fortran to get the reliable > > performance. And then call it from C or OCaml. > > > > > > Best wishes, > > Ivan > > > > > > > > On Mon, Dec 19, 2016 at 6:51 AM, Gerd Stolpmann > <info@gerd-stolpmann.de <mailto:info@gerd-stolpmann.de>> > > wrote: > >> > >> Hi Christoph, > >> > >> the extra code looks very much like an allocation on the minor heap: > >> > >> sub $0x10,%r15 > >> lea 0x25c7b6(%rip),%rax > >> cmp (%rax),%r15 > >> jb 404a8a <dlerror@plt+0x2d0a> > >> lea 0x8(%r15),%rax > >> movq $0x4fd,-0x8(%rax) > >> > >> r15 points to the used area of the minor heap - by decrementing > it you > >> get an additional block of memory. It is compared against the > beginning > >> of the heap to check whether GC is needed. The constant 0x4fd is the > >> header of the new block (which must be always initialized). > >> > >> From the source code, it remains unclear for what this is used. > >> Obviously, the compiler runs out of registers, and moves some > values to > >> the minor heap (temporarily). When you call a C function like cos > it is > >> likely that this happens because the C calling conventions do not > >> preserve the FP registers (xmm*). This could be improved if the OCaml > >> compiler tried alternate places for temporarily storing FP values: > >> > >> - int registers (which is perfectly possible on 64 bit platforms). > >> A number of int registers survive C calls. > >> - stack > >> > >> To my knowledge, the OCaml compiler never tries this (but this > could be > >> out of date). This is a fairly specific optimization that makes > mostly > >> sense for purely iterating or aggregating functions like yours > that do > >> not store FP values away. > >> > >> Gerd > >> > >> Am Samstag, den 17.12.2016, 14:02 +0100 schrieb Christoph Höger: > >> > Ups. Forgot the actual examples. > >> > > >> > Am 17.12.2016 um 14:01 schrieb Christoph Höger: > >> > > > >> > > Dear all, > >> > > > >> > > find attached two simple runge-kutta iteration schemes. One is > >> > > written > >> > > in C, the other in OCaml. I compared the runtime of both and > gcc (- > >> > > O2) > >> > > produces an executable that is roughly 30% faster (to be more > >> > > precise: > >> > > 3.52s vs. 2.63s). That is in itself quite pleasing, I think. I do > >> > > not > >> > > understand however, what causes this difference. Admittedly, the > >> > > generated assembly looks completely different, but both compilers > >> > > inline > >> > > all functions and generate one big loop. Ocaml generates a > lot more > >> > > scaffolding, but that is to be expected. > >> > > > >> > > There is however an interesting particularity: OCaml generates 6 > >> > > calls > >> > > to cos, while gcc only needs 3 (and one direct jump). > Surprisingly, > >> > > there are also calls to cosh, acos and pretty much any other > >> > > trigonometric function (initialization of constants, maybe?) > >> > > > >> > > However, the true culprit seems to be an excess of instructions > >> > > between > >> > > the different calls to cos. This is what happens between the > first > >> > > two > >> > > calls to cos: > >> > > > >> > > gcc: > >> > > jmpq 400530 <cos@plt> > >> > > nop > >> > > nopw %cs:0x0(%rax,%rax,1) > >> > > > >> > > sub $0x38,%rsp > >> > > movsd %xmm0,0x10(%rsp) > >> > > movapd %xmm1,%xmm0 > >> > > movsd %xmm2,0x18(%rsp) > >> > > movsd %xmm1,0x8(%rsp) > >> > > callq 400530 <cos@plt> > >> > > > >> > > ocamlopt: > >> > > > >> > > callq 401a60 <cos@plt> > >> > > mulsd (%r12),%xmm0 > >> > > movsd %xmm0,0x10(%rsp) > >> > > sub $0x10,%r15 > >> > > lea 0x25c7b6(%rip),%rax > >> > > cmp (%rax),%r15 > >> > > jb 404a8a <dlerror@plt+0x2d0a> > >> > > lea 0x8(%r15),%rax > >> > > movq $0x4fd,-0x8(%rax) > >> > > > >> > > movsd 0x32319(%rip),%xmm1 > >> > > > >> > > movapd %xmm1,%xmm2 > >> > > mulsd %xmm0,%xmm2 > >> > > addsd 0x0(%r13),%xmm2 > >> > > movsd %xmm2,(%rax) > >> > > movapd %xmm1,%xmm0 > >> > > mulsd (%r12),%xmm0 > >> > > addsd (%rbx),%xmm0 > >> > > callq 401a60 <cos@plt> > >> > > > >> > > > >> > > Is this caused by some underlying difference in the > representation > >> > > of > >> > > numeric values (i.e. tagged ints) or is it reasonable to attack > >> > > this > >> > > issue as a hobby experiment? > >> > > > >> > > > >> > > thanks for any advice, > >> > > > >> > > Christoph > >> > > > >> > > >> -- > >> ------------------------------------------------------------ > >> Gerd Stolpmann, Darmstadt, Germany gerd@gerd-stolpmann.de > <mailto:gerd@gerd-stolpmann.de> > >> My OCaml site: http://www.camlcity.org > >> Contact details: http://www.camlcity.org/contact.html > <http://www.camlcity.org/contact.html> > >> Company homepage: http://www.gerd-stolpmann.de > >> ------------------------------------------------------------ > >> > >> > > > > -- Christoph Höger Technische Universität Berlin Fakultät IV - Elektrotechnik und Informatik Übersetzerbau und Programmiersprachen Sekr. TEL12-2, Ernst-Reuter-Platz 7, 10587 Berlin Tel.: +49 (30) 314-24890 E-Mail: christoph.hoeger@tu-berlin.de [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 181 bytes --] ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [Caml-list] Closing the performance gap to C 2016-12-21 9:08 ` Christoph Höger @ 2016-12-23 12:18 ` Oleg 0 siblings, 0 replies; 25+ messages in thread From: Oleg @ 2016-12-23 12:18 UTC (permalink / raw) To: christoph.hoeger; +Cc: ivg, yotambarnoy, caml-list > @Ivan: Your point about the optimization is well taken, but please > consider that the runge-kutta integrator is a sophisticated, general > numerical method. Usually, the user of such a method has neither the > knowledge nor the time to manually adapt it, but rather implements a > standard interface. Therefore, the unused arguments and > algebraic/trigonometric identities have to be resolved by the compiler. > Otherwise, we could as well compute the exact solution, anyways ;). I would like to concur with Ivan and to strongly emphasize his point. This discussion about compiler optimizations has been done before -- decades before -- in the HPC community. We know the conclusion: the era of compiler optimizations is over. All further gains in HPC computing can only be expected from *domain-specific* optimizations, specific to a particular domain. Because the most profitable optimizations are domain-specific, one cannot, should not and must not count on *general-purpose* compiler to do them. Compiler writers should think of optimizations that universally profitable, rather than applicable only in specific (and not entirely clear-cut) circumstances. The compiler cannot do algebraic/trigonometric identities! First of all, there are too many of them, and it is not clear in which direction to apply them. Second, most of the identities are not generally valid, in the presence of NaNs etc (for example, 0*x = 0 is not valid for FP computations). A lot has been written about it. This list is not the best place to discuss all this work. Let me give three references: -- The recent talk by Paul Kelly http://www.doc.ic.ac.uk/~phjk/Presentations/2015-09-LCPC-Keynote-PaulKelly-V03-ForDistribution.pdf plus many more publications from his group http://www.doc.ic.ac.uk/~phjk/phjk-Publications.html He talks about finite element methods, far more sophisticated than Runge-Kutta -- A very old paper with lots of references to the debate about compiler optimizations http://www-rocq.inria.fr/~acohen/publications/CDGHKP06.ps.gz -- Stream Fusion, to completeness paper mentioned on this list a couple of months ago. The paper shows yet another example that domain-specific optimizations really work, generating code as if it were written by hand. The paper takes great pains to emphasize why the optimizations cannot be carried out by a general-purpose compiler. Please be sure to read the discussion section near the end, with some notable quotes about GHC optimizer. ^ permalink raw reply [flat|nested] 25+ messages in thread
end of thread, other threads:[~2016-12-23 12:18 UTC | newest] Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2016-12-17 13:01 [Caml-list] Closing the performance gap to C Christoph Höger 2016-12-17 13:02 ` Christoph Höger 2016-12-19 10:58 ` Soegtrop, Michael 2016-12-19 11:51 ` Gerd Stolpmann 2016-12-19 14:52 ` Soegtrop, Michael 2016-12-19 16:41 ` Gerd Stolpmann 2016-12-19 17:09 ` Frédéric Bour 2016-12-19 17:19 ` Yotam Barnoy 2016-12-21 11:25 ` Alain Frisch 2016-12-21 14:45 ` Yotam Barnoy 2016-12-21 16:06 ` Alain Frisch 2016-12-21 16:31 ` Gerd Stolpmann 2016-12-21 16:39 ` Yotam Barnoy 2016-12-21 16:47 ` Gabriel Scherer 2016-12-21 16:51 ` Yotam Barnoy 2016-12-21 16:56 ` Mark Shinwell 2016-12-21 17:43 ` Alain Frisch 2016-12-22 8:39 ` Mark Shinwell 2016-12-22 17:23 ` Pierre Chambart 2016-12-21 17:35 ` Alain Frisch 2016-12-19 15:48 ` Ivan Gotovchits 2016-12-19 16:44 ` Yotam Barnoy 2016-12-19 16:59 ` Ivan Gotovchits 2016-12-21 9:08 ` Christoph Höger 2016-12-23 12:18 ` Oleg
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox