[Caml-list] Closing the performance gap to C

Mailing list for all users of the OCaml language and system.
 help / color / mirror / Atom feed

* [Caml-list] Closing the performance gap to C
@ 2016-12-17 13:01 Christoph Höger
  2016-12-17 13:02 ` Christoph Höger
  0 siblings, 1 reply; 25+ messages in thread
From: Christoph Höger @ 2016-12-17 13:01 UTC (permalink / raw)
  To: caml users

[-- Attachment #1.1: Type: text/plain, Size: 2110 bytes --]

Dear all,

find attached two simple runge-kutta iteration schemes. One is written
in C, the other in OCaml. I compared the runtime of both and gcc (-O2)
produces an executable that is roughly 30% faster (to be more precise:
3.52s vs. 2.63s). That is in itself quite pleasing, I think. I do not
understand however, what causes this difference. Admittedly, the
generated assembly looks completely different, but both compilers inline
all functions and generate one big loop. Ocaml generates a lot more
scaffolding, but that is to be expected.

There is however an interesting particularity: OCaml generates 6 calls
to cos, while gcc only needs 3 (and one direct jump). Surprisingly,
there are also calls to cosh, acos and pretty much any other
trigonometric function (initialization of constants, maybe?)

However, the true culprit seems to be an excess of instructions between
the different calls to cos. This is what happens between the first two
calls to cos:

gcc:
jmpq   400530 <cos@plt>
nop
nopw   %cs:0x0(%rax,%rax,1)

sub    $0x38,%rsp
movsd  %xmm0,0x10(%rsp)
movapd %xmm1,%xmm0
movsd  %xmm2,0x18(%rsp)
movsd  %xmm1,0x8(%rsp)
callq  400530 <cos@plt>

ocamlopt:

callq  401a60 <cos@plt>
mulsd  (%r12),%xmm0
movsd  %xmm0,0x10(%rsp)
sub    $0x10,%r15
lea    0x25c7b6(%rip),%rax
cmp    (%rax),%r15
jb     404a8a <dlerror@plt+0x2d0a>
lea    0x8(%r15),%rax
movq   $0x4fd,-0x8(%rax)

movsd  0x32319(%rip),%xmm1

movapd %xmm1,%xmm2
mulsd  %xmm0,%xmm2
addsd  0x0(%r13),%xmm2
movsd  %xmm2,(%rax)
movapd %xmm1,%xmm0
mulsd  (%r12),%xmm0
addsd  (%rbx),%xmm0
callq  401a60 <cos@plt>

Is this caused by some underlying difference in the representation of
numeric values (i.e. tagged ints) or is it reasonable to attack this
issue as a hobby experiment?

thanks for any advice,

Christoph
-- 
Christoph Höger

Technische Universität Berlin
Fakultät IV - Elektrotechnik und Informatik
Übersetzerbau und Programmiersprachen

Sekr. TEL12-2, Ernst-Reuter-Platz 7, 10587 Berlin

Tel.: +49 (30) 314-24890
E-Mail: christoph.hoeger@tu-berlin.de

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Caml-list] Closing the performance gap to C
  2016-12-17 13:01 [Caml-list] Closing the performance gap to C Christoph Höger
@ 2016-12-17 13:02 ` Christoph Höger
  2016-12-19 10:58   ` Soegtrop, Michael
  2016-12-19 11:51   ` Gerd Stolpmann
  0 siblings, 2 replies; 25+ messages in thread
From: Christoph Höger @ 2016-12-17 13:02 UTC (permalink / raw)
  To: caml-list


[-- Attachment #1.1.1: Type: text/plain, Size: 2330 bytes --]

Ups. Forgot the actual examples.

Am 17.12.2016 um 14:01 schrieb Christoph Höger:
> Dear all,
> 
> find attached two simple runge-kutta iteration schemes. One is written
> in C, the other in OCaml. I compared the runtime of both and gcc (-O2)
> produces an executable that is roughly 30% faster (to be more precise:
> 3.52s vs. 2.63s). That is in itself quite pleasing, I think. I do not
> understand however, what causes this difference. Admittedly, the
> generated assembly looks completely different, but both compilers inline
> all functions and generate one big loop. Ocaml generates a lot more
> scaffolding, but that is to be expected.
> 
> There is however an interesting particularity: OCaml generates 6 calls
> to cos, while gcc only needs 3 (and one direct jump). Surprisingly,
> there are also calls to cosh, acos and pretty much any other
> trigonometric function (initialization of constants, maybe?)
> 
> However, the true culprit seems to be an excess of instructions between
> the different calls to cos. This is what happens between the first two
> calls to cos:
> 
> gcc:
> jmpq   400530 <cos@plt>
> nop
> nopw   %cs:0x0(%rax,%rax,1)
> 
> sub    $0x38,%rsp
> movsd  %xmm0,0x10(%rsp)
> movapd %xmm1,%xmm0
> movsd  %xmm2,0x18(%rsp)
> movsd  %xmm1,0x8(%rsp)
> callq  400530 <cos@plt>
> 
> ocamlopt:
> 
> callq  401a60 <cos@plt>
> mulsd  (%r12),%xmm0
> movsd  %xmm0,0x10(%rsp)
> sub    $0x10,%r15
> lea    0x25c7b6(%rip),%rax
> cmp    (%rax),%r15
> jb     404a8a <dlerror@plt+0x2d0a>
> lea    0x8(%r15),%rax
> movq   $0x4fd,-0x8(%rax)
> 
> movsd  0x32319(%rip),%xmm1
> 
> movapd %xmm1,%xmm2
> mulsd  %xmm0,%xmm2
> addsd  0x0(%r13),%xmm2
> movsd  %xmm2,(%rax)
> movapd %xmm1,%xmm0
> mulsd  (%r12),%xmm0
> addsd  (%rbx),%xmm0
> callq  401a60 <cos@plt>
> 
> 
> Is this caused by some underlying difference in the representation of
> numeric values (i.e. tagged ints) or is it reasonable to attack this
> issue as a hobby experiment?
> 
> 
> thanks for any advice,
> 
> Christoph
> 


-- 
Christoph Höger

Technische Universität Berlin
Fakultät IV - Elektrotechnik und Informatik
Übersetzerbau und Programmiersprachen

Sekr. TEL12-2, Ernst-Reuter-Platz 7, 10587 Berlin

Tel.: +49 (30) 314-24890
E-Mail: christoph.hoeger@tu-berlin.de

[-- Attachment #1.1.2: rk4.c --]
[-- Type: text/plain, Size: 843 bytes --]

#include <stdio.h>
#include <math.h>

double exact(double t) { return sin(t); }

double dy(double t, double y) { return cos(t); }

double rk4_step(double y, double t, double h) {
    double k1 = h * dy(t, y);
    double k2 = h * dy(t + 0.5 * h, y + 0.5 * k1);
    double k3 = h * dy(t + 0.5 * h, y + 0.5 * k2);
    double k4 = h * dy(t + h, y + k3);
    return  y + (k1 + k4)/ 6.0 + (k2+k3) / 3.0;
}

double loop (int steps, double h, int n, double y, double t) {
    if (n < steps)
        return loop(steps, h, n+1, rk4_step(y,t,h), t+h);
    else return y;
}

int main() {
    double h = 0.1;
    double y = loop(102, h, 1, 1.0, 0.0);
    double err = fabs(y - exact(102 * h));
    int large = 10000000;
    double y2 = loop(large, h, 1, 1.0, 0.0);
    printf("%d\n",
           (fabs(y2 - (exact(large * h))) < 2. * err));
    return 0;
}

[-- Attachment #1.1.3: testrk4.ml --]
[-- Type: text/plain, Size: 653 bytes --]

let y' t y = cos t
let exact t = sin t

let rk4_step y t h =
  let k1 = h *. y' t y in
  let k2 = h *. y' (t +. 0.5*.h) (y +. 0.5*.k1) in
  let k3 = h *. y' (t +. 0.5*.h) (y +. 0.5*.k2) in
  let k4 = h *. y' (t +. h) (y +. k3) in
  y +. (k1+.k4)/.6.0 +. (k2+.k3)/.3.0

let rec loop steps h n y t =
  if n < steps then loop steps h (n+1) (rk4_step y t h) (t +. h) else
    y
let _ =
  let h = 0.1 in
  let y = loop 102 h 1 1.0 0.0 in
  let err = abs_float (y -. (exact ((float_of_int 102) *. h))) in
  let large = 10000000 in
  let y = loop large h 1 1.0 0.0 in
  Printf.printf "%b\n"
    (abs_float (y -. (exact (float_of_int large) *. h)) < 2. *. err)

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* RE: [Caml-list] Closing the performance gap to C
  2016-12-17 13:02 ` Christoph Höger
@ 2016-12-19 10:58   ` Soegtrop, Michael
  2016-12-19 11:51   ` Gerd Stolpmann
  1 sibling, 0 replies; 25+ messages in thread
From: Soegtrop, Michael @ 2016-12-19 10:58 UTC (permalink / raw)
  To: Christoph Höger, caml-list

Dear Christoph,

would you mind sharing the assembly code? It would be interesting to see if OCaml generates x87 style stack based FPU code, or more modern SSE style register based code. Also it would be interesting to know how gcc was configured (for what kind of floating point code). For floating point stuff, the configuration of the CPU can have very substantial impact on the result.

Best regards,

Michael
Intel Deutschland GmbH
Registered Address: Am Campeon 10-12, 85579 Neubiberg, Germany
Tel: +49 89 99 8853-0, www.intel.de
Managing Directors: Christin Eisenschmid, Christian Lamprechter
Chairperson of the Supervisory Board: Nicole Lau
Registered Office: Munich
Commercial Register: Amtsgericht Muenchen HRB 186928

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Caml-list] Closing the performance gap to C
  2016-12-17 13:02 ` Christoph Höger
  2016-12-19 10:58   ` Soegtrop, Michael
@ 2016-12-19 11:51   ` Gerd Stolpmann
  2016-12-19 14:52     ` Soegtrop, Michael
  2016-12-19 15:48     ` Ivan Gotovchits
  1 sibling, 2 replies; 25+ messages in thread
From: Gerd Stolpmann @ 2016-12-19 11:51 UTC (permalink / raw)
  To: Christoph Höger, caml-list

[-- Attachment #1: Type: text/plain, Size: 4007 bytes --]

Hi Christoph,

the extra code looks very much like an allocation on the minor heap:

sub    $0x10,%r15
lea    0x25c7b6(%rip),%rax
cmp    (%rax),%r15
jb     404a8a <dlerror@plt+0x2d0a>
lea    0x8(%r15),%rax
movq   $0x4fd,-0x8(%rax)

r15 points to the used area of the minor heap - by decrementing it you
get an additional block of memory. It is compared against the beginning
of the heap to check whether GC is needed. The constant 0x4fd is the
header of the new block (which must be always initialized).

From the source code, it remains unclear for what this is used.
Obviously, the compiler runs out of registers, and moves some values to
the minor heap (temporarily). When you call a C function like cos it is
likely that this happens because the C calling conventions do not
preserve the FP registers (xmm*). This could be improved if the OCaml
compiler tried alternate places for temporarily storing FP values:

 - int registers (which is perfectly possible on 64 bit platforms).
   A number of int registers survive C calls.
 - stack

To my knowledge, the OCaml compiler never tries this (but this could be
out of date). This is a fairly specific optimization that makes mostly
sense for purely iterating or aggregating functions like yours that do
not store FP values away.

Gerd

Am Samstag, den 17.12.2016, 14:02 +0100 schrieb Christoph Höger:
> Ups. Forgot the actual examples.
> 
> Am 17.12.2016 um 14:01 schrieb Christoph Höger:
> > 
> > Dear all,
> > 
> > find attached two simple runge-kutta iteration schemes. One is
> > written
> > in C, the other in OCaml. I compared the runtime of both and gcc (-
> > O2)
> > produces an executable that is roughly 30% faster (to be more
> > precise:
> > 3.52s vs. 2.63s). That is in itself quite pleasing, I think. I do
> > not
> > understand however, what causes this difference. Admittedly, the
> > generated assembly looks completely different, but both compilers
> > inline
> > all functions and generate one big loop. Ocaml generates a lot more
> > scaffolding, but that is to be expected.
> > 
> > There is however an interesting particularity: OCaml generates 6
> > calls
> > to cos, while gcc only needs 3 (and one direct jump). Surprisingly,
> > there are also calls to cosh, acos and pretty much any other
> > trigonometric function (initialization of constants, maybe?)
> > 
> > However, the true culprit seems to be an excess of instructions
> > between
> > the different calls to cos. This is what happens between the first
> > two
> > calls to cos:
> > 
> > gcc:
> > jmpq   400530 <cos@plt>
> > nop
> > nopw   %cs:0x0(%rax,%rax,1)
> > 
> > sub    $0x38,%rsp
> > movsd  %xmm0,0x10(%rsp)
> > movapd %xmm1,%xmm0
> > movsd  %xmm2,0x18(%rsp)
> > movsd  %xmm1,0x8(%rsp)
> > callq  400530 <cos@plt>
> > 
> > ocamlopt:
> > 
> > callq  401a60 <cos@plt>
> > mulsd  (%r12),%xmm0
> > movsd  %xmm0,0x10(%rsp)
> > sub    $0x10,%r15
> > lea    0x25c7b6(%rip),%rax
> > cmp    (%rax),%r15
> > jb     404a8a <dlerror@plt+0x2d0a>
> > lea    0x8(%r15),%rax
> > movq   $0x4fd,-0x8(%rax)
> > 
> > movsd  0x32319(%rip),%xmm1
> > 
> > movapd %xmm1,%xmm2
> > mulsd  %xmm0,%xmm2
> > addsd  0x0(%r13),%xmm2
> > movsd  %xmm2,(%rax)
> > movapd %xmm1,%xmm0
> > mulsd  (%r12),%xmm0
> > addsd  (%rbx),%xmm0
> > callq  401a60 <cos@plt>
> > 
> > 
> > Is this caused by some underlying difference in the representation
> > of
> > numeric values (i.e. tagged ints) or is it reasonable to attack
> > this
> > issue as a hobby experiment?
> > 
> > 
> > thanks for any advice,
> > 
> > Christoph
> > 
> 
-- 
------------------------------------------------------------
Gerd Stolpmann, Darmstadt, Germany    gerd@gerd-stolpmann.de
My OCaml site:          http://www.camlcity.org
Contact details:        http://www.camlcity.org/contact.html
Company homepage:       http://www.gerd-stolpmann.de
------------------------------------------------------------



[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* RE: [Caml-list] Closing the performance gap to C
  2016-12-19 11:51   ` Gerd Stolpmann
@ 2016-12-19 14:52     ` Soegtrop, Michael
  2016-12-19 16:41       ` Gerd Stolpmann
  2016-12-19 15:48     ` Ivan Gotovchits
  1 sibling, 1 reply; 25+ messages in thread
From: Soegtrop, Michael @ 2016-12-19 14:52 UTC (permalink / raw)
  To: Gerd Stolpmann, Christoph Höger, caml-list

Dear Gerd,

> When you call a C function like cos it is likely that this
> happens because the C calling conventions do not preserve the FP registers
> (xmm*). This could be improved if the OCaml compiler tried alternate places for
> temporarily storing FP values:

For Windows this doesn't seem to be true. See e.g.:

https://msdn.microsoft.com/en-us/library/ms235286.aspx

which states that XMM0..XMM5 are volatile, while XMM6..XMM15 must be preserved.

I think for Linix you are right. I couldn't find a better reference than Wikipedia:

https://en.wikipedia.org/wiki/X86_calling_conventions

see "System V AMD64 ABI" there.
 
This reference contains a good overview, which matches the above data in table 4:

http://www.agner.org/optimize/calling_conventions.pdf

So on Windows, there is for sure no need to save XMM6..XMM15, while on Linux this seems to be an issue.

Best regards,

Michael
Intel Deutschland GmbH
Registered Address: Am Campeon 10-12, 85579 Neubiberg, Germany
Tel: +49 89 99 8853-0, www.intel.de
Managing Directors: Christin Eisenschmid, Christian Lamprechter
Chairperson of the Supervisory Board: Nicole Lau
Registered Office: Munich
Commercial Register: Amtsgericht Muenchen HRB 186928


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Caml-list] Closing the performance gap to C
  2016-12-19 14:52     ` Soegtrop, Michael
@ 2016-12-19 16:41       ` Gerd Stolpmann
  2016-12-19 17:09         ` Frédéric Bour
  0 siblings, 1 reply; 25+ messages in thread
From: Gerd Stolpmann @ 2016-12-19 16:41 UTC (permalink / raw)
  To: Soegtrop, Michael, Christoph Höger, caml-list

[-- Attachment #1: Type: text/plain, Size: 2326 bytes --]

Michael,

look here, it's the "definitive source":
https://github.com/ocaml/ocaml/blob/trunk/asmcomp/amd64/proc.ml

Windows is in deed different. I don't have enough insight into the
register spilling algorithm to say whether this has a significant
effect, though. OCaml code never keeps registers alive, and thus I
would bet the spilling algorithm is designed for that, and it is
probably not tried to move values preferably to xmm6-15 in order to
avoid spilling during C calls. But that's just a hypothesis... Does
somebody know?

Gerd


Am Montag, den 19.12.2016, 14:52 +0000 schrieb Soegtrop, Michael:
> Dear Gerd,
> 
> > 
> > When you call a C function like cos it is likely that this
> > happens because the C calling conventions do not preserve the FP
> > registers
> > (xmm*). This could be improved if the OCaml compiler tried
> > alternate places for
> > temporarily storing FP values:
> For Windows this doesn't seem to be true. See e.g.:
> 
> https://msdn.microsoft.com/en-us/library/ms235286.aspx
> 
> which states that XMM0..XMM5 are volatile, while XMM6..XMM15 must be
> preserved.
> 
> I think for Linix you are right. I couldn't find a better reference
> than Wikipedia:
> 
> https://en.wikipedia.org/wiki/X86_calling_conventions
> 
> see "System V AMD64 ABI" there.
>  
> This reference contains a good overview, which matches the above data
> in table 4:
> 
> http://www.agner.org/optimize/calling_conventions.pdf
> 
> So on Windows, there is for sure no need to save XMM6..XMM15, while
> on Linux this seems to be an issue.
> 
> Best regards,
> 
> Michael
> Intel Deutschland GmbH
> Registered Address: Am Campeon 10-12, 85579 Neubiberg, Germany
> Tel: +49 89 99 8853-0, www.intel.de
> Managing Directors: Christin Eisenschmid, Christian Lamprechter
> Chairperson of the Supervisory Board: Nicole Lau
> Registered Office: Munich
> Commercial Register: Amtsgericht Muenchen HRB 186928
> 
-- 
------------------------------------------------------------
Gerd Stolpmann, Darmstadt, Germany    gerd@gerd-stolpmann.de
My OCaml site:          http://www.camlcity.org
Contact details:        http://www.camlcity.org/contact.html
Company homepage:       http://www.gerd-stolpmann.de
------------------------------------------------------------



[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Caml-list] Closing the performance gap to C
  2016-12-19 16:41       ` Gerd Stolpmann
@ 2016-12-19 17:09         ` Frédéric Bour
  2016-12-19 17:19           ` Yotam Barnoy
  0 siblings, 1 reply; 25+ messages in thread
From: Frédéric Bour @ 2016-12-19 17:09 UTC (permalink / raw)
  To: Gerd Stolpmann; +Cc: Soegtrop, Michael, Christoph Höger, caml-list

[-- Attachment #1: Type: text/plain, Size: 3426 bytes --]

OCamlopt is able to spill floating point register.
You can even see with -dalloc that the code will spill a floating point register in the loop.

The problem observed is not because of spilling but simply because of float boxing and compilation of recursive calls.
The loop seems to compile down to an efficient code ended by a jump, but float unboxing is done in a much earlier pass in the compiler (cmm).

Passing -dcmm to the compiler, we can see that before the recursive call to loop the float is boxed again.
At this point, it is just a regular ocaml function call, taking boxed float.

A simpler code to observe this behavior:

let rec test f =
  test (f +. 1.0)

let () = test 0.0

will box at every iteration.

> Le 19 déc. 2016 à 17:41, Gerd Stolpmann <info@gerd-stolpmann.de> a écrit :
> 
> Michael,
> 
> look here, it's the "definitive source":
> https://github.com/ocaml/ocaml/blob/trunk/asmcomp/amd64/proc.ml <https://github.com/ocaml/ocaml/blob/trunk/asmcomp/amd64/proc.ml>
> 
> Windows is in deed different. I don't have enough insight into the
> register spilling algorithm to say whether this has a significant
> effect, though. OCaml code never keeps registers alive, and thus I
> would bet the spilling algorithm is designed for that, and it is
> probably not tried to move values preferably to xmm6-15 in order to
> avoid spilling during C calls. But that's just a hypothesis... Does
> somebody know?
> 
> Gerd
> 
> 
> Am Montag, den 19.12.2016, 14:52 +0000 schrieb Soegtrop, Michael:
>> Dear Gerd,
>> 
>>> 
>>> When you call a C function like cos it is likely that this
>>> happens because the C calling conventions do not preserve the FP
>>> registers
>>> (xmm*). This could be improved if the OCaml compiler tried
>>> alternate places for
>>> temporarily storing FP values:
>> For Windows this doesn't seem to be true. See e.g.:
>> 
>> https://msdn.microsoft.com/en-us/library/ms235286.aspx
>> 
>> which states that XMM0..XMM5 are volatile, while XMM6..XMM15 must be
>> preserved.
>> 
>> I think for Linix you are right. I couldn't find a better reference
>> than Wikipedia:
>> 
>> https://en.wikipedia.org/wiki/X86_calling_conventions
>> 
>> see "System V AMD64 ABI" there.
>>  
>> This reference contains a good overview, which matches the above data
>> in table 4:
>> 
>> http://www.agner.org/optimize/calling_conventions.pdf
>> 
>> So on Windows, there is for sure no need to save XMM6..XMM15, while
>> on Linux this seems to be an issue.
>> 
>> Best regards,
>> 
>> Michael
>> Intel Deutschland GmbH
>> Registered Address: Am Campeon 10-12, 85579 Neubiberg, Germany
>> Tel: +49 89 99 8853-0, www.intel.de
>> Managing Directors: Christin Eisenschmid, Christian Lamprechter
>> Chairperson of the Supervisory Board: Nicole Lau
>> Registered Office: Munich
>> Commercial Register: Amtsgericht Muenchen HRB 186928
>> 
> -- 
> ------------------------------------------------------------
> Gerd Stolpmann, Darmstadt, Germany    gerd@gerd-stolpmann.de <mailto:gerd@gerd-stolpmann.de>
> My OCaml site:          http://www.camlcity.org <http://www.camlcity.org/>
> Contact details:        http://www.camlcity.org/contact.html <http://www.camlcity.org/contact.html>
> Company homepage:       http://www.gerd-stolpmann.de <http://www.gerd-stolpmann.de/>
> ------------------------------------------------------------


[-- Attachment #2: Type: text/html, Size: 21661 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Caml-list] Closing the performance gap to C
  2016-12-19 17:09         ` Frédéric Bour
@ 2016-12-19 17:19           ` Yotam Barnoy
  2016-12-21 11:25             ` Alain Frisch
  0 siblings, 1 reply; 25+ messages in thread
From: Yotam Barnoy @ 2016-12-19 17:19 UTC (permalink / raw)
  To: Frédéric Bour
  Cc: Gerd Stolpmann, Soegtrop, Michael, Christoph Höger, caml-list

On Mon, Dec 19, 2016 at 12:09 PM, Frédéric Bour
<frederic.bour@lakaban.net> wrote:
> The problem observed is not because of spilling but simply because of float
> boxing and compilation of recursive calls.
> The loop seems to compile down to an efficient code ended by a jump, but
> float unboxing is done in a much earlier pass in the compiler (cmm).
>
> Passing -dcmm to the compiler, we can see that before the recursive call to
> loop the float is boxed again.
> At this point, it is just a regular ocaml function call, taking boxed float.
>
> A simpler code to observe this behavior:
>
> let rec test f =
>   test (f +. 1.0)
>
> let () = test 0.0
>
> will box at every iteration.

Yes, this is a current weak point of OCaml compilation, which I've
reported here (https://caml.inria.fr/mantis/view.php?id=7289). The
only real solution to this currently is to improve Flambda to the
point that unboxing is done at the Flambda level rather than at
cmmgen. Unboxing decisions are currently extremely local, which isn't
good enough for recursive functions, which cannot be inlined.
Currently, the only solution to this is to use a. for/while loops or
b. a global mutable value instead of a parameter.

-Yotam

>
> Le 19 déc. 2016 à 17:41, Gerd Stolpmann <info@gerd-stolpmann.de> a écrit :
>
> Michael,
>
> look here, it's the "definitive source":
> https://github.com/ocaml/ocaml/blob/trunk/asmcomp/amd64/proc.ml
>
> Windows is in deed different. I don't have enough insight into the
> register spilling algorithm to say whether this has a significant
> effect, though. OCaml code never keeps registers alive, and thus I
> would bet the spilling algorithm is designed for that, and it is
> probably not tried to move values preferably to xmm6-15 in order to
> avoid spilling during C calls. But that's just a hypothesis... Does
> somebody know?
>
> Gerd
>
>
> Am Montag, den 19.12.2016, 14:52 +0000 schrieb Soegtrop, Michael:
>
> Dear Gerd,
>
>
> When you call a C function like cos it is likely that this
> happens because the C calling conventions do not preserve the FP
> registers
> (xmm*). This could be improved if the OCaml compiler tried
> alternate places for
> temporarily storing FP values:
>
> For Windows this doesn't seem to be true. See e.g.:
>
> https://msdn.microsoft.com/en-us/library/ms235286.aspx
>
> which states that XMM0..XMM5 are volatile, while XMM6..XMM15 must be
> preserved.
>
> I think for Linix you are right. I couldn't find a better reference
> than Wikipedia:
>
> https://en.wikipedia.org/wiki/X86_calling_conventions
>
> see "System V AMD64 ABI" there.
>
> This reference contains a good overview, which matches the above data
> in table 4:
>
> http://www.agner.org/optimize/calling_conventions.pdf
>
> So on Windows, there is for sure no need to save XMM6..XMM15, while
> on Linux this seems to be an issue.
>
> Best regards,
>
> Michael
> Intel Deutschland GmbH
> Registered Address: Am Campeon 10-12, 85579 Neubiberg, Germany
> Tel: +49 89 99 8853-0, www.intel.de
> Managing Directors: Christin Eisenschmid, Christian Lamprechter
> Chairperson of the Supervisory Board: Nicole Lau
> Registered Office: Munich
> Commercial Register: Amtsgericht Muenchen HRB 186928
>
> --
> ------------------------------------------------------------
> Gerd Stolpmann, Darmstadt, Germany    gerd@gerd-stolpmann.de
> My OCaml site:          http://www.camlcity.org
> Contact details:        http://www.camlcity.org/contact.html
> Company homepage:       http://www.gerd-stolpmann.de
> ------------------------------------------------------------
>
>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Caml-list] Closing the performance gap to C
  2016-12-19 17:19           ` Yotam Barnoy
@ 2016-12-21 11:25             ` Alain Frisch
  2016-12-21 14:45               ` Yotam Barnoy
  0 siblings, 1 reply; 25+ messages in thread
From: Alain Frisch @ 2016-12-21 11:25 UTC (permalink / raw)
  To: Yotam Barnoy, Frédéric Bour
  Cc: Gerd Stolpmann, Soegtrop, Michael, Christoph Höger, caml-list

On 19/12/2016 18:19, Yotam Barnoy wrote:
> Yes, this is a current weak point of OCaml compilation, which I've
> reported here (https://caml.inria.fr/mantis/view.php?id=7289). The
> only real solution to this currently is to improve Flambda to the
> point that unboxing is done at the Flambda level rather than at
> cmmgen.

The issue could very well be addressed outside flambda, at the cmmgen 
level.  See https://caml.inria.fr/mantis/view.php?id=5894 .   Making 
flambda aware of boxing can help its inlining decisions, but the notion 
of having specialized unboxing calling convention is not really tied to 
flambda.

Pushing #5894 forward could have a direct impact on numerical code.  I 
think it would really be useful to work on that, even if flambda also 
does something along these lines, at some point.

-- Alain

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Caml-list] Closing the performance gap to C
  2016-12-21 11:25             ` Alain Frisch
@ 2016-12-21 14:45               ` Yotam Barnoy
  2016-12-21 16:06                 ` Alain Frisch
  0 siblings, 1 reply; 25+ messages in thread
From: Yotam Barnoy @ 2016-12-21 14:45 UTC (permalink / raw)
  To: Alain Frisch
  Cc: Frédéric Bour, Gerd Stolpmann, Soegtrop, Michael,
	Christoph Höger, caml-list

> The issue could very well be addressed outside flambda, at the cmmgen level.
> See https://caml.inria.fr/mantis/view.php?id=5894 .   Making flambda aware
> of boxing can help its inlining decisions, but the notion of having
> specialized unboxing calling convention is not really tied to flambda.
>
> Pushing #5894 forward could have a direct impact on numerical code.  I think
> it would really be useful to work on that, even if flambda also does
> something along these lines, at some point.
>
> -- Alain

I think it's not worth the effort. You need to examine all the code
dealing with a parameter (ie. its flow) to see if any generic function
is called on that parameter. This is work best left to Flambda. Even
if a generic function is called on said parameter, you can still
express a cost-based approximation to determine whether unboxing is
worthwhile. This is all large-scale optimization rather than peephole
optimization, and fits better under Flambda's umbrella IMO.

-Yotam

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Caml-list] Closing the performance gap to C
  2016-12-21 14:45               ` Yotam Barnoy
@ 2016-12-21 16:06                 ` Alain Frisch
  2016-12-21 16:31                   ` Gerd Stolpmann
  0 siblings, 1 reply; 25+ messages in thread
From: Alain Frisch @ 2016-12-21 16:06 UTC (permalink / raw)
  To: Yotam Barnoy
  Cc: Frédéric Bour, Gerd Stolpmann, Soegtrop, Michael,
	Christoph Höger, caml-list

On 21/12/2016 15:45, Yotam Barnoy wrote:
> I think it's not worth the effort. You need to examine all the code
> dealing with a parameter (ie. its flow) to see if any generic function
> is called on that parameter.

This would be treated a bit like the stubs for optional arguments with a 
default value.  Any function taking float arguments or returning a float 
would be compiled to a specialized version with an unboxed calling 
convention plus a stub which implements the generic calling convention 
and delegate the call to the specialized version (or copy its body if it 
is small enough).  On call site, when the called function is known, one 
calls the specialized version instead.  This is a systematic, local 
compilation scheme, similar to other optimizations made in 
closure/cmmgen; it does not require a more global analysis nor a 
radically different representation of the code.

About the "it's not worth the effort": the effort has largely been done, 
since the ticket came with a working patch (some more effort would be 
needed to bring it up to date, though).  In my opinion, this seems like 
a rather low-hanging fruit with very immediate and significant gains. 
I'd rather have this soon than wait for flambda to become stable and 
usable on large projects.

-- Alain

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Caml-list] Closing the performance gap to C
  2016-12-21 16:06                 ` Alain Frisch
@ 2016-12-21 16:31                   ` Gerd Stolpmann
  2016-12-21 16:39                     ` Yotam Barnoy
  0 siblings, 1 reply; 25+ messages in thread
From: Gerd Stolpmann @ 2016-12-21 16:31 UTC (permalink / raw)
  To: Alain Frisch, Yotam Barnoy
  Cc: Frédéric Bour, Soegtrop, Michael, Christoph Höger,
	caml-list

[-- Attachment #1: Type: text/plain, Size: 2428 bytes --]

Dumb question because you are effectively suggesting an alternate
calling convention in addition to the existing one: wouldn't it make
more sense to switch to a completely different convention? Like: we
pass fp values always in fp registers so far available?

Sure, there are cases leading to double allocations of the same value
then. But getting this under control is some kind of CSE problem, and
something that flambda can really deal with.

I mean the current calling conventions only exist because of outdated
CPU architectures (x87). Maybe it is time to rethink this.

Gerd

Am Mittwoch, den 21.12.2016, 17:06 +0100 schrieb Alain Frisch:
> On 21/12/2016 15:45, Yotam Barnoy wrote:
> > 
> > I think it's not worth the effort. You need to examine all the code
> > dealing with a parameter (ie. its flow) to see if any generic
> > function
> > is called on that parameter.
> This would be treated a bit like the stubs for optional arguments
> with a 
> default value.  Any function taking float arguments or returning a
> float 
> would be compiled to a specialized version with an unboxed calling 
> convention plus a stub which implements the generic calling
> convention 
> and delegate the call to the specialized version (or copy its body if
> it 
> is small enough).  On call site, when the called function is known,
> one 
> calls the specialized version instead.  This is a systematic, local 
> compilation scheme, similar to other optimizations made in 
> closure/cmmgen; it does not require a more global analysis nor a 
> radically different representation of the code.
> 
> About the "it's not worth the effort": the effort has largely been
> done, 
> since the ticket came with a working patch (some more effort would
> be 
> needed to bring it up to date, though).  In my opinion, this seems
> like 
> a rather low-hanging fruit with very immediate and significant
> gains. 
> I'd rather have this soon than wait for flambda to become stable and 
> usable on large projects.
> 
> 
> 
> -- Alain
> 
-- 
------------------------------------------------------------
Gerd Stolpmann, Darmstadt, Germany    gerd@gerd-stolpmann.de
My OCaml site:          http://www.camlcity.org
Contact details:        http://www.camlcity.org/contact.html
Company homepage:       http://www.gerd-stolpmann.de
------------------------------------------------------------



[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Caml-list] Closing the performance gap to C
  2016-12-21 16:31                   ` Gerd Stolpmann
@ 2016-12-21 16:39                     ` Yotam Barnoy
  2016-12-21 16:47                       ` Gabriel Scherer
  2016-12-21 17:35                       ` Alain Frisch
  0 siblings, 2 replies; 25+ messages in thread
From: Yotam Barnoy @ 2016-12-21 16:39 UTC (permalink / raw)
  To: Gerd Stolpmann
  Cc: Alain Frisch, Frédéric Bour, Soegtrop, Michael,
	Christoph Höger, caml-list

On Wed, Dec 21, 2016 at 11:31 AM, Gerd Stolpmann <info@gerd-stolpmann.de> wrote:
> Dumb question because you are effectively suggesting an alternate
> calling convention in addition to the existing one: wouldn't it make
> more sense to switch to a completely different convention? Like: we
> pass fp values always in fp registers so far available?
>
>
> Gerd

As far as I understand, the current calling convention has to do with
generic functions. Having to be polymorphic over all types, including
primitive ones, such as floating point and ints of different widths,
is really difficult. This is why OOP languages don't do it -- only
classes are polymorphic.

And Alain, even if we create these alternate calling conventions, the
tricky part is still detecting when it's ok to call using direct
calling conventions. That's the hard part, and it's best done by
Flambda with its inlining.

-Yotam

>
> Am Mittwoch, den 21.12.2016, 17:06 +0100 schrieb Alain Frisch:
>> On 21/12/2016 15:45, Yotam Barnoy wrote:
>> >
>> > I think it's not worth the effort. You need to examine all the code
>> > dealing with a parameter (ie. its flow) to see if any generic
>> > function
>> > is called on that parameter.
>> This would be treated a bit like the stubs for optional arguments
>> with a
>> default value.  Any function taking float arguments or returning a
>> float
>> would be compiled to a specialized version with an unboxed calling
>> convention plus a stub which implements the generic calling
>> convention
>> and delegate the call to the specialized version (or copy its body if
>> it
>> is small enough).  On call site, when the called function is known,
>> one
>> calls the specialized version instead.  This is a systematic, local
>> compilation scheme, similar to other optimizations made in
>> closure/cmmgen; it does not require a more global analysis nor a
>> radically different representation of the code.
>>
>> About the "it's not worth the effort": the effort has largely been
>> done,
>> since the ticket came with a working patch (some more effort would
>> be
>> needed to bring it up to date, though).  In my opinion, this seems
>> like
>> a rather low-hanging fruit with very immediate and significant
>> gains.
>> I'd rather have this soon than wait for flambda to become stable and
>> usable on large projects.
>>
>>
>>
>> -- Alain
>>
> --
> ------------------------------------------------------------
> Gerd Stolpmann, Darmstadt, Germany    gerd@gerd-stolpmann.de
> My OCaml site:          http://www.camlcity.org
> Contact details:        http://www.camlcity.org/contact.html
> Company homepage:       http://www.gerd-stolpmann.de
> ------------------------------------------------------------
>
>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Caml-list] Closing the performance gap to C
  2016-12-21 16:39                     ` Yotam Barnoy
@ 2016-12-21 16:47                       ` Gabriel Scherer
  2016-12-21 16:51                         ` Yotam Barnoy
  2016-12-21 16:56                         ` Mark Shinwell
  2016-12-21 17:35                       ` Alain Frisch
  1 sibling, 2 replies; 25+ messages in thread
From: Gabriel Scherer @ 2016-12-21 16:47 UTC (permalink / raw)
  To: Yotam Barnoy
  Cc: Gerd Stolpmann, Alain Frisch, Frédéric Bour, Soegtrop,
	Michael, Christoph Höger, caml-list

[-- Attachment #1: Type: text/plain, Size: 3701 bytes --]

I think there is an information asymmetry here. Alain has done a lot of
work on float unboxing in old and recent OCaml releases, and if he says
that some approach is a low-hanging fruit it is reasonable to trust him. I
suspect, Yotam, that you have different issues in mind that correspond to
the fact that inlining and specialization at the flambda level could be
complementary with the work that Alain is describing.

(I think it is difficult to accurately discuss these optimization questions
at a high/conceptual level only.)

On Wed, Dec 21, 2016 at 11:39 AM, Yotam Barnoy <yotambarnoy@gmail.com>
wrote:

> On Wed, Dec 21, 2016 at 11:31 AM, Gerd Stolpmann <info@gerd-stolpmann.de>
> wrote:
> > Dumb question because you are effectively suggesting an alternate
> > calling convention in addition to the existing one: wouldn't it make
> > more sense to switch to a completely different convention? Like: we
> > pass fp values always in fp registers so far available?
> >
> >
> > Gerd
>
> As far as I understand, the current calling convention has to do with
> generic functions. Having to be polymorphic over all types, including
> primitive ones, such as floating point and ints of different widths,
> is really difficult. This is why OOP languages don't do it -- only
> classes are polymorphic.
>
> And Alain, even if we create these alternate calling conventions, the
> tricky part is still detecting when it's ok to call using direct
> calling conventions. That's the hard part, and it's best done by
> Flambda with its inlining.
>
> -Yotam
>
> >
> > Am Mittwoch, den 21.12.2016, 17:06 +0100 schrieb Alain Frisch:
> >> On 21/12/2016 15:45, Yotam Barnoy wrote:
> >> >
> >> > I think it's not worth the effort. You need to examine all the code
> >> > dealing with a parameter (ie. its flow) to see if any generic
> >> > function
> >> > is called on that parameter.
> >> This would be treated a bit like the stubs for optional arguments
> >> with a
> >> default value.  Any function taking float arguments or returning a
> >> float
> >> would be compiled to a specialized version with an unboxed calling
> >> convention plus a stub which implements the generic calling
> >> convention
> >> and delegate the call to the specialized version (or copy its body if
> >> it
> >> is small enough).  On call site, when the called function is known,
> >> one
> >> calls the specialized version instead.  This is a systematic, local
> >> compilation scheme, similar to other optimizations made in
> >> closure/cmmgen; it does not require a more global analysis nor a
> >> radically different representation of the code.
> >>
> >> About the "it's not worth the effort": the effort has largely been
> >> done,
> >> since the ticket came with a working patch (some more effort would
> >> be
> >> needed to bring it up to date, though).  In my opinion, this seems
> >> like
> >> a rather low-hanging fruit with very immediate and significant
> >> gains.
> >> I'd rather have this soon than wait for flambda to become stable and
> >> usable on large projects.
> >>
> >>
> >>
> >> -- Alain
> >>
> > --
> > ------------------------------------------------------------
> > Gerd Stolpmann, Darmstadt, Germany    gerd@gerd-stolpmann.de
> > My OCaml site:          http://www.camlcity.org
> > Contact details:        http://www.camlcity.org/contact.html
> > Company homepage:       http://www.gerd-stolpmann.de
> > ------------------------------------------------------------
> >
> >
>
> --
> Caml-list mailing list.  Subscription management and archives:
> https://sympa.inria.fr/sympa/arc/caml-list
> Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
> Bug reports: http://caml.inria.fr/bin/caml-bugs
>

[-- Attachment #2: Type: text/html, Size: 5399 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Caml-list] Closing the performance gap to C
  2016-12-21 16:47                       ` Gabriel Scherer
@ 2016-12-21 16:51                         ` Yotam Barnoy
  2016-12-21 16:56                         ` Mark Shinwell
  1 sibling, 0 replies; 25+ messages in thread
From: Yotam Barnoy @ 2016-12-21 16:51 UTC (permalink / raw)
  To: Gabriel Scherer
  Cc: Gerd Stolpmann, Alain Frisch, Frédéric Bour, Soegtrop,
	Michael, Christoph Höger, caml-list

Fair enough. I'm all for low-hanging fruit.

On Wed, Dec 21, 2016 at 11:47 AM, Gabriel Scherer
<gabriel.scherer@gmail.com> wrote:
> I think there is an information asymmetry here. Alain has done a lot of work
> on float unboxing in old and recent OCaml releases, and if he says that some
> approach is a low-hanging fruit it is reasonable to trust him. I suspect,
> Yotam, that you have different issues in mind that correspond to the fact
> that inlining and specialization at the flambda level could be complementary
> with the work that Alain is describing.
>
> (I think it is difficult to accurately discuss these optimization questions
> at a high/conceptual level only.)
>
> On Wed, Dec 21, 2016 at 11:39 AM, Yotam Barnoy <yotambarnoy@gmail.com>
> wrote:
>>
>> On Wed, Dec 21, 2016 at 11:31 AM, Gerd Stolpmann <info@gerd-stolpmann.de>
>> wrote:
>> > Dumb question because you are effectively suggesting an alternate
>> > calling convention in addition to the existing one: wouldn't it make
>> > more sense to switch to a completely different convention? Like: we
>> > pass fp values always in fp registers so far available?
>> >
>> >
>> > Gerd
>>
>> As far as I understand, the current calling convention has to do with
>> generic functions. Having to be polymorphic over all types, including
>> primitive ones, such as floating point and ints of different widths,
>> is really difficult. This is why OOP languages don't do it -- only
>> classes are polymorphic.
>>
>> And Alain, even if we create these alternate calling conventions, the
>> tricky part is still detecting when it's ok to call using direct
>> calling conventions. That's the hard part, and it's best done by
>> Flambda with its inlining.
>>
>> -Yotam
>>
>> >
>> > Am Mittwoch, den 21.12.2016, 17:06 +0100 schrieb Alain Frisch:
>> >> On 21/12/2016 15:45, Yotam Barnoy wrote:
>> >> >
>> >> > I think it's not worth the effort. You need to examine all the code
>> >> > dealing with a parameter (ie. its flow) to see if any generic
>> >> > function
>> >> > is called on that parameter.
>> >> This would be treated a bit like the stubs for optional arguments
>> >> with a
>> >> default value.  Any function taking float arguments or returning a
>> >> float
>> >> would be compiled to a specialized version with an unboxed calling
>> >> convention plus a stub which implements the generic calling
>> >> convention
>> >> and delegate the call to the specialized version (or copy its body if
>> >> it
>> >> is small enough).  On call site, when the called function is known,
>> >> one
>> >> calls the specialized version instead.  This is a systematic, local
>> >> compilation scheme, similar to other optimizations made in
>> >> closure/cmmgen; it does not require a more global analysis nor a
>> >> radically different representation of the code.
>> >>
>> >> About the "it's not worth the effort": the effort has largely been
>> >> done,
>> >> since the ticket came with a working patch (some more effort would
>> >> be
>> >> needed to bring it up to date, though).  In my opinion, this seems
>> >> like
>> >> a rather low-hanging fruit with very immediate and significant
>> >> gains.
>> >> I'd rather have this soon than wait for flambda to become stable and
>> >> usable on large projects.
>> >>
>> >>
>> >>
>> >> -- Alain
>> >>
>> > --
>> > ------------------------------------------------------------
>> > Gerd Stolpmann, Darmstadt, Germany    gerd@gerd-stolpmann.de
>> > My OCaml site:          http://www.camlcity.org
>> > Contact details:        http://www.camlcity.org/contact.html
>> > Company homepage:       http://www.gerd-stolpmann.de
>> > ------------------------------------------------------------
>> >
>> >
>>
>> --
>> Caml-list mailing list.  Subscription management and archives:
>> https://sympa.inria.fr/sympa/arc/caml-list
>> Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
>> Bug reports: http://caml.inria.fr/bin/caml-bugs
>
>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Caml-list] Closing the performance gap to C
  2016-12-21 16:47                       ` Gabriel Scherer
  2016-12-21 16:51                         ` Yotam Barnoy
@ 2016-12-21 16:56                         ` Mark Shinwell
  2016-12-21 17:43                           ` Alain Frisch
  1 sibling, 1 reply; 25+ messages in thread
From: Mark Shinwell @ 2016-12-21 16:56 UTC (permalink / raw)
  To: Gabriel Scherer
  Cc: Yotam Barnoy, Gerd Stolpmann, Alain Frisch,
	Frédéric Bour, Soegtrop, Michael, Christoph Höger,
	caml-list

I agree with Gabriel, but, we are planning substantial work within
Flambda during the coming year to implement unboxing transformations
therein.  I do think we should spend a little time, before diving into
this particular issue, convincing ourselves that these two tracks of
work are going to be complementary rather than conflicting.

Mark

On 21 December 2016 at 16:47, Gabriel Scherer <gabriel.scherer@gmail.com> wrote:
> I think there is an information asymmetry here. Alain has done a lot of work
> on float unboxing in old and recent OCaml releases, and if he says that some
> approach is a low-hanging fruit it is reasonable to trust him. I suspect,
> Yotam, that you have different issues in mind that correspond to the fact
> that inlining and specialization at the flambda level could be complementary
> with the work that Alain is describing.
>
> (I think it is difficult to accurately discuss these optimization questions
> at a high/conceptual level only.)
>
> On Wed, Dec 21, 2016 at 11:39 AM, Yotam Barnoy <yotambarnoy@gmail.com>
> wrote:
>>
>> On Wed, Dec 21, 2016 at 11:31 AM, Gerd Stolpmann <info@gerd-stolpmann.de>
>> wrote:
>> > Dumb question because you are effectively suggesting an alternate
>> > calling convention in addition to the existing one: wouldn't it make
>> > more sense to switch to a completely different convention? Like: we
>> > pass fp values always in fp registers so far available?
>> >
>> >
>> > Gerd
>>
>> As far as I understand, the current calling convention has to do with
>> generic functions. Having to be polymorphic over all types, including
>> primitive ones, such as floating point and ints of different widths,
>> is really difficult. This is why OOP languages don't do it -- only
>> classes are polymorphic.
>>
>> And Alain, even if we create these alternate calling conventions, the
>> tricky part is still detecting when it's ok to call using direct
>> calling conventions. That's the hard part, and it's best done by
>> Flambda with its inlining.
>>
>> -Yotam
>>
>> >
>> > Am Mittwoch, den 21.12.2016, 17:06 +0100 schrieb Alain Frisch:
>> >> On 21/12/2016 15:45, Yotam Barnoy wrote:
>> >> >
>> >> > I think it's not worth the effort. You need to examine all the code
>> >> > dealing with a parameter (ie. its flow) to see if any generic
>> >> > function
>> >> > is called on that parameter.
>> >> This would be treated a bit like the stubs for optional arguments
>> >> with a
>> >> default value.  Any function taking float arguments or returning a
>> >> float
>> >> would be compiled to a specialized version with an unboxed calling
>> >> convention plus a stub which implements the generic calling
>> >> convention
>> >> and delegate the call to the specialized version (or copy its body if
>> >> it
>> >> is small enough).  On call site, when the called function is known,
>> >> one
>> >> calls the specialized version instead.  This is a systematic, local
>> >> compilation scheme, similar to other optimizations made in
>> >> closure/cmmgen; it does not require a more global analysis nor a
>> >> radically different representation of the code.
>> >>
>> >> About the "it's not worth the effort": the effort has largely been
>> >> done,
>> >> since the ticket came with a working patch (some more effort would
>> >> be
>> >> needed to bring it up to date, though).  In my opinion, this seems
>> >> like
>> >> a rather low-hanging fruit with very immediate and significant
>> >> gains.
>> >> I'd rather have this soon than wait for flambda to become stable and
>> >> usable on large projects.
>> >>
>> >>
>> >>
>> >> -- Alain
>> >>
>> > --
>> > ------------------------------------------------------------
>> > Gerd Stolpmann, Darmstadt, Germany    gerd@gerd-stolpmann.de
>> > My OCaml site:          http://www.camlcity.org
>> > Contact details:        http://www.camlcity.org/contact.html
>> > Company homepage:       http://www.gerd-stolpmann.de
>> > ------------------------------------------------------------
>> >
>> >
>>
>> --
>> Caml-list mailing list.  Subscription management and archives:
>> https://sympa.inria.fr/sympa/arc/caml-list
>> Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
>> Bug reports: http://caml.inria.fr/bin/caml-bugs
>
>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Caml-list] Closing the performance gap to C
  2016-12-21 16:56                         ` Mark Shinwell
@ 2016-12-21 17:43                           ` Alain Frisch
  2016-12-22  8:39                             ` Mark Shinwell
  2016-12-22 17:23                             ` Pierre Chambart
  0 siblings, 2 replies; 25+ messages in thread
From: Alain Frisch @ 2016-12-21 17:43 UTC (permalink / raw)
  To: Mark Shinwell, Gabriel Scherer
  Cc: Yotam Barnoy, Gerd Stolpmann, Frédéric Bour, Soegtrop,
	Michael, Christoph Höger, caml-list

On 21/12/2016 17:56, Mark Shinwell wrote:
> I agree with Gabriel, but, we are planning substantial work within
> Flambda during the coming year to implement unboxing transformations
> therein.  I do think we should spend a little time, before diving into
> this particular issue, convincing ourselves that these two tracks of
> work are going to be complementary rather than conflicting.

Dealing with boxing/unboxing in flambda rather than keeping it at the 
cmm level certainly seems the way to go in the context of the flambda 
pipeline.  This will probably need to touch the definition of clambda 
(to bring it closer to cmm) and thus cmmgen.  Do you believe this is 
compatible with continuing sharing clambda and cmmgen between the 
flambda and non-flambda pipeline?  At some point, it might be more 
natural, or just required, to compile flambda directly to cmm without 
going through clambda.  What do you think?

If the flambda pipeline targets cmm directly without going through 
clambda/cmmgen, the approach discussed here (for the non-flambda case) 
should not interact too much with the flambda version.  As far as I can 
tell, they would mostly share the plumbing required to pass more info 
from typedtree to lambda.

Alain

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Caml-list] Closing the performance gap to C
  2016-12-21 17:43                           ` Alain Frisch
@ 2016-12-22  8:39                             ` Mark Shinwell
  2016-12-22 17:23                             ` Pierre Chambart
  1 sibling, 0 replies; 25+ messages in thread
From: Mark Shinwell @ 2016-12-22  8:39 UTC (permalink / raw)
  To: Alain Frisch
  Cc: Gabriel Scherer, Yotam Barnoy, Gerd Stolpmann,
	Frédéric Bour, Soegtrop, Michael, Christoph Höger,
	caml-list

I'm not sure I have an answer to that question at the moment---I
suggest taking this item to a smaller group for discussion.  I think
we've mooted the idea of trying to have some variant of Cmmgen that
did the Clambda to Cmm translation without doing e.g. the unboxing
part (which Flambda would already have done).  However, something like
that isn't entirely straightforward, as there are various Cmm -> Cmm
rewrites that are currently done by Cmmgen and we probably need to
preserve.  I'm pretty uncertain about how to treat those at the
moment; I think it would require some amount of experimentation.

Something tangentially related is that we should be able to get to the
stage fairly soon (within 2017 I think) where Flambda only generates
the subset of the Cmm language that is in SSA form.

It doesn't seem to me that keeping Clambda in the Flambda pipeline is
actually a bad thing; it forms quite a nice progression.  It would
however be nice to remove the Un_anf pass which is only there to
satisfy specific pattern-matches in the Cmmgen code, and slows down
compilation.

Mark

On 21 December 2016 at 17:43, Alain Frisch <alain.frisch@lexifi.com> wrote:
> On 21/12/2016 17:56, Mark Shinwell wrote:
>>
>> I agree with Gabriel, but, we are planning substantial work within
>> Flambda during the coming year to implement unboxing transformations
>> therein.  I do think we should spend a little time, before diving into
>> this particular issue, convincing ourselves that these two tracks of
>> work are going to be complementary rather than conflicting.
>
>
> Dealing with boxing/unboxing in flambda rather than keeping it at the cmm
> level certainly seems the way to go in the context of the flambda pipeline.
> This will probably need to touch the definition of clambda (to bring it
> closer to cmm) and thus cmmgen.  Do you believe this is compatible with
> continuing sharing clambda and cmmgen between the flambda and non-flambda
> pipeline?  At some point, it might be more natural, or just required, to
> compile flambda directly to cmm without going through clambda.  What do you
> think?
>
> If the flambda pipeline targets cmm directly without going through
> clambda/cmmgen, the approach discussed here (for the non-flambda case)
> should not interact too much with the flambda version.  As far as I can
> tell, they would mostly share the plumbing required to pass more info from
> typedtree to lambda.
>
>
> Alain

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Caml-list] Closing the performance gap to C
  2016-12-21 17:43                           ` Alain Frisch
  2016-12-22  8:39                             ` Mark Shinwell
@ 2016-12-22 17:23                             ` Pierre Chambart
  1 sibling, 0 replies; 25+ messages in thread
From: Pierre Chambart @ 2016-12-22 17:23 UTC (permalink / raw)
  To: Alain Frisch; +Cc: caml-list

Le 21/12/2016 à 18:43, Alain Frisch a écrit :
> On 21/12/2016 17:56, Mark Shinwell wrote:
>> I agree with Gabriel, but, we are planning substantial work within
>> Flambda during the coming year to implement unboxing transformations
>> therein.  I do think we should spend a little time, before diving into
>> this particular issue, convincing ourselves that these two tracks of
>> work are going to be complementary rather than conflicting.
>
> Dealing with boxing/unboxing in flambda rather than keeping it at the
> cmm level certainly seems the way to go in the context of the flambda
> pipeline.  This will probably need to touch the definition of clambda
> (to bring it closer to cmm) and thus cmmgen.  Do you believe this is
> compatible with continuing sharing clambda and cmmgen between the
> flambda and non-flambda pipeline?  At some point, it might be more
> natural, or just required, to compile flambda directly to cmm without
> going through clambda.  What do you think?
>
> If the flambda pipeline targets cmm directly without going through
> clambda/cmmgen, the approach discussed here (for the non-flambda case)
> should not interact too much with the flambda version.  As far as I
> can tell, they would mostly share the plumbing required to pass more
> info from typedtree to lambda.
>
>
> Alain
>
The machinery to do that correctly could be certainly be shared I think.
The way I see it, is change
the semantics of the float related primitives at the clambda and flambda
level. This should represent
explicitly unboxed float manipulation. The Closure/Closure_conversion
(or Flambda_to_clambda for
a first patch that does not change how flambda works) passes would
rewrite them to be enclosed
by new box/unbox primitives.
For instance in Cmmgen this would look like:

 | Pmulfloat ->
    Cop(Cmulf, [transl env arg1; transl env arg2], dbg)

 | Pboxfloat ->
    box_float dbg (transl env arg)
 | Punboxfloat ->
    transl_unbox_float dbg env arg

Then on functions and direct calls at clambda level, like in my old
patch, annotate the calling
convention.

This should be sufficient for doing what was done in that patch without
degrading anything and still
allowing this to be implemented separately in flambda. Of course the
Typedtree to lambda type
annotation propagation can be shared without any problem. Updating the
patch for that will be a
be tedious, but simple.

If you don't like the idea of having primitives that does not mean
exactly the same thing at the
different stages of the compiler, introducing a new unboxed version of
each float primitives is
also possible (with its use forbidden in lambda). I would like at some
point to have a different type
for lambda primitives and clambda primitives. It might be the right time
to do that.

Also, there was a problem with the approach of that old patch that
bothered me. The function
argument unboxing decision had to guess whether an argument was going to
be used boxed
or not. If the guess is wrong you might end up allocating more than
without unboxing. Having the
annotation there explicitly allow to do the unboxing earlier and rely on
the annotation for the decision.

Notice that the transformation was not completely local as it had to
propagate across functions
which argument might be used unboxed. In particular, recursive functions
made it a bit complex.
This was the job of the added Specialisation module.
-- 
Pierre

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Caml-list] Closing the performance gap to C
  2016-12-21 16:39                     ` Yotam Barnoy
  2016-12-21 16:47                       ` Gabriel Scherer
@ 2016-12-21 17:35                       ` Alain Frisch
  1 sibling, 0 replies; 25+ messages in thread
From: Alain Frisch @ 2016-12-21 17:35 UTC (permalink / raw)
  To: Yotam Barnoy, Gerd Stolpmann
  Cc: Frédéric Bour, Soegtrop, Michael, Christoph Höger,
	caml-list

On 21/12/2016 17:39, Yotam Barnoy wrote:
> As far as I understand, the current calling convention has to do with
> generic functions.

Indeed.  If you write:

let f x = x +. 1.

let g y = f (y +. 1)

you cannot simply choose a specific calling convention more suited to fp 
arguments/results, because the same function can be used as:

let h l = List.map f l

Currently, the code would be compiled as (pseudo code with explicit 
boxing/unboxing, and explicit function symbol and closures):

let f$direct x = box (unbox x +. 1.)
let f = closure(f$direct)
let g$direct y = f$direct (box (unbox y. +. 1.))
let g y = closure(g$direct)
let h y = List.map f l

> And Alain, even if we create these alternate calling conventions, the
> tricky part is still detecting when it's ok to call using direct
> calling conventions. That's the hard part, and it's best done by
> Flambda with its inlining.

The point is that it's not really hard: there is already some support in 
the compiler to detect direct calls and apply a specific calling 
convention (cf Clambda.Udirect_apply vs Clambda.Ugeneric_apply).  It is 
"just" that this specific calling convention does not do any 
specialization (the same function body is used for the direct calling 
convention and the generic one).

The suggested approach would compile the code above as (assuming no 
inling of f'd nor g'd body):

let f$unboxed x = x +. 1.
let f$direct x = box (f$unboxed (unbox x))
let f = closure(f$direct)

let g$unboxed y = unbox (f$direct (box (y. +. 1.)))
                 = unbox (box (f$unboxed (unbox (box (y. +. 1.)))))
                      (* short-circuit the wrapper *)
                 = f$unboxed  (y. +. 1.)
                      (* collapse unbox(box(..))->.. *)
let g$direct y = box (g$unboxed (unboxed y))
let h l = List.map f l

The fact that g can be compiled to a direct call to f (and not to an 
arbitrary function) is already known to the compiler.  And at the level 
of cmm, one can already describe functions that take unboxed floats (cf 
Cmm.machtype, and the fun_args field in Cmm.fundecl).

What is missing is mostly just the plumbing of type information to know 
that f and g should be compiled as above (i.e. keep track of the type of 
their arguments and result), and arrange so that direct calls 
short-circuit the boxing wrapper.

Now of course there are cases were this strategy would degrade 
performances:  imagine a function taking a float, but using it most 
often as a boxed float (e.g. passing it to Printf.printf "%f"); all call 
sites would still pass an unboxed float (even if the float comes 
naturally in already boxed form) and that function would need to box the 
float again (possibly several times).  This is exactly the same story as 
for the current unboxing strategy (described in 
http://www.lexifi.com/blog/unboxed-floats-ocaml ), where the compiler 
always decides to keep a local "float ref" variable in unboxed form, 
sometimes leading to extra boxing.  But in practice, this works well and 
leads to low-level numerical code work with no boxing at all except at 
data-structure boundaries and function calls (for now).

Alain

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Caml-list] Closing the performance gap to C
  2016-12-19 11:51   ` Gerd Stolpmann
  2016-12-19 14:52     ` Soegtrop, Michael
@ 2016-12-19 15:48     ` Ivan Gotovchits
  2016-12-19 16:44       ` Yotam Barnoy
  1 sibling, 1 reply; 25+ messages in thread
From: Ivan Gotovchits @ 2016-12-19 15:48 UTC (permalink / raw)
  To: Gerd Stolpmann; +Cc: Christoph Höger, caml-list

[-- Attachment #1: Type: text/plain, Size: 9211 bytes --]

Hi Christoph,

The problem is that your function definitions, like `loop` and `rk4_step`,
have too many parameters
and OCaml is not able to eliminate them and is actually not trying. It was
always a feature of OCaml
that poorly written code will be compiled into a poorly written program.
OCaml never tries to fix
programmer's errors. It will try to minimize the abstraction overhead
(often to the zero). But if the abstractions,
on the first hand, were chosen incorrectly, then it won't fix the code.

In your particular example, the C compiler was quite aggressive and was
capable of eliminating unnecessary computations.
I wouldn't, however, assume that the C compiler will be always able to do
this for you.

Let's go through the code:

let rec loop steps h n y t =
  if n < steps then loop steps h (n+1) (rk4_step y t h) (t +. h) else
    y

Here variables `steps` and `h` are loop invariants, so you shouldn't put it
into the loop function parameter list.
Well yes, a compiler can find out loop invariants and remove them for you.
But in our case, to remove them,
it will need to change the type of the function. The compiler will not go
that far for us. It respects a programmer, and if a
programmer decided to make his loop function to depend on `6` arguments,
then it means that the computation is actually depending
on 6 arguments. So, it will allocate 6 registers to hold loop variables,
with all the consequences.

Now, let's take a look at the `rk4_step` function:

let rk4_step y t h =
  let k1 = h *. y' t y in
  let k2 = h *. y' (t +. 0.5*.h) (y +. 0.5*.k1) in
  let k3 = h *. y' (t +. 0.5*.h) (y +. 0.5*.k2) in
  let k4 = h *. y' (t +. h) (y +. k3) in
  y +. (k1+.k4)/.6.0 +. (k2+.k3)/.3.0

This function, is, in fact, a body of the loop, and everything except t is
loop invariant here. Moreover,
function  `y'` is defined as:

let y' t y = cos t

I.e., it doesn't really use it the second argument. Probably, a compiler
should inline the call, and eliminate
lots of unecessary computations, and thus free a few registers, but,
apparently, OCaml doesn't do this
(even in 4.03+flambda).

So we should do this manually:

let rk4_step y t =
  let k1 = h *. y' t in
  let k2 = h *. y' (t +. 0.5*.h)  in
  let k3 = h *. y' (t +. 0.5*.h)  in
  let k4 = h *. y' (t +. h) (y +. k3) in
  y +. (k1+.k4)/.6.0 +. (k2+.k3)/.3.0

We can even see, that `k3` and `k2` are equal now, so we can eliminate them:

let rk4_step y t =
  let k1 = h *. y' t in
  let k2 = h *. y' (t +. 0.5*.h)  in
  let k4 = h *. y' (t +. h) (y +. k3) in
  y +. (k1+.k4)/.6.0 +. k2 *. 1.5

Finally, we don't want to pass `y` into the `rk4_step` every time, as we
don't want to require an extra register for it.
After all these manual optimizations, we have the following program:

let h = 0.1

let exact t = sin t

let rk4_step t =

  let k1 = h *. cos t in

  let k2 = h *. cos (t +. 0.5*.h) in

  let k4 = h *. cos (t +. h) in

  (k1+.k4)/.6.0 +. k2*.1.5

let compute steps =

  let rec loop n y t =

    if n < steps

    then loop (n+1) (y +. rk4_step t) (t +. h)

    else y in

  loop 1 1.0 0.0

let () =

  let y = compute 102 in

  let err = abs_float (y -. (exact ((float_of_int 102) *. h))) in

  let large = 50000000 in

  let y = compute large in

  Printf.printf "%b\n"

    (abs_float (y -. (exact (float_of_int large) *. h)) < 2. *. err)

This program has the same performance as the C one... unless I pass really
aggressive optimization options
to the C compiler, that will emit a platform specific code, e.g.,

    gcc rk.c -lm -O3 -march=corei7-avx -o rksse

These options basically double the performance of the C version,
leaving OCaml lagging behind. That is because,
OCaml, obviously, cannot follow the latest developments of intel CPU,
especially in the field of SSE.

The fun part is that when I've tried to compile the same file with clang,
the resulting program was even slower
than the original non-optimized OCaml. But this is all micro benchmarking
of course, so don't jump to fast conclusions
(although I like to think that OCaml is faster than Clang :))

As a final remark, my experience in HPC shows that in general you should
not really rely on compiler optimizations and hope
that the compiler will do the magic for you. Even the GCC compiler. It
would be very easy to accidentally amend the above program
in a such way, that the optimizations will no longer fire in. Of course,
writing in assembly is also not a choice. If you really need
to optimize, then you should find out the performance bottleneck and then
optimize it manually until you get an expected performance.
Alternatively, you can use plain old Fortran to get the reliable
performance. And then call it from C or OCaml.

Best wishes,
Ivan

On Mon, Dec 19, 2016 at 6:51 AM, Gerd Stolpmann <info@gerd-stolpmann.de>
wrote:

> Hi Christoph,
>
> the extra code looks very much like an allocation on the minor heap:
>
> sub    $0x10,%r15
> lea    0x25c7b6(%rip),%rax
> cmp    (%rax),%r15
> jb     404a8a <dlerror@plt+0x2d0a>
> lea    0x8(%r15),%rax
> movq   $0x4fd,-0x8(%rax)
>
> r15 points to the used area of the minor heap - by decrementing it you
> get an additional block of memory. It is compared against the beginning
> of the heap to check whether GC is needed. The constant 0x4fd is the
> header of the new block (which must be always initialized).
>
> From the source code, it remains unclear for what this is used.
> Obviously, the compiler runs out of registers, and moves some values to
> the minor heap (temporarily). When you call a C function like cos it is
> likely that this happens because the C calling conventions do not
> preserve the FP registers (xmm*). This could be improved if the OCaml
> compiler tried alternate places for temporarily storing FP values:
>
>  - int registers (which is perfectly possible on 64 bit platforms).
>    A number of int registers survive C calls.
>  - stack
>
> To my knowledge, the OCaml compiler never tries this (but this could be
> out of date). This is a fairly specific optimization that makes mostly
> sense for purely iterating or aggregating functions like yours that do
> not store FP values away.
>
> Gerd
>
> Am Samstag, den 17.12.2016, 14:02 +0100 schrieb Christoph Höger:
> > Ups. Forgot the actual examples.
> >
> > Am 17.12.2016 um 14:01 schrieb Christoph Höger:
> > >
> > > Dear all,
> > >
> > > find attached two simple runge-kutta iteration schemes. One is
> > > written
> > > in C, the other in OCaml. I compared the runtime of both and gcc (-
> > > O2)
> > > produces an executable that is roughly 30% faster (to be more
> > > precise:
> > > 3.52s vs. 2.63s). That is in itself quite pleasing, I think. I do
> > > not
> > > understand however, what causes this difference. Admittedly, the
> > > generated assembly looks completely different, but both compilers
> > > inline
> > > all functions and generate one big loop. Ocaml generates a lot more
> > > scaffolding, but that is to be expected.
> > >
> > > There is however an interesting particularity: OCaml generates 6
> > > calls
> > > to cos, while gcc only needs 3 (and one direct jump). Surprisingly,
> > > there are also calls to cosh, acos and pretty much any other
> > > trigonometric function (initialization of constants, maybe?)
> > >
> > > However, the true culprit seems to be an excess of instructions
> > > between
> > > the different calls to cos. This is what happens between the first
> > > two
> > > calls to cos:
> > >
> > > gcc:
> > > jmpq   400530 <cos@plt>
> > > nop
> > > nopw   %cs:0x0(%rax,%rax,1)
> > >
> > > sub    $0x38,%rsp
> > > movsd  %xmm0,0x10(%rsp)
> > > movapd %xmm1,%xmm0
> > > movsd  %xmm2,0x18(%rsp)
> > > movsd  %xmm1,0x8(%rsp)
> > > callq  400530 <cos@plt>
> > >
> > > ocamlopt:
> > >
> > > callq  401a60 <cos@plt>
> > > mulsd  (%r12),%xmm0
> > > movsd  %xmm0,0x10(%rsp)
> > > sub    $0x10,%r15
> > > lea    0x25c7b6(%rip),%rax
> > > cmp    (%rax),%r15
> > > jb     404a8a <dlerror@plt+0x2d0a>
> > > lea    0x8(%r15),%rax
> > > movq   $0x4fd,-0x8(%rax)
> > >
> > > movsd  0x32319(%rip),%xmm1
> > >
> > > movapd %xmm1,%xmm2
> > > mulsd  %xmm0,%xmm2
> > > addsd  0x0(%r13),%xmm2
> > > movsd  %xmm2,(%rax)
> > > movapd %xmm1,%xmm0
> > > mulsd  (%r12),%xmm0
> > > addsd  (%rbx),%xmm0
> > > callq  401a60 <cos@plt>
> > >
> > >
> > > Is this caused by some underlying difference in the representation
> > > of
> > > numeric values (i.e. tagged ints) or is it reasonable to attack
> > > this
> > > issue as a hobby experiment?
> > >
> > >
> > > thanks for any advice,
> > >
> > > Christoph
> > >
> >
> --
> ------------------------------------------------------------
> Gerd Stolpmann, Darmstadt, Germany    gerd@gerd-stolpmann.de
> My OCaml site:          http://www.camlcity.org
> Contact details:        http://www.camlcity.org/contact.html
> Company homepage:       http://www.gerd-stolpmann.de
> ------------------------------------------------------------
>
>
>

[-- Attachment #2: Type: text/html, Size: 16989 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Caml-list] Closing the performance gap to C
  2016-12-19 15:48     ` Ivan Gotovchits
@ 2016-12-19 16:44       ` Yotam Barnoy
  2016-12-19 16:59         ` Ivan Gotovchits
  0 siblings, 1 reply; 25+ messages in thread
From: Yotam Barnoy @ 2016-12-19 16:44 UTC (permalink / raw)
  To: Ivan Gotovchits; +Cc: Gerd Stolpmann, Christoph Höger, caml-list

Christoph, you don't mention which version of OCaml you're compiling
with, and whether FLambda is used. Flambda should be able to do the
kind of optimizations Ivan mentions, at least ideally. If it's not
able to do them yet, it will be when the rewritten version is
deployed.

-Yotam


On Mon, Dec 19, 2016 at 10:48 AM, Ivan Gotovchits <ivg@ieee.org> wrote:
> Hi Christoph,
>
> The problem is that your function definitions, like `loop` and `rk4_step`,
> have too many parameters
> and OCaml is not able to eliminate them and is actually not trying. It was
> always a feature of OCaml
> that poorly written code will be compiled into a poorly written program.
> OCaml never tries to fix
> programmer's errors. It will try to minimize the abstraction overhead (often
> to the zero). But if the abstractions,
> on the first hand, were chosen incorrectly, then it won't fix the code.
>
> In your particular example, the C compiler was quite aggressive and was
> capable of eliminating unnecessary computations.
> I wouldn't, however, assume that the C compiler will be always able to do
> this for you.
>
> Let's go through the code:
>
>
> let rec loop steps h n y t =
>   if n < steps then loop steps h (n+1) (rk4_step y t h) (t +. h) else
>     y
>
> Here variables `steps` and `h` are loop invariants, so you shouldn't put it
> into the loop function parameter list.
> Well yes, a compiler can find out loop invariants and remove them for you.
> But in our case, to remove them,
> it will need to change the type of the function. The compiler will not go
> that far for us. It respects a programmer, and if a
> programmer decided to make his loop function to depend on `6` arguments,
> then it means that the computation is actually depending
> on 6 arguments. So, it will allocate 6 registers to hold loop variables,
> with all the consequences.
>
>
> Now, let's take a look at the `rk4_step` function:
>
> let rk4_step y t h =
>   let k1 = h *. y' t y in
>   let k2 = h *. y' (t +. 0.5*.h) (y +. 0.5*.k1) in
>   let k3 = h *. y' (t +. 0.5*.h) (y +. 0.5*.k2) in
>   let k4 = h *. y' (t +. h) (y +. k3) in
>   y +. (k1+.k4)/.6.0 +. (k2+.k3)/.3.0
>
>
>
> This function, is, in fact, a body of the loop, and everything except t is
> loop invariant here. Moreover,
> function  `y'` is defined as:
>
> let y' t y = cos t
>
> I.e., it doesn't really use it the second argument. Probably, a compiler
> should inline the call, and eliminate
> lots of unecessary computations, and thus free a few registers, but,
> apparently, OCaml doesn't do this
> (even in 4.03+flambda).
>
> So we should do this manually:
>
> let rk4_step y t =
>   let k1 = h *. y' t in
>   let k2 = h *. y' (t +. 0.5*.h)  in
>   let k3 = h *. y' (t +. 0.5*.h)  in
>   let k4 = h *. y' (t +. h) (y +. k3) in
>   y +. (k1+.k4)/.6.0 +. (k2+.k3)/.3.0
>
> We can even see, that `k3` and `k2` are equal now, so we can eliminate them:
>
> let rk4_step y t =
>   let k1 = h *. y' t in
>   let k2 = h *. y' (t +. 0.5*.h)  in
>   let k4 = h *. y' (t +. h) (y +. k3) in
>   y +. (k1+.k4)/.6.0 +. k2 *. 1.5
>
>
> Finally, we don't want to pass `y` into the `rk4_step` every time, as we
> don't want to require an extra register for it.
> After all these manual optimizations, we have the following program:
>
> let h = 0.1
>
>
> let exact t = sin t
>
>
> let rk4_step t =
>
>   let k1 = h *. cos t in
>
>   let k2 = h *. cos (t +. 0.5*.h) in
>
>   let k4 = h *. cos (t +. h) in
>
>   (k1+.k4)/.6.0 +. k2*.1.5
>
>
> let compute steps =
>
>   let rec loop n y t =
>
>     if n < steps
>
>     then loop (n+1) (y +. rk4_step t) (t +. h)
>
>     else y in
>
>   loop 1 1.0 0.0
>
>
> let () =
>
>   let y = compute 102 in
>
>   let err = abs_float (y -. (exact ((float_of_int 102) *. h))) in
>
>   let large = 50000000 in
>
>   let y = compute large in
>
>   Printf.printf "%b\n"
>
>     (abs_float (y -. (exact (float_of_int large) *. h)) < 2. *. err)
>
>
>
> This program has the same performance as the C one... unless I pass really
> aggressive optimization options
> to the C compiler, that will emit a platform specific code, e.g.,
>
>     gcc rk.c -lm -O3 -march=corei7-avx -o rksse
>
>
> These options basically double the performance of the C version, leaving
> OCaml lagging behind. That is because,
> OCaml, obviously, cannot follow the latest developments of intel CPU,
> especially in the field of SSE.
>
> The fun part is that when I've tried to compile the same file with clang,
> the resulting program was even slower
> than the original non-optimized OCaml. But this is all micro benchmarking of
> course, so don't jump to fast conclusions
> (although I like to think that OCaml is faster than Clang :))
>
>
> As a final remark, my experience in HPC shows that in general you should not
> really rely on compiler optimizations and hope
> that the compiler will do the magic for you. Even the GCC compiler. It would
> be very easy to accidentally amend the above program
> in a such way, that the optimizations will no longer fire in. Of course,
> writing in assembly is also not a choice. If you really need
> to optimize, then you should find out the performance bottleneck and then
> optimize it manually until you get an expected performance.
> Alternatively, you can use plain old Fortran to get the reliable
> performance. And then call it from C or OCaml.
>
>
> Best wishes,
> Ivan
>
>
>
> On Mon, Dec 19, 2016 at 6:51 AM, Gerd Stolpmann <info@gerd-stolpmann.de>
> wrote:
>>
>> Hi Christoph,
>>
>> the extra code looks very much like an allocation on the minor heap:
>>
>> sub    $0x10,%r15
>> lea    0x25c7b6(%rip),%rax
>> cmp    (%rax),%r15
>> jb     404a8a <dlerror@plt+0x2d0a>
>> lea    0x8(%r15),%rax
>> movq   $0x4fd,-0x8(%rax)
>>
>> r15 points to the used area of the minor heap - by decrementing it you
>> get an additional block of memory. It is compared against the beginning
>> of the heap to check whether GC is needed. The constant 0x4fd is the
>> header of the new block (which must be always initialized).
>>
>> From the source code, it remains unclear for what this is used.
>> Obviously, the compiler runs out of registers, and moves some values to
>> the minor heap (temporarily). When you call a C function like cos it is
>> likely that this happens because the C calling conventions do not
>> preserve the FP registers (xmm*). This could be improved if the OCaml
>> compiler tried alternate places for temporarily storing FP values:
>>
>>  - int registers (which is perfectly possible on 64 bit platforms).
>>    A number of int registers survive C calls.
>>  - stack
>>
>> To my knowledge, the OCaml compiler never tries this (but this could be
>> out of date). This is a fairly specific optimization that makes mostly
>> sense for purely iterating or aggregating functions like yours that do
>> not store FP values away.
>>
>> Gerd
>>
>> Am Samstag, den 17.12.2016, 14:02 +0100 schrieb Christoph Höger:
>> > Ups. Forgot the actual examples.
>> >
>> > Am 17.12.2016 um 14:01 schrieb Christoph Höger:
>> > >
>> > > Dear all,
>> > >
>> > > find attached two simple runge-kutta iteration schemes. One is
>> > > written
>> > > in C, the other in OCaml. I compared the runtime of both and gcc (-
>> > > O2)
>> > > produces an executable that is roughly 30% faster (to be more
>> > > precise:
>> > > 3.52s vs. 2.63s). That is in itself quite pleasing, I think. I do
>> > > not
>> > > understand however, what causes this difference. Admittedly, the
>> > > generated assembly looks completely different, but both compilers
>> > > inline
>> > > all functions and generate one big loop. Ocaml generates a lot more
>> > > scaffolding, but that is to be expected.
>> > >
>> > > There is however an interesting particularity: OCaml generates 6
>> > > calls
>> > > to cos, while gcc only needs 3 (and one direct jump). Surprisingly,
>> > > there are also calls to cosh, acos and pretty much any other
>> > > trigonometric function (initialization of constants, maybe?)
>> > >
>> > > However, the true culprit seems to be an excess of instructions
>> > > between
>> > > the different calls to cos. This is what happens between the first
>> > > two
>> > > calls to cos:
>> > >
>> > > gcc:
>> > > jmpq   400530 <cos@plt>
>> > > nop
>> > > nopw   %cs:0x0(%rax,%rax,1)
>> > >
>> > > sub    $0x38,%rsp
>> > > movsd  %xmm0,0x10(%rsp)
>> > > movapd %xmm1,%xmm0
>> > > movsd  %xmm2,0x18(%rsp)
>> > > movsd  %xmm1,0x8(%rsp)
>> > > callq  400530 <cos@plt>
>> > >
>> > > ocamlopt:
>> > >
>> > > callq  401a60 <cos@plt>
>> > > mulsd  (%r12),%xmm0
>> > > movsd  %xmm0,0x10(%rsp)
>> > > sub    $0x10,%r15
>> > > lea    0x25c7b6(%rip),%rax
>> > > cmp    (%rax),%r15
>> > > jb     404a8a <dlerror@plt+0x2d0a>
>> > > lea    0x8(%r15),%rax
>> > > movq   $0x4fd,-0x8(%rax)
>> > >
>> > > movsd  0x32319(%rip),%xmm1
>> > >
>> > > movapd %xmm1,%xmm2
>> > > mulsd  %xmm0,%xmm2
>> > > addsd  0x0(%r13),%xmm2
>> > > movsd  %xmm2,(%rax)
>> > > movapd %xmm1,%xmm0
>> > > mulsd  (%r12),%xmm0
>> > > addsd  (%rbx),%xmm0
>> > > callq  401a60 <cos@plt>
>> > >
>> > >
>> > > Is this caused by some underlying difference in the representation
>> > > of
>> > > numeric values (i.e. tagged ints) or is it reasonable to attack
>> > > this
>> > > issue as a hobby experiment?
>> > >
>> > >
>> > > thanks for any advice,
>> > >
>> > > Christoph
>> > >
>> >
>> --
>> ------------------------------------------------------------
>> Gerd Stolpmann, Darmstadt, Germany    gerd@gerd-stolpmann.de
>> My OCaml site:          http://www.camlcity.org
>> Contact details:        http://www.camlcity.org/contact.html
>> Company homepage:       http://www.gerd-stolpmann.de
>> ------------------------------------------------------------
>>
>>
>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Caml-list] Closing the performance gap to C
  2016-12-19 16:44       ` Yotam Barnoy
@ 2016-12-19 16:59         ` Ivan Gotovchits
  2016-12-21  9:08           ` Christoph Höger
  0 siblings, 1 reply; 25+ messages in thread
From: Ivan Gotovchits @ 2016-12-19 16:59 UTC (permalink / raw)
  To: Yotam Barnoy; +Cc: Gerd Stolpmann, Christoph Höger, caml-list

[-- Attachment #1: Type: text/plain, Size: 11614 bytes --]

>  Flambda should be able to do the kind of optimizations Ivan mentions, at
least ideally.

I've used 4.03+flambda and it didn't perform the optimizations. I've played
a lot with the inline options and
`inline` directive and even noticed a strange (on a first glance) thing,
that the performance is better, when
I explicitly _disable_ of the `trk4_step` function (with `[@@inline never]`)

As a side note, not about compilers, but about mathematics.
Since the three cosine functions, that are computed in the
loop are basically cos(t), cos(t+h) and cos(t + h/2) and `h` is a constant,
you can use the  `cos(t+x) = cos(t)cos(x) - sin(t)sin(x)` trigonometric
identity to factor out the constant factors (that are only depend on `h`),
so finally, you will need to compute one cosine and one sine.
This will boost the performance of the program, leaving even the SSE corei7
specific version far behind.


On Mon, Dec 19, 2016 at 11:44 AM, Yotam Barnoy <yotambarnoy@gmail.com>
wrote:

> Christoph, you don't mention which version of OCaml you're compiling
> with, and whether FLambda is used. Flambda should be able to do the
> kind of optimizations Ivan mentions, at least ideally. If it's not
> able to do them yet, it will be when the rewritten version is
> deployed.
>
> -Yotam
>
>
> On Mon, Dec 19, 2016 at 10:48 AM, Ivan Gotovchits <ivg@ieee.org> wrote:
> > Hi Christoph,
> >
> > The problem is that your function definitions, like `loop` and
> `rk4_step`,
> > have too many parameters
> > and OCaml is not able to eliminate them and is actually not trying. It
> was
> > always a feature of OCaml
> > that poorly written code will be compiled into a poorly written program.
> > OCaml never tries to fix
> > programmer's errors. It will try to minimize the abstraction overhead
> (often
> > to the zero). But if the abstractions,
> > on the first hand, were chosen incorrectly, then it won't fix the code.
> >
> > In your particular example, the C compiler was quite aggressive and was
> > capable of eliminating unnecessary computations.
> > I wouldn't, however, assume that the C compiler will be always able to do
> > this for you.
> >
> > Let's go through the code:
> >
> >
> > let rec loop steps h n y t =
> >   if n < steps then loop steps h (n+1) (rk4_step y t h) (t +. h) else
> >     y
> >
> > Here variables `steps` and `h` are loop invariants, so you shouldn't put
> it
> > into the loop function parameter list.
> > Well yes, a compiler can find out loop invariants and remove them for
> you.
> > But in our case, to remove them,
> > it will need to change the type of the function. The compiler will not go
> > that far for us. It respects a programmer, and if a
> > programmer decided to make his loop function to depend on `6` arguments,
> > then it means that the computation is actually depending
> > on 6 arguments. So, it will allocate 6 registers to hold loop variables,
> > with all the consequences.
> >
> >
> > Now, let's take a look at the `rk4_step` function:
> >
> > let rk4_step y t h =
> >   let k1 = h *. y' t y in
> >   let k2 = h *. y' (t +. 0.5*.h) (y +. 0.5*.k1) in
> >   let k3 = h *. y' (t +. 0.5*.h) (y +. 0.5*.k2) in
> >   let k4 = h *. y' (t +. h) (y +. k3) in
> >   y +. (k1+.k4)/.6.0 +. (k2+.k3)/.3.0
> >
> >
> >
> > This function, is, in fact, a body of the loop, and everything except t
> is
> > loop invariant here. Moreover,
> > function  `y'` is defined as:
> >
> > let y' t y = cos t
> >
> > I.e., it doesn't really use it the second argument. Probably, a compiler
> > should inline the call, and eliminate
> > lots of unecessary computations, and thus free a few registers, but,
> > apparently, OCaml doesn't do this
> > (even in 4.03+flambda).
> >
> > So we should do this manually:
> >
> > let rk4_step y t =
> >   let k1 = h *. y' t in
> >   let k2 = h *. y' (t +. 0.5*.h)  in
> >   let k3 = h *. y' (t +. 0.5*.h)  in
> >   let k4 = h *. y' (t +. h) (y +. k3) in
> >   y +. (k1+.k4)/.6.0 +. (k2+.k3)/.3.0
> >
> > We can even see, that `k3` and `k2` are equal now, so we can eliminate
> them:
> >
> > let rk4_step y t =
> >   let k1 = h *. y' t in
> >   let k2 = h *. y' (t +. 0.5*.h)  in
> >   let k4 = h *. y' (t +. h) (y +. k3) in
> >   y +. (k1+.k4)/.6.0 +. k2 *. 1.5
> >
> >
> > Finally, we don't want to pass `y` into the `rk4_step` every time, as we
> > don't want to require an extra register for it.
> > After all these manual optimizations, we have the following program:
> >
> > let h = 0.1
> >
> >
> > let exact t = sin t
> >
> >
> > let rk4_step t =
> >
> >   let k1 = h *. cos t in
> >
> >   let k2 = h *. cos (t +. 0.5*.h) in
> >
> >   let k4 = h *. cos (t +. h) in
> >
> >   (k1+.k4)/.6.0 +. k2*.1.5
> >
> >
> > let compute steps =
> >
> >   let rec loop n y t =
> >
> >     if n < steps
> >
> >     then loop (n+1) (y +. rk4_step t) (t +. h)
> >
> >     else y in
> >
> >   loop 1 1.0 0.0
> >
> >
> > let () =
> >
> >   let y = compute 102 in
> >
> >   let err = abs_float (y -. (exact ((float_of_int 102) *. h))) in
> >
> >   let large = 50000000 in
> >
> >   let y = compute large in
> >
> >   Printf.printf "%b\n"
> >
> >     (abs_float (y -. (exact (float_of_int large) *. h)) < 2. *. err)
> >
> >
> >
> > This program has the same performance as the C one... unless I pass
> really
> > aggressive optimization options
> > to the C compiler, that will emit a platform specific code, e.g.,
> >
> >     gcc rk.c -lm -O3 -march=corei7-avx -o rksse
> >
> >
> > These options basically double the performance of the C version, leaving
> > OCaml lagging behind. That is because,
> > OCaml, obviously, cannot follow the latest developments of intel CPU,
> > especially in the field of SSE.
> >
> > The fun part is that when I've tried to compile the same file with clang,
> > the resulting program was even slower
> > than the original non-optimized OCaml. But this is all micro
> benchmarking of
> > course, so don't jump to fast conclusions
> > (although I like to think that OCaml is faster than Clang :))
> >
> >
> > As a final remark, my experience in HPC shows that in general you should
> not
> > really rely on compiler optimizations and hope
> > that the compiler will do the magic for you. Even the GCC compiler. It
> would
> > be very easy to accidentally amend the above program
> > in a such way, that the optimizations will no longer fire in. Of course,
> > writing in assembly is also not a choice. If you really need
> > to optimize, then you should find out the performance bottleneck and then
> > optimize it manually until you get an expected performance.
> > Alternatively, you can use plain old Fortran to get the reliable
> > performance. And then call it from C or OCaml.
> >
> >
> > Best wishes,
> > Ivan
> >
> >
> >
> > On Mon, Dec 19, 2016 at 6:51 AM, Gerd Stolpmann <info@gerd-stolpmann.de>
> > wrote:
> >>
> >> Hi Christoph,
> >>
> >> the extra code looks very much like an allocation on the minor heap:
> >>
> >> sub    $0x10,%r15
> >> lea    0x25c7b6(%rip),%rax
> >> cmp    (%rax),%r15
> >> jb     404a8a <dlerror@plt+0x2d0a>
> >> lea    0x8(%r15),%rax
> >> movq   $0x4fd,-0x8(%rax)
> >>
> >> r15 points to the used area of the minor heap - by decrementing it you
> >> get an additional block of memory. It is compared against the beginning
> >> of the heap to check whether GC is needed. The constant 0x4fd is the
> >> header of the new block (which must be always initialized).
> >>
> >> From the source code, it remains unclear for what this is used.
> >> Obviously, the compiler runs out of registers, and moves some values to
> >> the minor heap (temporarily). When you call a C function like cos it is
> >> likely that this happens because the C calling conventions do not
> >> preserve the FP registers (xmm*). This could be improved if the OCaml
> >> compiler tried alternate places for temporarily storing FP values:
> >>
> >>  - int registers (which is perfectly possible on 64 bit platforms).
> >>    A number of int registers survive C calls.
> >>  - stack
> >>
> >> To my knowledge, the OCaml compiler never tries this (but this could be
> >> out of date). This is a fairly specific optimization that makes mostly
> >> sense for purely iterating or aggregating functions like yours that do
> >> not store FP values away.
> >>
> >> Gerd
> >>
> >> Am Samstag, den 17.12.2016, 14:02 +0100 schrieb Christoph Höger:
> >> > Ups. Forgot the actual examples.
> >> >
> >> > Am 17.12.2016 um 14:01 schrieb Christoph Höger:
> >> > >
> >> > > Dear all,
> >> > >
> >> > > find attached two simple runge-kutta iteration schemes. One is
> >> > > written
> >> > > in C, the other in OCaml. I compared the runtime of both and gcc (-
> >> > > O2)
> >> > > produces an executable that is roughly 30% faster (to be more
> >> > > precise:
> >> > > 3.52s vs. 2.63s). That is in itself quite pleasing, I think. I do
> >> > > not
> >> > > understand however, what causes this difference. Admittedly, the
> >> > > generated assembly looks completely different, but both compilers
> >> > > inline
> >> > > all functions and generate one big loop. Ocaml generates a lot more
> >> > > scaffolding, but that is to be expected.
> >> > >
> >> > > There is however an interesting particularity: OCaml generates 6
> >> > > calls
> >> > > to cos, while gcc only needs 3 (and one direct jump). Surprisingly,
> >> > > there are also calls to cosh, acos and pretty much any other
> >> > > trigonometric function (initialization of constants, maybe?)
> >> > >
> >> > > However, the true culprit seems to be an excess of instructions
> >> > > between
> >> > > the different calls to cos. This is what happens between the first
> >> > > two
> >> > > calls to cos:
> >> > >
> >> > > gcc:
> >> > > jmpq   400530 <cos@plt>
> >> > > nop
> >> > > nopw   %cs:0x0(%rax,%rax,1)
> >> > >
> >> > > sub    $0x38,%rsp
> >> > > movsd  %xmm0,0x10(%rsp)
> >> > > movapd %xmm1,%xmm0
> >> > > movsd  %xmm2,0x18(%rsp)
> >> > > movsd  %xmm1,0x8(%rsp)
> >> > > callq  400530 <cos@plt>
> >> > >
> >> > > ocamlopt:
> >> > >
> >> > > callq  401a60 <cos@plt>
> >> > > mulsd  (%r12),%xmm0
> >> > > movsd  %xmm0,0x10(%rsp)
> >> > > sub    $0x10,%r15
> >> > > lea    0x25c7b6(%rip),%rax
> >> > > cmp    (%rax),%r15
> >> > > jb     404a8a <dlerror@plt+0x2d0a>
> >> > > lea    0x8(%r15),%rax
> >> > > movq   $0x4fd,-0x8(%rax)
> >> > >
> >> > > movsd  0x32319(%rip),%xmm1
> >> > >
> >> > > movapd %xmm1,%xmm2
> >> > > mulsd  %xmm0,%xmm2
> >> > > addsd  0x0(%r13),%xmm2
> >> > > movsd  %xmm2,(%rax)
> >> > > movapd %xmm1,%xmm0
> >> > > mulsd  (%r12),%xmm0
> >> > > addsd  (%rbx),%xmm0
> >> > > callq  401a60 <cos@plt>
> >> > >
> >> > >
> >> > > Is this caused by some underlying difference in the representation
> >> > > of
> >> > > numeric values (i.e. tagged ints) or is it reasonable to attack
> >> > > this
> >> > > issue as a hobby experiment?
> >> > >
> >> > >
> >> > > thanks for any advice,
> >> > >
> >> > > Christoph
> >> > >
> >> >
> >> --
> >> ------------------------------------------------------------
> >> Gerd Stolpmann, Darmstadt, Germany    gerd@gerd-stolpmann.de
> >> My OCaml site:          http://www.camlcity.org
> >> Contact details:        http://www.camlcity.org/contact.html
> >> Company homepage:       http://www.gerd-stolpmann.de
> >> ------------------------------------------------------------
> >>
> >>
> >
>

[-- Attachment #2: Type: text/html, Size: 15225 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Caml-list] Closing the performance gap to C
  2016-12-19 16:59         ` Ivan Gotovchits
@ 2016-12-21  9:08           ` Christoph Höger
  2016-12-23 12:18             ` Oleg
  0 siblings, 1 reply; 25+ messages in thread
From: Christoph Höger @ 2016-12-21  9:08 UTC (permalink / raw)
  To: Ivan Gotovchits, Yotam Barnoy; +Cc: Gerd Stolpmann, caml-list


[-- Attachment #1.1: Type: text/plain, Size: 14418 bytes --]

Dear all,

thanks a lot for your valuable input. It seems like there still is
something to gain for OCaml, but it is not quite simple to achieve. From
your comments I reckon the main culprit is the use of a tail-recursive
function instead of a loop, which causes boxing (which causes spilling).
I will repeat my experiment with a new setup asap.

@Ivan: Your point about the optimization is well taken, but please
consider that the runge-kutta integrator is a sophisticated, general
numerical method. Usually, the user of such a method has neither the
knowledge nor the time to manually adapt it, but rather implements a
standard interface. Therefore, the unused arguments and
algebraic/trigonometric identities have to be resolved by the compiler.
Otherwise, we could as well compute the exact solution, anyways ;).

Am 19.12.2016 um 17:59 schrieb Ivan Gotovchits:
>>  Flambda should be able to do the kind of optimizations Ivan mentions,
> at least ideally.
> 
> I've used 4.03+flambda and it didn't perform the optimizations. I've
> played a lot with the inline options and 
> `inline` directive and even noticed a strange (on a first glance) thing,
> that the performance is better, when 
> I explicitly _disable_ of the `trk4_step` function (with `[@@inline never]`)
> 
> As a side note, not about compilers, but about mathematics. 
> Since the three cosine functions, that are computed in the
> loop are basically cos(t), cos(t+h) and cos(t + h/2) and `h` is a
> constant, you can use the  `cos(t+x) = cos(t)cos(x) - sin(t)sin(x)`
> trigonometric
> identity to factor out the constant factors (that are only depend on
> `h`), so finally, you will need to compute one cosine and one sine.
> This will boost the performance of the program, leaving even the SSE
> corei7 specific version far behind.
> 
> 
> On Mon, Dec 19, 2016 at 11:44 AM, Yotam Barnoy <yotambarnoy@gmail.com
> <mailto:yotambarnoy@gmail.com>> wrote:
> 
>     Christoph, you don't mention which version of OCaml you're compiling
>     with, and whether FLambda is used. Flambda should be able to do the
>     kind of optimizations Ivan mentions, at least ideally. If it's not
>     able to do them yet, it will be when the rewritten version is
>     deployed.
> 
>     -Yotam
> 
> 
>     On Mon, Dec 19, 2016 at 10:48 AM, Ivan Gotovchits <ivg@ieee.org
>     <mailto:ivg@ieee.org>> wrote:
>     > Hi Christoph,
>     >
>     > The problem is that your function definitions, like `loop` and
>     `rk4_step`,
>     > have too many parameters
>     > and OCaml is not able to eliminate them and is actually not
>     trying. It was
>     > always a feature of OCaml
>     > that poorly written code will be compiled into a poorly written
>     program.
>     > OCaml never tries to fix
>     > programmer's errors. It will try to minimize the abstraction
>     overhead (often
>     > to the zero). But if the abstractions,
>     > on the first hand, were chosen incorrectly, then it won't fix the
>     code.
>     >
>     > In your particular example, the C compiler was quite aggressive
>     and was
>     > capable of eliminating unnecessary computations.
>     > I wouldn't, however, assume that the C compiler will be always
>     able to do
>     > this for you.
>     >
>     > Let's go through the code:
>     >
>     >
>     > let rec loop steps h n y t =
>     >   if n < steps then loop steps h (n+1) (rk4_step y t h) (t +. h) else
>     >     y
>     >
>     > Here variables `steps` and `h` are loop invariants, so you
>     shouldn't put it
>     > into the loop function parameter list.
>     > Well yes, a compiler can find out loop invariants and remove them
>     for you.
>     > But in our case, to remove them,
>     > it will need to change the type of the function. The compiler will
>     not go
>     > that far for us. It respects a programmer, and if a
>     > programmer decided to make his loop function to depend on `6`
>     arguments,
>     > then it means that the computation is actually depending
>     > on 6 arguments. So, it will allocate 6 registers to hold loop
>     variables,
>     > with all the consequences.
>     >
>     >
>     > Now, let's take a look at the `rk4_step` function:
>     >
>     > let rk4_step y t h =
>     >   let k1 = h *. y' t y in
>     >   let k2 = h *. y' (t +. 0.5*.h) (y +. 0.5*.k1) in
>     >   let k3 = h *. y' (t +. 0.5*.h) (y +. 0.5*.k2) in
>     >   let k4 = h *. y' (t +. h) (y +. k3) in
>     >   y +. (k1+.k4)/.6.0 +. (k2+.k3)/.3.0
>     >
>     >
>     >
>     > This function, is, in fact, a body of the loop, and everything
>     except t is
>     > loop invariant here. Moreover,
>     > function  `y'` is defined as:
>     >
>     > let y' t y = cos t
>     >
>     > I.e., it doesn't really use it the second argument. Probably, a
>     compiler
>     > should inline the call, and eliminate
>     > lots of unecessary computations, and thus free a few registers, but,
>     > apparently, OCaml doesn't do this
>     > (even in 4.03+flambda).
>     >
>     > So we should do this manually:
>     >
>     > let rk4_step y t =
>     >   let k1 = h *. y' t in
>     >   let k2 = h *. y' (t +. 0.5*.h)  in
>     >   let k3 = h *. y' (t +. 0.5*.h)  in
>     >   let k4 = h *. y' (t +. h) (y +. k3) in
>     >   y +. (k1+.k4)/.6.0 +. (k2+.k3)/.3.0
>     >
>     > We can even see, that `k3` and `k2` are equal now, so we can
>     eliminate them:
>     >
>     > let rk4_step y t =
>     >   let k1 = h *. y' t in
>     >   let k2 = h *. y' (t +. 0.5*.h)  in
>     >   let k4 = h *. y' (t +. h) (y +. k3) in
>     >   y +. (k1+.k4)/.6.0 +. k2 *. 1.5
>     >
>     >
>     > Finally, we don't want to pass `y` into the `rk4_step` every time,
>     as we
>     > don't want to require an extra register for it.
>     > After all these manual optimizations, we have the following program:
>     >
>     > let h = 0.1
>     >
>     >
>     > let exact t = sin t
>     >
>     >
>     > let rk4_step t =
>     >
>     >   let k1 = h *. cos t in
>     >
>     >   let k2 = h *. cos (t +. 0.5*.h) in
>     >
>     >   let k4 = h *. cos (t +. h) in
>     >
>     >   (k1+.k4)/.6.0 +. k2*.1.5
>     >
>     >
>     > let compute steps =
>     >
>     >   let rec loop n y t =
>     >
>     >     if n < steps
>     >
>     >     then loop (n+1) (y +. rk4_step t) (t +. h)
>     >
>     >     else y in
>     >
>     >   loop 1 1.0 0.0
>     >
>     >
>     > let () =
>     >
>     >   let y = compute 102 in
>     >
>     >   let err = abs_float (y -. (exact ((float_of_int 102) *. h))) in
>     >
>     >   let large = 50000000 in
>     >
>     >   let y = compute large in
>     >
>     >   Printf.printf "%b\n"
>     >
>     >     (abs_float (y -. (exact (float_of_int large) *. h)) < 2. *. err)
>     >
>     >
>     >
>     > This program has the same performance as the C one... unless I
>     pass really
>     > aggressive optimization options
>     > to the C compiler, that will emit a platform specific code, e.g.,
>     >
>     >     gcc rk.c -lm -O3 -march=corei7-avx -o rksse
>     >
>     >
>     > These options basically double the performance of the C version,
>     leaving
>     > OCaml lagging behind. That is because,
>     > OCaml, obviously, cannot follow the latest developments of intel CPU,
>     > especially in the field of SSE.
>     >
>     > The fun part is that when I've tried to compile the same file with
>     clang,
>     > the resulting program was even slower
>     > than the original non-optimized OCaml. But this is all micro
>     benchmarking of
>     > course, so don't jump to fast conclusions
>     > (although I like to think that OCaml is faster than Clang :))
>     >
>     >
>     > As a final remark, my experience in HPC shows that in general you
>     should not
>     > really rely on compiler optimizations and hope
>     > that the compiler will do the magic for you. Even the GCC
>     compiler. It would
>     > be very easy to accidentally amend the above program
>     > in a such way, that the optimizations will no longer fire in. Of
>     course,
>     > writing in assembly is also not a choice. If you really need
>     > to optimize, then you should find out the performance bottleneck
>     and then
>     > optimize it manually until you get an expected performance.
>     > Alternatively, you can use plain old Fortran to get the reliable
>     > performance. And then call it from C or OCaml.
>     >
>     >
>     > Best wishes,
>     > Ivan
>     >
>     >
>     >
>     > On Mon, Dec 19, 2016 at 6:51 AM, Gerd Stolpmann
>     <info@gerd-stolpmann.de <mailto:info@gerd-stolpmann.de>>
>     > wrote:
>     >>
>     >> Hi Christoph,
>     >>
>     >> the extra code looks very much like an allocation on the minor heap:
>     >>
>     >> sub    $0x10,%r15
>     >> lea    0x25c7b6(%rip),%rax
>     >> cmp    (%rax),%r15
>     >> jb     404a8a <dlerror@plt+0x2d0a>
>     >> lea    0x8(%r15),%rax
>     >> movq   $0x4fd,-0x8(%rax)
>     >>
>     >> r15 points to the used area of the minor heap - by decrementing
>     it you
>     >> get an additional block of memory. It is compared against the
>     beginning
>     >> of the heap to check whether GC is needed. The constant 0x4fd is the
>     >> header of the new block (which must be always initialized).
>     >>
>     >> From the source code, it remains unclear for what this is used.
>     >> Obviously, the compiler runs out of registers, and moves some
>     values to
>     >> the minor heap (temporarily). When you call a C function like cos
>     it is
>     >> likely that this happens because the C calling conventions do not
>     >> preserve the FP registers (xmm*). This could be improved if the OCaml
>     >> compiler tried alternate places for temporarily storing FP values:
>     >>
>     >>  - int registers (which is perfectly possible on 64 bit platforms).
>     >>    A number of int registers survive C calls.
>     >>  - stack
>     >>
>     >> To my knowledge, the OCaml compiler never tries this (but this
>     could be
>     >> out of date). This is a fairly specific optimization that makes
>     mostly
>     >> sense for purely iterating or aggregating functions like yours
>     that do
>     >> not store FP values away.
>     >>
>     >> Gerd
>     >>
>     >> Am Samstag, den 17.12.2016, 14:02 +0100 schrieb Christoph Höger:
>     >> > Ups. Forgot the actual examples.
>     >> >
>     >> > Am 17.12.2016 um 14:01 schrieb Christoph Höger:
>     >> > >
>     >> > > Dear all,
>     >> > >
>     >> > > find attached two simple runge-kutta iteration schemes. One is
>     >> > > written
>     >> > > in C, the other in OCaml. I compared the runtime of both and
>     gcc (-
>     >> > > O2)
>     >> > > produces an executable that is roughly 30% faster (to be more
>     >> > > precise:
>     >> > > 3.52s vs. 2.63s). That is in itself quite pleasing, I think. I do
>     >> > > not
>     >> > > understand however, what causes this difference. Admittedly, the
>     >> > > generated assembly looks completely different, but both compilers
>     >> > > inline
>     >> > > all functions and generate one big loop. Ocaml generates a
>     lot more
>     >> > > scaffolding, but that is to be expected.
>     >> > >
>     >> > > There is however an interesting particularity: OCaml generates 6
>     >> > > calls
>     >> > > to cos, while gcc only needs 3 (and one direct jump).
>     Surprisingly,
>     >> > > there are also calls to cosh, acos and pretty much any other
>     >> > > trigonometric function (initialization of constants, maybe?)
>     >> > >
>     >> > > However, the true culprit seems to be an excess of instructions
>     >> > > between
>     >> > > the different calls to cos. This is what happens between the
>     first
>     >> > > two
>     >> > > calls to cos:
>     >> > >
>     >> > > gcc:
>     >> > > jmpq   400530 <cos@plt>
>     >> > > nop
>     >> > > nopw   %cs:0x0(%rax,%rax,1)
>     >> > >
>     >> > > sub    $0x38,%rsp
>     >> > > movsd  %xmm0,0x10(%rsp)
>     >> > > movapd %xmm1,%xmm0
>     >> > > movsd  %xmm2,0x18(%rsp)
>     >> > > movsd  %xmm1,0x8(%rsp)
>     >> > > callq  400530 <cos@plt>
>     >> > >
>     >> > > ocamlopt:
>     >> > >
>     >> > > callq  401a60 <cos@plt>
>     >> > > mulsd  (%r12),%xmm0
>     >> > > movsd  %xmm0,0x10(%rsp)
>     >> > > sub    $0x10,%r15
>     >> > > lea    0x25c7b6(%rip),%rax
>     >> > > cmp    (%rax),%r15
>     >> > > jb     404a8a <dlerror@plt+0x2d0a>
>     >> > > lea    0x8(%r15),%rax
>     >> > > movq   $0x4fd,-0x8(%rax)
>     >> > >
>     >> > > movsd  0x32319(%rip),%xmm1
>     >> > >
>     >> > > movapd %xmm1,%xmm2
>     >> > > mulsd  %xmm0,%xmm2
>     >> > > addsd  0x0(%r13),%xmm2
>     >> > > movsd  %xmm2,(%rax)
>     >> > > movapd %xmm1,%xmm0
>     >> > > mulsd  (%r12),%xmm0
>     >> > > addsd  (%rbx),%xmm0
>     >> > > callq  401a60 <cos@plt>
>     >> > >
>     >> > >
>     >> > > Is this caused by some underlying difference in the
>     representation
>     >> > > of
>     >> > > numeric values (i.e. tagged ints) or is it reasonable to attack
>     >> > > this
>     >> > > issue as a hobby experiment?
>     >> > >
>     >> > >
>     >> > > thanks for any advice,
>     >> > >
>     >> > > Christoph
>     >> > >
>     >> >
>     >> --
>     >> ------------------------------------------------------------
>     >> Gerd Stolpmann, Darmstadt, Germany    gerd@gerd-stolpmann.de
>     <mailto:gerd@gerd-stolpmann.de>
>     >> My OCaml site:          http://www.camlcity.org
>     >> Contact details:        http://www.camlcity.org/contact.html
>     <http://www.camlcity.org/contact.html>
>     >> Company homepage:       http://www.gerd-stolpmann.de
>     >> ------------------------------------------------------------
>     >>
>     >>
>     >
> 
> 


-- 
Christoph Höger

Technische Universität Berlin
Fakultät IV - Elektrotechnik und Informatik
Übersetzerbau und Programmiersprachen

Sekr. TEL12-2, Ernst-Reuter-Platz 7, 10587 Berlin

Tel.: +49 (30) 314-24890
E-Mail: christoph.hoeger@tu-berlin.de


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Caml-list] Closing the performance gap to C
  2016-12-21  9:08           ` Christoph Höger
@ 2016-12-23 12:18             ` Oleg
  0 siblings, 0 replies; 25+ messages in thread
From: Oleg @ 2016-12-23 12:18 UTC (permalink / raw)
  To: christoph.hoeger; +Cc: ivg, yotambarnoy, caml-list

> @Ivan: Your point about the optimization is well taken, but please
> consider that the runge-kutta integrator is a sophisticated, general
> numerical method. Usually, the user of such a method has neither the
> knowledge nor the time to manually adapt it, but rather implements a
> standard interface. Therefore, the unused arguments and
> algebraic/trigonometric identities have to be resolved by the compiler.
> Otherwise, we could as well compute the exact solution, anyways ;).

I would like to concur with Ivan and to strongly emphasize his
point. This discussion about compiler optimizations has been done
before -- decades before -- in the HPC community. We know the
conclusion: the era of compiler optimizations is over. All further
gains in HPC computing can only be expected from *domain-specific*
optimizations, specific to a particular domain. Because the most
profitable optimizations are domain-specific, one cannot, should not
and must not count on *general-purpose* compiler to do them. Compiler
writers should think of optimizations that universally profitable,
rather than applicable only in specific (and not entirely clear-cut)
circumstances. The compiler cannot do algebraic/trigonometric
identities! First of all, there are too many of them, and it is not
clear in which direction to apply them. Second, most of the identities
are not generally valid, in the presence of NaNs etc (for example, 0*x
= 0 is not valid for FP computations).

A lot has been written about it. This list is not the best place to
discuss all this work. Let me give three references:

-- The recent talk by Paul Kelly
  http://www.doc.ic.ac.uk/~phjk/Presentations/2015-09-LCPC-Keynote-PaulKelly-V03-ForDistribution.pdf
plus many more publications from his group
        http://www.doc.ic.ac.uk/~phjk/phjk-Publications.html
He talks about finite element methods, far more sophisticated than
Runge-Kutta

-- A very old paper with lots of references to the debate about
compiler optimizations
        http://www-rocq.inria.fr/~acohen/publications/CDGHKP06.ps.gz

-- Stream Fusion, to completeness paper mentioned on this list a
couple of months ago. The paper shows yet another example that
domain-specific optimizations really work, generating code as if it
were written by hand. The paper takes great pains to emphasize why
the optimizations cannot be carried out by a general-purpose
compiler. Please be sure to read the discussion section near the end,
with some notable quotes about GHC optimizer.

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2016-12-23 12:18 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-12-17 13:01 [Caml-list] Closing the performance gap to C Christoph Höger
2016-12-17 13:02 ` Christoph Höger
2016-12-19 10:58   ` Soegtrop, Michael
2016-12-19 11:51   ` Gerd Stolpmann
2016-12-19 14:52     ` Soegtrop, Michael
2016-12-19 16:41       ` Gerd Stolpmann
2016-12-19 17:09         ` Frédéric Bour
2016-12-19 17:19           ` Yotam Barnoy
2016-12-21 11:25             ` Alain Frisch
2016-12-21 14:45               ` Yotam Barnoy
2016-12-21 16:06                 ` Alain Frisch
2016-12-21 16:31                   ` Gerd Stolpmann
2016-12-21 16:39                     ` Yotam Barnoy
2016-12-21 16:47                       ` Gabriel Scherer
2016-12-21 16:51                         ` Yotam Barnoy
2016-12-21 16:56                         ` Mark Shinwell
2016-12-21 17:43                           ` Alain Frisch
2016-12-22  8:39                             ` Mark Shinwell
2016-12-22 17:23                             ` Pierre Chambart
2016-12-21 17:35                       ` Alain Frisch
2016-12-19 15:48     ` Ivan Gotovchits
2016-12-19 16:44       ` Yotam Barnoy
2016-12-19 16:59         ` Ivan Gotovchits
2016-12-21  9:08           ` Christoph Höger
2016-12-23 12:18             ` Oleg

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox