[Caml-list] Ocaml optimizer pitfalls & work-arounds

Mailing list for all users of the OCaml language and system.
 help / color / mirror / Atom feed

* [Caml-list] Ocaml optimizer pitfalls & work-arounds
@ 2017-01-19  6:51 Mark Hayden
  2017-01-19 11:20 ` Nils Becker
                   ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Mark Hayden @ 2017-01-19  6:51 UTC (permalink / raw)
  To: caml-list

We recently upgraded our Ocaml toolchain from 4.02.3 to Ocaml 4.04.0.
We were looking forward to a performance boost from the optimization
improvements, especially from flambda.  While we generally were able
to achieve significant performance improvement, we were somewhat
surprised by the effort required to avoid certain pitfalls in Ocaml.

This note describes some issues we ran into.  We filed several
reports on Ocaml Mantis regarding our findings.  However it appears
the underlying issues we ran into are unlikely to change:

  Your three reports (0007440, 0007441, 0007442) are manifestations
  of the same fact: the OCaml compiler performs type-based
  optimizations first, then erases types, then performs all the other
  optimizations. This is very unlikely to change in the near future,
  as it would require a total rewrite of the compiler.

  [X Leroy, https://caml.inria.fr/mantis/view.php?id=7440]

I encourage readers to review the problem reports we submitted, which
include more concrete examples.  I'm posting this note in case there
are others running into similar performance issues with their Ocaml
software and who might find it helpful in working around those
issues.  I'm not aware of them being documented elsewhere and there
appears to be little prospect of the issues being addressed in the
compiler in the forseeable future.  Please chime in if any of this is
inaccurate or there is something I missed.

As an initial example, consider the following Ocaml code to find the
maximum floating point value in an array (that is at least 0.0):

  [Array.fold_left max 0.0 arr]

Now compile this with the latest compiler and maximum optimization.
Because of how the Ocaml optimization works, this will run about
10-15x slower (and allocate 2-3 words per array element) than a more
carefully written version that uses specialized operations and avoids
allocation.  See below for one way to achieve this (while still using
a functional-programming style).

  (* Same as Array.fold_left, but with type casting.
   *)
  let [@inline] array_fold_leftf f (x:float) (a:float array) =
    let r = ref x in
    for i = 0 to Array.length a - 1 do
      r := f !r (Array.unsafe_get a i)
    done;
    !r
  ;;

  let [@inline] float_max (v0:float) v1 =
    if v0 > v1 then v0 else v1
  ;;

  let array_float_max a =
    array_fold_leftf float_max 0.0 a
  ;;

The assembly for the "inner loop" for the two examples are below.
They were compiled with Ocaml 4.05.dev+flambda, "-O3
-unbox-closures", MacOS 12.2, AMD64.

Unoptimized example.  Note test/branch for array tag.  Allocation for
boxing (we did not include the calls to trigger a minor gc).  There
is a call to Ocaml runtime for polymorphic greater-equal.  This is
probably not what one would expect from an optimizing/inline compiler
for a simple case such as this.  Note that to create this we used our
own definition of Array.fold_left which had an "[@inline]"
annotation.

  L215:
          movq    (%rsp), %rbx
          .loc    1       38      14
          movzbq  -8(%rbx), %rax
          cmpq    $254, %rax
          je      L219
          .loc    1       38      14
          movq    -4(%rbx,%rdx,4), %rsi
          movq    %rsi, 24(%rsp)
          jmp     L218
          .align  2
  L219:
          .loc    1       38      14
  L221:
          subq    $16, %r15
          movq    _caml_young_limit@GOTPCREL(%rip), %rax
          cmpq    (%rax), %r15
          jb      L222
          leaq    8(%r15), %rsi
          movq    $1277, -8(%rsi)
          .loc    1       38      14
          movsd   -4(%rbx,%rdx,4), %xmm0
          movsd   %xmm0, (%rsi)
          movq    %rsi, 24(%rsp)
  L218:
          movq    %rdi, 32(%rsp)
          .file   5       "pervasives.ml"
          .loc    5       65      17
          movq    _caml_greaterequal@GOTPCREL(%rip), %rax
          call    _caml_c_call
  L213:
          movq    _caml_young_ptr@GOTPCREL(%rip), %r11
          movq    (%r11), %r15
          cmpq    $1, %rax
          je      L217
          movq    32(%rsp), %rdi
          jmp     L216
          .align  2
  L217:
          movq    24(%rsp), %rdi
  L216:
          movq    8(%rsp), %rdx
          movq    %rdx, %rax
          addq    $2, %rdx
          movq    %rdx, 8(%rsp)
          movq    16(%rsp), %rbx
          cmpq    %rbx, %rax
          jne     L215

The assembly for the more carefully writting case is below.  No
allocation.  No call to external C code.  No test/branch for array
tag.  This matches what I think most people would like to see.  It is
compact enough that (maybe) it would benefit from unrolling:

  l225:
          .loc    1       46      14
          movsd   -4(%rax,%rdi,4), %xmm1
          comisd  %xmm1, %xmm0
          jbe     l227
          jmp     l226
          .align  2
  l227:
          movapd  %xmm1, %xmm0
  l226:
          movq    %rdi, %rsi
          addq    $2, %rdi
          cmpq    %rbx, %rsi
          jne     l225

The two main learnings we found were:

* Polymorphic primitives ([Array.get], [compare], [>=], [min]) are
  only specialized if they appear in a context where the types can be
  determined at their exact call site, otherwise a polymorphic
  version is used.  If the use of the primitive is later inlined in a
  context where the type is no longer polymorphic, the function will
  _not_ be specialized by the compiler.

* Use of abstract data types prevents specialization.  In particular,
  defining an abstract data type in a module ("type t ;;") will
  prevent specialization (even after inlining) for any polymorphic
  primitives (eg, "caml_equal") used with that type.  For instance,
  if the underlying type for [t] is actually [int], other modules
  will still use polymorphic equality instead of a single machine
  instruction.  You can prevent this behavior with the "private"
  keyword in order to export the type information, "type t = private
  int".  Alternatively, the module can include its own specialized
  operations and other modules can be careful to use them.

It bears emphasizing that the issues described in this note apply
even when all of the code is "fully inlined" and uses highest level
of optimization.  Specialization in the Ocaml compiler occurs in a
stage prior to inlining.  If it hasn’t happened before inlining, it
won’t happen afterwards.

What kind of effect does lack of specialization have on performance?
Calling the "caml_compare" Ocaml C runtime function to compare
integers can be 10-20x times slower than using a single integer
comparison machine instruction.  Ditto for floating point values.
The unspecialized [Array.get], [Array.set], and (on 32-bit)
[Array.length] have to check the tag on the array to determine if the
array uses the unboxed floating-point represntation (I wish Ocaml
didn't use this!).  For instance, the polymorphic [Array.get] checks
the tag on the array and (for floating point arrays) reads the value
and boxes the floating point value (ie, allocate 2-3 words on the
heap).  Note that when iterating over an array, the check on the tag
will be included in _each_ loop iteration.

Other impacts of using non-specialized functions:

* Use of polymorphic primitives means floating point values have to
  be boxed, requiring heap allocation.  Through use of specialized
  specialized primitives, in many cases floats can remain unboxed.

* All the extra native code from using polymorphic primitives
  (checking array tags, conditionally boxing floats, calling out to
  Ocaml C runtime) can have follow-on effects for further inlining.
  In other words, when native code can be kept compact, then more
  code can be inlined and/or loops unrolled and this can in turn
  allow further optimization.

Some suggestions others may find helpful:

* Consider using the "private" keyword for any abstract types in your
  modules.  We added over 50 of these to our code base.  It is an
  ugly but effective work-around.

* The min/max functions in standard library Pervasives are
  particularly problematic.  They are polymorphic so their comparison
  will never be specialized.  It can be helpful to define specialized
  functions such as:

    let [@inline] float_max (v0:float) (v1:float) =
      if v0 > v1 then v0 else v1
    ;;

    let [@inline] int_max (v0:int) (v1:int) =
      if v0 > v1 then v0 else v1
    ;;

  These will be compiled to use native machine code and unboxed
  values.

* Any use of polymorphism can negatively affect performance.  Be
  careful about inadvertently introducing polymorphism into your
  program, such as this helper function:

    let [@inline] getter v ofs = Array.get v ofs ;;

  This will result in unspecialized version of [Array.get] being
  inlined at all call-sites.  Note that if your .mli file defines the
  function as non-polymorphic that will still _not_ affect how
  [getter] is compiled.:

    type getter : t -> int -> int ;;  (* does not affect optimization *)

  You must cast the type in the implementation in order for [Array.get]
  to be specialized:

    let [@inline] getter (v:int array) ofs = Array.get v ofs ;;

* All the iterators in the Array module (eg, [Array.iter]) in the
  standard library are polymorphic, so will use unspecialized
  accessors and be affected by the issues described here.  Using the
  following to sum and array of floats may seem elegant:

    Array.fold_left (+) 0.0 arr

  However, the resulting code is much slower (and allocates 2 floats
  per array entry, ie 4-6 words) than a "specialized" version.  Note
  that even if [Array.fold_left] and surrounding code were "fully
  inlined," it is still a polymorphic function so the above
  performance penalty for checking the array tag and boxing the float
  is present.  See also the earlier example.

* It can be helpful to review the compiled assembly code (using "-S"
  option for ocamlopt) and look for tell-tale signs of lack of
  specialization, such as calls to [_caml_greaterequal] or allocation
  [caml_call_gc] in cases where they are not expected.  You can refer
  to the assembly code for the original implementation, know that
  that code will not be specialized when inlined.

As I said, by examining our hot spots and following the suggestions
above, we found the resulting native code could be comparable to what
we would expect from C.  It is unfortunate these issues were
(apparently) designed into the Ocaml compiler architecture, because
otherwise it would have seemed this would be a natural area of
improvement for the compiler.  I would have thought a staticly typed
language such as Ocaml would (through its type checker) be
well-suited for the simple types of function specialization described
in this note.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Caml-list] Ocaml optimizer pitfalls & work-arounds
  2017-01-19  6:51 [Caml-list] Ocaml optimizer pitfalls & work-arounds Mark Hayden
@ 2017-01-19 11:20 ` Nils Becker
  2017-01-19 11:39   ` Gabriel Scherer
  2017-01-19 14:35   ` Alain Frisch
  2017-01-19 13:41 ` Yaron Minsky
  2017-01-21 14:39 ` [Caml-list] <DKIM> " Pierre Chambart
  2 siblings, 2 replies; 14+ messages in thread
From: Nils Becker @ 2017-01-19 11:20 UTC (permalink / raw)
  To: caml-list

a while ago there was a proposal for deprecating the specialized
treatment of float arrays, which was not accepted iirc. it seems that
the slowdown of all Array.get calls is quite a high price to pay. what
do we gain from the float array specialization? (honest question - i
don't have a good understanding)

the min, max issues can be avoided without much trouble by using
specialized comparison, as provided in standard library extensions.

n.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Caml-list] Ocaml optimizer pitfalls & work-arounds
  2017-01-19 11:20 ` Nils Becker
@ 2017-01-19 11:39   ` Gabriel Scherer
  2017-01-19 13:26     ` Frédéric Bour
  2017-01-19 14:35   ` Alain Frisch
  1 sibling, 1 reply; 14+ messages in thread
From: Gabriel Scherer @ 2017-01-19 11:39 UTC (permalink / raw)
  To: Nils Becker; +Cc: caml-list

> the min, max issues can be avoided without much trouble by using
> specialized comparison, as provided in standard library extensions.

Note that min/max are already optimized from generic to specialized
when the type information is known at type-checking time. (It used to
be the case that only fully applied calls were optimized, this was
improved by Frédéric Bour to extend to non-applied primitives in 4.03
(eg. "let eq : int -> int -> _ = (=)"). That does not work when those
functions are used inside a functor body at an abstract type (which is
when we want inlining and specialization to interact better), but
there neither do Float.equal or Int.compare.

On Thu, Jan 19, 2017 at 12:20 PM, Nils Becker
<nils.becker@bioquant.uni-heidelberg.de> wrote:
> a while ago there was a proposal for deprecating the specialized
> treatment of float arrays, which was not accepted iirc. it seems that
> the slowdown of all Array.get calls is quite a high price to pay. what
> do we gain from the float array specialization? (honest question - i
> don't have a good understanding)
>
> the min, max issues can be avoided without much trouble by using
> specialized comparison, as provided in standard library extensions.
>
> n.
>
> --
> Caml-list mailing list.  Subscription management and archives:
> https://sympa.inria.fr/sympa/arc/caml-list
> Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
> Bug reports: http://caml.inria.fr/bin/caml-bugs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Caml-list] Ocaml optimizer pitfalls & work-arounds
  2017-01-19 11:39   ` Gabriel Scherer
@ 2017-01-19 13:26     ` Frédéric Bour
  0 siblings, 0 replies; 14+ messages in thread
From: Frédéric Bour @ 2017-01-19 13:26 UTC (permalink / raw)
  To: Gabriel Scherer; +Cc: Nils Becker, caml-list

[-- Attachment #1: Type: text/plain, Size: 1110 bytes --]

Le 19 janv. 2017 à 12:39, Gabriel Scherer <gabriel.scherer@gmail.com> a écrit :
> 
>> the min, max issues can be avoided without much trouble by using
>> specialized comparison, as provided in standard library extensions.
> 
> Note that min/max are already optimized from generic to specialized
> when the type information is known at type-checking time. (It used to
> be the case that only fully applied calls were optimized, this was
> improved by Frédéric Bour to extend to non-applied primitives in 4.03
> (eg. "let eq : int -> int -> _ = (=)"). That does not work when those
> functions are used inside a functor body at an abstract type (which is
> when we want inlining and specialization to interact better), but
> there neither do Float.equal or Int.compare.

Alas, min/max cannot benefit from this optimisation.

In Pervasives, they are defined as:

let min x y = if x <= y then x else y
let max x y = if x >= y then x else y

Specialization only happens for primitives, here it is a plain definition.
So no specialization unless the definitions are copied and annotated :(.

[-- Attachment #2: Type: text/html, Size: 5410 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Caml-list] Ocaml optimizer pitfalls & work-arounds
  2017-01-19 11:20 ` Nils Becker
  2017-01-19 11:39   ` Gabriel Scherer
@ 2017-01-19 14:35   ` Alain Frisch
  2017-01-19 15:35     ` Ivan Gotovchits
  2017-01-19 15:41     ` Gerd Stolpmann
  1 sibling, 2 replies; 14+ messages in thread
From: Alain Frisch @ 2017-01-19 14:35 UTC (permalink / raw)
  To: Nils Becker, caml-list

On 19/01/2017 12:20, Nils Becker wrote:
> a while ago there was a proposal for deprecating the specialized
> treatment of float arrays, which was not accepted iirc.

For future reference, here are some relevant links:

   https://www.lexifi.com/blog/about-unboxed-float-arrays
   https://github.com/ocaml/ocaml/pull/163
   https://github.com/ocaml/ocaml/pull/616

> it seems that
> the slowdown of all Array.get calls is quite a high price to pay. what
> do we gain from the float array specialization? (honest question - i
> don't have a good understanding)

As far as I understand, the main argument for keeping the special 
current hack is that it helps beginners writing naive numerical code get 
something not horribly inefficient, while most proponents of removing 
the hack are advanced users who feel that the current situation gets in 
the way towards very efficient code.

Providing a FloatArray module would address parts of the concerns from 
these advanced users.  At least it would allow better optimization of 
numerical code, and pave the way for a future deprecation/removal of the 
special hack.  It would not address bad consequences of the special hack 
on (i) the performance of generic array accesses would remain and (ii) 
the complexity and corner cases in the compiler and runtime system.

(Another argument in favor of the status quo is that writing 
"FloatArray.get a i" is syntactically heavier than "a.(i)". 
Interestingly, I don't feel that the ability to write polymorphic code 
on arrays and apply it to (unboxed) float arrays is considered as a very 
important property.)

-- Alain

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Caml-list] Ocaml optimizer pitfalls & work-arounds
  2017-01-19 14:35   ` Alain Frisch
@ 2017-01-19 15:35     ` Ivan Gotovchits
  2017-01-19 17:02       ` Hezekiah M. Carty
  2017-01-19 15:41     ` Gerd Stolpmann
  1 sibling, 1 reply; 14+ messages in thread
From: Ivan Gotovchits @ 2017-01-19 15:35 UTC (permalink / raw)
  To: Alain Frisch; +Cc: Nils Becker, caml-list

[-- Attachment #1: Type: text/plain, Size: 906 bytes --]

On Thu, Jan 19, 2017 at 9:35 AM, Alain Frisch <alain.frisch@lexifi.com>
wrote:
>
> (Another argument in favor of the status quo is that writing
> "FloatArray.get a i" is syntactically heavier than "a.(i)". Interestingly,
> I don't feel that the ability to write polymorphic code on arrays and apply
> it to (unboxed) float arrays is considered as a very important property.)
>

This  is not a big issue, as you can write:

    module Array = FloatArray
    a.(i)

and it resolve to a call to the FloatArray.get function, as x.(i) is a
syntactic sugar for `Array.get x i`. The same is true to bigarrays. This is
kind of a non-documented  feature, though.



> -- Alain
>
>
> --
> Caml-list mailing list.  Subscription management and archives:
> https://sympa.inria.fr/sympa/arc/caml-list
> Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
> Bug reports: http://caml.inria.fr/bin/caml-bugs
>
>

[-- Attachment #2: Type: text/html, Size: 1926 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Caml-list] Ocaml optimizer pitfalls & work-arounds
  2017-01-19 15:35     ` Ivan Gotovchits
@ 2017-01-19 17:02       ` Hezekiah M. Carty
  0 siblings, 0 replies; 14+ messages in thread
From: Hezekiah M. Carty @ 2017-01-19 17:02 UTC (permalink / raw)
  To: Ivan Gotovchits, Alain Frisch; +Cc: Nils Becker, caml-list

[-- Attachment #1: Type: text/plain, Size: 960 bytes --]

On Thu, Jan 19, 2017 at 8:35 AM Ivan Gotovchits <ivg@ieee.org> wrote:

> On Thu, Jan 19, 2017 at 9:35 AM, Alain Frisch <alain.frisch@lexifi.com>
> wrote:
>
> (Another argument in favor of the status quo is that writing
> "FloatArray.get a i" is syntactically heavier than "a.(i)". Interestingly,
> I don't feel that the ability to write polymorphic code on arrays and apply
> it to (unboxed) float arrays is considered as a very important property.)
>
>
> This  is not a big issue, as you can write:
>
>     module Array = FloatArray
>     a.(i)
>
> and it resolve to a call to the FloatArray.get function, as x.(i) is a
> syntactic sugar for `Array.get x i`. The same is true to bigarrays. This is
> kind of a non-documented  feature, though.
>
>
>
> -- Alain
>
>
If we get https://github.com/ocaml/ocaml/pull/622 or
https://github.com/ocaml/ocaml/pull/616 then it's even simpler since
FloatArray could have or define its own specialized element access.

Hez

[-- Attachment #2: Type: text/html, Size: 2664 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Caml-list] Ocaml optimizer pitfalls & work-arounds
  2017-01-19 14:35   ` Alain Frisch
  2017-01-19 15:35     ` Ivan Gotovchits
@ 2017-01-19 15:41     ` Gerd Stolpmann
  1 sibling, 0 replies; 14+ messages in thread
From: Gerd Stolpmann @ 2017-01-19 15:41 UTC (permalink / raw)
  To: Alain Frisch, Nils Becker, caml-list

[-- Attachment #1: Type: text/plain, Size: 800 bytes --]

Am Donnerstag, den 19.01.2017, 15:35 +0100 schrieb Alain Frisch:
> (Another argument in favor of the status quo is that writing 
> "FloatArray.get a i" is syntactically heavier than "a.(i)". 
> Interestingly, I don't feel that the ability to write polymorphic
> code 
> on arrays and apply it to (unboxed) float arrays is considered as a
> very 
> important property.)

Any chance that we get implicits?

Gerd

> -- Alain
> 
-- 
------------------------------------------------------------
Gerd Stolpmann, Darmstadt, Germany    gerd@gerd-stolpmann.de
My OCaml site:          http://www.camlcity.org
Contact details:        http://www.camlcity.org/contact.html
Company homepage:       http://www.gerd-stolpmann.de
------------------------------------------------------------



[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Caml-list] Ocaml optimizer pitfalls & work-arounds
  2017-01-19  6:51 [Caml-list] Ocaml optimizer pitfalls & work-arounds Mark Hayden
  2017-01-19 11:20 ` Nils Becker
@ 2017-01-19 13:41 ` Yaron Minsky
  2017-01-19 17:59   ` Mark Hayden
  2017-01-21 14:39 ` [Caml-list] <DKIM> " Pierre Chambart
  2 siblings, 1 reply; 14+ messages in thread
From: Yaron Minsky @ 2017-01-19 13:41 UTC (permalink / raw)
  To: Mark Hayden; +Cc: caml-list

It seems like your primary issues are around lack of specialization
around two features:

- unboxing in float arrays
- optimization of ad-hoc operations (e.g., polymorphic compare)

My view on this is that it's best not to rely on float array
specialization at all, and I think the best improvement we can make to
OCaml is to remove the ad-hoc specialization of float arrays, and
instead add a separate, specialized (and unboxed) type for arrays of
floats, similar to the Bytes.t type which is effectively a specialized
byte array.

As you've observed, specialization is brittle, and it's best not to
rely on it too much. Beyond that, the existence of float arrays
complicate the runtime quite a bit, and make other bugs more likely.

There isn't yet consensus that specialization of float array should be
removed, but I'm still hopeful that we'll get there. It's probably
Jane Street's highest priority ask for the compiler.

We also avoid use of polymorphic compare and other ad-hoc operations,
preferring to use type-specialized comparators everywhere. This is
better for semantic as well as performance reasons, since we've seen a
lot of subtle bugs from polymorphic compare doing the wrong thing on
specific types.

It's hard to deny that using type-specialized comparators is more
verbose than polymorphic compare, but hopefully modular implicits will
make this problem go away, and we can get the best of both worlds. And
we hope that Flambda will be up to the job of inlining away the
overhead of the more indirect calling conventions imposed by modular
implicits.

I think that with the above changes, we can probably get pretty far
towards the goal of being able to write OCaml code that is both highly
performant and pretty.  Being able to delay specialization until later
in the compilation pipeline would help more, but I believe we can do
pretty well without it.

y


On Thu, Jan 19, 2017 at 1:51 AM, Mark Hayden <markghayden@yahoo.com> wrote:
> We recently upgraded our Ocaml toolchain from 4.02.3 to Ocaml 4.04.0.
> We were looking forward to a performance boost from the optimization
> improvements, especially from flambda.  While we generally were able
> to achieve significant performance improvement, we were somewhat
> surprised by the effort required to avoid certain pitfalls in Ocaml.
>
> This note describes some issues we ran into.  We filed several
> reports on Ocaml Mantis regarding our findings.  However it appears
> the underlying issues we ran into are unlikely to change:
>
>   Your three reports (0007440, 0007441, 0007442) are manifestations
>   of the same fact: the OCaml compiler performs type-based
>   optimizations first, then erases types, then performs all the other
>   optimizations. This is very unlikely to change in the near future,
>   as it would require a total rewrite of the compiler.
>
>   [X Leroy, https://caml.inria.fr/mantis/view.php?id=7440]
>
> I encourage readers to review the problem reports we submitted, which
> include more concrete examples.  I'm posting this note in case there
> are others running into similar performance issues with their Ocaml
> software and who might find it helpful in working around those
> issues.  I'm not aware of them being documented elsewhere and there
> appears to be little prospect of the issues being addressed in the
> compiler in the forseeable future.  Please chime in if any of this is
> inaccurate or there is something I missed.
>
> As an initial example, consider the following Ocaml code to find the
> maximum floating point value in an array (that is at least 0.0):
>
>   [Array.fold_left max 0.0 arr]
>
> Now compile this with the latest compiler and maximum optimization.
> Because of how the Ocaml optimization works, this will run about
> 10-15x slower (and allocate 2-3 words per array element) than a more
> carefully written version that uses specialized operations and avoids
> allocation.  See below for one way to achieve this (while still using
> a functional-programming style).
>
>   (* Same as Array.fold_left, but with type casting.
>    *)
>   let [@inline] array_fold_leftf f (x:float) (a:float array) =
>     let r = ref x in
>     for i = 0 to Array.length a - 1 do
>       r := f !r (Array.unsafe_get a i)
>     done;
>     !r
>   ;;
>
>   let [@inline] float_max (v0:float) v1 =
>     if v0 > v1 then v0 else v1
>   ;;
>
>   let array_float_max a =
>     array_fold_leftf float_max 0.0 a
>   ;;
>
> The assembly for the "inner loop" for the two examples are below.
> They were compiled with Ocaml 4.05.dev+flambda, "-O3
> -unbox-closures", MacOS 12.2, AMD64.
>
> Unoptimized example.  Note test/branch for array tag.  Allocation for
> boxing (we did not include the calls to trigger a minor gc).  There
> is a call to Ocaml runtime for polymorphic greater-equal.  This is
> probably not what one would expect from an optimizing/inline compiler
> for a simple case such as this.  Note that to create this we used our
> own definition of Array.fold_left which had an "[@inline]"
> annotation.
>
>   L215:
>           movq    (%rsp), %rbx
>           .loc    1       38      14
>           movzbq  -8(%rbx), %rax
>           cmpq    $254, %rax
>           je      L219
>           .loc    1       38      14
>           movq    -4(%rbx,%rdx,4), %rsi
>           movq    %rsi, 24(%rsp)
>           jmp     L218
>           .align  2
>   L219:
>           .loc    1       38      14
>   L221:
>           subq    $16, %r15
>           movq    _caml_young_limit@GOTPCREL(%rip), %rax
>           cmpq    (%rax), %r15
>           jb      L222
>           leaq    8(%r15), %rsi
>           movq    $1277, -8(%rsi)
>           .loc    1       38      14
>           movsd   -4(%rbx,%rdx,4), %xmm0
>           movsd   %xmm0, (%rsi)
>           movq    %rsi, 24(%rsp)
>   L218:
>           movq    %rdi, 32(%rsp)
>           .file   5       "pervasives.ml"
>           .loc    5       65      17
>           movq    _caml_greaterequal@GOTPCREL(%rip), %rax
>           call    _caml_c_call
>   L213:
>           movq    _caml_young_ptr@GOTPCREL(%rip), %r11
>           movq    (%r11), %r15
>           cmpq    $1, %rax
>           je      L217
>           movq    32(%rsp), %rdi
>           jmp     L216
>           .align  2
>   L217:
>           movq    24(%rsp), %rdi
>   L216:
>           movq    8(%rsp), %rdx
>           movq    %rdx, %rax
>           addq    $2, %rdx
>           movq    %rdx, 8(%rsp)
>           movq    16(%rsp), %rbx
>           cmpq    %rbx, %rax
>           jne     L215
>
>
> The assembly for the more carefully writting case is below.  No
> allocation.  No call to external C code.  No test/branch for array
> tag.  This matches what I think most people would like to see.  It is
> compact enough that (maybe) it would benefit from unrolling:
>
>   l225:
>           .loc    1       46      14
>           movsd   -4(%rax,%rdi,4), %xmm1
>           comisd  %xmm1, %xmm0
>           jbe     l227
>           jmp     l226
>           .align  2
>   l227:
>           movapd  %xmm1, %xmm0
>   l226:
>           movq    %rdi, %rsi
>           addq    $2, %rdi
>           cmpq    %rbx, %rsi
>           jne     l225
>
>
> The two main learnings we found were:
>
> * Polymorphic primitives ([Array.get], [compare], [>=], [min]) are
>   only specialized if they appear in a context where the types can be
>   determined at their exact call site, otherwise a polymorphic
>   version is used.  If the use of the primitive is later inlined in a
>   context where the type is no longer polymorphic, the function will
>   _not_ be specialized by the compiler.
>
> * Use of abstract data types prevents specialization.  In particular,
>   defining an abstract data type in a module ("type t ;;") will
>   prevent specialization (even after inlining) for any polymorphic
>   primitives (eg, "caml_equal") used with that type.  For instance,
>   if the underlying type for [t] is actually [int], other modules
>   will still use polymorphic equality instead of a single machine
>   instruction.  You can prevent this behavior with the "private"
>   keyword in order to export the type information, "type t = private
>   int".  Alternatively, the module can include its own specialized
>   operations and other modules can be careful to use them.
>
> It bears emphasizing that the issues described in this note apply
> even when all of the code is "fully inlined" and uses highest level
> of optimization.  Specialization in the Ocaml compiler occurs in a
> stage prior to inlining.  If it hasn’t happened before inlining, it
> won’t happen afterwards.
>
> What kind of effect does lack of specialization have on performance?
> Calling the "caml_compare" Ocaml C runtime function to compare
> integers can be 10-20x times slower than using a single integer
> comparison machine instruction.  Ditto for floating point values.
> The unspecialized [Array.get], [Array.set], and (on 32-bit)
> [Array.length] have to check the tag on the array to determine if the
> array uses the unboxed floating-point represntation (I wish Ocaml
> didn't use this!).  For instance, the polymorphic [Array.get] checks
> the tag on the array and (for floating point arrays) reads the value
> and boxes the floating point value (ie, allocate 2-3 words on the
> heap).  Note that when iterating over an array, the check on the tag
> will be included in _each_ loop iteration.
>
> Other impacts of using non-specialized functions:
>
> * Use of polymorphic primitives means floating point values have to
>   be boxed, requiring heap allocation.  Through use of specialized
>   specialized primitives, in many cases floats can remain unboxed.
>
> * All the extra native code from using polymorphic primitives
>   (checking array tags, conditionally boxing floats, calling out to
>   Ocaml C runtime) can have follow-on effects for further inlining.
>   In other words, when native code can be kept compact, then more
>   code can be inlined and/or loops unrolled and this can in turn
>   allow further optimization.
>
> Some suggestions others may find helpful:
>
> * Consider using the "private" keyword for any abstract types in your
>   modules.  We added over 50 of these to our code base.  It is an
>   ugly but effective work-around.
>
> * The min/max functions in standard library Pervasives are
>   particularly problematic.  They are polymorphic so their comparison
>   will never be specialized.  It can be helpful to define specialized
>   functions such as:
>
>     let [@inline] float_max (v0:float) (v1:float) =
>       if v0 > v1 then v0 else v1
>     ;;
>
>     let [@inline] int_max (v0:int) (v1:int) =
>       if v0 > v1 then v0 else v1
>     ;;
>
>   These will be compiled to use native machine code and unboxed
>   values.
>
> * Any use of polymorphism can negatively affect performance.  Be
>   careful about inadvertently introducing polymorphism into your
>   program, such as this helper function:
>
>     let [@inline] getter v ofs = Array.get v ofs ;;
>
>   This will result in unspecialized version of [Array.get] being
>   inlined at all call-sites.  Note that if your .mli file defines the
>   function as non-polymorphic that will still _not_ affect how
>   [getter] is compiled.:
>
>     type getter : t -> int -> int ;;  (* does not affect optimization *)
>
>   You must cast the type in the implementation in order for [Array.get]
>   to be specialized:
>
>     let [@inline] getter (v:int array) ofs = Array.get v ofs ;;
>
> * All the iterators in the Array module (eg, [Array.iter]) in the
>   standard library are polymorphic, so will use unspecialized
>   accessors and be affected by the issues described here.  Using the
>   following to sum and array of floats may seem elegant:
>
>     Array.fold_left (+) 0.0 arr
>
>   However, the resulting code is much slower (and allocates 2 floats
>   per array entry, ie 4-6 words) than a "specialized" version.  Note
>   that even if [Array.fold_left] and surrounding code were "fully
>   inlined," it is still a polymorphic function so the above
>   performance penalty for checking the array tag and boxing the float
>   is present.  See also the earlier example.
>
> * It can be helpful to review the compiled assembly code (using "-S"
>   option for ocamlopt) and look for tell-tale signs of lack of
>   specialization, such as calls to [_caml_greaterequal] or allocation
>   [caml_call_gc] in cases where they are not expected.  You can refer
>   to the assembly code for the original implementation, know that
>   that code will not be specialized when inlined.
>
> As I said, by examining our hot spots and following the suggestions
> above, we found the resulting native code could be comparable to what
> we would expect from C.  It is unfortunate these issues were
> (apparently) designed into the Ocaml compiler architecture, because
> otherwise it would have seemed this would be a natural area of
> improvement for the compiler.  I would have thought a staticly typed
> language such as Ocaml would (through its type checker) be
> well-suited for the simple types of function specialization described
> in this note.
>
>
> --
> Caml-list mailing list.  Subscription management and archives:
> https://sympa.inria.fr/sympa/arc/caml-list
> Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
> Bug reports: http://caml.inria.fr/bin/caml-bugs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Caml-list] Ocaml optimizer pitfalls & work-arounds
  2017-01-19 13:41 ` Yaron Minsky
@ 2017-01-19 17:59   ` Mark Hayden
  2017-01-19 22:30     ` Yaron Minsky
  0 siblings, 1 reply; 14+ messages in thread
From: Mark Hayden @ 2017-01-19 17:59 UTC (permalink / raw)
  To: Yaron Minsky; +Cc: caml-list

I agree that removal of the current treatment of float array would eliminate the biggest issues and that would avoid some of the bigger issues.  However, our issue is not just with how float arrays are handled.  Comparisons for integer/char/bool types should always be specialized after inlining.

As things stand (and apparently this is unlikely to change), Ocaml programmers who want best performance need to learn to developer their software to carefully tailor use of polymorphism, abstraction, min/max, etc, at least for parts where performance is important.  This is too bad... the type checker infers the types but (according to X Leroy) the type information is no longer available by the time inlining occurs.

Below is an example taking the maximum (at least 0) of an array of an array of integers.  The natural way to implement this is:

 let array_sum arr =
    Array.fold_left max 0 arr
 ;;

The inner loop compiles to the code below (again, with Array.fold_left redefined to be annotated with “[@inline]”).  Because of how the Ocaml compiler is architected, the array operations and comparison operations will not be specialized, even though the code is inlined and the array type has been inferred to be [int array].  Removing the special “float array” treatment from Ocaml would at least eliminate the "dead code" for checking and boxing for floats, but the call to [_caml_greaterequal] would still be used instead of a single comparison instruction.

L258:
	movq	(%rsp), %rbx
	.loc	1	38	14
	movzbq	-8(%rbx), %rax
	cmpq	$254, %rax
	je	L262
	.loc	1	38	14
	movq	-4(%rbx,%rdx,4), %rsi
	movq	%rsi, 24(%rsp)
	jmp	L261
	.align	2
L262:
	.loc	1	38	14
L264:
	subq	$16, %r15
	movq	_caml_young_limit@GOTPCREL(%rip), %rax
	cmpq	(%rax), %r15
	jb	L265
	leaq	8(%r15), %rsi
	movq	$1277, -8(%rsi)
	.loc	1	38	14
	movsd	-4(%rbx,%rdx,4), %xmm0
	movsd	%xmm0, (%rsi)
	movq	%rsi, 24(%rsp)
L261:
	movq	%rdi, 32(%rsp)
	.loc	5	65	17
	movq	_caml_greaterequal@GOTPCREL(%rip), %rax
	call	_caml_c_call
L256:
	movq	_caml_young_ptr@GOTPCREL(%rip), %r11
	movq	(%r11), %r15
	cmpq	$1, %rax
	je	L260
	movq	32(%rsp), %rdi
	jmp	L259
	.align	2
L260:
	movq	24(%rsp), %rdi
L259:
	movq	8(%rsp), %rdx
	movq	%rdx, %rax
	addq	$2, %rdx
	movq	%rdx, 8(%rsp)
	movq	16(%rsp), %rbx
	cmpq	%rbx, %rax
	jne	L258

If you write it like this:

 let [@inline] array_fold_left_i f (x:int) (a:int array) =
   let r = ref x in
   for i = 0 to Array.length a - 1 do
     r := f !r (Array.unsafe_get a i)
   done;
   !r
 ;;

 let [@inline] int_max (v0:int) v1 =
   if v0 > v1 then v0 else v1
 ;;

 let array_int_max a =
   array_fold_left_i int_max 0 a
 ;;

Then the inner loop will compile as follows, which is about 5-15x faster than the code above and I think most developers would be pleased with.

L284:
	.loc	1	54	14
	movq	-4(%rbx,%rsi,4), %rdx
	cmpq	%rdx, %rdi
	jle	L286
	jmp	L285
	.align	2
L286:
	movq	%rdx, %rdi
L285:
	movq	%rsi, %rdx
	addq	$2, %rsi
	cmpq	%rax, %rdx
	jne	L284





> On Jan 19, 2017, at 5:41 AM, Yaron Minsky <yminsky@janestreet.com> wrote:
> 
> It seems like your primary issues are around lack of specialization
> around two features:
> 
> - unboxing in float arrays
> - optimization of ad-hoc operations (e.g., polymorphic compare)
> 
> My view on this is that it's best not to rely on float array
> specialization at all, and I think the best improvement we can make to
> OCaml is to remove the ad-hoc specialization of float arrays, and
> instead add a separate, specialized (and unboxed) type for arrays of
> floats, similar to the Bytes.t type which is effectively a specialized
> byte array.
> 
> As you've observed, specialization is brittle, and it's best not to
> rely on it too much. Beyond that, the existence of float arrays
> complicate the runtime quite a bit, and make other bugs more likely.
> 
> There isn't yet consensus that specialization of float array should be
> removed, but I'm still hopeful that we'll get there. It's probably
> Jane Street's highest priority ask for the compiler.
> 
> We also avoid use of polymorphic compare and other ad-hoc operations,
> preferring to use type-specialized comparators everywhere. This is
> better for semantic as well as performance reasons, since we've seen a
> lot of subtle bugs from polymorphic compare doing the wrong thing on
> specific types.
> 
> It's hard to deny that using type-specialized comparators is more
> verbose than polymorphic compare, but hopefully modular implicits will
> make this problem go away, and we can get the best of both worlds. And
> we hope that Flambda will be up to the job of inlining away the
> overhead of the more indirect calling conventions imposed by modular
> implicits.
> 
> I think that with the above changes, we can probably get pretty far
> towards the goal of being able to write OCaml code that is both highly
> performant and pretty.  Being able to delay specialization until later
> in the compilation pipeline would help more, but I believe we can do
> pretty well without it.
> 
> y
> 
> 
> On Thu, Jan 19, 2017 at 1:51 AM, Mark Hayden <markghayden@yahoo.com> wrote:
>> We recently upgraded our Ocaml toolchain from 4.02.3 to Ocaml 4.04.0.
>> We were looking forward to a performance boost from the optimization
>> improvements, especially from flambda.  While we generally were able
>> to achieve significant performance improvement, we were somewhat
>> surprised by the effort required to avoid certain pitfalls in Ocaml.
>> 
>> This note describes some issues we ran into.  We filed several
>> reports on Ocaml Mantis regarding our findings.  However it appears
>> the underlying issues we ran into are unlikely to change:
>> 
>> Your three reports (0007440, 0007441, 0007442) are manifestations
>> of the same fact: the OCaml compiler performs type-based
>> optimizations first, then erases types, then performs all the other
>> optimizations. This is very unlikely to change in the near future,
>> as it would require a total rewrite of the compiler.
>> 
>> [X Leroy, https://caml.inria.fr/mantis/view.php?id=7440]
>> 
>> I encourage readers to review the problem reports we submitted, which
>> include more concrete examples.  I'm posting this note in case there
>> are others running into similar performance issues with their Ocaml
>> software and who might find it helpful in working around those
>> issues.  I'm not aware of them being documented elsewhere and there
>> appears to be little prospect of the issues being addressed in the
>> compiler in the forseeable future.  Please chime in if any of this is
>> inaccurate or there is something I missed.
>> 
>> As an initial example, consider the following Ocaml code to find the
>> maximum floating point value in an array (that is at least 0.0):
>> 
>> [Array.fold_left max 0.0 arr]
>> 
>> Now compile this with the latest compiler and maximum optimization.
>> Because of how the Ocaml optimization works, this will run about
>> 10-15x slower (and allocate 2-3 words per array element) than a more
>> carefully written version that uses specialized operations and avoids
>> allocation.  See below for one way to achieve this (while still using
>> a functional-programming style).
>> 
>> (* Same as Array.fold_left, but with type casting.
>>  *)
>> let [@inline] array_fold_leftf f (x:float) (a:float array) =
>>   let r = ref x in
>>   for i = 0 to Array.length a - 1 do
>>     r := f !r (Array.unsafe_get a i)
>>   done;
>>   !r
>> ;;
>> 
>> let [@inline] float_max (v0:float) v1 =
>>   if v0 > v1 then v0 else v1
>> ;;
>> 
>> let array_float_max a =
>>   array_fold_leftf float_max 0.0 a
>> ;;
>> 
>> The assembly for the "inner loop" for the two examples are below.
>> They were compiled with Ocaml 4.05.dev+flambda, "-O3
>> -unbox-closures", MacOS 12.2, AMD64.
>> 
>> Unoptimized example.  Note test/branch for array tag.  Allocation for
>> boxing (we did not include the calls to trigger a minor gc).  There
>> is a call to Ocaml runtime for polymorphic greater-equal.  This is
>> probably not what one would expect from an optimizing/inline compiler
>> for a simple case such as this.  Note that to create this we used our
>> own definition of Array.fold_left which had an "[@inline]"
>> annotation.
>> 
>> L215:
>>         movq    (%rsp), %rbx
>>         .loc    1       38      14
>>         movzbq  -8(%rbx), %rax
>>         cmpq    $254, %rax
>>         je      L219
>>         .loc    1       38      14
>>         movq    -4(%rbx,%rdx,4), %rsi
>>         movq    %rsi, 24(%rsp)
>>         jmp     L218
>>         .align  2
>> L219:
>>         .loc    1       38      14
>> L221:
>>         subq    $16, %r15
>>         movq    _caml_young_limit@GOTPCREL(%rip), %rax
>>         cmpq    (%rax), %r15
>>         jb      L222
>>         leaq    8(%r15), %rsi
>>         movq    $1277, -8(%rsi)
>>         .loc    1       38      14
>>         movsd   -4(%rbx,%rdx,4), %xmm0
>>         movsd   %xmm0, (%rsi)
>>         movq    %rsi, 24(%rsp)
>> L218:
>>         movq    %rdi, 32(%rsp)
>>         .file   5       "pervasives.ml"
>>         .loc    5       65      17
>>         movq    _caml_greaterequal@GOTPCREL(%rip), %rax
>>         call    _caml_c_call
>> L213:
>>         movq    _caml_young_ptr@GOTPCREL(%rip), %r11
>>         movq    (%r11), %r15
>>         cmpq    $1, %rax
>>         je      L217
>>         movq    32(%rsp), %rdi
>>         jmp     L216
>>         .align  2
>> L217:
>>         movq    24(%rsp), %rdi
>> L216:
>>         movq    8(%rsp), %rdx
>>         movq    %rdx, %rax
>>         addq    $2, %rdx
>>         movq    %rdx, 8(%rsp)
>>         movq    16(%rsp), %rbx
>>         cmpq    %rbx, %rax
>>         jne     L215
>> 
>> 
>> The assembly for the more carefully writting case is below.  No
>> allocation.  No call to external C code.  No test/branch for array
>> tag.  This matches what I think most people would like to see.  It is
>> compact enough that (maybe) it would benefit from unrolling:
>> 
>> l225:
>>         .loc    1       46      14
>>         movsd   -4(%rax,%rdi,4), %xmm1
>>         comisd  %xmm1, %xmm0
>>         jbe     l227
>>         jmp     l226
>>         .align  2
>> l227:
>>         movapd  %xmm1, %xmm0
>> l226:
>>         movq    %rdi, %rsi
>>         addq    $2, %rdi
>>         cmpq    %rbx, %rsi
>>         jne     l225
>> 
>> 
>> The two main learnings we found were:
>> 
>> * Polymorphic primitives ([Array.get], [compare], [>=], [min]) are
>> only specialized if they appear in a context where the types can be
>> determined at their exact call site, otherwise a polymorphic
>> version is used.  If the use of the primitive is later inlined in a
>> context where the type is no longer polymorphic, the function will
>> _not_ be specialized by the compiler.
>> 
>> * Use of abstract data types prevents specialization.  In particular,
>> defining an abstract data type in a module ("type t ;;") will
>> prevent specialization (even after inlining) for any polymorphic
>> primitives (eg, "caml_equal") used with that type.  For instance,
>> if the underlying type for [t] is actually [int], other modules
>> will still use polymorphic equality instead of a single machine
>> instruction.  You can prevent this behavior with the "private"
>> keyword in order to export the type information, "type t = private
>> int".  Alternatively, the module can include its own specialized
>> operations and other modules can be careful to use them.
>> 
>> It bears emphasizing that the issues described in this note apply
>> even when all of the code is "fully inlined" and uses highest level
>> of optimization.  Specialization in the Ocaml compiler occurs in a
>> stage prior to inlining.  If it hasn’t happened before inlining, it
>> won’t happen afterwards.
>> 
>> What kind of effect does lack of specialization have on performance?
>> Calling the "caml_compare" Ocaml C runtime function to compare
>> integers can be 10-20x times slower than using a single integer
>> comparison machine instruction.  Ditto for floating point values.
>> The unspecialized [Array.get], [Array.set], and (on 32-bit)
>> [Array.length] have to check the tag on the array to determine if the
>> array uses the unboxed floating-point represntation (I wish Ocaml
>> didn't use this!).  For instance, the polymorphic [Array.get] checks
>> the tag on the array and (for floating point arrays) reads the value
>> and boxes the floating point value (ie, allocate 2-3 words on the
>> heap).  Note that when iterating over an array, the check on the tag
>> will be included in _each_ loop iteration.
>> 
>> Other impacts of using non-specialized functions:
>> 
>> * Use of polymorphic primitives means floating point values have to
>> be boxed, requiring heap allocation.  Through use of specialized
>> specialized primitives, in many cases floats can remain unboxed.
>> 
>> * All the extra native code from using polymorphic primitives
>> (checking array tags, conditionally boxing floats, calling out to
>> Ocaml C runtime) can have follow-on effects for further inlining.
>> In other words, when native code can be kept compact, then more
>> code can be inlined and/or loops unrolled and this can in turn
>> allow further optimization.
>> 
>> Some suggestions others may find helpful:
>> 
>> * Consider using the "private" keyword for any abstract types in your
>> modules.  We added over 50 of these to our code base.  It is an
>> ugly but effective work-around.
>> 
>> * The min/max functions in standard library Pervasives are
>> particularly problematic.  They are polymorphic so their comparison
>> will never be specialized.  It can be helpful to define specialized
>> functions such as:
>> 
>>   let [@inline] float_max (v0:float) (v1:float) =
>>     if v0 > v1 then v0 else v1
>>   ;;
>> 
>>   let [@inline] int_max (v0:int) (v1:int) =
>>     if v0 > v1 then v0 else v1
>>   ;;
>> 
>> These will be compiled to use native machine code and unboxed
>> values.
>> 
>> * Any use of polymorphism can negatively affect performance.  Be
>> careful about inadvertently introducing polymorphism into your
>> program, such as this helper function:
>> 
>>   let [@inline] getter v ofs = Array.get v ofs ;;
>> 
>> This will result in unspecialized version of [Array.get] being
>> inlined at all call-sites.  Note that if your .mli file defines the
>> function as non-polymorphic that will still _not_ affect how
>> [getter] is compiled.:
>> 
>>   type getter : t -> int -> int ;;  (* does not affect optimization *)
>> 
>> You must cast the type in the implementation in order for [Array.get]
>> to be specialized:
>> 
>>   let [@inline] getter (v:int array) ofs = Array.get v ofs ;;
>> 
>> * All the iterators in the Array module (eg, [Array.iter]) in the
>> standard library are polymorphic, so will use unspecialized
>> accessors and be affected by the issues described here.  Using the
>> following to sum and array of floats may seem elegant:
>> 
>>   Array.fold_left (+) 0.0 arr
>> 
>> However, the resulting code is much slower (and allocates 2 floats
>> per array entry, ie 4-6 words) than a "specialized" version.  Note
>> that even if [Array.fold_left] and surrounding code were "fully
>> inlined," it is still a polymorphic function so the above
>> performance penalty for checking the array tag and boxing the float
>> is present.  See also the earlier example.
>> 
>> * It can be helpful to review the compiled assembly code (using "-S"
>> option for ocamlopt) and look for tell-tale signs of lack of
>> specialization, such as calls to [_caml_greaterequal] or allocation
>> [caml_call_gc] in cases where they are not expected.  You can refer
>> to the assembly code for the original implementation, know that
>> that code will not be specialized when inlined.
>> 
>> As I said, by examining our hot spots and following the suggestions
>> above, we found the resulting native code could be comparable to what
>> we would expect from C.  It is unfortunate these issues were
>> (apparently) designed into the Ocaml compiler architecture, because
>> otherwise it would have seemed this would be a natural area of
>> improvement for the compiler.  I would have thought a staticly typed
>> language such as Ocaml would (through its type checker) be
>> well-suited for the simple types of function specialization described
>> in this note.
>> 
>> 
>> --
>> Caml-list mailing list.  Subscription management and archives:
>> https://sympa.inria.fr/sympa/arc/caml-list
>> Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
>> Bug reports: http://caml.inria.fr/bin/caml-bugs


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Caml-list] Ocaml optimizer pitfalls & work-arounds
  2017-01-19 17:59   ` Mark Hayden
@ 2017-01-19 22:30     ` Yaron Minsky
  2017-01-22 20:06       ` Berke Durak
  0 siblings, 1 reply; 14+ messages in thread
From: Yaron Minsky @ 2017-01-19 22:30 UTC (permalink / raw)
  To: Mark Hayden; +Cc: caml-list

On Thu, Jan 19, 2017 at 12:59 PM, Mark Hayden <markghayden@yahoo.com> wrote:
> I agree that removal of the current treatment of float array would
> eliminate the biggest issues and that would avoid some of the bigger
> issues.  However, our issue is not just with how float arrays are
> handled.  Comparisons for integer/char/bool types should always be
> specialized after inlining.

I think eliminating polymorphic compare and adding modular implicits
would take care of this issue. In that case, you could write:

   a > 3

and modular implicits could be used to pick the specialized comparison
function for this type. This would fix both the performance problem
and the semantic problems of polymorphic compare.

Modular implicits is a big change, but unlike changes to where type
information is available in the compiler, it's one that is actually in
progress and seems like it's really going to land.

But I agree: for right now, OCaml developers who care about
performance need to be careful about exactly the issues you highlight.

For what it's worth, Base, which is our not-quite-ready-for-prime-time
alternative to the OCaml stdlib, hides polymorphic comparison
functions by default.

> As things stand (and apparently this is unlikely to change), Ocaml
> programmers who want best performance need to learn to developer
> their software to carefully tailor use of polymorphism, abstraction,
> min/max, etc, at least for parts where performance is important.
> This is too bad... the type checker infers the types but (according
> to X Leroy) the type information is no longer available by the time
> inlining occurs.
>
> Below is an example taking the maximum (at least 0) of an array of
> an array of integers.  The natural way to implement this is:
>
>  let array_sum arr =
>     Array.fold_left max 0 arr
>  ;;
>
> The inner loop compiles to the code below (again, with
> Array.fold_left redefined to be annotated with “[@inline]”).
> Because of how the Ocaml compiler is architected, the array
> operations and comparison operations will not be specialized, even
> though the code is inlined and the array type has been inferred to
> be [int array].  Removing the special “float array” treatment from
> Ocaml would at least eliminate the "dead code" for checking and
> boxing for floats, but the call to [_caml_greaterequal] would still
> be used instead of a single comparison instruction.
>
> L258:
>         movq    (%rsp), %rbx
>         .loc    1       38      14
>         movzbq  -8(%rbx), %rax
>         cmpq    $254, %rax
>         je      L262
>         .loc    1       38      14
>         movq    -4(%rbx,%rdx,4), %rsi
>         movq    %rsi, 24(%rsp)
>         jmp     L261
>         .align  2
> L262:
>         .loc    1       38      14
> L264:
>         subq    $16, %r15
>         movq    _caml_young_limit@GOTPCREL(%rip), %rax
>         cmpq    (%rax), %r15
>         jb      L265
>         leaq    8(%r15), %rsi
>         movq    $1277, -8(%rsi)
>         .loc    1       38      14
>         movsd   -4(%rbx,%rdx,4), %xmm0
>         movsd   %xmm0, (%rsi)
>         movq    %rsi, 24(%rsp)
> L261:
>         movq    %rdi, 32(%rsp)
>         .loc    5       65      17
>         movq    _caml_greaterequal@GOTPCREL(%rip), %rax
>         call    _caml_c_call
> L256:
>         movq    _caml_young_ptr@GOTPCREL(%rip), %r11
>         movq    (%r11), %r15
>         cmpq    $1, %rax
>         je      L260
>         movq    32(%rsp), %rdi
>         jmp     L259
>         .align  2
> L260:
>         movq    24(%rsp), %rdi
> L259:
>         movq    8(%rsp), %rdx
>         movq    %rdx, %rax
>         addq    $2, %rdx
>         movq    %rdx, 8(%rsp)
>         movq    16(%rsp), %rbx
>         cmpq    %rbx, %rax
>         jne     L258
>
> If you write it like this:
>
>  let [@inline] array_fold_left_i f (x:int) (a:int array) =
>    let r = ref x in
>    for i = 0 to Array.length a - 1 do
>      r := f !r (Array.unsafe_get a i)
>    done;
>    !r
>  ;;
>
>  let [@inline] int_max (v0:int) v1 =
>    if v0 > v1 then v0 else v1
>  ;;
>
>  let array_int_max a =
>    array_fold_left_i int_max 0 a
>  ;;
>
> Then the inner loop will compile as follows, which is about 5-15x faster than the code above and I think most developers would be pleased with.
>
> L284:
>         .loc    1       54      14
>         movq    -4(%rbx,%rsi,4), %rdx
>         cmpq    %rdx, %rdi
>         jle     L286
>         jmp     L285
>         .align  2
> L286:
>         movq    %rdx, %rdi
> L285:
>         movq    %rsi, %rdx
>         addq    $2, %rsi
>         cmpq    %rax, %rdx
>         jne     L284
>
>
>
>
>
>> On Jan 19, 2017, at 5:41 AM, Yaron Minsky <yminsky@janestreet.com> wrote:
>>
>> It seems like your primary issues are around lack of specialization
>> around two features:
>>
>> - unboxing in float arrays
>> - optimization of ad-hoc operations (e.g., polymorphic compare)
>>
>> My view on this is that it's best not to rely on float array
>> specialization at all, and I think the best improvement we can make to
>> OCaml is to remove the ad-hoc specialization of float arrays, and
>> instead add a separate, specialized (and unboxed) type for arrays of
>> floats, similar to the Bytes.t type which is effectively a specialized
>> byte array.
>>
>> As you've observed, specialization is brittle, and it's best not to
>> rely on it too much. Beyond that, the existence of float arrays
>> complicate the runtime quite a bit, and make other bugs more likely.
>>
>> There isn't yet consensus that specialization of float array should be
>> removed, but I'm still hopeful that we'll get there. It's probably
>> Jane Street's highest priority ask for the compiler.
>>
>> We also avoid use of polymorphic compare and other ad-hoc operations,
>> preferring to use type-specialized comparators everywhere. This is
>> better for semantic as well as performance reasons, since we've seen a
>> lot of subtle bugs from polymorphic compare doing the wrong thing on
>> specific types.
>>
>> It's hard to deny that using type-specialized comparators is more
>> verbose than polymorphic compare, but hopefully modular implicits will
>> make this problem go away, and we can get the best of both worlds. And
>> we hope that Flambda will be up to the job of inlining away the
>> overhead of the more indirect calling conventions imposed by modular
>> implicits.
>>
>> I think that with the above changes, we can probably get pretty far
>> towards the goal of being able to write OCaml code that is both highly
>> performant and pretty.  Being able to delay specialization until later
>> in the compilation pipeline would help more, but I believe we can do
>> pretty well without it.
>>
>> y
>>
>>
>> On Thu, Jan 19, 2017 at 1:51 AM, Mark Hayden <markghayden@yahoo.com> wrote:
>>> We recently upgraded our Ocaml toolchain from 4.02.3 to Ocaml 4.04.0.
>>> We were looking forward to a performance boost from the optimization
>>> improvements, especially from flambda.  While we generally were able
>>> to achieve significant performance improvement, we were somewhat
>>> surprised by the effort required to avoid certain pitfalls in Ocaml.
>>>
>>> This note describes some issues we ran into.  We filed several
>>> reports on Ocaml Mantis regarding our findings.  However it appears
>>> the underlying issues we ran into are unlikely to change:
>>>
>>> Your three reports (0007440, 0007441, 0007442) are manifestations
>>> of the same fact: the OCaml compiler performs type-based
>>> optimizations first, then erases types, then performs all the other
>>> optimizations. This is very unlikely to change in the near future,
>>> as it would require a total rewrite of the compiler.
>>>
>>> [X Leroy, https://caml.inria.fr/mantis/view.php?id=7440]
>>>
>>> I encourage readers to review the problem reports we submitted, which
>>> include more concrete examples.  I'm posting this note in case there
>>> are others running into similar performance issues with their Ocaml
>>> software and who might find it helpful in working around those
>>> issues.  I'm not aware of them being documented elsewhere and there
>>> appears to be little prospect of the issues being addressed in the
>>> compiler in the forseeable future.  Please chime in if any of this is
>>> inaccurate or there is something I missed.
>>>
>>> As an initial example, consider the following Ocaml code to find the
>>> maximum floating point value in an array (that is at least 0.0):
>>>
>>> [Array.fold_left max 0.0 arr]
>>>
>>> Now compile this with the latest compiler and maximum optimization.
>>> Because of how the Ocaml optimization works, this will run about
>>> 10-15x slower (and allocate 2-3 words per array element) than a more
>>> carefully written version that uses specialized operations and avoids
>>> allocation.  See below for one way to achieve this (while still using
>>> a functional-programming style).
>>>
>>> (* Same as Array.fold_left, but with type casting.
>>>  *)
>>> let [@inline] array_fold_leftf f (x:float) (a:float array) =
>>>   let r = ref x in
>>>   for i = 0 to Array.length a - 1 do
>>>     r := f !r (Array.unsafe_get a i)
>>>   done;
>>>   !r
>>> ;;
>>>
>>> let [@inline] float_max (v0:float) v1 =
>>>   if v0 > v1 then v0 else v1
>>> ;;
>>>
>>> let array_float_max a =
>>>   array_fold_leftf float_max 0.0 a
>>> ;;
>>>
>>> The assembly for the "inner loop" for the two examples are below.
>>> They were compiled with Ocaml 4.05.dev+flambda, "-O3
>>> -unbox-closures", MacOS 12.2, AMD64.
>>>
>>> Unoptimized example.  Note test/branch for array tag.  Allocation for
>>> boxing (we did not include the calls to trigger a minor gc).  There
>>> is a call to Ocaml runtime for polymorphic greater-equal.  This is
>>> probably not what one would expect from an optimizing/inline compiler
>>> for a simple case such as this.  Note that to create this we used our
>>> own definition of Array.fold_left which had an "[@inline]"
>>> annotation.
>>>
>>> L215:
>>>         movq    (%rsp), %rbx
>>>         .loc    1       38      14
>>>         movzbq  -8(%rbx), %rax
>>>         cmpq    $254, %rax
>>>         je      L219
>>>         .loc    1       38      14
>>>         movq    -4(%rbx,%rdx,4), %rsi
>>>         movq    %rsi, 24(%rsp)
>>>         jmp     L218
>>>         .align  2
>>> L219:
>>>         .loc    1       38      14
>>> L221:
>>>         subq    $16, %r15
>>>         movq    _caml_young_limit@GOTPCREL(%rip), %rax
>>>         cmpq    (%rax), %r15
>>>         jb      L222
>>>         leaq    8(%r15), %rsi
>>>         movq    $1277, -8(%rsi)
>>>         .loc    1       38      14
>>>         movsd   -4(%rbx,%rdx,4), %xmm0
>>>         movsd   %xmm0, (%rsi)
>>>         movq    %rsi, 24(%rsp)
>>> L218:
>>>         movq    %rdi, 32(%rsp)
>>>         .file   5       "pervasives.ml"
>>>         .loc    5       65      17
>>>         movq    _caml_greaterequal@GOTPCREL(%rip), %rax
>>>         call    _caml_c_call
>>> L213:
>>>         movq    _caml_young_ptr@GOTPCREL(%rip), %r11
>>>         movq    (%r11), %r15
>>>         cmpq    $1, %rax
>>>         je      L217
>>>         movq    32(%rsp), %rdi
>>>         jmp     L216
>>>         .align  2
>>> L217:
>>>         movq    24(%rsp), %rdi
>>> L216:
>>>         movq    8(%rsp), %rdx
>>>         movq    %rdx, %rax
>>>         addq    $2, %rdx
>>>         movq    %rdx, 8(%rsp)
>>>         movq    16(%rsp), %rbx
>>>         cmpq    %rbx, %rax
>>>         jne     L215
>>>
>>>
>>> The assembly for the more carefully writting case is below.  No
>>> allocation.  No call to external C code.  No test/branch for array
>>> tag.  This matches what I think most people would like to see.  It is
>>> compact enough that (maybe) it would benefit from unrolling:
>>>
>>> l225:
>>>         .loc    1       46      14
>>>         movsd   -4(%rax,%rdi,4), %xmm1
>>>         comisd  %xmm1, %xmm0
>>>         jbe     l227
>>>         jmp     l226
>>>         .align  2
>>> l227:
>>>         movapd  %xmm1, %xmm0
>>> l226:
>>>         movq    %rdi, %rsi
>>>         addq    $2, %rdi
>>>         cmpq    %rbx, %rsi
>>>         jne     l225
>>>
>>>
>>> The two main learnings we found were:
>>>
>>> * Polymorphic primitives ([Array.get], [compare], [>=], [min]) are
>>> only specialized if they appear in a context where the types can be
>>> determined at their exact call site, otherwise a polymorphic
>>> version is used.  If the use of the primitive is later inlined in a
>>> context where the type is no longer polymorphic, the function will
>>> _not_ be specialized by the compiler.
>>>
>>> * Use of abstract data types prevents specialization.  In particular,
>>> defining an abstract data type in a module ("type t ;;") will
>>> prevent specialization (even after inlining) for any polymorphic
>>> primitives (eg, "caml_equal") used with that type.  For instance,
>>> if the underlying type for [t] is actually [int], other modules
>>> will still use polymorphic equality instead of a single machine
>>> instruction.  You can prevent this behavior with the "private"
>>> keyword in order to export the type information, "type t = private
>>> int".  Alternatively, the module can include its own specialized
>>> operations and other modules can be careful to use them.
>>>
>>> It bears emphasizing that the issues described in this note apply
>>> even when all of the code is "fully inlined" and uses highest level
>>> of optimization.  Specialization in the Ocaml compiler occurs in a
>>> stage prior to inlining.  If it hasn’t happened before inlining, it
>>> won’t happen afterwards.
>>>
>>> What kind of effect does lack of specialization have on performance?
>>> Calling the "caml_compare" Ocaml C runtime function to compare
>>> integers can be 10-20x times slower than using a single integer
>>> comparison machine instruction.  Ditto for floating point values.
>>> The unspecialized [Array.get], [Array.set], and (on 32-bit)
>>> [Array.length] have to check the tag on the array to determine if the
>>> array uses the unboxed floating-point represntation (I wish Ocaml
>>> didn't use this!).  For instance, the polymorphic [Array.get] checks
>>> the tag on the array and (for floating point arrays) reads the value
>>> and boxes the floating point value (ie, allocate 2-3 words on the
>>> heap).  Note that when iterating over an array, the check on the tag
>>> will be included in _each_ loop iteration.
>>>
>>> Other impacts of using non-specialized functions:
>>>
>>> * Use of polymorphic primitives means floating point values have to
>>> be boxed, requiring heap allocation.  Through use of specialized
>>> specialized primitives, in many cases floats can remain unboxed.
>>>
>>> * All the extra native code from using polymorphic primitives
>>> (checking array tags, conditionally boxing floats, calling out to
>>> Ocaml C runtime) can have follow-on effects for further inlining.
>>> In other words, when native code can be kept compact, then more
>>> code can be inlined and/or loops unrolled and this can in turn
>>> allow further optimization.
>>>
>>> Some suggestions others may find helpful:
>>>
>>> * Consider using the "private" keyword for any abstract types in your
>>> modules.  We added over 50 of these to our code base.  It is an
>>> ugly but effective work-around.
>>>
>>> * The min/max functions in standard library Pervasives are
>>> particularly problematic.  They are polymorphic so their comparison
>>> will never be specialized.  It can be helpful to define specialized
>>> functions such as:
>>>
>>>   let [@inline] float_max (v0:float) (v1:float) =
>>>     if v0 > v1 then v0 else v1
>>>   ;;
>>>
>>>   let [@inline] int_max (v0:int) (v1:int) =
>>>     if v0 > v1 then v0 else v1
>>>   ;;
>>>
>>> These will be compiled to use native machine code and unboxed
>>> values.
>>>
>>> * Any use of polymorphism can negatively affect performance.  Be
>>> careful about inadvertently introducing polymorphism into your
>>> program, such as this helper function:
>>>
>>>   let [@inline] getter v ofs = Array.get v ofs ;;
>>>
>>> This will result in unspecialized version of [Array.get] being
>>> inlined at all call-sites.  Note that if your .mli file defines the
>>> function as non-polymorphic that will still _not_ affect how
>>> [getter] is compiled.:
>>>
>>>   type getter : t -> int -> int ;;  (* does not affect optimization *)
>>>
>>> You must cast the type in the implementation in order for [Array.get]
>>> to be specialized:
>>>
>>>   let [@inline] getter (v:int array) ofs = Array.get v ofs ;;
>>>
>>> * All the iterators in the Array module (eg, [Array.iter]) in the
>>> standard library are polymorphic, so will use unspecialized
>>> accessors and be affected by the issues described here.  Using the
>>> following to sum and array of floats may seem elegant:
>>>
>>>   Array.fold_left (+) 0.0 arr
>>>
>>> However, the resulting code is much slower (and allocates 2 floats
>>> per array entry, ie 4-6 words) than a "specialized" version.  Note
>>> that even if [Array.fold_left] and surrounding code were "fully
>>> inlined," it is still a polymorphic function so the above
>>> performance penalty for checking the array tag and boxing the float
>>> is present.  See also the earlier example.
>>>
>>> * It can be helpful to review the compiled assembly code (using "-S"
>>> option for ocamlopt) and look for tell-tale signs of lack of
>>> specialization, such as calls to [_caml_greaterequal] or allocation
>>> [caml_call_gc] in cases where they are not expected.  You can refer
>>> to the assembly code for the original implementation, know that
>>> that code will not be specialized when inlined.
>>>
>>> As I said, by examining our hot spots and following the suggestions
>>> above, we found the resulting native code could be comparable to what
>>> we would expect from C.  It is unfortunate these issues were
>>> (apparently) designed into the Ocaml compiler architecture, because
>>> otherwise it would have seemed this would be a natural area of
>>> improvement for the compiler.  I would have thought a staticly typed
>>> language such as Ocaml would (through its type checker) be
>>> well-suited for the simple types of function specialization described
>>> in this note.
>>>
>>>
>>> --
>>> Caml-list mailing list.  Subscription management and archives:
>>> https://sympa.inria.fr/sympa/arc/caml-list
>>> Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
>>> Bug reports: http://caml.inria.fr/bin/caml-bugs
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Caml-list] Ocaml optimizer pitfalls & work-arounds
  2017-01-19 22:30     ` Yaron Minsky
@ 2017-01-22 20:06       ` Berke Durak
  2017-01-23 16:33         ` David McClain
  0 siblings, 1 reply; 14+ messages in thread
From: Berke Durak @ 2017-01-22 20:06 UTC (permalink / raw)
  To: caml-list

[-- Attachment #1: Type: text/plain, Size: 9714 bytes --]

[Re-sending to list because of size limit]

Of course the same kind of issues apply to Bigarray values.  A polymorphic
getter:

  let get a i = a.{i}

will have pretty bad performance, unless "a" is annotated with the Bigarray
kind and layout.

Yesterday I rewrote some numerical double-integration inner loop in
Fortran.   One of the loops looks like this:

    do ui = 1,n
       u = real(ui - 1) / real(n - 1)
       do vi = 1,ui
          v = real(vi - 1) / real(n - 1)
          p = p1 + u*p12 + v*p13
          r = norm(s - p)
          if (r > epsilon) then
             q = qs(1)*(1 - u - v) + qs(2)*u + qs(3)*v
             z = z + q/r
          end if
       end do
    end do

I was half-disappointed, half pleasantly surprised when I noticed that it
only got about 3 times faster than the OCaml version (with gfortran 6.3.0
and -O3 -fcheck-bounds -ffast-math).

Granted the OCaml code is similar written in a mostly imperative style with
Bigarrays and references, but it's using a function passed to a domain
iterator instead of having two nested identical loops.   This is with
4.05+pr985+flambda.

I examined the Fortran assembly code, and while it looks numerically dense,
it looks like gfortran ended up calling its internal pack/unpack primitives:

        mulsd   %xmm1, %xmm2    # v, D.4022
        movsd   %xmm0, 168(%rsp)        # D.4022, A.12
        movsd   16(%r15), %xmm0 # *s_110(D), *s_110(D)
        subsd   88(%rsp), %xmm0 # %sfp, D.4022
        subsd   32(%rsp), %xmm0 # %sfp, D.4022
        subsd   %xmm2, %xmm0    # D.4022, D.4022
        movsd   %xmm0, 176(%rsp)        # D.4022, A.12
        call    _gfortran_internal_pack #
        movsd   (%rax), %xmm2   # MEM[(real(kind=8)[3] *)_117], D.4022
        cmpq    %r12, %rax      # tmp266, D.4025
        movsd   8(%rax), %xmm3  # MEM[(real(kind=8)[3] *)_117], D.4022
        movq    %rax, %rbx      #, D.4025
        mulsd   %xmm2, %xmm2    # D.4022, D.4022
        mulsd   %xmm3, %xmm3    # D.4022, D.4022
        movsd   16(%rax), %xmm0 # MEM[(real(kind=8)[3] *)_117], D.4022
        movsd   (%rsp), %xmm1   # %sfp, v
        mulsd   %xmm0, %xmm0    # D.4022, D.4022
        addsd   %xmm3, %xmm2    # D.4022, D.4022
        addsd   %xmm2, %xmm0    # D.4022, D.4022
        sqrtsd  %xmm0, %xmm0    # D.4022, D.4022
        je      .L4     #,
        leaq    192(%rsp), %rdi #, tmp328
        movq    %rax, %rsi      # D.4025,
        movsd   %xmm0, 8(%rsp)  # D.4022, %sfp
        call    _gfortran_internal_unpack       #

I'm new at using Fortran, so maybe there are a few simple things to make
the code faster.  I suspect these calls are due to the vector operations,
such as the call to norm(u) and the vector substraction, in spite of norm()
being defined in the same Fortran module and is inlined.  (Note that
pack/unpack aren't that expensive.)

My point in describing all this is that if some of the pitfalls described
are avoided, OCaml is not bad for numerical code if you compare it to
"unoptimized" Fortran code.  Getting the "magical optimization" from
Fortran compilers (auto-vectorization, or auto-parallelization as provided
by PGI Fortran) is neither automatic nor easy.

Now a lot of scientists are stuck with Matlab, and the newer generation
tends to use Python with Numpy.

Assuming the required matrix/vector operations are available in OCaml
libraries Lacaml, Fftw, etc. we OCaml programmers will find Python + Numpy
to be inferior to OCaml because Python is an inferior language.

The benefit of Python is that it's easier to "just get started" since
"types won't get in the way", but then it's not any better than Matlab
(besides the price tag).  And yes, Python is a better language than Matlab.
 (Octave and INRIA's Scilab seem to be slightly better language-wise.)

But for writing a non-temporary numerical routine Ocaml is superior since
you can produce a type-checked, fast standalone executable efficiently
thanks to the high-level programming offered.

People dissatisfied with Python's performance often rave about Cython and
how wonderful it is.  This thing generates C code from type-annotated
Python code.  Examining the generated C and then assembly code from the
heat.pyx examples, the code (with bounds checks disabled) doesn't look
glorious.

The Cython code looks like this:

@cython.boundscheck(False)
def solve_heat_buf_nocheck(initial_conditions, double dx, double dt, int
iter):
    cdef numpy.ndarray[double, ndim=2] cur = initial_conditions.copy()
    cdef numpy.ndarray[double, ndim=2] next =
numpy.zeros_like(initial_conditions)
    cdef int count, i, j, M, N
    M, N = cur.shape[0], cur.shape[1]
    cdef double step
    for count in range(iter):
        for i in range(1, M-1):
            for j in range(1, N-1):
                step = cur[i-1,j] + cur[i+1,j] + cur[i,j-1] + cur[i,j+1] -
4*cur[i,j]
                next[i,j] = cur[i,j] + dt*step/dx**2
        cur, next = next, cur
    return cur

This, with two other Python functions of similar size, generates about 8000
lines of C code.

The actual loop is compiled into something that looks like this...

        if (__pyx_t_23 < 0) __pyx_t_23 +=
__pyx_pybuffernd_cur.diminfo[0].shape;
        if (__pyx_t_24 < 0) __pyx_t_24 +=
__pyx_pybuffernd_cur.diminfo[1].shape;
        __pyx_v_step = (((((*__Pyx_BufPtrStrided2d(double *,
__pyx_pybuffernd_cur.rcbuffer->pybuffer.buf, __pyx_t_15,
__pyx_pybuffernd_cur.diminfo[0].strides, __pyx_t_16,
__pyx_pybuffernd_cur.diminfo[1].strides)) + (*__Pyx_BufPtrStrided2d(double
*, __pyx_pybuffernd_cur.rcbuffer->pybuffer.buf, __pyx_t_17,
__pyx_pybuffernd_cur.diminfo[0].strides, __pyx_t_18,
__pyx_pybuffernd_cur.diminfo[1].strides))) + (*__Pyx_BufPtrStrided2d(double
*, __pyx_pybuffernd_cur.rcbuffer->pybuffer.buf, __pyx_t_19,
__pyx_pybuffernd_cur.diminfo[0].strides, __pyx_t_20,
__pyx_pybuffernd_cur.diminfo[1].strides))) + (*__Pyx_BufPtrStrided2d(double
*, __pyx_pybuffernd_cur.rcbuffer->pybuffer.buf, __pyx_t_21,
__pyx_pybuffernd_cur.diminfo[0].strides, __pyx_t_22,
__pyx_pybuffernd_cur.diminfo[1].strides))) - (4.0 *
(*__Pyx_BufPtrStrided2d(double *,
__pyx_pybuffernd_cur.rcbuffer->pybuffer.buf, __pyx_t_23,
__pyx_pybuffernd_cur.diminfo[0].strides, __pyx_t_24,
__pyx_pybuffernd_cur.diminfo[1].strides))));

...repeated a hundred times.  I pasted the excerpt at

  https://gist.github.com/anonymous/e54964121b212e5f783fb10c696ed9e2

This comes with an incredible amount of wrapper code for checking types and
recursion limits.

Regarding the float array issue, I have these two thoughts:

1) Why not just completely drop the float *array* optimization?  If you're
writing numerical code, use bigarrays.  Those are optimized.  I rarely use
float arrays.

2) However, I often use structures and tuples of floats (e.g. type vec3 =
float * float * float or type vec3 = { x : float; y : float; z : float }).
Having the floats in structures/tuples be boxed would be very annoying.

I'm not sure (1) and (2) interact.

3) What about 63-bit floats?  This would be tricky and non-compliant, but
someone had already made the suggestion here:

http://blog.frama-c.com/index.php?post/2013/05/09/A-63-bit-floating-point-type-for-64-bit-OCaml

These could be provided as a separate type.

4) Same thing for single-precision floats (i.e. actual C floats.)  On
64-bit machines these would fit with no problems in a 64-bit word.
Wasteful, but fast.

5) The really important issue may be float boxing/unboxing when passed to
functions.  Consider:

let accu_over_range i1 i2 f q0 = let rec loop i q = if i > i2 then q else
loop (i + 1) (f q i) in loop i1 q0

let _ = Printf.printf "%g\n" (accu_over_range 1 100_000_000 (fun q i -> q
+. float i *. float i) 0.0)

The inner loop translates into this (with the same 4.05+pr485+flambda)

camlFloataccu__loop_133:
        subq    $8, %rsp
.L113:
        movq    %rax, %rdi
        cmpq    $200000001, %rdi
        jle     .L112
        movq    %rbx, %rax
        addq    $8, %rsp
        ret
.L112:
.L114:
        subq    $16, %r15
        movq    caml_young_limit@GOTPCREL(%rip), %rax
        cmpq    (%rax), %r15
        jb      .L115
        leaq    8(%r15), %rsi
        movq    $1277, -8(%rsi)
        movq    %rdi, %rax
        sarq    $1, %rax
        cvtsi2sdq       %rax, %xmm0
        movapd  %xmm0, %xmm1
        mulsd   %xmm0, %xmm1
        addsd   (%rbx), %xmm1
        movsd   %xmm1, (%rsi)
        movq    %rdi, %rax
        addq    $2, %rax
        movq    %rsi, %rbx
        jmp     .L113
.L115:
        call    caml_call_gc@PLT
.L116:
        jmp     .L114

And we see that floats are boxed.  Type annotation doesn't help.  I did
quickly try a few Flambda options such as:

  ocamlopt -O3 floataccu-anno.ml -unbox-closures -rounds=10
-inline-call-cost=1000 -inlining-report

but that didn't change anything.

Maybe there is a way?

Note that Fortran 2008 currently has "inner functions" which can be defined
locally and passed to subroutines.  They do capture variables, but the
catch is that they are limited to one nesting level.  Also some compilers
(eg. PGI Fortran) don't support them yet.  See:
http://www.fortran90.org/src/faq.html#does-fortran-support-closures

My point is that efficient calls with float arguments are much more
important than this issue.  Having to add a few type annotations to avoid
polymorphic code is inconvenient, but not being able to use functions
efficiently (the fundamental construct!) is a roadblock.

6) For records containing a few floats, there is the Mlton approach of
laying out all unboxed fields before boxed ones, however as long as
all-float structures are unboxed, this can be worked around by having the
used manually place all these fields in a sub-record.

--
Berke Durak, VA7OBD (CN88)

[-- Attachment #2: Type: text/html, Size: 11256 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Caml-list] Ocaml optimizer pitfalls & work-arounds
  2017-01-22 20:06       ` Berke Durak
@ 2017-01-23 16:33         ` David McClain
  0 siblings, 0 replies; 14+ messages in thread
From: David McClain @ 2017-01-23 16:33 UTC (permalink / raw)
  To: caml-list

[-- Attachment #1: Type: text/plain, Size: 1045 bytes --]

> On Jan 22, 2017, at 13:06, Berke Durak <berke.durak@gmail.com <mailto:berke.durak@gmail.com>> wrote:
> 
> 
> But for writing a non-temporary numerical routine Ocaml is superior since you can produce a type-checked, fast standalone executable efficiently thanks to the high-level programming offered.

This was the conclusion I reached almost 16-17 years ago while working on solving for optical train aberrations from point spread images. At that time we were stuck with RSI/IDL (a variant of Matlab, of sorts), and could not get past 5 or 6 degrees of freedom without havoc striking. We needed 150+ DOF. 

So I sat down and learned about this new world of FPL and developed a tensor based optimization that worked the first time - once I finally got it all to compile. It wasn’t a huge body of code, perhaps 2-3 KLOC.

But my experience was so incredible that I wrote a short paper for Phil Wadler in the ACM proceedings. Couldn’t have come at this from a more distant place - astrophysics and missile borne IR sensors.

- DM

[-- Attachment #2: Type: text/html, Size: 1721 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Caml-list] <DKIM> Ocaml optimizer pitfalls & work-arounds
  2017-01-19  6:51 [Caml-list] Ocaml optimizer pitfalls & work-arounds Mark Hayden
  2017-01-19 11:20 ` Nils Becker
  2017-01-19 13:41 ` Yaron Minsky
@ 2017-01-21 14:39 ` Pierre Chambart
  2 siblings, 0 replies; 14+ messages in thread
From: Pierre Chambart @ 2017-01-21 14:39 UTC (permalink / raw)
  To: Mark Hayden, caml-list

I'm coming a bit late to the discussion to tell you that this is not
completely hopeless.
Xavier Leroy was right by telling you that the (mid/back)-end of the
compiler is
untyped and that this hinder this kind of optimization. But it is not
completely
impossible. I've recently being working on some more aggressive unboxing
which
requires propagating some more type information. I took the opportunity to
specialize a few more primitive in flambda. It appears that not so much
more type
annotations are required to be propagated through the intermediate
representations
to be able to improve something like:

  [Array.fold_left max 0.0 arr]

When translating from the typedtree to Lambda, the application arguments
can be
annotated with the kind of values they carry. Here the important
information is that
[arr] is a float array. When fold left is inlined that information can
be propagated to
its uses. I have some tests showing good results on this.

Notice that this is still quite limited. Type propagation on later
representations can
only follow values and flow forward. This is mainly due to GADTs. For
instance

   type 'a t = Float : float t | Int : int t
   type ('a, 'b) eq = Eq : ('a, 'a) eq

   let f (type x) (t : x t) (v : x) (a : x array) (cast : x t -> (x,
float) eq) =
     a.(0) <- v;
     let Eq = cast t in
     v +. 3.14

       (function t/1019 v/1020 a/1021 cast/1022
         (seq (array.set a/1021 0 v/1020)
           (let (match/1028 = (apply cast/1022 t/1019)) (+. v/1020 3.14)))))

Here, for instance, the type of the argument v cannot be deduced from
the (+.)
operation, and specializing the array assignment using it would lead to
segfaults.
-- 
Pierre Chambart

Le 19/01/2017 à 07:51, Mark Hayden a écrit :
> We recently upgraded our Ocaml toolchain from 4.02.3 to Ocaml 4.04.0.
> We were looking forward to a performance boost from the optimization
> improvements, especially from flambda.  While we generally were able
> to achieve significant performance improvement, we were somewhat
> surprised by the effort required to avoid certain pitfalls in Ocaml.
>
> This note describes some issues we ran into.  We filed several
> reports on Ocaml Mantis regarding our findings.  However it appears
> the underlying issues we ran into are unlikely to change:
>
>   Your three reports (0007440, 0007441, 0007442) are manifestations
>   of the same fact: the OCaml compiler performs type-based
>   optimizations first, then erases types, then performs all the other
>   optimizations. This is very unlikely to change in the near future,
>   as it would require a total rewrite of the compiler.
>
>   [X Leroy, https://caml.inria.fr/mantis/view.php?id=7440]
>
> I encourage readers to review the problem reports we submitted, which
> include more concrete examples.  I'm posting this note in case there
> are others running into similar performance issues with their Ocaml
> software and who might find it helpful in working around those
> issues.  I'm not aware of them being documented elsewhere and there
> appears to be little prospect of the issues being addressed in the
> compiler in the forseeable future.  Please chime in if any of this is
> inaccurate or there is something I missed.
>
> As an initial example, consider the following Ocaml code to find the
> maximum floating point value in an array (that is at least 0.0):
>
>   [Array.fold_left max 0.0 arr]
>
> Now compile this with the latest compiler and maximum optimization.
> Because of how the Ocaml optimization works, this will run about
> 10-15x slower (and allocate 2-3 words per array element) than a more
> carefully written version that uses specialized operations and avoids
> allocation.  See below for one way to achieve this (while still using
> a functional-programming style).
>
>   (* Same as Array.fold_left, but with type casting.
>    *)
>   let [@inline] array_fold_leftf f (x:float) (a:float array) =
>     let r = ref x in
>     for i = 0 to Array.length a - 1 do
>       r := f !r (Array.unsafe_get a i)
>     done;
>     !r
>   ;;
>
>   let [@inline] float_max (v0:float) v1 =
>     if v0 > v1 then v0 else v1
>   ;;
>
>   let array_float_max a =
>     array_fold_leftf float_max 0.0 a
>   ;;
>
> The assembly for the "inner loop" for the two examples are below.
> They were compiled with Ocaml 4.05.dev+flambda, "-O3
> -unbox-closures", MacOS 12.2, AMD64.
>
> Unoptimized example.  Note test/branch for array tag.  Allocation for
> boxing (we did not include the calls to trigger a minor gc).  There
> is a call to Ocaml runtime for polymorphic greater-equal.  This is
> probably not what one would expect from an optimizing/inline compiler
> for a simple case such as this.  Note that to create this we used our
> own definition of Array.fold_left which had an "[@inline]"
> annotation.
>
>   L215:
>           movq    (%rsp), %rbx
>           .loc    1       38      14
>           movzbq  -8(%rbx), %rax
>           cmpq    $254, %rax
>           je      L219
>           .loc    1       38      14
>           movq    -4(%rbx,%rdx,4), %rsi
>           movq    %rsi, 24(%rsp)
>           jmp     L218
>           .align  2
>   L219:
>           .loc    1       38      14
>   L221:
>           subq    $16, %r15
>           movq    _caml_young_limit@GOTPCREL(%rip), %rax
>           cmpq    (%rax), %r15
>           jb      L222
>           leaq    8(%r15), %rsi
>           movq    $1277, -8(%rsi)
>           .loc    1       38      14
>           movsd   -4(%rbx,%rdx,4), %xmm0
>           movsd   %xmm0, (%rsi)
>           movq    %rsi, 24(%rsp)
>   L218:
>           movq    %rdi, 32(%rsp)
>           .file   5       "pervasives.ml"
>           .loc    5       65      17
>           movq    _caml_greaterequal@GOTPCREL(%rip), %rax
>           call    _caml_c_call
>   L213:
>           movq    _caml_young_ptr@GOTPCREL(%rip), %r11
>           movq    (%r11), %r15
>           cmpq    $1, %rax
>           je      L217
>           movq    32(%rsp), %rdi
>           jmp     L216
>           .align  2
>   L217:
>           movq    24(%rsp), %rdi
>   L216:
>           movq    8(%rsp), %rdx
>           movq    %rdx, %rax
>           addq    $2, %rdx
>           movq    %rdx, 8(%rsp)
>           movq    16(%rsp), %rbx
>           cmpq    %rbx, %rax
>           jne     L215
>
>
> The assembly for the more carefully writting case is below.  No
> allocation.  No call to external C code.  No test/branch for array
> tag.  This matches what I think most people would like to see.  It is
> compact enough that (maybe) it would benefit from unrolling:
>
>   l225:
>           .loc    1       46      14
>           movsd   -4(%rax,%rdi,4), %xmm1
>           comisd  %xmm1, %xmm0
>           jbe     l227
>           jmp     l226
>           .align  2
>   l227:
>           movapd  %xmm1, %xmm0
>   l226:
>           movq    %rdi, %rsi
>           addq    $2, %rdi
>           cmpq    %rbx, %rsi
>           jne     l225
>
>
> The two main learnings we found were:
>
> * Polymorphic primitives ([Array.get], [compare], [>=], [min]) are
>   only specialized if they appear in a context where the types can be
>   determined at their exact call site, otherwise a polymorphic
>   version is used.  If the use of the primitive is later inlined in a
>   context where the type is no longer polymorphic, the function will
>   _not_ be specialized by the compiler.
>
> * Use of abstract data types prevents specialization.  In particular,
>   defining an abstract data type in a module ("type t ;;") will
>   prevent specialization (even after inlining) for any polymorphic
>   primitives (eg, "caml_equal") used with that type.  For instance,
>   if the underlying type for [t] is actually [int], other modules
>   will still use polymorphic equality instead of a single machine
>   instruction.  You can prevent this behavior with the "private"
>   keyword in order to export the type information, "type t = private
>   int".  Alternatively, the module can include its own specialized
>   operations and other modules can be careful to use them.
>
> It bears emphasizing that the issues described in this note apply
> even when all of the code is "fully inlined" and uses highest level
> of optimization.  Specialization in the Ocaml compiler occurs in a
> stage prior to inlining.  If it hasn’t happened before inlining, it
> won’t happen afterwards.
>
> What kind of effect does lack of specialization have on performance?
> Calling the "caml_compare" Ocaml C runtime function to compare
> integers can be 10-20x times slower than using a single integer
> comparison machine instruction.  Ditto for floating point values.
> The unspecialized [Array.get], [Array.set], and (on 32-bit)
> [Array.length] have to check the tag on the array to determine if the
> array uses the unboxed floating-point represntation (I wish Ocaml
> didn't use this!).  For instance, the polymorphic [Array.get] checks
> the tag on the array and (for floating point arrays) reads the value
> and boxes the floating point value (ie, allocate 2-3 words on the
> heap).  Note that when iterating over an array, the check on the tag
> will be included in _each_ loop iteration.
>
> Other impacts of using non-specialized functions:
>
> * Use of polymorphic primitives means floating point values have to
>   be boxed, requiring heap allocation.  Through use of specialized
>   specialized primitives, in many cases floats can remain unboxed.
>
> * All the extra native code from using polymorphic primitives
>   (checking array tags, conditionally boxing floats, calling out to
>   Ocaml C runtime) can have follow-on effects for further inlining.
>   In other words, when native code can be kept compact, then more
>   code can be inlined and/or loops unrolled and this can in turn
>   allow further optimization.
>
> Some suggestions others may find helpful:
>
> * Consider using the "private" keyword for any abstract types in your
>   modules.  We added over 50 of these to our code base.  It is an
>   ugly but effective work-around.
>
> * The min/max functions in standard library Pervasives are
>   particularly problematic.  They are polymorphic so their comparison
>   will never be specialized.  It can be helpful to define specialized
>   functions such as:
>
>     let [@inline] float_max (v0:float) (v1:float) =
>       if v0 > v1 then v0 else v1
>     ;;
>
>     let [@inline] int_max (v0:int) (v1:int) =
>       if v0 > v1 then v0 else v1
>     ;;
>
>   These will be compiled to use native machine code and unboxed
>   values.
>
> * Any use of polymorphism can negatively affect performance.  Be
>   careful about inadvertently introducing polymorphism into your
>   program, such as this helper function:
>
>     let [@inline] getter v ofs = Array.get v ofs ;;
>
>   This will result in unspecialized version of [Array.get] being
>   inlined at all call-sites.  Note that if your .mli file defines the
>   function as non-polymorphic that will still _not_ affect how
>   [getter] is compiled.:
>
>     type getter : t -> int -> int ;;  (* does not affect optimization *)
>
>   You must cast the type in the implementation in order for [Array.get]
>   to be specialized:
>
>     let [@inline] getter (v:int array) ofs = Array.get v ofs ;;
>
> * All the iterators in the Array module (eg, [Array.iter]) in the
>   standard library are polymorphic, so will use unspecialized
>   accessors and be affected by the issues described here.  Using the
>   following to sum and array of floats may seem elegant:
>
>     Array.fold_left (+) 0.0 arr
>
>   However, the resulting code is much slower (and allocates 2 floats
>   per array entry, ie 4-6 words) than a "specialized" version.  Note
>   that even if [Array.fold_left] and surrounding code were "fully
>   inlined," it is still a polymorphic function so the above
>   performance penalty for checking the array tag and boxing the float
>   is present.  See also the earlier example.
>
> * It can be helpful to review the compiled assembly code (using "-S"
>   option for ocamlopt) and look for tell-tale signs of lack of
>   specialization, such as calls to [_caml_greaterequal] or allocation
>   [caml_call_gc] in cases where they are not expected.  You can refer
>   to the assembly code for the original implementation, know that
>   that code will not be specialized when inlined.
>
> As I said, by examining our hot spots and following the suggestions
> above, we found the resulting native code could be comparable to what
> we would expect from C.  It is unfortunate these issues were
> (apparently) designed into the Ocaml compiler architecture, because
> otherwise it would have seemed this would be a natural area of
> improvement for the compiler.  I would have thought a staticly typed
> language such as Ocaml would (through its type checker) be
> well-suited for the simple types of function specialization described
> in this note.
>
>



^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2017-01-23 16:34 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-01-19  6:51 [Caml-list] Ocaml optimizer pitfalls & work-arounds Mark Hayden
2017-01-19 11:20 ` Nils Becker
2017-01-19 11:39   ` Gabriel Scherer
2017-01-19 13:26     ` Frédéric Bour
2017-01-19 14:35   ` Alain Frisch
2017-01-19 15:35     ` Ivan Gotovchits
2017-01-19 17:02       ` Hezekiah M. Carty
2017-01-19 15:41     ` Gerd Stolpmann
2017-01-19 13:41 ` Yaron Minsky
2017-01-19 17:59   ` Mark Hayden
2017-01-19 22:30     ` Yaron Minsky
2017-01-22 20:06       ` Berke Durak
2017-01-23 16:33         ` David McClain
2017-01-21 14:39 ` [Caml-list] <DKIM> " Pierre Chambart

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox