Re: [Caml-list] Ocaml optimizer pitfalls & work-arounds

Mailing list for all users of the OCaml language and system.
 help / color / mirror / Atom feed

From: Berke Durak <berke.durak@gmail.com>
To: caml-list <caml-list@inria.fr>
Subject: Re: [Caml-list] Ocaml optimizer pitfalls & work-arounds
Date: Sun, 22 Jan 2017 12:06:26 -0800	[thread overview]
Message-ID: <CAALTfKAUYoDxj_eVbAig8+TGFcyqR0rCaHJoqa_uwZfCvAWkMg@mail.gmail.com> (raw)
In-Reply-To: <CACLX4jTwywp4RuyUPnViS9LQrQ=jjdjwmsHZqgWUoGkOVKj2DA@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 9714 bytes --]

[Re-sending to list because of size limit]

Of course the same kind of issues apply to Bigarray values.  A polymorphic
getter:

  let get a i = a.{i}

will have pretty bad performance, unless "a" is annotated with the Bigarray
kind and layout.

Yesterday I rewrote some numerical double-integration inner loop in
Fortran.   One of the loops looks like this:

    do ui = 1,n
       u = real(ui - 1) / real(n - 1)
       do vi = 1,ui
          v = real(vi - 1) / real(n - 1)
          p = p1 + u*p12 + v*p13
          r = norm(s - p)
          if (r > epsilon) then
             q = qs(1)*(1 - u - v) + qs(2)*u + qs(3)*v
             z = z + q/r
          end if
       end do
    end do

I was half-disappointed, half pleasantly surprised when I noticed that it
only got about 3 times faster than the OCaml version (with gfortran 6.3.0
and -O3 -fcheck-bounds -ffast-math).

Granted the OCaml code is similar written in a mostly imperative style with
Bigarrays and references, but it's using a function passed to a domain
iterator instead of having two nested identical loops.   This is with
4.05+pr985+flambda.

I examined the Fortran assembly code, and while it looks numerically dense,
it looks like gfortran ended up calling its internal pack/unpack primitives:

        mulsd   %xmm1, %xmm2    # v, D.4022
        movsd   %xmm0, 168(%rsp)        # D.4022, A.12
        movsd   16(%r15), %xmm0 # *s_110(D), *s_110(D)
        subsd   88(%rsp), %xmm0 # %sfp, D.4022
        subsd   32(%rsp), %xmm0 # %sfp, D.4022
        subsd   %xmm2, %xmm0    # D.4022, D.4022
        movsd   %xmm0, 176(%rsp)        # D.4022, A.12
        call    _gfortran_internal_pack #
        movsd   (%rax), %xmm2   # MEM[(real(kind=8)[3] *)_117], D.4022
        cmpq    %r12, %rax      # tmp266, D.4025
        movsd   8(%rax), %xmm3  # MEM[(real(kind=8)[3] *)_117], D.4022
        movq    %rax, %rbx      #, D.4025
        mulsd   %xmm2, %xmm2    # D.4022, D.4022
        mulsd   %xmm3, %xmm3    # D.4022, D.4022
        movsd   16(%rax), %xmm0 # MEM[(real(kind=8)[3] *)_117], D.4022
        movsd   (%rsp), %xmm1   # %sfp, v
        mulsd   %xmm0, %xmm0    # D.4022, D.4022
        addsd   %xmm3, %xmm2    # D.4022, D.4022
        addsd   %xmm2, %xmm0    # D.4022, D.4022
        sqrtsd  %xmm0, %xmm0    # D.4022, D.4022
        je      .L4     #,
        leaq    192(%rsp), %rdi #, tmp328
        movq    %rax, %rsi      # D.4025,
        movsd   %xmm0, 8(%rsp)  # D.4022, %sfp
        call    _gfortran_internal_unpack       #

I'm new at using Fortran, so maybe there are a few simple things to make
the code faster.  I suspect these calls are due to the vector operations,
such as the call to norm(u) and the vector substraction, in spite of norm()
being defined in the same Fortran module and is inlined.  (Note that
pack/unpack aren't that expensive.)

My point in describing all this is that if some of the pitfalls described
are avoided, OCaml is not bad for numerical code if you compare it to
"unoptimized" Fortran code.  Getting the "magical optimization" from
Fortran compilers (auto-vectorization, or auto-parallelization as provided
by PGI Fortran) is neither automatic nor easy.

Now a lot of scientists are stuck with Matlab, and the newer generation
tends to use Python with Numpy.

Assuming the required matrix/vector operations are available in OCaml
libraries Lacaml, Fftw, etc. we OCaml programmers will find Python + Numpy
to be inferior to OCaml because Python is an inferior language.

The benefit of Python is that it's easier to "just get started" since
"types won't get in the way", but then it's not any better than Matlab
(besides the price tag).  And yes, Python is a better language than Matlab.
 (Octave and INRIA's Scilab seem to be slightly better language-wise.)

But for writing a non-temporary numerical routine Ocaml is superior since
you can produce a type-checked, fast standalone executable efficiently
thanks to the high-level programming offered.

People dissatisfied with Python's performance often rave about Cython and
how wonderful it is.  This thing generates C code from type-annotated
Python code.  Examining the generated C and then assembly code from the
heat.pyx examples, the code (with bounds checks disabled) doesn't look
glorious.

The Cython code looks like this:

@cython.boundscheck(False)
def solve_heat_buf_nocheck(initial_conditions, double dx, double dt, int
iter):
    cdef numpy.ndarray[double, ndim=2] cur = initial_conditions.copy()
    cdef numpy.ndarray[double, ndim=2] next =
numpy.zeros_like(initial_conditions)
    cdef int count, i, j, M, N
    M, N = cur.shape[0], cur.shape[1]
    cdef double step
    for count in range(iter):
        for i in range(1, M-1):
            for j in range(1, N-1):
                step = cur[i-1,j] + cur[i+1,j] + cur[i,j-1] + cur[i,j+1] -
4*cur[i,j]
                next[i,j] = cur[i,j] + dt*step/dx**2
        cur, next = next, cur
    return cur

This, with two other Python functions of similar size, generates about 8000
lines of C code.

The actual loop is compiled into something that looks like this...

        if (__pyx_t_23 < 0) __pyx_t_23 +=
__pyx_pybuffernd_cur.diminfo[0].shape;
        if (__pyx_t_24 < 0) __pyx_t_24 +=
__pyx_pybuffernd_cur.diminfo[1].shape;
        __pyx_v_step = (((((*__Pyx_BufPtrStrided2d(double *,
__pyx_pybuffernd_cur.rcbuffer->pybuffer.buf, __pyx_t_15,
__pyx_pybuffernd_cur.diminfo[0].strides, __pyx_t_16,
__pyx_pybuffernd_cur.diminfo[1].strides)) + (*__Pyx_BufPtrStrided2d(double
*, __pyx_pybuffernd_cur.rcbuffer->pybuffer.buf, __pyx_t_17,
__pyx_pybuffernd_cur.diminfo[0].strides, __pyx_t_18,
__pyx_pybuffernd_cur.diminfo[1].strides))) + (*__Pyx_BufPtrStrided2d(double
*, __pyx_pybuffernd_cur.rcbuffer->pybuffer.buf, __pyx_t_19,
__pyx_pybuffernd_cur.diminfo[0].strides, __pyx_t_20,
__pyx_pybuffernd_cur.diminfo[1].strides))) + (*__Pyx_BufPtrStrided2d(double
*, __pyx_pybuffernd_cur.rcbuffer->pybuffer.buf, __pyx_t_21,
__pyx_pybuffernd_cur.diminfo[0].strides, __pyx_t_22,
__pyx_pybuffernd_cur.diminfo[1].strides))) - (4.0 *
(*__Pyx_BufPtrStrided2d(double *,
__pyx_pybuffernd_cur.rcbuffer->pybuffer.buf, __pyx_t_23,
__pyx_pybuffernd_cur.diminfo[0].strides, __pyx_t_24,
__pyx_pybuffernd_cur.diminfo[1].strides))));

...repeated a hundred times.  I pasted the excerpt at

  https://gist.github.com/anonymous/e54964121b212e5f783fb10c696ed9e2

This comes with an incredible amount of wrapper code for checking types and
recursion limits.

Regarding the float array issue, I have these two thoughts:

1) Why not just completely drop the float *array* optimization?  If you're
writing numerical code, use bigarrays.  Those are optimized.  I rarely use
float arrays.

2) However, I often use structures and tuples of floats (e.g. type vec3 =
float * float * float or type vec3 = { x : float; y : float; z : float }).
Having the floats in structures/tuples be boxed would be very annoying.

I'm not sure (1) and (2) interact.

3) What about 63-bit floats?  This would be tricky and non-compliant, but
someone had already made the suggestion here:

http://blog.frama-c.com/index.php?post/2013/05/09/A-63-bit-floating-point-type-for-64-bit-OCaml

These could be provided as a separate type.

4) Same thing for single-precision floats (i.e. actual C floats.)  On
64-bit machines these would fit with no problems in a 64-bit word.
Wasteful, but fast.

5) The really important issue may be float boxing/unboxing when passed to
functions.  Consider:

let accu_over_range i1 i2 f q0 = let rec loop i q = if i > i2 then q else
loop (i + 1) (f q i) in loop i1 q0

let _ = Printf.printf "%g\n" (accu_over_range 1 100_000_000 (fun q i -> q
+. float i *. float i) 0.0)

The inner loop translates into this (with the same 4.05+pr485+flambda)

camlFloataccu__loop_133:
        subq    $8, %rsp
.L113:
        movq    %rax, %rdi
        cmpq    $200000001, %rdi
        jle     .L112
        movq    %rbx, %rax
        addq    $8, %rsp
        ret
.L112:
.L114:
        subq    $16, %r15
        movq    caml_young_limit@GOTPCREL(%rip), %rax
        cmpq    (%rax), %r15
        jb      .L115
        leaq    8(%r15), %rsi
        movq    $1277, -8(%rsi)
        movq    %rdi, %rax
        sarq    $1, %rax
        cvtsi2sdq       %rax, %xmm0
        movapd  %xmm0, %xmm1
        mulsd   %xmm0, %xmm1
        addsd   (%rbx), %xmm1
        movsd   %xmm1, (%rsi)
        movq    %rdi, %rax
        addq    $2, %rax
        movq    %rsi, %rbx
        jmp     .L113
.L115:
        call    caml_call_gc@PLT
.L116:
        jmp     .L114

And we see that floats are boxed.  Type annotation doesn't help.  I did
quickly try a few Flambda options such as:

  ocamlopt -O3 floataccu-anno.ml -unbox-closures -rounds=10
-inline-call-cost=1000 -inlining-report

but that didn't change anything.

Maybe there is a way?

Note that Fortran 2008 currently has "inner functions" which can be defined
locally and passed to subroutines.  They do capture variables, but the
catch is that they are limited to one nesting level.  Also some compilers
(eg. PGI Fortran) don't support them yet.  See:
http://www.fortran90.org/src/faq.html#does-fortran-support-closures

My point is that efficient calls with float arguments are much more
important than this issue.  Having to add a few type annotations to avoid
polymorphic code is inconvenient, but not being able to use functions
efficiently (the fundamental construct!) is a roadblock.

6) For records containing a few floats, there is the Mlton approach of
laying out all unboxed fields before boxed ones, however as long as
all-float structures are unboxed, this can be worked around by having the
used manually place all these fields in a sub-record.

--
Berke Durak, VA7OBD (CN88)

[-- Attachment #2: Type: text/html, Size: 11256 bytes --]

next prev parent reply	other threads:[~2017-01-22 20:06 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-01-19  6:51 Mark Hayden
2017-01-19 11:20 ` Nils Becker
2017-01-19 11:39   ` Gabriel Scherer
2017-01-19 13:26     ` Frédéric Bour
2017-01-19 14:35   ` Alain Frisch
2017-01-19 15:35     ` Ivan Gotovchits
2017-01-19 17:02       ` Hezekiah M. Carty
2017-01-19 15:41     ` Gerd Stolpmann
2017-01-19 13:41 ` Yaron Minsky
2017-01-19 17:59   ` Mark Hayden
2017-01-19 22:30     ` Yaron Minsky
2017-01-22 20:06       ` Berke Durak [this message]
2017-01-23 16:33         ` David McClain
2017-01-21 14:39 ` [Caml-list] <DKIM> " Pierre Chambart
2017-01-19 14:32 [Caml-list] " Hongbo Zhang (BLOOMBERG/ 731 LEX)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAALTfKAUYoDxj_eVbAig8+TGFcyqR0rCaHJoqa_uwZfCvAWkMg@mail.gmail.com \
    --to=berke.durak@gmail.com \
    --cc=caml-list@inria.fr \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox