From: Berke Durak <berke.durak@gmail.com>
To: caml-list <caml-list@inria.fr>
Subject: Re: [Caml-list] Ocaml optimizer pitfalls & work-arounds
Date: Sun, 22 Jan 2017 12:06:26 -0800 [thread overview]
Message-ID: <CAALTfKAUYoDxj_eVbAig8+TGFcyqR0rCaHJoqa_uwZfCvAWkMg@mail.gmail.com> (raw)
In-Reply-To: <CACLX4jTwywp4RuyUPnViS9LQrQ=jjdjwmsHZqgWUoGkOVKj2DA@mail.gmail.com>
[-- Attachment #1: Type: text/plain, Size: 9714 bytes --]
[Re-sending to list because of size limit]
Of course the same kind of issues apply to Bigarray values. A polymorphic
getter:
let get a i = a.{i}
will have pretty bad performance, unless "a" is annotated with the Bigarray
kind and layout.
Yesterday I rewrote some numerical double-integration inner loop in
Fortran. One of the loops looks like this:
do ui = 1,n
u = real(ui - 1) / real(n - 1)
do vi = 1,ui
v = real(vi - 1) / real(n - 1)
p = p1 + u*p12 + v*p13
r = norm(s - p)
if (r > epsilon) then
q = qs(1)*(1 - u - v) + qs(2)*u + qs(3)*v
z = z + q/r
end if
end do
end do
I was half-disappointed, half pleasantly surprised when I noticed that it
only got about 3 times faster than the OCaml version (with gfortran 6.3.0
and -O3 -fcheck-bounds -ffast-math).
Granted the OCaml code is similar written in a mostly imperative style with
Bigarrays and references, but it's using a function passed to a domain
iterator instead of having two nested identical loops. This is with
4.05+pr985+flambda.
I examined the Fortran assembly code, and while it looks numerically dense,
it looks like gfortran ended up calling its internal pack/unpack primitives:
mulsd %xmm1, %xmm2 # v, D.4022
movsd %xmm0, 168(%rsp) # D.4022, A.12
movsd 16(%r15), %xmm0 # *s_110(D), *s_110(D)
subsd 88(%rsp), %xmm0 # %sfp, D.4022
subsd 32(%rsp), %xmm0 # %sfp, D.4022
subsd %xmm2, %xmm0 # D.4022, D.4022
movsd %xmm0, 176(%rsp) # D.4022, A.12
call _gfortran_internal_pack #
movsd (%rax), %xmm2 # MEM[(real(kind=8)[3] *)_117], D.4022
cmpq %r12, %rax # tmp266, D.4025
movsd 8(%rax), %xmm3 # MEM[(real(kind=8)[3] *)_117], D.4022
movq %rax, %rbx #, D.4025
mulsd %xmm2, %xmm2 # D.4022, D.4022
mulsd %xmm3, %xmm3 # D.4022, D.4022
movsd 16(%rax), %xmm0 # MEM[(real(kind=8)[3] *)_117], D.4022
movsd (%rsp), %xmm1 # %sfp, v
mulsd %xmm0, %xmm0 # D.4022, D.4022
addsd %xmm3, %xmm2 # D.4022, D.4022
addsd %xmm2, %xmm0 # D.4022, D.4022
sqrtsd %xmm0, %xmm0 # D.4022, D.4022
je .L4 #,
leaq 192(%rsp), %rdi #, tmp328
movq %rax, %rsi # D.4025,
movsd %xmm0, 8(%rsp) # D.4022, %sfp
call _gfortran_internal_unpack #
I'm new at using Fortran, so maybe there are a few simple things to make
the code faster. I suspect these calls are due to the vector operations,
such as the call to norm(u) and the vector substraction, in spite of norm()
being defined in the same Fortran module and is inlined. (Note that
pack/unpack aren't that expensive.)
My point in describing all this is that if some of the pitfalls described
are avoided, OCaml is not bad for numerical code if you compare it to
"unoptimized" Fortran code. Getting the "magical optimization" from
Fortran compilers (auto-vectorization, or auto-parallelization as provided
by PGI Fortran) is neither automatic nor easy.
Now a lot of scientists are stuck with Matlab, and the newer generation
tends to use Python with Numpy.
Assuming the required matrix/vector operations are available in OCaml
libraries Lacaml, Fftw, etc. we OCaml programmers will find Python + Numpy
to be inferior to OCaml because Python is an inferior language.
The benefit of Python is that it's easier to "just get started" since
"types won't get in the way", but then it's not any better than Matlab
(besides the price tag). And yes, Python is a better language than Matlab.
(Octave and INRIA's Scilab seem to be slightly better language-wise.)
But for writing a non-temporary numerical routine Ocaml is superior since
you can produce a type-checked, fast standalone executable efficiently
thanks to the high-level programming offered.
People dissatisfied with Python's performance often rave about Cython and
how wonderful it is. This thing generates C code from type-annotated
Python code. Examining the generated C and then assembly code from the
heat.pyx examples, the code (with bounds checks disabled) doesn't look
glorious.
The Cython code looks like this:
@cython.boundscheck(False)
def solve_heat_buf_nocheck(initial_conditions, double dx, double dt, int
iter):
cdef numpy.ndarray[double, ndim=2] cur = initial_conditions.copy()
cdef numpy.ndarray[double, ndim=2] next =
numpy.zeros_like(initial_conditions)
cdef int count, i, j, M, N
M, N = cur.shape[0], cur.shape[1]
cdef double step
for count in range(iter):
for i in range(1, M-1):
for j in range(1, N-1):
step = cur[i-1,j] + cur[i+1,j] + cur[i,j-1] + cur[i,j+1] -
4*cur[i,j]
next[i,j] = cur[i,j] + dt*step/dx**2
cur, next = next, cur
return cur
This, with two other Python functions of similar size, generates about 8000
lines of C code.
The actual loop is compiled into something that looks like this...
if (__pyx_t_23 < 0) __pyx_t_23 +=
__pyx_pybuffernd_cur.diminfo[0].shape;
if (__pyx_t_24 < 0) __pyx_t_24 +=
__pyx_pybuffernd_cur.diminfo[1].shape;
__pyx_v_step = (((((*__Pyx_BufPtrStrided2d(double *,
__pyx_pybuffernd_cur.rcbuffer->pybuffer.buf, __pyx_t_15,
__pyx_pybuffernd_cur.diminfo[0].strides, __pyx_t_16,
__pyx_pybuffernd_cur.diminfo[1].strides)) + (*__Pyx_BufPtrStrided2d(double
*, __pyx_pybuffernd_cur.rcbuffer->pybuffer.buf, __pyx_t_17,
__pyx_pybuffernd_cur.diminfo[0].strides, __pyx_t_18,
__pyx_pybuffernd_cur.diminfo[1].strides))) + (*__Pyx_BufPtrStrided2d(double
*, __pyx_pybuffernd_cur.rcbuffer->pybuffer.buf, __pyx_t_19,
__pyx_pybuffernd_cur.diminfo[0].strides, __pyx_t_20,
__pyx_pybuffernd_cur.diminfo[1].strides))) + (*__Pyx_BufPtrStrided2d(double
*, __pyx_pybuffernd_cur.rcbuffer->pybuffer.buf, __pyx_t_21,
__pyx_pybuffernd_cur.diminfo[0].strides, __pyx_t_22,
__pyx_pybuffernd_cur.diminfo[1].strides))) - (4.0 *
(*__Pyx_BufPtrStrided2d(double *,
__pyx_pybuffernd_cur.rcbuffer->pybuffer.buf, __pyx_t_23,
__pyx_pybuffernd_cur.diminfo[0].strides, __pyx_t_24,
__pyx_pybuffernd_cur.diminfo[1].strides))));
...repeated a hundred times. I pasted the excerpt at
https://gist.github.com/anonymous/e54964121b212e5f783fb10c696ed9e2
This comes with an incredible amount of wrapper code for checking types and
recursion limits.
Regarding the float array issue, I have these two thoughts:
1) Why not just completely drop the float *array* optimization? If you're
writing numerical code, use bigarrays. Those are optimized. I rarely use
float arrays.
2) However, I often use structures and tuples of floats (e.g. type vec3 =
float * float * float or type vec3 = { x : float; y : float; z : float }).
Having the floats in structures/tuples be boxed would be very annoying.
I'm not sure (1) and (2) interact.
3) What about 63-bit floats? This would be tricky and non-compliant, but
someone had already made the suggestion here:
http://blog.frama-c.com/index.php?post/2013/05/09/A-63-bit-floating-point-type-for-64-bit-OCaml
These could be provided as a separate type.
4) Same thing for single-precision floats (i.e. actual C floats.) On
64-bit machines these would fit with no problems in a 64-bit word.
Wasteful, but fast.
5) The really important issue may be float boxing/unboxing when passed to
functions. Consider:
let accu_over_range i1 i2 f q0 = let rec loop i q = if i > i2 then q else
loop (i + 1) (f q i) in loop i1 q0
let _ = Printf.printf "%g\n" (accu_over_range 1 100_000_000 (fun q i -> q
+. float i *. float i) 0.0)
The inner loop translates into this (with the same 4.05+pr485+flambda)
camlFloataccu__loop_133:
subq $8, %rsp
.L113:
movq %rax, %rdi
cmpq $200000001, %rdi
jle .L112
movq %rbx, %rax
addq $8, %rsp
ret
.L112:
.L114:
subq $16, %r15
movq caml_young_limit@GOTPCREL(%rip), %rax
cmpq (%rax), %r15
jb .L115
leaq 8(%r15), %rsi
movq $1277, -8(%rsi)
movq %rdi, %rax
sarq $1, %rax
cvtsi2sdq %rax, %xmm0
movapd %xmm0, %xmm1
mulsd %xmm0, %xmm1
addsd (%rbx), %xmm1
movsd %xmm1, (%rsi)
movq %rdi, %rax
addq $2, %rax
movq %rsi, %rbx
jmp .L113
.L115:
call caml_call_gc@PLT
.L116:
jmp .L114
And we see that floats are boxed. Type annotation doesn't help. I did
quickly try a few Flambda options such as:
ocamlopt -O3 floataccu-anno.ml -unbox-closures -rounds=10
-inline-call-cost=1000 -inlining-report
but that didn't change anything.
Maybe there is a way?
Note that Fortran 2008 currently has "inner functions" which can be defined
locally and passed to subroutines. They do capture variables, but the
catch is that they are limited to one nesting level. Also some compilers
(eg. PGI Fortran) don't support them yet. See:
http://www.fortran90.org/src/faq.html#does-fortran-support-closures
My point is that efficient calls with float arguments are much more
important than this issue. Having to add a few type annotations to avoid
polymorphic code is inconvenient, but not being able to use functions
efficiently (the fundamental construct!) is a roadblock.
6) For records containing a few floats, there is the Mlton approach of
laying out all unboxed fields before boxed ones, however as long as
all-float structures are unboxed, this can be worked around by having the
used manually place all these fields in a sub-record.
--
Berke Durak, VA7OBD (CN88)
[-- Attachment #2: Type: text/html, Size: 11256 bytes --]
next prev parent reply other threads:[~2017-01-22 20:06 UTC|newest]
Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-01-19 6:51 Mark Hayden
2017-01-19 11:20 ` Nils Becker
2017-01-19 11:39 ` Gabriel Scherer
2017-01-19 13:26 ` Frédéric Bour
2017-01-19 14:35 ` Alain Frisch
2017-01-19 15:35 ` Ivan Gotovchits
2017-01-19 17:02 ` Hezekiah M. Carty
2017-01-19 15:41 ` Gerd Stolpmann
2017-01-19 13:41 ` Yaron Minsky
2017-01-19 17:59 ` Mark Hayden
2017-01-19 22:30 ` Yaron Minsky
2017-01-22 20:06 ` Berke Durak [this message]
2017-01-23 16:33 ` David McClain
2017-01-21 14:39 ` [Caml-list] <DKIM> " Pierre Chambart
2017-01-19 14:32 [Caml-list] " Hongbo Zhang (BLOOMBERG/ 731 LEX)
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CAALTfKAUYoDxj_eVbAig8+TGFcyqR0rCaHJoqa_uwZfCvAWkMg@mail.gmail.com \
--to=berke.durak@gmail.com \
--cc=caml-list@inria.fr \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox