[Re-sending to list because of size limit]

Of course the same kind of issues apply to Bigarray values. A polymorphic getter:

let get a i = a.{i}

will have pretty bad performance, unless "a" is annotated with the Bigarray kind and layout.

Yesterday I rewrote some numerical double-integration inner loop in Fortran. One of the loops looks like this:

do ui = 1,n
u = real(ui - 1) / real(n - 1)
do vi = 1,ui
v = real(vi - 1) / real(n - 1)
p = p1 + u*p12 + v*p13
r = norm(s - p)
if (r > epsilon) then
q = qs(1)*(1 - u - v) + qs(2)*u + qs(3)*v
z = z + q/r
end if
end do
end do

I was half-disappointed, half pleasantly surprised when I noticed that it only got about 3 times faster than the OCaml version (with gfortran 6.3.0 and -O3 -fcheck-bounds -ffast-math).

Granted the OCaml code is similar written in a mostly imperative style with Bigarrays and references, but it's using a function passed to a domain iterator instead of having two nested identical loops. This is with 4.05+pr985+flambda.

I examined the Fortran assembly code, and while it looks numerically dense, it looks like gfortran ended up calling its internal pack/unpack primitives:

mulsd %xmm1, %xmm2 # v, D.4022
movsd %xmm0, 168(%rsp) # D.4022, A.12
movsd 16(%r15), %xmm0 # *s_110(D), *s_110(D)
subsd 88(%rsp), %xmm0 # %sfp, D.4022
subsd 32(%rsp), %xmm0 # %sfp, D.4022
subsd %xmm2, %xmm0 # D.4022, D.4022
movsd %xmm0, 176(%rsp) # D.4022, A.12
call _gfortran_internal_pack #
movsd (%rax), %xmm2 # MEM[(real(kind=8)[3] *)_117], D.4022
cmpq %r12, %rax # tmp266, D.4025
movsd 8(%rax), %xmm3 # MEM[(real(kind=8)[3] *)_117], D.4022
movq %rax, %rbx #, D.4025
mulsd %xmm2, %xmm2 # D.4022, D.4022
mulsd %xmm3, %xmm3 # D.4022, D.4022
movsd 16(%rax), %xmm0 # MEM[(real(kind=8)[3] *)_117], D.4022
movsd (%rsp), %xmm1 # %sfp, v
mulsd %xmm0, %xmm0 # D.4022, D.4022
addsd %xmm3, %xmm2 # D.4022, D.4022
addsd %xmm2, %xmm0 # D.4022, D.4022
sqrtsd %xmm0, %xmm0 # D.4022, D.4022
je .L4 #,
leaq 192(%rsp), %rdi #, tmp328
movq %rax, %rsi # D.4025,
movsd %xmm0, 8(%rsp) # D.4022, %sfp
call _gfortran_internal_unpack #

I'm new at using Fortran, so maybe there are a few simple things to make the code faster. I suspect these calls are due to the vector operations, such as the call to norm(u) and the vector substraction, in spite of norm() being defined in the same Fortran module and is inlined. (Note that pack/unpack aren't that expensive.)

My point in describing all this is that if some of the pitfalls described are avoided, OCaml is not bad for numerical code if you compare it to "unoptimized" Fortran code. Getting the "magical optimization" from Fortran compilers (auto-vectorization, or auto-parallelization as provided by PGI Fortran) is neither automatic nor easy.

Now a lot of scientists are stuck with Matlab, and the newer generation tends to use Python with Numpy.

Assuming the required matrix/vector operations are available in OCaml libraries Lacaml, Fftw, etc. we OCaml programmers will find Python + Numpy to be inferior to OCaml because Python is an inferior language.

The benefit of Python is that it's easier to "just get started" since "types won't get in the way", but then it's not any better than Matlab (besides the price tag). And yes, Python is a better language than Matlab. (Octave and INRIA's Scilab seem to be slightly better language-wise.)

But for writing a non-temporary numerical routine Ocaml is superior since you can produce a type-checked, fast standalone executable efficiently thanks to the high-level programming offered.

People dissatisfied with Python's performance often rave about Cython and how wonderful it is. This thing generates C code from type-annotated Python code. Examining the generated C and then assembly code from the heat.pyx examples, the code (with bounds checks disabled) doesn't look glorious.

The Cython code looks like this:

@cython.boundscheck(False)
def solve_heat_buf_nocheck(initial_conditions, double dx, double dt, int iter):
cdef numpy.ndarray[double, ndim=2] cur = initial_conditions.copy()
cdef numpy.ndarray[double, ndim=2] next = numpy.zeros_like(initial_conditions)
cdef int count, i, j, M, N
M, N = cur.shape[0], cur.shape[1]
cdef double step
for count in range(iter):
for i in range(1, M-1):
for j in range(1, N-1):
step = cur[i-1,j] + cur[i+1,j] + cur[i,j-1] + cur[i,j+1] - 4*cur[i,j]
next[i,j] = cur[i,j] + dt*step/dx**2
cur, next = next, cur
return cur

This, with two other Python functions of similar size, generates about 8000 lines of C code.

The actual loop is compiled into something that looks like this...

if (__pyx_t_23 < 0) __pyx_t_23 += __pyx_pybuffernd_cur.diminfo[0].shape;
if (__pyx_t_24 < 0) __pyx_t_24 += __pyx_pybuffernd_cur.diminfo[1].shape;
__pyx_v_step = (((((*__Pyx_BufPtrStrided2d(double *, __pyx_pybuffernd_cur.rcbuffer->pybuffer.buf, __pyx_t_15, __pyx_pybuffernd_cur.diminfo[0].strides, __pyx_t_16, __pyx_pybuffernd_cur.diminfo[1].strides)) + (*__Pyx_BufPtrStrided2d(double *, __pyx_pybuffernd_cur.rcbuffer->pybuffer.buf, __pyx_t_17, __pyx_pybuffernd_cur.diminfo[0].strides, __pyx_t_18, __pyx_pybuffernd_cur.diminfo[1].strides))) + (*__Pyx_BufPtrStrided2d(double *, __pyx_pybuffernd_cur.rcbuffer->pybuffer.buf, __pyx_t_19, __pyx_pybuffernd_cur.diminfo[0].strides, __pyx_t_20, __pyx_pybuffernd_cur.diminfo[1].strides))) + (*__Pyx_BufPtrStrided2d(double *, __pyx_pybuffernd_cur.rcbuffer->pybuffer.buf, __pyx_t_21, __pyx_pybuffernd_cur.diminfo[0].strides, __pyx_t_22, __pyx_pybuffernd_cur.diminfo[1].strides))) - (4.0 * (*__Pyx_BufPtrStrided2d(double *, __pyx_pybuffernd_cur.rcbuffer->pybuffer.buf, __pyx_t_23, __pyx_pybuffernd_cur.diminfo[0].strides, __pyx_t_24, __pyx_pybuffernd_cur.diminfo[1].strides))));

...repeated a hundred times. I pasted the excerpt at

https://gist.github.com/anonymous/e54964121b212e5f783fb10c696ed9e2

This comes with an incredible amount of wrapper code for checking types and recursion limits.

Regarding the float array issue, I have these two thoughts:

1) Why not just completely drop the float *array* optimization? If you're writing numerical code, use bigarrays. Those are optimized. I rarely use float arrays.

2) However, I often use structures and tuples of floats (e.g. type vec3 = float * float * float or type vec3 = { x : float; y : float; z : float }). Having the floats in structures/tuples be boxed would be very annoying.

I'm not sure (1) and (2) interact.

3) What about 63-bit floats? This would be tricky and non-compliant, but someone had already made the suggestion here:

http://blog.frama-c.com/index.php?post/2013/05/09/A-63-bit-floating-point-type-for-64-bit-OCaml

These could be provided as a separate type.

4) Same thing for single-precision floats (i.e. actual C floats.) On 64-bit machines these would fit with no problems in a 64-bit word. Wasteful, but fast.

5) The really important issue may be float boxing/unboxing when passed to functions. Consider:

let accu_over_range i1 i2 f q0 = let rec loop i q = if i > i2 then q else loop (i + 1) (f q i) in loop i1 q0

let _ = Printf.printf "%g\n" (accu_over_range 1 100_000_000 (fun q i -> q +. float i *. float i) 0.0)

The inner loop translates into this (with the same 4.05+pr485+flambda)

camlFloataccu__loop_133:
subq $8, %rsp
.L113:
movq %rax, %rdi
cmpq $200000001, %rdi
jle .L112
movq %rbx, %rax
addq $8, %rsp
ret
.L112:
.L114:
subq $16, %r15
movq caml_young_limit@GOTPCREL(%rip), %rax
cmpq (%rax), %r15
jb .L115
leaq 8(%r15), %rsi
movq $1277, -8(%rsi)
movq %rdi, %rax
sarq $1, %rax
cvtsi2sdq %rax, %xmm0
movapd %xmm0, %xmm1
mulsd %xmm0, %xmm1
addsd (%rbx), %xmm1
movsd %xmm1, (%rsi)
movq %rdi, %rax
addq $2, %rax
movq %rsi, %rbx
jmp .L113
.L115:
call caml_call_gc@PLT
.L116:
jmp .L114

And we see that floats are boxed. Type annotation doesn't help. I did quickly try a few Flambda options such as:

ocamlopt -O3 floataccu-anno.ml -unbox-closures -rounds=10 -inline-call-cost=1000 -inlining-report

but that didn't change anything.

Maybe there is a way?

Note that Fortran 2008 currently has "inner functions" which can be defined locally and passed to subroutines. They do capture variables, but the catch is that they are limited to one nesting level. Also some compilers (eg. PGI Fortran) don't support them yet. See: http://www.fortran90.org/src/faq.html#does-fortran-support-closures

My point is that efficient calls with float arguments are much more important than this issue. Having to add a few type annotations to avoid polymorphic code is inconvenient, but not being able to use functions efficiently (the fundamental construct!) is a roadblock.

6) For records containing a few floats, there is the Mlton approach of laying out all unboxed fields before boxed ones, however as long as all-float structures are unboxed, this can be worked around by having the used manually place all these fields in a sub-record.

--
Berke Durak, VA7OBD (CN88)