On Mon, Jul 10, 2017 at 6:28 PM, Alexey Egorov <alex.only.d@gmail.com>
wrote:

> Hi Ivan,
>
> thanks for your analysis!
>
> Which version of OCaml you are using? I'm tried on 4.04.1 and trunk
> (with and without flambda) and still getting the same (poor)
> performance for recursive version, no matter if curried or uncurried
> loop is used.
>

It works the same on any version of OCaml. Are you sure, that you're
actually recompiling the code?
But if you insist, then I was trying on 4.03.0 and 4.04.0+flambda.

P.S. I would like to retract my statement that the recursive version is 10%
faster, in fact it is the same, it was a measurement error.


>
> 2017-07-11 2:29 GMT+05:00 Ivan Gotovchits <ivg@ieee.org>:
> > Hi Alexey,
> >
> > The recursion version is slower because you're allocating a tuple on each
> > recursive call. Try to use a curried form of a function, and you will a
> > performance that is even better than the for-loop (10% better in my
> case).
> > E.g.,
> >
> >
> >
> > let loop high =
> >   let rec loop i = function
> >     | t when i > high -> t
> >     | t when i mod 2 = 0 -> loop (i + 1) (t + i * 2)
> >     | t -> loop (i + 1) t in
> >   loop 1 0
> >
> >
> > The for-loop version is indeed slower than the C version because of
> tagging,
> > c.f., the OCaml implementation:
> >
> > movq $1, %rbx
> > movq $3, %rdi
> > cmpq %rax, %rdi
> > jg .L100
> >
> >
> > .L101:
> > movq %rdi, %rsi
> > sarq $1, %rsi
> > movq %rsi, %rdx
> > shr movq $1, %rbx
> > movq $3, %rdi
> > cmpq %rax, %rdi
> > jg .L100
> >
> > .L101:
> > movq %rdi, %rsi
> > sarq $1, %rsi
> > moq $63, %rdx
> > movq %rsi, %rcx
> > addq %rdx, %rcx
> > andq $-2, %rcx
> > subq %rcx, %rsi
> > leaq 1(%rsi,%rsi), %rsi
> > cmpq $1, %rsi
> > jne .L102
> > leaq -2(%rbx,%rdi,2), %rbx
> > .L102:
> > movq %rdi, %rsi
> > addq $2, %rdi
> > cmpq %rax, %rsi
> > jne .L101
> > .L100:
> > movq %rbx, %rax
> > ret
> >
> >
> > With the GCC output:
> >
> >
> > testl %edi, %edi
> > jle .L6
> > addl $1, %edi
> > movl $4, %ecx
> > movl $1, %edx
> > xorl %eax, %eax
> > jmp .L3
> > .L5:
> > leaq (%rax,%rcx), %rsi
> > testb $1, %dl
> > cmove %rsi, %rax
> > addq $2, %rcx
> > .L3:
> > addl $1, %edx
> > cmpl %edi, %edx
> > jne .L5
> > rep ret
> > .L6:
> > xorl %eax, %eax
> > ret
> >
> >
> > Notice this `sar` and `shr` instructions, they are here due to the
> tagging.
> > The GCC compiler was also able to leverage the conditional move
> instruction
> > `cmove`, though it looks nice I'm not sure how much it actually won.
> There
> > is no need to use flambda for the recursive and for-loop as they are
> already
> > optimized to the maximum.
> >
> > Concerning your stream implementation, you may compare it to the Core
> > Kernel's Sequence. It is basically the same, so you can play the
> > benchmarking game :)
> >
> > The general way to implement a stream with a zero performance penalty
> (i.e.,
> > as fast as a for loop) is to use staging. This topic is explored in
> detail
> > in the Oleg Kiselyov et alas Strymonad [1] paper. The implementation, as
> > well as other interesting discussions about streams, are available on his
> > homepage [2] (there are tons of very interesting stuff there!).
> >
> > Best,
> > Ivan
> >
> > [1]: http://okmij.org/ftp/meta-programming/strymonas.pdf
> > [2]: http://okmij.org/ftp/Streams.html
> >
> >
> >
> >
> >
> > On Mon, Jul 10, 2017 at 4:26 PM, Alexey Egorov <alex.only.d@gmail.com>
> > wrote:
> >>
> >> Hello,
> >>
> >> Michael Snoyman recently posted very interesting article about Haskell
> >> streams and Rust iterators (
> >> https://www.fpcomplete.com/blog/2017/07/iterators-streams-rust-haskell
> >> ) and I tried to do the same in OCaml.
> >>
> >> You can find my code here -
> >> https://gist.github.com/anonymous/127e9116b362d561c5dfa9cce6b453f3
> >>
> >> There is four different versions:
> >>
> >> 1) c_loop.c - plain C loop
> >> 2) f_loop.ml - for-loop in OCaml
> >> 3) loop.ml - OCaml loop using recursion
> >> 4) stream.ml - OCaml loop using streams
> >>
> >> Here is the results:
> >>
> >> c_loop.c:   0m0.105s
> >> f_loop.ml:  0m0.184s
> >> loop.ml:     0m0.207s
> >> stream.ml: 0m0.605s
> >>
> >> It's not suprise that C version is fastest - at least, it uses untagged
> >> ints.
> >> What surprises me is that recursion is slower than for-loop - can't
> >> OCaml compile it down to the same imperative loop?
> >>
> >> And it's very dissapointing that stream version is 3 times slower than
> >> recursion-based.
> >> The problem, I believe, is that OCaml is unable to optimize
> >> intermediate stream constructors away (as GHC do).
> >>
> >> I tried to add inline annotations, which helps a little - is there
> >> something I can do to improve it further?
> >>
> >> --
> >> Caml-list mailing list.  Subscription management and archives:
> >> https://sympa.inria.fr/sympa/arc/caml-list
> >> Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
> >> Bug reports: http://caml.inria.fr/bin/caml-bugs
> >
> >
>