On Mon, Jul 10, 2017 at 6:28 PM, Alexey Egorov wrote: > Hi Ivan, > > thanks for your analysis! > > Which version of OCaml you are using? I'm tried on 4.04.1 and trunk > (with and without flambda) and still getting the same (poor) > performance for recursive version, no matter if curried or uncurried > loop is used. > It works the same on any version of OCaml. Are you sure, that you're actually recompiling the code? But if you insist, then I was trying on 4.03.0 and 4.04.0+flambda. P.S. I would like to retract my statement that the recursive version is 10% faster, in fact it is the same, it was a measurement error. > > 2017-07-11 2:29 GMT+05:00 Ivan Gotovchits : > > Hi Alexey, > > > > The recursion version is slower because you're allocating a tuple on each > > recursive call. Try to use a curried form of a function, and you will a > > performance that is even better than the for-loop (10% better in my > case). > > E.g., > > > > > > > > let loop high = > > let rec loop i = function > > | t when i > high -> t > > | t when i mod 2 = 0 -> loop (i + 1) (t + i * 2) > > | t -> loop (i + 1) t in > > loop 1 0 > > > > > > The for-loop version is indeed slower than the C version because of > tagging, > > c.f., the OCaml implementation: > > > > movq $1, %rbx > > movq $3, %rdi > > cmpq %rax, %rdi > > jg .L100 > > > > > > .L101: > > movq %rdi, %rsi > > sarq $1, %rsi > > movq %rsi, %rdx > > shr movq $1, %rbx > > movq $3, %rdi > > cmpq %rax, %rdi > > jg .L100 > > > > .L101: > > movq %rdi, %rsi > > sarq $1, %rsi > > moq $63, %rdx > > movq %rsi, %rcx > > addq %rdx, %rcx > > andq $-2, %rcx > > subq %rcx, %rsi > > leaq 1(%rsi,%rsi), %rsi > > cmpq $1, %rsi > > jne .L102 > > leaq -2(%rbx,%rdi,2), %rbx > > .L102: > > movq %rdi, %rsi > > addq $2, %rdi > > cmpq %rax, %rsi > > jne .L101 > > .L100: > > movq %rbx, %rax > > ret > > > > > > With the GCC output: > > > > > > testl %edi, %edi > > jle .L6 > > addl $1, %edi > > movl $4, %ecx > > movl $1, %edx > > xorl %eax, %eax > > jmp .L3 > > .L5: > > leaq (%rax,%rcx), %rsi > > testb $1, %dl > > cmove %rsi, %rax > > addq $2, %rcx > > .L3: > > addl $1, %edx > > cmpl %edi, %edx > > jne .L5 > > rep ret > > .L6: > > xorl %eax, %eax > > ret > > > > > > Notice this `sar` and `shr` instructions, they are here due to the > tagging. > > The GCC compiler was also able to leverage the conditional move > instruction > > `cmove`, though it looks nice I'm not sure how much it actually won. > There > > is no need to use flambda for the recursive and for-loop as they are > already > > optimized to the maximum. > > > > Concerning your stream implementation, you may compare it to the Core > > Kernel's Sequence. It is basically the same, so you can play the > > benchmarking game :) > > > > The general way to implement a stream with a zero performance penalty > (i.e., > > as fast as a for loop) is to use staging. This topic is explored in > detail > > in the Oleg Kiselyov et alas Strymonad [1] paper. The implementation, as > > well as other interesting discussions about streams, are available on his > > homepage [2] (there are tons of very interesting stuff there!). > > > > Best, > > Ivan > > > > [1]: http://okmij.org/ftp/meta-programming/strymonas.pdf > > [2]: http://okmij.org/ftp/Streams.html > > > > > > > > > > > > On Mon, Jul 10, 2017 at 4:26 PM, Alexey Egorov > > wrote: > >> > >> Hello, > >> > >> Michael Snoyman recently posted very interesting article about Haskell > >> streams and Rust iterators ( > >> https://www.fpcomplete.com/blog/2017/07/iterators-streams-rust-haskell > >> ) and I tried to do the same in OCaml. > >> > >> You can find my code here - > >> https://gist.github.com/anonymous/127e9116b362d561c5dfa9cce6b453f3 > >> > >> There is four different versions: > >> > >> 1) c_loop.c - plain C loop > >> 2) f_loop.ml - for-loop in OCaml > >> 3) loop.ml - OCaml loop using recursion > >> 4) stream.ml - OCaml loop using streams > >> > >> Here is the results: > >> > >> c_loop.c: 0m0.105s > >> f_loop.ml: 0m0.184s > >> loop.ml: 0m0.207s > >> stream.ml: 0m0.605s > >> > >> It's not suprise that C version is fastest - at least, it uses untagged > >> ints. > >> What surprises me is that recursion is slower than for-loop - can't > >> OCaml compile it down to the same imperative loop? > >> > >> And it's very dissapointing that stream version is 3 times slower than > >> recursion-based. > >> The problem, I believe, is that OCaml is unable to optimize > >> intermediate stream constructors away (as GHC do). > >> > >> I tried to add inline annotations, which helps a little - is there > >> something I can do to improve it further? > >> > >> -- > >> Caml-list mailing list. Subscription management and archives: > >> https://sympa.inria.fr/sympa/arc/caml-list > >> Beginner's list: http://groups.yahoo.com/group/ocaml_beginners > >> Bug reports: http://caml.inria.fr/bin/caml-bugs > > > > >