From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail3-relais-sop.national.inria.fr (mail3-relais-sop.national.inria.fr [192.134.164.104]) by sympa.inria.fr (Postfix) with ESMTPS id 16D8280198 for ; Tue, 11 Jul 2017 14:40:11 +0200 (CEST) Authentication-Results: mail3-smtp-sop.national.inria.fr; spf=None smtp.pra=ivg@ieee.org; spf=Pass smtp.mailfrom=ivg@ieee.org; spf=None smtp.helo=postmaster@mail-yw0-f169.google.com Received-SPF: None (mail3-smtp-sop.national.inria.fr: no sender authenticity information available from domain of ivg@ieee.org) identity=pra; client-ip=209.85.161.169; receiver=mail3-smtp-sop.national.inria.fr; envelope-from="ivg@ieee.org"; x-sender="ivg@ieee.org"; x-conformance=sidf_compatible Received-SPF: Pass (mail3-smtp-sop.national.inria.fr: domain of ivg@ieee.org designates 209.85.161.169 as permitted sender) identity=mailfrom; client-ip=209.85.161.169; receiver=mail3-smtp-sop.national.inria.fr; envelope-from="ivg@ieee.org"; x-sender="ivg@ieee.org"; x-conformance=sidf_compatible; x-record-type="v=spf1" Received-SPF: None (mail3-smtp-sop.national.inria.fr: no sender authenticity information available from domain of postmaster@mail-yw0-f169.google.com) identity=helo; client-ip=209.85.161.169; receiver=mail3-smtp-sop.national.inria.fr; envelope-from="ivg@ieee.org"; x-sender="postmaster@mail-yw0-f169.google.com"; x-conformance=sidf_compatible IronPort-PHdr: =?us-ascii?q?9a23=3A9gDUgxZyzjgY3gIn4VQ8Wvf/LSx+4OfEezUN459i?= =?us-ascii?q?sYplN5qZrsyzbnLW6fgltlLVR4KTs6sC0LuJ9fi4EUU7or+5+EgYd5JNUxJXwe?= =?us-ascii?q?43pCcHRPC/NEvgMfTxZDY7FskRHHVs/nW8LFQHUJ2mPw6arXK99yMdFQviPgRp?= =?us-ascii?q?OOv1BpTSj8Oq3Oyu5pHfeQtFiT6/bL9oMBm6sRjau9ULj4dlNqs/0AbCrGFSe+?= =?us-ascii?q?RRy2NoJFaTkAj568yt4pNt8Dletuw4+cJYXqr0Y6o3TbpDDDQ7KG81/9HktQPC?= =?us-ascii?q?TQSU+HQRVHgdnwdSDAjE6BH6WYrxsjf/u+Fg1iSWIdH6QLYpUjul8qlrVQToiD?= =?us-ascii?q?8ZODEl7GHZhMtwjKdBrxKgoRx03orYbY6ROfZ7eK7WYNEUSndbXstJVSNBDIOy?= =?us-ascii?q?YYUMAeQcI+hXs5LwqEESoRakHwSgGP/jxz1Oi3Tr3aM6yeMhEQTe0QIjAdIBqn?= =?us-ascii?q?LUp8j0OqcVTeC1y7fIwinDb/NXxTf985XDfxcgofGSUrJ9asvRxlcxGAzblFmQ?= =?us-ascii?q?rpblPzyM2+kLrmOV4e1gVee1hG4mrQF8ujmvxsE2ionInI0Z0F7E9T9hzIYxIt?= =?us-ascii?q?24T1Z7bcShEJtUry2aLJd2Qtk8TG5yvSY20LgGuZqjcCgFyZQn2x7fa+GcfISS?= =?us-ascii?q?/h3jU+ORLS94hX1/eLK/gBGy/VK8xe37U8m4yFhKoTJBktnLsXANzwbf6s2DSv?= =?us-ascii?q?dl8EehwSqA1wfW6uFcJUA7i7bbJIA7zrEskZoTtFzPHij1mEXzja+WdF8o+u+y?= =?us-ascii?q?6+ToZLjtu5ySN5dshw3gLqgjntazDOc4PwQUQmSW+Pmw2Kf+8UD4RLhHiOA9nL?= =?us-ascii?q?PDv5DAP8sbo7a0Aw9L3YYn7BayFzKm384ZnXkDNV5EeByGg5TwN1HAPfz1DPOy?= =?us-ascii?q?j06jkDdswPDGMbnhDYvXInffl7fheK5x609ayAUt0dBS/4xYBq0FLf7pWUL8tM?= =?us-ascii?q?bUAgI4PgCp2errFdRw24cGVWKKGKCZMafSsVGS5uIoJumBfI4VuCjyK/U+5v7h?= =?us-ascii?q?k2E2lkEHcamux5sXZ2i0Hu56LEWBfXrsntABHH8WsQUkSezqjESOUTpSZ3apQ6?= =?us-ascii?q?Ix/So7CYKjDYfbXI+hmr2B3CGhHp1XfG9KEF6MEW27P7mDDt4IZTKfM4dMnTse?= =?us-ascii?q?UqbpH4Yl2AHoswn+2vxrBuXR8ywc85nk0Y4myffUkEQT6zVyR+uaz2aTRGF1gn?= =?us-ascii?q?hAEz4o04h+rEFwjFCZ3v4r0LRjCdVP6qYRAU8BPpnGwrkmWt0=3D?= X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: =?us-ascii?q?A0D0AADGxmRZhqmhVdFdHAEBBAEBCgEBF?= =?us-ascii?q?wEBBAEBCgEBhBOBFAeFH5pQgkKFbI8HA1wshUoCgzEHQxQBAQEBAQEBAQEBARI?= =?us-ascii?q?BAQEICwsIKC+CMwyCWQEBAQEBAQEjBBkBASwLAQQLCwsDCg0dAgIhARIBBQEKE?= =?us-ascii?q?gYTEg2JeAMIBQgQnww/ix9rgWw6gwUBAQWEJQMKg3gBAQEBAQUBAQEBAQEBGQg?= =?us-ascii?q?SgxYvgx6DeYEMglcpgheCZoJhiAEMjzOHLjuHSIcLS4RugiU+j0KLfYgCFB+BF?= =?us-ascii?q?Q8ngSuBAghJGgaEOiofgg8kNgEEiCkBAQE?= X-IPAS-Result: =?us-ascii?q?A0D0AADGxmRZhqmhVdFdHAEBBAEBCgEBFwEBBAEBCgEBhBO?= =?us-ascii?q?BFAeFH5pQgkKFbI8HA1wshUoCgzEHQxQBAQEBAQEBAQEBARIBAQEICwsIKC+CM?= =?us-ascii?q?wyCWQEBAQEBAQEjBBkBASwLAQQLCwsDCg0dAgIhARIBBQEKEgYTEg2JeAMIBQg?= =?us-ascii?q?Qnww/ix9rgWw6gwUBAQWEJQMKg3gBAQEBAQUBAQEBAQEBGQgSgxYvgx6DeYEMg?= =?us-ascii?q?lcpgheCZoJhiAEMjzOHLjuHSIcLS4RugiU+j0KLfYgCFB+BFQ8ngSuBAghJGga?= =?us-ascii?q?EOiofgg8kNgEEiCkBAQE?= X-IronPort-AV: E=Sophos;i="5.40,346,1496095200"; d="scan'208,217";a="231206112" Received: from mail-yw0-f169.google.com ([209.85.161.169]) by mail3-smtp-sop.national.inria.fr with ESMTP/TLS/AES128-GCM-SHA256; 11 Jul 2017 14:40:09 +0200 Received: by mail-yw0-f169.google.com with SMTP id v193so48167541ywg.2 for ; Tue, 11 Jul 2017 05:40:08 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ieee-org.20150623.gappssmtp.com; s=20150623; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc; bh=+z9OBlKpQMprXI6l6F9hoSN4MRWNcu07tlcJKGwkL/s=; b=nEzSzhVMlkcuneoeNOg8WdC2boUwhEUhN7dgrnujJ8Q3bMCIfpn74/oHY80HzFaOic NpP3SJ5bMuJlSJnU50bYifV9gU7qmfUZaTmFA5Xv4kjsL7yYeomwEWa4l7WTC+YHBwyU j+4W7JIml11M5ovzDtsppq0zJf6fFFvcrWPJctN00icKfICBh6qpLSbN0EylwqZu0eKm 5oR9CpN4KAEDO8BGPdKp0ldJliroOfrrpmoorNIIwa7djJyrh6KhffrB+I1QgQpNZF0A mldZEkZ4HoPLMNyjslRvw/069pN3unmA3N824fr4yiyNJiklM2IkO4fwDCRyJAwwaHpM KepQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc; bh=+z9OBlKpQMprXI6l6F9hoSN4MRWNcu07tlcJKGwkL/s=; b=ILeuk7ceRQNdyw5pKwSXGb5CsNYyjO08/pAcUVh4VEI9bCI4q5r0liljX9H9bzarwF ZlFMP3Yg5AXVq9qNwqBAZLX9EpwYvh/qjrgsrEYFWvq89ryAurlmBnVZBZQhDbIRaVuH F947nghIZzYMa3cgcG6DSYWg/toDTJPT/4nQOYqGqPfrHbZ55Eaf1WrPLtTxK3fkR1r+ UJ5Z6tKec303YBMBqXYVkukI7XMeWGcZWHQkR4UhyH7gnlR6XOK7MtNi5WR2n9yG5AmB dwRWFmsrdhIKTPTGxuiKQD53ft1JOv74MpA+XdceMIP79YWTsk5qhwz31JabMHv4gbj0 OF7w== X-Gm-Message-State: AIVw111WnzhNe1nSCFZGemeptXppcYSIU/QjsEB/7h+vXUNmsGLqB6Sc jy2oZgnkKrxbhwokfkGFRIJeZh/4Ye/J X-Received: by 10.129.107.198 with SMTP id g189mr4314891ywc.314.1499776807279; Tue, 11 Jul 2017 05:40:07 -0700 (PDT) MIME-Version: 1.0 Received: by 10.129.39.7 with HTTP; Tue, 11 Jul 2017 05:40:06 -0700 (PDT) In-Reply-To: References: From: Ivan Gotovchits Date: Tue, 11 Jul 2017 08:40:06 -0400 Message-ID: To: Alexey Egorov Cc: caml users Content-Type: multipart/alternative; boundary="001a11473b36d76cc8055409ff6a" Subject: Re: [Caml-list] Optimizing pure-functional streams --001a11473b36d76cc8055409ff6a Content-Type: text/plain; charset="UTF-8" On Mon, Jul 10, 2017 at 6:28 PM, Alexey Egorov wrote: > Hi Ivan, > > thanks for your analysis! > > Which version of OCaml you are using? I'm tried on 4.04.1 and trunk > (with and without flambda) and still getting the same (poor) > performance for recursive version, no matter if curried or uncurried > loop is used. > It works the same on any version of OCaml. Are you sure, that you're actually recompiling the code? But if you insist, then I was trying on 4.03.0 and 4.04.0+flambda. P.S. I would like to retract my statement that the recursive version is 10% faster, in fact it is the same, it was a measurement error. > > 2017-07-11 2:29 GMT+05:00 Ivan Gotovchits : > > Hi Alexey, > > > > The recursion version is slower because you're allocating a tuple on each > > recursive call. Try to use a curried form of a function, and you will a > > performance that is even better than the for-loop (10% better in my > case). > > E.g., > > > > > > > > let loop high = > > let rec loop i = function > > | t when i > high -> t > > | t when i mod 2 = 0 -> loop (i + 1) (t + i * 2) > > | t -> loop (i + 1) t in > > loop 1 0 > > > > > > The for-loop version is indeed slower than the C version because of > tagging, > > c.f., the OCaml implementation: > > > > movq $1, %rbx > > movq $3, %rdi > > cmpq %rax, %rdi > > jg .L100 > > > > > > .L101: > > movq %rdi, %rsi > > sarq $1, %rsi > > movq %rsi, %rdx > > shr movq $1, %rbx > > movq $3, %rdi > > cmpq %rax, %rdi > > jg .L100 > > > > .L101: > > movq %rdi, %rsi > > sarq $1, %rsi > > moq $63, %rdx > > movq %rsi, %rcx > > addq %rdx, %rcx > > andq $-2, %rcx > > subq %rcx, %rsi > > leaq 1(%rsi,%rsi), %rsi > > cmpq $1, %rsi > > jne .L102 > > leaq -2(%rbx,%rdi,2), %rbx > > .L102: > > movq %rdi, %rsi > > addq $2, %rdi > > cmpq %rax, %rsi > > jne .L101 > > .L100: > > movq %rbx, %rax > > ret > > > > > > With the GCC output: > > > > > > testl %edi, %edi > > jle .L6 > > addl $1, %edi > > movl $4, %ecx > > movl $1, %edx > > xorl %eax, %eax > > jmp .L3 > > .L5: > > leaq (%rax,%rcx), %rsi > > testb $1, %dl > > cmove %rsi, %rax > > addq $2, %rcx > > .L3: > > addl $1, %edx > > cmpl %edi, %edx > > jne .L5 > > rep ret > > .L6: > > xorl %eax, %eax > > ret > > > > > > Notice this `sar` and `shr` instructions, they are here due to the > tagging. > > The GCC compiler was also able to leverage the conditional move > instruction > > `cmove`, though it looks nice I'm not sure how much it actually won. > There > > is no need to use flambda for the recursive and for-loop as they are > already > > optimized to the maximum. > > > > Concerning your stream implementation, you may compare it to the Core > > Kernel's Sequence. It is basically the same, so you can play the > > benchmarking game :) > > > > The general way to implement a stream with a zero performance penalty > (i.e., > > as fast as a for loop) is to use staging. This topic is explored in > detail > > in the Oleg Kiselyov et alas Strymonad [1] paper. The implementation, as > > well as other interesting discussions about streams, are available on his > > homepage [2] (there are tons of very interesting stuff there!). > > > > Best, > > Ivan > > > > [1]: http://okmij.org/ftp/meta-programming/strymonas.pdf > > [2]: http://okmij.org/ftp/Streams.html > > > > > > > > > > > > On Mon, Jul 10, 2017 at 4:26 PM, Alexey Egorov > > wrote: > >> > >> Hello, > >> > >> Michael Snoyman recently posted very interesting article about Haskell > >> streams and Rust iterators ( > >> https://www.fpcomplete.com/blog/2017/07/iterators-streams-rust-haskell > >> ) and I tried to do the same in OCaml. > >> > >> You can find my code here - > >> https://gist.github.com/anonymous/127e9116b362d561c5dfa9cce6b453f3 > >> > >> There is four different versions: > >> > >> 1) c_loop.c - plain C loop > >> 2) f_loop.ml - for-loop in OCaml > >> 3) loop.ml - OCaml loop using recursion > >> 4) stream.ml - OCaml loop using streams > >> > >> Here is the results: > >> > >> c_loop.c: 0m0.105s > >> f_loop.ml: 0m0.184s > >> loop.ml: 0m0.207s > >> stream.ml: 0m0.605s > >> > >> It's not suprise that C version is fastest - at least, it uses untagged > >> ints. > >> What surprises me is that recursion is slower than for-loop - can't > >> OCaml compile it down to the same imperative loop? > >> > >> And it's very dissapointing that stream version is 3 times slower than > >> recursion-based. > >> The problem, I believe, is that OCaml is unable to optimize > >> intermediate stream constructors away (as GHC do). > >> > >> I tried to add inline annotations, which helps a little - is there > >> something I can do to improve it further? > >> > >> -- > >> Caml-list mailing list. Subscription management and archives: > >> https://sympa.inria.fr/sympa/arc/caml-list > >> Beginner's list: http://groups.yahoo.com/group/ocaml_beginners > >> Bug reports: http://caml.inria.fr/bin/caml-bugs > > > > > --001a11473b36d76cc8055409ff6a Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
On M= on, Jul 10, 2017 at 6:28 PM, Alexey Egorov <alex.only.d@gmail.com&= gt; wrote:
Hi Ivan,

thanks for your analysis!

Which version of OCaml you are using? I'm tried on 4.04.1 and trunk
(with and without flambda) and still getting the same (poor)
performance for recursive version, no matter if curried or uncurried
loop is used.

It works the same = on any version of OCaml. Are you sure, that you're actually recompiling= the code?=C2=A0
But if you insist, then I was trying on 4.03.0 a= nd 4.04.0+flambda.=C2=A0

P.S. I would like to retr= act my statement that the recursive version is 10% faster, in fact it is th= e same, it was a measurement error.=C2=A0
=C2=A0

2017-07-11 2:29 GMT+05:00 Ivan Gotovchits <ivg@ieee.org>:
> Hi Alexey,
>
> The recursion version is slower because you're allocating a tuple = on each
> recursive call. Try to use a curried form of a function, and you will = a
> performance that is even better than the for-loop (10% better in my ca= se).
> E.g.,
>
>
>
> let loop high =3D
>=C2=A0 =C2=A0let rec loop i =3D function
>=C2=A0 =C2=A0 =C2=A0| t when i > high -> t
>=C2=A0 =C2=A0 =C2=A0| t when i mod 2 =3D 0 -> loop (i + 1) (t + i * = 2)
>=C2=A0 =C2=A0 =C2=A0| t -> loop (i + 1) t in
>=C2=A0 =C2=A0loop 1 0
>
>
> The for-loop version is indeed slower than the C version because of ta= gging,
> c.f., the OCaml implementation:
>
> movq $1, %rbx
> movq $3, %rdi
> cmpq %rax, %rdi
> jg .L100
>
>
> .L101:
> movq %rdi, %rsi
> sarq $1, %rsi
> movq %rsi, %rdx
> shr movq $1, %rbx
> movq $3, %rdi
> cmpq %rax, %rdi
> jg .L100
>
> .L101:
> movq %rdi, %rsi
> sarq $1, %rsi
> moq $63, %rdx
> movq %rsi, %rcx
> addq %rdx, %rcx
> andq $-2, %rcx
> subq %rcx, %rsi
> leaq 1(%rsi,%rsi), %rsi
> cmpq $1, %rsi
> jne .L102
> leaq -2(%rbx,%rdi,2), %rbx
> .L102:
> movq %rdi, %rsi
> addq $2, %rdi
> cmpq %rax, %rsi
> jne .L101
> .L100:
> movq %rbx, %rax
> ret
>
>
> With the GCC output:
>
>
> testl %edi, %edi
> jle .L6
> addl $1, %edi
> movl $4, %ecx
> movl $1, %edx
> xorl %eax, %eax
> jmp .L3
> .L5:
> leaq (%rax,%rcx), %rsi
> testb $1, %dl
> cmove %rsi, %rax
> addq $2, %rcx
> .L3:
> addl $1, %edx
> cmpl %edi, %edx
> jne .L5
> rep ret
> .L6:
> xorl %eax, %eax
> ret
>
>
> Notice this `sar` and `shr` instructions, they are here due to the tag= ging.
> The GCC compiler was also able to leverage the conditional move instru= ction
> `cmove`, though it looks nice I'm not sure how much it actually wo= n. There
> is no need to use flambda for the recursive and for-loop as they are a= lready
> optimized to the maximum.
>
> Concerning your stream implementation, you may compare it to the Core<= br> > Kernel's Sequence. It is basically the same, so you can play the > benchmarking game :)
>
> The general way to implement a stream with a zero performance penalty = (i.e.,
> as fast as a for loop) is to use staging. This topic is explored in de= tail
> in the Oleg Kiselyov et alas Strymonad [1] paper. The implementation, = as
> well as other interesting discussions about streams, are available on = his
> homepage [2] (there are tons of very interesting stuff there!).
>
> Best,
> Ivan
>
> [1]: http://okmij.org/ftp/meta-programm= ing/strymonas.pdf
> [2]: http://okmij.org/ftp/Streams.html
>
>
>
>
>
> On Mon, Jul 10, 2017 at 4:26 PM, Alexey Egorov <alex.only.d@gmail.com>
> wrote:
>>
>> Hello,
>>
>> Michael Snoyman recently posted very interesting article about Has= kell
>> streams and Rust iterators (
>> https://www.fpcomplet= e.com/blog/2017/07/iterators-streams-rust-haskell
>> ) and I tried to do the same in OCaml.
>>
>> You can find my code here -
>> https://gist.github.com/<= wbr>anonymous/127e9116b362d561c5dfa9cce6b453f3
>>
>> There is four different versions:
>>
>> 1) c_loop.c - plain C loop
>> 2) f_loop.ml - for-loop in OCaml
>> 3) loop.ml - OCaml loop using recursion
>> 4) stream.ml - OCaml loop using streams
>>
>> Here is the results:
>>
>> c_loop.c:=C2=A0 =C2=A00m0.105s
>> = f_loop.ml:=C2=A0 0m0.184s
>> lo= op.ml:=C2=A0 =C2=A0 =C2=A00m0.207s
>> = stream.ml: 0m0.605s
>>
>> It's not suprise that C version is fastest - at least, it uses= untagged
>> ints.
>> What surprises me is that recursion is slower than for-loop - can&= #39;t
>> OCaml compile it down to the same imperative loop?
>>
>> And it's very dissapointing that stream version is 3 times slo= wer than
>> recursion-based.
>> The problem, I believe, is that OCaml is unable to optimize
>> intermediate stream constructors away (as GHC do).
>>
>> I tried to add inline annotations, which helps a little - is there=
>> something I can do to improve it further?
>>
>> --
>> Caml-list mailing list.=C2=A0 Subscription management and archives= :
>> https://sympa.inria.fr/sympa/arc/caml-list
>> Beginner's list:
http://groups.yahoo.com/g= roup/ocaml_beginners
>> Bug reports: http://caml.inria.fr/bin/caml-bugs<= br> >
>

--001a11473b36d76cc8055409ff6a--