* Value types (Was: [Caml-list] ocamlopt LLVM support) @ 2010-12-12 14:54 Jon Harrop 2010-12-12 15:55 ` Török Edwin 2010-12-12 19:53 ` Value types (Was: [Caml-list] ocamlopt LLVM support) Brian Hurt 0 siblings, 2 replies; 21+ messages in thread From: Jon Harrop @ 2010-12-12 14:54 UTC (permalink / raw) To: caml-list The Haskell guys got their best performance improvement moving to LLVM from the hailstone benchmark so it is interesting to examine this case as well. I just added support for 64-bit ints to HLVM to implement that benchmark and my code is: let rec collatzLen ((c, n): int * int64) : int = if n = 1L then c else collatzLen (c+1, if n % 2L = 0L then n / 2L else 3L * n + 1L);; let rec loop ((i, (nlen, n)): int64 * (int * int64)) : int64 = if i = 1L then n else let ilen = collatzLen(1, i) in let nlen, n = if ilen > nlen then ilen, i else nlen, n in loop (i-1L, (nlen, n));; When run without using LLVM's optimization passes, this produces the following output for 10k, 100k and 1M, respectively: - : `Int64 = 6171L Live: 0 0.00704098s total; 0s suspend; 0s mark; 0s sweep - : `Int64 = 77031L Live: 0 0.087815s total; 0s suspend; 0s mark; 0s sweep - : `Int64 = 837799L Live: 0 1.06907s total; 0s suspend; 0s mark; 0s sweep With LLVM's default optimization passes enabled, I get: - : `Int64 = 6171L Live: 0 0.00508595s total; 0s suspend; 0s mark; 0s sweep - : `Int64 = 77031L Live: 0 0.0626569s total; 0s suspend; 0s mark; 0s sweep - : `Int64 = 837799L Live: 0 0.759738s total; 0s suspend; 0s mark; 0s sweep Note the ~30% performance improvement in this case. In other cases, LLVM's default optimization passes can degrade performance significantly. Heres the equivalent OCaml code: let rec collatzLen(c, n) : int = if n = 1L then c else collatzLen (c+1, if Int64.rem n 2L = 0L then Int64.div n 2L else Int64.add (Int64.mul 3L n) 1L);; let rec loop(i, (nlen, n)) = if i = 1L then n else let ilen = collatzLen(1, i) in let nlen, n = if ilen > nlen then ilen, i else nlen, n in loop (Int64.sub i 1L, (nlen, n));; and Haskell: import Data.Int collatzLen :: Int -> Int64 -> Int collatzLen c 1 = c collatzLen c n = collatzLen (c+1) $ if n `mod` 2 == 0 then n `div` 2 else 3*n+1 pmax x n = x `max` (collatzLen 1 n, n) main = print $ foldl pmax (1,1) [2..1000000] and C99: #include <stdio.h> int collatzLen(int c, long long n) { return (n==1 ? c : collatzLen(c+1, (n%2==0 ? n/2 : 3*n+1))); } long long loop(long long m) { int nlen=1; long long n = 1; for (long long i=2; i<=m; ++i) { const int ilen = collatzLen(1, i); if (ilen > nlen) { nlen = ilen; n = i; } } return n; } int main() { printf("%lld", loop(1000000)); } And here are my timings: OCaml: 24.3s ocamlopt Haskell: 24.0s ghc6.10 --make -O2 F#.NET: 3.45s fsc --optimize+ --platform:x86 C: 1.20s gcc -O3 -std=c99 HLVM: 1.07s ./repl --compile HLVM (opt): 0.76s opt -tailcallelim -std-compile-opts These results really demonstrate two things: 1. Unboxing can give huge performance improvements on serial code, let alone parallel code. The optimized HLVM is running 32× faster than the OCaml here. 2. LLVM makes it easy to JIT fast code from OCaml. HLVM is using it to beat GCC-compiled C code here. -- Dr Jon Harrop, Flying Frog Consultancy Ltd. http://www.ffconsultancy.com ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Value types (Was: [Caml-list] ocamlopt LLVM support) 2010-12-12 14:54 Value types (Was: [Caml-list] ocamlopt LLVM support) Jon Harrop @ 2010-12-12 15:55 ` Török Edwin 2010-12-12 17:14 ` Jon Harrop ` (2 more replies) 2010-12-12 19:53 ` Value types (Was: [Caml-list] ocamlopt LLVM support) Brian Hurt 1 sibling, 3 replies; 21+ messages in thread From: Török Edwin @ 2010-12-12 15:55 UTC (permalink / raw) To: Jon Harrop; +Cc: caml-list On Sun, 12 Dec 2010 14:54:14 -0000 "Jon Harrop" <jon@ffconsultancy.com> wrote: > The Haskell guys got their best performance improvement moving to > LLVM from the hailstone benchmark so it is interesting to examine > this case as well. I just added support for 64-bit ints to HLVM to > implement that benchmark and my code is: > > Here’s the equivalent OCaml code: > > let rec collatzLen(c, n) : int = > if n = 1L then c else > collatzLen (c+1, if Int64.rem n 2L = 0L then Int64.div n 2L else OK, but boxing itself has nothing to do with the performance degration here. It is the lack of compiler optimizations on the Int64 type. This could be solved by implementing compiler optimizations for it (or manually optimizing some integer arithmetic that is slow). Lets run the code under a profiler, or look at the assembly (I used 'perf record' and 'perf report'). 2 'idiv' instructions turn up as top offenders in the profile. Problem #1: Int64.rem n 2 -> another idiv instruction A C compiler would optimize this to an 'and' instruction. Change that to 'Int64.logand n 1L = 0L'/ Problem #2: Int64.div n 2 -> idiv instruction. A C compiler would optimize this to a right shift. Changing that to 'Int64.shift_right n 1' speeds up the code. With these changes I get almost the same speed as the C code: $ ocamlopt x.ml -o x && time ./x 837799 real 0m0.664s user 0m0.667s sys 0m0.000s $ gcc -O3 x.c && time ./a.out 837799 real 0m0.635s user 0m0.633s sys 0m0.000s Here's the OCaml code: let rec collatzLen(c, n) : int = if n = 1L then c else collatzLen (c+1, if Int64.logand n 1L = 0L then Int64.shift_right n 1 else Int64.add (Int64.mul 3L n) 1L);; let rec loop(i, (nlen, n)) = if i = 1L then n else let ilen = collatzLen(1, i) in let nlen, n = if ilen > nlen then ilen, i else nlen, n in loop (Int64.sub i 1L, (nlen, n));; let _ = let s = loop (1000000L, (1,1000000L)) in print_int (Int64.to_int s);; > 1. Unboxing can give huge performance improvements on serial code, s/Unboxing/arithmetic optimizations/ Please find an example where the performance benefit is due to unboxing, and not due to arithmetic optimizations performed on the unboxed code. > let alone parallel code. The optimized HLVM is running 32× faster > than the OCaml here. > > 2. LLVM makes it easy to JIT fast code from OCaml. HLVM is using it > to beat GCC-compiled C code here. > One advantage of using LLVM is that it would notice arithmetic optimizations like this and perform it itself (even if you use the boxed representation). Best regards, --Edwin ^ permalink raw reply [flat|nested] 21+ messages in thread
* RE: Value types (Was: [Caml-list] ocamlopt LLVM support) 2010-12-12 15:55 ` Török Edwin @ 2010-12-12 17:14 ` Jon Harrop 2010-12-12 17:26 ` Török Edwin 2010-12-12 19:09 ` Benedikt Meurer 2010-12-14 9:54 ` Value types Goswin von Brederlow 2 siblings, 1 reply; 21+ messages in thread From: Jon Harrop @ 2010-12-12 17:14 UTC (permalink / raw) To: 'Török Edwin'; +Cc: caml-list Török Edwin wrote: > Problem #1: Int64.rem n 2 -> another idiv instruction > > A C compiler would optimize this to an 'and' instruction. > Change that to 'Int64.logand n 1L = 0L'/ Yes. LLVM did that for me. > Problem #2: Int64.div n 2 -> idiv instruction. > > A C compiler would optimize this to a right shift. Changing that to > 'Int64.shift_right n 1' speeds > up the code. Yes. LLVM also did that for me. In fact, I have been bitten by ocamlopt not optimizing div and mod by a constant in real OCaml code before. This problem also turns up in the context of hash table implementations where you want to % by the length of the spine. > With these changes I get almost the same speed as the C code: > $ ocamlopt x.ml -o x && time ./x > 837799 > real 0m0.664s > user 0m0.667s > sys 0m0.000s > > $ gcc -O3 x.c && time ./a.out > 837799 > real 0m0.635s > user 0m0.633s > sys 0m0.000s > > Here's the OCaml code: > let rec collatzLen(c, n) : int = > if n = 1L then c else > collatzLen (c+1, if Int64.logand n 1L = 0L then Int64.shift_right > n 1 else Int64.add (Int64.mul 3L n) 1L);; > > let rec loop(i, (nlen, n)) = > if i = 1L then n else > let ilen = collatzLen(1, i) in > let nlen, n = if ilen > nlen then ilen, i else nlen, n in > loop (Int64.sub i 1L, (nlen, n));; > > let _ = > let s = loop (1000000L, (1,1000000L)) in > print_int (Int64.to_int s);; I am unable to reproduce your results. Here, the time falls from 24s to 19.5s (using ocamlopt 3.12.0 on Intel x86) which is still 26× slower than HLVM. > > 1. Unboxing can give huge performance improvements on serial code, > > s/Unboxing/arithmetic optimizations/ > Please find an example where the performance benefit is due to > unboxing, and not due to arithmetic optimizations performed on the > unboxed code. The last example I gave (array of key-value pairs) demonstrates some of the performance improvements offered by unboxing in the heap (12.3× faster than OCaml in that case). I'm still not sure that this example is invalid because I cannot reproduce your results. > > let alone parallel code. The optimized HLVM is running 32× faster > > than the OCaml here. > > > > 2. LLVM makes it easy to JIT fast code from OCaml. HLVM is using it > > to beat GCC-compiled C code here. > > One advantage of using LLVM is that it would notice arithmetic > optimizations like this and perform it itself (even if you use the > boxed representation). Yes. LLVM hopefully optimizes div/mod by any constant which is quite tricky in the general case. Cheers, Jon. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Value types (Was: [Caml-list] ocamlopt LLVM support) 2010-12-12 17:14 ` Jon Harrop @ 2010-12-12 17:26 ` Török Edwin 2010-12-12 18:01 ` Jon Harrop 0 siblings, 1 reply; 21+ messages in thread From: Török Edwin @ 2010-12-12 17:26 UTC (permalink / raw) To: Jon Harrop; +Cc: caml-list [-- Attachment #1: Type: text/plain, Size: 2265 bytes --] On Sun, 12 Dec 2010 17:14:45 -0000 "Jon Harrop" <jon@ffconsultancy.com> wrote: > Török Edwin wrote: > > Problem #1: Int64.rem n 2 -> another idiv instruction > > > > A C compiler would optimize this to an 'and' instruction. > > Change that to 'Int64.logand n 1L = 0L'/ > > Yes. LLVM did that for me. > > > Problem #2: Int64.div n 2 -> idiv instruction. > > > > A C compiler would optimize this to a right shift. Changing that to > > 'Int64.shift_right n 1' speeds > > up the code. > > Yes. LLVM also did that for me. In fact, I have been bitten by > ocamlopt not optimizing div and mod by a constant in real OCaml code > before. This problem also turns up in the context of hash table > implementations where you want to % by the length of the spine. Do you really need to use Int64 for that though? Won't the 63-bit version do? > > > With these changes I get almost the same speed as the C code: > > $ ocamlopt x.ml -o x && time ./x > > 837799 > > real 0m0.664s > > user 0m0.667s > > sys 0m0.000s > > > > $ gcc -O3 x.c && time ./a.out > > 837799 > > real 0m0.635s > > user 0m0.633s > > sys 0m0.000s > > > > Here's the OCaml code: > > let rec collatzLen(c, n) : int = > > if n = 1L then c else > > collatzLen (c+1, if Int64.logand n 1L = 0L then > > Int64.shift_right n 1 else Int64.add (Int64.mul 3L n) 1L);; > > > > let rec loop(i, (nlen, n)) = > > if i = 1L then n else > > let ilen = collatzLen(1, i) in > > let nlen, n = if ilen > nlen then ilen, i else nlen, n in > > loop (Int64.sub i 1L, (nlen, n));; > > > > let _ = > > let s = loop (1000000L, (1,1000000L)) in > > print_int (Int64.to_int s);; > > I am unable to reproduce your results. Here, the time falls from 24s > to 19.5s (using ocamlopt 3.12.0 on Intel x86) which is still 26× > slower than HLVM. Do you still have 'idiv' in the compiled code? See my attached assembly, and compare it with yours please. I was doing the test on 64-bit, with ocamlopt 3.11.2 and 3.12.0. FWIW the original code took 2.8 seconds here, so only 4x slower (this is an AMD Phenom II x6 1090T CPU). It probably depends how fast/slow the 'idiv' is on your CPU. --Edwin [-- Attachment #2: x.s --] [-- Type: application/octet-stream, Size: 4971 bytes --] .section .rodata.cst8,"a",@progbits .align 16 caml_negf_mask: .quad 0x8000000000000000, 0 .align 16 caml_absf_mask: .quad 0x7FFFFFFFFFFFFFFF, 0xFFFFFFFFFFFFFFFF .data .globl camlX__data_begin camlX__data_begin: .text .globl camlX__code_begin camlX__code_begin: .data .quad 2048 .globl camlX camlX: .space 16 .data .quad 3319 camlX__2: .quad caml_tuplify2 .quad -3 .quad camlX__loop_1033 .data .quad 3319 camlX__3: .quad caml_tuplify2 .quad -3 .quad camlX__collatzLen_1030 .data .quad 2048 camlX__1: .quad .L100004 .quad .L100005 .quad 2048 .L100005: .quad 3 .quad .L100006 .quad 2303 .L100006: .quad caml_int64_ops .quad 1000000 .quad 2303 .L100004: .quad caml_int64_ops .quad 1000000 .text .align 16 .globl camlX__collatzLen_1030 camlX__collatzLen_1030: subq $8, %rsp .L103: movq %rax, %rdi movq $1, %rsi movq 8(%rbx), %rax cmpq %rsi, %rax jne .L102 movq %rdi, %rax addq $8, %rsp ret .align 4 .L102: xorq %rdx, %rdx movq $1, %rsi movq 8(%rbx), %rax andq %rsi, %rax cmpq %rdx, %rax jne .L101 .L104: subq $24, %r15 movq caml_young_limit@GOTPCREL(%rip), %rax cmpq (%rax), %r15 jb .L105 leaq 8(%r15), %rcx movq $2303, -8(%rcx) movq caml_int64_ops@GOTPCREL(%rip), %rax movq %rax, (%rcx) movq 8(%rbx), %rax sarq $1, %rax movq %rax, 8(%rcx) jmp .L100 .align 4 .L101: .L107: subq $24, %r15 movq caml_young_limit@GOTPCREL(%rip), %rax cmpq (%rax), %r15 jb .L108 leaq 8(%r15), %rcx movq $2303, -8(%rcx) movq caml_int64_ops@GOTPCREL(%rip), %rax movq %rax, (%rcx) movq $1, %rdx movq 8(%rbx), %rsi movq $3, %rax imulq %rsi, %rax addq %rdx, %rax movq %rax, 8(%rcx) .L100: movq %rdi, %rax addq $2, %rax movq %rcx, %rbx jmp .L103 .L108: call caml_call_gc@PLT .L109: jmp .L107 .L105: call caml_call_gc@PLT .L106: jmp .L104 .type camlX__collatzLen_1030,@function .size camlX__collatzLen_1030,.-camlX__collatzLen_1030 .text .align 16 .globl camlX__loop_1033 camlX__loop_1033: subq $24, %rsp .L113: movq %rax, %rdx movq 8(%rbx), %rax movq $1, %rsi movq 8(%rdx), %rdi cmpq %rsi, %rdi jne .L112 addq $24, %rsp ret .align 4 .L112: movq %rax, 8(%rsp) movq %rdx, 16(%rsp) movq (%rbx), %rax movq %rax, 0(%rsp) movq $3, %rax movq %rdx, %rbx call camlX__collatzLen_1030@PLT .L114: movq %rax, %rsi movq 0(%rsp), %rbx cmpq %rbx, %rsi jle .L111 .L115: subq $24, %r15 movq caml_young_limit@GOTPCREL(%rip), %rax cmpq (%rax), %r15 jb .L116 leaq 8(%r15), %rdi movq $2048, -8(%rdi) movq %rsi, (%rdi) movq 16(%rsp), %rax movq %rax, 8(%rdi) jmp .L110 .align 4 .L111: .L118: subq $24, %r15 movq caml_young_limit@GOTPCREL(%rip), %rax cmpq (%rax), %r15 jb .L119 leaq 8(%r15), %rdi movq $2048, -8(%rdi) movq %rbx, (%rdi) movq 8(%rsp), %rax movq %rax, 8(%rdi) .L110: .L121: subq $48, %r15 movq caml_young_limit@GOTPCREL(%rip), %rax cmpq (%rax), %r15 jb .L122 leaq 8(%r15), %rbx movq $2048, -8(%rbx) movq (%rdi), %rax movq %rax, (%rbx) movq 8(%rdi), %rax movq %rax, 8(%rbx) leaq 24(%rbx), %rax movq $2303, -8(%rax) movq caml_int64_ops@GOTPCREL(%rip), %rdi movq %rdi, (%rax) movq $1, %rsi movq 16(%rsp), %rdi movq 8(%rdi), %rdi subq %rsi, %rdi movq %rdi, 8(%rax) jmp .L113 .L122: call caml_call_gc@PLT .L123: jmp .L121 .L119: call caml_call_gc@PLT .L120: jmp .L118 .L116: call caml_call_gc@PLT .L117: jmp .L115 .type camlX__loop_1033,@function .size camlX__loop_1033,.-camlX__loop_1033 .text .align 16 .globl camlX__entry camlX__entry: subq $8, %rsp .L124: movq camlX__3@GOTPCREL(%rip), %rbx movq camlX@GOTPCREL(%rip), %rax movq %rbx, (%rax) movq camlX__2@GOTPCREL(%rip), %rbx movq camlX@GOTPCREL(%rip), %rax movq %rbx, 8(%rax) movq camlX@GOTPCREL(%rip), %rax movq 8(%rax), %rbx movq camlX__1@GOTPCREL(%rip), %rax movq (%rbx), %rdi call *%rdi .L125: movq 8(%rax), %rax salq $1, %rax orq $1, %rax call camlPervasives__string_of_int_1130@PLT .L126: movq %rax, %rbx movq camlPervasives@GOTPCREL(%rip), %rax movq 184(%rax), %rax call camlPervasives__output_string_1191@PLT .L127: movq $1, %rax addq $8, %rsp ret .type camlX__entry,@function .size camlX__entry,.-camlX__entry .data .text .globl camlX__code_end camlX__code_end: .data .globl camlX__data_end camlX__data_end: .long 0 .globl camlX__frametable camlX__frametable: .quad 9 .quad .L127 .word 17 .word 0 .align 8 .long (.L200000 - .) + 0xe0000000 .long 0x168120 .quad .L126 .word 17 .word 0 .align 8 .long (.L200000 - .) + 0xe0000000 .long 0x168270 .quad .L125 .word 16 .word 0 .align 8 .quad .L123 .word 32 .word 2 .word 16 .word 5 .align 8 .quad .L120 .word 32 .word 3 .word 3 .word 8 .word 16 .align 8 .quad .L117 .word 32 .word 2 .word 16 .word 7 .align 8 .quad .L114 .word 32 .word 3 .word 0 .word 8 .word 16 .align 8 .quad .L109 .word 16 .word 2 .word 3 .word 5 .align 8 .quad .L106 .word 16 .word 2 .word 3 .word 5 .align 8 .L200000: .asciz "pervasives.ml" .align 8 .section .note.GNU-stack,"",%progbits ^ permalink raw reply [flat|nested] 21+ messages in thread
* RE: Value types (Was: [Caml-list] ocamlopt LLVM support) 2010-12-12 17:26 ` Török Edwin @ 2010-12-12 18:01 ` Jon Harrop 2010-12-12 18:22 ` Török Edwin 0 siblings, 1 reply; 21+ messages in thread From: Jon Harrop @ 2010-12-12 18:01 UTC (permalink / raw) To: 'Török Edwin'; +Cc: caml-list Török Edwin wrote: > Do you really need to use Int64 for that though? Won't the 63-bit > version do? I'm running 32-bit. > > I am unable to reproduce your results. Here, the time falls from 24s > > to 19.5s (using ocamlopt 3.12.0 on Intel x86) which is still 26× > > slower than HLVM. Sorry, I'm actually using an Opteron x86 (logged in from an Intel x86!). > Do you still have 'idiv' in the compiled code? See my attached > assembly, and compare it with yours please. > I was doing the test on 64-bit, with ocamlopt 3.11.2 and 3.12.0. I get what appear to be calls to C code: camlCollatz__collatzLen_1030: subl $8, %esp .L103: movl %eax, 4(%esp) movl %ebx, 0(%esp) pushl $camlCollatz__10 pushl %ebx movl $caml_equal, %eax call caml_c_call .L104: addl $8, %esp cmpl $1, %eax je .L102 movl 4(%esp), %eax addl $8, %esp ret .align 16 .L102: pushl $camlCollatz__8 movl 4(%esp), %eax pushl %eax movl $caml_int64_and, %eax call caml_c_call .L105: addl $8, %esp pushl $camlCollatz__9 pushl %eax movl $caml_equal, %eax call caml_c_call .L106: addl $8, %esp cmpl $1, %eax je .L101 pushl $3 movl 4(%esp), %eax pushl %eax movl $caml_int64_shift_right, %eax call caml_c_call .L107: addl $8, %esp movl %eax, %ebx jmp .L100 .align 16 .L101: movl 0(%esp), %eax pushl %eax pushl $camlCollatz__6 movl $caml_int64_mul, %eax call caml_c_call .L108: addl $8, %esp pushl $camlCollatz__7 pushl %eax movl $caml_int64_add, %eax call caml_c_call .L109: addl $8, %esp movl %eax, %ebx .L100: movl 4(%esp), %eax addl $2, %eax jmp .L103 > FWIW the original code took 2.8 seconds here, so only 4x slower (this > is an AMD Phenom II x6 1090T CPU). It probably depends how fast/slow > the 'idiv' is on your CPU. The performance of idiv is irrelevant here. The bottleneck may be those C calls but I don't understand why they are being generated. Cheers, Jon. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Value types (Was: [Caml-list] ocamlopt LLVM support) 2010-12-12 18:01 ` Jon Harrop @ 2010-12-12 18:22 ` Török Edwin 0 siblings, 0 replies; 21+ messages in thread From: Török Edwin @ 2010-12-12 18:22 UTC (permalink / raw) To: Jon Harrop; +Cc: caml-list On Sun, 12 Dec 2010 18:01:13 -0000 "Jon Harrop" <jon@ffconsultancy.com> wrote: > Török Edwin wrote: > > Do you really need to use Int64 for that though? Won't the 63-bit > > version do? > > I'm running 32-bit. That explains it, in a 32-bit chroot my modified version is still slow. I thought that 32-bit systems are not that relevant anymore (except for Windows, but then people start moving to 64-bit there also). > > > > I am unable to reproduce your results. Here, the time falls from > > > 24s to 19.5s (using ocamlopt 3.12.0 on Intel x86) which is still > > > 26× slower than HLVM. > > Sorry, I'm actually using an Opteron x86 (logged in from an Intel > x86!). > > > Do you still have 'idiv' in the compiled code? See my attached > > assembly, and compare it with yours please. > > I was doing the test on 64-bit, with ocamlopt 3.11.2 and 3.12.0. > > I get what appear to be calls to C code: > > camlCollatz__collatzLen_1030: > subl $8, %esp > .L103: > movl %eax, 4(%esp) > movl %ebx, 0(%esp) > pushl $camlCollatz__10 > pushl %ebx > movl $caml_equal, %eax > call caml_c_call Yes, that is quite bad. I don't know how OCaml's code generator works, but it looks like it calls the C implementation if the CPU doesn't support the operation directly. And since this is 32-bit you need all the extra pushes and movs to do actually call something. If only it could inline those calls, then it could optimize away most of the overhead (LLVM would help here again). > > > FWIW the original code took 2.8 seconds here, so only 4x slower > > (this is an AMD Phenom II x6 1090T CPU). It probably depends how > > fast/slow the 'idiv' is on your CPU. > > The performance of idiv is irrelevant here. The bottleneck may be > those C calls but I don't understand why they are being generated. I think for the same reason gcc has __udivdi3 in libgcc: there is no direct way of executing a 64-bit divide on a 32-bit machine, and it saves code space to do it in a function. However that doesn't make much sense for mul and add, which don't need that many instructions to implement on 32-bit. > > Cheers, > Jon. > > ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Value types (Was: [Caml-list] ocamlopt LLVM support) 2010-12-12 15:55 ` Török Edwin 2010-12-12 17:14 ` Jon Harrop @ 2010-12-12 19:09 ` Benedikt Meurer 2010-12-12 19:20 ` John Carr ` (3 more replies) 2010-12-14 9:54 ` Value types Goswin von Brederlow 2 siblings, 4 replies; 21+ messages in thread From: Benedikt Meurer @ 2010-12-12 19:09 UTC (permalink / raw) To: caml-list [-- Attachment #1: Type: text/plain, Size: 5708 bytes --] On Dec 12, 2010, at 16:55 , Török Edwin wrote: > On Sun, 12 Dec 2010 14:54:14 -0000 > "Jon Harrop" <jon@ffconsultancy.com> wrote: > >> The Haskell guys got their best performance improvement moving to >> LLVM from the hailstone benchmark so it is interesting to examine >> this case as well. I just added support for 64-bit ints to HLVM to >> implement that benchmark and my code is: >> >> Here’s the equivalent OCaml code: >> >> let rec collatzLen(c, n) : int = >> if n = 1L then c else >> collatzLen (c+1, if Int64.rem n 2L = 0L then Int64.div n 2L else > > OK, but boxing itself has nothing to do with the performance degration > here. It is the lack of compiler optimizations on the Int64 type. This > could be solved by implementing compiler optimizations for it (or > manually optimizing some integer arithmetic that is slow). > > Lets run the code under a profiler, or look at the assembly (I used > 'perf record' and 'perf report'). 2 'idiv' instructions turn up as top > offenders in the profile. > > Problem #1: Int64.rem n 2 -> another idiv instruction > > A C compiler would optimize this to an 'and' instruction. > Change that to 'Int64.logand n 1L = 0L'/ > > Problem #2: Int64.div n 2 -> idiv instruction. > > A C compiler would optimize this to a right shift. Changing that to 'Int64.shift_right n 1' speeds > up the code. This is easy to fix in ocamlopt (see attached patch ocamlopt-natint.patch), by applying the same optimizations already used for constant int's to constant natint's (Int64 is Nativeint on 64bit). Note however, that "mod 2" is not really "and 1", neither is "div 2" equivalent to "lsr 1"; that would be the case for unsigned arithmetic (doesn't matter in your example, tho). I don't see the point of optimizing for x86-32 (neither would I spend my time optimizing anything for VAX these days), but it would be possible to add appropriate cases for Int64 on x86 as well (regalloc would be most difficult here, since that requires support for register pairs to perform 64bit arithmetic). >> 1. Unboxing can give huge performance improvements on serial code, > > s/Unboxing/arithmetic optimizations/ > Please find an example where the performance benefit is due to > unboxing, and not due to arithmetic optimizations performed on the > unboxed code. The boxing involved is relevant, but boxing in general is not the issue. In this special case, the "let nlen, n = if..." code requires heap allocation, because of the way the pattern is compiled. This could be fixed by moving the condition out of the code and using two if's to select n/nlen separately (doesn't speed up that much). Fixing the pattern compiler to handle these cases might be interesting for general benefit. I already mentioned this multiple times, but here we go again: Unboxing optimizations may indeed prove useful if applied wisely (cmmgen.ml is of special interest here, the unboxing optimizations are more or less special cases; that could be extended to include interesting cases like moving boxing out of if-then-else in return position, etc). But (here comes the special "Harrop note") this has absolutely nothing to do with LLVM (and of course really, really, really nothing to do with HLVM). Using a different data representation for the heap requires a nearly complete rewrite of the OCaml system (you would probably need to start at the Typing level); if one wants to do this, enjoy and come back with code. But even then, data representation issues will have to be considered long before it comes to actual code generation (if you are serious, you'll have to think about the runtime first prior to talking about code generation for a non-existing runtime), so even then it has nothing to do with LLVM (or C-- or C or whatever IR you can think of). Combining alloc's across if-then-else constructs further reduces code size in your example (and probably other programs as well), see attached file ocamlopt-comballoc-ifthenelse.patch. It's quick&dirty, but it should illustrate the idea. >> let alone parallel code. The optimized HLVM is running 32× faster >> than the OCaml here. >> >> 2. LLVM makes it easy to JIT fast code from OCaml. HLVM is using it >> to beat GCC-compiled C code here. >> > > One advantage of using LLVM is that it would notice arithmetic > optimizations like this and perform it itself (even if you use the > boxed representation). In case of x86-32, it won't, simply because LLVM will be presented with the calls to caml_int32_* functions. You'd need to change the Cmm code instead (changing the low-level stuff is straight-forward as demonstrated). For 64bit targets, see attached patch. This doesn't mean that LLVM wouldn't be useful (in fact, I've just started an LLVM backend for ocamlopt). But it is important to note that LLVM is not the solution to everything. As the name implies, it's "low level", it does a few "higher level" optimizations for C, but these are special cases (and somewhat ugly if you take the time to read the code). It won't make a faster OCaml magically, just like it didn't make a faster Haskell by itself. I could go on by quoting common "Harrop jokes" like "you need types in the low-level IR", etc. trying to tell him that this is simply wrong; but after reading through the Caml/LISP mailing list archives (thanks for the pointers by several readers), I'm pretty much convinced that Jon simply decided to end his war against LISP just to start a new one against ocamlopt. If anyone is serious about ocamlopt with LLVM, feel free to contact me (tho, my time is limited atm). > Best regards, > --Edwin greets, Benedikt [-- Attachment #2: ocamlopt-comballoc-ifthenelse.patch --] [-- Type: application/octet-stream, Size: 6109 bytes --] diff --git a/asmcomp/comballoc.ml b/asmcomp/comballoc.ml index 5a862b1..82c01f9 100644 --- a/asmcomp/comballoc.ml +++ b/asmcomp/comballoc.ml @@ -27,64 +27,78 @@ let allocated_size = function No_alloc -> 0 | Pending_alloc(reg, ofs) -> ofs +let instr_cons_alloc sz a r n = + if sz != 0 + then instr_cons (Iop(Ialloc sz)) a r n + else n + let rec combine i allocstate = match i.desc with Iend | Ireturn | Iexit _ | Iraise -> - (i, allocated_size allocstate) + (i, allocated_size allocstate, true) | Iop(Ialloc sz) -> begin match allocstate with No_alloc -> - let (newnext, newsz) = + let (newnext, newsz, _) = combine i.next (Pending_alloc(i.res.(0), sz)) in - (instr_cons (Iop(Ialloc newsz)) i.arg i.res newnext, 0) + (instr_cons_alloc newsz i.arg i.res newnext, 0, false) | Pending_alloc(reg, ofs) -> if ofs + sz < Config.max_young_wosize then begin - let (newnext, newsz) = + let (newnext, newsz, safe) = combine i.next (Pending_alloc(reg, ofs + sz)) in - (instr_cons (Iop(Iintop_imm(Iadd, ofs))) [| reg |] i.res newnext, - newsz) + if sz != 0 && ofs != 0 then + (instr_cons (Iop(Iintop_imm(Iadd, ofs))) [|reg|] i.res newnext, newsz, safe) + else if sz != 0 then + (instr_cons (Iop Imove) [|reg|] i.res newnext, newsz, safe) + else + (newnext, newsz, safe) end else begin - let (newnext, newsz) = + let (newnext, newsz, _) = combine i.next (Pending_alloc(i.res.(0), sz)) in - (instr_cons (Iop(Ialloc newsz)) i.arg i.res newnext, ofs) + (instr_cons_alloc newsz i.arg i.res newnext, ofs, false) end end | Iop(Icall_ind | Icall_imm _ | Iextcall _ | Itailcall_ind | Itailcall_imm _) -> let newnext = combine_restart i.next in - (instr_cons_debug i.desc i.arg i.res i.dbg newnext, - allocated_size allocstate) + (instr_cons_debug i.desc i.arg i.res i.dbg newnext, allocated_size allocstate, false) | Iop op -> - let (newnext, sz) = combine i.next allocstate in - (instr_cons_debug i.desc i.arg i.res i.dbg newnext, sz) + let (newnext, sz, safe) = combine i.next allocstate in + (instr_cons_debug i.desc i.arg i.res i.dbg newnext, sz, safe) | Iifthenelse(test, ifso, ifnot) -> - let newifso = combine_restart ifso in - let newifnot = combine_restart ifnot in - let newnext = combine_restart i.next in - (instr_cons (Iifthenelse(test, newifso, newifnot)) i.arg i.res newnext, - allocated_size allocstate) + begin match allocstate, combine ifso allocstate, combine ifnot allocstate with + Pending_alloc(reg, ofs), (newifso, szifso, true), (newifnot, szifnot, true) when szifso = szifnot -> + let (newnext, sznext, safe) = combine i.next (Pending_alloc(reg, ofs + szifso)) in + (instr_cons (Iifthenelse(test, newifso, newifnot)) i.arg i.res newnext, + sznext, + safe) + | _, _, _ -> + let newifso = combine_restart ifso in + let newifnot = combine_restart ifnot in + let newnext = combine_restart i.next in + (instr_cons (Iifthenelse(test, newifso, newifnot)) i.arg i.res newnext, + allocated_size allocstate, false) + end | Iswitch(table, cases) -> let newcases = Array.map combine_restart cases in let newnext = combine_restart i.next in - (instr_cons (Iswitch(table, newcases)) i.arg i.res newnext, - allocated_size allocstate) + (instr_cons (Iswitch(table, newcases)) i.arg i.res newnext, allocated_size allocstate, false) | Iloop(body) -> let newbody = combine_restart body in - (instr_cons (Iloop(newbody)) i.arg i.res i.next, - allocated_size allocstate) + (instr_cons (Iloop(newbody)) i.arg i.res i.next, allocated_size allocstate, false) | Icatch(io, body, handler) -> - let (newbody, sz) = combine body allocstate in + let (newbody, sz, _) = combine body allocstate in let newhandler = combine_restart handler in let newnext = combine_restart i.next in - (instr_cons (Icatch(io, newbody, newhandler)) i.arg i.res newnext, sz) + (instr_cons (Icatch(io, newbody, newhandler)) i.arg i.res newnext, sz, false) | Itrywith(body, handler) -> - let (newbody, sz) = combine body allocstate in + let (newbody, sz, _) = combine body allocstate in let newhandler = combine_restart handler in let newnext = combine_restart i.next in - (instr_cons (Itrywith(newbody, newhandler)) i.arg i.res newnext, sz) + (instr_cons (Itrywith(newbody, newhandler)) i.arg i.res newnext, sz, false) and combine_restart i = - let (newi, _) = combine i No_alloc in newi + let (newi, _, _) = combine i No_alloc in newi let fundecl f = {f with fun_body = combine_restart f.fun_body} diff --git a/asmcomp/selectgen.ml b/asmcomp/selectgen.ml index 7daa239..4221f95 100644 --- a/asmcomp/selectgen.ml +++ b/asmcomp/selectgen.ml @@ -562,6 +562,9 @@ method emit_expr env exp = let (rif, sif) = self#emit_sequence env eif in let (relse, selse) = self#emit_sequence env eelse in let r = join rif sif relse selse in + (* Dummy Ialloc 0 for comballoc.ml *) + let ra = self#regs_for typ_addr in + self#insert (Iop(Ialloc 0)) [||] ra; self#insert (Iifthenelse(cond, sif#extract, selse#extract)) rarg [||]; r @@ -790,6 +793,9 @@ method emit_tail env exp = begin match self#emit_expr env earg with None -> () | Some rarg -> + (* Dummy Ialloc 0 for comballoc.ml *) + let ra = self#regs_for typ_addr in + self#insert (Iop(Ialloc 0)) [||] ra; self#insert (Iifthenelse(cond, self#emit_tail_sequence env eif, self#emit_tail_sequence env eelse)) rarg [||] [-- Attachment #3: ocamlopt-natint.patch --] [-- Type: application/octet-stream, Size: 7218 bytes --] diff --git a/asmcomp/amd64/selection.ml b/asmcomp/amd64/selection.ml index 4921e51..bce8104 100644 --- a/asmcomp/amd64/selection.ml +++ b/asmcomp/amd64/selection.ml @@ -168,6 +168,9 @@ method! select_operation op args = [arg1; Cconst_int n] when self#is_immediate n && n = 1 lsl (Misc.log2 n) -> (Iintop_imm(Idiv, n), [arg1]) + | [arg1; Cconst_natint n] when self#is_immediate_natint n + && n = Nativeint.shift_left 1n (Misc.log2 (Nativeint.to_int n)) -> + (Iintop_imm(Idiv, Nativeint.to_int n), [arg1]) | _ -> (Iintop Idiv, args) end | Cmodi -> @@ -175,6 +178,9 @@ method! select_operation op args = [arg1; Cconst_int n] when self#is_immediate n && n = 1 lsl (Misc.log2 n) -> (Iintop_imm(Imod, n), [arg1]) + | [arg1; Cconst_natint n] when self#is_immediate_natint n + && n = Nativeint.shift_left 1n (Misc.log2 (Nativeint.to_int n)) -> + (Iintop_imm(Imod, Nativeint.to_int n), [arg1]) | _ -> (Iintop Imod, args) end (* Recognize float arithmetic with memory. *) diff --git a/asmcomp/selectgen.ml b/asmcomp/selectgen.ml index 50f949a..7daa239 100644 --- a/asmcomp/selectgen.ml +++ b/asmcomp/selectgen.ml @@ -201,6 +201,8 @@ method is_simple_expr = function method virtual is_immediate : int -> bool +method virtual is_immediate_natint : nativeint -> bool + (* Selection of addressing modes *) method virtual select_addressing : @@ -238,11 +240,21 @@ method select_operation op args = if n = 1 lsl l then (Iintop_imm(Ilsl, l), [arg1]) else self#select_arith_comm Imul args + | (Cmuli, [arg1; Cconst_natint n]) -> + let l = Misc.log2 (Nativeint.to_int n) in + if n = Nativeint.shift_left 1n l + then (Iintop_imm(Ilsl, l), [arg1]) + else self#select_arith_comm Imul args | (Cmuli, [Cconst_int n; arg1]) -> let l = Misc.log2 n in if n = 1 lsl l then (Iintop_imm(Ilsl, l), [arg1]) else self#select_arith_comm Imul args + | (Cmuli, [Cconst_natint n; arg1]) -> + let l = Misc.log2 (Nativeint.to_int n) in + if n = Nativeint.shift_left 1n l + then (Iintop_imm(Ilsl, l), [arg1]) + else self#select_arith_comm Imul args | (Cmuli, _) -> self#select_arith_comm Imul args | (Cdivi, _) -> self#select_arith Idiv args | (Cmodi, _) -> self#select_arith_comm Imod args @@ -270,38 +282,60 @@ method select_operation op args = method private select_arith_comm op = function [arg; Cconst_int n] when self#is_immediate n -> (Iintop_imm(op, n), [arg]) + | [arg; Cconst_natint n] when self#is_immediate_natint n -> + (Iintop_imm(op, Nativeint.to_int n), [arg]) | [arg; Cconst_pointer n] when self#is_immediate n -> (Iintop_imm(op, n), [arg]) + | [arg; Cconst_natpointer n] when self#is_immediate_natint n -> + (Iintop_imm(op, Nativeint.to_int n), [arg]) | [Cconst_int n; arg] when self#is_immediate n -> (Iintop_imm(op, n), [arg]) + | [Cconst_natint n; arg] when self#is_immediate_natint n -> + (Iintop_imm(op, Nativeint.to_int n), [arg]) | [Cconst_pointer n; arg] when self#is_immediate n -> (Iintop_imm(op, n), [arg]) + | [Cconst_natpointer n; arg] when self#is_immediate_natint n -> + (Iintop_imm(op, Nativeint.to_int n), [arg]) | args -> (Iintop op, args) method private select_arith op = function [arg; Cconst_int n] when self#is_immediate n -> (Iintop_imm(op, n), [arg]) + | [arg; Cconst_natint n] when self#is_immediate_natint n -> + (Iintop_imm(op, Nativeint.to_int n), [arg]) | [arg; Cconst_pointer n] when self#is_immediate n -> (Iintop_imm(op, n), [arg]) + | [arg; Cconst_natpointer n] when self#is_immediate_natint n -> + (Iintop_imm(op, Nativeint.to_int n), [arg]) | args -> (Iintop op, args) method private select_shift op = function [arg; Cconst_int n] when n >= 0 && n < Arch.size_int * 8 -> (Iintop_imm(op, n), [arg]) + | [arg; Cconst_natint n] when n >= 0n && n < Nativeint.of_int (Arch.size_int * 8) -> + (Iintop_imm(op, Nativeint.to_int n), [arg]) | args -> (Iintop op, args) method private select_arith_comp cmp = function [arg; Cconst_int n] when self#is_immediate n -> (Iintop_imm(Icomp cmp, n), [arg]) + | [arg; Cconst_natint n] when self#is_immediate_natint n -> + (Iintop_imm(Icomp cmp, Nativeint.to_int n), [arg]) | [arg; Cconst_pointer n] when self#is_immediate n -> (Iintop_imm(Icomp cmp, n), [arg]) + | [arg; Cconst_natpointer n] when self#is_immediate_natint n -> + (Iintop_imm(Icomp cmp, Nativeint.to_int n), [arg]) | [Cconst_int n; arg] when self#is_immediate n -> (Iintop_imm(Icomp(swap_intcomp cmp), n), [arg]) + | [Cconst_natint n; arg] when self#is_immediate_natint n -> + (Iintop_imm(Icomp(swap_intcomp cmp), Nativeint.to_int n), [arg]) | [Cconst_pointer n; arg] when self#is_immediate n -> (Iintop_imm(Icomp(swap_intcomp cmp), n), [arg]) + | [Cconst_natpointer n; arg] when self#is_immediate_natint n -> + (Iintop_imm(Icomp(swap_intcomp cmp), Nativeint.to_int n), [arg]) | args -> (Iintop(Icomp cmp), args) @@ -310,12 +344,20 @@ method private select_arith_comp cmp = function method select_condition = function Cop(Ccmpi cmp, [arg1; Cconst_int n]) when self#is_immediate n -> (Iinttest_imm(Isigned cmp, n), arg1) + | Cop(Ccmpi cmp, [arg1; Cconst_natint n]) when self#is_immediate_natint n -> + (Iinttest_imm(Isigned cmp, Nativeint.to_int n), arg1) | Cop(Ccmpi cmp, [Cconst_int n; arg2]) when self#is_immediate n -> (Iinttest_imm(Isigned(swap_comparison cmp), n), arg2) + | Cop(Ccmpi cmp, [Cconst_natint n; arg2]) when self#is_immediate_natint n -> + (Iinttest_imm(Isigned(swap_comparison cmp), Nativeint.to_int n), arg2) | Cop(Ccmpi cmp, [arg1; Cconst_pointer n]) when self#is_immediate n -> (Iinttest_imm(Isigned cmp, n), arg1) + | Cop(Ccmpi cmp, [arg1; Cconst_natpointer n]) when self#is_immediate_natint n -> + (Iinttest_imm(Isigned cmp, Nativeint.to_int n), arg1) | Cop(Ccmpi cmp, [Cconst_pointer n; arg2]) when self#is_immediate n -> (Iinttest_imm(Isigned(swap_comparison cmp), n), arg2) + | Cop(Ccmpi cmp, [Cconst_natpointer n; arg2]) when self#is_immediate_natint n -> + (Iinttest_imm(Isigned(swap_comparison cmp), Nativeint.to_int n), arg2) | Cop(Ccmpi cmp, args) -> (Iinttest(Isigned cmp), Ctuple args) | Cop(Ccmpa cmp, [arg1; Cconst_pointer n]) when self#is_immediate n -> diff --git a/asmcomp/selectgen.mli b/asmcomp/selectgen.mli index 7c30f9f..8966ad1 100644 --- a/asmcomp/selectgen.mli +++ b/asmcomp/selectgen.mli @@ -23,6 +23,7 @@ class virtual selector_generic : object (* The following methods must or can be overridden by the processor description *) method virtual is_immediate : int -> bool + method virtual is_immediate_natint : nativeint -> bool (* Must be defined to indicate whether a constant is a suitable immediate operand to arithmetic instructions *) method virtual select_addressing : ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Value types (Was: [Caml-list] ocamlopt LLVM support) 2010-12-12 19:09 ` Benedikt Meurer @ 2010-12-12 19:20 ` John Carr 2010-12-14 9:43 ` Value types Goswin von Brederlow 2010-12-12 19:55 ` Value types (Was: [Caml-list] ocamlopt LLVM support) Török Edwin ` (2 subsequent siblings) 3 siblings, 1 reply; 21+ messages in thread From: John Carr @ 2010-12-12 19:20 UTC (permalink / raw) To: Benedikt Meurer; +Cc: caml-list > I don't see the point of optimizing for x86-32 I'm using 32 bit ocaml because my program uses too much memory in 64 bit mode. If there were an ocaml that used 32 bit words in 64 bit mode, I would use that instead. Early 32 to 64 bit transitions offered 32 bit pointers with 64 bit registers, called TSO on Alpha and n32 on MIPS. AMD and Intel did not. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Value types 2010-12-12 19:20 ` John Carr @ 2010-12-14 9:43 ` Goswin von Brederlow 0 siblings, 0 replies; 21+ messages in thread From: Goswin von Brederlow @ 2010-12-14 9:43 UTC (permalink / raw) To: John Carr; +Cc: Benedikt Meurer, caml-list John Carr <jfc@MIT.EDU> writes: >> I don't see the point of optimizing for x86-32 > > I'm using 32 bit ocaml because my program uses too much memory in 64 > bit mode. If there were an ocaml that used 32 bit words in 64 bit > mode, I would use that instead. > > Early 32 to 64 bit transitions offered 32 bit pointers with 64 bit > registers, called TSO on Alpha and n32 on MIPS. AMD and Intel did not. There is a patch for gcc for this for amd64. It could be nice to implement this for ocaml too. It would mostly improve the Int64 module. But the extra registers in 64bit mode should help performance overall too. MfG Goswin ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Value types (Was: [Caml-list] ocamlopt LLVM support) 2010-12-12 19:09 ` Benedikt Meurer 2010-12-12 19:20 ` John Carr @ 2010-12-12 19:55 ` Török Edwin 2010-12-12 22:05 ` Jon Harrop 2010-12-12 21:50 ` Jon Harrop 2010-12-13 8:43 ` Alain Frisch 3 siblings, 1 reply; 21+ messages in thread From: Török Edwin @ 2010-12-12 19:55 UTC (permalink / raw) To: Benedikt Meurer; +Cc: caml-list On Sun, 12 Dec 2010 20:09:00 +0100 Benedikt Meurer <benedikt.meurer@googlemail.com> wrote: > > On Dec 12, 2010, at 16:55 , Török Edwin wrote: > > > [...] > > Problem #2: Int64.div n 2 -> idiv instruction. > > > > A C compiler would optimize this to a right shift. Changing that to > > 'Int64.shift_right n 1' speeds up the code. > > This is easy to fix in ocamlopt (see attached patch > ocamlopt-natint.patch), by applying the same optimizations already > used for constant int's to constant natint's (Int64 is Nativeint on > 64bit). Nice patch. I definitely agree that what can be fixed in ocamlopt's high-level opts should be fixed there, rather than hope that LLVM will just do everything. > [...] > > > > One advantage of using LLVM is that it would notice arithmetic > > optimizations like this and perform it itself (even if you use the > > boxed representation). > > In case of x86-32, it won't, simply because LLVM will be presented > with the calls to caml_int32_* functions. I was thinking that the runtime could be compiled to .bc, and then they would get inlined and optimized away at link time. That is overkill for something as simple as integer arithmetic, but could be useful for more complicated runtime functions (or C bindings). > You'd need to change the > Cmm code instead (changing the low-level stuff is straight-forward as > demonstrated). I agree that not emitting those C calls in the first place is better. > For 64bit targets, see attached patch. > > This doesn't mean that LLVM wouldn't be useful (in fact, I've just > started an LLVM backend for ocamlopt). Great! If you notice some more optimizations missed by ocamlopt while working on it, could you write up a list of those? I think I suggested earlier in this thread that LLVM could be used in two ways: write a backend with it, or port some of the LLVM optimizations to ocamlopt. Maybe it'd be good to write some documentation on ocamlopt's internals, and how one can add more optimizations there. Something simple like what is the IR format (like LLVM's langref), how you perform the optimizations (what is the equivalent of LLVM's passes), and what helper modules can you use (what is the equivalent of LLVM's analysis) would suffice for a start. Does something like this exist already? > But it is important to note > that LLVM is not the solution to everything. As the name implies, > it's "low level", it does a few "higher level" optimizations for C, > but these are special cases (and somewhat ugly if you take the time > to read the code). It won't make a faster OCaml magically, just like > it didn't make a faster Haskell by itself. > > I could go on by quoting common "Harrop jokes" like "you need types > in the low-level IR", etc. trying to tell him that this is simply > wrong; but after reading through the Caml/LISP mailing list archives > (thanks for the pointers by several readers), I'm pretty much > convinced that Jon simply decided to end his war against LISP just to > start a new one against ocamlopt. > > If anyone is serious about ocamlopt with LLVM, feel free to contact > me (tho, my time is limited atm). I wouldn't have much time to dedicate to this, so I can't say I'm serious about it. AFAICT LLVM's OCaml bindings are only good for generating LLVM IR from OCaml, not for actually performing transformations on it (there is no binding to retrieve the type of a value for example). I'll probably be looking into fixing that in the near future, and this may indirectly help your LLVM backend (if you intend to write OCaml specific transformations on the LLVM IR). Best regards, --Edwin ^ permalink raw reply [flat|nested] 21+ messages in thread
* RE: Value types (Was: [Caml-list] ocamlopt LLVM support) 2010-12-12 19:55 ` Value types (Was: [Caml-list] ocamlopt LLVM support) Török Edwin @ 2010-12-12 22:05 ` Jon Harrop 2010-12-12 22:27 ` Török Edwin 0 siblings, 1 reply; 21+ messages in thread From: Jon Harrop @ 2010-12-12 22:05 UTC (permalink / raw) To: 'Török Edwin'; +Cc: caml-list Edwin wrote: > AFAICT LLVM's OCaml bindings are only good for generating LLVM IR from > OCaml, not for actually performing transformations on it (there is no > binding to retrieve the type of a value for example). I'll probably be > looking into fixing that in the near future, and this may indirectly > help your LLVM backend (if you intend to write OCaml specific > transformations on the LLVM IR). That's a lot of work. Wouldn't it be preferable to do the passes on the OCaml side and focus on generating high quality LLVM IR? Cheers, Jon. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Value types (Was: [Caml-list] ocamlopt LLVM support) 2010-12-12 22:05 ` Jon Harrop @ 2010-12-12 22:27 ` Török Edwin 2010-12-12 23:41 ` Jon Harrop 0 siblings, 1 reply; 21+ messages in thread From: Török Edwin @ 2010-12-12 22:27 UTC (permalink / raw) To: Jon Harrop; +Cc: caml-list On Sun, 12 Dec 2010 22:05:34 -0000 Jon Harrop <jonathandeanharrop@googlemail.com> wrote: > Edwin wrote: > > AFAICT LLVM's OCaml bindings are only good for generating LLVM IR > > from OCaml, not for actually performing transformations on it > > (there is no binding to retrieve the type of a value for example). > > I'll probably be looking into fixing that in the near future, and > > this may indirectly help your LLVM backend (if you intend to write > > OCaml specific transformations on the LLVM IR). > > That's a lot of work. Wouldn't it be preferable to do the passes on > the OCaml side and focus on generating high quality LLVM IR? Yes, that is probably a better approach (generating the optimized IR in the first place). Best regards, --Edwin ^ permalink raw reply [flat|nested] 21+ messages in thread
* RE: Value types (Was: [Caml-list] ocamlopt LLVM support) 2010-12-12 22:27 ` Török Edwin @ 2010-12-12 23:41 ` Jon Harrop 2010-12-13 2:13 ` Eray Ozkural 0 siblings, 1 reply; 21+ messages in thread From: Jon Harrop @ 2010-12-12 23:41 UTC (permalink / raw) To: 'Török Edwin'; +Cc: caml-list Edwin wrote: > On Sun, 12 Dec 2010 22:05:34 -0000 > Jon Harrop <jonathandeanharrop@googlemail.com> wrote: > > Edwin wrote: > > > AFAICT LLVM's OCaml bindings are only good for generating LLVM IR > > > from OCaml, not for actually performing transformations on it > > > (there is no binding to retrieve the type of a value for example). > > > I'll probably be looking into fixing that in the near future, and > > > this may indirectly help your LLVM backend (if you intend to write > > > OCaml specific transformations on the LLVM IR). > > > > That's a lot of work. Wouldn't it be preferable to do the passes on > > the OCaml side and focus on generating high quality LLVM IR? > > Yes, that is probably a better approach (generating the optimized IR in > the first place). FWIW, just the basic existing bindings were still buggy last I looked. For example, HLVM had to use a workaround to add a dummy argument to a function with no arguments because the bindings didn't handle that case correctly. Ironing the bugs out of the existing bindings would be more useful than making them even bigger, IMHO. Going back to Richard's idea, writing an OCaml library that uses LLVM to JIT interface code on-the-fly as a replacement for writing separate C stubs would be incredibly useful and you could even use it to interface to LLVM itself more easily! Cheers, Jon. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Value types (Was: [Caml-list] ocamlopt LLVM support) 2010-12-12 23:41 ` Jon Harrop @ 2010-12-13 2:13 ` Eray Ozkural 0 siblings, 0 replies; 21+ messages in thread From: Eray Ozkural @ 2010-12-13 2:13 UTC (permalink / raw) To: Jon Harrop; +Cc: Török Edwin, caml-list [-- Attachment #1: Type: text/plain, Size: 347 bytes --] It's better to focus on low-level optimizations within LLVM IR I think, high-level optimizations would better be done beforehand. It's not a bad marriage though. :) -- Eray Ozkural, PhD candidate. Comp. Sci. Dept., Bilkent University, Ankara http://groups.yahoo.com/group/ai-philosophy http://myspace.com/arizanesil http://myspace.com/malfunct [-- Attachment #2: Type: text/html, Size: 539 bytes --] ^ permalink raw reply [flat|nested] 21+ messages in thread
* RE: Value types (Was: [Caml-list] ocamlopt LLVM support) 2010-12-12 19:09 ` Benedikt Meurer 2010-12-12 19:20 ` John Carr 2010-12-12 19:55 ` Value types (Was: [Caml-list] ocamlopt LLVM support) Török Edwin @ 2010-12-12 21:50 ` Jon Harrop 2010-12-13 8:43 ` Alain Frisch 3 siblings, 0 replies; 21+ messages in thread From: Jon Harrop @ 2010-12-12 21:50 UTC (permalink / raw) To: 'Benedikt Meurer', caml-list Benedict wrote: > > A C compiler would optimize this to a right shift. Changing that to > > 'Int64.shift_right n 1' speeds up the code. > > This is easy to fix in ocamlopt (see attached patch ocamlopt- > natint.patch), by applying the same optimizations already used for > constant int's to constant natint's (Int64 is Nativeint on 64bit). Note > however, that "mod 2" is not really "and 1", neither is "div 2" > equivalent to "lsr 1"; that would be the case for unsigned arithmetic > (doesn't matter in your example, tho). That's great. Does it optimize div and mod by any constant integer as C compilers do? > >> 1. Unboxing can give huge performance improvements on serial code, > > > > s/Unboxing/arithmetic optimizations/ > > Please find an example where the performance benefit is due to > > unboxing, and not due to arithmetic optimizations performed on the > > unboxed code. > > The boxing involved is relevant, but boxing in general is not the > issue. In this special case, the "let nlen, n = if..." code requires > heap allocation, because of the way the pattern is compiled. This could > be fixed by moving the condition out of the code and using two if's to > select n/nlen separately (doesn't speed up that much). Fixing the > pattern compiler to handle these cases might be interesting for general > benefit. > > I already mentioned this multiple times, but here we go again: Unboxing > optimizations may indeed prove useful if applied wisely (cmmgen.ml is > of special interest here, the unboxing optimizations are more or less > special cases; that could be extended to include interesting cases like > moving boxing out of if-then-else in return position, etc). > > But (here comes the special "Harrop note") this has absolutely nothing > to do with LLVM (and of course really, really, really nothing to do > with HLVM). Using a different data representation for the heap requires > a nearly complete rewrite of the OCaml system (you would probably need > to start at the Typing level); if one wants to do this, enjoy and come > back with code. But even then, data representation issues will have to > be considered long before it comes to actual code generation (if you > are serious, you'll have to think about the runtime first prior to > talking about code generation for a non-existing runtime), so even then > it has nothing to do with LLVM (or C-- or C or whatever IR you can > think of). OCaml programmers can benefit from more appropriate data representations by using LLVM as a library to generate code from OCaml. HLVM is an example of this that anyone can play with. > Combining alloc's across if-then-else constructs further reduces code > size in your example (and probably other programs as well), see > attached file ocamlopt-comballoc-ifthenelse.patch. It's quick&dirty, > but it should illustrate the idea. I think that is an even more valuable improvement to ocamlopt than the int optimization. > This doesn't mean that LLVM wouldn't be useful (in fact, I've just > started an LLVM backend for ocamlopt). But it is important to note that > LLVM is not the solution to everything. As the name implies, it's "low > level", it does a few "higher level" optimizations for C, but these are > special cases (and somewhat ugly if you take the time to read the > code). It won't make a faster OCaml magically, just like it didn't make > a faster Haskell by itself. Absolutely. > I could go on by quoting common "Harrop jokes" like "you need types in > the low-level IR", etc. trying to tell him that this is simply wrong; > but after reading through the Caml/LISP mailing list archives (thanks > for the pointers by several readers), I'm pretty much convinced that > Jon simply decided to end his war against LISP just to start a new one > against ocamlopt. Suggesting that OCaml programmers use LLVM as a library because you can get huge performance gains is hardly a "war against ocamlopt". Cheers, Jon. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Value types (Was: [Caml-list] ocamlopt LLVM support) 2010-12-12 19:09 ` Benedikt Meurer ` (2 preceding siblings ...) 2010-12-12 21:50 ` Jon Harrop @ 2010-12-13 8:43 ` Alain Frisch 2010-12-15 10:29 ` Benedikt Meurer 3 siblings, 1 reply; 21+ messages in thread From: Alain Frisch @ 2010-12-13 8:43 UTC (permalink / raw) To: Benedikt Meurer; +Cc: caml-list On 12/12/2010 08:09 PM, Benedikt Meurer wrote: > The boxing involved is relevant, but boxing in general is not the > issue. In this special case, the "let nlen, n = if..." code requires > heap allocation, because of the way the pattern is compiled. This could > be fixed by moving the condition out of the code and using two if's to > select n/nlen separately (doesn't speed up that much). Fixing the > pattern compiler to handle these cases might be interesting for general > benefit. Instead of duplicating the conditional, you could also push the assignments to bound variables down the expression. For instance: let (x, y) = if b then (u, v) else (v, u) in ... can be replaced, conceptually, by: let x = <dummy> in let y = <dummy> in if b then (x <- u; y <- v) else (x <- v; y <- u); ... and similarly when the bound expression is a pattern matching. I've played with this a few months ago and could observe important speedups (27%, 20%) on two micro-benchmarks. The diff is really small: http://caml.inria.fr/cgi-bin/viewcvs.cgi/ocaml/branches/inplace_let/bytecomp/matching.ml?rev=10475&sortby=date&r2=10475&r1=10474 -- Alain Micro benchmark 1: let () = for k = 1 to 10000 do for i = 1 to 100000 do let (x, y) = if i mod 2 = 0 then (1, i * 2) else (2, i * 3) in r := !r * x + y done done Micro benchmark 2: let f x y z = let a, b = match z with | Some (u, v) -> u, v * 2 | None -> 10, 20 in a * x + b * y let () = let r = ref 0 in for k = 1 to 2000 do for i = 1 to 100000 do r := !r + f k i (Some (k, i)); r := !r + f k i None done done ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Value types (Was: [Caml-list] ocamlopt LLVM support) 2010-12-13 8:43 ` Alain Frisch @ 2010-12-15 10:29 ` Benedikt Meurer 2010-12-15 13:15 ` Jon Harrop 0 siblings, 1 reply; 21+ messages in thread From: Benedikt Meurer @ 2010-12-15 10:29 UTC (permalink / raw) To: caml-list On Dec 13, 2010, at 09:43 , Alain Frisch wrote: > On 12/12/2010 08:09 PM, Benedikt Meurer wrote: >> The boxing involved is relevant, but boxing in general is not the >> issue. In this special case, the "let nlen, n = if..." code requires >> heap allocation, because of the way the pattern is compiled. This could >> be fixed by moving the condition out of the code and using two if's to >> select n/nlen separately (doesn't speed up that much). Fixing the >> pattern compiler to handle these cases might be interesting for general >> benefit. > > Instead of duplicating the conditional, you could also push the assignments to bound variables down the expression. For instance: > > let (x, y) = if b then (u, v) else (v, u) in ... > > can be replaced, conceptually, by: > > let x = <dummy> in > let y = <dummy> in > if b then (x <- u; y <- v) else (x <- v; y <- u); > ... > > and similarly when the bound expression is a pattern matching. > > > I've played with this a few months ago and could observe important speedups (27%, 20%) on two micro-benchmarks. > > The diff is really small: > > http://caml.inria.fr/cgi-bin/viewcvs.cgi/ocaml/branches/inplace_let/bytecomp/matching.ml?rev=10475&sortby=date&r2=10475&r1=10474 Nice. But it would be even better to avoid the dummy, in your example let x = u in let y = v in if b then x <- v; y <- u This does not only avoid the dummy, but would also allow lowering to "cmovcc" instructions in the backend selector (atleast for x86-32/64). > -- Alain Benedikt ^ permalink raw reply [flat|nested] 21+ messages in thread
* RE: Value types (Was: [Caml-list] ocamlopt LLVM support) 2010-12-15 10:29 ` Benedikt Meurer @ 2010-12-15 13:15 ` Jon Harrop 0 siblings, 0 replies; 21+ messages in thread From: Jon Harrop @ 2010-12-15 13:15 UTC (permalink / raw) To: caml-list Benedikt wrote: > Nice. But it would be even better to avoid the dummy, in your example > > let x = u in > let y = v in > if b then x <- v; y <- u > > This does not only avoid the dummy, but would also allow lowering to > "cmovcc" instructions in the backend selector (atleast for x86-32/64). FWIW, when you're using LLVM (as a library!) to generate equivalent code then you can just represent the tuples as structs (value types) and LLVM will take care of all of this for you. Again, it generates really efficient code with minimal effort. The latest example front-end for HLVM represents tuples as structs so it gets this for free. However, polymorphic recursion is *much* harder to implement with that design choice. Also, there is the question of when boxed vs unboxed tuples are more efficient. Cheers, Jon. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Value types 2010-12-12 15:55 ` Török Edwin 2010-12-12 17:14 ` Jon Harrop 2010-12-12 19:09 ` Benedikt Meurer @ 2010-12-14 9:54 ` Goswin von Brederlow 2 siblings, 0 replies; 21+ messages in thread From: Goswin von Brederlow @ 2010-12-14 9:54 UTC (permalink / raw) To: Edwin; +Cc: Jon Harrop, caml-list Török Edwin <edwintorok@gmail.com> writes: > On Sun, 12 Dec 2010 14:54:14 -0000 > "Jon Harrop" <jon@ffconsultancy.com> wrote: > >> The Haskell guys got their best performance improvement moving to >> LLVM from the hailstone benchmark so it is interesting to examine >> this case as well. I just added support for 64-bit ints to HLVM to >> implement that benchmark and my code is: >> >> Hereâs the equivalent OCaml code: >> >> let rec collatzLen(c, n) : int = >> if n = 1L then c else >> collatzLen (c+1, if Int64.rem n 2L = 0L then Int64.div n 2L else > > OK, but boxing itself has nothing to do with the performance degration > here. It is the lack of compiler optimizations on the Int64 type. This > could be solved by implementing compiler optimizations for it (or > manually optimizing some integer arithmetic that is slow). > > Lets run the code under a profiler, or look at the assembly (I used > 'perf record' and 'perf report'). 2 'idiv' instructions turn up as top > offenders in the profile. > > Problem #1: Int64.rem n 2 -> another idiv instruction > > A C compiler would optimize this to an 'and' instruction. > Change that to 'Int64.logand n 1L = 0L'/ But C would have an uint64_t there (if you are smart). Otherwise that isn't equivalent: void foo(int64_t x) { a = x % 2; } 0000000000000000 <foo>: 0: 48 89 f8 mov %rdi,%rax 3: 48 c1 e8 3f shr $0x3f,%rax 7: 48 01 c7 add %rax,%rdi a: 83 e7 01 and $0x1,%edi d: 48 29 c7 sub %rax,%rdi 10: 48 89 3d 00 00 00 00 mov %rdi,0x0(%rip) # 17 <foo+0x17> 17: c3 retq > Problem #2: Int64.div n 2 -> idiv instruction. > > A C compiler would optimize this to a right shift. Changing that to 'Int64.shift_right n 1' speeds > up the code. Same thing again: void foo(int64_t x) { a = x / 2; } 0000000000000000 <foo>: 0: 48 89 f8 mov %rdi,%rax 3: 48 c1 e8 3f shr $0x3f,%rax 7: 48 8d 3c 38 lea (%rax,%rdi,1),%rdi b: 48 d1 ff sar %rdi e: 48 89 3d 00 00 00 00 mov %rdi,0x0(%rip) # 15 <foo+0x15> 15: c3 retq In both cases gcc avoids the expensive idiv instruction but it needs to handle the sign of the number with some tricks. Still faster than idiv though. An UInt64 module would be nice here. MfG Goswin ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Value types (Was: [Caml-list] ocamlopt LLVM support) 2010-12-12 14:54 Value types (Was: [Caml-list] ocamlopt LLVM support) Jon Harrop 2010-12-12 15:55 ` Török Edwin @ 2010-12-12 19:53 ` Brian Hurt 2010-12-12 20:39 ` Jon Harrop 1 sibling, 1 reply; 21+ messages in thread From: Brian Hurt @ 2010-12-12 19:53 UTC (permalink / raw) To: Jon Harrop; +Cc: caml-list I'm going to regret this. I know I'm going to regret this. On Sun, 12 Dec 2010, Jon Harrop wrote: > Here?s the equivalent OCaml code: > > let rec collatzLen(c, n) : int = > if n = 1L then c else > collatzLen (c+1, if Int64.rem n 2L = 0L then Int64.div n 2L else > Int64.add (Int64.mul 3L n) 1L);; > > let rec loop(i, (nlen, n)) = > if i = 1L then n else > let ilen = collatzLen(1, i) in > let nlen, n = if ilen > nlen then ilen, i else nlen, n in > loop (Int64.sub i 1L, (nlen, n));; Congratulations, Jon, you win today's Captain Obvious award. Using Int64's, which are forced to be boxed, really slows things down. Also, uncurrying all your arguments also slows things down. Running your original code on my 64-bit laptop, it took 6.35s to run the 1M example. The following alternate code only took 0.82s, for a speed up of almost 7.75x. Scaling your timings by a similar amount gives Ocaml a running speed of 3.14s in your set up, or competitive with F#. My code: let rec collatzLen c n = if n = 1 then c else collatzLen (c+1) (if (n land 1) == 0 then (n lsr 1) else ((n * 3) + 1)) ;; let rec loop i nlen n = if i = 1 then n else let ilen = collatzLen 1 i in if (ilen > nlen) then loop (i - 1) ilen i else loop (i - 1) nlen n ;; loop 1000000 0 0 Here is where you insert a lecture about how Ocaml int's being on 63 (or 31) bits aren't "real" ints, and that therefor this isn't a valid comparison at all. Brian ^ permalink raw reply [flat|nested] 21+ messages in thread
* RE: Value types (Was: [Caml-list] ocamlopt LLVM support) 2010-12-12 19:53 ` Value types (Was: [Caml-list] ocamlopt LLVM support) Brian Hurt @ 2010-12-12 20:39 ` Jon Harrop 0 siblings, 0 replies; 21+ messages in thread From: Jon Harrop @ 2010-12-12 20:39 UTC (permalink / raw) To: 'Brian Hurt'; +Cc: caml-list Brian Hurt wrote: > On Sun, 12 Dec 2010, Jon Harrop wrote: > > let rec collatzLen(c, n) : int = > > if n = 1L then c else > > collatzLen (c+1, if Int64.rem n 2L = 0L then Int64.div n 2L else > > Int64.add (Int64.mul 3L n) 1L);; > > > > let rec loop(i, (nlen, n)) = > > if i = 1L then n else > > let ilen = collatzLen(1, i) in > > let nlen, n = if ilen > nlen then ilen, i else nlen, n in > > loop (Int64.sub i 1L, (nlen, n));; > > Congratulations, Jon, you win today's Captain Obvious award. Using > Int64's, which are forced to be boxed, really slows things down. Apparently boxing isn't the issue here, as I had assumed. On 32-bit, OCaml compiles each arithmetic operation on the int64s to a C function call. > Also, uncurrying all your arguments also slows things down. I see <3% performance improvement from currying everything. > Running your > original code on my 64-bit laptop, it took 6.35s to run the 1M example. > The following alternate code only took 0.82s, for a speed up of almost > 7.75x. According to Edwin, you should be able to get C-like performance by running the OCaml in 64-bit and replacing the div and mod operations with shifts and logical ANDs. > Scaling your timings by a similar amount gives Ocaml a running > speed of 3.14s in your set up, or competitive with F#. I'd be wary of scaling timings by measurements made across different architectures. OCaml seems to be doing completely different things on x86 and x64 here. Cheers, Jon. ^ permalink raw reply [flat|nested] 21+ messages in thread
end of thread, other threads:[~2010-12-15 13:15 UTC | newest] Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2010-12-12 14:54 Value types (Was: [Caml-list] ocamlopt LLVM support) Jon Harrop 2010-12-12 15:55 ` Török Edwin 2010-12-12 17:14 ` Jon Harrop 2010-12-12 17:26 ` Török Edwin 2010-12-12 18:01 ` Jon Harrop 2010-12-12 18:22 ` Török Edwin 2010-12-12 19:09 ` Benedikt Meurer 2010-12-12 19:20 ` John Carr 2010-12-14 9:43 ` Value types Goswin von Brederlow 2010-12-12 19:55 ` Value types (Was: [Caml-list] ocamlopt LLVM support) Török Edwin 2010-12-12 22:05 ` Jon Harrop 2010-12-12 22:27 ` Török Edwin 2010-12-12 23:41 ` Jon Harrop 2010-12-13 2:13 ` Eray Ozkural 2010-12-12 21:50 ` Jon Harrop 2010-12-13 8:43 ` Alain Frisch 2010-12-15 10:29 ` Benedikt Meurer 2010-12-15 13:15 ` Jon Harrop 2010-12-14 9:54 ` Value types Goswin von Brederlow 2010-12-12 19:53 ` Value types (Was: [Caml-list] ocamlopt LLVM support) Brian Hurt 2010-12-12 20:39 ` Jon Harrop
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox