* testers wanted for experimental SSE2 back-end @ 2010-03-09 16:33 Xavier Leroy 2010-03-10 19:25 ` [Caml-list] " Mike Lin ` (2 more replies) 0 siblings, 3 replies; 10+ messages in thread From: Xavier Leroy @ 2010-03-09 16:33 UTC (permalink / raw) To: caml-list Hello list, This is a call for testers concerning an experimental OCaml compiler back-end that uses SSE2 instructions for floating-point arithmetic. This code generation strategy was discussed before on this list, and I include below a summary in Q&A style. The new back-end is being considered for inclusion in the next major release (3.12), but performance testing done so far at INRIA and by Caml Consortium members is not conclusive. Additional results from members of this list would therefore be very welcome. We're not terribly interested in small (< 50 LOC), Shootout-style benchmarks, since their performance is very sensitive to code and data placement. However, if some of you have a sizeable (> 500 LOC) body of float-intensive Caml code, we'd be very interested to hear about the compared speed of the SSE2 back-end and the old back-end on your code. Switching to Q&A style: Q: Where can I get the code? A: From the SVN repository: svn checkout http://caml.inria.fr/svn/ocaml/branches/sse2 ocaml-sse2 Source-code only. Very lightly tested under Windows, so you might be better off testing under Unix. Q: What is this SSE2 thingy? A: An extension of the Intel/AMD x86 instruction set that provides, among other things, 64-bit float arithmetic instructions operating over 64-bit float registers. Before SSE2, the only way to perform 64-bit float arithmetic on x86 was the x87 instructions, which compute in 80-bit precision and use a stack instead of registers. Q: Why this sudden interest in SSE2? A: SSE2 has several potential advantages over x87, including: - The register-based SSE2 model fits the OCaml back-end much better than the stack-based x87 model. In particular, "let"-bound intermediate results of type "float" can be kept in SSE2 registers, while in the current x87 mode they are systematically flushed to the stack. - SSE2 implements exactly 64-bit IEEE arithmetic, giving float results that are consistent with those obtained on other platforms and with the OCaml bytecode interpreter. The 80-bit format of x87 produces different results and can causes surprises such as "double rounding" errors. (For more explanations, see David Monniaux's excellent article, http://hal.archives-ouvertes.fr/hal-00128124/ ) - Some x86 processors execute SSE2 instructions faster than their x87 counterparts. This speed difference was notable on the Pentium 4 in particular, but is much smaller on more recent processors such as Core 2. Note that x86-64 bits systems as well as Mac OS X already use SSE2 as their default floating-point model. SSE2 also has some potential disadvantages: - The instructions are bigger than x87 instructions, causing some increase in code size and potentially some decrease in instruction cache efficiency. - Computing intermediate results in 80-bit precision, like x87 does, can improve the numerical stability of poorly-conditioned float computations, although it doesn't make a difference for well-written numerical code. Q: Is SSE2 universally available on x86 processors? A: Not universally but pretty close. SSE2 made its debut in 2000, in the Pentium 4 processor. All x86 machines built in the last 4 years or so support SSE2, but pre-Pentium 4 and pre-Athlon64 processors do not. Q: So if you adopt this new back-end, OCaml will stop working on my trusty 1995-vintage Pentium? A: No. Under friendly pressure from our Debian friends, we agreed to keep the x87 back-end alive for a while in parallel with the SSE2 back-end. The x87 back-end is selected at configuration time if the processor doesn't support SSE2 or if a special flag is given to the configure script. Q: I observed a 20% (speedup|slowdown)! Should I tell the world about it? A: If your benchmark spends all its time in 10 lines of OCaml, maybe not. On such small codes, variations in code and data placement alone (without changing the instructions that are actually executed) can result in performance variations by 20%, so this is just experimental noise. Larger programs are less sensitive to this noise, which is why we're much more interested in results obtained on real OCaml applications. Finally, one micro-benchmark slowed down by a factor of 2 for reasons we couldn't explain. Q: What are those inconclusive results you mentioned? A: On medium-sized numerical kernels (e.g. FFT, Gaussian process regression), we've observed speedups of about 8% on Core 2 processors and somewhat higher on recent AMD processors. On bigger OCaml applications that perform floating-point computations but not exclusively, the performance difference was lost in the noise. Looking forward to interesting experimental results, - Xavier Leroy ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [Caml-list] testers wanted for experimental SSE2 back-end 2010-03-09 16:33 testers wanted for experimental SSE2 back-end Xavier Leroy @ 2010-03-10 19:25 ` Mike Lin 2010-03-10 20:51 ` Will M. Farr 2010-03-11 8:42 ` Xavier Leroy 2010-03-13 16:10 ` Gaëtan DUBREIL 2010-03-23 8:58 ` Dmitry Bely 2 siblings, 2 replies; 10+ messages in thread From: Mike Lin @ 2010-03-10 19:25 UTC (permalink / raw) To: Xavier Leroy; +Cc: caml-list [-- Attachment #1: Type: text/plain, Size: 724 bytes --] On Tue, Mar 9, 2010 at 11:33 AM, Xavier Leroy <Xavier.Leroy@inria.fr> wrote: > - The register-based SSE2 model fits the OCaml back-end much better > than the stack-based x87 model. In particular, "let"-bound intermediate > results of type "float" can be kept in SSE2 registers, while in > the current x87 mode they are systematically flushed to the stack. > > Note that x86-64 bits systems as well as Mac OS X already use SSE2 as > their default floating-point model. > I have a bunch of biological sequence analysis stuff that could be interesting but I am already in x86-64 ("Wow! A 64 bit architecture!"). The above seems pretty clear but just to verify - I would not benefit from this new back-end, right? Mike [-- Attachment #2: Type: text/html, Size: 1214 bytes --] ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [Caml-list] testers wanted for experimental SSE2 back-end 2010-03-10 19:25 ` [Caml-list] " Mike Lin @ 2010-03-10 20:51 ` Will M. Farr 2010-03-11 8:42 ` Xavier Leroy 1 sibling, 0 replies; 10+ messages in thread From: Will M. Farr @ 2010-03-10 20:51 UTC (permalink / raw) To: caml-list On Mar 10, 2010, at 1:25 PM, Mike Lin wrote: > On Tue, Mar 9, 2010 at 11:33 AM, Xavier Leroy <Xavier.Leroy@inria.fr> wrote: > Note that x86-64 bits systems as well as Mac OS X already use SSE2 as > their default floating-point model. > > I have a bunch of biological sequence analysis stuff that could be interesting but I am already in x86-64 ("Wow! A 64 bit architecture!"). The above seems pretty clear but just to verify - I would not benefit from this new back-end, right? > > Mike Oops. I just ran a bunch of tests on my Mac OS 10.6 system---does that mean that I compared two sse2 backends? The ocaml-sse2 branch definitely produced different code than the trunk, but that could easily be due to any small difference in the two compilers, and not due to a change of architecture. Will ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [Caml-list] testers wanted for experimental SSE2 back-end 2010-03-10 19:25 ` [Caml-list] " Mike Lin 2010-03-10 20:51 ` Will M. Farr @ 2010-03-11 8:42 ` Xavier Leroy 1 sibling, 0 replies; 10+ messages in thread From: Xavier Leroy @ 2010-03-11 8:42 UTC (permalink / raw) To: Mike Lin; +Cc: caml-list Mike Lin wrote: > I have a bunch of biological sequence analysis stuff that could be > interesting but I am already in x86-64 ("Wow! A 64 bit architecture!"). > The above seems pretty clear but just to verify - I would not benefit > from this new back-end, right? Right. Sorry for not mentioning this. The x86-64 bit code generator for OCaml uses SSE2 floats, like all C compilers for this platform. The experimental back-end I announced is for x86-32 bit. Some more Q&A: Q: I have OCaml installed on my x86 machine, how do I know if it's 32 or 64 bits? A: Do: grep ^ARCH `ocamlopt -where`/Makefile.config If it says "amd64", it's 64 bits with SSE2 floats. If it says "i386", it's 32 bits with x87 floats. If if says "ia32", it's the experimental back-end: 32 bits with SSE2 floats. Q: If I compile from sources, which code generator is chosen by default? 32 or 64 bits? A: OCaml's configure script chooses whatever mode the C compiler defaults to. For instance, on a 32-bit Linux installation, the 32-bit generator is selected, and on 64-bit Linux installation, it's the 64-bit generator. Mac OS X is more tricky: 10.5 and earlier default to 32 bits, but 10.6 defaults to 64 bits... Will Farr wrote: > Oops. I just ran a bunch of tests on my Mac OS 10.6 system---does > that mean that I compared two sse2 backends? The ocaml-sse2 branch > definitely produced different code than the trunk, but that could > easily be due to any small difference in the two compilers, and not > due to a change of architecture. It is quite possible you ended up with two 64-bit, SSE2-float back-ends. Oups. Sorry for your time loss. And, yes, unrelated changes between release 3.11.2 and the experimental sources I released (based on what will become 3.12.0) can account for small speed differences. - Xavier Leroy ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [Caml-list] testers wanted for experimental SSE2 back-end 2010-03-09 16:33 testers wanted for experimental SSE2 back-end Xavier Leroy 2010-03-10 19:25 ` [Caml-list] " Mike Lin @ 2010-03-13 16:10 ` Gaëtan DUBREIL 2010-03-23 8:58 ` Dmitry Bely 2 siblings, 0 replies; 10+ messages in thread From: Gaëtan DUBREIL @ 2010-03-13 16:10 UTC (permalink / raw) To: caml-list With my audio processing application of about 1300 LOC, i get a speedup of about 9%. I run the application on Intel Pentium M under Kubuntu. The application spend all the time doing computation on bigarray of float. For me, the gain is appreciable. In real time audio processing, a speedup of 9% can be crucial. Gaëtan Dubreil ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [Caml-list] testers wanted for experimental SSE2 back-end 2010-03-09 16:33 testers wanted for experimental SSE2 back-end Xavier Leroy 2010-03-10 19:25 ` [Caml-list] " Mike Lin 2010-03-13 16:10 ` Gaëtan DUBREIL @ 2010-03-23 8:58 ` Dmitry Bely 2010-03-23 9:07 ` Daniel Bünzli 2010-03-29 16:49 ` Xavier Leroy 2 siblings, 2 replies; 10+ messages in thread From: Dmitry Bely @ 2010-03-23 8:58 UTC (permalink / raw) To: caml-list On Tue, Mar 9, 2010 at 7:33 PM, Xavier Leroy <Xavier.Leroy@inria.fr> wrote: > Hello list, > > This is a call for testers concerning an experimental OCaml compiler > back-end that uses SSE2 instructions for floating-point arithmetic. > This code generation strategy was discussed before on this list, and I > include below a summary in Q&A style. > > The new back-end is being considered for inclusion in the next major > release (3.12), but performance testing done so far at INRIA and by > Caml Consortium members is not conclusive. Additional results > from members of this list would therefore be very welcome. > > We're not terribly interested in small (< 50 LOC), Shootout-style > benchmarks, since their performance is very sensitive to code and data > placement. However, if some of you have a sizeable (> 500 LOC) body > of float-intensive Caml code, we'd be very interested to hear about > the compared speed of the SSE2 back-end and the old back-end on your > code. I cannot provide any benchmark yet but even not taking into account the better register organization there are at least two areas where SSE2 can outperform x87 significantly. 1. Float to integer conversion Is quite inefficient on x87 because you have to explicitly set and restore rounding mode. Typical let round x = truncate (x +. 0.5) Translates to _camlT__round_58: sub esp, 8 L100: fld L101 fadd REAL8 PTR [eax] sub esp, 8 fnstcw [esp+4] mov ax, [esp+4] mov ah, 12 mov [esp], ax fldcw [esp] fistp DWORD PTR [esp] mov eax, [esp] fldcw [esp+4] add esp, 8 lea eax, DWORD PTR [eax+eax+1] add esp, 8 ret but just to _camlT__round_58: L100: movlpd xmm0, L101 addsd xmm0, REAL8 PTR [eax] cvttsd2si eax, xmm0 lea eax, DWORD PTR [eax+eax+1] ret with SSE2. 2. Float compare Does not set flags on x87 so let fmin (x:float) y = if x < y then x else y ends up with _camlT__fmin_58: sub esp, 8 L101: mov ecx, eax fld REAL8 PTR [ebx] fld REAL8 PTR [ecx] fcompp fnstsw ax and ah, 69 cmp ah, 1 jne L100 mov eax, ecx add esp, 8 ret L100: mov eax, ebx add esp, 8 ret on SSE2 you just have _camlT__fmin_58: L101: movlpd xmm1, REAL8 PTR [ebx] movlpd xmm0, REAL8 PTR [eax] comisd xmm1, xmm0 jbe L100 ret L100: mov eax, ebx ret As for SSE2 backend presented I have some thoughts regarding the code (fast math functions via x87 are questionable, optimization of floating compare etc.) Where to discuss that - just here or there is some entry in Mantis? - Dmitry Bely ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [Caml-list] testers wanted for experimental SSE2 back-end 2010-03-23 8:58 ` Dmitry Bely @ 2010-03-23 9:07 ` Daniel Bünzli 2010-03-23 9:22 ` Dmitry Bely 2010-03-29 16:49 ` Xavier Leroy 1 sibling, 1 reply; 10+ messages in thread From: Daniel Bünzli @ 2010-03-23 9:07 UTC (permalink / raw) To: Dmitry Bely; +Cc: caml-list > let round x = truncate (x +. 0.5) Side note, if you are also interested in negative numbers that's not what you want. You want : > let round x = floor (x +. 0.5) Best, Daniel ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [Caml-list] testers wanted for experimental SSE2 back-end 2010-03-23 9:07 ` Daniel Bünzli @ 2010-03-23 9:22 ` Dmitry Bely 0 siblings, 0 replies; 10+ messages in thread From: Dmitry Bely @ 2010-03-23 9:22 UTC (permalink / raw) To: caml-list On Tue, Mar 23, 2010 at 12:07 PM, Daniel Bünzli <daniel.buenzli@erratique.ch> wrote: >> let round x = truncate (x +. 0.5) > > Side note, if you are also interested in negative numbers Sure, it was just an code generation example. Probably I should use round_positive name. > that's not what you want. You want : > >> let round x = floor (x +. 0.5) ...and you get a floor() C call. I would better use let round x = truncate (x +. (if x > 0. then 0.5 else -0.5)) - Dmitry Bely ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [Caml-list] testers wanted for experimental SSE2 back-end 2010-03-23 8:58 ` Dmitry Bely 2010-03-23 9:07 ` Daniel Bünzli @ 2010-03-29 16:49 ` Xavier Leroy 2010-03-29 18:58 ` Dmitry Bely 1 sibling, 1 reply; 10+ messages in thread From: Xavier Leroy @ 2010-03-29 16:49 UTC (permalink / raw) To: Dmitry Bely; +Cc: caml-list Hello Dmitry, >> This is a call for testers concerning an experimental OCaml compiler >> back-end that uses SSE2 instructions for floating-point arithmetic.[...] > > I cannot provide any benchmark yet Too bad :-( I got very little feedback to my call: just one data point (thanks Gaetan). Perhaps most OCaml users interested in numerical computations have switched to x86-64bits already? At any rate, given such a lack of interest, this x86-32/SSE2 port isn't going to make it into the OCaml distribution. > but even not taking into account > the better register organization there are at least two areas where > SSE2 can outperform x87 significantly. > > 1. Float to integer conversion > Is quite inefficient on x87 because you have to explicitly set and > restore rounding mode. Right. The mode change makes the conversion about 10x slower on x87 than on SSE2. Apparently, float->int conversion is uncommon is numerical code, otherwise we'd observe bigger speedups on real applications... > 2. Float compare > Does not set flags on x87 so The SSE2 code is prettier than the x87 code, but this doesn't seem to translate into a significant performance gain, in my limited testing. > As for SSE2 backend presented I have some thoughts regarding the code > (fast math functions via x87 are questionable, Most x86-32bits C libraries implement sin(), cos(), etc with the x87 instructions, so I'm curious to know what you find objectionable here. > optimization of floating compare etc.) Where to discuss that - just > here or there is some entry in Mantis? Why not start on this list? We'll move to private e-mail if the discussion becomes too heated :-) - Xavier Leroy ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [Caml-list] testers wanted for experimental SSE2 back-end 2010-03-29 16:49 ` Xavier Leroy @ 2010-03-29 18:58 ` Dmitry Bely 0 siblings, 0 replies; 10+ messages in thread From: Dmitry Bely @ 2010-03-29 18:58 UTC (permalink / raw) To: caml-list On Mon, Mar 29, 2010 at 8:49 PM, Xavier Leroy <Xavier.Leroy@inria.fr> wrote: >>> This is a call for testers concerning an experimental OCaml compiler >>> back-end that uses SSE2 instructions for floating-point arithmetic.[...] >> >> I cannot provide any benchmark yet > > Too bad :-( I got very little feedback to my call: just one data point > (thanks Gaetan). Perhaps most OCaml users interested in numerical > computations have switched to x86-64bits already? At any rate, given > such a lack of interest, this x86-32/SSE2 port isn't going to make it > into the OCaml distribution. It's a pity. Probably even my (future) benchmarks won't help... >> but even not taking into account >> the better register organization there are at least two areas where >> SSE2 can outperform x87 significantly. >> >> 1. Float to integer conversion >> Is quite inefficient on x87 because you have to explicitly set and >> restore rounding mode. > > Right. The mode change makes the conversion about 10x slower on x87 > than on SSE2. Apparently, float->int conversion is uncommon is > numerical code, otherwise we'd observe bigger speedups on real > applications... > >> 2. Float compare >> Does not set flags on x87 so > > The SSE2 code is prettier than the x87 code, but this doesn't seem to > translate into a significant performance gain, in my limited testing. > >> As for SSE2 backend presented I have some thoughts regarding the code >> (fast math functions via x87 are questionable, > > Most x86-32bits C libraries implement sin(), cos(), etc with the x87 > instructions, so I'm curious to know what you find objectionable here. Microsoft's implementation for P4 and above is SSE2-based. And Intel itself recommends to do so: [quote] What Is AM Library? =================== Ever missed a sine or arctangent instruction among Intel Streaming SIMD Extensions? Ever wished there were a way to calculate logarithm or exponent in about a dozen cycles? Here is a new release of Approximate Math Library (AM Library) -- a set of fast routines to calculate math functions using Intel(R) Streaming SIMD Extensions (SSE) and Streaming SIMD Extensions 2 (SSE2). The Library offers trigonometric, reverse trigonometric, logarithmic, and exponential functions for packed and scalar arguments. The processing speed is many times faster than that of x87 instructions and even of table lookups. The accuracy of AM Library routines can be adequate for many applications. It is comparable with that of reciprocal SSE instructions, and is hundreds times better than what is achievable with lookup tables. The AM Library is provided along with the full source code and a usage sample. [end of quote] http://www.intel.com/design/pentiumiii/devtools/AMaths.zip Another interesting reading: http://users.ece.utexas.edu/~adnan/comm/fast-trigonometric-functions-using.pdf >> optimization of floating compare etc.) Where to discuss that - just >> here or there is some entry in Mantis? > > Why not start on this list? We'll move to private e-mail if the > discussion becomes too heated :-) OK 1. My variant of emit_float_test (in many cases eliminates extra jump). let emit_float_test cmp neg arg lbl = let opcode_jp cmp = match (cmp, neg) with (Ceq, false) -> ("je", true) | (Ceq, true) -> ("jne", true) | (Cne, false) -> ("jne", true) | (Cne, true) -> ("je", true) | (Clt, false) -> ("jb", true) | (Clt, true) -> ("jae", true) | (Cle, false) -> ("jbe", true) | (Cle, true) -> ("ja", true) | (Cgt, false) -> ("ja", false) | (Cgt, true) -> ("jbe", false) | (Cge, false) -> ("jae", true) | (Cge, true) -> ("jb", false) in let branch_opcode, need_jp = opcode_jp cmp in let branch_opcode, arg0, arg1, need_jp = match arg.(1).loc with Reg _ when need_jp -> (* swap args if it excludes jmp *) let (branch_opcode_swap, need_jp_swap) = opcode_jp (match cmp with Ceq -> Ceq | Cne -> Cne | Clt -> Cgt | Cle -> Cge | Cgt -> Clt | Cge -> Cle) in if need_jp_swap then (branch_opcode, arg.(0), arg.(1), true) else (branch_opcode_swap, arg.(1), arg.(0), false) | _ -> (branch_opcode, arg.(0), arg.(1), need_jp) in begin match cmp with | Ceq | Cne -> ` ucomisd ` | _ -> ` comisd ` end; `{emit_reg arg0}, {emit_reg arg1}\n`; let branch_if_not_comparable = if cmp = Cne then not neg else neg in if need_jp then if branch_if_not_comparable then begin ` jp {emit_label lbl}\n`; ` {emit_string branch_opcode} {emit_label lbl}\n` end else begin let next = new_label() in ` jp {emit_label next}\n`; ` {emit_string branch_opcode} {emit_label lbl}\n`; `{emit_label next}:\n` end else begin ` {emit_string branch_opcode} {emit_label lbl}\n` end 2. My variant of fast math functions (see explanation above) let emit_floatspecial = function "sqrt" -> ` sqrtsd ` | _ -> assert false 3. Loading st(0) can be two instructions shorter :) ` sub esp, 8\n`; ` fstp REAL8 PTR [esp]\n`; ` movsd {emit_reg dst}, REAL8 PTR [esp]\n`; ` add esp, 8\n` can be written as ` fstp REAL8 PTR [esp-8]\n`; ` movlpd {emit_reg dst}, REAL8 PTR [esp-8]\n`; 4. Unnecessary instruction in Lop(Iload(Single, addr)) ` movss {emit_reg dest}, REAL4 PTR {emit_addressing addr i.arg 0}\n`; ` cvtss2sd {emit_reg dest}, {emit_reg dest}\n` can be written as ` cvtss2sd {emit_reg dest}, REAL4 PTR {emit_addressing addr i.arg 0}\n` - Dmitry Bely ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2010-03-29 18:58 UTC | newest] Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2010-03-09 16:33 testers wanted for experimental SSE2 back-end Xavier Leroy 2010-03-10 19:25 ` [Caml-list] " Mike Lin 2010-03-10 20:51 ` Will M. Farr 2010-03-11 8:42 ` Xavier Leroy 2010-03-13 16:10 ` Gaëtan DUBREIL 2010-03-23 8:58 ` Dmitry Bely 2010-03-23 9:07 ` Daniel Bünzli 2010-03-23 9:22 ` Dmitry Bely 2010-03-29 16:49 ` Xavier Leroy 2010-03-29 18:58 ` Dmitry Bely
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox