* Ocamlopt x86-32 and SSE2 [not found] <20090509100004.353ADBC5C@yquem.inria.fr> @ 2009-05-09 11:38 ` CUOQ Pascal 2009-05-10 1:52 ` [Caml-list] " Goswin von Brederlow 0 siblings, 1 reply; 16+ messages in thread From: CUOQ Pascal @ 2009-05-09 11:38 UTC (permalink / raw) To: caml-list, caml-list Xavier Leroy <Xavier.Leroy@inria.fr> wrote: >2- Declare pre-SSE2 processors obsolete and convert the current > "i386" port to always use SSE2 float arithmetic. > >3- Support both x87 and SSE2 float arithmetic within the same i386 > port, with a command-line option to activate SSE2, like gcc does. As someone with somewhat of an obsession for keeping obsolete computers in function as long as they are not broken, I have to interject something. I still have a functional Pentium 90 (granted, that's not the newest computer that does not support SSE2, but please hear me). I gave up the idea of bootstrapping OCaml on it years ago because it has 16Mb of memory, and that became insufficient around the time Camlp4 became part of the distribution. I would have had either to modify the compilation flow or cross-compile, both of which were too much work for the meagre resulting cool factor. Now, both the old and the new Camlp4 are fine pieces of software that make use of resources available nowadays to make things possible that weren't before. I am not complaining. I am saying that you have to be consistent in your requirements. My father was using Debian on a 500MHz K6-3D that I had somehow been able to upgrade with enough memory to run one of the two popular desktops. He finally upgraded to a new computer because he could see the characters being displayed one by one in the e-mail client. That, or the motherboard died. I can't remember. It was serendipitous, anyway. There are plenty of embedded processors with an x86 instruction set and no SSE2 around, but these are not in the cool toys that we want to run OCaml on. The cool toys have ARM processors. My message is: I am one of the people who have the peculiar mental illness that leads one to suggest a compatible option. Well, I am not. Take option 2 and run with it! >However, packagers are >going to be very unhappy: Debian still lists i486 as its bottom line; >for Fedora, it's Pentium or Pentium II; for Windows, it's "a 1GHz >processor", meaning Pentium III. All these processors lack SSE2 >support. Only MacOS X is SSE2-compatible from scratch. Only Linux distributions are a problem, if OCaml packages are at risk of being rejected. Just because Windows still works on old computers doesn't force every program to do the same (flame bait: and I would add that Windows' support for old computers is mostly unintentional). In Linux distributions, is it completely forbidden to have packages that will not work on the bottom line? This is (I assume) Ocaml 3.12 that we are talking about, which would land sometime in 2010 and arrive in binary distributions that are scheduled to be released in 2011. Will Debian maintain its delusion of supporting the i486 by that time? Pascal ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [Caml-list] Ocamlopt x86-32 and SSE2 2009-05-09 11:38 ` Ocamlopt x86-32 and SSE2 CUOQ Pascal @ 2009-05-10 1:52 ` Goswin von Brederlow 2009-05-10 2:16 ` Seo Sanghyeon ` (2 more replies) 0 siblings, 3 replies; 16+ messages in thread From: Goswin von Brederlow @ 2009-05-10 1:52 UTC (permalink / raw) To: CUOQ Pascal; +Cc: caml-list "CUOQ Pascal" <Pascal.CUOQ@cea.fr> writes: > Xavier Leroy <Xavier.Leroy@inria.fr> wrote: >>2- Declare pre-SSE2 processors obsolete and convert the current >> "i386" port to always use SSE2 float arithmetic. >> >>3- Support both x87 and SSE2 float arithmetic within the same i386 >> port, with a command-line option to activate SSE2, like gcc does. >... > In Linux distributions, is it completely forbidden to have packages > that will not work on the bottom line? > This is (I assume) Ocaml 3.12 that we are talking about, which > would land sometime in 2010 and arrive in binary distributions > that are scheduled to be released in 2011. Will Debian maintain > its delusion of supporting the i486 by that time? > > Pascal As you said (in the deleted part) there are plenty of cpus without SSE2 around and Debian will continue to support them. That does not really mean i486 at 25MHz will be used but it is the common bottom line that can easily be supported. Having ocaml require SSE2 is quite unacceptable for someone with a Via C7 cpu (they don't have SSE2, right?) Is it really that much work for ocaml to use option 3? MfG Goswin ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [Caml-list] Ocamlopt x86-32 and SSE2 2009-05-10 1:52 ` [Caml-list] " Goswin von Brederlow @ 2009-05-10 2:16 ` Seo Sanghyeon 2009-05-10 3:50 ` Jon Harrop 2009-05-10 8:56 ` CUOQ Pascal 2009-05-10 19:25 ` Florian Weimer 2 siblings, 1 reply; 16+ messages in thread From: Seo Sanghyeon @ 2009-05-10 2:16 UTC (permalink / raw) To: Goswin von Brederlow; +Cc: CUOQ Pascal, caml-list 2009/5/10 Goswin von Brederlow <goswin-v-b@web.de>: > Having ocaml require SSE2 is quite unacceptable for someone with a Via > C7 cpu (they don't have SSE2, right?) Is it really that much work for > ocaml to use option 3? Maybe not, but don't underestimate tiny inconveniences! Even if it is tiny more work to support x87, it could be a difference of doing it and not doing it. http://lesswrong.com/lw/f1/beware_trivial_inconveniences/ -- Seo Sanghyeon ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [Caml-list] Ocamlopt x86-32 and SSE2 2009-05-10 2:16 ` Seo Sanghyeon @ 2009-05-10 3:50 ` Jon Harrop 2009-05-11 8:05 ` Dmitry Bely 0 siblings, 1 reply; 16+ messages in thread From: Jon Harrop @ 2009-05-10 3:50 UTC (permalink / raw) To: caml-list On Sunday 10 May 2009 03:16:49 Seo Sanghyeon wrote: > 2009/5/10 Goswin von Brederlow <goswin-v-b@web.de>: > > Having ocaml require SSE2 is quite unacceptable for someone with a Via > > C7 cpu (they don't have SSE2, right?) Is it really that much work for > > ocaml to use option 3? > > Maybe not, but don't underestimate tiny inconveniences! Even if it is > tiny more work to support x87, it could be a difference of doing it and > not doing it. > http://lesswrong.com/lw/f1/beware_trivial_inconveniences/ If you want to avoid inconvenience, why not use LLVM to replace several of the existing backends? -- Dr Jon Harrop, Flying Frog Consultancy Ltd. http://www.ffconsultancy.com/?e ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [Caml-list] Ocamlopt x86-32 and SSE2 2009-05-10 3:50 ` Jon Harrop @ 2009-05-11 8:05 ` Dmitry Bely 2009-05-11 9:26 ` Jon Harrop 0 siblings, 1 reply; 16+ messages in thread From: Dmitry Bely @ 2009-05-11 8:05 UTC (permalink / raw) To: Caml List On Sun, May 10, 2009 at 7:50 AM, Jon Harrop <jon@ffconsultancy.com> wrote: > On Sunday 10 May 2009 03:16:49 Seo Sanghyeon wrote: >> 2009/5/10 Goswin von Brederlow <goswin-v-b@web.de>: >> > Having ocaml require SSE2 is quite unacceptable for someone with a Via >> > C7 cpu (they don't have SSE2, right?) Is it really that much work for >> > ocaml to use option 3? >> >> Maybe not, but don't underestimate tiny inconveniences! Even if it is >> tiny more work to support x87, it could be a difference of doing it and >> not doing it. >> http://lesswrong.com/lw/f1/beware_trivial_inconveniences/ > > If you want to avoid inconvenience, why not use LLVM to replace several of the > existing backends? I think it would be the major code rewrite (if ever possible). Merging SSE2 from amd64 into i386 code generator took about a day of my efforts. How much time LLVM integration would require? If it is that simple can you provide a proof-of-the-concept implementation? - Dmitry Bely ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [Caml-list] Ocamlopt x86-32 and SSE2 2009-05-11 8:05 ` Dmitry Bely @ 2009-05-11 9:26 ` Jon Harrop 2009-05-11 8:43 ` Dmitry Bely 2009-05-11 9:12 ` Andrey Riabushenko 0 siblings, 2 replies; 16+ messages in thread From: Jon Harrop @ 2009-05-11 9:26 UTC (permalink / raw) To: caml-list On Monday 11 May 2009 09:05:08 Dmitry Bely wrote: > I think it would be the major code rewrite (if ever possible). Merging > SSE2 from amd64 into i386 code generator took about a day of my > efforts. How much time LLVM integration would require? If it is that > simple can you provide a proof-of-the-concept implementation? Well, I can provide a complete garbage collected VM. :-) http://hlvm.forge.ocamlcore.org/ The hard part of writing an LLVM backend for ocamlopt is probably getting LLVM to generate code that is compatible with OCaml's GC, particularly the stack. However, I believe Gordon Henriksen already did this: "Included in the pending LLVM garbage collection code generation changeset is an Ocaml frametable emitter." - http://lists.cs.uiuc.edu/pipermail/llvmdev/2007-November/011527.html Unfortunately, I will not have any spare time until my next book is out... Did any of the OCaml+LLVM student projects get funded in the end? -- Dr Jon Harrop, Flying Frog Consultancy Ltd. http://www.ffconsultancy.com/?e ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [Caml-list] Ocamlopt x86-32 and SSE2 2009-05-11 9:26 ` Jon Harrop @ 2009-05-11 8:43 ` Dmitry Bely 2009-05-11 13:47 ` Jon Harrop 2009-05-11 9:12 ` Andrey Riabushenko 1 sibling, 1 reply; 16+ messages in thread From: Dmitry Bely @ 2009-05-11 8:43 UTC (permalink / raw) To: Caml List On Mon, May 11, 2009 at 1:26 PM, Jon Harrop <jon@ffconsultancy.com> wrote: > On Monday 11 May 2009 09:05:08 Dmitry Bely wrote: >> I think it would be the major code rewrite (if ever possible). Merging >> SSE2 from amd64 into i386 code generator took about a day of my >> efforts. How much time LLVM integration would require? If it is that >> simple can you provide a proof-of-the-concept implementation? > > Well, I can provide a complete garbage collected VM. :-) > > http://hlvm.forge.ocamlcore.org/ We are talking about a new backend to Ocaml compiler, aren't we? > The hard part of writing an LLVM backend for ocamlopt is probably getting LLVM > to generate code that is compatible with OCaml's GC, particularly the stack. > However, I believe Gordon Henriksen already did this: > > "Included in the pending LLVM garbage collection code generation > changeset is an Ocaml frametable emitter." - > http://lists.cs.uiuc.edu/pipermail/llvmdev/2007-November/011527.html So it's just pie in the sky. No working implementation has been demonstrated since then. The answer to your "why not use LLVM to replace several of the existing backends?" question is quite obvious. - Dmitry Bely ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [Caml-list] Ocamlopt x86-32 and SSE2 2009-05-11 8:43 ` Dmitry Bely @ 2009-05-11 13:47 ` Jon Harrop 0 siblings, 0 replies; 16+ messages in thread From: Jon Harrop @ 2009-05-11 13:47 UTC (permalink / raw) To: caml-list On Monday 11 May 2009 09:43:59 Dmitry Bely wrote: > So it's just pie in the sky. No working implementation has been > demonstrated since then. The file "test/CodeGen/Generic/GC/simple_ocaml.ll" in the LLVM 2.5 source distribution contains the following test code for the OCaml-compatible frametable emitter: %struct.obj = type { i8*, %struct.obj* } define %struct.obj* @fun(%struct.obj* %head) gc "ocaml" { entry: %gcroot.0 = alloca i8* %gcroot.1 = alloca i8* call void @llvm.gcroot(i8** %gcroot.0, i8* null) call void @llvm.gcroot(i8** %gcroot.1, i8* null) %local.0 = bitcast i8** %gcroot.0 to %struct.obj** %local.1 = bitcast i8** %gcroot.1 to %struct.obj** store %struct.obj* %head, %struct.obj** %local.0 br label %bb.loop bb.loop: %t0 = load %struct.obj** %local.0 %t1 = getelementptr %struct.obj* %t0, i32 0, i32 1 %t2 = bitcast %struct.obj* %t0 to i8* %t3 = bitcast %struct.obj** %t1 to i8** %t4 = call i8* @llvm.gcread(i8* %t2, i8** %t3) %t5 = bitcast i8* %t4 to %struct.obj* %t6 = icmp eq %struct.obj* %t5, null br i1 %t6, label %bb.loop, label %bb.end bb.end: %t7 = malloc %struct.obj store %struct.obj* %t7, %struct.obj** %local.1 %t8 = bitcast %struct.obj* %t7 to i8* %t9 = load %struct.obj** %local.0 %t10 = getelementptr %struct.obj* %t9, i32 0, i32 1 %t11 = bitcast %struct.obj* %t9 to i8* %t12 = bitcast %struct.obj** %t10 to i8** call void @llvm.gcwrite(i8* %t8, i8* %t11, i8** %t12) ret %struct.obj* %t7 } declare void @llvm.gcroot(i8** %value, i8* %tag) declare void @llvm.gcwrite(i8* %value, i8* %obj, i8** %field) declare i8* @llvm.gcread(i8* %obj, i8** %field) Compiling this with: llvm-as <simple_ocaml.ll | llc gives: .file "<stdin>" .text .globl caml<stdin>__code_begin caml<stdin>__code_begin: .data .globl caml<stdin>__data_begin caml<stdin>__data_begin: .text .align 16 .globl fun .type fun,@function fun: .Leh_func_begin1: .Llabel1: subl $12, %esp movl $0, 8(%esp) movl $0, 4(%esp) movl 16(%esp), %eax movl %eax, 8(%esp) .align 16 .LBB1_1: # bb.loop movl 8(%esp), %eax cmpl $0, 4(%eax) je .LBB1_1 # bb.loop .LBB1_2: # bb.end movl $8, (%esp) call malloc .Llabel2: movl %eax, 4(%esp) movl 8(%esp), %ecx movl %eax, 4(%ecx) addl $12, %esp ret .size fun, .-fun .Leh_func_end1: .section .eh_frame,"aw",@progbits .LEH_frame0: .Lsection_eh_frame: .Leh_frame_common: .long .Leh_frame_common_end-.Leh_frame_common_begin .Leh_frame_common_begin: .long 0x0 .byte 0x1 .asciz "zR" .uleb128 1 .sleb128 -4 .byte 0x8 .uleb128 1 .byte 0x1B .byte 0xC .uleb128 4 .uleb128 4 .byte 0x88 .uleb128 1 .align 4 .Leh_frame_common_end: .Lfun.eh: .long .Leh_frame_end1-.Leh_frame_begin1 .Leh_frame_begin1: .long .Leh_frame_begin1-.Leh_frame_common .long .Leh_func_begin1-. .long .Leh_func_end1-.Leh_func_begin1 .uleb128 0 .byte 0xE .uleb128 16 .byte 0x4 .long .Llabel1-.Leh_func_begin1 .byte 0xD .uleb128 4 .align 4 .Leh_frame_end1: .text .globl caml<stdin>__code_end caml<stdin>__code_end: .data .globl caml<stdin>__data_end caml<stdin>__data_end: .long 0 .globl caml<stdin>__frametable caml<stdin>__frametable: # live roots for fun .long .Llabel2 .short 0xC .short 0x2 .word 8 .word 4 .align 4 .section .note.GNU-stack,"",@progbits So perhaps it is worth a look. -- Dr Jon Harrop, Flying Frog Consultancy Ltd. http://www.ffconsultancy.com/?e ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [Caml-list] Ocamlopt x86-32 and SSE2 2009-05-11 9:26 ` Jon Harrop 2009-05-11 8:43 ` Dmitry Bely @ 2009-05-11 9:12 ` Andrey Riabushenko 1 sibling, 0 replies; 16+ messages in thread From: Andrey Riabushenko @ 2009-05-11 9:12 UTC (permalink / raw) To: caml-list > Did any of the OCaml+LLVM student projects get funded in the end? NO, Unfortunately. Not this time... ^ permalink raw reply [flat|nested] 16+ messages in thread
* Ocamlopt x86-32 and SSE2 2009-05-10 1:52 ` [Caml-list] " Goswin von Brederlow 2009-05-10 2:16 ` Seo Sanghyeon @ 2009-05-10 8:56 ` CUOQ Pascal 2009-05-10 14:47 ` [Caml-list] " Richard Jones 2009-05-10 19:25 ` Florian Weimer 2 siblings, 1 reply; 16+ messages in thread From: CUOQ Pascal @ 2009-05-10 8:56 UTC (permalink / raw) To: Goswin von Brederlow; +Cc: caml-list >That does not >really mean i486 at 25MHz will be used but it is the common bottom >line that can easily be supported. My point is that you're not looking at the whole set of requirements for OCaml and other existing Debian packages when you look only at the processor's instruction set. The way to keep old hardware running is to keep it running old software. or, if you give me a second to switch to my Bogart voice, "we will always have 3.11". >Having ocaml require SSE2 is quite unacceptable for someone with a Via >C7 cpu (they don't have SSE2, right?) According to http://en.wikipedia.org/wiki/SSE2, someone using a Via C7 should be fine. Pascal ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [Caml-list] Ocamlopt x86-32 and SSE2 2009-05-10 8:56 ` CUOQ Pascal @ 2009-05-10 14:47 ` Richard Jones 0 siblings, 0 replies; 16+ messages in thread From: Richard Jones @ 2009-05-10 14:47 UTC (permalink / raw) To: CUOQ Pascal; +Cc: Goswin von Brederlow, caml-list On Sun, May 10, 2009 at 10:56:37AM +0200, CUOQ Pascal wrote: > According to http://en.wikipedia.org/wiki/SSE2, someone using a Via C7 > should be fine. AMD Geode then ... $ grep -i flags /proc/cpuinfo flags : fpu de pse tsc msr cx8 pge cmov mmx mmxext 3dnowext 3dnow up Rich. -- Richard Jones Red Hat ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [Caml-list] Ocamlopt x86-32 and SSE2 2009-05-10 1:52 ` [Caml-list] " Goswin von Brederlow 2009-05-10 2:16 ` Seo Sanghyeon 2009-05-10 8:56 ` CUOQ Pascal @ 2009-05-10 19:25 ` Florian Weimer 2 siblings, 0 replies; 16+ messages in thread From: Florian Weimer @ 2009-05-10 19:25 UTC (permalink / raw) To: Goswin von Brederlow; +Cc: CUOQ Pascal, caml-list * Goswin von Brederlow: > Having ocaml require SSE2 is quite unacceptable for someone with a Via > C7 cpu (they don't have SSE2, right?) More problematic are AMD's K7 and some of their Sempron processors, I think. AMD introduced SSE2-less CPUs as late as 2004. ^ permalink raw reply [flat|nested] 16+ messages in thread
[parent not found: <20090511043120.976EBBC67@yquem.inria.fr>]
* Ocamlopt x86-32 and SSE2 [not found] <20090511043120.976EBBC67@yquem.inria.fr> @ 2009-05-11 7:10 ` Pascal Cuoq 2009-05-12 9:37 ` [Caml-list] " Xavier Leroy 0 siblings, 1 reply; 16+ messages in thread From: Pascal Cuoq @ 2009-05-11 7:10 UTC (permalink / raw) To: caml-list Here's an idea, I don't know if it is relevant, but it looks that it could be a good compromise (option 2.5, if you will): how about implementing floating-point operations as function calls (the functions could be written in C and be part of the runtime library) when the SSE2 instructions are not available? Is that simpler than option 3? Matteo Frigo <athena@fftw.org> wrote: > Do you guys have any sort of empirical evidence that scalar SSE2 > math is > faster than plain old x87? It's not speed I am after personally, but a correct implementation of IEEE 754's round-to-nearest mode for doubles. Also, the satisfying knowledge that the code of the compiler I use is as tight is it can be and that I could understand it if I had to some day. Jon Harrop <jon@ffconsultancy.com> wrote: > Note that you can use the same argument to justify not optimizing > the x86 > backend because power users should be using the (much more > performant) x64 > code gen. I don't know where you get "much more performant" from. For what I do, speed of floating-point operations is irrelevant, but not the speed of the whole application. The whole application is slightly slower (~10%) with the larger data words despite the improved instruction set. Plus, memory is also a concern, and for users who have less than 6GiB of memory, there are actually more addressable data words in x86 mode. Pascal ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [Caml-list] Ocamlopt x86-32 and SSE2 2009-05-11 7:10 ` Pascal Cuoq @ 2009-05-12 9:37 ` Xavier Leroy 2009-05-12 10:04 ` Sylvain Le Gall 0 siblings, 1 reply; 16+ messages in thread From: Xavier Leroy @ 2009-05-12 9:37 UTC (permalink / raw) To: caml-list This is an interesting discussion with many relevant points being made. Some comments: Matteo Frigo: > Do you guys have any sort of empirical evidence that scalar SSE2 math is > faster than plain old x87? > I ask because every time I tried compiling FFTW with gcc -m32 > -mfpmath=sse, the result has been invariably slower than the vanilla x87 > compilation. (I am talking about scalar arithmetic here. FFTW also > supports SSE2 2-way vector arithmetic, which is of course faster.) gcc does rather clever tricks with the x87 float stack and the fxch instruction, making it look almost like a flat register set and managing to expose some instruction-level parallelism despite the dependencies on the top of the stack. In contrast, ocamlopt uses the x87 stack in a pedestrian, reverse-Polish-notation way, so the benefits of having "real" float registers is bigger. Using the experimental x86-sse2 port that I did in 2003 on a Core2 processor, I see speedups of 10 to 15% on my few standard float benchmarks. However, these benchmarks were written in such a way that the generated x87 code isn't too awful. It is easy to construct examples where the SSE2 code is twice as fast as x87. More generally, the SSE2 code generator is much more forgiving towards changes in program style, and its performance characteristics are more predictable than the x87 code generator. For instance, manual elimination of common subexpressions is almost always a win with SSE2 but quite often a loss with x87 ... Pascal Cuoq: > According to http://en.wikipedia.org/wiki/SSE2, someone using a Via C7 > should be fine. Richard Jones: > AMD Geode then ... Apparently, recent versions of the Geode support SSE2 as well. Low-power people love vector instruction sets, because it lets them do common tasks like audio and video decoding more efficiently, ergo with less energy. Sylvain Le Gall: > If INRIA choose to switch to SSE2 there should be at least still a way > to compile on older architecture. Doesn't mean that INRIA need to keep > the old code generator, but should provide a simple emulation for it. In > this case, we will have good performance on new arch for float and we > will still be able to compile on old arch. The least complicated way to preserve backward compatibility with pre-SSE2 hardware is to keep the existing x87 code generator and bolt the SSE2 generator on top of it, Frankenstein-style. Well, either that, or rely on the kernel to trap unimplemented SSE2 instructions and emulate them in software. This is theoretically possible but I'm pretty sure neither Linux nor Windows implement it. David Mentre: > Regarding option 2, I assume that byte-code would still work on i386 > pre-SSE2 machines? So OCaml programs would still work on those machines. You're correct, provided the bytecode interpreter isn't compiled in SSE2 mode itself (see below for one reason one might want to do this). However, packagers would still be unhappy about this: packaged OCaml applications like Unison or Coq are usually compiled to native-code (the additional speed is most welcome in the case of Coq...). Therefore, packagers would have to choose between making these applications SSE2-only or make them slower by compiling them to bytecode. Dmitry Bely: > [Reproducibility of results between bytecode and native] > I wouldn't be so sure. Bytecode runtime is C compiler-dependent (that > does use x87 for floating-point calculations), so rounding errors can > lead to different results. That's right: even though it stores all intermediate float results in 64-bit format, a bytecode interpreter compiled in default x87 mode still exhibits double rounding anomalies. One would have to compile it with gcc in SSE2 mode (like MacOS X does by default) to have complete reproducibility between bytecode and native. > Floating point is always approximate... I used to believe strongly in this viewpoint, but after discussion with people who do static analysis or program proof over float programs, I'm not so sure: static analysis and program proof are difficult enough that one doesn't want to complicate them even further to take extended-precision intermediate results and double rounding into account... To finish: I'm still very interested in hearing from packagers. Does Debian, for example, already have some packages that are SSE2-only? Are these packages specially tagged so that the installer will refuse to install them on pre-SSE2 hardware? What's the party line? - Xavier Leroy ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Ocamlopt x86-32 and SSE2 2009-05-12 9:37 ` [Caml-list] " Xavier Leroy @ 2009-05-12 10:04 ` Sylvain Le Gall 2009-05-25 8:23 ` Sylvain Le Gall 0 siblings, 1 reply; 16+ messages in thread From: Sylvain Le Gall @ 2009-05-12 10:04 UTC (permalink / raw) To: caml-list On 12-05-2009, Xavier Leroy <Xavier.Leroy@inria.fr> wrote: > > Sylvain Le Gall: >> If INRIA choose to switch to SSE2 there should be at least still a way >> to compile on older architecture. Doesn't mean that INRIA need to keep >> the old code generator, but should provide a simple emulation for it. In >> this case, we will have good performance on new arch for float and we >> will still be able to compile on old arch. > > The least complicated way to preserve backward compatibility with > pre-SSE2 hardware is to keep the existing x87 code generator and bolt > the SSE2 generator on top of it, Frankenstein-style. Well, either > that, or rely on the kernel to trap unimplemented SSE2 instructions > and emulate them in software. This is theoretically possible but I'm > pretty sure neither Linux nor Windows implement it. > I was thinking (if it is possible) to use simple "function call" for doing float operation. This will be very inefficient, but will provide a very simple compatible layer. > > To finish: I'm still very interested in hearing from packagers. Does > Debian, for example, already have some packages that are SSE2-only? > Are these packages specially tagged so that the installer will refuse > to install them on pre-SSE2 hardware? What's the party line? > The more obvious package I see, is the linux kernel or the libc6: http://packages.debian.org/lenny/linux-image-2.6.26-2-486 http://packages.debian.org/lenny/linux-image-2.6.26-1-686-bigmem http://packages.debian.org/lenny/libc6 http://packages.debian.org/lenny/libc6-i686 AFAIK, there is no way for the package manager to do a real difference (no tag). However, the installer has some clue about which one to choose and install the best one for linux and libc6. Once installed, it is always updated in the good way, because the arch is embeded into the package name. I think linux and libc6 should be considered as exceptions, because they really provide an important benefit for overall optimization. For other package, if there is possible optimization, a version with and without optimization is embedded into the package and chosen at runtime. Example libavcodec provide i686 and i486 version: http://packages.debian.org/sid/i386/libavcodec52/filelist So in conclusion, there is always a "default" non SSE2 alternative for package that can provide an optimized version. I don't know any package that are SSE2-only. Im my opinion, Debian will probably refuse to ship a package that only provide SSE2-only version (but I am talking from my point of view). Regards Sylvain Le Gall ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Ocamlopt x86-32 and SSE2 2009-05-12 10:04 ` Sylvain Le Gall @ 2009-05-25 8:23 ` Sylvain Le Gall 0 siblings, 0 replies; 16+ messages in thread From: Sylvain Le Gall @ 2009-05-25 8:23 UTC (permalink / raw) To: caml-list On 12-05-2009, Sylvain Le Gall <sylvain@le-gall.net> wrote: > On 12-05-2009, Xavier Leroy <Xavier.Leroy@inria.fr> wrote: >> >> Sylvain Le Gall: >> >> To finish: I'm still very interested in hearing from packagers. Does >> Debian, for example, already have some packages that are SSE2-only? >> Are these packages specially tagged so that the installer will refuse >> to install them on pre-SSE2 hardware? What's the party line? >> > > Im my opinion, Debian will probably refuse to ship a package that only > provide SSE2-only version (but I am talking from my point of view). > For those who are interested, a discussion just started about dropping pre-i686 architecture for Debian: http://permalink.gmane.org/gmane.linux.debian.devel.kernel/47844 The first round of post seems clearly against this decision. The main argument is that many school are using old pre-i686 hardware. Regards, Sylvain Le Gall ^ permalink raw reply [flat|nested] 16+ messages in thread
* Ocamlopt code generator question @ 2009-04-28 19:36 Dmitry Bely 2009-05-05 9:24 ` [Caml-list] " Xavier Leroy 0 siblings, 1 reply; 16+ messages in thread From: Dmitry Bely @ 2009-04-28 19:36 UTC (permalink / raw) To: Caml List For amd64 we have in asmcomp/amd64/proc_nt.mlp: (* xmm0 - xmm15 100 - 115 xmm0 - xmm9: Caml function arguments xmm0 - xmm3: C function arguments xmm0: Caml and C function results xmm6-xmm15 are preserved by C *) let loc_arguments arg = calling_conventions 0 9 100 109 outgoing arg let loc_parameters arg = let (loc, ofs) = calling_conventions 0 9 100 109 incoming arg in loc let loc_results res = let (loc, ofs) = calling_conventions 0 0 100 100 not_supported res in loc What these first_float=100 and last_float=109 for loc_arguments and loc_parameters affect? My impression is that floats are always passed boxed, so xmm registers are in fact never used to pass parameters. And float values are returned as a pointer in eax, not a value in xmm0 as loc_results would suggest. - Dmitry Bely ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [Caml-list] Ocamlopt code generator question 2009-04-28 19:36 Ocamlopt code generator question Dmitry Bely @ 2009-05-05 9:24 ` Xavier Leroy 2009-05-05 9:41 ` Dmitry Bely 0 siblings, 1 reply; 16+ messages in thread From: Xavier Leroy @ 2009-05-05 9:24 UTC (permalink / raw) To: Dmitry Bely; +Cc: Caml List > For amd64 we have in asmcomp/amd64/proc_nt.mlp: > > (* xmm0 - xmm15 100 - 115 xmm0 - xmm9: Caml function arguments > xmm0 - xmm3: C function arguments > xmm0: Caml and C function results > xmm6-xmm15 are preserved by C *) > > let loc_arguments arg = > calling_conventions 0 9 100 109 outgoing arg > let loc_parameters arg = > let (loc, ofs) = calling_conventions 0 9 100 109 incoming arg in loc > let loc_results res = > let (loc, ofs) = calling_conventions 0 0 100 100 not_supported res in loc > > What these first_float=100 and last_float=109 for loc_arguments and > loc_parameters affect? My impression is that floats are always passed > boxed, so xmm registers are in fact never used to pass parameters. And > float values are returned as a pointer in eax, not a value in xmm0 as > loc_results would suggest. The ocamlopt code generators support unboxed floats as function parameters and results, as well as returning multiple results in several registers. (Except for the x86-32 bits port, because of the weird floating-point model of this architecture.) You're right that the ocamlopt "middle-end" does not currently take advantage of this possibility, since floats are passed between functions in boxed state. - Xavier Leroy ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [Caml-list] Ocamlopt code generator question 2009-05-05 9:24 ` [Caml-list] " Xavier Leroy @ 2009-05-05 9:41 ` Dmitry Bely 2009-05-08 10:21 ` [Caml-list] Ocamlopt x86-32 and SSE2 Xavier Leroy 0 siblings, 1 reply; 16+ messages in thread From: Dmitry Bely @ 2009-05-05 9:41 UTC (permalink / raw) To: Caml List On Tue, May 5, 2009 at 1:24 PM, Xavier Leroy <Xavier.Leroy@inria.fr> wrote: >> For amd64 we have in asmcomp/amd64/proc_nt.mlp: >> >> (* xmm0 - xmm15 100 - 115 xmm0 - xmm9: Caml function arguments >> xmm0 - xmm3: C function arguments >> xmm0: Caml and C function results >> xmm6-xmm15 are preserved by C *) >> >> let loc_arguments arg = >> calling_conventions 0 9 100 109 outgoing arg >> let loc_parameters arg = >> let (loc, ofs) = calling_conventions 0 9 100 109 incoming arg in loc >> let loc_results res = >> let (loc, ofs) = calling_conventions 0 0 100 100 not_supported res in loc >> >> What these first_float=100 and last_float=109 for loc_arguments and >> loc_parameters affect? My impression is that floats are always passed >> boxed, so xmm registers are in fact never used to pass parameters. And >> float values are returned as a pointer in eax, not a value in xmm0 as >> loc_results would suggest. > > The ocamlopt code generators support unboxed floats as function > parameters and results, as well as returning multiple results in > several registers. (Except for the x86-32 bits port, because of the > weird floating-point model of this architecture.) You're right that > the ocamlopt "middle-end" does not currently take advantage of this > possibility, since floats are passed between functions in boxed state. I see. Why I asked this: trying to improve floating-point performance on 32-bit x86 platform I have merged floating-point SSE2 code generator from amd64 ocamlopt back end to i386 one, making ia32sse2 architecture. It also inlines sqrt() via -ffast-math flag and slightly optimizes emit_float_test (usually eliminates an extra jump) - features that are missed in the original amd64 code generator. All this seems to work OK: beyond my own code all tests found in Ocaml CVS test directory are passed. Of course this is idea is not new - you had working IA32+SSE2 back end several years ago [1] but unfortunately never released it to the public. Is this of any interest to anybody? - Dmitry Bely [1] http://caml.inria.fr/pub/ml-archives/caml-list/2003/03/e0db2f3f54ce19e4bad589ffbb082484.fr.html ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [Caml-list] Ocamlopt x86-32 and SSE2 2009-05-05 9:41 ` Dmitry Bely @ 2009-05-08 10:21 ` Xavier Leroy 2009-05-10 11:04 ` David MENTRE 0 siblings, 1 reply; 16+ messages in thread From: Xavier Leroy @ 2009-05-08 10:21 UTC (permalink / raw) To: Dmitry Bely; +Cc: Caml List Dmitry Bely wrote: > I see. Why I asked this: trying to improve floating-point performance > on 32-bit x86 platform I have merged floating-point SSE2 code > generator from amd64 ocamlopt back end to i386 one, making ia32sse2 > architecture. It also inlines sqrt() via -ffast-math flag and slightly > optimizes emit_float_test (usually eliminates an extra jump) - > features that are missed in the original amd64 code generator. You just passed black belt in OCaml compiler hacking :-) > Is this of any interest to anybody? I'm definitely interested in the potential improvements to the amd64 code generator. Concerning the i386 code generator (x86 in 32-bit mode), SSE2 float arithmetic does improve performance and fit ocamlopt's compilation model much better than the current x87 float arithmetic, which is a bit of a hack. Several options can be considered: 1- Have an additional "ia32sse2" port of ocamlopt in parallel with the current "i386" port. 2- Declare pre-SSE2 processors obsolete and convert the current "i386" port to always use SSE2 float arithmetic. 3- Support both x87 and SSE2 float arithmetic within the same i386 port, with a command-line option to activate SSE2, like gcc does. I'm really not keen on approach 1. We have too many ports (and their variants for Windows/MSVC) already. Moreover, I suspect packagers would stick to the i386 port for compatibility with old hardware, and most casual users would, too, out of lazyness, so this hypothetical "ia32sse2" port would receive little testing. Approach 2 is tempting for me because it would simplify the x86-32 code generator and remove some historical cruft. The issue is that it demands a processor that implements SSE2. For a list of processors, see http://en.wikipedia.org/wiki/SSE2 As a rule of thumb, almost all desktop PC bought since 2004 has SSE2, as well as almost all notebooks since 2006. That should be OK for professional users (it's nearly impossible to purchase maintenance beyond 3 years, anyway) and serious hobbyists. However, packagers are going to be very unhappy: Debian still lists i486 as its bottom line; for Fedora, it's Pentium or Pentium II; for Windows, it's "a 1GHz processor", meaning Pentium III. All these processors lack SSE2 support. Only MacOS X is SSE2-compatible from scratch. Approach 3 is probably the best from a user's point of view. But it's going to complicate the code generator: the x87 cruft would still be there, and new cruft would need to be added to support SSE2. Code compiled with the SSE2 flag could link with code compiled without, provided the SSE2 registers are not used for parameter and result passing. But as Dmitry observed, this is already the case in the current ocamlopt compiler. Jean-Marc Eber: >> But again, having better floating point performance (and >> predictable behaviour, compared to the bytecode version) would be a >> big plus for some applications. Dmitry Bely: > Don't quite understand what is "predictable behavior" - any generator > should conform to specs. In my tests x87 and SSE2 backends show the > same results (otherwise it would be called a bug). You haven't tested enough :-). The x87 backend keeps some intermediate results in 80-bit float format, while the SSE2 backend (as well as all other backends and the bytecode interpreter) compute everything in 64-bit format. See David Monniaux's excellent tutorial: http://hal.archives-ouvertes.fr/hal-00128124/en/ Computing intermediate results in extended precision has pros and cons, but my understanding is that the cons slightly outweigh the pros. - Xavier Leroy ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [Caml-list] Ocamlopt x86-32 and SSE2 2009-05-08 10:21 ` [Caml-list] Ocamlopt x86-32 and SSE2 Xavier Leroy @ 2009-05-10 11:04 ` David MENTRE 2009-05-11 3:43 ` Stefan Monnier 0 siblings, 1 reply; 16+ messages in thread From: David MENTRE @ 2009-05-10 11:04 UTC (permalink / raw) To: Xavier Leroy; +Cc: Dmitry Bely, Caml List Hello, Xavier Leroy <Xavier.Leroy@inria.fr> writes: > 1- Have an additional "ia32sse2" port of ocamlopt in parallel with the > current "i386" port. > > 2- Declare pre-SSE2 processors obsolete and convert the current > "i386" port to always use SSE2 float arithmetic. > > 3- Support both x87 and SSE2 float arithmetic within the same i386 > port, with a command-line option to activate SSE2, like gcc does. Regarding option 2, I assume that byte-code would still work on i386 pre-SSE2 machines? So OCaml programs would still work on those machines. As far as I know, one is using ocamlopt to improve performance. I can't think of any case where one would need native code running on pre-SS2 machines which are so outdated performance-wise. So I would vote for option 2: always use SSE2 float arithmetic. Sincerely yours, david -- GPG/PGP key: A3AD7A2A David MENTRE <dmentre@linux-france.org> 5996 CC46 4612 9CA4 3562 D7AC 6C67 9E96 A3AD 7A2A ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Ocamlopt x86-32 and SSE2 2009-05-10 11:04 ` David MENTRE @ 2009-05-11 3:43 ` Stefan Monnier 0 siblings, 0 replies; 16+ messages in thread From: Stefan Monnier @ 2009-05-11 3:43 UTC (permalink / raw) To: caml-list > As far as I know, one is using ocamlopt to improve performance. > I can't think of any case where one would need native code running on > pre-SS2 machines which are so outdated performance-wise. You mean we should make slow machines even slower? Stefan ^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2009-05-25 8:24 UTC | newest] Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- [not found] <20090509100004.353ADBC5C@yquem.inria.fr> 2009-05-09 11:38 ` Ocamlopt x86-32 and SSE2 CUOQ Pascal 2009-05-10 1:52 ` [Caml-list] " Goswin von Brederlow 2009-05-10 2:16 ` Seo Sanghyeon 2009-05-10 3:50 ` Jon Harrop 2009-05-11 8:05 ` Dmitry Bely 2009-05-11 9:26 ` Jon Harrop 2009-05-11 8:43 ` Dmitry Bely 2009-05-11 13:47 ` Jon Harrop 2009-05-11 9:12 ` Andrey Riabushenko 2009-05-10 8:56 ` CUOQ Pascal 2009-05-10 14:47 ` [Caml-list] " Richard Jones 2009-05-10 19:25 ` Florian Weimer [not found] <20090511043120.976EBBC67@yquem.inria.fr> 2009-05-11 7:10 ` Pascal Cuoq 2009-05-12 9:37 ` [Caml-list] " Xavier Leroy 2009-05-12 10:04 ` Sylvain Le Gall 2009-05-25 8:23 ` Sylvain Le Gall 2009-04-28 19:36 Ocamlopt code generator question Dmitry Bely 2009-05-05 9:24 ` [Caml-list] " Xavier Leroy 2009-05-05 9:41 ` Dmitry Bely 2009-05-08 10:21 ` [Caml-list] Ocamlopt x86-32 and SSE2 Xavier Leroy 2009-05-10 11:04 ` David MENTRE 2009-05-11 3:43 ` Stefan Monnier
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox