* JIT & HLVM, LLVM @ 2009-09-27 17:18 David McClain 2009-09-27 17:22 ` [Caml-list] " Vincent Aravantinos 0 siblings, 1 reply; 18+ messages in thread From: David McClain @ 2009-09-27 17:18 UTC (permalink / raw) To: caml-list Ahh, I see from doing a bit more research that JIT does *not* run particularly faster than statically compiled code. But rather, it runs faster than interpreted byte-code. I remember many years ago speaking with David Robson, over lunch, about the upcoming changes in Smalltalk, using a form of JIT to improve performance of their method dispatch, and attempting to gain multiple inheritance in that manner for Smalltalk. But there, again, it is a case of attempting to improve on an interpreted byte-code, and not a case of improving over statically compiled native code. But with so many talented bodies working on LLVM, perhaps, in time, a way will be found to gain improvement over static native code. Dr. David McClain dbm@refined-audiometrics.com ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [Caml-list] JIT & HLVM, LLVM 2009-09-27 17:18 JIT & HLVM, LLVM David McClain @ 2009-09-27 17:22 ` Vincent Aravantinos 2009-09-27 17:35 ` Edgar Friendly 0 siblings, 1 reply; 18+ messages in thread From: Vincent Aravantinos @ 2009-09-27 17:22 UTC (permalink / raw) To: David McClain; +Cc: caml-list Hi, I think what Jon means is that, with JIT, polymorphic functions can be specialized at run-time and allow optimizations that are not currently achieved by the Ocaml native code compiler. V. Le 27 sept. 09 à 19:18, David McClain a écrit : > Ahh, I see from doing a bit more research that JIT does *not* run > particularly faster than statically compiled code. But rather, it > runs faster than interpreted byte-code. > > I remember many years ago speaking with David Robson, over lunch, > about the upcoming changes in Smalltalk, using a form of JIT to > improve performance of their method dispatch, and attempting to gain > multiple inheritance in that manner for Smalltalk. But there, again, > it is a case of attempting to improve on an interpreted byte-code, > and not a case of improving over statically compiled native code. > > But with so many talented bodies working on LLVM, perhaps, in time, > a way will be found to gain improvement over static native code. > > Dr. David McClain > dbm@refined-audiometrics.com ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [Caml-list] JIT & HLVM, LLVM 2009-09-27 17:22 ` [Caml-list] " Vincent Aravantinos @ 2009-09-27 17:35 ` Edgar Friendly 2009-09-27 18:51 ` Jon Harrop 2009-09-28 12:25 ` Gerd Stolpmann 0 siblings, 2 replies; 18+ messages in thread From: Edgar Friendly @ 2009-09-27 17:35 UTC (permalink / raw) To: Vincent Aravantinos, caml-list Vincent Aravantinos wrote: > Hi, > > I think what Jon means is that, with JIT, polymorphic functions can be > specialized at run-time > and allow optimizations that are not currently achieved by the Ocaml > native code compiler. > > V. The alternative to specializing at runtime using JIT is to do it at compile time (/ link time) using a form of whole-program analysis. How expensive would this be, and how hard would it be to still support separate compilation? And how much would the OCaml world cry if we didn't have fully-separate compilation? At the moment, and for the foreseeable future, anything binary is bound tightly to the compiler version, and binary distribution of modules seems pretty much impossible due to this and unknown other factors. How much easier would the above be given the source code / some intermediate representation to generate specialized code with? E ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [Caml-list] JIT & HLVM, LLVM 2009-09-27 17:35 ` Edgar Friendly @ 2009-09-27 18:51 ` Jon Harrop 2009-09-27 19:07 ` Edgar Friendly 2009-09-28 12:25 ` Gerd Stolpmann 1 sibling, 1 reply; 18+ messages in thread From: Jon Harrop @ 2009-09-27 18:51 UTC (permalink / raw) To: caml-list On Sunday 27 September 2009 18:35:32 Edgar Friendly wrote: > Vincent Aravantinos wrote: > > I think what Jon means is that, with JIT, polymorphic functions can be > > specialized at run-time and allow optimizations that are not currently > > achieved by the Ocaml native code compiler. > > The alternative to specializing at runtime using JIT is to do it at > compile time (/ link time) using a form of whole-program analysis. How > expensive would this be, and how hard would it be to still support > separate compilation? C++ has those features and they are blamed for slow compilation. However, separate compilation is not the issue so much as dynamic loading. > And how much would the OCaml world cry if we didn't have fully-separate > compilation? I think it is a bad idea to consider how much the OCaml world would cry. Indeed, we cannot even agree on what constitutes the OCaml world. Xavier said that most OCaml users care about the performance of Coq which is, IMHO, insanely inaccurate. The important thing is what we can offer the whole world, not what little remains of the OCaml world (who are a self-selected group of people who, by definition, endure OCaml's shortcomings). I think that an OCaml-like language that addresses OCaml's performance (including parallelism) and FFI issues would be much more widely useful and is an entirely achievable goal. > At the moment, and for the foreseeable future, anything > binary is bound tightly to the compiler version, and binary distribution > of modules seems pretty much impossible due to this and unknown other > factors. Right. Separate compilation and dynamic linking and trivial with a JIT compiler. > How much easier would the above be given the source code / > some intermediate representation to generate specialized code with? Infinitely easier. -- Dr Jon Harrop, Flying Frog Consultancy Ltd. http://www.ffconsultancy.com/?e ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [Caml-list] JIT & HLVM, LLVM 2009-09-27 18:51 ` Jon Harrop @ 2009-09-27 19:07 ` Edgar Friendly 2009-09-27 19:23 ` kcheung 0 siblings, 1 reply; 18+ messages in thread From: Edgar Friendly @ 2009-09-27 19:07 UTC (permalink / raw) To: Jon Harrop; +Cc: caml-list Jon Harrop wrote: > C++ has those features and they are blamed for slow compilation. However, > separate compilation is not the issue so much as dynamic loading. > Yes, dynamic loading doesn't mix with whole-program analysis. Although I wonder if it's possible to have our cake and eat it too, with a fast bytecode compiler and a slower native code compiler that does great specialization. > I think > that an OCaml-like language that addresses OCaml's performance (including > parallelism) and FFI issues would be much more widely useful and is an > entirely achievable goal. > I'm not as committed to abandoning OCaml as you seem, and have hope for incremental improvement of OCaml's weaknesses, although I realize we'll have to break a number of things to get to where we both (and likely many others) want to be. E ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [Caml-list] JIT & HLVM, LLVM 2009-09-27 19:07 ` Edgar Friendly @ 2009-09-27 19:23 ` kcheung 2009-09-27 19:33 ` Jon Harrop 2009-09-27 21:10 ` Jon Harrop 0 siblings, 2 replies; 18+ messages in thread From: kcheung @ 2009-09-27 19:23 UTC (permalink / raw) To: caml-list >> I think >> that an OCaml-like language that addresses OCaml's performance >> (including >> parallelism) and FFI issues would be much more widely useful and is an >> entirely achievable goal. >> > I'm not as committed to abandoning OCaml as you seem, and have hope for > incremental improvement of OCaml's weaknesses, although I realize we'll > have to break a number of things to get to where we both (and likely > many others) want to be. Perhaps the future adoption of OCaml will depend on what OCaml 4.0 is going to be like. If Grand Central Dispatch makes its way into *nix, then I think it is extremely worthwhile for OCaml to have support for something similar. Kevin Cheung. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [Caml-list] JIT & HLVM, LLVM 2009-09-27 19:23 ` kcheung @ 2009-09-27 19:33 ` Jon Harrop 2009-09-27 21:10 ` Jon Harrop 1 sibling, 0 replies; 18+ messages in thread From: Jon Harrop @ 2009-09-27 19:33 UTC (permalink / raw) To: caml-list On Sunday 27 September 2009 20:23:13 kcheung@math.carleton.ca wrote: > > I'm not as committed to abandoning OCaml as you seem, and have hope for > > incremental improvement of OCaml's weaknesses, although I realize we'll > > have to break a number of things to get to where we both (and likely > > many others) want to be. > > Perhaps the future adoption of OCaml will > depend on what OCaml 4.0 is going to be like. > If Grand Central Dispatch makes its way > into *nix, then I think it is extremely > worthwhile for OCaml to have support for > something similar. Someone would have to inspire Xavier and/or Damien to start from scratch and build what we need (and not what Coq needs). Realistically, that is never going to happen. We should thank them for enlightening us but this will only ever get built if we build it ourselves. But, hey, at least we can build it in OCaml. ;-) -- Dr Jon Harrop, Flying Frog Consultancy Ltd. http://www.ffconsultancy.com/?e ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [Caml-list] JIT & HLVM, LLVM 2009-09-27 19:23 ` kcheung 2009-09-27 19:33 ` Jon Harrop @ 2009-09-27 21:10 ` Jon Harrop 2009-09-27 21:45 ` Vincent Aravantinos 2009-09-30 1:08 ` Mikkel Fahnøe Jørgensen 1 sibling, 2 replies; 18+ messages in thread From: Jon Harrop @ 2009-09-27 21:10 UTC (permalink / raw) To: caml-list On Sunday 27 September 2009 20:23:13 kcheung@math.carleton.ca wrote: > If Grand Central Dispatch makes its way > into *nix... Incidentally, what exactly (technically) is Grand Central and how does it relate to existing alternatives and the state-of-the-art? I Googled it but failed to find anything useful... -- Dr Jon Harrop, Flying Frog Consultancy Ltd. http://www.ffconsultancy.com/?e ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [Caml-list] JIT & HLVM, LLVM 2009-09-27 21:10 ` Jon Harrop @ 2009-09-27 21:45 ` Vincent Aravantinos 2009-09-30 1:08 ` Mikkel Fahnøe Jørgensen 1 sibling, 0 replies; 18+ messages in thread From: Vincent Aravantinos @ 2009-09-27 21:45 UTC (permalink / raw) To: Jon Harrop; +Cc: caml-list Le 27 sept. 09 à 23:10, Jon Harrop a écrit : > On Sunday 27 September 2009 20:23:13 kcheung@math.carleton.ca wrote: >> If Grand Central Dispatch makes its way >> into *nix... > > Incidentally, what exactly (technically) is Grand Central and how > does it > relate to existing alternatives and the state-of-the-art? I Googled > it but > failed to find anything useful... http://arstechnica.com/apple/reviews/2009/08/mac-os-x-10-6.ars Not everything relevant in those 23 pages, but you should get answers. V. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [Caml-list] JIT & HLVM, LLVM 2009-09-27 21:10 ` Jon Harrop 2009-09-27 21:45 ` Vincent Aravantinos @ 2009-09-30 1:08 ` Mikkel Fahnøe Jørgensen 2009-09-30 11:57 ` Stéphane Glondu 2009-09-30 15:22 ` Jon Harrop 1 sibling, 2 replies; 18+ messages in thread From: Mikkel Fahnøe Jørgensen @ 2009-09-30 1:08 UTC (permalink / raw) To: Jon Harrop; +Cc: caml-list 2009/9/27 Jon Harrop <jon@ffconsultancy.com>: > On Sunday 27 September 2009 20:23:13 kcheung@math.carleton.ca wrote: >> If Grand Central Dispatch makes its way >> into *nix... > > Incidentally, what exactly (technically) is Grand Central and how does it > relate to existing alternatives and the state-of-the-art? I Googled it but > failed to find anything useful... > Grand Central Dispatch (GCD, where libdispatch is client side library) is a very simple piece of code (with some non-trivial implementation details). Some would say it is just a thread queue, but it is quite clever the same way map-reduce is clever and simple. It may help to read the Ars Technica article first, since I dig a little below the user level api. Two important concepts: GCD does not require threads, and it does not require kernel support. When present, GCD becomes efficient at handling multiple cores - or this is the claim, and at least OS X 10.6 Snow Leopard seems to be much smoother than OS X 10.5 Leopard, so I guess there is something to it. But without threads and kernel support, it is still a good programming model, it would seem (haven't burned my fingers yet...) To better understand GCD, think of a single threaded event driven application with support for nested event queues. We start with a single thread, later extend the concept to multiple threads and cores: There is set of global queues with different priorities, one which is the default. Since we assume single threaded for now, we limit this to only one global queue. Tasks are pushed onto the global queue. A task is just function with data, i.e. a closure. In C this is a function pointer and a void * to context, or it is block if you have the blocks extension, but the blocks extension really is optional, although very useful since it can take an standard C block and with a little syntax turn it into a closure with a copy of the parent scope state which can then quickly be turned into a parallel loop. But this is just syntax, really not too important for the underpinnings of libdispatch. A function can push other functions to the global queue. This in effect creates a continuation, and the concept is similar to continuation passing style or LWT threads, but requires no continuation arguments to drag around. A task should return soon, and avoid blocking operations - especially in single threaded setup, but generally blocking is bad in GCD. The interesting part is how this gets scheduled: There is no central scheduler. When our current task function returns (which it should do soon), there is another piece of calling code that earlier picked our task from a queue. Upon task return this code now picks the next task on the queue and calls it - ad infinitum until we reach an empty queue, in which case our control code can exit - i.e. exit the thread. So far this your standard http web server event loop. Now we add a thread pool to the queue. This means that N threads are all either executing a task from the queue, or actively executing a task on the queue. One a task completes, the next task is popped. Since there are N threads, we have contention on the queue, so it must be briefly locked with compare-exchange logic. Now we have a concurrent queue with the semantic that tasks are started in order, but execute concurrently. Any and all synchronization between tasks must be managed by client code if they access shared resources. Which brings us to the next point: A task (or the main program thread) can create a new serial queue. A serial queue is really just a task pushed onto the default global queue, but this is mostly invisible to the user, but internally this makes the control very simple and elegant (thread local storage is used to identify the current parent queue in the queue creation function). Our active task can now push new tasks onto the serial queue. These tasks are guaranteed to be isolated. The serial queue tasks will eventually be popped from the global concurrent queue either after our function exits, or sooner if there or other vacant threads in the pool. Once the serial queue task is popped, it starts popping tasks of the queue, while still locking pop operations to avoid conflict with new pushes. Note that every single thread cooperate on scheduling by popping the next task from the queue it is currently assigned to. When a task is a queue task, it will temporarily change the thread task to the new queue until that queue is exhausted, then the thread returns focus to the previous queue. A task can choose to wait on the completion of another task it pushes, in which case it is easy to deadlock if not being careful. It is possibly to create an insanely huge amounts of queues in this system. The only real overhead is the costs of memory barriers used when pushing and popping. Apple refers to this as islands of serialization in a sea of concurrency. There is also the event signalling subsystem that grabs descriptor events and holds a task until signalled, then pushes the task to a designated queue. This is based on kqueue on FreeBSD and OS X, and some promising work is being done to emulate kqueue for Linux for the purpose of libdispatch integration - this is somewhat related to libevent. Now, it would be really simple to implement the dispatch api in ocaml, either single threaded, or multithreaded to handle the odd block tasking. On top of this, Apple has built NSOperations which is a user level system that allows for various constraints between tasks: task A cannot run before task B and only if there is no yellow car parked in front of the hotel, or something, but GCD runs underneath, and it is almost all queues in user space, and very simple. (Actually I think NSOperations came before, but has now been rewritten to use GCD). OK, so how does this turn into a finely tuned OS controlled multi-core system? Well, whenever thread is locking the queue, it does this through a compare-exchange op that gives first prevents other tasks from taking control, then it verifies that it itself has not been victim of the prevention step. If it did, the queue is in heavy competition, and this means there are too many threads, so our discouraged thread will prepare for suicide. During a task push operation to a concurrent queue, a new thread is attempted allocated for that thread. This is done by calling a supporting thread create function. The tasks is simply queued, so it does not matter if a new thread is not created, it just means less parallelism. The supporting thread creation function is where it all happens. This is deeply buried in the client libdispatch logic. It the function has two implementations: if the apple work threads library is present, it will call a kernel function to so inform that there is now more work. This kernel function may or may not provide additional threads based on a global view of current system load and available cores. Now, since this global view does not know how many of these threads we intend to leave blocking on whatever sync api, it is generally seen as a bad thing to have anything blocking. The event system makes it relatively easy to avoid sync calls to common I/O ops. The portable alternative thread creation function is ugly: whenever a new task is pushed to a global async queue, a new thread is created unless the thread count has reached 255. While this works, a more intelligent thread pool would be useful; one that knows about number of cores, and possibly some app hint about how many threads we intend to have blocking. So basically, all there is libdispatch, aka Grand Central Dispatch is the kernel work threads api, the kqueue event api, and a thin client library that provides asyn access to pushing and popping from queues. A minor detail left out: when pushing a task, a small list element is allocated from heap, but returned to thread local storage cache list which is cleaned when a thread commits suicide. There are some additional sugar ops such as creating a parallel for loop by pushing tasks that take and index as argument. A clever detail of this design is to not push the 10,000 array elements at once, but limit to a fair number based on system core count, then have these tasks dynamically restart themselves with a new iteration index which they allocate from from a critical section protected iteration counter. Such a list iterator is also a task, and the task creating the iteration will block until all iterations have completed, which is somewhat similar to a thread join, but trivial to create. It is not clear to me how much of this kind of blocking is a good thing because it depends on the underlying thread pool. At least this is how I understand :-) Now, I'm quite tempted to do this kind of programming in C rather than OCaml, since you get a lot of stuff for free, and you don't have to keep track of a lot of different processes - oh - btw. GCD also makes it easy to monitor process events via the kqueue system, so perhaps a C core server dispatching work to ocaml processes would be a way to go? (Process monitoring might be emulated via a separate thread on Linux, at least this is what the author of the kqueue emulation library currently considers). As to how GCD could be used in OCaml with multi-core support: There is still the issue of concurrent garbage collection. If the thread support is entirely removed and replaced by a GCD like construct, then the runtime has a much better view over when scheduling happens, and it knows that there is an observable and limited amount of real threads - thus making it feasible with several large per thread heaps. But still, it doesn't really solve the concurrent GC problem fully. I might have gotten some details wrong, but this is what I came up with after digging around the code and asking a few questions. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [Caml-list] JIT & HLVM, LLVM 2009-09-30 1:08 ` Mikkel Fahnøe Jørgensen @ 2009-09-30 11:57 ` Stéphane Glondu 2009-09-30 12:54 ` Mikkel Fahnøe Jørgensen 2009-09-30 15:22 ` Jon Harrop 1 sibling, 1 reply; 18+ messages in thread From: Stéphane Glondu @ 2009-09-30 11:57 UTC (permalink / raw) To: Mikkel Fahnøe Jørgensen; +Cc: caml-list Mikkel Fahnøe Jørgensen a écrit : > A function can push other functions to the global queue. This in > effect creates a continuation, and the concept is similar to > continuation passing style or LWT threads, *but requires no > continuation arguments to drag around*. Could you elaborate? Best regards, -- Stéphane ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [Caml-list] JIT & HLVM, LLVM 2009-09-30 11:57 ` Stéphane Glondu @ 2009-09-30 12:54 ` Mikkel Fahnøe Jørgensen 2009-09-30 13:42 ` Stéphane Glondu 0 siblings, 1 reply; 18+ messages in thread From: Mikkel Fahnøe Jørgensen @ 2009-09-30 12:54 UTC (permalink / raw) To: Stéphane Glondu; +Cc: caml-list First, sorry for the rough post (it was late), I see some typos and slightly more confusing mistakes, but hopefully not too bad, or please ask. >> A function can push other functions to the global queue. This in >> effect creates a continuation, and the concept is similar to >> continuation passing style or LWT threads, *but requires no >> continuation arguments to drag around*. > Could you elaborate? Yes, I'm probably not the best to explain continuation passing style, but I'll give it a shot. In OCaml you do not have continuations, but you do have partial evaluation. This means the function let sum a b = a + b can be called partially like let sum_creator a = sum a This won't execute the sum code, but will have it pending until we get the last argument b. In this sense the sum_creator function is a sleeping thread waiting to execute. We can combine several functions in this way to get a complex calculation which won't evaluate until the continuation argument b is supplied. For example we can create a complex boolean expression combining several primitive and / or functions that take a variable x as argument. The entire expression is suspended until x is given, and in principle the OR subexpressions could execute in parallel. We can also build parse expressions where it is not a variable x, but the rest of the input string that is missing. Instead of evaluating a sum or boolean expression, the function could do something else, like filling a buffer from a socket. The k argument does not need to be used at all, we just use it to avoid computation before we need it. We can use this to create threads by having an extra dummy argument that we will call k. We don't really need k we just need the expression to not evaluate before we say so. This is the continuation being dragged (passed) around. (In monadic stream parsers, the k continuation is replaced by io state which is dragged around, and contrary to k it provides useful information. These parsers allow parsing to sleep when the input stream dries up temporarily, and the parsers can explore several parse paths concurrently (useful for error reporting), so there is an analog to concurrent threads consuming i/o). A function can also return a recursive call to itself supplying new state such as bytes left to read, or a partial call to another function, in both cases supplying new state information, but failing to supply to the k argument such that "the next task to execute" is returned instead of actually executing this task. A now it starts to get complex, but you get a hold of an entire tree of partially evaluated functions that are parked waiting for the last argument. A clean way to put one function after another so they execute in sequence as dependent tasks is to use a monadic bind operation: http://ocsigen.org/docu/last/Lwt.html#VALbind But this requires the function to be designed in a clean way and conform to certain monadic rules, and getting it wrong creates a mess in the type errors. What we effectively achieve is a lot of functions queued up on a the call stack in the form of closures allocated to hold the partially evaluated parameters. Now, we might as well just push a closure onto a queue instead of on the call stack. This avoid a lot of complexity in the function type design, and we get a lot more flexibility in how we dispatch the tasks (arguably we could do the same or possibly more, in continuation passing style, but it will give you a headache). We might loose some type safety by using queues, but I'm sure one could dream up some queue type system to enforce certain conditions. More importantly, it is much simpler to understand a closure on a queue, than continuation passing style. And finally, with queues you have an elegant way of extending this to multiple OS threads (or ocaml green threads), although again this is also doable in continuation passing style. Someone could probably argue that a queue is also just a continuation state or something, but we a dealing with making threaded abstractions that are easy to use and implement, not mathematical equivalence relations. 2009/9/30 Stéphane Glondu <steph@glondu.net>: > Mikkel Fahnøe Jørgensen a écrit : >> A function can push other functions to the global queue. This in >> effect creates a continuation, and the concept is similar to >> continuation passing style or LWT threads, *but requires no >> continuation arguments to drag around*. > > Could you elaborate? > > > Best regards, > > -- > Stéphane > > ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [Caml-list] JIT & HLVM, LLVM 2009-09-30 12:54 ` Mikkel Fahnøe Jørgensen @ 2009-09-30 13:42 ` Stéphane Glondu 2009-09-30 19:11 ` Mikkel Fahnøe Jørgensen 0 siblings, 1 reply; 18+ messages in thread From: Stéphane Glondu @ 2009-09-30 13:42 UTC (permalink / raw) To: Mikkel Fahnøe Jørgensen; +Cc: caml-list Mikkel Fahnøe Jørgensen a écrit : > But this requires the function to be designed in a clean way and > conform to certain monadic rules, and getting it wrong creates a mess > in the type errors. Actually, I find the typing discipline enforced by the monadic abstraction very helpful (and elegant). > Now, we might as well just push a closure onto a queue instead of on > the call stack. This avoid a lot of complexity in the function type > design, and we get a lot more flexibility in how we dispatch the tasks > (arguably we could do the same or possibly more, in continuation > passing style, but it will give you a headache). This sounds like a narrow (and C-ish) way to tackle things. The bind operator is about composition of threads, not scheduling. > More importantly, it is much simpler to understand a closure on a > queue, than continuation passing style. What you call the "call stack" is orthogonal to your "queues of closures": one is about combining results of threaded computations whereas the other is just spawning threads and relying on some external mechanism to schedule them. In other words, I think both can be used together, and achieve different purposes. Best regards, -- Stéphane ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [Caml-list] JIT & HLVM, LLVM 2009-09-30 13:42 ` Stéphane Glondu @ 2009-09-30 19:11 ` Mikkel Fahnøe Jørgensen 0 siblings, 0 replies; 18+ messages in thread From: Mikkel Fahnøe Jørgensen @ 2009-09-30 19:11 UTC (permalink / raw) To: Stéphane Glondu; +Cc: caml-list 2009/9/30 Stéphane Glondu <steph@glondu.net>: > Mikkel Fahnøe Jørgensen a écrit : > Actually, I find the typing discipline enforced by the monadic > abstraction very helpful (and elegant). For some purposes - for example filtering and transforming large data sets, but perhaps less so for ad hoc tasks like background reporting - although of course it can be done. >> Now, we might as well just push a closure onto a queue instead of on >> the call stack. This avoid a lot of complexity in the function type >> design, and we get a lot more flexibility in how we dispatch the tasks >> (arguably we could do the same or possibly more, in continuation >> passing style, but it will give you a headache). > > This sounds like a narrow (and C-ish) way to tackle things. The bind > operator is about composition of threads, not scheduling. Perhaps, but sometimes this makes it easier to get things done, especially for people not trained in the use of monads. Yes, bind makes sure that one task is not scheduled at the same time, or before, another task. > What you call the "call stack" is orthogonal to your "queues of > closures": one is about combining results of threaded computations > whereas the other is just spawning threads and relying on some external > mechanism to schedule them. Well - yes and no. In general, monads are about combining results, but they are also used for thread control where you do not rely on the results. But it highlights an interesting point: queues does nothing to help you communicate computation results, in parallel or not. Neither does a monad based thread library. But a map-reduce monadic setup very much does combine results and can also be made to schedule for parallel execution which is very elegant, but not a good match for more ad-hoc concurrent tasks. > In other words, I think both can be used together, and achieve different > purposes. I agree. It was not my intention to say that monads are bad in any way, and indeed many of our daily collection types are very useful monads with abstractions that make them easy to use. But since monads tend to spread all over the code, I think they should not be used as a solution to everything, and this is basically what I meant by "dragging continuations around". Regards, Mikkel ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [Caml-list] JIT & HLVM, LLVM 2009-09-30 1:08 ` Mikkel Fahnøe Jørgensen 2009-09-30 11:57 ` Stéphane Glondu @ 2009-09-30 15:22 ` Jon Harrop 2009-09-30 19:35 ` Mikkel Fahnøe Jørgensen 1 sibling, 1 reply; 18+ messages in thread From: Jon Harrop @ 2009-09-30 15:22 UTC (permalink / raw) To: caml-list On Wednesday 30 September 2009 02:08:06 Mikkel Fahnøe Jørgensen wrote: > 2009/9/27 Jon Harrop <jon@ffconsultancy.com>: > > On Sunday 27 September 2009 20:23:13 kcheung@math.carleton.ca wrote: > >> If Grand Central Dispatch makes its way > >> into *nix... > > > > Incidentally, what exactly (technically) is Grand Central and how does it > > relate to existing alternatives and the state-of-the-art? I Googled it > > but failed to find anything useful... > > Grand Central Dispatch... Thanks for the explanation! This seems to be attacking the same problem as Cilk and the TPL but my immediate impression (based entirely upon your description) is that Cilk and the TPL are probably better foundations. Using one work stealing deque per core is much more efficient than the work sharing queue you described for two reasons: 1. Less global syncronization. 2. Subtasks are likely to be executed on the core that spawned them, which improves cache efficiency. The parallel "for" loop you described is similar to the TPL's and leaves a lot to be desired. Specifically, you want to execute clusters of consecutive tasks on the same core for two reasons: 1. Using the same core improves cache efficiency because the end of one task is often spatially related to the start of the next, e.g. parallel quicksort, linear algebra or image processing. 2. Executing clusters of tasks amortizes the cost of queueing tasks. The former is easily solved with the Cilk/TPL design by using recursive subdivision instead of the index sharing scheme you described because subtasks are likely to be executed on the same core. However, that will not work with a work sharing queue because subtasks are executed on random cores. Moreover, a trivial extension to this higher-order function allows you to pass in another function that describes the complexity of each task. This allows the recursive function to more intelligently subdivide the given range into clusters of variable-complexity tasks with similar total running times. This is the technique I use in our commercial F# libraries but I have not seen it described elsewhere. Cilk and the TPL also just rely upon the programmer to make tasks sufficiently complex that the time spent queueing and dequeueing them is insignificant. I solved this problem by injecting run-time profiling code (with global synchronizations per cluster of tasks) to measure the proportion of the time being spent spawning rather than executing tasks and then use exponential backoff to increase the cluster size until a sufficiently small proportion of the time is spent spawning. Again, I have not seen this technique described elsewhere. -- Dr Jon Harrop, Flying Frog Consultancy Ltd. http://www.ffconsultancy.com/?e ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [Caml-list] JIT & HLVM, LLVM 2009-09-30 15:22 ` Jon Harrop @ 2009-09-30 19:35 ` Mikkel Fahnøe Jørgensen 0 siblings, 0 replies; 18+ messages in thread From: Mikkel Fahnøe Jørgensen @ 2009-09-30 19:35 UTC (permalink / raw) To: Jon Harrop; +Cc: caml-list > Using one work stealing deque per core is much more efficient than the work > sharing queue you described for two reasons: > > 1. Less global syncronization. > > 2. Subtasks are likely to be executed on the core that spawned them, which > improves cache efficiency. You could also create larger task chunks or schedule multiple tasks on a serial queue to obtain some benefits of cache coherency. I'm not saying that this is a better way than the one suggest. Another issue is latency vs. throughput. If you have a lot of short transactions like in a trading system, you want things done now, not more efficiently in a while, then you'd rather add more cores to the system and wasting some. > The parallel "for" loop you described is similar to the TPL's and leaves a lot > to be desired. Specifically, you want to execute clusters of consecutive > tasks on the same core for two reasons: umm yes - it just a convenience for getting things done I suppose, but again, you could split the loop into nested loops to increase granularity. But in relation to my other recent post, there are also map-reduce for addressing these issues which work well across network nodes. > The former is easily solved with the Cilk/TPL design by using recursive > subdivision instead of the index sharing scheme you described because > subtasks are likely to be executed on the same core. I also wondered why GCD insisted on starting indices in order and even waste time syncing on the index counter since there is no guarantee of execution order, other than than start time. I guess it would be easy to modify GCD with a different scheduler - it is about 10 lines of code in the library, and you could set a property on a queue to suggest a preferred scheduler. However, I think the benefit of the current approach is exactly to ensure that as many early indices as possible run in parallel - which you might want for low latency. Another issue that concerns me more is how serial queues when more work is added to the queue. If the tasks keeps eating on the serial queue, it might starve other more important work, unless the serial queue takes a break - of course the thread is preempted, but it will not give priority to older concurrent tasks still waiting to be pulled from the global concurrent queue. Again I think this is easily fixed, and I might have not understood this fully. For high performance computing of many similar computations like pixel processing, I think one should also have a look at OpenCL, and possibly some map reduce top layer that can distribute work to multiple nodes - such a layer could be built with GCD without relying heavily on GCD for anything other than coordination. Again, here latency is important - the sooner you can ship off work to OpenCL (which feeds work to graphic coprocessors) the more work gets done rather than having work lining up on the same core to ensure GCD maximizes its own cpu cache. This also applies to network communication: if you can send data sooner, the other end can start computing earlier, and especially if risk delays. So there is an argument both ways - I think GCD is more intended for systems programming than number crunching (or raytracing :-) I think one should also remember that GCD is a programming abstraction, and that there are many ways to implement it, and as time progress some changes could be made. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [Caml-list] JIT & HLVM, LLVM 2009-09-27 17:35 ` Edgar Friendly 2009-09-27 18:51 ` Jon Harrop @ 2009-09-28 12:25 ` Gerd Stolpmann 1 sibling, 0 replies; 18+ messages in thread From: Gerd Stolpmann @ 2009-09-28 12:25 UTC (permalink / raw) To: Edgar Friendly; +Cc: Vincent Aravantinos, caml-list Am Sonntag, den 27.09.2009, 13:35 -0400 schrieb Edgar Friendly: > Vincent Aravantinos wrote: > > Hi, > > > > I think what Jon means is that, with JIT, polymorphic functions can be > > specialized at run-time > > and allow optimizations that are not currently achieved by the Ocaml > > native code compiler. > > > > V. > > The alternative to specializing at runtime using JIT is to do it at > compile time (/ link time) using a form of whole-program analysis. How > expensive would this be, and how hard would it be to still support > separate compilation? > > And how much would the OCaml world cry if we didn't have fully-separate > compilation? We don't have fully separate compilation anyway (in native mode). The compiler can already dump an intermediate representation of functions into .cmx files, and can use that for cross-module inlining. This feature is optional for the user. Of course, it increases the number of modules that need to be recompiled when something is changed. If we keep that spirit of giving the user a choice: Of course the "Ocaml world" would appreciate when the actual code generation can be delayed in order to gain performance improvements by whole-program analysis. However, for projects with several 100KLOC the compile time could be drastically increased, and I don't think users with such large programs would actually use that feature. Gerd > At the moment, and for the foreseeable future, anything > binary is bound tightly to the compiler version, and binary distribution > of modules seems pretty much impossible due to this and unknown other > factors. How much easier would the above be given the source code / > some intermediate representation to generate specialized code with? > > E > > _______________________________________________ > Caml-list mailing list. Subscription management: > http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list > Archives: http://caml.inria.fr > Beginner's list: http://groups.yahoo.com/group/ocaml_beginners > Bug reports: http://caml.inria.fr/bin/caml-bugs > -- ------------------------------------------------------------ Gerd Stolpmann, Bad Nauheimer Str.3, 64289 Darmstadt,Germany gerd@gerd-stolpmann.de http://www.gerd-stolpmann.de Phone: +49-6151-153855 Fax: +49-6151-997714 ------------------------------------------------------------ ^ permalink raw reply [flat|nested] 18+ messages in thread
[parent not found: <20090927214558.D40C5BC5C@yquem.inria.fr>]
* Re: [Caml-list] JIT & HLVM, LLVM [not found] <20090927214558.D40C5BC5C@yquem.inria.fr> @ 2009-09-27 21:59 ` CUOQ Pascal 0 siblings, 0 replies; 18+ messages in thread From: CUOQ Pascal @ 2009-09-27 21:59 UTC (permalink / raw) To: caml-list [-- Attachment #1: Type: text/plain, Size: 236 bytes --] > We should thank them for enlightening us but this will only > ever get built if we build it ourselves. But, hey, at least we can build it > in OCaml. ;-) So that's why I was reminded of John Skaller all this time... Pascal [-- Attachment #2: winmail.dat --] [-- Type: application/ms-tnef, Size: 2529 bytes --] ^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2009-09-30 19:56 UTC | newest] Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2009-09-27 17:18 JIT & HLVM, LLVM David McClain 2009-09-27 17:22 ` [Caml-list] " Vincent Aravantinos 2009-09-27 17:35 ` Edgar Friendly 2009-09-27 18:51 ` Jon Harrop 2009-09-27 19:07 ` Edgar Friendly 2009-09-27 19:23 ` kcheung 2009-09-27 19:33 ` Jon Harrop 2009-09-27 21:10 ` Jon Harrop 2009-09-27 21:45 ` Vincent Aravantinos 2009-09-30 1:08 ` Mikkel Fahnøe Jørgensen 2009-09-30 11:57 ` Stéphane Glondu 2009-09-30 12:54 ` Mikkel Fahnøe Jørgensen 2009-09-30 13:42 ` Stéphane Glondu 2009-09-30 19:11 ` Mikkel Fahnøe Jørgensen 2009-09-30 15:22 ` Jon Harrop 2009-09-30 19:35 ` Mikkel Fahnøe Jørgensen 2009-09-28 12:25 ` Gerd Stolpmann [not found] <20090927214558.D40C5BC5C@yquem.inria.fr> 2009-09-27 21:59 ` CUOQ Pascal
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox