* STM support in OCaml @ 2006-03-07 16:18 Asfand Yar Qazi 2006-03-07 16:50 ` [Caml-list] " Sebastian Egner 2006-03-07 17:15 ` [Caml-list] STM support in OCaml skaller 0 siblings, 2 replies; 29+ messages in thread From: Asfand Yar Qazi @ 2006-03-07 16:18 UTC (permalink / raw) To: caml-list Hi, I've temporarily decided to leave off learning OCaml (although I still intend to learn it at some point) and start learning Haskell due to its support for Software Transactional Memory and lock-free concurrent programming. Are there any plans to add STM to OCaml? And I'm talking highly integrated into the language, not just some bolt-on library. Thanks ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [Caml-list] STM support in OCaml 2006-03-07 16:18 STM support in OCaml Asfand Yar Qazi @ 2006-03-07 16:50 ` Sebastian Egner 2006-03-07 17:44 ` Michael Hicks 2006-03-07 17:15 ` [Caml-list] STM support in OCaml skaller 1 sibling, 1 reply; 29+ messages in thread From: Sebastian Egner @ 2006-03-07 16:50 UTC (permalink / raw) To: caml-list; +Cc: Asfand Yar Qazi [-- Attachment #1: Type: text/plain, Size: 1389 bytes --] > Are there any plans to add STM to OCaml? And I'm talking highly integrated > into the language, not just some bolt-on library. I implemented 'some bolt-on library' as you call it for myself (for understanding the details of STM) but did not release it. The main lesseons I learned, however, are a) A library will be nearly as good as something 'highly integrated into the language'. The main difference is that you cannot annotate fields to be transactional, and define transactional record types; you need to work with explicit transactional reference boxes and arrays. This appears tolerable, and it is a lot simpler, both in terms of implementation and in terms of using it. b) The version of STM as designed for Haskell makes essential use of laziness for composing transactions in a very clean way. You can construct more and more complex transactions by not yet enclosing them into 'atomic'. This is probably the most important benefit of STMs at all. Unfortunately, this property is essentially impossible in OCaml: No matter how highly integrated STMs are in the language, it will always be easy to write a program that uses ordinary (non-STM) state to keep a record of something it shouldn't---messing up semantics of course. Sorry, I know this is bad news, but that is what I found. I invite everybody to verify this for him- or herself, or prove me wrong. Sebastian. [-- Attachment #2: Type: text/html, Size: 2249 bytes --] ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [Caml-list] STM support in OCaml 2006-03-07 16:50 ` [Caml-list] " Sebastian Egner @ 2006-03-07 17:44 ` Michael Hicks 2006-03-08 0:37 ` Asfand Yar Qazi 2006-03-11 19:43 ` Deadlock free locking scheme (was: Re: [Caml-list] STM support in OCaml) David MENTRE 0 siblings, 2 replies; 29+ messages in thread From: Michael Hicks @ 2006-03-07 17:44 UTC (permalink / raw) To: caml-list There's a longer answer, but one short answer is: check out AtomCaml at http://www.cs.washington.edu/homes/miker/atomcaml/ -Mike On Mar 7, 2006, at 11:50 AM, Sebastian Egner wrote: > > > Are there any plans to add STM to OCaml? And I'm talking highly > integrated > > into the language, not just some bolt-on library. > > I implemented 'some bolt-on library' as you call it for > myself (for understanding the details of STM) but did > not release it. The main lesseons I learned, however, are > > a) A library will be nearly as good as something 'highly > integrated into the language'. The main difference is > that you cannot annotate fields to be transactional, > and define transactional record types; you need to > work with explicit transactional reference boxes and arrays. > This appears tolerable, and it is a lot simpler, both in > terms of implementation and in terms of using it. > > b) The version of STM as designed for Haskell makes essential > use of laziness for composing transactions in a very clean > way. You can construct more and more complex transactions > by not yet enclosing them into 'atomic'. This is probably > the most important benefit of STMs at all. Unfortunately, > this property is essentially impossible in OCaml: No matter > how highly integrated STMs are in the language, it will > always be easy to write a program that uses ordinary (non-STM) > state to keep a record of something it shouldn't---messing > up semantics of course. > > Sorry, I know this is bad news, but that is what I found. > I invite everybody to verify this for him- or herself, > or prove me wrong. > > Sebastian. > _______________________________________________ > Caml-list mailing list. Subscription management: > http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list > Archives: http://caml.inria.fr > Beginner's list: http://groups.yahoo.com/group/ocaml_beginners > Bug reports: http://caml.inria.fr/bin/caml-bugs ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [Caml-list] STM support in OCaml 2006-03-07 17:44 ` Michael Hicks @ 2006-03-08 0:37 ` Asfand Yar Qazi 2006-03-08 5:05 ` Erick Tryzelaar 2006-03-11 19:43 ` Deadlock free locking scheme (was: Re: [Caml-list] STM support in OCaml) David MENTRE 1 sibling, 1 reply; 29+ messages in thread From: Asfand Yar Qazi @ 2006-03-08 0:37 UTC (permalink / raw) To: caml-list Michael Hicks wrote: > There's a longer answer, but one short answer is: check out AtomCaml at > http://www.cs.washington.edu/homes/miker/atomcaml/ > > -Mike > That is actually really really useful just on its own! Now there are 2 things missing: - Multi-threaded support - Optional lazy evaluation and we have a Haskell beater :-) ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [Caml-list] STM support in OCaml 2006-03-08 0:37 ` Asfand Yar Qazi @ 2006-03-08 5:05 ` Erick Tryzelaar 0 siblings, 0 replies; 29+ messages in thread From: Erick Tryzelaar @ 2006-03-08 5:05 UTC (permalink / raw) To: Asfand Yar Qazi; +Cc: caml-list Asfand Yar Qazi wrote: > Now there are 2 things missing: > > - Multi-threaded support > - Optional lazy evaluation > > and we have a Haskell beater :-) > only one thing left, as ocaml has said optional lazy evaluation: http://caml.inria.fr/pub/docs/manual-ocaml/manual021.html#htoc97 :) -e ^ permalink raw reply [flat|nested] 29+ messages in thread
* Deadlock free locking scheme (was: Re: [Caml-list] STM support in OCaml) 2006-03-07 17:44 ` Michael Hicks 2006-03-08 0:37 ` Asfand Yar Qazi @ 2006-03-11 19:43 ` David MENTRE 1 sibling, 0 replies; 29+ messages in thread From: David MENTRE @ 2006-03-11 19:43 UTC (permalink / raw) To: caml-list; +Cc: Michael Hicks Hello, Michael Hicks <mwh@cs.umd.edu> writes: > There's a longer answer, but one short answer is: check out AtomCaml at > http://www.cs.washington.edu/homes/miker/atomcaml/ Thank you for the interesting paper, I did not know about that work. Was there an announcement on caml-list? By the way, as a reply to initial Asfand Yar Quazi's question, there is a simple way to implement deadlock free locking scheme, both for byte and native code: call a locking routine that always try to lock the needed locks *in the same order*. By using the high-order properties of OCaml, this is very easy for the programmer to use it. Such a locking scheme can be used in the following way (imagine you have four bases, named like "participant" or "position", each one protected with a multiple-reader/single-writer lock): let do_atomic_work () = let unlock = lock_bases { no_locks with participant = Read_lock; position = Read_lock; } in let result = do_processing () in unlock (); result Compared to AtomCaml approach, it is like if the do_processing code is executed in a Commit phase. Please find below the code that imlements this scheme (using Reader and Writer mutex[1]). It could probably be optimized for speed (using precomputed locking and unlocking functions stored in a hash table) or improved to support more locks (using an ordered set as input to lock function), but the general scheme is there. \chapter{Control of data bases access} Module [[Basesctrl]] controls the locks to access the data bases. Each of the four data bases (Participants, Classification, Delegation, Position) can be accessed for reading or for writing. A reader/writer lock (see chapter \ref{module:rwmutex}) protects each one of them. All locking and unlocking of those bases are done through this module. We can thus control the locking and unlocking order and thus guarantee the absence of deadlocks. The locking of the bases is done through the [[lock]] function. It locks bases as desired and then return an unlocking function that should be called to cancel locking. \section{Data structures} Each lock can either be taken for reading ([[Read_lock]]), for writing ([[Write_lock]]) or not taken at all ([[No_lock]]). <<basesctrl.ml>>= type lock_type = | Read_lock | Write_lock | No_lock @ We define a data structure [[t]] that indicates, for each action on the database, the desired (un)locking for each base. <<basesctrl.ml>>= type t = { participant : lock_type; classification : lock_type; delegation : lock_type; position : lock_type; } @ We define a default lock state [[no_locks]] where no locks are taken. <<basesctrl.ml>>= let no_locks = { participant = No_lock; classification = No_lock; delegation = No_lock; position = No_lock; } @ We also define the actual set of locks that protect bases. We chose a [[Writer_preference]] to give priority to data base modifications that should be less frequent that data base reading. \nextchunklabel{code:basesctrl:locks} <<basesctrl.ml>>= let pref = Rwmutex.Writer_preference let participant_lock = Rwmutex.create pref let classification_lock = Rwmutex.create pref let delegation_lock = Rwmutex.create pref let position_lock = Rwmutex.create pref @ \section{Base unlocking} Helper function [[unlock_a_base]] is used to unlock [[lock]] with [[desired]] unlocking. <<basesctrl.ml>>= let unlock_a_base desired lock = match desired with | Read_lock -> Rwmutex.read_unlock lock | Write_lock -> Rwmutex.write_unlock lock | No_lock -> () @ Function [[create_unlock]] returns a function that, when called, unlocks the bases as specified in the [[what]] argument. \nextchunklabel{code:create_unlock} <<basesctrl.ml>>= let create_unlock what = fun () -> unlock_a_base what.position position_lock; unlock_a_base what.delegation delegation_lock; unlock_a_base what.classification classification_lock; unlock_a_base what.participant participant_lock @ \section{Base locking} Helper function [[lock_a_base]] is used to lock [[lock]] with [[desired]] locking. <<basesctrl.ml>>= let lock_a_base desired lock = match desired with | Read_lock -> Rwmutex.read_lock lock | Write_lock -> Rwmutex.write_lock lock | No_lock -> () @ Function [[lock_bases]] is called to lock bases. The [[what]] argument gives for each base the desired locking. This function returns a function that should be called to unlock the bases. To avoid deadlocks, the locking order is the reverse as used in the unlocking procedure (see \codechunkref{code:create_unlock}). <<basesctrl.ml>>= let lock_bases what = lock_a_base what.participant participant_lock; lock_a_base what.classification classification_lock; lock_a_base what.delegation delegation_lock; lock_a_base what.position position_lock; create_unlock what @ Best wishes, david Footnotes: [1] http://www.linux-france.org/~dmentre/code/ocaml_thread_synchro.tar.gz -- pub 1024D/A3AD7A2A 2004-10-03 David MENTRE <dmentre@linux-france.org> 5996 CC46 4612 9CA4 3562 D7AC 6C67 9E96 A3AD 7A2A ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [Caml-list] STM support in OCaml 2006-03-07 16:18 STM support in OCaml Asfand Yar Qazi 2006-03-07 16:50 ` [Caml-list] " Sebastian Egner @ 2006-03-07 17:15 ` skaller 2006-03-07 19:05 ` Asfand Yar Qazi 1 sibling, 1 reply; 29+ messages in thread From: skaller @ 2006-03-07 17:15 UTC (permalink / raw) To: Asfand Yar Qazi; +Cc: caml-list On Tue, 2006-03-07 at 16:18 +0000, Asfand Yar Qazi wrote: > Hi, > > I've temporarily decided to leave off learning OCaml (although I still intend > to learn it at some point) and start learning Haskell due to its support for > Software Transactional Memory and lock-free concurrent programming. AFAIK STM is not lock free. It simply limits the locking period to a bounded time, at the expense of the whole transaction taking unbounded time. The final compare/write/retry must be atomic and is therefore protected by a mutex under the hood. Sebastian Egner said in another post the main advantage of STMs: they have a combinator form, that is, they can be composed. However, despite the lack of safety in a bolt on addition for Ocaml .. the real problem is that STM isn't that useful unless you have (a) a lot of processors (b) a lot of variables so that the risk of contention is low and the cost of long exclusions is high. Ocaml fails to satisfy property (a) since it doesn't support multi-processing. -- John Skaller <skaller at users dot sourceforge dot net> Async PL, Realtime software consultants Checkout Felix: http://felix.sourceforge.net ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [Caml-list] STM support in OCaml 2006-03-07 17:15 ` [Caml-list] STM support in OCaml skaller @ 2006-03-07 19:05 ` Asfand Yar Qazi 2006-03-08 0:52 ` skaller 0 siblings, 1 reply; 29+ messages in thread From: Asfand Yar Qazi @ 2006-03-07 19:05 UTC (permalink / raw) To: caml-list skaller wrote: > On Tue, 2006-03-07 at 16:18 +0000, Asfand Yar Qazi wrote: > >>Hi, >> >>I've temporarily decided to leave off learning OCaml (although I still intend >>to learn it at some point) and start learning Haskell due to its support for >>Software Transactional Memory and lock-free concurrent programming. > > > AFAIK STM is not lock free. It simply limits the locking period > to a bounded time, at the expense of the whole transaction > taking unbounded time. The final compare/write/retry must > be atomic and is therefore protected by a mutex under the hood. > > Sebastian Egner said in another post the main advantage > of STMs: they have a combinator form, that is, they > can be composed. > > However, despite the lack of safety in a bolt on > addition for Ocaml .. the real problem is that STM > isn't that useful unless you have > > (a) a lot of processors > (b) a lot of variables > > so that the risk of contention is low and the cost > of long exclusions is high. Ocaml fails to satisfy > property (a) since it doesn't support multi-processing. > You make several claims: STM is not lock free. STM is not useful on a small number of processors With all due respect, these papers refutes these claims: http://research.microsoft.com/~simonpj/papers/stm/index.htm http://research.microsoft.com/~simonpj/papers/stm/lock-free.htm (That's right - the premier research on STM is being done by Micro$oft - yuck.) As for claim 1. "Lock-free" doesn't mean what you think it does. Quote from first paper, page 11: "The STM implementation guarantees that one transaction can force another to abort only when the first one commits. As a result, the STM implementation is lock-free in the sense that it guarantees at any time that some running transaction can successfully commit." As for claim 2, note the two-processor performance graphs of their tests (second paper, "lock-free.htm", pages 14-15): STM blows the hell out of conventional lock-based parallel processing. As for the lot of variables bit, I intend to have a lot of the buggers :-) ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [Caml-list] STM support in OCaml 2006-03-07 19:05 ` Asfand Yar Qazi @ 2006-03-08 0:52 ` skaller 2006-03-08 7:08 ` Bardur Arantsson ` (2 more replies) 0 siblings, 3 replies; 29+ messages in thread From: skaller @ 2006-03-08 0:52 UTC (permalink / raw) To: Asfand Yar Qazi; +Cc: caml-list On Tue, 2006-03-07 at 19:05 +0000, Asfand Yar Qazi wrote: > You make several claims: > > STM is not lock free. > STM is not useful on a small number of processors > > As for claim 1. "Lock-free" doesn't mean what you think it does. I know what STM does, thank you: I intend to implement it myself in my own programming language. Maybe you should read more carefully. I said "protected by a mutex under the hood." which means sure, the programmer is not writing locks, but they're used in the implementation and the associated costs are still paid. I really hate it when people try to throw papers against simple logic. I said what the tradeoffs where: "It simply limits the locking period to a bounded time, at the expense of the whole transaction taking unbounded time." and then elaborated the conditions under which this made sense. Long locking period on a Uniprocessor not only do not cause problems they can actually IMPROVE performance by preventing expensive context switches. A paper is cached here on my website, probably one of the ones you cited. http://felix.sourceforge.net/papers/ea8-composablememory_stm.pdf It's quite interesting and I've bought a dual core CPU specifically to test it out. The only numbers I can give you are based on a simple lock test on a dual core G5 incrementing an integer: 15x SLOWER on a dual processor than a uniprocessor with two threads. No doubt because of the weak support provided by Linux. Windows may do better, haven't tried yet, but I doubt anything older than Vista has suitable API support. In the end, fast concurrency is going to depend on both CPU and board design and OS support. The point of the above paper is not performance: the point is as I said, Sebastian said, AND the paper emphasises: it provides a model which supports composition. I point out that in fact, under the right conditions -- lots of processors and lots of variables -- it will probably provide better performance too. However this is hard to test -- not many of us have access to >2 cores on the same board. There certainly no way POSIX can deliver good performance: mutexes have to be synchronisation points and that requires ALL the CPUs to flush their caches -- it doesn't scale. Message passing does, since sender and receiver only need to sync the message. Explicit coupling, and both the subset of processor and memory are limited. Oh, and Ocaml supports message passing between processes .. :) -- John Skaller <skaller at users dot sourceforge dot net> Async PL, Realtime software consultants Checkout Felix: http://felix.sourceforge.net ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: STM support in OCaml 2006-03-08 0:52 ` skaller @ 2006-03-08 7:08 ` Bardur Arantsson 2006-03-08 10:38 ` [Caml-list] " Asfand Yar Qazi 2006-03-08 19:36 ` William Lovas 2 siblings, 0 replies; 29+ messages in thread From: Bardur Arantsson @ 2006-03-08 7:08 UTC (permalink / raw) To: caml-list skaller wrote: > On Tue, 2006-03-07 at 19:05 +0000, Asfand Yar Qazi wrote: [--snip--] > I point out that in fact, under the right conditions -- lots > of processors and lots of variables -- it will probably provide better > performance too. However this is hard to test -- not many > of us have access to >2 cores on the same board. There certainly > no way POSIX can deliver good performance: mutexes have to be > synchronisation points and that requires ALL the CPUs to > flush their caches -- it doesn't scale. Interestingly, DragonflyBSD seems to be moving toward a slightly weaker (relative to mutex) form of synchronisation which seems somewhat similar to STMs: http://en.wikipedia.org/wiki/Serializing_tokens I haven't look at it in detail, but it might be possible to use these to implement STM in a mutex-free (cheap) way. (Though you might need some level of hardware support unless you're content with page granularity 'exclusion'). Just thought I'd throw that in there. :) Cheers, -- Bardur Arantsson <bardurREMOVE@THISimada.sdu.dk> <bardurREMOVE@THISscientician.net> - Am I paying for this abuse or is it extra? Edmund Blackadder, 'Blackadder' ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [Caml-list] STM support in OCaml 2006-03-08 0:52 ` skaller 2006-03-08 7:08 ` Bardur Arantsson @ 2006-03-08 10:38 ` Asfand Yar Qazi 2006-03-08 19:36 ` William Lovas 2 siblings, 0 replies; 29+ messages in thread From: Asfand Yar Qazi @ 2006-03-08 10:38 UTC (permalink / raw) To: caml-list skaller wrote: > On Tue, 2006-03-07 at 19:05 +0000, Asfand Yar Qazi wrote: > > >>You make several claims: >> >>STM is not lock free. >>STM is not useful on a small number of processors >> >>As for claim 1. "Lock-free" doesn't mean what you think it does. > > > I know what STM does, thank you: I intend to implement it > myself in my own programming language. Maybe you should > read more carefully. > > I said "protected by a mutex under the hood." which means > sure, the programmer is not writing locks, but they're used > in the implementation and the associated costs are still paid. > > I really hate it when people try to throw papers against > simple logic. I said what the tradeoffs where: > > "It simply limits the locking period > to a bounded time, at the expense of the whole transaction > taking unbounded time." > > and then elaborated the conditions under which this > made sense. > > Long locking period on a Uniprocessor not only do not > cause problems they can actually IMPROVE performance by preventing > expensive context switches. > > A paper is cached here on my website, probably one of the > ones you cited. > > http://felix.sourceforge.net/papers/ea8-composablememory_stm.pdf > > It's quite interesting and I've bought a dual core CPU specifically > to test it out. The only numbers I can give you are based on a simple > lock test on a dual core G5 incrementing an integer: 15x SLOWER > on a dual processor than a uniprocessor with two threads. > > No doubt because of the weak support provided by Linux. > Windows may do better, haven't tried yet, but I doubt anything > older than Vista has suitable API support. > > In the end, fast concurrency is going to depend on both CPU and > board design and OS support. The point of the above paper is > not performance: the point is as I said, Sebastian said, > AND the paper emphasises: it provides a model which > supports composition. > > I point out that in fact, under the right conditions -- lots > of processors and lots of variables -- it will probably provide better > performance too. However this is hard to test -- not many > of us have access to >2 cores on the same board. There certainly > no way POSIX can deliver good performance: mutexes have to be > synchronisation points and that requires ALL the CPUs to > flush their caches -- it doesn't scale. Message passing does, > since sender and receiver only need to sync the message. > Explicit coupling, and both the subset of processor and > memory are limited. > > Oh, and Ocaml supports message passing between processes .. :) > Bad form on my part old chap - didn't realise your level of expertise. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [Caml-list] STM support in OCaml 2006-03-08 0:52 ` skaller 2006-03-08 7:08 ` Bardur Arantsson 2006-03-08 10:38 ` [Caml-list] " Asfand Yar Qazi @ 2006-03-08 19:36 ` William Lovas 2006-03-08 20:45 ` Brian Hurt 2 siblings, 1 reply; 29+ messages in thread From: William Lovas @ 2006-03-08 19:36 UTC (permalink / raw) To: caml-list On Wed, Mar 08, 2006 at 11:52:05AM +1100, skaller wrote: > On Tue, 2006-03-07 at 19:05 +0000, Asfand Yar Qazi wrote: > > As for claim 1. "Lock-free" doesn't mean what you think it does. > > I know what STM does, thank you: I intend to implement it > myself in my own programming language. Maybe you should > read more carefully. > > I said "protected by a mutex under the hood." which means > sure, the programmer is not writing locks, but they're used > in the implementation and the associated costs are still paid. As the quoted paper said, though, some running transaction can always commit -- so not all of the associated costs are still paid. In particular, the cost of potential non-termination due to deadlock or livelock is not paid. It doesn't matter that there's a mutex under the hood, since it's used only in such a way as to preserve the property that some transaction can always complete. > I really hate it when people try to throw papers against > simple logic. I said what the tradeoffs where: In this case, i think the paper was right :) Its statement of "lock-free"ness was with regards to semantics, not performance. You may be correct that there is some other property, regarding performance, that the STM Haskell implementation does not have, but it is not what their paper calls "lock-free". William ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [Caml-list] STM support in OCaml 2006-03-08 19:36 ` William Lovas @ 2006-03-08 20:45 ` Brian Hurt 2006-03-08 21:14 ` Paul Snively 2006-03-08 22:06 ` skaller 0 siblings, 2 replies; 29+ messages in thread From: Brian Hurt @ 2006-03-08 20:45 UTC (permalink / raw) To: William Lovas; +Cc: caml-list On Wed, 8 Mar 2006, William Lovas wrote: > As the quoted paper said, though, some running transaction can always > commit -- so not all of the associated costs are still paid. In > particular, the cost of potential non-termination due to deadlock or > livelock is not paid. It doesn't matter that there's a mutex under the > hood, since it's used only in such a way as to preserve the property that > some transaction can always complete. One comment I will make is that a mutex is expensive, but not *that* expensive. I just wrote a quick program (available if anyone cares) in GNU C that measures the cost, in clocks, of locking and unlocking a posix mutex. On my desktop box (AMD Athlon XP 2200+ 1.8GHz), I'm getting a cost of like 44 clock cycles. Which makes it less expensive than an L2 cache miss. At this point correctness is a much bigger concern of mine than absolute performance. A fair bit of conservative or unnecessary locking is acceptable, given than I can write a complicated and highly scalable application and know that I don't have deadlocks or livelocks. Not having race conditions would also be nice, but I don't think that's possible with mutable data. Brian ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [Caml-list] STM support in OCaml 2006-03-08 20:45 ` Brian Hurt @ 2006-03-08 21:14 ` Paul Snively 2006-03-08 22:06 ` skaller 1 sibling, 0 replies; 29+ messages in thread From: Paul Snively @ 2006-03-08 21:14 UTC (permalink / raw) To: Brian Hurt; +Cc: William Lovas, caml-list -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 You folks might find Tim Sweeney's slides from POPL '06, which are linked to at <http://lambda-the-ultimate.org/node/1277>, interesting. Tim gives some statistics about Gears of War, a title with about 250,000 lines of combined C++ and UnrealScript code for the title itself, and another 250,000 lines of C++ for the Unreal Engine 3, and goes on to discuss some issues that he has. For example, he has about 10,000 objects active at any given time. Whenever one of these objects is modified, it typically touches 5-10 other objects. And all of this has to happen 30-60 times per second. On slide 50, Tim talks about STM, and says: "~2-4X STM performance overhead is acceptable: if it enables our state-intensive code to scale to many threads, it's still a win." Food for thought. Best regards, Paul On Mar 8, 2006, at 12:45 PM, Brian Hurt wrote: > > > On Wed, 8 Mar 2006, William Lovas wrote: > >> As the quoted paper said, though, some running transaction can always >> commit -- so not all of the associated costs are still paid. In >> particular, the cost of potential non-termination due to deadlock or >> livelock is not paid. It doesn't matter that there's a mutex >> under the >> hood, since it's used only in such a way as to preserve the >> property that >> some transaction can always complete. > > One comment I will make is that a mutex is expensive, but not > *that* expensive. I just wrote a quick program (available if > anyone cares) in GNU C that measures the cost, in clocks, of > locking and unlocking a posix mutex. On my desktop box (AMD Athlon > XP 2200+ 1.8GHz), I'm getting a cost of like 44 clock cycles. > Which makes it less expensive than an L2 cache miss. > > At this point correctness is a much bigger concern of mine than > absolute performance. A fair bit of conservative or unnecessary > locking is acceptable, given than I can write a complicated and > highly scalable application and know that I don't have deadlocks or > livelocks. Not having race conditions would also be nice, but I > don't think that's possible with mutable data. > > Brian > > _______________________________________________ > Caml-list mailing list. Subscription management: > http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list > Archives: http://caml.inria.fr > Beginner's list: http://groups.yahoo.com/group/ocaml_beginners > Bug reports: http://caml.inria.fr/bin/caml-bugs -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.1 (Darwin) iEYEARECAAYFAkQPSSoACgkQS6yIxITC5OqqrACeK1Se7ZKbyKP9bFZSJXcoe6t/ P7IAoLr9Z/1IL2341P2ARcnMyEj+yZ1G =Lse8 -----END PGP SIGNATURE----- ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [Caml-list] STM support in OCaml 2006-03-08 20:45 ` Brian Hurt 2006-03-08 21:14 ` Paul Snively @ 2006-03-08 22:06 ` skaller 2006-03-08 22:10 ` Gerd Stolpmann ` (2 more replies) 1 sibling, 3 replies; 29+ messages in thread From: skaller @ 2006-03-08 22:06 UTC (permalink / raw) To: Brian Hurt; +Cc: William Lovas, caml-list On Wed, 2006-03-08 at 14:45 -0600, Brian Hurt wrote: > One comment I will make is that a mutex is expensive, but not *that* > expensive. I just wrote a quick program (available if anyone cares) in > GNU C that measures the cost, in clocks, of locking and unlocking a posix > mutex. On my desktop box (AMD Athlon XP 2200+ 1.8GHz), I'm getting a cost > of like 44 clock cycles. Which makes it less expensive than an L2 cache > miss. Ahem. Now try that on an AMDx2 (dual core). The cost goes through the roof if one process has a thread on each core. Because each core has its own cache and both caches have to be flushed/ synchronised. And those caches are BIG! I had hoped to compare dual CPU with dual core by buying a two CPU box (i.e. with two dual cores on it) but they're 2-3 times more expensive because it seems there's no board that supports Athlons (you need Opterons which cost a lot more). I have no idea if Linux, for example, running SMP kernel, is smart enough to know if a mutex is shared between two processing units or not: AFAIK Linux doesn't support interprocess mutex. Windows does. Be interesting to compare. As mentioned before the only data I have at the moment is a two thread counter increment experiment on a dual CPU G5 box, where the speed up from 2 CPUs vs 1 was a factor of 15 .. times SLOWER. -- John Skaller <skaller at users dot sourceforge dot net> Async PL, Realtime software consultants Checkout Felix: http://felix.sourceforge.net ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [Caml-list] STM support in OCaml 2006-03-08 22:06 ` skaller @ 2006-03-08 22:10 ` Gerd Stolpmann 2006-03-08 23:48 ` skaller 2006-03-09 7:45 ` Andrae Muys 2006-03-08 22:11 ` Brian Hurt 2006-03-11 15:26 ` [Caml-list] " Florian Weimer 2 siblings, 2 replies; 29+ messages in thread From: Gerd Stolpmann @ 2006-03-08 22:10 UTC (permalink / raw) To: skaller; +Cc: Brian Hurt, caml-list, William Lovas Am Donnerstag, den 09.03.2006, 09:06 +1100 schrieb skaller: > On Wed, 2006-03-08 at 14:45 -0600, Brian Hurt wrote: > > > One comment I will make is that a mutex is expensive, but not *that* > > expensive. I just wrote a quick program (available if anyone cares) in > > GNU C that measures the cost, in clocks, of locking and unlocking a posix > > mutex. On my desktop box (AMD Athlon XP 2200+ 1.8GHz), I'm getting a cost > > of like 44 clock cycles. Which makes it less expensive than an L2 cache > > miss. > I have no idea if Linux, for example, running SMP kernel, > is smart enough to know if a mutex is shared between two > processing units or not: AFAIK Linux doesn't support > interprocess mutex. Windows does. Be interesting to > compare. Of course POSIX supports interprocess mutexes: Mutexes are inherited across fork(). And Linux even implements threads as processes, so you can even place mutexes in shared memory (but do not ask me how). Anyway, measuring the costs of a mutex is probably not simple. They are highly optimized for the frequently occurring cases. And today "SMP" behaves often more like NUMA, especially for multi-core CPUs. > As mentioned before the only data I have at the moment > is a two thread counter increment experiment on a dual > CPU G5 box, where the speed up from 2 CPUs vs 1 was > a factor of 15 .. times SLOWER. You mean for a highly congested mutex you saw that slowdown. Gerd -- ------------------------------------------------------------ Gerd Stolpmann * Viktoriastr. 45 * 64293 Darmstadt * Germany gerd@gerd-stolpmann.de http://www.gerd-stolpmann.de Phone: +49-6151-153855 Fax: +49-6151-997714 ------------------------------------------------------------ ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [Caml-list] STM support in OCaml 2006-03-08 22:10 ` Gerd Stolpmann @ 2006-03-08 23:48 ` skaller 2006-03-09 7:45 ` Andrae Muys 1 sibling, 0 replies; 29+ messages in thread From: skaller @ 2006-03-08 23:48 UTC (permalink / raw) To: Gerd Stolpmann; +Cc: Brian Hurt, caml-list, William Lovas On Wed, 2006-03-08 at 23:10 +0100, Gerd Stolpmann wrote: > Am Donnerstag, den 09.03.2006, 09:06 +1100 schrieb skaller: > > I have no idea if Linux, for example, running SMP kernel, > > is smart enough to know if a mutex is shared between two > > processing units or not: AFAIK Linux doesn't support > > interprocess mutex. Windows does. Be interesting to > > compare. > > Of course POSIX supports interprocess mutexes: Mutexes are inherited > across fork(). How does that work with the address space being copied? On Windows, the mutex lives in kernel memory, the user only gets a handle to it. But at least with Xavier's pthread implementation (I assume this is Posix) you can do pthread_mutex_t fastmutex = PTHREAD_MUTEX_INITIALIZER; which indicates the mutex lives in user address space. Fork() copies the user address space (perhaps lazily). So it seems to me mutexes cannot work across fork()? > And Linux even implements threads as processes, so you > can even place mutexes in shared memory (but do not ask me how). I was thinking of Windows, where you can actually create inter-process global objects by naming them, and then find them in another unrelated process using the string name. [Placing a mutex is shared memory is the same as any other memory I guess: you just call pthread_mutex_init] > Anyway, measuring the costs of a mutex is probably not simple. Agreed! > > is a two thread counter increment experiment on a dual > > CPU G5 box, where the speed up from 2 CPUs vs 1 was > > a factor of 15 .. times SLOWER. > > You mean for a highly congested mutex you saw that slowdown. Yes, just one trivial experiment. Enough to suggest that merely upgrading from one core to two won't necessarily buy you a major performance increase. Probably better can be done with the right software, low level stuff using kernel calls, or even processor/board specific patches. AMD64 for example seems to have reasonably sophisticated ways of handling cache coherence which can't be provided generically by an OS like Linux that has to run on processors with different cache management architectures. Although these are low level issues, the really interesting ones are high level. Stuff like STM, control inversion, garbage collection, etc may be more portable and yield more benefits (especially if it is portable). But that is also hard to know without some significant applications and a lot of measurements. Anyhow I personally won't be writing off Ocaml any time soon! -- John Skaller <skaller at users dot sourceforge dot net> Async PL, Realtime software consultants Checkout Felix: http://felix.sourceforge.net ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [Caml-list] STM support in OCaml 2006-03-08 22:10 ` Gerd Stolpmann 2006-03-08 23:48 ` skaller @ 2006-03-09 7:45 ` Andrae Muys 2006-03-09 9:18 ` David Brown 1 sibling, 1 reply; 29+ messages in thread From: Andrae Muys @ 2006-03-09 7:45 UTC (permalink / raw) To: Gerd Stolpmann; +Cc: skaller, Brian Hurt, caml-list, William Lovas On 09/03/2006, at 8:10 AM, Gerd Stolpmann wrote: > Am Donnerstag, den 09.03.2006, 09:06 +1100 schrieb skaller: >> On Wed, 2006-03-08 at 14:45 -0600, Brian Hurt wrote: >> >>> One comment I will make is that a mutex is expensive, but not *that* >>> expensive. I just wrote a quick program (available if anyone >>> cares) in >>> GNU C that measures the cost, in clocks, of locking and unlocking >>> a posix >>> mutex. On my desktop box (AMD Athlon XP 2200+ 1.8GHz), I'm >>> getting a cost >>> of like 44 clock cycles. Which makes it less expensive than an >>> L2 cache >>> miss. > >> I have no idea if Linux, for example, running SMP kernel, >> is smart enough to know if a mutex is shared between two >> processing units or not: AFAIK Linux doesn't support >> interprocess mutex. Windows does. Be interesting to >> compare. > > Of course POSIX supports interprocess mutexes: Mutexes are inherited > across fork(). As skeller suggested, I can confirm this does not work. Mutexes are not OS resources, they are initialised regions of memory (generally a word). As fork() does copy-on-write, this leads to the mutex being duplicated NOT shared; and yes given that fork() only duplicates the calling thread, that does mean you can end-up with mutexes locked in the child with no controlling thread available to unlock them. fork() safety is an important issue to be aware of when working with posix-threads. The best reference I am familiar with is Programming With Threads by Kleiman, Shah, and Smaalders, which includes a whole chapter on the this and related topics. > And Linux even implements threads as processes, so you > can even place mutexes in shared memory (but do not ask me how). I've done this in an embedded app I wrote a few years back. It's not hard, you create the shared-memory segment, cast some of it to be of type pthread_mutex_t and pass it to pthread_mutex_init() (I'm working from memory here, so the function/type names may be off). > Anyway, measuring the costs of a mutex is probably not simple. They > are > highly optimized for the frequently occurring cases. And today "SMP" > behaves often more like NUMA, especially for multi-core CPUs. Agreed, however in my experience, if optimistic locking isn't going to work for you then neither are mutex-based shared-memory approaches. Andrae Muys -- Andrae Muys andrae@netymon.com Principal Kowari Consultant Netymon Pty Ltd ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [Caml-list] STM support in OCaml 2006-03-09 7:45 ` Andrae Muys @ 2006-03-09 9:18 ` David Brown 0 siblings, 0 replies; 29+ messages in thread From: David Brown @ 2006-03-09 9:18 UTC (permalink / raw) To: Andrae Muys; +Cc: Gerd Stolpmann, Brian Hurt, caml-list, William Lovas, skaller On Thu, Mar 09, 2006 at 05:45:10PM +1000, Andrae Muys wrote: > >>I have no idea if Linux, for example, running SMP kernel, > >>is smart enough to know if a mutex is shared between two > >>processing units or not: AFAIK Linux doesn't support > >>interprocess mutex. Windows does. Be interesting to > >>compare. > > > >Of course POSIX supports interprocess mutexes: Mutexes are inherited > >across fork(). > > As skeller suggested, I can confirm this does not work. Mutexes are > not OS resources, they are initialised regions of memory (generally a > word). As fork() does copy-on-write, this leads to the mutex being > duplicated NOT shared; and yes given that fork() only duplicates the > calling thread, that does mean you can end-up with mutexes locked in > the child with no controlling thread available to unlock them. The mutex gets duplicated, including the data structures that indicate the process ids of who might have been waiting for it. fork() is an outstanding way to very thorougly hose an pthread application. Windows has the "advantage" of just not having fork. Well, not exactly, since Cygwin demonstrates that it can be faked. The Linux mutex code does use kernel scheduling resources (to go to sleep and signal each other), but the initial tests are done through the shared memory. Dave ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [Caml-list] STM support in OCaml 2006-03-08 22:06 ` skaller 2006-03-08 22:10 ` Gerd Stolpmann @ 2006-03-08 22:11 ` Brian Hurt 2006-03-08 23:05 ` Lodewijk Vöge ` (2 more replies) 2006-03-11 15:26 ` [Caml-list] " Florian Weimer 2 siblings, 3 replies; 29+ messages in thread From: Brian Hurt @ 2006-03-08 22:11 UTC (permalink / raw) To: skaller; +Cc: William Lovas, caml-list [-- Attachment #1: Type: TEXT/PLAIN, Size: 1608 bytes --] On Thu, 9 Mar 2006, skaller wrote: > Ahem. Now try that on an AMDx2 (dual core). The cost goes through > the roof if one process has a thread on each core. Because each > core has its own cache and both caches have to be flushed/ > synchronised. And those caches are BIG! Love to. Wanna buy me the box? :-} Seriously- my code is attached, if someone wants to run it on other boxes and post the results, feel free. It's GNU-C/x86 specific, as I'm using GNU C's inline assembler and the rdtsc instruction to get accurate cycle counts. As to the cache comment: the whole caches don't have to be flushed, just the line the mutex is on. Which makes it approximately the cost of a cache miss- that's a good approximation of the cost of getting an uncontended lock. > > I have no idea if Linux, for example, running SMP kernel, > is smart enough to know if a mutex is shared between two > processing units or not: AFAIK Linux doesn't support > interprocess mutex. Windows does. Be interesting to > compare. It doesn't look like the mutex software is even going into the kernel. I don't think the Linux kernel even knows the mutex *exists*, let alone what threads are competing for it. On the x86, at least, lock instructions are not priveledged. > > As mentioned before the only data I have at the moment > is a two thread counter increment experiment on a dual > CPU G5 box, where the speed up from 2 CPUs vs 1 was > a factor of 15 .. times SLOWER. If you're ping-ponging a cache line between two CPUs (and the AMD dual cores count as two CPUs), then I can easily beleive that. So? Brian [-- Attachment #2: Type: TEXT/X-CSRC, Size: 1847 bytes --] #include <stdio.h> #include <pthread.h> #include <semaphore.h> #if !defined(__GNUC__) && !defined(__i386__) #error This code only works with GCC/i386. #endif /* The reason this only works under GNU C and the x86 is we're using the * rdtsc instruction. */ static inline unsigned long long rdtsc() { unsigned long long rval; asm volatile ("rdtsc" : "=A" (rval)); return rval; } static sem_t waiting_thread_semaphore; static pthread_mutex_t mutex; void * waiting_thread_func(void * param __attribute__((unused))) { sem_wait(&waiting_thread_semaphore); return NULL; } int main(void) { int i; pthread_t waiting_thread; void * trash; unsigned long long start, stop, time, min; /* Create a thread to force us to actually do multi-threaded work */ sem_init(&waiting_thread_semaphore, 1, 0); pthread_create(&waiting_thread, NULL, waiting_thread_func, NULL); pthread_mutex_init(&mutex, NULL); /* Time how long a rdtsc takes- we do this ten times and take the * cheapest run. */ min = ~0ull; for (i = 0; i < 10; ++i) { start = rdtsc(); stop = rdtsc(); time = stop - start; if (time < min) { min = time; } } printf("Minimum time for a rdtsc instruction (in clocks): %llu\n", min); /* Now time how long the pair of mutex lock + unlock take */ min = ~0ull; for (i = 0; i < 10; ++i) { start = rdtsc(); pthread_mutex_lock(&mutex); pthread_mutex_unlock(&mutex); stop = rdtsc(); time = stop - start; if (time < min) { min = time; } } printf("Minimum time for a mutex lock+unlock + rdtsc (in clocks): %llu\n", min); /* Clean up the waiting thread we spawned just to be multithreaded. */ sem_post(&waiting_thread_semaphore); pthread_join(waiting_thread, &trash); sem_destroy(&waiting_thread_semaphore); return 0; } ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [Caml-list] STM support in OCaml 2006-03-08 22:11 ` Brian Hurt @ 2006-03-08 23:05 ` Lodewijk Vöge 2006-03-09 3:13 ` Brian Hurt 2006-03-08 23:45 ` Robert Roessler 2006-03-09 0:23 ` skaller 2 siblings, 1 reply; 29+ messages in thread From: Lodewijk Vöge @ 2006-03-08 23:05 UTC (permalink / raw) To: caml-list On 8-mrt-2006, at 23:11, Brian Hurt wrote: > Love to. Wanna buy me the box? :-} Seriously- my code is > attached, if someone wants to run it on other boxes and post the > results, feel free. It's GNU-C/x86 specific, as I'm using GNU C's > inline assembler and the rdtsc instruction to get accurate cycle > counts. I don't know if rdtsc is accurate on 64bit, but this is what running it on a dual opteron 250 running 64bit linux gives: $ ./lock Minimum time for a rdtsc instruction (in clocks): 144099185 Minimum time for a mutex lock+unlock + rdtsc (in clocks): 24 multiple runs hover between this and 37 clocks for the mutex, and more than twice for the rdtsc. Lodewijk ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [Caml-list] STM support in OCaml 2006-03-08 23:05 ` Lodewijk Vöge @ 2006-03-09 3:13 ` Brian Hurt 0 siblings, 0 replies; 29+ messages in thread From: Brian Hurt @ 2006-03-09 3:13 UTC (permalink / raw) To: Lodewijk Vöge; +Cc: caml-list [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #1: Type: TEXT/PLAIN; charset=X-UNKNOWN; format=flowed, Size: 680 bytes --] On Thu, 9 Mar 2006, Lodewijk Vöge wrote: > On 8-mrt-2006, at 23:11, Brian Hurt wrote: > >> Love to. Wanna buy me the box? :-} Seriously- my code is attached, if >> someone wants to run it on other boxes and post the results, feel free. >> It's GNU-C/x86 specific, as I'm using GNU C's inline assembler and the >> rdtsc instruction to get accurate cycle counts. > > I don't know if rdtsc is accurate on 64bit, but this is what running it on a > dual opteron 250 running 64bit linux gives: On 64-bit, I think rdtsc only sets the RAX register. At least that's what I'm assuming is going on- 144 million clocks for a rdtsc seems a bit extreme. Brian ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [Caml-list] STM support in OCaml 2006-03-08 22:11 ` Brian Hurt 2006-03-08 23:05 ` Lodewijk Vöge @ 2006-03-08 23:45 ` Robert Roessler 2006-03-09 0:23 ` skaller 2 siblings, 0 replies; 29+ messages in thread From: Robert Roessler @ 2006-03-08 23:45 UTC (permalink / raw) To: Caml-list Brian Hurt wrote: > On Thu, 9 Mar 2006, skaller wrote: > >> Ahem. Now try that on an AMDx2 (dual core). The cost goes through >> the roof if one process has a thread on each core. Because each >> core has its own cache and both caches have to be flushed/ >> synchronised. And those caches are BIG! > > Love to. Wanna buy me the box? :-} Seriously- my code is attached, if > someone wants to run it on other boxes and post the results, feel free. > It's GNU-C/x86 specific, as I'm using GNU C's inline assembler and the > rdtsc instruction to get accurate cycle counts. > > As to the cache comment: the whole caches don't have to be flushed, just > the line the mutex is on. Which makes it approximately the cost of a > cache miss- that's a good approximation of the cost of getting an > uncontended lock. > >> >> I have no idea if Linux, for example, running SMP kernel, >> is smart enough to know if a mutex is shared between two >> processing units or not: AFAIK Linux doesn't support >> interprocess mutex. Windows does. Be interesting to >> compare. > > It doesn't look like the mutex software is even going into the kernel. I > don't think the Linux kernel even knows the mutex *exists*, let alone > what threads are competing for it. On the x86, at least, lock > instructions are not priveledged. Brian [, this may be fairly off-topic, but], could you possibly comment on how these results may or may not pertain to g_mutex_lock (rather than pthread_mutex_lock)? I have been going to some trouble to operate a glyph-string cache without needing any locking for the *lookup* case - which is doable, only needing locking for *adding*. OTOH, for the flushing operation, it looks a bit tougher... ;) But if g_mutex_lock/g_mutex_unlock is not all that grim... Robert Roessler roessler@rftp.com http://www.rftp.com ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [Caml-list] STM support in OCaml 2006-03-08 22:11 ` Brian Hurt 2006-03-08 23:05 ` Lodewijk Vöge 2006-03-08 23:45 ` Robert Roessler @ 2006-03-09 0:23 ` skaller 2006-03-09 3:19 ` Brian Hurt 2 siblings, 1 reply; 29+ messages in thread From: skaller @ 2006-03-09 0:23 UTC (permalink / raw) To: Brian Hurt; +Cc: caml-list, William Lovas On Wed, 2006-03-08 at 16:11 -0600, Brian Hurt wrote: > As to the cache comment: the whole caches don't have to be flushed, just > the line the mutex is on. Unfortunately that isn't correct ;( Consider: lock mutex read some data write back that data unlock mutex The problem with two threads is that 'some data' has to be invalidated too. Otherwise there is no way existing Posix software would work on a multi-processor. I couldn't believe it! That's why we tried the experiment: did the OS really invalidate the 'whole' caches, or did our trivial increment program fail, because the memory being incremented wasn't coherent? Either answer is BAD: its a lose/lose situation. The result is -- we got the right answers and the 15x slowdown which indicates that yes, really the whole** cache on both processors is being invalidated every mutex lock/unlock. Interestingly .. it should work much better with a pure FPL, there's nothing to synchronise unless the compiler did some nasty optimisations like reusing a closure for self-tail-rec functions, in which case the compiler could tell the OS which store is dirty. I wouldn't expect that to work with a C program though :) -- John Skaller <skaller at users dot sourceforge dot net> Async PL, Realtime software consultants Checkout Felix: http://felix.sourceforge.net ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [Caml-list] STM support in OCaml 2006-03-09 0:23 ` skaller @ 2006-03-09 3:19 ` Brian Hurt 2006-03-09 4:32 ` skaller 0 siblings, 1 reply; 29+ messages in thread From: Brian Hurt @ 2006-03-09 3:19 UTC (permalink / raw) To: skaller; +Cc: caml-list, William Lovas On Thu, 9 Mar 2006, skaller wrote: > The problem with two threads is that 'some data' has to > be invalidated too. Otherwise there > is no way existing Posix software would work on a > multi-processor. > > I couldn't believe it! That's why we tried the experiment: > did the OS really invalidate the 'whole' caches, or did our > trivial increment program fail, because the memory being > incremented wasn't coherent? Generally both the lock get and lock remove need to do a memory barrier, and force all writes to complete to cache. Once in cache, the cache's do cache coherency. The caches snoop the memory traffic- and if someone is reading memory that the cache knows is more up to date in that cache than in memory, the read is satisified from that cache and not from memory- and the cache updates memory as well at the same time. I'm simplifying enormously here, but the end result is that no, you don't need to flush the entire cache on lock release. Google "cache coherency" or "MESI" for more information. Brian ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [Caml-list] STM support in OCaml 2006-03-09 3:19 ` Brian Hurt @ 2006-03-09 4:32 ` skaller 2006-03-09 10:38 ` John Chu 2006-03-09 16:53 ` Stefan Monnier 0 siblings, 2 replies; 29+ messages in thread From: skaller @ 2006-03-09 4:32 UTC (permalink / raw) To: Brian Hurt; +Cc: caml-list, William Lovas On Wed, 2006-03-08 at 21:19 -0600, Brian Hurt wrote: > > I couldn't believe it! That's why we tried the experiment: > > did the OS really invalidate the 'whole' caches, or did our > > trivial increment program fail, because the memory being > > incremented wasn't coherent? > I'm simplifying > enormously here, I know, so was I -- I've read the AMD64 specs. Suppose I have a thread that creates an array which fits roughly in the cache. Then I hand it over to another thread to fold. The first thread has to load stuff from RAM to populate the array, which slows it down. Then it has to write the whole array back to RAM, and the second thread reloads the whole array from RAM into the cache. In that case the whole cache has to be not only flushed to RAM, it has to be reloaded into another cache. Pretty nasty .. on a MP machine. On a uniprocessor, the same CPU will do the fold, and the cache doesn't have to be either saved or reloaded. Yes, that's an extreme case! -- John Skaller <skaller at users dot sourceforge dot net> Async PL, Realtime software consultants Checkout Felix: http://felix.sourceforge.net ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [Caml-list] STM support in OCaml 2006-03-09 4:32 ` skaller @ 2006-03-09 10:38 ` John Chu 2006-03-09 16:53 ` Stefan Monnier 1 sibling, 0 replies; 29+ messages in thread From: John Chu @ 2006-03-09 10:38 UTC (permalink / raw) To: caml-list > I know, so was I -- I've read the AMD64 specs. [snip] > The first thread > has to load stuff from RAM to populate the array, > which slows it down. > > Then it has to write the whole array back to RAM, > and the second thread reloads the whole array from > RAM into the cache. If you've read the AMD64 specs, then you know about the Owned state in the cache coherence protocol AMD chips use. It allows one processor to update another processor without having to update main memory. i.e., one processor goes from M->O, the other goes from I->S. The O state is there to remind the first processor that on an eviction, it can't simply drop the cache line (as it could a line in the S state). It must write back to main memory as if it were in the M state. You're obviously right that two processes on two cores use the same data, the data will obviously need to be copied from the cache hierarchy of one core to the other core. What I'm saying is that it does not need to write into main memory first. john ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: STM support in OCaml 2006-03-09 4:32 ` skaller 2006-03-09 10:38 ` John Chu @ 2006-03-09 16:53 ` Stefan Monnier 1 sibling, 0 replies; 29+ messages in thread From: Stefan Monnier @ 2006-03-09 16:53 UTC (permalink / raw) To: caml-list > In that case the whole cache has to be not only > flushed to RAM, it has to be reloaded into another > cache. Pretty nasty .. on a MP machine. Yes, of course, the data needs to copied between the CPUs. But the cache never needs to be flushed. Stefan ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [Caml-list] STM support in OCaml 2006-03-08 22:06 ` skaller 2006-03-08 22:10 ` Gerd Stolpmann 2006-03-08 22:11 ` Brian Hurt @ 2006-03-11 15:26 ` Florian Weimer 2 siblings, 0 replies; 29+ messages in thread From: Florian Weimer @ 2006-03-11 15:26 UTC (permalink / raw) To: skaller; +Cc: Brian Hurt, caml-list, William Lovas > I have no idea if Linux, for example, running SMP kernel, > is smart enough to know if a mutex is shared between two > processing units or not: AFAIK Linux doesn't support > interprocess mutex. Uhm, Linux and GNU libc do support them for quite some time (many years if you used one of the enterprise variants which backported them to Linux 2.4). ^ permalink raw reply [flat|nested] 29+ messages in thread
end of thread, other threads:[~2006-03-11 19:42 UTC | newest] Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2006-03-07 16:18 STM support in OCaml Asfand Yar Qazi 2006-03-07 16:50 ` [Caml-list] " Sebastian Egner 2006-03-07 17:44 ` Michael Hicks 2006-03-08 0:37 ` Asfand Yar Qazi 2006-03-08 5:05 ` Erick Tryzelaar 2006-03-11 19:43 ` Deadlock free locking scheme (was: Re: [Caml-list] STM support in OCaml) David MENTRE 2006-03-07 17:15 ` [Caml-list] STM support in OCaml skaller 2006-03-07 19:05 ` Asfand Yar Qazi 2006-03-08 0:52 ` skaller 2006-03-08 7:08 ` Bardur Arantsson 2006-03-08 10:38 ` [Caml-list] " Asfand Yar Qazi 2006-03-08 19:36 ` William Lovas 2006-03-08 20:45 ` Brian Hurt 2006-03-08 21:14 ` Paul Snively 2006-03-08 22:06 ` skaller 2006-03-08 22:10 ` Gerd Stolpmann 2006-03-08 23:48 ` skaller 2006-03-09 7:45 ` Andrae Muys 2006-03-09 9:18 ` David Brown 2006-03-08 22:11 ` Brian Hurt 2006-03-08 23:05 ` Lodewijk Vöge 2006-03-09 3:13 ` Brian Hurt 2006-03-08 23:45 ` Robert Roessler 2006-03-09 0:23 ` skaller 2006-03-09 3:19 ` Brian Hurt 2006-03-09 4:32 ` skaller 2006-03-09 10:38 ` John Chu 2006-03-09 16:53 ` Stefan Monnier 2006-03-11 15:26 ` [Caml-list] " Florian Weimer
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox