* [Caml-list] Allocation profiling for x86-64 native code @ 2013-09-13 15:48 Mark Shinwell 2013-09-13 16:52 ` Gerd Stolpmann 2013-09-14 22:10 ` Jacques-Henri Jourdan 0 siblings, 2 replies; 5+ messages in thread From: Mark Shinwell @ 2013-09-13 15:48 UTC (permalink / raw) To: caml-list Large OCaml programs can experience performance degradation due to high garbage collection loads (or possibly due to it being Friday 13th). Understanding the memory usage of such programs can also be difficult. To this end, I am pleased to release a version of OCaml 4.01 that contains functionality for the memory profiling of native code programs, for the x86-64 architecture. Currently it is only fully working on Linux platforms, but there should be a version for Mac OS X in the near future, and the BSDs. opam remote add mshinwell git://github.com/mshinwell/opam-repo-dev opam update opam switch 4.01-allocation-profiling The source is on GitHub: https://github.com/mshinwell/ocaml/tree/4.01-allocation-profiling Using ocamlopt with -allocation-tracing and running in an environment with the OCAMLRUNPARAM environment variable including the letter "T" enables the use of the functionality in the new [Allocation_profiling] standard library module [1]. You should also ensure that you have the new ocamlmklocs script (installed to the same place as the compiler binaries) on your PATH at compile time. The runtime system for this compiler contains instrumentation that can produce a global analysis showing the total number of words allocated on the OCaml heaps by source location. This works not only for blocks allocated in OCaml code but also in C stubs. Further, values are instrumented---without space overhead---in order to be able to determine from a snapshot of the heap which value was allocated where; and also to provide a runtime API that can be queried from the instrumented program itself. Following Unix tradition, there is no shiny user interface. Scripts are provided to decode the data from the former two analyses. There is also a script that can draw a graph of the heap quotiented by the equivalence relation that identifies two blocks iff they were allocated at the same source location. Programs compiled with allocation profiling will run slower than under normal compilation, but this degradation should not be that significant. They will use a little more memory than normal, but not much, and the amount of increase roughly speaking is about twice the size of the machine code in your program. (In particular there is no overhead per value allocated.) Source locations reported are very slightly approximated, but this should not normally cause a problem. Sometimes the source location that appears in the profile may not be quite the function you're looking for (e.g. some allocation function that's called from multiple places; the allocation function rather than the callers might show up). The system goes to some rudimentary efforts to avoid this by looking back up the stack one level under certain conditions, but if you get stuck, you can set a breakpoint in gdb on the function identified in the profile and collect a backtrace every time you pass it. These can then be uniquified by a shell script left as an exercise to the reader. (This technique has been discussed previously on this list.) This is not yet a fully-polished system, but it has been used at Jane Street on rather large OCaml programs with success. The part that still requires most work is the runtime API. If you experience long compile times then you can disable the runtime API support by editing the ocamlmklocs script to write an empty file; fixing this is on the list. (See the comment in stdlib/allocation_profiling.mli.) This is on the list to be fixed. I would be interested to hear of reports of success or failure; or feature requests. One feature on the near-term list is being able to measure how long a particular value has been in existence. Have fun. Mark P.S. There is some related work going on at OCamlPro using similar techniques. These projects were developed independently, but we expect to collaborate on getting some of this technology into the main distribution. [1] https://github.com/mshinwell/ocaml/blob/4.01-allocation-profiling/stdlib/allocation_profiling.mli ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [Caml-list] Allocation profiling for x86-64 native code 2013-09-13 15:48 [Caml-list] Allocation profiling for x86-64 native code Mark Shinwell @ 2013-09-13 16:52 ` Gerd Stolpmann 2013-09-16 8:00 ` Mark Shinwell 2013-09-14 22:10 ` Jacques-Henri Jourdan 1 sibling, 1 reply; 5+ messages in thread From: Gerd Stolpmann @ 2013-09-13 16:52 UTC (permalink / raw) To: Mark Shinwell; +Cc: caml-list [-- Attachment #1: Type: text/plain, Size: 1678 bytes --] Am Freitag, den 13.09.2013, 16:48 +0100 schrieb Mark Shinwell: > Large OCaml programs can experience performance degradation due to high > garbage collection loads (or possibly due to it being Friday 13th). > Understanding the memory usage of such programs can also be difficult. > > To this end, I am pleased to release a version of OCaml 4.01 that > contains functionality for the memory profiling of native code programs, > for the x86-64 architecture. Currently it is only fully working on Linux > platforms, but there should be a version for Mac OS X in the near future, > and the BSDs. > ... > The runtime system for this compiler contains instrumentation that can > produce a global analysis showing the total number of words allocated > on the OCaml heaps by source location. This works not only for blocks > allocated in OCaml code but also in C stubs. Further, values are > instrumented---without space overhead---in order to be able to determine > from a snapshot of the heap which value was allocated where; and also > to provide a runtime API that can be queried from the instrumented > program itself. A dumb question: how do you do the value instrumentation? Without space overhead? There is not much information in the value itself... do you track value relocations? Gerd -- ------------------------------------------------------------ Gerd Stolpmann, Darmstadt, Germany gerd@gerd-stolpmann.de My OCaml site: http://www.camlcity.org Contact details: http://www.camlcity.org/contact.html Company homepage: http://www.gerd-stolpmann.de ------------------------------------------------------------ [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 490 bytes --] ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [Caml-list] Allocation profiling for x86-64 native code 2013-09-13 16:52 ` Gerd Stolpmann @ 2013-09-16 8:00 ` Mark Shinwell 0 siblings, 0 replies; 5+ messages in thread From: Mark Shinwell @ 2013-09-16 8:00 UTC (permalink / raw) To: Gerd Stolpmann; +Cc: caml-list On 13 September 2013 17:52, Gerd Stolpmann <info@gerd-stolpmann.de> wrote: > A dumb question: how do you do the value instrumentation? Without space > overhead? There is not much information in the value itself... The maximum size of blocks is reduced and then an approximation to the instruction pointer at the point of allocation is stored in the spare space in the header word. (More detail to follow in a reply to Jacques-Henri.) Mark ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [Caml-list] Allocation profiling for x86-64 native code 2013-09-13 15:48 [Caml-list] Allocation profiling for x86-64 native code Mark Shinwell 2013-09-13 16:52 ` Gerd Stolpmann @ 2013-09-14 22:10 ` Jacques-Henri Jourdan 2013-09-16 8:42 ` Mark Shinwell 1 sibling, 1 reply; 5+ messages in thread From: Jacques-Henri Jourdan @ 2013-09-14 22:10 UTC (permalink / raw) To: caml-list [-- Attachment #1: Type: text/plain, Size: 6324 bytes --] This is a really interesting work ! Actually, I have had this project of making a memory profiler for Ocaml for a few months. My idea was to do some statistical profiling by annotating only a fraction of the allocated blocks. Here was the advantages I thought about : 1- Lower execution time overhead. BTW, what is yours ? 2- When annotating a block, we could decide to store more information. Typically, it could be very profitable to know (at least a part of) the current backtrace while allocating. 3- We could also analyze more precisely the life of annotated objects without much performance constrains, because they are fewer. We could for example put watchpoints on them to know when they accessed, or do statistics about there life time... 4- Your method for annotating blocks uses the 22 highest bits of the blocks headers to store the bits 4..25 of the allocation point address. I can see several (minor) problems of doing that - The maximum size of a block is then limited to 32GB. - That does mean that those 22 bits identify the allocation point, and I am not convinced that the probability of collision is negligible in the case of large code base (like code) non-contiguously loaded in memory because of dynlink, for example. - This is not usable for x86-32 bits. With statistical profiling, we can afford having a separate table of traced blocks, that we would maintain at the end of each GC phase. This way, we don't actually "annotate" blocks, but we rather annotate the corresponding table entry. 5- It is not necessary to walk the whole heap to understand some of its properties, but rather only the traced blocks. There is an easy and cheap way to do statistical profiling of memory allocation: we could decide that each allocation exceeding caml_young_limit should receive a special treatment in order to be traced. So, do you think this would be a good idea to implement ? Any other comments ? -- JH Jourdan Le 13/09/2013 17:48, Mark Shinwell a écrit : > Large OCaml programs can experience performance degradation due to high > garbage collection loads (or possibly due to it being Friday 13th). > Understanding the memory usage of such programs can also be difficult. > > To this end, I am pleased to release a version of OCaml 4.01 that > contains functionality for the memory profiling of native code programs, > for the x86-64 architecture. Currently it is only fully working on Linux > platforms, but there should be a version for Mac OS X in the near future, > and the BSDs. > > opam remote add mshinwell git://github.com/mshinwell/opam-repo-dev > opam update > opam switch 4.01-allocation-profiling > > The source is on GitHub: > https://github.com/mshinwell/ocaml/tree/4.01-allocation-profiling > > Using ocamlopt with -allocation-tracing and running in an environment > with the OCAMLRUNPARAM environment variable including the letter "T" > enables the use of the functionality in the new [Allocation_profiling] > standard library module [1]. You should also ensure that you have > the new ocamlmklocs script (installed to the same place as the > compiler binaries) on your PATH at compile time. > > The runtime system for this compiler contains instrumentation that can > produce a global analysis showing the total number of words allocated > on the OCaml heaps by source location. This works not only for blocks > allocated in OCaml code but also in C stubs. Further, values are > instrumented---without space overhead---in order to be able to determine > from a snapshot of the heap which value was allocated where; and also > to provide a runtime API that can be queried from the instrumented > program itself. Following Unix tradition, there is no shiny user > interface. Scripts are provided to decode the data from the former > two analyses. There is also a script that can draw a graph of the > heap quotiented by the equivalence relation that identifies two > blocks iff they were allocated at the same source location. > > Programs compiled with allocation profiling will run slower than under > normal compilation, but this degradation should not be that > significant. They will use a little more memory than normal, but not > much, and the amount of increase roughly speaking is about twice the > size of the machine code in your program. (In particular there is > no overhead per value allocated.) > > Source locations reported are very slightly approximated, but this > should not normally cause a problem. > > Sometimes the source location that appears in the profile may not be > quite the function you're looking for (e.g. some allocation function > that's called from multiple places; the allocation function rather > than the callers might show up). The system goes to some rudimentary > efforts to avoid this by looking back up the stack one level under > certain conditions, but if you get stuck, you can set a breakpoint in > gdb on the function identified in the profile and collect a backtrace > every time you pass it. These can then be uniquified by a shell script > left as an exercise to the reader. (This technique has been discussed > previously on this list.) > > This is not yet a fully-polished system, but it has been used at Jane > Street on rather large OCaml programs with success. The part that still > requires most work is the runtime API. If you experience long compile > times then you can disable the runtime API support by editing the > ocamlmklocs script to write an empty file; fixing this is on the list. > (See the comment in stdlib/allocation_profiling.mli.) This is on the > list to be fixed. > > I would be interested to hear of reports of success or failure; or > feature requests. One feature on the near-term list is being able to > measure how long a particular value has been in existence. > > Have fun. > > Mark > > P.S. There is some related work going on at OCamlPro using similar > techniques. These projects were developed independently, but we expect > to collaborate on getting some of this technology into the main > distribution. > > [1] https://github.com/mshinwell/ocaml/blob/4.01-allocation-profiling/stdlib/allocation_profiling.mli > [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 555 bytes --] ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [Caml-list] Allocation profiling for x86-64 native code 2013-09-14 22:10 ` Jacques-Henri Jourdan @ 2013-09-16 8:42 ` Mark Shinwell 0 siblings, 0 replies; 5+ messages in thread From: Mark Shinwell @ 2013-09-16 8:42 UTC (permalink / raw) To: Jacques-Henri Jourdan; +Cc: caml-list On 14 September 2013 23:10, Jacques-Henri Jourdan <jacques-henri.jourdan@ens.fr> wrote: > statistical profiling by > annotating only a fraction of the allocated blocks. I think it's certainly worth experimenting with different approaches. I've tried to address your points below. > Here was the advantages I thought about : > > 1- Lower execution time overhead. BTW, what is yours ? I don't have any exact figures to hand, and in fact, it will potentially vary quite a lot depending on the amount of allocation. I think the time overhead is maybe 20% at the moment for a large allocation-heavy application. This could be decreased somewhat---firstly by optimizing, and secondly by maybe allowing the "global" (cf. [Allocation_profiling.Global] in the stdlib) analysis to be disabled. (This analysis was actually the predecessor of the value-annotating analysis.) The remaining overhead of annotating the values is small. > 2- When annotating a block, we could decide to store more information. > Typically, it could be very profitable to know (at least a part of) the > current backtrace while allocating. Agreed. I'm hoping to do some work on capturing a partial backtrace in that scenario. > 3- We could also analyze more precisely the life of annotated objects > without much performance constrains, because they are fewer. We could > for example put watchpoints on them to know when they accessed, or do > statistics about there life time... I think if using watchpoints you'd have to pick and choose what you instrument fairly carefully, in any case, otherwise everything will grind to a halt. Lifetime statistics are likely to be supported soon (this should be easy). > 4- Your method for annotating blocks uses the 22 highest bits of the > blocks headers to store the bits 4..25 of the allocation point address. > I can see several (minor) problems of doing that > - The maximum size of a block is then limited to 32GB. I think such blocks are unlikely to occur in practice. I'd argue that it's most likely a mistake to have such large allocations inside the OCaml heap, too. > - That does mean that those 22 bits identify the allocation point, > and I am not convinced that the probability of collision is negligible > in the case of large code base (like code) non-contiguously loaded in > memory because of dynlink, for example. I neglected to say that this is not expected to work with natdynlink at the moment. I think for x86-64 Linux the current assumption about contiguous code is correct, at least using the normal linker scripts, and the range is probably sufficient. The main place where the approximation could be problematic, I think, is where there are allocation points close together that can't quite be distinguished. In practice I'm not sure this is a problem, though. > - This is not usable for x86-32 bits. I'm not sure x86-32 is worthy of much attention any more (dare I say it!) but 32-bit platforms more generally I think still are of concern. My plan for upstreaming this work includes a patch to enable compiler hackers to adjust the layout of the block header more easily (roughly speaking, removing hard-coded constants in the code) and that may end up including an option to allocate more than one header word per block. This could then be used to solve the 32-bit problem, as well as likely being a useful platform for other experiments. > With statistical profiling, we can afford having a separate table of > traced blocks, that we would maintain at the end of each GC phase. This > way, we don't actually "annotate" blocks, but we rather annotate the > corresponding table entry. This seems like it might cause quite a lot of extra work, and disturb cache behaviour, no? (The current "global" analysis mentioned above in my system will disturb the cache too, but I think if that's turned off, just the value annotation should not.) Mark ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2013-09-16 8:42 UTC | newest] Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2013-09-13 15:48 [Caml-list] Allocation profiling for x86-64 native code Mark Shinwell 2013-09-13 16:52 ` Gerd Stolpmann 2013-09-16 8:00 ` Mark Shinwell 2013-09-14 22:10 ` Jacques-Henri Jourdan 2013-09-16 8:42 ` Mark Shinwell
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox