[Caml-list] Allocation profiling for x86-64 native code

Mailing list for all users of the OCaml language and system.
 help / color / mirror / Atom feed

* [Caml-list] Allocation profiling for x86-64 native code
@ 2013-09-13 15:48 Mark Shinwell
  2013-09-13 16:52 ` Gerd Stolpmann
  2013-09-14 22:10 ` Jacques-Henri Jourdan
  0 siblings, 2 replies; 5+ messages in thread
From: Mark Shinwell @ 2013-09-13 15:48 UTC (permalink / raw)
  To: caml-list

Large OCaml programs can experience performance degradation due to high
garbage collection loads (or possibly due to it being Friday 13th).
Understanding the memory usage of such programs can also be difficult.

To this end, I am pleased to release a version of OCaml 4.01 that
contains functionality for the memory profiling of native code programs,
for the x86-64 architecture.  Currently it is only fully working on Linux
platforms, but there should be a version for Mac OS X in the near future,
and the BSDs.

  opam remote add mshinwell git://github.com/mshinwell/opam-repo-dev
  opam update
  opam switch 4.01-allocation-profiling

The source is on GitHub:
  https://github.com/mshinwell/ocaml/tree/4.01-allocation-profiling

Using ocamlopt with -allocation-tracing and running in an environment
with the OCAMLRUNPARAM environment variable including the letter "T"
enables the use of the functionality in the new [Allocation_profiling]
standard library module [1].  You should also ensure that you have
the new ocamlmklocs script (installed to the same place as the
compiler binaries) on your PATH at compile time.

The runtime system for this compiler contains instrumentation that can
produce a global analysis showing the total number of words allocated
on the OCaml heaps by source location.  This works not only for blocks
allocated in OCaml code but also in C stubs.  Further, values are
instrumented---without space overhead---in order to be able to determine
from a snapshot of the heap which value was allocated where; and also
to provide a runtime API that can be queried from the instrumented
program itself.  Following Unix tradition, there is no shiny user
interface.  Scripts are provided to decode the data from the former
two analyses.  There is also a script that can draw a graph of the
heap quotiented by the equivalence relation that identifies two
blocks iff they were allocated at the same source location.

Programs compiled with allocation profiling will run slower than under
normal compilation, but this degradation should not be that
significant.  They will use a little more memory than normal, but not
much, and the amount of increase roughly speaking is about twice the
size of the machine code in your program.  (In particular there is
no overhead per value allocated.)

Source locations reported are very slightly approximated, but this
should not normally cause a problem.

Sometimes the source location that appears in the profile may not be
quite the function you're looking for (e.g. some allocation function
that's called from multiple places; the allocation function rather
than the callers might show up).  The system goes to some rudimentary
efforts to avoid this by looking back up the stack one level under
certain conditions, but if you get stuck, you can set a breakpoint in
gdb on the function identified in the profile and collect a backtrace
every time you pass it.  These can then be uniquified by a shell script
left as an exercise to the reader.  (This technique has been discussed
previously on this list.)

This is not yet a fully-polished system, but it has been used at Jane
Street on rather large OCaml programs with success.  The part that still
requires most work is the runtime API.  If you experience long compile
times then you can disable the runtime API support by editing the
ocamlmklocs script to write an empty file; fixing this is on the list.
(See the comment in stdlib/allocation_profiling.mli.)  This is on the
list to be fixed.

I would be interested to hear of reports of success or failure; or
feature requests.  One feature on the near-term list is being able to
measure how long a particular value has been in existence.

Have fun.

Mark

P.S. There is some related work going on at OCamlPro using similar
techniques.  These projects were developed independently, but we expect
to collaborate on getting some of this technology into the main
distribution.

[1] https://github.com/mshinwell/ocaml/blob/4.01-allocation-profiling/stdlib/allocation_profiling.mli

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Caml-list] Allocation profiling for x86-64 native code
  2013-09-13 15:48 [Caml-list] Allocation profiling for x86-64 native code Mark Shinwell
@ 2013-09-13 16:52 ` Gerd Stolpmann
  2013-09-16  8:00   ` Mark Shinwell
  2013-09-14 22:10 ` Jacques-Henri Jourdan
  1 sibling, 1 reply; 5+ messages in thread
From: Gerd Stolpmann @ 2013-09-13 16:52 UTC (permalink / raw)
  To: Mark Shinwell; +Cc: caml-list

[-- Attachment #1: Type: text/plain, Size: 1678 bytes --]

Am Freitag, den 13.09.2013, 16:48 +0100 schrieb Mark Shinwell:
> Large OCaml programs can experience performance degradation due to high
> garbage collection loads (or possibly due to it being Friday 13th).
> Understanding the memory usage of such programs can also be difficult.
> 
> To this end, I am pleased to release a version of OCaml 4.01 that
> contains functionality for the memory profiling of native code programs,
> for the x86-64 architecture.  Currently it is only fully working on Linux
> platforms, but there should be a version for Mac OS X in the near future,
> and the BSDs.
> ...
> The runtime system for this compiler contains instrumentation that can
> produce a global analysis showing the total number of words allocated
> on the OCaml heaps by source location.  This works not only for blocks
> allocated in OCaml code but also in C stubs.  Further, values are
> instrumented---without space overhead---in order to be able to determine
> from a snapshot of the heap which value was allocated where; and also
> to provide a runtime API that can be queried from the instrumented
> program itself. 

A dumb question: how do you do the value instrumentation? Without space
overhead? There is not much information in the value itself... do you
track value relocations?

Gerd
-- 
------------------------------------------------------------
Gerd Stolpmann, Darmstadt, Germany    gerd@gerd-stolpmann.de
My OCaml site:          http://www.camlcity.org
Contact details:        http://www.camlcity.org/contact.html
Company homepage:       http://www.gerd-stolpmann.de
------------------------------------------------------------


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 490 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Caml-list] Allocation profiling for x86-64 native code
  2013-09-13 16:52 ` Gerd Stolpmann
@ 2013-09-16  8:00   ` Mark Shinwell
  0 siblings, 0 replies; 5+ messages in thread
From: Mark Shinwell @ 2013-09-16  8:00 UTC (permalink / raw)
  To: Gerd Stolpmann; +Cc: caml-list

On 13 September 2013 17:52, Gerd Stolpmann <info@gerd-stolpmann.de> wrote:
> A dumb question: how do you do the value instrumentation? Without space
> overhead? There is not much information in the value itself...

The maximum size of blocks is reduced and then an
approximation to the instruction pointer at the point of
allocation is stored in the spare space in the header
word.  (More detail to follow in a reply to Jacques-Henri.)

Mark

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Caml-list] Allocation profiling for x86-64 native code
  2013-09-13 15:48 [Caml-list] Allocation profiling for x86-64 native code Mark Shinwell
  2013-09-13 16:52 ` Gerd Stolpmann
@ 2013-09-14 22:10 ` Jacques-Henri Jourdan
  2013-09-16  8:42   ` Mark Shinwell
  1 sibling, 1 reply; 5+ messages in thread
From: Jacques-Henri Jourdan @ 2013-09-14 22:10 UTC (permalink / raw)
  To: caml-list

[-- Attachment #1: Type: text/plain, Size: 6324 bytes --]

This is a really interesting work !

Actually, I have had this project of making a memory profiler for Ocaml
for a few months. My idea was to do some statistical profiling by
annotating only a fraction of the allocated blocks.

Here was the advantages I thought about :

1- Lower execution time overhead. BTW, what is yours ?
2- When annotating a block, we could decide to store more information.
Typically, it could be very profitable to know (at least a part of) the
current backtrace while allocating.
3- We could also analyze more precisely the life of annotated objects
without much performance constrains, because they are fewer. We could
for example put watchpoints on them to know when they accessed, or do
statistics about there life time...
4- Your method for annotating blocks uses the 22 highest bits of the
blocks headers to store the bits 4..25 of the allocation point address.
I can see several (minor) problems of doing that
   - The maximum size of a block is then limited to 32GB.
   - That does mean that those 22 bits identify the allocation point,
and I am not convinced that the probability of collision is negligible
in the case of large code base (like code) non-contiguously loaded in
memory because of dynlink, for example.
   - This is not usable for x86-32 bits.
With statistical profiling, we can afford having a separate table of
traced blocks, that we would maintain at the end of each GC phase. This
way, we don't actually "annotate" blocks, but we rather annotate the
corresponding table entry.
5- It is not necessary to walk the whole heap to understand some of its
properties, but rather only the traced blocks.

There is an easy and cheap way to do statistical profiling of memory
allocation: we could decide that each allocation exceeding
caml_young_limit should receive a special treatment in order to be traced.

So, do you think this would be a good idea to implement ? Any other
comments ?

--
JH Jourdan


Le 13/09/2013 17:48, Mark Shinwell a écrit :
> Large OCaml programs can experience performance degradation due to high
> garbage collection loads (or possibly due to it being Friday 13th).
> Understanding the memory usage of such programs can also be difficult.
> 
> To this end, I am pleased to release a version of OCaml 4.01 that
> contains functionality for the memory profiling of native code programs,
> for the x86-64 architecture.  Currently it is only fully working on Linux
> platforms, but there should be a version for Mac OS X in the near future,
> and the BSDs.
> 
>   opam remote add mshinwell git://github.com/mshinwell/opam-repo-dev
>   opam update
>   opam switch 4.01-allocation-profiling
> 
> The source is on GitHub:
>   https://github.com/mshinwell/ocaml/tree/4.01-allocation-profiling
> 
> Using ocamlopt with -allocation-tracing and running in an environment
> with the OCAMLRUNPARAM environment variable including the letter "T"
> enables the use of the functionality in the new [Allocation_profiling]
> standard library module [1].  You should also ensure that you have
> the new ocamlmklocs script (installed to the same place as the
> compiler binaries) on your PATH at compile time.
> 
> The runtime system for this compiler contains instrumentation that can
> produce a global analysis showing the total number of words allocated
> on the OCaml heaps by source location.  This works not only for blocks
> allocated in OCaml code but also in C stubs.  Further, values are
> instrumented---without space overhead---in order to be able to determine
> from a snapshot of the heap which value was allocated where; and also
> to provide a runtime API that can be queried from the instrumented
> program itself.  Following Unix tradition, there is no shiny user
> interface.  Scripts are provided to decode the data from the former
> two analyses.  There is also a script that can draw a graph of the
> heap quotiented by the equivalence relation that identifies two
> blocks iff they were allocated at the same source location.
> 
> Programs compiled with allocation profiling will run slower than under
> normal compilation, but this degradation should not be that
> significant.  They will use a little more memory than normal, but not
> much, and the amount of increase roughly speaking is about twice the
> size of the machine code in your program.  (In particular there is
> no overhead per value allocated.)
> 
> Source locations reported are very slightly approximated, but this
> should not normally cause a problem.
> 
> Sometimes the source location that appears in the profile may not be
> quite the function you're looking for (e.g. some allocation function
> that's called from multiple places; the allocation function rather
> than the callers might show up).  The system goes to some rudimentary
> efforts to avoid this by looking back up the stack one level under
> certain conditions, but if you get stuck, you can set a breakpoint in
> gdb on the function identified in the profile and collect a backtrace
> every time you pass it.  These can then be uniquified by a shell script
> left as an exercise to the reader.  (This technique has been discussed
> previously on this list.)
> 
> This is not yet a fully-polished system, but it has been used at Jane
> Street on rather large OCaml programs with success.  The part that still
> requires most work is the runtime API.  If you experience long compile
> times then you can disable the runtime API support by editing the
> ocamlmklocs script to write an empty file; fixing this is on the list.
> (See the comment in stdlib/allocation_profiling.mli.)  This is on the
> list to be fixed.
> 
> I would be interested to hear of reports of success or failure; or
> feature requests.  One feature on the near-term list is being able to
> measure how long a particular value has been in existence.
> 
> Have fun.
> 
> Mark
> 
> P.S. There is some related work going on at OCamlPro using similar
> techniques.  These projects were developed independently, but we expect
> to collaborate on getting some of this technology into the main
> distribution.
> 
> [1] https://github.com/mshinwell/ocaml/blob/4.01-allocation-profiling/stdlib/allocation_profiling.mli
> 



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 555 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Caml-list] Allocation profiling for x86-64 native code
  2013-09-14 22:10 ` Jacques-Henri Jourdan
@ 2013-09-16  8:42   ` Mark Shinwell
  0 siblings, 0 replies; 5+ messages in thread
From: Mark Shinwell @ 2013-09-16  8:42 UTC (permalink / raw)
  To: Jacques-Henri Jourdan; +Cc: caml-list

On 14 September 2013 23:10, Jacques-Henri Jourdan
<jacques-henri.jourdan@ens.fr> wrote:
> statistical profiling by
> annotating only a fraction of the allocated blocks.

I think it's certainly worth experimenting with different approaches.
I've tried to address your points below.

> Here was the advantages I thought about :
>
> 1- Lower execution time overhead. BTW, what is yours ?

I don't have any exact figures to hand, and in fact, it will
potentially vary quite a lot depending on the amount of allocation.
I think the time overhead is maybe 20% at the moment for a large
allocation-heavy application.

This could be decreased somewhat---firstly by optimizing, and
secondly by maybe allowing the "global"
(cf. [Allocation_profiling.Global] in the stdlib) analysis to be
disabled.  (This analysis was actually the predecessor of the
value-annotating analysis.)  The remaining overhead of annotating
the values is small.

> 2- When annotating a block, we could decide to store more information.
> Typically, it could be very profitable to know (at least a part of) the
> current backtrace while allocating.

Agreed.  I'm hoping to do some work on capturing a partial
backtrace in that scenario.

> 3- We could also analyze more precisely the life of annotated objects
> without much performance constrains, because they are fewer. We could
> for example put watchpoints on them to know when they accessed, or do
> statistics about there life time...

I think if using watchpoints you'd have to pick and choose what
you instrument fairly carefully, in any case, otherwise everything
will grind to a halt.  Lifetime statistics are likely to be supported
soon (this should be easy).

> 4- Your method for annotating blocks uses the 22 highest bits of the
> blocks headers to store the bits 4..25 of the allocation point address.
> I can see several (minor) problems of doing that
>    - The maximum size of a block is then limited to 32GB.

I think such blocks are unlikely to occur in practice.  I'd argue
that it's most likely a mistake to have such large allocations inside
the OCaml heap, too.

>    - That does mean that those 22 bits identify the allocation point,
> and I am not convinced that the probability of collision is negligible
> in the case of large code base (like code) non-contiguously loaded in
> memory because of dynlink, for example.

I neglected to say that this is not expected to work with natdynlink
at the moment.  I think for x86-64 Linux the current assumption about
contiguous code is correct, at least using the normal linker scripts,
and the range is probably sufficient.

The main place where the approximation could be problematic, I think,
is where there are allocation points close together that can't quite
be distinguished.  In practice I'm not sure this is a problem, though.

>    - This is not usable for x86-32 bits.

I'm not sure x86-32 is worthy of much attention any more (dare
I say it!) but 32-bit platforms more generally I think still are
of concern.  My plan for upstreaming this work includes a patch
to enable compiler hackers to adjust the layout of the block header
more easily (roughly speaking, removing hard-coded constants in
the code) and that may end up including an option to allocate more
than one header word per block.  This could then be used to solve
the 32-bit problem, as well as likely being a useful platform for
other experiments.

> With statistical profiling, we can afford having a separate table of
> traced blocks, that we would maintain at the end of each GC phase. This
> way, we don't actually "annotate" blocks, but we rather annotate the
> corresponding table entry.

This seems like it might cause quite a lot of extra work, and
disturb cache behaviour, no?  (The current "global" analysis
mentioned above in my system will disturb the cache too, but I
think if that's turned off, just the value annotation should not.)

Mark

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2013-09-16  8:42 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-09-13 15:48 [Caml-list] Allocation profiling for x86-64 native code Mark Shinwell
2013-09-13 16:52 ` Gerd Stolpmann
2013-09-16  8:00   ` Mark Shinwell
2013-09-14 22:10 ` Jacques-Henri Jourdan
2013-09-16  8:42   ` Mark Shinwell

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox