[Caml-list] Case study in optimization: porting a compiler from OCaml to F#

Mailing list for all users of the OCaml language and system.
 help / color / mirror / Atom feed

* [Caml-list] Case study in optimization: porting a compiler from OCaml to F#
@ 2013-03-13 17:04 Jon Harrop
  2013-03-13 17:14 ` julien verlaguet
                   ` (2 more replies)
  0 siblings, 3 replies; 22+ messages in thread
From: Jon Harrop @ 2013-03-13 17:04 UTC (permalink / raw)
  To: caml-list

There has been some discussion here about the implications of single- vs
multi-threaded garbage collectors and, in particular, their performance in
the context of the kinds of metaprogramming that OCaml has traditionally
been used for.

I recently ported a compiler written in OCaml to the F# programming language
for a client and performance turned out to be an issue so I'd like to
present this as a case study to provide some real data. Unfortunately I
cannot disclose precise details.

The original compiler was 15kLOC of OCaml code. The amounts of DSL code that
it consumes and C code that it produces can be considerable and compilation
can take minutes. Consequently, performance is valued by my client's
customers and, therefore, the original code had been optimized for OCaml's
performance characteristics.

A direct translation of the OCaml code to F# proved to be over 10x slower.
This was so slow that it impeded testing my translation so I did some
optimization early. Specifically, profiling indicated that the biggest
problem was the high rate of exceptions being raised and caught. Exceptions
are around 600x slower on .NET than in OCaml so this can quickly degrade
performance. I changed all of the hot paths to use union types (usually
option types) instead of exceptions, according to F# idioms. Although this
incurs a lot of unnecessary boxing in F# the performance improvements were
substantial and the F# version became 5x slower than the OCaml.

On a related note, thorough testing showed that my almost-blind translation
of 15kLOC of code was completely error free. I think this is a real
testament to the power of ML's static type system. The only error I have
introduced so far occurred when I was replacing the use of an exception in a
function with a union type.

After demonstrating the correctness of the translation, my effort turned to
trying to improve performance in an attempt to compete with the original
OCaml code. I had believed that this could well prove to be prohibitively
difficult or even impossible because symbolic code is OCaml's main strength.
However, I have managed to make the F# around 8x faster than it was and, in
particular, substantially faster than the original OCaml.

So this non-trivial symbolic code base has not had its performance suffer
from the adoption of a multicore-friendly garbage collector.

-- 
Dr Jon Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Caml-list] Case study in optimization: porting a compiler from OCaml to F#
  2013-03-13 17:04 [Caml-list] Case study in optimization: porting a compiler from OCaml to F# Jon Harrop
@ 2013-03-13 17:14 ` julien verlaguet
  2013-03-13 19:19   ` Jon Harrop
  2013-03-13 17:55 ` Alain Frisch
  2013-03-13 18:27 ` oliver
  2 siblings, 1 reply; 22+ messages in thread
From: julien verlaguet @ 2013-03-13 17:14 UTC (permalink / raw)
  To: jon; +Cc: caml-list

[-- Attachment #1: Type: text/plain, Size: 3210 bytes --]

Hi Jon,

Thanks for sharing this case-study with us!
Have you tried parallelizing the OCaml version? I am thinking pre-forked
processes communicating with pipes?
We write a lot of large-scale static-analysis in OCaml here at Facebook.
Parallelizing them with pre-forked processes gave us very good performances.
I would be curious to see a case-study of pre-forked OCaml vs threaded F#.

Julien

2013/3/13 Jon Harrop <jon@ffconsultancy.com>

>
> There has been some discussion here about the implications of single- vs
> multi-threaded garbage collectors and, in particular, their performance in
> the context of the kinds of metaprogramming that OCaml has traditionally
> been used for.
>
> I recently ported a compiler written in OCaml to the F# programming
> language
> for a client and performance turned out to be an issue so I'd like to
> present this as a case study to provide some real data. Unfortunately I
> cannot disclose precise details.
>
> The original compiler was 15kLOC of OCaml code. The amounts of DSL code
> that
> it consumes and C code that it produces can be considerable and compilation
> can take minutes. Consequently, performance is valued by my client's
> customers and, therefore, the original code had been optimized for OCaml's
> performance characteristics.
>
> A direct translation of the OCaml code to F# proved to be over 10x slower.
> This was so slow that it impeded testing my translation so I did some
> optimization early. Specifically, profiling indicated that the biggest
> problem was the high rate of exceptions being raised and caught. Exceptions
> are around 600x slower on .NET than in OCaml so this can quickly degrade
> performance. I changed all of the hot paths to use union types (usually
> option types) instead of exceptions, according to F# idioms. Although this
> incurs a lot of unnecessary boxing in F# the performance improvements were
> substantial and the F# version became 5x slower than the OCaml.
>
> On a related note, thorough testing showed that my almost-blind translation
> of 15kLOC of code was completely error free. I think this is a real
> testament to the power of ML's static type system. The only error I have
> introduced so far occurred when I was replacing the use of an exception in
> a
> function with a union type.
>
> After demonstrating the correctness of the translation, my effort turned to
> trying to improve performance in an attempt to compete with the original
> OCaml code. I had believed that this could well prove to be prohibitively
> difficult or even impossible because symbolic code is OCaml's main
> strength.
> However, I have managed to make the F# around 8x faster than it was and, in
> particular, substantially faster than the original OCaml.
>
> So this non-trivial symbolic code base has not had its performance suffer
> from the adoption of a multicore-friendly garbage collector.
>
> --
> Dr Jon Harrop, Flying Frog Consultancy Ltd.
> http://www.ffconsultancy.com
>
>
> --
> Caml-list mailing list.  Subscription management and archives:
> https://sympa.inria.fr/sympa/arc/caml-list
> Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
> Bug reports: http://caml.inria.fr/bin/caml-bugs
>

[-- Attachment #2: Type: text/html, Size: 4065 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* RE: [Caml-list] Case study in optimization: porting a compiler from OCaml to F#
  2013-03-13 17:14 ` julien verlaguet
@ 2013-03-13 19:19   ` Jon Harrop
  2013-03-13 19:28     ` oliver
                       ` (2 more replies)
  0 siblings, 3 replies; 22+ messages in thread
From: Jon Harrop @ 2013-03-13 19:19 UTC (permalink / raw)
  To: 'julien verlaguet'; +Cc: caml-list

Julien wrote:
> Thanks for sharing this case-study with us!

No problem. I found it in my drafts folder. :-)

> Have you tried parallelizing the OCaml version?

No. I didn't touch the OCaml code at all.

> I am thinking pre-forked processes communicating with pipes?

That would work but it would be a lot of effort compared to
Array.Parallel.map in F#.

> We write a lot of large-scale static-analysis in OCaml here at Facebook.

Good to hear. :-)

> Parallelizing them with pre-forked processes gave us very good
performances.

I'm not sure how much message passing would be required in this case and
don't have time to investigate.

> I would be curious to see a case-study of pre-forked OCaml vs threaded F#.

Me too. Only problem is that fork-based parallel OCaml code takes a long
time to write in comparison. Incidentally, the F# does not make direct use
of threads.

Cheers,
Jon.

From: julien verlaguet [mailto:julien.verlaguet@gmail.com] 
Sent: 13 March 2013 17:14
To: jon@ffconsultancy.com
Cc: caml-list@inria.fr
Subject: Re: [Caml-list] Case study in optimization: porting a compiler from
OCaml to F#

Hi Jon,

Thanks for sharing this case-study with us!
Have you tried parallelizing the OCaml version? I am thinking pre-forked
processes communicating with pipes?
We write a lot of large-scale static-analysis in OCaml here at Facebook.
Parallelizing them with pre-forked processes gave us very good performances.
I would be curious to see a case-study of pre-forked OCaml vs threaded F#.

Julien
2013/3/13 Jon Harrop <jon@ffconsultancy.com>

There has been some discussion here about the implications of single- vs
multi-threaded garbage collectors and, in particular, their performance in
the context of the kinds of metaprogramming that OCaml has traditionally
been used for.

I recently ported a compiler written in OCaml to the F# programming language
for a client and performance turned out to be an issue so I'd like to
present this as a case study to provide some real data. Unfortunately I
cannot disclose precise details.

The original compiler was 15kLOC of OCaml code. The amounts of DSL code that
it consumes and C code that it produces can be considerable and compilation
can take minutes. Consequently, performance is valued by my client's
customers and, therefore, the original code had been optimized for OCaml's
performance characteristics.

A direct translation of the OCaml code to F# proved to be over 10x slower.
This was so slow that it impeded testing my translation so I did some
optimization early. Specifically, profiling indicated that the biggest
problem was the high rate of exceptions being raised and caught. Exceptions
are around 600x slower on .NET than in OCaml so this can quickly degrade
performance. I changed all of the hot paths to use union types (usually
option types) instead of exceptions, according to F# idioms. Although this
incurs a lot of unnecessary boxing in F# the performance improvements were
substantial and the F# version became 5x slower than the OCaml.

On a related note, thorough testing showed that my almost-blind translation
of 15kLOC of code was completely error free. I think this is a real
testament to the power of ML's static type system. The only error I have
introduced so far occurred when I was replacing the use of an exception in a
function with a union type.

After demonstrating the correctness of the translation, my effort turned to
trying to improve performance in an attempt to compete with the original
OCaml code. I had believed that this could well prove to be prohibitively
difficult or even impossible because symbolic code is OCaml's main strength.
However, I have managed to make the F# around 8x faster than it was and, in
particular, substantially faster than the original OCaml.

So this non-trivial symbolic code base has not had its performance suffer
from the adoption of a multicore-friendly garbage collector.

--
Dr Jon Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com

--
Caml-list mailing list.  Subscription management and archives:
https://sympa.inria.fr/sympa/arc/caml-list
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
Bug reports: http://caml.inria.fr/bin/caml-bugs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Caml-list] Case study in optimization: porting a compiler from OCaml to F#
  2013-03-13 19:19   ` Jon Harrop
@ 2013-03-13 19:28     ` oliver
  2013-03-14 15:05       ` oliver
  2013-03-15 13:34     ` Pierre-Alexandre Voye
  2013-03-19  1:37     ` Francois Berenger
  2 siblings, 1 reply; 22+ messages in thread
From: oliver @ 2013-03-13 19:28 UTC (permalink / raw)
  To: caml-list

On Wed, Mar 13, 2013 at 07:19:29PM -0000, Jon Harrop wrote:
[...]
> Me too. Only problem is that fork-based parallel OCaml code takes a long
> time to write in comparison.
[...]

Does it?

http://camlp3l.inria.fr/eng.htm


Ciao,
   Oliver

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Caml-list] Case study in optimization: porting a compiler from OCaml to F#
  2013-03-13 19:28     ` oliver
@ 2013-03-14 15:05       ` oliver
  2013-03-14 15:15         ` oliver
  0 siblings, 1 reply; 22+ messages in thread
From: oliver @ 2013-03-14 15:05 UTC (permalink / raw)
  To: caml-list

Hello,

as I just have seen an email from Piertre Weis,
hementioned CamlP3l is the old stuff...

On Wed, Mar 13, 2013 at 08:28:16PM +0100, oliver wrote:
> On Wed, Mar 13, 2013 at 07:19:29PM -0000, Jon Harrop wrote:
> [...]
> > Me too. Only problem is that fork-based parallel OCaml code takes a long
> > time to write in comparison.
> [...]
> 
> Does it?
> 
> http://camlp3l.inria.fr/eng.htm
[...]

Old.


The new stuff is:

 "Sklml is a functional parallel skeleton compiler and programming system for
 OCaml programs."

  http://sklml.inria.fr/

Ciao,
   Oliver

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Caml-list] Case study in optimization: porting a compiler from OCaml to F#
  2013-03-14 15:05       ` oliver
@ 2013-03-14 15:15         ` oliver
  0 siblings, 0 replies; 22+ messages in thread
From: oliver @ 2013-03-14 15:15 UTC (permalink / raw)
  To: caml-list

On Thu, Mar 14, 2013 at 04:05:18PM +0100, oliver wrote:
> Hello,
> 
> as I just have seen an email from Piertre Weis,
[...]

oops, typo => Pierre Weis


Ciao,
   Oliver

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Caml-list] Case study in optimization: porting a compiler from OCaml to F#
  2013-03-13 19:19   ` Jon Harrop
  2013-03-13 19:28     ` oliver
@ 2013-03-15 13:34     ` Pierre-Alexandre Voye
  2013-03-17 12:06       ` Jon Harrop
  2013-03-19  1:37     ` Francois Berenger
  2 siblings, 1 reply; 22+ messages in thread
From: Pierre-Alexandre Voye @ 2013-03-15 13:34 UTC (permalink / raw)
  To: Jon Harrop, caml-list

[-- Attachment #1: Type: text/plain, Size: 398 bytes --]

2013/3/13 Jon Harrop <jon@ffconsultancy.com>

>
>
> > I am thinking pre-forked processes communicating with pipes?
>
> That would work but it would be a lot of effort compared to
> Array.Parallel.map in F#.
>

So you could maybe use Parmap.map ?

Parmap.parmap ~ncores:4 funct (Parmap.L elem_list)




-- 
---------------------
https://twitter.com/#!/ontologiae/
http://linuxfr.org/users/montaigne

[-- Attachment #2: Type: text/html, Size: 1176 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* RE: [Caml-list] Case study in optimization: porting a compiler from OCaml to F#
  2013-03-15 13:34     ` Pierre-Alexandre Voye
@ 2013-03-17 12:06       ` Jon Harrop
  2013-03-19  1:50         ` Francois Berenger
  2013-03-19 12:47         ` Jean-Marc Alliot
  0 siblings, 2 replies; 22+ messages in thread
From: Jon Harrop @ 2013-03-17 12:06 UTC (permalink / raw)
  To: 'Pierre-Alexandre Voye', 'caml-list'

Pierre-Alexandre Voye wrote:
> So you could maybe use Parmap.map ?
> Parmap.parmap ~ncores:4 funct (Parmap.L elem_list)

What happens if the inner function returns results via mutation? I assume you must rearrange the code to return all results explicitly and they will then be deep copied (which destroys scalability due to limited shared memory bandwidth on multicores).

Does it do load balancing? I assume not given that ncores is hardcoded.

Does a parmap with ncores=4 inside a parmap with ncores=4 create 16 processes?

Does it deep copy inputs and/or outputs? I assume so, at least for outputs, because you cannot write results in-place without a shared mutable heap.

Does parmap have a large constant overhead? I assume so if it is forking processes.

Another solution is to prefork and explicitly communicate all inputs using message passing but this is equally problematic. You have to rearrange the code. Deep copying inputs also destroys scalability.

Cheers,
Jon.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Caml-list] Case study in optimization: porting a compiler from OCaml to F#
  2013-03-17 12:06       ` Jon Harrop
@ 2013-03-19  1:50         ` Francois Berenger
  2013-03-20 20:54           ` Jon Harrop
  2013-03-19 12:47         ` Jean-Marc Alliot
  1 sibling, 1 reply; 22+ messages in thread
From: Francois Berenger @ 2013-03-19  1:50 UTC (permalink / raw)
  To: caml-list

I have observed and measured perfect scalability with up to 4 cores
of an OCaml program using Parmap.
With more than 4 cores, the scalability was degrading.

I think the scalability of the program depends only on the
granularity of the tasks. The tasks were coarse in my case.

F.

On 03/17/2013 09:06 PM, Jon Harrop wrote:
> Pierre-Alexandre Voye wrote:
>> So you could maybe use Parmap.map ?
>> Parmap.parmap ~ncores:4 funct (Parmap.L elem_list)
>
> What happens if the inner function returns results via mutation? I assume you must rearrange the code to return all results explicitly and they will then be deep copied (which destroys scalability due to limited shared memory bandwidth on multicores).
>
> Does it do load balancing? I assume not given that ncores is hardcoded.
>
> Does a parmap with ncores=4 inside a parmap with ncores=4 create 16 processes?
>
> Does it deep copy inputs and/or outputs? I assume so, at least for outputs, because you cannot write results in-place without a shared mutable heap.
>
> Does parmap have a large constant overhead? I assume so if it is forking processes.
>
> Another solution is to prefork and explicitly communicate all inputs using message passing but this is equally problematic. You have to rearrange the code. Deep copying inputs also destroys scalability.
>
> Cheers,
> Jon.
>
>
>


^ permalink raw reply	[flat|nested] 22+ messages in thread

* RE: [Caml-list] Case study in optimization: porting a compiler from OCaml to F#
  2013-03-19  1:50         ` Francois Berenger
@ 2013-03-20 20:54           ` Jon Harrop
  2013-03-20 22:35             ` Roberto Di Cosmo
  0 siblings, 1 reply; 22+ messages in thread
From: Jon Harrop @ 2013-03-20 20:54 UTC (permalink / raw)
  To: 'Francois Berenger', caml-list


Not just the granularity. Also the communication including any communication involved in scatter and gather phases. That differs a lot more between OCaml and F#. Fork does copy-on-write but (IIRC) the GC can incur unnecessary copying but, more importantly, requires the gather phase to deep copy results back to the original process. In contrast, data can be passed by reference in F#.

Would be very interesting to benchmark this...

Cheers,
Jon.

-----Original Message-----
From: caml-list-request@inria.fr [mailto:caml-list-request@inria.fr] On Behalf Of Francois Berenger
Sent: 19 March 2013 01:50
To: caml-list@inria.fr
Subject: Re: [Caml-list] Case study in optimization: porting a compiler from OCaml to F#

I have observed and measured perfect scalability with up to 4 cores of an OCaml program using Parmap.
With more than 4 cores, the scalability was degrading.

I think the scalability of the program depends only on the granularity of the tasks. The tasks were coarse in my case.

F.

On 03/17/2013 09:06 PM, Jon Harrop wrote:
> Pierre-Alexandre Voye wrote:
>> So you could maybe use Parmap.map ?
>> Parmap.parmap ~ncores:4 funct (Parmap.L elem_list)
>
> What happens if the inner function returns results via mutation? I assume you must rearrange the code to return all results explicitly and they will then be deep copied (which destroys scalability due to limited shared memory bandwidth on multicores).
>
> Does it do load balancing? I assume not given that ncores is hardcoded.
>
> Does a parmap with ncores=4 inside a parmap with ncores=4 create 16 processes?
>
> Does it deep copy inputs and/or outputs? I assume so, at least for outputs, because you cannot write results in-place without a shared mutable heap.
>
> Does parmap have a large constant overhead? I assume so if it is forking processes.
>
> Another solution is to prefork and explicitly communicate all inputs using message passing but this is equally problematic. You have to rearrange the code. Deep copying inputs also destroys scalability.
>
> Cheers,
> Jon.
>
>
>


--
Caml-list mailing list.  Subscription management and archives:
https://sympa.inria.fr/sympa/arc/caml-list
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
Bug reports: http://caml.inria.fr/bin/caml-bugs


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Caml-list] Case study in optimization: porting a compiler from OCaml to F#
  2013-03-20 20:54           ` Jon Harrop
@ 2013-03-20 22:35             ` Roberto Di Cosmo
  2013-03-21  4:13               ` Mike Lin
  0 siblings, 1 reply; 22+ messages in thread
From: Roberto Di Cosmo @ 2013-03-20 22:35 UTC (permalink / raw)
  To: Jon Harrop; +Cc: 'Francois Berenger', caml-list

Hi Jon,
   a concrete set of well justified benchmarks could serve
the cause more than any abstract discussion; please feel
free to set it up, run it, and analyze the results.

Having spent quite a bit of energy on Parmap, after it
started as a sort of a one-afternoon project, and with
the experience of the now very old OCamlP3l library that
started much of this at the end of the '90s (including a
detour through an experimental reimplementation in Haskell),
I definitely took the St Thomas stance with this kind of issues :-)

--
Roberto

On Wed, Mar 20, 2013 at 08:54:59PM -0000, Jon Harrop wrote:
> 
> Not just the granularity. Also the communication including any communication involved in scatter and gather phases. That differs a lot more between OCaml and F#. Fork does copy-on-write but (IIRC) the GC can incur unnecessary copying but, more importantly, requires the gather phase to deep copy results back to the original process. In contrast, data can be passed by reference in F#.
> 
> Would be very interesting to benchmark this...
> 
> Cheers,
> Jon.
> 
> -----Original Message-----
> From: caml-list-request@inria.fr [mailto:caml-list-request@inria.fr] On Behalf Of Francois Berenger
> Sent: 19 March 2013 01:50
> To: caml-list@inria.fr
> Subject: Re: [Caml-list] Case study in optimization: porting a compiler from OCaml to F#
> 
> I have observed and measured perfect scalability with up to 4 cores of an OCaml program using Parmap.
> With more than 4 cores, the scalability was degrading.
> 
> I think the scalability of the program depends only on the granularity of the tasks. The tasks were coarse in my case.
> 
> F.
> 
> On 03/17/2013 09:06 PM, Jon Harrop wrote:
> > Pierre-Alexandre Voye wrote:
> >> So you could maybe use Parmap.map ?
> >> Parmap.parmap ~ncores:4 funct (Parmap.L elem_list)
> >
> > What happens if the inner function returns results via mutation? I assume you must rearrange the code to return all results explicitly and they will then be deep copied (which destroys scalability due to limited shared memory bandwidth on multicores).
> >
> > Does it do load balancing? I assume not given that ncores is hardcoded.
> >
> > Does a parmap with ncores=4 inside a parmap with ncores=4 create 16 processes?
> >
> > Does it deep copy inputs and/or outputs? I assume so, at least for outputs, because you cannot write results in-place without a shared mutable heap.
> >
> > Does parmap have a large constant overhead? I assume so if it is forking processes.
> >
> > Another solution is to prefork and explicitly communicate all inputs using message passing but this is equally problematic. You have to rearrange the code. Deep copying inputs also destroys scalability.
> >
> > Cheers,
> > Jon.
> >
> >
> >
> 
> 
> --
> Caml-list mailing list.  Subscription management and archives:
> https://sympa.inria.fr/sympa/arc/caml-list
> Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
> Bug reports: http://caml.inria.fr/bin/caml-bugs
> 
> 
> -- 
> Caml-list mailing list.  Subscription management and archives:
> https://sympa.inria.fr/sympa/arc/caml-list
> Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
> Bug reports: http://caml.inria.fr/bin/caml-bugs

-- 
Roberto Di Cosmo
 
------------------------------------------------------------------
Professeur               En delegation a l'INRIA
PPS                      E-mail: roberto@dicosmo.org
Universite Paris Diderot WWW  : http://www.dicosmo.org
Case 7014                Tel  : ++33-(0)1-57 27 92 20
5, Rue Thomas Mann       
F-75205 Paris Cedex 13   Identica: http://identi.ca/rdicosmo
FRANCE.                  Twitter: http://twitter.com/rdicosmo
------------------------------------------------------------------
Attachments:
MIME accepted, Word deprecated
      http://www.gnu.org/philosophy/no-word-attachments.html
------------------------------------------------------------------
Office location:
 
Bureau 320 (3rd floor)
Batiment Sophie Germain
Avenue de France
Metro Bibliotheque Francois Mitterrand, ligne 14/RER C
-----------------------------------------------------------------
GPG fingerprint 2931 20CE 3A5A 5390 98EC 8BFC FCCA C3BE 39CB 12D3                        

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Caml-list] Case study in optimization: porting a compiler from OCaml to F#
  2013-03-20 22:35             ` Roberto Di Cosmo
@ 2013-03-21  4:13               ` Mike Lin
  2013-03-21  7:35                 ` Roberto Di Cosmo
  0 siblings, 1 reply; 22+ messages in thread
From: Mike Lin @ 2013-03-21  4:13 UTC (permalink / raw)
  To: Roberto Di Cosmo; +Cc: Jon Harrop, Francois Berenger, caml-list

[-- Attachment #1: Type: text/plain, Size: 5243 bytes --]

Just a comment FWIW, I personally moved away from Parmap because it doesn't
deal well with exceptions raised in the child processes. I know the
brokenness of exception marshalling (PR#0001961) complicates this, but I
came up with something I felt was pretty reasonable in my iteration of the
same concept:
https://github.com/mlin/forkwork
Of course I didn't reimplement some of Parmap's fancier features, but this
solution has been "just right" for my applications. I hope this can inspire
a future improvement to exception handling in Parmap.
Mike




On Wed, Mar 20, 2013 at 3:35 PM, Roberto Di Cosmo <roberto@dicosmo.org>wrote:

> Hi Jon,
>    a concrete set of well justified benchmarks could serve
> the cause more than any abstract discussion; please feel
> free to set it up, run it, and analyze the results.
>
> Having spent quite a bit of energy on Parmap, after it
> started as a sort of a one-afternoon project, and with
> the experience of the now very old OCamlP3l library that
> started much of this at the end of the '90s (including a
> detour through an experimental reimplementation in Haskell),
> I definitely took the St Thomas stance with this kind of issues :-)
>
> --
> Roberto
>
> On Wed, Mar 20, 2013 at 08:54:59PM -0000, Jon Harrop wrote:
> >
> > Not just the granularity. Also the communication including any
> communication involved in scatter and gather phases. That differs a lot
> more between OCaml and F#. Fork does copy-on-write but (IIRC) the GC can
> incur unnecessary copying but, more importantly, requires the gather phase
> to deep copy results back to the original process. In contrast, data can be
> passed by reference in F#.
> >
> > Would be very interesting to benchmark this...
> >
> > Cheers,
> > Jon.
> >
> > -----Original Message-----
> > From: caml-list-request@inria.fr [mailto:caml-list-request@inria.fr] On
> Behalf Of Francois Berenger
> > Sent: 19 March 2013 01:50
> > To: caml-list@inria.fr
> > Subject: Re: [Caml-list] Case study in optimization: porting a compiler
> from OCaml to F#
> >
> > I have observed and measured perfect scalability with up to 4 cores of
> an OCaml program using Parmap.
> > With more than 4 cores, the scalability was degrading.
> >
> > I think the scalability of the program depends only on the granularity
> of the tasks. The tasks were coarse in my case.
> >
> > F.
> >
> > On 03/17/2013 09:06 PM, Jon Harrop wrote:
> > > Pierre-Alexandre Voye wrote:
> > >> So you could maybe use Parmap.map ?
> > >> Parmap.parmap ~ncores:4 funct (Parmap.L elem_list)
> > >
> > > What happens if the inner function returns results via mutation? I
> assume you must rearrange the code to return all results explicitly and
> they will then be deep copied (which destroys scalability due to limited
> shared memory bandwidth on multicores).
> > >
> > > Does it do load balancing? I assume not given that ncores is hardcoded.
> > >
> > > Does a parmap with ncores=4 inside a parmap with ncores=4 create 16
> processes?
> > >
> > > Does it deep copy inputs and/or outputs? I assume so, at least for
> outputs, because you cannot write results in-place without a shared mutable
> heap.
> > >
> > > Does parmap have a large constant overhead? I assume so if it is
> forking processes.
> > >
> > > Another solution is to prefork and explicitly communicate all inputs
> using message passing but this is equally problematic. You have to
> rearrange the code. Deep copying inputs also destroys scalability.
> > >
> > > Cheers,
> > > Jon.
> > >
> > >
> > >
> >
> >
> > --
> > Caml-list mailing list.  Subscription management and archives:
> > https://sympa.inria.fr/sympa/arc/caml-list
> > Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
> > Bug reports: http://caml.inria.fr/bin/caml-bugs
> >
> >
> > --
> > Caml-list mailing list.  Subscription management and archives:
> > https://sympa.inria.fr/sympa/arc/caml-list
> > Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
> > Bug reports: http://caml.inria.fr/bin/caml-bugs
>
> --
> Roberto Di Cosmo
>
> ------------------------------------------------------------------
> Professeur               En delegation a l'INRIA
> PPS                      E-mail: roberto@dicosmo.org
> Universite Paris Diderot WWW  : http://www.dicosmo.org
> Case 7014                Tel  : ++33-(0)1-57 27 92 20
> 5, Rue Thomas Mann
> F-75205 Paris Cedex 13   Identica: http://identi.ca/rdicosmo
> FRANCE.                  Twitter: http://twitter.com/rdicosmo
> ------------------------------------------------------------------
> Attachments:
> MIME accepted, Word deprecated
>       http://www.gnu.org/philosophy/no-word-attachments.html
> ------------------------------------------------------------------
> Office location:
>
> Bureau 320 (3rd floor)
> Batiment Sophie Germain
> Avenue de France
> Metro Bibliotheque Francois Mitterrand, ligne 14/RER C
> -----------------------------------------------------------------
> GPG fingerprint 2931 20CE 3A5A 5390 98EC 8BFC FCCA C3BE 39CB 12D3
>
> --
> Caml-list mailing list.  Subscription management and archives:
> https://sympa.inria.fr/sympa/arc/caml-list
> Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
> Bug reports: http://caml.inria.fr/bin/caml-bugs
>

[-- Attachment #2: Type: text/html, Size: 7549 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Caml-list] Case study in optimization: porting a compiler from OCaml to F#
  2013-03-21  4:13               ` Mike Lin
@ 2013-03-21  7:35                 ` Roberto Di Cosmo
  2013-03-21 20:07                   ` Roberto Di Cosmo
  0 siblings, 1 reply; 22+ messages in thread
From: Roberto Di Cosmo @ 2013-03-21  7:35 UTC (permalink / raw)
  To: Mike Lin; +Cc: Jon Harrop, Francois Berenger, caml-list

Hi Mike, that is something that one could add rather easily... just
open an issue here: https://github.com/rdicosmo/parmap

2013/3/21 Mike Lin <mlin@mlin.net>:
> Just a comment FWIW, I personally moved away from Parmap because it doesn't
> deal well with exceptions raised in the child processes. I know the
> brokenness of exception marshalling (PR#0001961) complicates this, but I
> came up with something I felt was pretty reasonable in my iteration of the
> same concept:
> https://github.com/mlin/forkwork
> Of course I didn't reimplement some of Parmap's fancier features, but this
> solution has been "just right" for my applications. I hope this can inspire
> a future improvement to exception handling in Parmap.
> Mike
>
>
>
>
> On Wed, Mar 20, 2013 at 3:35 PM, Roberto Di Cosmo <roberto@dicosmo.org>
> wrote:
>>
>> Hi Jon,
>>    a concrete set of well justified benchmarks could serve
>> the cause more than any abstract discussion; please feel
>> free to set it up, run it, and analyze the results.
>>
>> Having spent quite a bit of energy on Parmap, after it
>> started as a sort of a one-afternoon project, and with
>> the experience of the now very old OCamlP3l library that
>> started much of this at the end of the '90s (including a
>> detour through an experimental reimplementation in Haskell),
>> I definitely took the St Thomas stance with this kind of issues :-)
>>
>> --
>> Roberto
>>
>> On Wed, Mar 20, 2013 at 08:54:59PM -0000, Jon Harrop wrote:
>> >
>> > Not just the granularity. Also the communication including any
>> > communication involved in scatter and gather phases. That differs a lot more
>> > between OCaml and F#. Fork does copy-on-write but (IIRC) the GC can incur
>> > unnecessary copying but, more importantly, requires the gather phase to deep
>> > copy results back to the original process. In contrast, data can be passed
>> > by reference in F#.
>> >
>> > Would be very interesting to benchmark this...
>> >
>> > Cheers,
>> > Jon.
>> >
>> > -----Original Message-----
>> > From: caml-list-request@inria.fr [mailto:caml-list-request@inria.fr] On
>> > Behalf Of Francois Berenger
>> > Sent: 19 March 2013 01:50
>> > To: caml-list@inria.fr
>> > Subject: Re: [Caml-list] Case study in optimization: porting a compiler
>> > from OCaml to F#
>> >
>> > I have observed and measured perfect scalability with up to 4 cores of
>> > an OCaml program using Parmap.
>> > With more than 4 cores, the scalability was degrading.
>> >
>> > I think the scalability of the program depends only on the granularity
>> > of the tasks. The tasks were coarse in my case.
>> >
>> > F.
>> >
>> > On 03/17/2013 09:06 PM, Jon Harrop wrote:
>> > > Pierre-Alexandre Voye wrote:
>> > >> So you could maybe use Parmap.map ?
>> > >> Parmap.parmap ~ncores:4 funct (Parmap.L elem_list)
>> > >
>> > > What happens if the inner function returns results via mutation? I
>> > > assume you must rearrange the code to return all results explicitly and they
>> > > will then be deep copied (which destroys scalability due to limited shared
>> > > memory bandwidth on multicores).
>> > >
>> > > Does it do load balancing? I assume not given that ncores is
>> > > hardcoded.
>> > >
>> > > Does a parmap with ncores=4 inside a parmap with ncores=4 create 16
>> > > processes?
>> > >
>> > > Does it deep copy inputs and/or outputs? I assume so, at least for
>> > > outputs, because you cannot write results in-place without a shared mutable
>> > > heap.
>> > >
>> > > Does parmap have a large constant overhead? I assume so if it is
>> > > forking processes.
>> > >
>> > > Another solution is to prefork and explicitly communicate all inputs
>> > > using message passing but this is equally problematic. You have to rearrange
>> > > the code. Deep copying inputs also destroys scalability.
>> > >
>> > > Cheers,
>> > > Jon.
>> > >
>> > >
>> > >
>> >
>> >
>> > --
>> > Caml-list mailing list.  Subscription management and archives:
>> > https://sympa.inria.fr/sympa/arc/caml-list
>> > Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
>> > Bug reports: http://caml.inria.fr/bin/caml-bugs
>> >
>> >
>> > --
>> > Caml-list mailing list.  Subscription management and archives:
>> > https://sympa.inria.fr/sympa/arc/caml-list
>> > Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
>> > Bug reports: http://caml.inria.fr/bin/caml-bugs
>>
>> --
>> Roberto Di Cosmo
>>
>> ------------------------------------------------------------------
>> Professeur               En delegation a l'INRIA
>> PPS                      E-mail: roberto@dicosmo.org
>> Universite Paris Diderot WWW  : http://www.dicosmo.org
>> Case 7014                Tel  : ++33-(0)1-57 27 92 20
>> 5, Rue Thomas Mann
>> F-75205 Paris Cedex 13   Identica: http://identi.ca/rdicosmo
>> FRANCE.                  Twitter: http://twitter.com/rdicosmo
>> ------------------------------------------------------------------
>> Attachments:
>> MIME accepted, Word deprecated
>>       http://www.gnu.org/philosophy/no-word-attachments.html
>> ------------------------------------------------------------------
>> Office location:
>>
>> Bureau 320 (3rd floor)
>> Batiment Sophie Germain
>> Avenue de France
>> Metro Bibliotheque Francois Mitterrand, ligne 14/RER C
>> -----------------------------------------------------------------
>> GPG fingerprint 2931 20CE 3A5A 5390 98EC 8BFC FCCA C3BE 39CB 12D3
>>
>> --
>> Caml-list mailing list.  Subscription management and archives:
>> https://sympa.inria.fr/sympa/arc/caml-list
>> Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
>> Bug reports: http://caml.inria.fr/bin/caml-bugs
>
>



-- 
--Roberto Di Cosmo

------------------------------------------------------------------
Professeur               En delegation a l'INRIA
PPS                      E-mail: roberto@dicosmo.org
Universite Paris Diderot WWW  : http://www.dicosmo.org
Case 7014                Tel  : ++33-(0)1-57 27 92 20
5, Rue Thomas Mann
F-75205 Paris Cedex 13
FRANCE.
------------------------------------------------------------------
Attachments:
   MIME accepted
   Word deprecated, http://www.rfc1149.net/documents/whynotword
------------------------------------------------------------------
Office location:

Bureau 6C15 (6th floor)
175, rue du Chevaleret, XIII
Metro Chevaleret, ligne 6
------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Caml-list] Case study in optimization: porting a compiler from OCaml to F#
  2013-03-21  7:35                 ` Roberto Di Cosmo
@ 2013-03-21 20:07                   ` Roberto Di Cosmo
  0 siblings, 0 replies; 22+ messages in thread
From: Roberto Di Cosmo @ 2013-03-21 20:07 UTC (permalink / raw)
  To: Mike Lin; +Cc: Francois Berenger, caml-list

Mike, Francois, please report on the github issue tracker
(https://github.com/rdicosmo/parmap) the behaviour you really want to
see when an exception is raised inside a worker.

The current behaviour is, in general, that an exception caught in a
worker is sent back to the father, and then all the program is
stopped.
If the computation can be done without load balancing, though (calling
the simplemapper schema), no special handling of exceptions is done (I
admit this will need to be changed).

More generally, when a feature is missing in a library, it is useful
to file a feature request, and even more useful to contribute a
patch... that's the whole idea of developing these libraries in the
open, after all...

2013/3/21 Roberto Di Cosmo <roberto@dicosmo.org>:
> Hi Mike, that is something that one could add rather easily... just
> open an issue here: https://github.com/rdicosmo/parmap
>
> 2013/3/21 Mike Lin <mlin@mlin.net>:
>> Just a comment FWIW, I personally moved away from Parmap because it doesn't
>> deal well with exceptions raised in the child processes. I know the
>> brokenness of exception marshalling (PR#0001961) complicates this, but I
>> came up with something I felt was pretty reasonable in my iteration of the
>> same concept:
>> https://github.com/mlin/forkwork
>> Of course I didn't reimplement some of Parmap's fancier features, but this
>> solution has been "just right" for my applications. I hope this can inspire
>> a future improvement to exception handling in Parmap.
>> Mike
>>
>>
>>
>>
>> On Wed, Mar 20, 2013 at 3:35 PM, Roberto Di Cosmo <roberto@dicosmo.org>
>> wrote:
>>>
>>> Hi Jon,
>>>    a concrete set of well justified benchmarks could serve
>>> the cause more than any abstract discussion; please feel
>>> free to set it up, run it, and analyze the results.
>>>
>>> Having spent quite a bit of energy on Parmap, after it
>>> started as a sort of a one-afternoon project, and with
>>> the experience of the now very old OCamlP3l library that
>>> started much of this at the end of the '90s (including a
>>> detour through an experimental reimplementation in Haskell),
>>> I definitely took the St Thomas stance with this kind of issues :-)
>>>
>>> --
>>> Roberto
>>>
>>> On Wed, Mar 20, 2013 at 08:54:59PM -0000, Jon Harrop wrote:
>>> >
>>> > Not just the granularity. Also the communication including any
>>> > communication involved in scatter and gather phases. That differs a lot more
>>> > between OCaml and F#. Fork does copy-on-write but (IIRC) the GC can incur
>>> > unnecessary copying but, more importantly, requires the gather phase to deep
>>> > copy results back to the original process. In contrast, data can be passed
>>> > by reference in F#.
>>> >
>>> > Would be very interesting to benchmark this...
>>> >
>>> > Cheers,
>>> > Jon.
>>> >
>>> > -----Original Message-----
>>> > From: caml-list-request@inria.fr [mailto:caml-list-request@inria.fr] On
>>> > Behalf Of Francois Berenger
>>> > Sent: 19 March 2013 01:50
>>> > To: caml-list@inria.fr
>>> > Subject: Re: [Caml-list] Case study in optimization: porting a compiler
>>> > from OCaml to F#
>>> >
>>> > I have observed and measured perfect scalability with up to 4 cores of
>>> > an OCaml program using Parmap.
>>> > With more than 4 cores, the scalability was degrading.
>>> >
>>> > I think the scalability of the program depends only on the granularity
>>> > of the tasks. The tasks were coarse in my case.
>>> >
>>> > F.
>>> >
>>> > On 03/17/2013 09:06 PM, Jon Harrop wrote:
>>> > > Pierre-Alexandre Voye wrote:
>>> > >> So you could maybe use Parmap.map ?
>>> > >> Parmap.parmap ~ncores:4 funct (Parmap.L elem_list)
>>> > >
>>> > > What happens if the inner function returns results via mutation? I
>>> > > assume you must rearrange the code to return all results explicitly and they
>>> > > will then be deep copied (which destroys scalability due to limited shared
>>> > > memory bandwidth on multicores).
>>> > >
>>> > > Does it do load balancing? I assume not given that ncores is
>>> > > hardcoded.
>>> > >
>>> > > Does a parmap with ncores=4 inside a parmap with ncores=4 create 16
>>> > > processes?
>>> > >
>>> > > Does it deep copy inputs and/or outputs? I assume so, at least for
>>> > > outputs, because you cannot write results in-place without a shared mutable
>>> > > heap.
>>> > >
>>> > > Does parmap have a large constant overhead? I assume so if it is
>>> > > forking processes.
>>> > >
>>> > > Another solution is to prefork and explicitly communicate all inputs
>>> > > using message passing but this is equally problematic. You have to rearrange
>>> > > the code. Deep copying inputs also destroys scalability.
>>> > >
>>> > > Cheers,
>>> > > Jon.
>>> > >
>>> > >
>>> > >
>>> >
>>> >
>>> > --
>>> > Caml-list mailing list.  Subscription management and archives:
>>> > https://sympa.inria.fr/sympa/arc/caml-list
>>> > Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
>>> > Bug reports: http://caml.inria.fr/bin/caml-bugs
>>> >
>>> >
>>> > --
>>> > Caml-list mailing list.  Subscription management and archives:
>>> > https://sympa.inria.fr/sympa/arc/caml-list
>>> > Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
>>> > Bug reports: http://caml.inria.fr/bin/caml-bugs
>>>
>>> --
>>> Roberto Di Cosmo
>>>
>>> ------------------------------------------------------------------
>>> Professeur               En delegation a l'INRIA
>>> PPS                      E-mail: roberto@dicosmo.org
>>> Universite Paris Diderot WWW  : http://www.dicosmo.org
>>> Case 7014                Tel  : ++33-(0)1-57 27 92 20
>>> 5, Rue Thomas Mann
>>> F-75205 Paris Cedex 13   Identica: http://identi.ca/rdicosmo
>>> FRANCE.                  Twitter: http://twitter.com/rdicosmo
>>> ------------------------------------------------------------------
>>> Attachments:
>>> MIME accepted, Word deprecated
>>>       http://www.gnu.org/philosophy/no-word-attachments.html
>>> ------------------------------------------------------------------
>>> Office location:
>>>
>>> Bureau 320 (3rd floor)
>>> Batiment Sophie Germain
>>> Avenue de France
>>> Metro Bibliotheque Francois Mitterrand, ligne 14/RER C
>>> -----------------------------------------------------------------
>>> GPG fingerprint 2931 20CE 3A5A 5390 98EC 8BFC FCCA C3BE 39CB 12D3
>>>
>>> --
>>> Caml-list mailing list.  Subscription management and archives:
>>> https://sympa.inria.fr/sympa/arc/caml-list
>>> Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
>>> Bug reports: http://caml.inria.fr/bin/caml-bugs
>>
>>
>
>
>
> --
> --Roberto Di Cosmo
>
> ------------------------------------------------------------------
> Professeur               En delegation a l'INRIA
> PPS                      E-mail: roberto@dicosmo.org
> Universite Paris Diderot WWW  : http://www.dicosmo.org
> Case 7014                Tel  : ++33-(0)1-57 27 92 20
> 5, Rue Thomas Mann
> F-75205 Paris Cedex 13
> FRANCE.
> ------------------------------------------------------------------
> Attachments:
>    MIME accepted
>    Word deprecated, http://www.rfc1149.net/documents/whynotword
> ------------------------------------------------------------------
> Office location:
>
> Bureau 6C15 (6th floor)
> 175, rue du Chevaleret, XIII
> Metro Chevaleret, ligne 6
> ------------------------------------------------------------------



-- 
--Roberto Di Cosmo

------------------------------------------------------------------
Professeur               En delegation a l'INRIA
PPS                      E-mail: roberto@dicosmo.org
Universite Paris Diderot WWW  : http://www.dicosmo.org
Case 7014                Tel  : ++33-(0)1-57 27 92 20
5, Rue Thomas Mann
F-75205 Paris Cedex 13
FRANCE.
------------------------------------------------------------------
Attachments:
   MIME accepted
   Word deprecated, http://www.rfc1149.net/documents/whynotword
------------------------------------------------------------------
Office location:

Bureau 6C15 (6th floor)
175, rue du Chevaleret, XIII
Metro Chevaleret, ligne 6
------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Caml-list] Case study in optimization: porting a compiler from OCaml to F#
  2013-03-17 12:06       ` Jon Harrop
  2013-03-19  1:50         ` Francois Berenger
@ 2013-03-19 12:47         ` Jean-Marc Alliot
  2013-03-20  9:32           ` Roberto Di Cosmo
  1 sibling, 1 reply; 22+ messages in thread
From: Jean-Marc Alliot @ 2013-03-19 12:47 UTC (permalink / raw)
  To: caml-list

We have been using Parmap and are quite happy with it (we use it as an 
alternative to the Ocaml/MPI implementation when MPI is not strictly 
required), with excellent scalability on our applications (even if we 
had to change from time to time a little bit or our code, regarding the 
fact that Parmap does not allow "in place" modifications).

Le 17/03/2013 13:06, Jon Harrop a écrit :
> What happens if the inner function returns results via mutation? I 
> assume you must rearrange the code to return all results explicitly 
> and they will then be deep copied (which destroys scalability due to 
> limited shared memory bandwidth on multicores).
As fas as I can tell there is no "copy" of any form. Parmap only 
collects results. If you do "in place" modifications, they are lost. 
Parmap is, in a way, "functional"...
> Does it do load balancing? I assume not given that ncores is hardcoded. 
There is an optional argument (chunksize) to manually control load 
balancing.
> Does a parmap with ncores=4 inside a parmap with ncores=4 create 16 
> processes? Does it deep copy inputs and/or outputs? 
Regarding outputs, Parmap uses a shared memory area and 
Marshal/Unmarshal to collect outputs. There is an optimization done when 
array of floats are returned, as marshalling is not used, thus reducing 
the overhead.
> I assume so, at least for outputs, because you cannot write results 
> in-place without a shared mutable heap. Does parmap have a large 
> constant overhead? I assume so if it is forking processes. 
Well, it depends on what you call "large constant overhead". Forking is 
not a so expensive primitive on modern Unix systems, because pages are 
only copied when written-to (copy-on-write).
> Another solution is to prefork and explicitly communicate all inputs 
> using message passing but this is equally problematic. You have to 
> rearrange the code. Deep copying inputs also destroys scalability. 
> Cheers, Jon. 

There is an article describing the implementation (I think it is also 
available online):

A “minimal disruption” skeleton experiment: seamless map & reduce 
embedding in OCaml
M. Danelutto, R. Di Cosmo
Procedia Computer Science 9 ( 2012 ) 1837 – 1846


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Caml-list] Case study in optimization: porting a compiler from OCaml to F#
  2013-03-19 12:47         ` Jean-Marc Alliot
@ 2013-03-20  9:32           ` Roberto Di Cosmo
  0 siblings, 0 replies; 22+ messages in thread
From: Roberto Di Cosmo @ 2013-03-20  9:32 UTC (permalink / raw)
  To: Jean-Marc Alliot; +Cc: caml-list, Marco Danelutto

Dear Jean-Marc,
     we are quite happy that Parmap is useful in your applications,
and would love to know more about your use cases.

Btw, I uploaded to HAL an author version of the article published in Procedia
Computer Science, so that everybody can access the paper ... it is
available at 

 http://hal.archives-ouvertes.fr/docs/00/69/25/15/PDF/parmap-author-file.pdf

all the best

--
Roberto and Marco

On Tue, Mar 19, 2013 at 01:47:51PM +0100, Jean-Marc Alliot wrote:
> We have been using Parmap and are quite happy with it (we use it as
> an alternative to the Ocaml/MPI implementation when MPI is not
> strictly required), with excellent scalability on our applications
> (even if we had to change from time to time a little bit or our
> code, regarding the fact that Parmap does not allow "in place"
> modifications).
> 
> Le 17/03/2013 13:06, Jon Harrop a écrit :
> >What happens if the inner function returns results via mutation? I
> >assume you must rearrange the code to return all results
> >explicitly and they will then be deep copied (which destroys
> >scalability due to limited shared memory bandwidth on multicores).
> As fas as I can tell there is no "copy" of any form. Parmap only
> collects results. If you do "in place" modifications, they are lost.
> Parmap is, in a way, "functional"...
> >Does it do load balancing? I assume not given that ncores is
> >hardcoded.
> There is an optional argument (chunksize) to manually control load
> balancing.
> >Does a parmap with ncores=4 inside a parmap with ncores=4 create
> >16 processes? Does it deep copy inputs and/or outputs?
> Regarding outputs, Parmap uses a shared memory area and
> Marshal/Unmarshal to collect outputs. There is an optimization done
> when array of floats are returned, as marshalling is not used, thus
> reducing the overhead.
> >I assume so, at least for outputs, because you cannot write
> >results in-place without a shared mutable heap. Does parmap have a
> >large constant overhead? I assume so if it is forking processes.
> Well, it depends on what you call "large constant overhead". Forking
> is not a so expensive primitive on modern Unix systems, because
> pages are only copied when written-to (copy-on-write).
> >Another solution is to prefork and explicitly communicate all
> >inputs using message passing but this is equally problematic. You
> >have to rearrange the code. Deep copying inputs also destroys
> >scalability. Cheers, Jon.
> 
> There is an article describing the implementation (I think it is
> also available online):
> 
> A “minimal disruption” skeleton experiment: seamless map & reduce
> embedding in OCaml
> M. Danelutto, R. Di Cosmo
> Procedia Computer Science 9 ( 2012 ) 1837 – 1846
> 
> 
> -- 
> Caml-list mailing list.  Subscription management and archives:
> https://sympa.inria.fr/sympa/arc/caml-list
> Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
> Bug reports: http://caml.inria.fr/bin/caml-bugs

-- 
Roberto Di Cosmo
 
------------------------------------------------------------------
Professeur               En delegation a l'INRIA
PPS                      E-mail: roberto@dicosmo.org
Universite Paris Diderot WWW  : http://www.dicosmo.org
Case 7014                Tel  : ++33-(0)1-57 27 92 20
5, Rue Thomas Mann       
F-75205 Paris Cedex 13   Identica: http://identi.ca/rdicosmo
FRANCE.                  Twitter: http://twitter.com/rdicosmo
------------------------------------------------------------------
Attachments:
MIME accepted, Word deprecated
      http://www.gnu.org/philosophy/no-word-attachments.html
------------------------------------------------------------------
Office location:
 
Bureau 320 (3rd floor)
Batiment Sophie Germain
Avenue de France
Metro Bibliotheque Francois Mitterrand, ligne 14/RER C
-----------------------------------------------------------------
GPG fingerprint 2931 20CE 3A5A 5390 98EC 8BFC FCCA C3BE 39CB 12D3                        

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Caml-list] Case study in optimization: porting a compiler from OCaml to F#
  2013-03-13 19:19   ` Jon Harrop
  2013-03-13 19:28     ` oliver
  2013-03-15 13:34     ` Pierre-Alexandre Voye
@ 2013-03-19  1:37     ` Francois Berenger
  2 siblings, 0 replies; 22+ messages in thread
From: Francois Berenger @ 2013-03-19  1:37 UTC (permalink / raw)
  To: caml-list

On 03/14/2013 04:19 AM, Jon Harrop wrote:
> Julien wrote:
>> Thanks for sharing this case-study with us!
>
> No problem. I found it in my drafts folder. :-)
>
>> Have you tried parallelizing the OCaml version?
>
> No. I didn't touch the OCaml code at all.
>
>> I am thinking pre-forked processes communicating with pipes?
>
> That would work but it would be a lot of effort compared to
> Array.Parallel.map in F#.

There is such a function in the Parmap library for OCaml.

>> We write a lot of large-scale static-analysis in OCaml here at Facebook.
>
> Good to hear. :-)
>
>> Parallelizing them with pre-forked processes gave us very good
> performances.
>
> I'm not sure how much message passing would be required in this case and
> don't have time to investigate.
>
>> I would be curious to see a case-study of pre-forked OCaml vs threaded F#.
>
> Me too. Only problem is that fork-based parallel OCaml code takes a long
> time to write in comparison. Incidentally, the F# does not make direct use
> of threads.
>
> Cheers,
> Jon.
>
>
> From: julien verlaguet [mailto:julien.verlaguet@gmail.com]
> Sent: 13 March 2013 17:14
> To: jon@ffconsultancy.com
> Cc: caml-list@inria.fr
> Subject: Re: [Caml-list] Case study in optimization: porting a compiler from
> OCaml to F#
>
> Hi Jon,
>
> Thanks for sharing this case-study with us!
> Have you tried parallelizing the OCaml version? I am thinking pre-forked
> processes communicating with pipes?
> We write a lot of large-scale static-analysis in OCaml here at Facebook.
> Parallelizing them with pre-forked processes gave us very good performances.
> I would be curious to see a case-study of pre-forked OCaml vs threaded F#.
>
> Julien
> 2013/3/13 Jon Harrop <jon@ffconsultancy.com>
>
> There has been some discussion here about the implications of single- vs
> multi-threaded garbage collectors and, in particular, their performance in
> the context of the kinds of metaprogramming that OCaml has traditionally
> been used for.
>
> I recently ported a compiler written in OCaml to the F# programming language
> for a client and performance turned out to be an issue so I'd like to
> present this as a case study to provide some real data. Unfortunately I
> cannot disclose precise details.
>
> The original compiler was 15kLOC of OCaml code. The amounts of DSL code that
> it consumes and C code that it produces can be considerable and compilation
> can take minutes. Consequently, performance is valued by my client's
> customers and, therefore, the original code had been optimized for OCaml's
> performance characteristics.
>
> A direct translation of the OCaml code to F# proved to be over 10x slower.
> This was so slow that it impeded testing my translation so I did some
> optimization early. Specifically, profiling indicated that the biggest
> problem was the high rate of exceptions being raised and caught. Exceptions
> are around 600x slower on .NET than in OCaml so this can quickly degrade
> performance. I changed all of the hot paths to use union types (usually
> option types) instead of exceptions, according to F# idioms. Although this
> incurs a lot of unnecessary boxing in F# the performance improvements were
> substantial and the F# version became 5x slower than the OCaml.
>
> On a related note, thorough testing showed that my almost-blind translation
> of 15kLOC of code was completely error free. I think this is a real
> testament to the power of ML's static type system. The only error I have
> introduced so far occurred when I was replacing the use of an exception in a
> function with a union type.
>
> After demonstrating the correctness of the translation, my effort turned to
> trying to improve performance in an attempt to compete with the original
> OCaml code. I had believed that this could well prove to be prohibitively
> difficult or even impossible because symbolic code is OCaml's main strength.
> However, I have managed to make the F# around 8x faster than it was and, in
> particular, substantially faster than the original OCaml.
>
> So this non-trivial symbolic code base has not had its performance suffer
> from the adoption of a multicore-friendly garbage collector.
>
> --
> Dr Jon Harrop, Flying Frog Consultancy Ltd.
> http://www.ffconsultancy.com
>
>
> --
> Caml-list mailing list.  Subscription management and archives:
> https://sympa.inria.fr/sympa/arc/caml-list
> Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
> Bug reports: http://caml.inria.fr/bin/caml-bugs
>
>
>


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Caml-list] Case study in optimization: porting a compiler from OCaml to F#
  2013-03-13 17:04 [Caml-list] Case study in optimization: porting a compiler from OCaml to F# Jon Harrop
  2013-03-13 17:14 ` julien verlaguet
@ 2013-03-13 17:55 ` Alain Frisch
  2013-03-13 19:44   ` Jon Harrop
  2013-03-13 18:27 ` oliver
  2 siblings, 1 reply; 22+ messages in thread
From: Alain Frisch @ 2013-03-13 17:55 UTC (permalink / raw)
  To: jon; +Cc: caml-list

On 03/13/2013 06:04 PM, Jon Harrop wrote:
> Unfortunately I
> cannot disclose precise details.

Too bad, because the really interesting part would to know (i) what kind 
of optimizations you had to do on the F# version (and in particular 
whether they make us of parallelism), (ii) whether (some of) those 
optimizations could be applicable to the OCaml version as well, and 
(iii) how much effort you initially put into optimizing the OCaml 
version (just tweaking the GC parameters can easily give very 
substantial speedups for symbolic processing).

-- Alain

^ permalink raw reply	[flat|nested] 22+ messages in thread

* RE: [Caml-list] Case study in optimization: porting a compiler from OCaml to F#
  2013-03-13 17:55 ` Alain Frisch
@ 2013-03-13 19:44   ` Jon Harrop
  2013-03-13 21:02     ` Alain Frisch
  0 siblings, 1 reply; 22+ messages in thread
From: Jon Harrop @ 2013-03-13 19:44 UTC (permalink / raw)
  To: 'Alain Frisch'; +Cc: caml-list

Alain Frisch wrote:
> Too bad, because the really interesting part would to know (i) what kind
of
> optimizations you had to do on the F# version (and in particular whether
> they make us of parallelism),

* Replaced the use of exceptions for control flow with variant types.
* Replaced use of fprintf with lower-level .NET functions.
* Parallelized some of the symbolic code.
* Algorithmic optimization to a search function.

> (ii) whether (some of) those optimizations could be applicable to the
OCaml
> version as well, and

* No need to replace exceptions in OCaml.
* No need to replace printf and friends in OCaml.
* Could try to parallelize the OCaml but it would be hard to do efficiently.
* Algorithmic optimization could also be done to the OCaml (IIRC, this was a
fairly minor speedup).

> (iii) how much effort you initially put into optimizing the OCaml version
(just
> tweaking the GC parameters can easily give very substantial speedups for
> symbolic processing).

None. I didn't change the OCaml code at all and didn't try different GC
parameters. However, it had already been quite heavily optimized.

Cheers,
Jon.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Caml-list] Case study in optimization: porting a compiler from OCaml to F#
  2013-03-13 19:44   ` Jon Harrop
@ 2013-03-13 21:02     ` Alain Frisch
  0 siblings, 0 replies; 22+ messages in thread
From: Alain Frisch @ 2013-03-13 21:02 UTC (permalink / raw)
  To: jon; +Cc: caml-list

On 3/13/2013 8:44 PM, Jon Harrop wrote:
> * No need to replace exceptions in OCaml.

It depends.  If you compile in debug mode and run with backtrace 
enabled, exceptions are not so fast any more.

> * No need to replace printf and friends in OCaml.

Printf can be quite slow.  Format direct-style as well.  And 
Format.printf is even worse.

( http://www.lexifi.com/blog/note-about-performance-printf-and-format )


Alain

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Caml-list] Case study in optimization: porting a compiler from OCaml to F#
  2013-03-13 17:04 [Caml-list] Case study in optimization: porting a compiler from OCaml to F# Jon Harrop
  2013-03-13 17:14 ` julien verlaguet
  2013-03-13 17:55 ` Alain Frisch
@ 2013-03-13 18:27 ` oliver
  2013-03-13 20:00   ` Jon Harrop
  2 siblings, 1 reply; 22+ messages in thread
From: oliver @ 2013-03-13 18:27 UTC (permalink / raw)
  To: caml-list

Hello Jon,

thanks for your report.
Very interesting.

On Wed, Mar 13, 2013 at 05:04:26PM -0000, Jon Harrop wrote:
[...]
> After demonstrating the correctness of the translation, my effort turned to
> trying to improve performance in an attempt to compete with the original
> OCaml code. I had believed that this could well prove to be prohibitively
> difficult or even impossible because symbolic code is OCaml's main strength.
> However, I have managed to make the F# around 8x faster than it was and, in
> particular, substantially faster than the original OCaml.
[...]

What is missing, is the information, how many cores / processors the machine has,
on which the F# code runs, and if the OCaml code runs on the same machine.
What about Ocaml Byteocde vs. Nativecode?

And what, if the re-designt program would be back-ported to OCaml?

Ciao,
   Oliver

^ permalink raw reply	[flat|nested] 22+ messages in thread

* RE: [Caml-list] Case study in optimization: porting a compiler from OCaml to F#
  2013-03-13 18:27 ` oliver
@ 2013-03-13 20:00   ` Jon Harrop
  0 siblings, 0 replies; 22+ messages in thread
From: Jon Harrop @ 2013-03-13 20:00 UTC (permalink / raw)
  To: 'oliver', caml-list

Oliver wrote:
> What is missing, is the information, how many cores / processors the machine
> has, on which the F# code runs, and if the OCaml code runs on the same machine.

Benchmark measurements used for comparison between OCaml and F# were always done on the same machine running Windows. I used two machines:

* 2-core 1.67GHz Intel Atom N570 based netbook.
* 8-core 2.0GHz Intel Xeon E5405 based workstation.

The client verified the relative performance on their own machines.

> What about Ocaml Byteocde vs. Nativecode?

The performance comparison was done using native code. We did not benchmark OCaml bytecode.

> And what, if the re-designt program would be back-ported to OCaml?

Only minor changes were made, no redesign. Some of those changes could be back-ported to the OCaml but there is no motivation to do so.

Cheers,
Jon.

-----Original Message-----
From: caml-list-request@inria.fr [mailto:caml-list-request@inria.fr] On Behalf Of oliver
Sent: 13 March 2013 18:28
To: caml-list@inria.fr
Subject: Re: [Caml-list] Case study in optimization: porting a compiler from OCaml to F#

Hello Jon,

thanks for your report.
Very interesting.

On Wed, Mar 13, 2013 at 05:04:26PM -0000, Jon Harrop wrote:
[...]
> After demonstrating the correctness of the translation, my effort 
> turned to trying to improve performance in an attempt to compete with 
> the original OCaml code. I had believed that this could well prove to 
> be prohibitively difficult or even impossible because symbolic code is OCaml's main strength.
> However, I have managed to make the F# around 8x faster than it was 
> and, in particular, substantially faster than the original OCaml.
[...]

What is missing, is the information, how many cores / processors the machine has, on which the F# code runs, and if the OCaml code runs on the same machine.
What about Ocaml Byteocde vs. Nativecode?

And what, if the re-designt program would be back-ported to OCaml?

Ciao,
   Oliver

--
Caml-list mailing list.  Subscription management and archives:
https://sympa.inria.fr/sympa/arc/caml-list
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
Bug reports: http://caml.inria.fr/bin/caml-bugs

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2013-03-21 20:07 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-03-13 17:04 [Caml-list] Case study in optimization: porting a compiler from OCaml to F# Jon Harrop
2013-03-13 17:14 ` julien verlaguet
2013-03-13 19:19   ` Jon Harrop
2013-03-13 19:28     ` oliver
2013-03-14 15:05       ` oliver
2013-03-14 15:15         ` oliver
2013-03-15 13:34     ` Pierre-Alexandre Voye
2013-03-17 12:06       ` Jon Harrop
2013-03-19  1:50         ` Francois Berenger
2013-03-20 20:54           ` Jon Harrop
2013-03-20 22:35             ` Roberto Di Cosmo
2013-03-21  4:13               ` Mike Lin
2013-03-21  7:35                 ` Roberto Di Cosmo
2013-03-21 20:07                   ` Roberto Di Cosmo
2013-03-19 12:47         ` Jean-Marc Alliot
2013-03-20  9:32           ` Roberto Di Cosmo
2013-03-19  1:37     ` Francois Berenger
2013-03-13 17:55 ` Alain Frisch
2013-03-13 19:44   ` Jon Harrop
2013-03-13 21:02     ` Alain Frisch
2013-03-13 18:27 ` oliver
2013-03-13 20:00   ` Jon Harrop

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox