From: Francois BERENGER <berenger@bioreg.kyushu-u.ac.jp>
To: caml-list@inria.fr
Subject: Re: [Caml-list] Can this code be accelerated by porting it to SPOC, SAREK or MetaOCaml ?
Date: Fri, 16 Jun 2017 18:49:12 +0900 [thread overview]
Message-ID: <d930e2f3-8b74-39bc-b2d1-f69ad6a9dd33@bioreg.kyushu-u.ac.jp> (raw)
In-Reply-To: <CALdWJ+zsKqDm+C74TGO8K6U4JvQO_SPspv41EDk125RQepWRYw@mail.gmail.com>
On 06/15/2017 07:38 PM, Ivan Gotovchits wrote:
> Hi François,
>
> Your code is actually bound by memory, not by computation. The problem
> is that you create lots of data, and all the time is spent in the
> allocation functions and GC. The actual computation consumes less than
> 5% of overall program execution, that can be easily seen by profiling
> (always profile before starting optimization). Thus moving to GPU will
> make this 5% of code run faster, at the expense of even higher overhead
> of moving data between CPU and GPU.
Thanks Ivan for the detailed explanations, the code suggestions and even
benchmarks !
Thanks to profiling, I knew the code snippet I sent is the bottleneck.
I also saw the whole program was spending 40% of its time doing GC.
I went with your deforesting/noalloc approach.
However, the reduction function cannot avoid doing array updates.
But, the results were amazing already: now, my program runs 4 times
faster than before and the numerical results are unchanged. I am impressed.
Thanks a lot!
F.
> So, let's see how we can optimize your code. First of all, your code
> uses too much mutable state, that can have a negative performance impact
> in OCaml due to write barriers. Let's make it a little bit cleaner and
> faster:
>
> let nonmutab points =
> let rec loop xs acc = match xs with
> | [] -> acc
> | (x,w) :: xs ->
> List.fold_left (fun acc (x',w') ->
> (Vector3.dist x x', w *. w') :: acc) acc xs |>
> loop xs in
> loop points []
>
> This code runs a little bit faster (on 10000 points):
>
> original: 14084.1 ms
> nonmutab: 11309.2 ms
>
> It computes absolutely the same result, using the same computational
> core, and allocating the same amount of trash. The only difference is
> that we are not using a mutable state anymore, and we are rewarded for that.
>
> The next step is to consider, do you really need to produce these
> intermediate structures, if the result of your program is the computed
> data, then you can just store it in a file. Or, if you need later this
> data to be reduced to some value, then you shouldn't create an
> intermediate result, and apply the reduction in place (so called
> de-foresting). So, let's generalize a little bit the function so that
> instead of producing a new list, the function just applies a
> user-provided function to all produced elements, along with an accumulator.
>
> let nonalloc f acc points =
> let rec loop xs acc = match xs with
> | [] -> acc
> | (x,w) :: xs ->
> List.fold_left (fun acc (x',w') ->
> f acc (Vector3.dist x x') (w *. w')) acc xs |>
> loop xs in
> loop points acc
>
>
>
> This function will not allocate any unnecessary data (it will though
> allocate a pair of floats per each call). So we can use this function to
> implement the `nonmutab` version, by passing a consing operator and an
> empty list. We can also try to use it to store data. The data storage
> process depends on how fast we can reformat the data, and how fast is
> the sink device. If we will output to /dev/null (that is known to be
> fast), then we can use the Marshal module and get about 300 MB/s
> throughtput. Not bad, but still about 10 seconds of running time. If,
> for example, we just need some scalar metrics, then we don't need to pay
> an overhead of data, and it will be as fast as possible. For example,
> with the following implementations of the kernel function
>
> let flags = [Marshal.No_sharing]
>
> let print () x1 x2 =
> Marshal.to_channel stdout x1 flags;
> Marshal.to_channel stdout x2 flags
>
> let product total x1 x2 = total +. x1 *. x2
>
>
> We will have the following timings:
>
> printall: 9649.54 ms
> nonalloc: 541.186 ms
>
>
> So, the actual computation took only half a second, the rest is data
> processing.
>
> Please find attached the whole example. You can run it with the
> following command:
>
> ocamlbuild -pkgs vector3,unix points.native -- > /dev/null
>
>
> Regards,
> Ivan Gotovchits
>
>
> On Thu, Jun 15, 2017 at 9:09 AM, Ronan Le Hy <ronan.lehy@gmail.com
> <mailto:ronan.lehy@gmail.com>> wrote:
>
> Hi François,
>
> 2017-06-15 8:28 GMT+02:00 Francois BERENGER
> <berenger@bioreg.kyushu-u.ac.jp
> <mailto:berenger@bioreg.kyushu-u.ac.jp>>:
> > I am wondering if some high performance OCaml experts out there
> > can know in advance if some code can go faster by executing it
> > on a GPU.
> > I have some clear bottleneck in my program.
> > Here is how the code looks like:
> > ---
> > let f (points: (Vector3.t * float) list) =
> > let acc = ref [] in
> > let ac p1 x (p2, y) =
> > acc := (Vector3.dist p1 p2, x *. y) :: !acc
> > in
> > let rec loop = function
> > | [] -> ()
> > | (p1, x) :: xs ->
> > L.iter (ac p1 x) xs;
> > loop xs
> > in
> > loop points;
> > !acc
>
> As a baseline before attempting anything on the GPU, I'd vectorize
> this. Put all your vectors in a big matrix. Put all your numbers in a
> vector. Compute the distances and the products all at once using
> Lacaml (make sure you use OpenBLAS as a backend). I'd expect it to be
> much faster than the above loop already.
>
> --
> Ronan
>
> --
> Caml-list mailing list. Subscription management and archives:
> https://sympa.inria.fr/sympa/arc/caml-list
> <https://sympa.inria.fr/sympa/arc/caml-list>
> Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
> <http://groups.yahoo.com/group/ocaml_beginners>
> Bug reports: http://caml.inria.fr/bin/caml-bugs
> <http://caml.inria.fr/bin/caml-bugs>
>
>
prev parent reply other threads:[~2017-06-16 9:49 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-06-15 6:28 Francois BERENGER
2017-06-15 7:09 ` Ronan Le Hy
2017-06-15 10:38 ` Ivan Gotovchits
2017-06-16 9:49 ` Francois BERENGER [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=d930e2f3-8b74-39bc-b2d1-f69ad6a9dd33@bioreg.kyushu-u.ac.jp \
--to=berenger@bioreg.kyushu-u.ac.jp \
--cc=caml-list@inria.fr \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox