* marshal and C structures crash
@ 2007-02-07 22:05 Andres Varon
2007-02-07 22:59 ` [Caml-list] " Robert Roessler
0 siblings, 1 reply; 3+ messages in thread
From: Andres Varon @ 2007-02-07 22:05 UTC (permalink / raw)
To: OCaml List
Hello Everyone,
I would like to ask a question regarding a bug I have been observing
in one program, which I have been unable to fix:
The program in question is a large phylogenetic analysis application
(bioinformatics), which has been written in OCaml and C. It's almost
ready for public beta testing _excepting_ for this particular bug.
The bulk of the code is in OCaml (~70.000 LOC), and a small fraction
of core functions in C (obviously it's hard to post the code in
question). It runs both in sequential and parallel versions using
MPI, and uses heavily polymorphic variants, functors, and object
oriented features, where each fit better our requirements.
I had the parallel version broken for a while, but it used to run
without a problem. Few weeks ago, when I updated the code for
parallel runs (using a master-slave distributed model), I started to
observe slaves segfaulting after a while. I nailed down the problem
to some marshal related issue that I can reproduce in the sequential
versions by doing the following:
1. load some data in the program and marshal what I would have sent
to a slave in a file
2. run the program in a loop that unmarshals the data from the file,
and repeats a short script. The loop usually ends with a crash (few
iterations).
The data structure being marshaled is pure OCaml (Sets and Maps of
other ocaml structures), and so all C structures (wrapped with a
custom tag), are produced locally. The segfault happens if the
computations are concentrated in either one of the only two C custom
types, which where programmed independently by two of us (extremely
different computations).
If I don't do the unmarshal step, but run the previous loop by just
reading the data from the input files, the program works flawlessly,
and tools such as valgrind, watch points I have set in gdb, and lots
of assertions in our C and ocaml code, pass every test. I also have
checks for every array access in our C side to ensure that each
access and write occurs within bounds.
However, if the data comes from the marshaled channel, after few
iterations the program segfaults, and the reason appears to be
(according to valgrind, and all my attempts to detect a failure as
early as possible), that some custom type is free while still alive
from the OCaml side (what I catch is a double free, or that the
contents of a DNA sequence is invalid because it has been free
already). Note, again, that I am completely unable to reproduce the
issue (even a single warning or assertion failure), unless I
unmarshal the data to start with. Moreover, the error occurs with two
data structures that where programmed independently by two
experienced OCaml programmers. I believe that OCaml is duplicating
the custom type and therefore I get two ocaml values pointing at the
same C structure, is that possible?. I though one of the C types uses
a pool of arrays to speedup some computations, the other one only has
one pointer, going from the Ocaml custom type to the C structure, and
from there to a couple of arrays, that's it. Also note that every
type is treated as an immutable data structure, and we provide no in-
place modifications in our OCaml interface.
Of course, I have been hunting a bug in my C functions and can't find
anything that could cause the double free (the only way to call
seq_CAML_free is from the garbage collector!), or an out of bounds
write. Is there anything special about marshaling that could be
causing this? Even some particular pattern in the way OCaml allocates
memory for the unmarshaling step? Any ideas about what the problem
could be or where should I look at?
As you see, I'm lost; I just don't see where else can I place a check
in our code.
For those of you who reached this line of my email, thanks for the
effort! I will listen at any ideas that could pop up in your minds.
best,
Andrés Varón
American Museum of Natural History
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [Caml-list] marshal and C structures crash
2007-02-07 22:05 marshal and C structures crash Andres Varon
@ 2007-02-07 22:59 ` Robert Roessler
2007-02-08 0:16 ` Andres Varon
0 siblings, 1 reply; 3+ messages in thread
From: Robert Roessler @ 2007-02-07 22:59 UTC (permalink / raw)
To: Caml-list
Andres Varon wrote:
> ...
> For those of you who reached this line of my email, thanks for the
> effort! I will listen at any ideas that could pop up in your minds.
Hey, I will read the full message just to see what someone is doing
with 70K lines of OCaml code! :)
The usual comment - you don't mention any version and platform
details... especially with something that took as long as this
probably did, those might be of interest (particularly since some
teams doing a project of this size might have not been keeping up with
OCaml releases).
It is not crystal clear that you are using "finalize" routines - if
so, they are an obvious (and easy) place to position check code. If
not, why not? It sounds like you might *need* to wrap some of your
values created in C-land in smart-but-thin OCaml objects, if for
nothing else than to more delicately handle lifetime issues.
These "popped up" for me on my initial reading. ;)
Robert Roessler
roessler@rftp.com
http://www.rftp.com
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [Caml-list] marshal and C structures crash
2007-02-07 22:59 ` [Caml-list] " Robert Roessler
@ 2007-02-08 0:16 ` Andres Varon
0 siblings, 0 replies; 3+ messages in thread
From: Andres Varon @ 2007-02-08 0:16 UTC (permalink / raw)
To: Robert Roessler; +Cc: Caml-list
On Feb 7, 2007, at 5:59 PM, Robert Roessler wrote:
> Andres Varon wrote:
>> ...
>> For those of you who reached this line of my email, thanks for the
>> effort! I will listen at any ideas that could pop up in your minds.
>
> Hey, I will read the full message just to see what someone is doing
> with 70K lines of OCaml code! :)
>
jejeje, we detect very complex combinatorial events in DNA sequences,
using different optimality criteria, over an evolutionary tree that
we are searching for. The program was in version 3 and became painful
to maintain (8 years of many hands working on it and - most important
-, learning OCaml on it), so now it has been rewritten from scratch.
> The usual comment - you don't mention any version and platform
> details... especially with something that took as long as this
> probably did, those might be of interest (particularly since some
> teams doing a project of this size might have not been keeping up
> with OCaml releases).
>
I realized that afterwards! In part I didn't mention it because it's
happening consistently in all versions of OCaml and platforms that
are applicable to:
3.08.4 and 3.09.2, 3.09.3 running in the following platforms:
Mac OS X - PPC / Intel, Linux x86, Linux AMD64, Linux EMT-64. I
truly believe it is something I do wrong in my C side, but for the
life of mine, I don't see what it is, and I don't understand why it
shows up only in relation to successive marshals. Note that the
marshalled structure do not include any of my C types wrapped in an
OCaml abstract one. It did at the beginning (that was my first
suspect), but before working around representations in pure ocaml to
try to get rid of the problem, I even compared the output of separate
marshals of the same values multiple times, unmarshaling and
marshaling again, and comparing different repetitions, with no errors
detected.
> It is not crystal clear that you are using "finalize" routines - if
> so, they are an obvious (and easy) place to position check code.
> If not, why not? It sounds like you might *need* to wrap some of
> your values created in C-land in smart-but-thin OCaml objects, if
> for nothing else than to more delicately handle lifetime issues.
>
> These "popped up" for me on my initial reading. ;)
We malloc the C structures, and store the pointer to them in a custom
type for which we provide the functions in OCaml. The registration of
the custom type (using a custom_operations structure), includes a
free function to deallocate whatever C allocated memory should be
when the garbage collector does its job, and we provide them.
AFAIK, having a pointer to an allocated C structure wrapped in a
custom type is safe, provided the C structure does not point back to
the OCaml heap, and we don't: the pointers go in only one direction
to the C side.
>
> Robert Roessler
> roessler@rftp.com
> http://www.rftp.com
>
>
Thanks!
Andres
> _______________________________________________
> Caml-list mailing list. Subscription management:
> http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list
> Archives: http://caml.inria.fr
> Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
> Bug reports: http://caml.inria.fr/bin/caml-bugs
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2007-02-08 0:16 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-02-07 22:05 marshal and C structures crash Andres Varon
2007-02-07 22:59 ` [Caml-list] " Robert Roessler
2007-02-08 0:16 ` Andres Varon
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox