* SafeUnmarshal: questions/problems/timings @ 2006-08-29 23:39 Hendrik Tews 2006-08-31 10:01 ` [Caml-list] " Grégoire Henry 0 siblings, 1 reply; 4+ messages in thread From: Hendrik Tews @ 2006-08-29 23:39 UTC (permalink / raw) To: caml-list Dear all, I used the safeUnmarshal module (http://www.pps.jussieu.fr/~henry/marshal/) in order to check if some marshaled ocaml data is compatible with its type (the data is generated in a mixed C++/ocaml program). Here are my experiences: 1. I made some measurements that suggest that the unmarshal has quadratic complexity in the data size, see http://www.cs.ru.nl/~tews/marshal-plot.eps http://www.cs.ru.nl/~tews/marshal-plot-detail.eps If my (simple-minded) estimations are correct it would take more than 2 hours to check 4 MB of marshaled data (which I generate in less than 3 seconds). Is there any hope that the time complexity will improve? 2. Objective Caml version 3.09.3+dev0+ty1 # SafeUnmarshal.copy [^nativeint^] 4;; Segmentation fault 3. Would it be possible to put an english version of http://www.pps.jussieu.fr/~henry/marshal/docTy/Ty.html online? 4. Instead of type-safe unmarshal functions, I am more interested in checking if some value that has been constructed outside ocaml is type correct. Therefore I would suggest to make Check.check available in come way. I am using now let check obj ty = Check.check (Obj.repr obj) (Ty.dump ty) with type val check : 'a -> 'a tyrepr -> bool Am I right that the type parameter of tyrepr is a kind of phantom type that is mainly used to restrict the type of SafeUnmarshal.from_channel? Then I could also use val check : 'a -> 'b tyrepr -> bool ? It would be great if (as a debugging feature) this check could produce some sort of trace that helps to locate those parts that violate the given type. 5. nativeint, int32, and int64 are not supported yet (I would suggest to make the documentation a bit more precise in that point). As fix I use (in Check.check_block): | Tnativeint -> tag = Obj.custom_tag && size = 2 && ((Obj.field obj 0) == (Obj.field (Obj.repr Nativeint.zero) 0)) | Tint32 -> tag = Obj.custom_tag && size = 2 && ((Obj.field obj 0) == (Obj.field (Obj.repr Int32.zero) 0)) | Tint64 -> tag = Obj.custom_tag && size = 3 && ((Obj.field obj 0) == (Obj.field (Obj.repr Int64.zero) 0)) Any comments? On a 64 bit architecture the int64 size should be required to be 2. I would strongly suggest to replace the catch all cases | _ -> false in check.ml with some more precise code (it took me several hours of bug hunting until I found out that the above line made my unmarshal fail without even looking at the nativeints). 6. Thanks for SafeUnmarshal, it helped me a lot when checking the data created inside C++! Bye, Hendrik ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [Caml-list] SafeUnmarshal: questions/problems/timings 2006-08-29 23:39 SafeUnmarshal: questions/problems/timings Hendrik Tews @ 2006-08-31 10:01 ` Grégoire Henry 2006-09-01 9:23 ` Hendrik Tews 0 siblings, 1 reply; 4+ messages in thread From: Grégoire Henry @ 2006-08-31 10:01 UTC (permalink / raw) To: Hendrik Tews; +Cc: caml-list, michel [-- Attachment #1: Type: text/plain, Size: 4845 bytes --] Hello, > 1. I made some measurements that suggest that the > unmarshal has quadratic complexity in the data size, see > > http://www.cs.ru.nl/~tews/marshal-plot.eps > http://www.cs.ru.nl/~tews/marshal-plot-detail.eps > > If my (simple-minded) estimations are correct it would take > more than 2 hours to check 4 MB of marshaled data (which I > generate in less than 3 seconds). > > Is there any hope that the time complexity will improve? If your value contains no sharing (hence no cycle), the type checking of safe-unmarshalling should be linear in the size of the data. If there is some sharing (but still no cycle), and if we assume that looking for shared type information takes constant time, then type checking should still be linear. With cycles, things are a bit worse because of the computation of strongly connected components. Therefore, if your data contains shared parts but no cycle, then it should be close to linear. Currently, mapping shared blocks to their type is represented by an association list: accessing such type information could explain this quadratic behaviour. If you are in this situation, this could be the reason and the implementation needs a more clever way of retrieving types of shared blocks. Note that a simple workaround such as using the Map module is incorrect because a memory compaction may occur and break the order used for storing blocks. With cycles, the `height' of cycles (cycles inside cycles) comes as an extra complexity factor. But in order to produce a quadratic behaviour, you need data with a very particular shape. Could you please send us more information about the shape of your data (sharing or not, cycles or not, and wether sharing increase linearly or not with the size)? > 2. Objective Caml version 3.09.3+dev0+ty1 > > # SafeUnmarshal.copy [^nativeint^] 4;; > Segmentation fault See your fifth question :) > 3. Would it be possible to put an english version of > http://www.pps.jussieu.fr/~henry/marshal/docTy/Ty.html online? Certainly. I'm joining a translated ty.mli, I will put it online too. > 4. Instead of type-safe unmarshal functions, I am more interested > in checking if some value that has been constructed outside > ocaml is type correct. Therefore I would suggest to make > Check.check available in come way. I am using now > > let check obj ty = Check.check (Obj.repr obj) (Ty.dump ty) > > with type > > val check : 'a -> 'a tyrepr -> bool > > Am I right that the type parameter of tyrepr is a kind of > phantom type that is mainly used to restrict the type of > SafeUnmarshal.from_channel? Then I could also use > > val check : 'a -> 'b tyrepr -> bool ? Yes and yes. > It would be great if (as a debugging feature) this check could > produce some sort of trace that helps to locate those parts > that violate the given type. My hope for reducing the complexity is to make some "pre-processing" on the value when the unmarshaller reallocates it in the memory. In this case, the Check.check function can't operate on "standard" values, and should not be exported. Yet I don't know if the "preprocessor" would be tightly bound to the unmarshaller or have a chance to be exported too. We can keep exporting a quadratic check for "standard" values. > 5. nativeint, int32, and int64 are not supported yet (I would > suggest to make the documentation a bit more precise in that > point). As fix I use (in Check.check_block): > > | Tnativeint -> > tag = Obj.custom_tag && size = 2 && > ((Obj.field obj 0) == (Obj.field (Obj.repr Nativeint.zero) 0)) > | Tint32 -> > tag = Obj.custom_tag && size = 2 && > ((Obj.field obj 0) == (Obj.field (Obj.repr Int32.zero) 0)) > | Tint64 -> > tag = Obj.custom_tag && size = 3 && > ((Obj.field obj 0) == (Obj.field (Obj.repr Int64.zero) 0)) > > Any comments? On a 64 bit architecture the int64 size should be > required to be 2. let custom_int64_size = 1 + 64 / Sys.word_size;; I will add those cases. > I would strongly suggest to replace the catch all cases > > | _ -> false > > in check.ml with some more precise code (it took me several > hours of bug hunting until I found out that the above line > made my unmarshal fail without even looking at the > nativeints). Something like printing the type clash (looking for ty_xx, found ty_yy)? Or pretty-printing part of the value that have been checked before failure? > 6. Thanks for SafeUnmarshal, it helped me a lot when checking the > data created inside C++! This is quite an original use for unmarshalling functions :-) Many thanks for your reports and suggestions ! -- Grégoire Henry [-- Attachment #2: ty.mli --] [-- Type: application/octet-stream, Size: 4477 bytes --] (* * * A run-time representation for OCaml types * *) (** Functions for de-constructing values of type ['a tyrepr]. *) (** {6 Type declaration} *) (** A value of type [declaration] is produced by the compiler for each type declaration in a module. This allows at runtime to ``break the abstraction barriers'' imposed by the type-checker... *) type declaration (** [name decl] returns the name of the type constructor defined by [decl]. *) val name: declaration -> string (** [fullname decl] returns the name of the type constructor defined by [decl], as well as the line and file name of this decleration. *) val fullname: declaration -> string (** [arity decl] returns the arity of the type constructor defined by [decl]. *) val arity: declaration -> int (** [is_alias decl] returns [true] if [decl] is an alias type declaration. When [decl] is abstract both on the structure and the signature, it returns [false]. *) val is_alias: declaration -> bool (** [is_datatype decl] returns [true] if [decl] is a record or sum type declaration. When [decl] is abstract both on the structure and the signature, it returns [false]. *) val is_datatype: declaration -> bool (** {6 Type expression} *) (** This is the ``untyped'' equivalent of the predefined type ['a tyrepr]. *) type expression (** The [dump] function allows to project a type expression [([^ ty ^] : ty tyrepr)] to the type [expression]. *) val dump: 'a tyrepr -> expression (** [body decl] returns a type expression describing the body of the type declaration [decl]. *) val body: declaration -> expression (** [canonical ty] resolves type aliases and returns a type expression that corresponds to a datatype declaration. Therefore, [desc (canonical ty)] - is a predefined type ; - or is an abstract type ; - or has shape [Tconstr(decl,_)] such that [is_datatype(decl)] is true. *) val canonical: expression -> expression (** Type equality modulo aliases. (It is useless to call [canonical] on [equal] arguments.) *) val equal: expression -> expression -> bool (** Pretty-printing, FIXME ... *) val print: Format.formatter -> expression -> unit (** {6 Description of type expression } *) (** Description of type expressions. *) type description = | Tunit | Tbool | Tint | Tnativeint | Tint32 | Tint64 | Tchar | Tstring | Tfloat | Texn | Tarray of expression | Tlist of expression | Toption of expression | Tlazy of expression | Ttyrepr of expression | Tformat4 of expression * expression * expression * expression | Tconstr of declaration * expression array | Ttuple of expression array | Tarrow of string * expression * expression | Tvariant of bool * (string * int * expression option) array | Tabstract | Tsum of string array * (string * expression array) array | Tdoublerecord of (string * bool) array | Trecord of (string * bool * expression) array | Targ of int | Tpoly of int * expression | Tvar of int (** [desc ty] returns a description of the type expression [ty]. It corresponds to the type expression that we can write using OCaml concrete syntax. It therefore excludes constructors such as [Tabstract], [Tsum], [Tdoublerecord] or [Trecord]. See [repr]. *) val desc: expression -> description (** [repr ty] returns the description of possible memory representations for the type expression [ty]. By contrast with [desc ty], it can be [Tabstract], [Tsum], [Tdoublerecord] ou [Trecord], but not [Tconstr]. *) val repr: expression -> description (** {6 Safe unmarshalling} *) (** The following functions are specific to safe unmarshalling. They operate on implicitly universally quantified type expression. The constant [Tvar i] represents the constant {i Top} of the (currently only in French) formalization. *) (** [cut_head_quantification ty] removes hypothetical headings [Tpoly _] from [ty]. *) val cut_head_quantification: expression -> expression (** [instance ty1 ty2] returns true when [ty1] is a polymorphic instance [ty2]. *) val instance: expression -> expression -> bool (** Memoryless anti-unification (also sometimes called weak anti-unification). We use [Tvar (-1)] for resolving conflicts. *) val antiunif: expression -> expression -> expression (** [is_monomorphic ty] returns [true] when [ty] contains no occurrence of [Tvar (-1)]. *) val is_monomorphic: expression -> bool [-- Attachment #3: Type: text/plain, Size: 1 bytes --] ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [Caml-list] SafeUnmarshal: questions/problems/timings 2006-08-31 10:01 ` [Caml-list] " Grégoire Henry @ 2006-09-01 9:23 ` Hendrik Tews 0 siblings, 0 replies; 4+ messages in thread From: Hendrik Tews @ 2006-09-01 9:23 UTC (permalink / raw) To: caml-list Grégoire Henry <gregoire.henry@pps.jussieu.fr> writes: Hendrik wrote: > 1. I made some measurements that suggest that the > unmarshal has quadratic complexity in the data size, see Therefore, if your data contains shared parts but no cycle, then it should be close to linear. Currently, mapping shared blocks to their type is represented by an association list: accessing such type information could explain this quadratic behaviour. That would explain something: An association list hardly scales to several megabyte of data. Could you please send us more information about the shape of your data (sharing or not, cycles or not, and wether sharing increase linearly or not with the size)? The data contains sharing and cycles. I have no idea about the height of the cycles. It might well be the case that the amount of shared data grows faster than the overal size. But almost all shared data has type string * int * int. The data are abstract syntax trees of C++ programs (produced by a modified elsa/elkhound parser). Its type is defined with about 40 recursive variant and record types. Its monomorphic, no aliasing or other fancy features. I would say the type is bigger, but not more complex than "int list". I would be really surprised if the stronly connected components are necessary for my application. I am working on a release of the modified elsa parser (hopefully early next week). Then you can examine the data and its type yourself. Maybe you can follow the approach of the Marshal module, where the user can choose (via the flags) between different algorithms? > 2. Objective Caml version 3.09.3+dev0+ty1 > > # SafeUnmarshal.copy [^nativeint^] 4;; > Segmentation fault See your fifth question :) I don't quite understand. My point is that even if nativeint is not supported yet, it should not end in a segmentation fault, should it? My hope for reducing the complexity is to make some "pre-processing" on the value when the unmarshaller reallocates it in the memory. In this case, the Check.check function can't operate on "standard" values, and should not be exported. Yet I don't know if the "preprocessor" would be tightly bound to the unmarshaller or have a chance to be exported too. We can keep exporting a quadratic check for "standard" values. My application scenario is as follows: I have this big C++ program (a C++ parser) that eventually generates an ocaml value (the abstract syntax tree of a C++ program, possibly > 10MB). I would then like to enter the clean world of ocaml to manipulate/process those abstract syntax trees. If everything goes wrong I would first like to check the integrety of the abstract syntax tree. If the integrety check fails I need to know which node and which of its fields contain the bad data (otherwise I can't fix the problem). To locate the bad data I currently invoke check on the top node, and if it fails I call it on the child nodes and descend into one for which the check fails. That is, I invoke check very often, but always on the same unmodified data. It would be nice if the interface of SafeUnmarshal permits this kind of use. What's necessary is - either a fast unmarshal function with the ability to output some error trace if requested - or a check function that is fast as long as one works on the same unmodified data. > I would strongly suggest to replace the catch all cases > > | _ -> false > > in check.ml with some more precise code (it took me several > hours of bug hunting until I found out that the above line > made my unmarshal fail without even looking at the > nativeints). Something like printing the type clash (looking for ty_xx, found ty_yy)? Or pretty-printing part of the value that have been checked before failure? No, we probably have some misunderstanding here. I meant the following: Instead of | _ -> false enumerate all possible constructors and carefully distinguish those cases for check should return false from those cases that have not been implemented yet and which should therefore trigger an assertion. (Because: your original code delivered false on a nativeint. If it delivered an assertion instead I would have immediately known that nativeints are not supported. Instead I spent several hours to search for the inconsistency in my data.) I tell you how to reproduce those unmarshals that run for hours when I made the release of the modified elsa parser. I would be very much interested in trying a new version of SafeUnmarshal, especially one where the association list problem has been fixed (I still have a problem in one ast of 4.7MB, it's only that check runs from more than 2 hours on it, so I can't track it) Bye, Hendrik PS. I assume that Michel Mouny and Grégoire Henry are subscribed to the caml list, so I don't cc them. ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [Caml-list] SafeUnmarshal: questions/problems/timings @ 2006-09-14 13:07 Hendrik Tews 0 siblings, 0 replies; 4+ messages in thread From: Hendrik Tews @ 2006-09-14 13:07 UTC (permalink / raw) To: caml-list Here is the promised followup with more details on the slow safe unmarshalling. On http://www.cs.ru.nl/~tews/nsUnicodeToTeXCMRt1.i.oast you can download 281 KB of marshalled data. On my machine it takes 23 seconds to check with native code. The data is of type annotated translationUnit_type = annotated * annotated topForm_type list You can test it with the following piece of code: open Cc_ast_gen_type open Ast_annotation let file = "/home/tews/src/elsa/elsa/in/big/nsUnicodeToTeXCMRt1.i.oast" ;; try SafeUnmarshal.from_channel [^ annotated translationUnit_type ^] (open_in file); print_endline "OK" with | _ -> print_endline "FAIL" compile with ocamlopt.opt safeUnmarshal.cmxa ast_annotation.ml elsa_util.ml \ ml_ctype.ml cc_ml_types.ml cc_ast_gen_type.ml justunmarshal.ml The additional files are from Olmar, get them here: http://www.sos.cs.ru.nl/cgi-bin/~tews/olmar/viewvc-patch.cgi/elsa/elsa/ast_annotation.ml?revision=olmar-release-2006-09-07 http://www.sos.cs.ru.nl/cgi-bin/~tews/olmar/viewvc-patch.cgi/elsa/elsa/elsa_util.ml?revision=olmar-release-2006-09-07 http://www.sos.cs.ru.nl/cgi-bin/~tews/olmar/viewvc-patch.cgi/elsa/elsa/ml_ctype.ml?revision=olmar-release-2006-09-07 http://www.sos.cs.ru.nl/cgi-bin/~tews/olmar/viewvc-patch.cgi/elsa/elsa/cc_ml_types.ml?revision=olmar-release-2006-09-07 http://www.sos.cs.ru.nl/cgi-bin/~tews/olmar/viewvc-patch.cgi/elsa/elsa/cc_ast_gen_type.ml?revision=olmar-release-2006-09-07 To produce graphs like http://www.cs.ru.nl/~tews/marshal-plot.eps you need to download Olmar, compile it and then - ./regrtest -ocaml in subdir elsa - ./regtest-oast | grep time >data in subdir asttools - gnuplot plot where the plot file is something like reset set grid unset mouse set terminal x11 persist #set terminal postscript enhanced color set xlabel "size (Bytes)" set ylabel "time (s)" set key left z = 2.1 plot "data" using 6:3 title "SafeUnmarshal user time", \ (0.000017088 * x) ** 2.1 Bye, Hendrik ^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2006-09-14 13:06 UTC | newest] Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2006-08-29 23:39 SafeUnmarshal: questions/problems/timings Hendrik Tews 2006-08-31 10:01 ` [Caml-list] " Grégoire Henry 2006-09-01 9:23 ` Hendrik Tews 2006-09-14 13:07 Hendrik Tews
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox