* [Caml-list] [ANN] Uutf 0.9.0 and Jsonm 0.9.0
@ 2012-05-05 14:48 Daniel Bünzli
0 siblings, 0 replies; only message in thread
From: Daniel Bünzli @ 2012-05-05 14:48 UTC (permalink / raw)
To: caml-list, caml-hump
Hello,
I'd like to announce the following two modules. First Uutf:
Uutf is a non-blocking streaming codec to decode and encode the UTF-8,
UTF-16, UTF-16LE and UTF-16BE encoding schemes. It can efficiently
work character by character without blocking on IO. Decoders perform
character position tracking and support newline normalization.
Functions are also provided to fold over the characters of UTF encoded
OCaml string values and to directly encode characters in OCaml Buffer.t
values.
Uutf is made of a single, independent, module and distributed under
the BSD3 license.
Project home page: http://erratique.ch/software/uutf
API doc & examples: http://erratique.ch/software/uutf/doc/Uutf
The aim of Uutf is to provide a convenient abstraction for
non-blocking streaming Unicode text processing and to implement
non-blocking LL(k) parsers over Unicode text. It's used by Jsonm and
will certainly be used by Xmlm in the future.
The second module is Jsonm:
Jsonm is a non-blocking streaming codec to decode and encode the JSON
data format. It can process JSON text without blocking on IO and
without a complete in-memory representation of the data.
The alternative "uncut" codec also processes whitespace and
(non-standard) JSON with JavaScript comments.
Jsonm is made of a single module and depends on [Uutf]. It is distributed
under the BSD3 license.
Project home page: http://erratique.ch/software/jsonm
API doc & examples: http://erratique.ch/software/jsonm/doc/Jsonm
Basically Jsonm is to JSON what Xmlm is to XML. It's a rather
low-level approach where you work with streams of structural lexemes
which reflect the data model underlying the data language. The
sequence of lexemes is guaranteed to be presented to you according to
a simple grammar or errors are returned. This allows to
consume/produce the data without having the whole data in memory while
abstracting over the idiosyncrasies of the data language. I also hope
it can serve as basis to define efficient data query combinators.
Jsonm's design is however more convient than Xmlm's one: Jsonm has
precise lexeme position tracking support, best-effort decoding that
allows to continue after an error, trivial input termination condition
(just decode `End, whereas in Xmlm you have to count), and allows to
access whitespace to write data filters that preserve as much of the
original data as possible (P.S. I hope to eventually find time to fix
all these defects in an incompatible release of Xmlm).
If you want to install these modules via odb here are lines you can
add to your odb package file:
http://erratique.ch/software/odb-packages.txt
Feedback is welcome,
Daniel
P.S. Since the question will likely be asked here's how I think Jsonm
compares to Yojson. Martin may want to chime in to correct me or offer
a different perspective as I'm certainly biased.
* Jsonm depends on Uutf. Yojson depends on ocamllex, cppo, easy-format and
biniou.
* Jsonm inputs UTF-8, UTF16, UTF-16LE and UTF-16BE and outputs
UTF-8 encoded JSON. Yojson inputs and outputs UTF-8 encoded
JSON.
* Jsonm reports character stream decoding errors and allows to bypass
them by replacing the invalid bytes with the Unicode character
replacement U+FFFD. Yojson, apparently by design, silently inputs
invalid UTF-8 byte sequences, I consider this to be a security risk
(or at least a wrong security default).
* Jsonm mostly sticks to the standard (with the exception of comments if
you use the uncut codec). Yojson extends the standard in various
ways to support the serialisation of OCaml values, it also
supports the input of JavaScript comments but discards them.
* Jsonm uses only OCaml floats for JSON numbers. This limits the
roundtrip of integers to the ones that are exactly representable in
this datatype. i.e. the range [-2^53;2^53]. Yojson returns the
integer string literal if the int is greater than [max_int]. Note
however that Jsonm's behaviour is equivalent to the one you have in
JavaScript (and hence in all browsers) so it would anyway be
ill-advised for JSON producers to go beyond this limit.
* Jsonm offers no generic tree-like JSON representation (see the examples in
the doc to see how to build one). Yojson offers many different generic
tree-like representations.
* Jsonm has a non-blocking IO interface. To the best of my knowledge
Yojson doesn't support that.
* Jsonm has a streaming IO interface. To the best of my knowledge
there's an undocumented very low-level streaming input interface in
Yojson but this bares no ressemblance with Jsonm's notion of a
streaming interface. There's also an undocumented streaming output
interface but it doesn't seem that you can output an object or an
array without first building an in-memory JSON representation of it.
* Jsonm can perform best-effort decoding, i.e. continue to parse after
an error. To the best of my knowledge Yojson cannot do that.
* Performance. I'm always reluctant to make performance claims in
abstract settings; it all depends on the context. If you like
unscientific benchmarks you can try to test performance between
`ydump` and `jsontrip` which both recode JSON text. Bear in mind
however that the results are highly data dependent and that
internally both programs don't do the same thing, `jsontrip` does
not build a generic in-memory representation of the JSON text. In
my tests on random data `jsontrip` takes anything between 1.25 and
2.1 the time of `ydump`. The upper bound occurs when random numbers
are only integers which `ydump` doesn't parse as floats. On real
geojson data these numbers are between 1.38 and 1.46. But on this
data, processing a 325Mo file, the resident memory used by `ydump`
grows up to 1.2Go while the streaming interface of Jsonm albeit
slower, remains constant at only 3.8Mo. Your mileage may vary.
^ permalink raw reply [flat|nested] only message in thread
only message in thread, other threads:[~2012-05-05 14:48 UTC | newest]
Thread overview: (only message) (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-05-05 14:48 [Caml-list] [ANN] Uutf 0.9.0 and Jsonm 0.9.0 Daniel Bünzli
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox