Mailing list for all users of the OCaml language and system.
 help / color / mirror / Atom feed
From: Gerd Stolpmann <info@gerd-stolpmann.de>
To: Till Varoquaux <till.varoquaux@gmail.com>
Cc: Vincent Hanquez <tab@snarc.org>, caml-list@yquem.inria.fr
Subject: Re: [Caml-list] XML library for validating MathML
Date: Thu, 18 Sep 2008 13:52:07 +0200	[thread overview]
Message-ID: <1221738727.17456.27.camel@flake.lan.gerd-stolpmann.de> (raw)
In-Reply-To: <9d3ec8300809180212r7e3dcdf3wd13c5cff69d5034b@mail.gmail.com>


Am Donnerstag, den 18.09.2008, 10:12 +0100 schrieb Till Varoquaux:
> PXP is tough to work with and feels a bit crazy but it is good with
> standards (It can sort out any DTD's I have ever thrown at it).
> xml-light is, well, very broken (it doesn't even support charcode
> switching). There are several XML parsers in OCaml and I've had a
> stint with a few of them; the only two I would consider using are
> expat and Pxp with a marked preference for the later. PXP can be very
> confusing and feels over engineered at times but it does the job. And
> remember parsing XML is a hard job, much harder than we often give it
> credit for....
> 
> Hats off to Gerd for providing us with a proper parser.

Thanks. Initially, I thought XML is an easy format - because it looks
easy. But the specs are really challenging - full of bad compromises,
and I would expect that a widely adopted standard has to undergo some
evaluation of its practicability before it is published. For instance,
there are very strict rules where whitespace has to be in XML, and where
it must not occur. E.g. <tag x="a"y="b"> is considered as illegal
because of the missing space between the attributes. The whitespace
rules make it practically impossible to use a yacc-generated parser (my
first attempt was ocamlyacc-based, and it sort of worked after
implementing lots of parsing tricks, but it was impossible to fix all
errors, although the XML grammar is quite short after all). There are
further complications in the XML standard, and after all, it is very
difficult to implement it even on the most basic level. So there are
many parsers now out there that do not do that, but rather implement a
subset because this is easier and parsing is faster.

There is much more to say about shortcomings in XML, or the XML
standardization process. It is now an unnecessary complicated
technology. I would advise everybody to use it only when there is no way
around it, e.g. for exchange of structured data between organizations.

I've got now a few hours of sponsorship for PXP. I'll try to improve the
documentation, because there are some parts that need more explanation
(where people feel it is over-engineered, but as Vincent pointed out,
it's the standard that demands it).

Gerd


> 
> Till
> 
> On Thu, Sep 18, 2008 at 9:38 AM, Vincent Hanquez <tab@snarc.org> wrote:
> > On Wed, Sep 17, 2008 at 11:58:05AM -0700, Dario Teixeira wrote:
> >> Given a string containing a mathematical expression in the MathML
> >> markup, I need to verify that the expression is indeed valid MathML.
> >> I am therefore looking for an XML library that can verify an expression
> >> against a given DTD.
> >>
> >> Now, I have tried Xml-light, and the code I used is listed below.
> >> Unfortunately, it fails when trying to parse MathML's DTD (it's the
> >> standard DTD from the W3C).  I have tried simpler DTDs, and it does work
> >> with them; am I therefore correct in assuming that Xml-light can only
> >> handle a particular version/subset of DTD features?
> >
> > I don't know about validation (i'll probably suggest looking at PXP tho),
> > but xml-light is very bad for XML compliance. the library is (happily) parsing
> > XML files that it shouldn't, which tell a lots concerning its validation
> > abilities ...
> >
> > for example, the XML supported character range is not even checked:
> >
> > Xml 1.0 specification -- 2.2 Characters
> >
> > Char       ::=          #x9 | #xA | #xD | [#x20-#xD7FF] |
> >                [#xE000-#xFFFD] | [#x10000-#x10FFFF]
> >
> > others problems include (uncomplete list):
> > - complete unicode un-awareness
> > - funny & wrong entities handling
> >
> > --
> > Vincent
> >
> > _______________________________________________
> > Caml-list mailing list. Subscription management:
> > http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list
> > Archives: http://caml.inria.fr
> > Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
> > Bug reports: http://caml.inria.fr/bin/caml-bugs
> >
> 
> _______________________________________________
> Caml-list mailing list. Subscription management:
> http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list
> Archives: http://caml.inria.fr
> Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
> Bug reports: http://caml.inria.fr/bin/caml-bugs
> 
-- 
------------------------------------------------------------
Gerd Stolpmann * Viktoriastr. 45 * 64293 Darmstadt * Germany 
gerd@gerd-stolpmann.de          http://www.gerd-stolpmann.de
Phone: +49-6151-153855                  Fax: +49-6151-997714
------------------------------------------------------------



  parent reply	other threads:[~2008-09-18 11:51 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-09-17 18:58 Dario Teixeira
2008-09-17 22:13 ` [Caml-list] " Richard Jones
2008-09-18  2:58   ` Matt Gushee
2008-09-18  8:06     ` Re : " Adrien
2008-09-18  8:38 ` Vincent Hanquez
2008-09-18  9:12   ` Till Varoquaux
2008-09-18  9:44     ` Vincent Hanquez
2008-09-18 11:52     ` Gerd Stolpmann [this message]
2008-09-18 13:35       ` Markus Mottl
2008-09-19 11:30       ` Matt Gushee
2008-09-18 14:26 ` Dario Teixeira
2008-09-18 17:58   ` Dario Teixeira
2008-09-18 18:28     ` Gerd Stolpmann
2008-09-18 20:44       ` Dario Teixeira
2008-09-18 20:48         ` Gerd Stolpmann
2008-09-19 13:23         ` Stefano Zacchiroli

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1221738727.17456.27.camel@flake.lan.gerd-stolpmann.de \
    --to=info@gerd-stolpmann.de \
    --cc=caml-list@yquem.inria.fr \
    --cc=tab@snarc.org \
    --cc=till.varoquaux@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox