xmlm and names(paces)

Mailing list for all users of the OCaml language and system.
 help / color / mirror / Atom feed

* xmlm and names(paces)
@ 2008-02-06 20:44 Bünzli Daniel
  2008-02-06 20:59 ` [Caml-list] " David Teller
  2008-02-06 21:52 ` Alain Frisch
  0 siblings, 2 replies; 8+ messages in thread
From: Bünzli Daniel @ 2008-02-06 20:44 UTC (permalink / raw)
  To: caml-list List

Hello,

As I previously said on this list I'm adding better namespace support  
to xmlm. Up to now xmlm just parsed qualified names into their prefix  
and local part (prefix, local). Now I'd like to provide the client  
with expanded names (uri, local).

Initially I planned to give the client choice between getting  
qualified names or expanded names. However the prefix of qualified  
names is really meaningless (it can be alpha converted) and thus  
cannot be used to recognize anything in a document. One of the aim of  
xmlm is simplicity, as such \x13I think xmlm should only provide expanded  
names.

However maybe I'm missing something so I'd like to ask the list if  
someone think there is any use for clients to get qualified names ? If  
I you do please tell me.

Best,

Daniel

P.S. There is no distinction betwen qualified and expanded names if  
you parse documents that have no prefixes and no default namespace  
declarations.

P.P.S. Name expansion has a performance cost but if I support only  
expanded names I can better reduce it.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Caml-list] xmlm and names(paces)
  2008-02-06 20:44 xmlm and names(paces) Bünzli Daniel
@ 2008-02-06 20:59 ` David Teller
  2008-02-06 21:26   ` Bünzli Daniel
  2008-02-06 21:52 ` Alain Frisch
  1 sibling, 1 reply; 8+ messages in thread
From: David Teller @ 2008-02-06 20:59 UTC (permalink / raw)
  To: Bünzli Daniel; +Cc: caml-list List

As far as I know, the only difference is when you try and produce
human-readable XML documents. In this case, there are often ad-hoc
conventions regarding which prefix maps to what namespace (e.g. for the
same namespace, xhtml: is more readable than, say, ns1:) -- which might
be useful for people writing, say, editors.

There might also be a difference when browsers are attempting to recover
broken xml, but that's probably not an issue here.

Cheers,
 David

On Wed, 2008-02-06 at 21:44 +0100, Bünzli Daniel wrote:
> However maybe I'm missing something so I'd like to ask the list if  
> someone think there is any use for clients to get qualified names ? If  
> I you do please tell me.

-- 
David Teller
 Security of Distributed Systems
  http://www.univ-orleans.fr/lifo/Members/David.Teller
 Angry researcher: French Universities need reforms, but the LRU act
brings liquidations. 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Caml-list] xmlm and names(paces)
  2008-02-06 20:59 ` [Caml-list] " David Teller
@ 2008-02-06 21:26   ` Bünzli Daniel
  0 siblings, 0 replies; 8+ messages in thread
From: Bünzli Daniel @ 2008-02-06 21:26 UTC (permalink / raw)
  To: caml-list List

Le 6 févr. 08 à 21:59, David Teller a écrit :

> As far as I know, the only difference is when you try and produce
> human-readable XML documents. In this case, there are often ad-hoc
> conventions regarding which prefix maps to what namespace (e.g. for  
> the
> same namespace, xhtml: is more readable than, say, ns1:) -- which  
> might
> be useful for people writing, say, editors.

As a side note I'll also add that xmlm's handling of namespaces is  
going to keep attributes of the xmlns namespace instead of removing  
them. This allows to know which prefix bindings were done in the  
document. Given how output is going to work chaining xmlm's input with  
output  you'll get the same prefixes that were present on input  
(provided the same ns is not bound to two different prefixes in a  
given context in which case there may be differences).

Best,

Daniel

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Caml-list] xmlm and names(paces)
  2008-02-06 20:44 xmlm and names(paces) Bünzli Daniel
  2008-02-06 20:59 ` [Caml-list] " David Teller
@ 2008-02-06 21:52 ` Alain Frisch
  2008-02-06 22:56   ` Bünzli Daniel
  1 sibling, 1 reply; 8+ messages in thread
From: Alain Frisch @ 2008-02-06 21:52 UTC (permalink / raw)
  To: Bünzli Daniel; +Cc: caml-list List

Bünzli Daniel wrote:
> As I previously said on this list I'm adding better namespace support to 
> xmlm. Up to now xmlm just parsed qualified names into their prefix and 
> local part (prefix, local). Now I'd like to provide the client with 
> expanded names (uri, local).
> 
> Initially I planned to give the client choice between getting qualified 
> names or expanded names. However the prefix of qualified names is really 
> meaningless (it can be alpha converted) and thus cannot be used to 
> recognize anything in a document. One of the aim of xmlm is simplicity, 
> as such \x13I think xmlm should only provide expanded names.

The problem with expanded names is that it makes it quite tedious to 
pattern-match on element/attribute names (uri are long!). Of course, it 
is a trivial exercise in Camlp4 to create a nice syntax for that.

Another option is to let the client provide a mapping from uri to fixed 
prefixes. (PXP can do that kind of prefix normalization.)

It is also a good idea to be able to parse XML documents that conform
to the XML spec but not the XML Namespaces spec.

What about something like that:

type name = string * [`N of string * string|`U of string * string|`X]

The first component of name gives the full unparsed name from the XML 
document. The second component gives the qname decomposition; it can be 
either a known normalized prefix (relative to a dictionnary provided by 
the client) or an unknown URI. Or it can be an error (the document does 
not conform to the XML Namespaces spec). If the client does not provide
any dictionnary of known prefixes, there will be no `N node. If the 
parser is run a non-namespace mode, there will be only `X nodes.

examples:
   ("html:p", `N ("xhtml", "p"))
         the prefix html refers to the known xhtml namespace

   ("foo:x", `U ("http://unknownnamespaceuri", "x"))
         the prefix foo refers to a unknown uri

   ("x:y", `X)
         the prefix x is not bound to any namespace

   ("x::z", `X)
         name is ill-formed w.r.t. the XML Namespaces spec.

Also, it is necessary to give the client a way to know the namespace 
bindings in scope at any node. Some XML languages like XML-Schema need 
this information. A possible way to do it is just to keep the xmlns 
declarations as regular attributes.

As a minor alternative, in order to reduce the syntactic overhead, it is 
possible to use a single string to encode the three possible cases. E.g.:

   "p:xhtml"  Known uri
   "x::http://unknownnamespaceuri" Unknown uri
   "" Ill-formed qname

-- Alain

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Caml-list] xmlm and names(paces)
  2008-02-06 21:52 ` Alain Frisch
@ 2008-02-06 22:56   ` Bünzli Daniel
  2008-02-06 23:51     ` Bünzli Daniel
  2008-02-07 22:03     ` Alain Frisch
  0 siblings, 2 replies; 8+ messages in thread
From: Bünzli Daniel @ 2008-02-06 22:56 UTC (permalink / raw)
  To: Alain Frisch; +Cc: caml-list List

Le 6 févr. 08 à 22:52, Alain Frisch a écrit :

> The problem with expanded names is that it makes it quite tedious to  
> pattern-match on element/attribute names (uri are long!).

Agreed.

> Another option is to let the client provide a mapping from uri to  
> fixed prefixes. (PXP can do that kind of prefix normalization.)

In xmlm you can do that yourself on input when you get callbacked and  
do the reverse translation just before outputing
start tags. Of course this means more work for the client, but it  
makes the basic interface simpler and it allows the client
to use variants instead of simply shorter prefixes.

> It is also a good idea to be able to parse XML documents that conform
> to the XML spec but not the XML Namespaces spec.

But don't you automatically get that ?

A document that has no xmlns namespace declarations and no prefixes if  
parsed according to the xmlns spec will result in names with empty  
namespace names.

The other problem I see is if there are external prefix declarations,  
but for that, as I did for external entity references, I have a  
callback that allows you to bind an undeclared prefix to an uri.

> What about something like that:
>
> type name = string * [`N of string * string|`U of string * string|`X]
[...]

Too heavy weight for my taste. With xmlm I try to give a reasonable  
default for xml IO, not the full blown complexity. So I think going  
with qualified names only is ok, the client can transform its own way  
if it whishes (e.g. uri replacement).

> Also, it is necessary to give the client a way to know the namespace  
> bindings in scope at any node. Some XML languages like XML-Schema  
> need this information. A possible way to do it is just to keep the  
> xmlns declarations as regular attributes.

I was planning on keeping the xmlns declarations, but I ignored some  
languages actually *need* this information, what is it used for in xml- 
schema ?

Thanks for the comments,

Daniel

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Caml-list] xmlm and names(paces)
  2008-02-06 22:56   ` Bünzli Daniel
@ 2008-02-06 23:51     ` Bünzli Daniel
  2008-02-07 22:03     ` Alain Frisch
  1 sibling, 0 replies; 8+ messages in thread
From: Bünzli Daniel @ 2008-02-06 23:51 UTC (permalink / raw)
  To: Alain Frisch; +Cc: caml-list List


Le 6 févr. 08 à 23:56, Bünzli Daniel a écrit :

>  Of course this means more work for the client, but it makes the  
> basic interface simpler and it allows the client
> to use variants instead of simply shorter prefixes.

Now that I think of it instead of (what I have now) :

> type name = string * string
> type attribute = name * string
> type tag = name * attribute list
>
> val input : ?enc:encoding option -> ?strip:bool ->
>    ?ns: (string -> string option) ->
>    ?entity: (string -> string option) ->
>    ?prolog: (dtd -> unit) ->
>    ?prune:(tag -> 'a -> bool) ->
>    ?s:(tag -> 'a -> 'a) ->	
>    ?e:(tag -> 'a -> 'a) ->
>    ?d:(string -> 'a -> 'a)  -> 'a -> input ->
>      [ `Value of 'a | `Error of (int * int) * error ]

(~s is for start tag, ~e is for end tag (the full start tag is given  
again), ~d is for data)

Why not have a callback ~name which is given the expanded name  
(uri,local) allows the client to do whathever it wishes and
a callback ~att to build attributes :

> val input : ?enc:encoding option -> ?strip:bool ->
>    ?ns: (string -> string option) ->
>    ?entity: (string -> string option) ->
>    name: (string -> string -> 'n) ->
>    att: ('n -> string -> 'att)
>    ?prolog: (dtd -> unit) ->
>    ?prune:('n * 'att -> 'a -> bool) ->
>    ?s:('n * 'att list -> 'a -> 'a) ->	
>    ?e:('n * 'att list -> 'a -> 'a) ->
>    ?d:(string -> 'a -> 'a)  -> 'a -> input ->
>      [ `Value of 'a | `Error of (int * int) * error ]


This allows you to give precise variant cases for the things your  
process and have a catch all case for what you are not interested in.   
Output would be polymorphised accordingly and the client provides  
inverses of ~name ('n -> string * string) and ~att ('att -> 'n *  
string).

On the other hand the work performed by ~name and ~att can be done by  
the client in ~s or ~e. The only thing the latter solution brings is  
to avoid folding over the tag type if you want to transform it. So  
finally I don't think it is worth it (especially because the  
polymorphised output feels cumbersome).

Any comment ?

Daniel

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Caml-list] xmlm and names(paces)
  2008-02-06 22:56   ` Bünzli Daniel
  2008-02-06 23:51     ` Bünzli Daniel
@ 2008-02-07 22:03     ` Alain Frisch
  1 sibling, 0 replies; 8+ messages in thread
From: Alain Frisch @ 2008-02-07 22:03 UTC (permalink / raw)
  To: Bünzli Daniel; +Cc: caml-list

Bünzli Daniel wrote:
> Le 6 févr. 08 à 22:52, Alain Frisch a écrit :
>> It is also a good idea to be able to parse XML documents that conform
>> to the XML spec but not the XML Namespaces spec.
> 
> But don't you automatically get that ?
> 
> A document that has no xmlns namespace declarations and no prefixes if 
> parsed according to the xmlns spec will result in names with empty 
> namespace names.

The following documents are well-formed w.r.t. the XML spec:

<a::::x/>   (syntactically invalid QName w.r.t. XML Namespaces)

<a:x/>   (unbound prefix)

>> type name = string * [`N of string * string|`U of string * string|`X]
> [...]
> 
> Too heavy weight for my taste.

Actually, I agree. What about doing nothing about namespaces?  For the 
event-based API at least, and maybe for the tree-based one as well, it 
doesn't seem too bad to let the client manage it (maybe you can simply 
provide a small module to help manage namespace dictionnaries and 
resolution).

-- Alain

^ permalink raw reply	[flat|nested] 8+ messages in thread

* xmlm and names(paces)
@ 2008-02-07  8:13 oleg
  2008-02-07  8:59 ` [Caml-list] " Bünzli Daniel
  0 siblings, 1 reply; 8+ messages in thread
From: oleg @ 2008-02-07  8:13 UTC (permalink / raw)
  To: caml-list

Buenzli Daniel wrote:
> As I previously said on this list I'm adding better namespace support to
> xmlm. Up to now xmlm just parsed qualified names into their prefix and
> local part (prefix, local). Now I'd like to provide the client with
> expanded names (uri, local).
>
> Initially I planned to give the client choice between getting qualified
> names or expanded names. However the prefix of qualified names is really
> meaningless (it can be alpha converted) and thus cannot be used to
> recognize anything in a document. One of the aim of xmlm is simplicity,
> as such I think xmlm should only provide expanded names.

It should be mentioned that the prefixes of qualified names cannot
just be alpha-converted. It is quite common to see the following,
quoted from http://www.w3.org/TR/xmlschema-0/

	<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">
         ...
	  <xsd:element name="comment" type="xsd:string"/>

One can plainly see that the prefix, xsd, appears inside a quoted
string! If one wishes to rename the prefix xsd into just 's', one has
to look inside quoted strings (of course, not every occurrence of xsd
inside quoted string is the prefix. A quoted string, the content of an
attribute, may just as well be an opaque quoted string).

One may really wonder what kind of people wrote all those voluminous
XML recommendations. 

So, ideally one may wish to keep the original prefix (in addition to
its corresponding URL). It is also reasonable for a user to specify a
`shortcut'. Unlike the prefix, which is chosen by the author of the
document, a shortcut is chosen by the person who invokes a parser. In
the SSAX parser, the user specifies the association of URI with
shortcuts. The parser, having resolved the QName prefix to a URI, maps
that URI to the user-specified shortcut, if present. The shortcuts are
extensively discussed in
	http://okmij.org/ftp/Scheme/SXML.html#Namespaces

Incidentally, some of the design decisions of SSAX (despite being
produced by an enemy) might be pertinent to this discussion. SSAX is
actually a SAX parser, or a big macro that builds a parser out of
user-provided callbacks and reasonable defaults. One can use SSAX to
parse XML on the fly or to convert XML to anything one chooses. There
is also an instantiation of SSAX with reasonable callbacks that make
SSAX a DOM parser, converting XML into one particular output format,
SXML. Experience shows that this particular instantiation satisfies
most of the users. Still I have come across several users who needed
the full SSAX (e.g., for streaming conversion of XML into something
else).

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Caml-list] xmlm and names(paces)
  2008-02-07  8:13 oleg
@ 2008-02-07  8:59 ` Bünzli Daniel
  0 siblings, 0 replies; 8+ messages in thread
From: Bünzli Daniel @ 2008-02-07  8:59 UTC (permalink / raw)
  To: caml-list List

Le 7 févr. 08 à 09:13, oleg@okmij.org a écrit :

> It should be mentioned that the prefixes of qualified names cannot
> just be alpha-converted. It is quite common to see the following,
> quoted from http://www.w3.org/TR/xmlschema-0/
>
> 	<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">
>         ...
> 	  <xsd:element name="comment" type="xsd:string"/>

Argh ! I now understand Alain's comment about the need to keep that  
information. This is a complete misuse of the namespace recommandation  
which says nothing about binding pefixes in attribute and character  
data. The w3c is really hopeless.

> One may really wonder what kind of people wrote all those voluminous
> XML recommendations.

You tell me.

> So, ideally one may wish to keep the original prefix (in addition to
> its corresponding URL). It is also reasonable for a user to specify a
> `shortcut'. Unlike the prefix, which is chosen by the author of the
> document, a shortcut is chosen by the person who invokes a parser.

As mentionned in my previous email with xmlm you can do that by  
yourself since all the info is there and you have full control on the  
parsing result. However for the aformentionned case this may mean a  
lot of work. On the other hand xml schema seems to be seen as a broken  
technology (even the xml spec editor says so iirc). So the question  
is, is it worth complexifiying the interface to facilitate the parsing  
of this obviously broken and marginal (is it ?) case.

> Incidentally, some of the design decisions of SSAX (despite being
> produced by an enemy) might be pertinent to this discussion.

Thanks for the link, I will have a look at it (functional programming  
languages are no enemies).

Best,

Daniel

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2008-02-07 22:08 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-02-06 20:44 xmlm and names(paces) Bünzli Daniel
2008-02-06 20:59 ` [Caml-list] " David Teller
2008-02-06 21:26   ` Bünzli Daniel
2008-02-06 21:52 ` Alain Frisch
2008-02-06 22:56   ` Bünzli Daniel
2008-02-06 23:51     ` Bünzli Daniel
2008-02-07 22:03     ` Alain Frisch
2008-02-07  8:13 oleg
2008-02-07  8:59 ` [Caml-list] " Bünzli Daniel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox