ElementTree, serialization and namespace prefixes


Mon 27 February 2006 By nuxeo

The way ElementTree
outputs namespaces in serialized output can be a problem with some
applications.

Here is an example of such an ouput :

>>> import cElementTree as etree
>>> stream = """<?xml version="1.0" encoding="UTF-8" ?>
... <doc xmlns="http://bar"
... xmlns:foo="http://foo/"&gt;
... <foo:sub/>
... </doc>""
>>>
>>> doc = etree.XML(stream)
>>> print etree.tostring(doc, encoding="UTF-8")
<?xml version="1.0" encoding="UTF-8" ?>
<ns0:doc xmlns:ns0="http://bar"&gt;
<ns1:sub xmlns:ns1="http://foo" />
</ns0:doc>
>>>

We can see that the declared namespaces are now given an alias and all
prefixes are now changed using those defined aliases. This is absolutley
correct in a XML point of view but you can be in trouble sometimes with some
applications for which you are outputing XML from elementtree based Python programs because they do not support
this properly on their side.

Here is a workaround I found but I don't know if others exist :

>>> import cElementTree
>>> import elementtree.ElementTree
>>>
>>> my_namespaces = {'http://foo' : 'foo',
... 'http://bar/' : bar}
>>> elementtree.ElementTree._namespace_map.update(my_namespaces)
>>>
>>> stream = """<?xml version="1.0" encoding="UTF-8" ?>
... <doc xmlns="http://bar"
... xmlns:foo="http://foo"&gt;
... <foo:sub/>
... </doc>"""
>>>
>>> doc = cElementTree.XML(stream)
>>> print cElementTree.tostring(doc)
<bar:doc xmlns="http://bar"&gt;
<foo:sub xmlns:foo="http://foo" />
</bar:doc>

Here, this has been serialized without replacing the prefixes within
qualifed names.

The idea is that we are adding well known namespace prefixes to elementtree
default ones.

The default elementtre ones are defined within elementtree/ElementTree.py
like below :

_namespace_map = {

"well-known" namespace prefixes

"http://www.w3.org/XML/1998/namespace": "xml",
"http://www.w3.org/1999/xhtml": "html",
"http://www.w3.org/1999/02/22-rdf-syntax-ns#": "rdf",
"http://schemas.xmlsoap.org/wsdl/": "wsdl",
}


This is not the best way I would have hope to find. Please let me know if
you know any others.

The problem I had recently was with OpenOffice.org 1.1.x. (I don't
know about the version2 though).

I could parse and serialize OpenOffice.org content XML documents and
read them from OpenOffice.org at
first. But as soon as I was modifiying the document from OpenOffice.org then it wasn't taking
the namespace prefix aliases into consideration while inserting new
elements. I used this trick and now OpenOffice.org is happy. I'm gonna report
this issue to Laurent to
see if the OpenOffice.org
guys are aware about this issue.I fixed the issue as shown below. I used the nmspace.mod from the
OOo dtd to find out the relevant OOo namespaces.

OOo_NS = "http://openoffice.org/2000/"

OFFICE_NS = "%soffice" % OOo_NS
TABLE_NS = "%stable" % OOo_NS
STYLE_NS = "%sstyle" % OOo_NS
TEXT_NS = "%stext" % OOo_NS
META_NS = "%smeta" % OOo_NS
SCRIPT_NS = "%sscript" % OOo_NS
DRAWING_NS = "%sdrawing" % OOo_NS
CHART_NS = "%schart" % OOo_NS
NUMBER_NS = "%snumber" % OOo_NS
DATASTYLE_NS = "%sdatastyle" % OOo_NS
DR3D_NS = "%sdr3d" % OOo_NS
FORM_NS = "%sform" % OOo_NS
CONFIG_NS = "%sconfig" % OOo_NS

FO_NS = "http://www.w3.org/1999/XSL/Format"
XLINK_NS = "http://www.w3.org/1999/xlink"
SVG_NS = "http://www.w3.org/2000/svg"
MATH_NS = "http://www.w3.org/1998/Math/MathML"

This will be used for the XML serialization and elementtree.

NAMESPACE_MAP = {
OFFICE_NS : 'office',
TABLE_NS : 'table',
STYLE_NS : 'style',
TEXT_NS : 'text',
META_NS : 'meta',
SCRIPT_NS : 'script',
DRAWING_NS : 'drawing',
CHART_NS : 'chard',
NUMBER_NS : 'number',
DATASTYLE_NS : 'datastyle',
DR3D_NS : 'dr3d',
FORM_NS : 'form',
CONFIG_NS : 'config',
MATH_NS : 'math',
SVG_NS : 'svg',
XLINK_NS : 'xlink',
FO_NS : 'fo',
}

import elementtree.ElementTree as etree
etree._namespace_map.update(NAMESPACE_MAP)



(Post originally written by Julien Anguenot on the old Nuxeo blogs.)


Category: Product & Development