A common problem when working with the xmlTree API is getting the first child of the root element and finding has no content, despite the fact that the root's first child element does have content. This is due to an often overlooked aspect of the XML specification:
Consider the following XML document:
<?xml version="1.0" encoding="UTF-8">
<top>
<item>I'm an item!</item>
<item>So am I!</item>
</top>
It looks like the <top>
element as two children, both of which
are <item>
elements. But remembering that whitespace is
significant, it actually has five children, which includes the
text in-between the nodes:
<top>
and the first <item>
)<item>
<item>
s)<item>
<item>
and the closing </top>
)
Going back to the pitfall from the beginning, when we get the first
child of the root element we're actually getting the text node in
between the <top>
and <item>
elements, instead of the
<item>
element itself.
Without those extra text nodes, the document would look like this:
<?xml version="1.0" encoding="UTF-8">
<top><item>I'm an item!</item><item>So am I!</item></top>
With many XML documents it is handy to ignore empty whitespace and
think of <top>
as only having two children. This can be done by
passing the option XML_PARSE_NOBLANKS
when parsing the XML
data. The parser will determine when a text node contains only
whitespace (as defined by the XML spec), and discard them when they
do.
See xmlTreeNewDocFromString()
(here) and xmlTreeNewDocFromFile()
(here) for more information.