Text Nodes and Children

A common problem when working with the xmlTree API is getting the first child of the root element and finding has no content, despite the fact that the root's first child element does have content. This is due to an often overlooked aspect of the XML specification:

All whitespace that occurs between XML elements is significant.

Consider the following XML document:

<?xml version="1.0" encoding="UTF-8">
<top>
    <item>I'm an item!</item>
    <item>So am I!</item>
</top>

It looks like the <top> element as two children, both of which are <item> elements. But remembering that whitespace is significant, it actually has five children, which includes the text in-between the nodes:

a text node, containing a newline and 4 spaces (the text between <top> and the first <item>)
an element node, the first <item>
a text node, containing a newline and 4 spaces (the text between the two <item>s)
an element node, the second <item>
a text node, containing only a newline (the text between the last <item> and the closing </top>)

Going back to the pitfall from the beginning, when we get the first child of the root element we're actually getting the text node in between the <top> and <item> elements, instead of the <item> element itself.

Without those extra text nodes, the document would look like this:

<?xml version="1.0" encoding="UTF-8">
<top><item>I'm an item!</item><item>So am I!</item></top>

With many XML documents it is handy to ignore empty whitespace and think of <top> as only having two children. This can be done by passing the option XML_PARSE_NOBLANKS when parsing the XML data. The parser will determine when a text node contains only whitespace (as defined by the XML spec), and discard them when they do.

See xmlTreeNewDocFromString() (here) and xmlTreeNewDocFromFile() (here) for more information.