There are two different items to discuss related to encodings; reading/writing XML, and working internally.
The XML API can read and write many character encodings, leveraging
the power of the GNU libiconv library. The reading and writng
encodings need not be simiar. For example, a SHIFT_JIS
document can be written as UTF-8
, and vice-versa.
When working within the library, everything is UTF-8 regardless of what character encoding it was read from or will be written to. This deserves stressing:
UTF-8
.UTF-8
.
This means that a document may exist on disk in ISO-8859-1
, but
when the XML API parses it and you call xmlTreeGetContent()
to
get the text from an element, you'll get UTF-8
data. If the
file exists on disk in ASCII
, calling
xmlTreeGetContent()
will still give UTF-8
data.
Simiarly, regardless of whether a document will be outputted in
BIG5
, ISO-8859-7
, UTF-32
, etc., when adding a new
element with xmlTreeNewElement()
, the name and contents must be
given in UTF-8
.
This may sound restricting, but it's actually liberating in that when
working in code, you never have to worry about what encoding the file
was read from, or what encoding it will be written out as. Always use
UTF-8
.
The default encoding when working in Vortex is already UTF-8
(unless manually changed with <urlcp charsettxt>). If you have
data that you need to convert to UTF-8, you can use the
<urlutil charsetconv> Vortex function.