For Windows, where wchar_t
corresponds in size and encoding to Xerces’ XMLCh
, all that’s needed to convert from a wchar_t
literal to Xerces XMLCh
string is, at most, a little reinterpret_cast
… But that won’t work on a system with 4 byte wchar_t
or a system where wchar_t
doesn’t imply Unicode. And although the non-Unicode systems are perhaps not your top porting priority, 4 byte wchar_t
systems are common.
The wrapping that I start discussing in this posting, based on the cppx Ownership
class, does a simple reinterpret_cast
when it can (e.g. for a literal wchar_t
string argument in Windows), and a dynamic allocation plus transcoding in other cases.
The Ownership
wrapping of the result abstracts away the details of how the string is converted, and whether deallocation is needed. Thus, the calling code does not have to care. It gets raw null-operation efficiency where possible, and otherwise at least the same convenient notation + C++ RAII based safety. 🙂
Xerces (pseudo) namespaces
Most everything in Xerces, except ::XMLCh
, is in the xercesc
namespace. You cannot directly extend the xercesc
namespace, at least with a standard-conforming C++ compiler, because it’s just a namespace alias, i.e. namespace xercesc =
something. Xerces offers some macros to deal with extension of the xercesc
namespace.
And instead of defining nested namespaces Xerces uses classes as namespaces, which I’ll call a pseudo namespace.
For example, the XMLString
class, entering the discussion below, is not a string class: it’s a C++ class used as a pseudo namespace, a class where all routines are static
(i.e., free-standing routines).
Xerces strings
XMLCh
is a 16-bit unsigned integer that may or may not be wchar_t
.
According to one part of the documentation XMLCh
is defined as unsigned short
, which presumably is how it’s defined on *nix systems. But with MSVC it’s defined as wchar_t
(this is also documented!). Presumably it’s done with the good intention of convenience for Windows-only code, but having the opposite effect for portable code since it precludes direct overloading on and detection of the type of string.
A Xerces string is represented as zero-terminated UTF-16 in a raw array of Xerces’ XMLCh
.
Except for the rather silly & annoying sometimes-it’s-wchar_t
definition it’s reasonable and a Good Thing™ that Xerces uses XMLCh
instead of wchar_t
. A wchar_t
can be 16 or 32 bits depending on the C++ implementation, and also the encoding implied by wchar_t
depends on the platform – it’s not necessarily Unicode. With XMLCh
you know what you have, both size-wise and encoding-wise.
In Windows the mapping is a null-operation, but on some other system transcoding, translation from one encoding to another, may be needed.
XMLString::transcode
is a set of overloads that transcode to and from default locale char
strings.
For more general transcoding Xerces offers the XMLTranscoder
interface. You can obtain XMLTranscoder
instances from XMLPlatformUtils::fgTransService
, which is a static
pointer. This does not seem to be very well documented, but at least some of the possible encoding names are defined as XMLCh
string constants in the XMLUni
pseudo namespace.
Xerces string memory deallocation: a pitfall
There’s a lot of disinformation available on the net. A main problem is that when some code apparently works, then some person will post it on the net in the belief that it will also work for others. For example, Brainbell’s “Free IT Training & Computer Tutorials” offers a “Working with Xerces Strings” tutorial, where the unnamed author first defines …
An apparently reasonable way to leverage
std::basic_string
:
typedef std::basic_string<XMLCh> XercesString;
… and then goes on to define e.g. …
An apparently safe C++ RAII-based conversion:
inline XercesString fromNative(const char* str) { boost::scoped_array<XMLCh> ptr(xercesc::XMLString::transcode(str)); return XercesString(ptr.get( )); }
Both these snippets have UB. The definition of XercesString
has a kind of academic UB because std::basic_string
does not directly support other character types than char
and wchar_t
. The fromNative
definition has more serious havoc-wreaking UB.
boost::scoped_array
will deallocate using a delete[]
expression. And although by default a Xerces XMLCh
string is allocated by new[]
, in Windows that new[]
expression will typically have been executed in a Xerces DLL. The delete[]
expression may, instead, be executed in a program with statically linked runtime library, thus using different memory managers for allocation and deallocation.
And that’s pretty serious.
Xerces offers XMLString::release
to deal with this (it’s a Xerces FAQ). It’s just a wrapper that invokes Xerces’ default memory manager. An alternative is to use your own Xerces memory manager.
Wrapping via boost::shared_ptr
The code above can be fixed by replacing std::basic_string
with a boost::shared_ptr
as the “string carrier”, because boost::shared_ptr
supports a custom deleter. This trades the nice-to-have but perhaps not ever needed functionality of std::basic_string
for correctness and improved efficiency. Where correctness is, of course, the main reason: that UB is serious!
The XMLString::release
wrapper called dispose
in the example below (wrapping a convenience wrapper…) is because XMLString::release
takes a pointer to the pointer as argument and therefore cannot be used directly as a custom deleter for a boost::shared_ptr
:
File [wrapped_by_shared_ptr.cpp], correct and less inefficient:
#include <iostream> #include <assert.h> #include <boost/shared_ptr.hpp> #include <xercesc/util/PlatformUtils.hpp> // Initialize, Terminate #include <xercesc/util/XMLString.hpp> // transcode #include <xercesc/dom/DOM.hpp> // DOMxxx template< class CharType > class ZStr // Zero-terminated string. { private: boost::shared_ptr< CharType > myArray; public: ZStr( CharType const* s, void (*deleter)( CharType* ) ) : myArray( const_cast< CharType* >( s ), deleter ) { assert( deleter != 0 ); } CharType const* ptr() const { return myArray.get(); } }; namespace myXerces { typedef ::XMLCh Utf16Char; inline void dispose( Utf16Char* p ) { xercesc::XMLString::release( &p ); } inline void dispose( char* p ) { xercesc::XMLString::release( &p ); } inline ZStr< Utf16Char > uStr( char const* s ) { return ZStr< Utf16Char >( xercesc::XMLString::transcode( s ), &dispose ); } inline ZStr< char > cStr( Utf16Char const* s ) { return ZStr< char >( xercesc::XMLString::transcode( s ), &dispose ); } struct Lib { Lib() { xercesc::XMLPlatformUtils::Initialize(); } ~Lib() { xercesc::XMLPlatformUtils::Terminate(); } }; } // namespace myXerces myXerces::Lib const xercesLibUsage; // Ensures lib initialized for all code in 'main'. int main() { using namespace std; using namespace myXerces; using namespace xercesc; try { DOMImplementation& dom = *DOMImplementation::getImplementation(); // *** Conversion char -> XMLCh: DOMDocument* const pDoc = dom.createDocument( 0, uStr( "Bad name!" ).ptr(), 0 ); cout << "Huh, shouldn't get here!" << endl; pDoc->release(); } catch( DOMException const& x ) { // *** Conversion XMLCh -> char: cout << "!DOMException: " << cStr( x.getMessage() ).ptr() << endl; } }
Note: the const_cast
is a concession to the established convention of writing deleters with type void(*)(T*)
instead of void(*)(T const*)
. The established convention is a bit impractical. But given that it is established, it would probably be even more impractical to require void(*)(T const*)
.
How to get rid of the inefficiency
Correctness is nice, but each string conversion in the code above involves two dynamic allocations: first one performed by XMLString::transcode
, and then one performed by boost::shared_ptr
, which needs to allocate a shared reference counter object. Contrast this with the fact that in Windows wchar_t
is compatible in both size and implied encoding with Xerces’ XMLCh
! I.e., that for a wchar_t
→ XMLCh
conversion, in Windows code like the above achieves correctness at the cost of two dynamic allocations plus one copying of the string data, when it could have been a simple reinterpret_cast
!
Getting rid of that inefficiency involves:
- instead of
char
strings, usingwchar_t
strings as the program’s native strings, - detecting the minimal amount of work needed for each conversion, which essentially means discriminating on the size of
wchar_t
and on the implied encoding ofwchar_t
, which can be done at compile time, and - instead of a sharing smart pointer, using an ownership transfer smart pointer (in practice that’s all that’s needed).
The last point is where the cppx::Ownership
class enters the picture. The standard smart pointers, where by “standard” I mean either in the current C++ standard, slated for C++0x, or in Boost, sadly do not include an ownership transfer smart pointer with custom deleter support. cppx::Ownership
fills this special niche; it’s usable on its own, and it can be used to implement higher level classes like the ZStr
class below.
Wrapping via cppx::Ownership
Converting ZStr
to ownership transfer semantics is a first step on the road to efficiency: by itself it only gets rid of the allocation done by boost::shared_ptr
, while still allowing e.g. ZStr
function results.
Here I use the common C++98 ownership transfer trick, discussed in my previous posting about cppx::Ownership
.
An ownership transfer version of ZStr
cannot rely on implicit generation of the necessary constructors, since C++98 does not directly support ownership transfer. Thus this code has a few more lines. The “Curiously Recurring Template Pattern” base class cppx::OwnershipTransferring
supplies a definition of the Ref
type (a Ref
simply holds a ZStr*
), an operator Ref()
for implicit ownership transfer from temporaries, and a method Ref transfer()
for explicit ownership transfer from lvalues such as variables:
In file [wrapped_by_ownership.cpp], getting rid of one dynamic allocation:
#include <iostream> #include <assert.h> #include <progrock/cppx/pointers/Ownership.h> #include <xercesc/util/PlatformUtils.hpp> // Initialize, Terminate #include <xercesc/util/XMLString.hpp> // transcode #include <xercesc/dom/DOM.hpp> // DOMxxx using namespace progrock; template< class CharType > class ZStr // Zero-terminated string. : public cppx::OwnershipTransferring< ZStr< CharType > > { private: cppx::Ownership< CharType const > myArray; ZStr( ZStr& ); // No such. ZStr& operator=( ZStr const& ); // No such. public: ZStr( CharType const* s, void (*deleter)( CharType* ) ) : myArray( s, deleter ) { assert( deleter != 0 ); } ZStr( Ref other ): myArray( 0 ) { swapWith( *other.p ); } void swapWith( ZStr& other ) { myArray.swapWith( other.myArray ); } CharType const* ptr() const { return myArray.ptr(); } }; //... The rest as before.
Ownership transfer support
Since my last posting I’ve added some features to cppx::Ownership
in order to better support arrays and “constification” of the pointee. As a result, in the code above the const_cast
required with boost::shared_ptr
is gone. However, while the functionality now available is like automatic dotting of i’s and crossing of t’s, just convenience :-), it involved making changes all the way down, including tackling some C++ language subtleties, and so it would be rather too much code to present & discuss!
The most fundamental change was however to factor out the support for ownership transfer in a class template cppx::OwnershipTransferring
, so that user classes like ZStr
above more easily can have ownership transfer semantics themselves.
It goes like this – rather simple (when I at last thought of it!):
In file [progrock/cppx/pointers/Ownership.h]:
namespace progrock{ namespace cppx{ template< class Derived > class OwnershipTransferring { private: Derived* pDerived() { return static_cast< Derived* >( this ); } protected: struct Ref { Derived* p; Ref( Derived* o ): p( o ) {} }; OwnershipTransferring( OwnershipTransferring& ) {} OwnershipTransferring() {} public: operator Ref() { return pDerived(); } // For temporaries. Ref transfer() { return pDerived(); } // For lvalues. }; } } // namespace progrock::cppx
Summary, so far
I’ve discussed Xerces’ string representation, a common deallocation pitfall, and how to do this correctly using C++ RAII techniques, with a concrete example using boost::shared_ptr
.
And I mentioned that for efficiency one can instead
- use
wchar_t
strings as the program’s native strings, - detect the minimal amount of work needed for each conversion, and, to support that,
- use an ownership transfer smart pointer.
In this posting I’ve presented code for the last point. This example used cppx::Ownership
as the ownership transfer smart pointer. My plan for the next posting is to present and discuss an implementation of the first two points.
Disclaimer
DISCLAIMER: This code has only been tested with MSVC 9.0. The main reason is that the general Xerces installation for MinGW g++ failed (it seemed to create the library fine, but then failed to build some example program), and I did not want to waste a lot of time on figuring out exactly what went wrong. It would have been nice if Xerces could just supply a MinGW g++ specific make file instead of using the general *nix installation.
Anyway, – enjoy!
Pingback: cppx: Xerces strings simplified by ownership, part II. | Alf on programming (mostly C++)
Hi, in C++11, is not this possible:
typedef std::basic_stringXercesString;
inline XercesString fromNative(const char * str){
auto xDeleter=[&](XMLCh buf[]){xercesc::XMLString::release(&buf); };
std::unique_ptr ptr(xercesc::XMLString::transcode(str),xDeleter);
return XercesString(ptr.get());
}
inline XercesString fromNative(const std::string & str){
return fromNative(str.c_str());
}
inline std::string toNative(const XMLCh* str){
auto cDeleter=[&](char buf[]){xercesc::XMLString::release(&buf); };
std::unique_ptr ptr(xercesc::XMLString::transcode(str),cDeleter);
return std::string(ptr.get());
}
inline std::string toNative(const XercesString & str){
return toNative(str.c_str());
}
sorry, my > and < brackets got changed somehow by the formatter..
typedef std::basic_string<XMLCh>XercesString;
inline XercesString fromNative(const char * str){
auto xDeleter=[&](XMLCh buf[]){xercesc::XMLString::release(&buf); };
std::unique_ptr<XMLCh[],decltype(xDeleter)> ptr(xercesc::XMLString::transcode(str),xDeleter);
return XercesString(ptr.get());
}
inline XercesString fromNative(const std::string & str){
return fromNative(str.c_str());
}
inline std::string toNative(const XMLCh* str){
auto cDeleter=[&](char buf[]){xercesc::XMLString::release(&buf); };
std::unique_ptr<char[],decltype(cDeleter)> ptr(xercesc::XMLString::transcode(str),cDeleter);
return std::string(ptr.get());
}
inline std::string toNative(const XercesString & str){
return toNative(str.c_str());
}