[cppx] Xerces strings simplified by ownership, part I.

For Windows, where wchar_t corresponds in size and encoding to XercesXMLCh, all that’s needed to convert from a wchar_t literal to Xerces XMLCh string is, at most, a little reinterpret_cast… But that won’t work on a system with 4 byte wchar_t or a system where wchar_t doesn’t imply Unicode. And although the non-Unicode systems are perhaps not your top porting priority, 4 byte wchar_t systems are common.

The wrapping that I start discussing in this posting, based on the cppx Ownership class, does a simple reinterpret_cast when it can (e.g. for a literal wchar_t string argument in Windows), and a dynamic allocation plus transcoding in other cases.

The Ownership wrapping of the result abstracts away the details of how the string is converted, and whether deallocation is needed. Thus, the calling code does not have to care. It gets raw null-operation efficiency where possible, and otherwise at least the same convenient notation + C++ RAII based safety. 🙂

Xerces (pseudo) namespaces

Most everything in Xerces, except ::XMLCh, is in the xercesc namespace. You cannot directly extend the xercesc namespace, at least with a standard-conforming C++ compiler, because it’s just a namespace alias, i.e. namespace xercesc = something. Xerces offers some macros to deal with extension of the xercesc namespace.

And instead of defining nested namespaces Xerces uses classes as namespaces, which I’ll call a pseudo namespace.

For example, the XMLString class, entering the discussion below, is not a string class: it’s a C++ class used as a pseudo namespace, a class where all routines are static (i.e., free-standing routines).

Xerces strings

XMLCh is a 16-bit unsigned integer that may or may not be wchar_t.

According to one part of the documentation XMLCh is defined as unsigned short, which presumably is how it’s defined on *nix systems. But with MSVC it’s defined as wchar_t (this is also documented!). Presumably it’s done with the good intention of convenience for Windows-only code, but having the opposite effect for portable code since it precludes direct overloading on and detection of the type of string.

A Xerces string is represented as zero-terminated UTF-16 in a raw array of Xerces’ XMLCh.

Except for the rather silly & annoying sometimes-it’s-wchar_t definition it’s reasonable and a Good Thing™ that Xerces uses XMLCh instead of wchar_t. A wchar_t can be 16 or 32 bits depending on the C++ implementation, and also the encoding implied by wchar_t depends on the platform – it’s not necessarily Unicode. With XMLCh you know what you have, both size-wise and encoding-wise.

In Windows the mapping is a null-operation, but on some other system transcoding, translation from one encoding to another, may be needed.

XMLString::transcode is a set of overloads that transcode to and from default locale char strings.

For more general transcoding Xerces offers the XMLTranscoder interface. You can obtain XMLTranscoder instances from XMLPlatformUtils::fgTransService, which is a static pointer. This does not seem to be very well documented, but at least some of the possible encoding names are defined as XMLCh string constants in the XMLUni pseudo namespace.

Xerces string memory deallocation: a pitfall

There’s a lot of disinformation available on the net. A main problem is that when some code apparently works, then some person will post it on the net in the belief that it will also work for others. For example, Brainbell’s “Free IT Training & Computer Tutorials” offers a “Working with Xerces Strings” tutorial, where the unnamed author first defines …

An apparently reasonable way to leverage std::basic_string:

typedef std::basic_string<XMLCh> XercesString;

… and then goes on to define e.g. …

An apparently safe C++ RAII-based conversion:

inline XercesString fromNative(const char* str)
{
    boost::scoped_array<XMLCh> ptr(xercesc::XMLString::transcode(str));
    return XercesString(ptr.get( ));
}

Both these snippets have UB. The definition of XercesString has a kind of academic UB because std::basic_string does not directly support other character types than char and wchar_t. The fromNative definition has more serious havoc-wreaking UB.

boost::scoped_array will deallocate using a delete[] expression. And although by default a Xerces XMLCh string is allocated by new[], in Windows that new[] expression will typically have been executed in a Xerces DLL. The delete[] expression may, instead, be executed in a program with statically linked runtime library, thus using different memory managers for allocation and deallocation.

And that’s pretty serious.

Xerces offers XMLString::release to deal with this (it’s a Xerces FAQ). It’s just a wrapper that invokes Xerces’ default memory manager. An alternative is to use your own Xerces memory manager.

Wrapping via boost::shared_ptr

The code above can be fixed by replacing std::basic_string with a boost::shared_ptr as the “string carrier”, because boost::shared_ptr supports a custom deleter. This trades the nice-to-have but perhaps not ever needed functionality of std::basic_string for correctness and improved efficiency. Where correctness is, of course, the main reason: that UB is serious!

The XMLString::release wrapper called dispose in the example below (wrapping a convenience wrapper…) is because XMLString::release takes a pointer to the pointer as argument and therefore cannot be used directly as a custom deleter for a boost::shared_ptr:

File [wrapped_by_shared_ptr.cpp], correct and less inefficient:

#include <iostream>
#include <assert.h>
#include <boost/shared_ptr.hpp>

#include <xercesc/util/PlatformUtils.hpp>       // Initialize, Terminate
#include <xercesc/util/XMLString.hpp>           // transcode
#include <xercesc/dom/DOM.hpp>                  // DOMxxx

template< class CharType >
class ZStr      // Zero-terminated string.
{
private:
    boost::shared_ptr< CharType >   myArray;

public:
    ZStr( CharType const* s, void (*deleter)( CharType* ) )
        : myArray( const_cast< CharType* >( s ), deleter )
    { assert( deleter != 0 ); }

    CharType const* ptr() const { return myArray.get(); }
};

namespace myXerces {
    typedef ::XMLCh     Utf16Char;

    inline void dispose( Utf16Char* p ) { xercesc::XMLString::release( &p ); }
    inline void dispose( char* p )      { xercesc::XMLString::release( &p ); }

    inline ZStr< Utf16Char > uStr( char const* s )
    {
        return ZStr< Utf16Char >(
            xercesc::XMLString::transcode( s ), &dispose
            );
    }

    inline ZStr< char > cStr( Utf16Char const* s )
    {
        return ZStr< char >(
            xercesc::XMLString::transcode( s ), &dispose
            );
    }

    struct Lib
    {
        Lib()  { xercesc::XMLPlatformUtils::Initialize(); }
        ~Lib() { xercesc::XMLPlatformUtils::Terminate(); }
    };
}   // namespace myXerces

myXerces::Lib const xercesLibUsage; // Ensures lib initialized for all code in 'main'.

int main()
{
    using namespace std;
    using namespace myXerces;
    using namespace xercesc;

    try
    {
        DOMImplementation&  dom     = *DOMImplementation::getImplementation();

        // *** Conversion char -> XMLCh:
        DOMDocument* const  pDoc    =
            dom.createDocument( 0, uStr( "Bad name!" ).ptr(), 0 );

        cout << "Huh, shouldn't get here!" << endl;
        pDoc->release();
    }
    catch( DOMException const& x )
    {
        // *** Conversion XMLCh -> char:
        cout << "!DOMException: " << cStr( x.getMessage() ).ptr() << endl;
    }
}

Note: the const_cast is a concession to the established convention of writing deleters with type void(*)(T*) instead of void(*)(T const*). The established convention is a bit impractical. But given that it is established, it would probably be even more impractical to require void(*)(T const*).

How to get rid of the inefficiency

Correctness is nice, but each string conversion in the code above involves two dynamic allocations: first one performed by XMLString::transcode, and then one performed by boost::shared_ptr, which needs to allocate a shared reference counter object. Contrast this with the fact that in Windows wchar_t is compatible in both size and implied encoding with Xerces’ XMLCh! I.e., that for a wchar_tXMLCh conversion, in Windows code like the above achieves correctness at the cost of two dynamic allocations plus one copying of the string data, when it could have been a simple reinterpret_cast!

Getting rid of that inefficiency involves:

  • instead of char strings, using wchar_t strings as the program’s native strings,
  • detecting the minimal amount of work needed for each conversion, which essentially means discriminating on the size of wchar_t and on the implied encoding of wchar_t, which can be done at compile time, and
  • instead of a sharing smart pointer, using an ownership transfer smart pointer (in practice that’s all that’s needed).

The last point is where the cppx::Ownership class enters the picture. The standard smart pointers, where by “standard” I mean either in the current C++ standard, slated for C++0x, or in Boost, sadly do not include an ownership transfer smart pointer with custom deleter support. cppx::Ownership fills this special niche; it’s usable on its own, and it can be used to implement higher level classes like the ZStr class below.

Wrapping via cppx::Ownership

Converting ZStr to ownership transfer semantics is a first step on the road to efficiency: by itself it only gets rid of the allocation done by boost::shared_ptr, while still allowing e.g. ZStr function results.

Here I use the common C++98 ownership transfer trick, discussed in my previous posting about cppx::Ownership.

An ownership transfer version of ZStr cannot rely on implicit generation of the necessary constructors, since C++98 does not directly support ownership transfer. Thus this code has a few more lines. The “Curiously Recurring Template Pattern” base class cppx::OwnershipTransferring supplies a definition of the Ref type (a Ref simply holds a ZStr*), an operator Ref() for implicit ownership transfer from temporaries, and a method Ref transfer() for explicit ownership transfer from lvalues such as variables:

In file [wrapped_by_ownership.cpp], getting rid of one dynamic allocation:

#include <iostream>
#include <assert.h>
#include <progrock/cppx/pointers/Ownership.h>

#include <xercesc/util/PlatformUtils.hpp>       // Initialize, Terminate
#include <xercesc/util/XMLString.hpp>           // transcode
#include <xercesc/dom/DOM.hpp>                  // DOMxxx

using namespace progrock;

template< class CharType >
class ZStr      // Zero-terminated string.
    : public cppx::OwnershipTransferring< ZStr< CharType > >
{
private:
    cppx::Ownership< CharType const >   myArray;

    ZStr( ZStr& );                      // No such.
    ZStr& operator=( ZStr const& );     // No such.

public:
    ZStr( CharType const* s, void (*deleter)( CharType* ) )
        : myArray( s, deleter )
    { assert( deleter != 0 ); }

    ZStr( Ref other ): myArray( 0 ) { swapWith( *other.p ); }
    void swapWith( ZStr& other ) { myArray.swapWith( other.myArray ); }

    CharType const* ptr() const { return myArray.ptr(); }
};

//... The rest as before.

Ownership transfer support

Since my last posting I’ve added some features to cppx::Ownership in order to better support arrays and “constification” of the pointee. As a result, in the code above the const_cast required with boost::shared_ptr is gone. However, while the functionality now available is like automatic dotting of i’s and crossing of t’s, just convenience :-), it involved making changes all the way down, including tackling some C++ language subtleties, and so it would be rather too much code to present & discuss!

The most fundamental change was however to factor out the support for ownership transfer in a class template cppx::OwnershipTransferring, so that user classes like ZStr above more easily can have ownership transfer semantics themselves.

It goes like this – rather simple (when I at last thought of it!):

In file [progrock/cppx/pointers/Ownership.h]:

namespace progrock{ namespace cppx{

    template< class Derived >
    class OwnershipTransferring
    {
    private:
        Derived* pDerived() { return static_cast< Derived* >( this ); }

    protected:
        struct Ref
        {
            Derived*  p;
            Ref( Derived* o ): p( o ) {}
        };

        OwnershipTransferring( OwnershipTransferring& ) {}
        OwnershipTransferring() {}

    public:
        operator Ref() { return pDerived(); }   // For temporaries.
        Ref transfer() { return pDerived(); }   // For lvalues.
    };

} }  // namespace progrock::cppx

Summary, so far

I’ve discussed Xerces’ string representation, a common deallocation pitfall, and how to do this correctly using C++ RAII techniques, with a concrete example using boost::shared_ptr.

And I mentioned that for efficiency one can instead

  • use wchar_t strings as the program’s native strings,
  • detect the minimal amount of work needed for each conversion, and, to support that,
  • use an ownership transfer smart pointer.

In this posting I’ve presented code for the last point. This example used cppx::Ownership as the ownership transfer smart pointer. My plan for the next posting is to present and discuss an implementation of the first two points.

Disclaimer

DISCLAIMER: This code has only been tested with MSVC 9.0. The main reason is that the general Xerces installation for MinGW g++ failed (it seemed to create the library fine, but then failed to build some example program), and I did not want to waste a lot of time on figuring out exactly what went wrong. It would have been nice if Xerces could just supply a MinGW g++ specific make file instead of using the general *nix installation.

Anyway, – enjoy!

Advertisement

4 comments on “[cppx] Xerces strings simplified by ownership, part I.

  1. Pingback: cppx: Xerces strings simplified by ownership, part II. | Alf on programming (mostly C++)

  2. Hi, in C++11, is not this possible:
    typedef std::basic_stringXercesString;
    inline XercesString fromNative(const char * str){
    auto xDeleter=[&](XMLCh buf[]){xercesc::XMLString::release(&buf); };
    std::unique_ptr ptr(xercesc::XMLString::transcode(str),xDeleter);
    return XercesString(ptr.get());
    }

    inline XercesString fromNative(const std::string & str){
    return fromNative(str.c_str());
    }
    inline std::string toNative(const XMLCh* str){
    auto cDeleter=[&](char buf[]){xercesc::XMLString::release(&buf); };
    std::unique_ptr ptr(xercesc::XMLString::transcode(str),cDeleter);
    return std::string(ptr.get());
    }

    inline std::string toNative(const XercesString & str){
    return toNative(str.c_str());
    }

  3. typedef std::basic_string<XMLCh>XercesString;
    inline XercesString fromNative(const char * str){
    auto xDeleter=[&](XMLCh buf[]){xercesc::XMLString::release(&buf); };
    std::unique_ptr<XMLCh[],decltype(xDeleter)> ptr(xercesc::XMLString::transcode(str),xDeleter);
    return XercesString(ptr.get());
    }

    inline XercesString fromNative(const std::string & str){
    return fromNative(str.c_str());
    }
    inline std::string toNative(const XMLCh* str){
    auto cDeleter=[&](char buf[]){xercesc::XMLString::release(&buf); };
    std::unique_ptr<char[],decltype(cDeleter)> ptr(xercesc::XMLString::transcode(str),cDeleter);
    return std::string(ptr.get());
    }

    inline std::string toNative(const XercesString & str){
    return toNative(str.c_str());
    }

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s