[cppx] Xerces strings simplified by ownership, part II.

In part I of this series I discussed Xerces’ UTF16-based string representation, a common deallocation pitfall for such strings, and how to convert to and from such strings correctly by using C++ RAII techniques. For the RAII techniques I presented one solution using boost::shared_ptr, and one more efficient solution using a home-brewed ownership transfer class, cppx::Ownership. And I promised to next (i.e. in this installment) discuss how to do it even more efficiently by using wchar_t strings as the program’s native strings, and detecting the minimal amount of work needed for each conversion.

How to discriminate on wchar_t variants

The most efficient conversion from wchar_t const* to Xerces’ XMLCh const* is a direct reinterpret_cast. The template based solution presented here does not quite achieve that because the conversion-to-Xerces-string function cStr has to have a result type that also can handle the case where a reinterpret_cast is not enough. But it’s quite close to a pure reinterpret_cast for the case where a reinterpret_cast would be enough, namely the case where

  • sizeof(wchar_t) == 2, like Xerces’ XMLCh, and
  • wchar_t is Unicode, which for size 2 implies UTF-16 encoding.

Detecting the size of wchar_t is trivial: just apply the sizeof operator. For detecting whether wchar_t is Unicode I first considered some macro scheme, with perhaps reasonable defaults for the most common systems Windows, Linux and Mac OS/X. But that would be just as bad as web “browser detection”, so instead I applied the KISS principle and defined

In file [progrock/cppx/primitive_types.h]:

    enum { wCharIsUnicode = (L'å' == 0xE5) };   // Checking for Latin-1, but it's good enough.

Unicode is a superset of ISO Latin-1, so the above simply checks whether Norwegian “å” has the same numerical value as its ISO Latin-1 codepoint (which is also its Unicode codepoint). As a matter of practicality, to find that codepoint I typed write in the Windows command interpreter, typed an “å” when WordPad popped up, and used [Alt X], which is also supported by Microsoft Word. Alternatively I could have used Windows’ CharMap accessory.

I don’t think there are any other wide character encodings that are extensions of ISO Latin-1 and that are used for encoding wchar_t values, but if you know of one then please do tell me! Then I might have to change the above to an “€” currency sign. Or whatever…

The cStr function uses an extended version of the ZStr zero-terminated string class presented in part I. The essential features of ZStr are that it provides ownership transfer and a custom deleter (since Xerces strings must be deallocated in a special way). cStr() simply delegates the conversion to a class that is templated on the wchar_t size and on whether wchar_t is Unicode or not:

In file [progrock/xml/xxerces/basic_types.h]:

namespace progrock{ namespace xml { namespace xxerces {

    typedef ::XMLCh     Utf16Code;                      // Unicode.
    CPPX_STATIC_ASSERT( sizeof( Utf16Code ) == 2 );     // xercesc guarantees this.

    typedef cppx::ZStr< Utf16Code >     Utf16Str;
    typedef cppx::ZStr< char >          CharStr;
    typedef cppx::ZStr< wchar_t >       WCharStr;

    // ... more stuff here, then:

    // The result from cStr is only guaranteed valid for the lifetime of the argument
    // string (this is much like the std::basic_string::c_str method, hence the name).
    // The ownership handling in the result type is for the /possibility/ of a copy.
    // With 2 byte wchar_t (e.g. Windows) no copy is created, it's then just a cast.

    inline Utf16Str
        cStr( wchar_t const s[] )
        typedef detail::CStrImpl< sizeof( wchar_t ), wchar_t, cppx::wCharIsUnicode > Impl;
        return Impl::func( s );

} } }  // namespace progrock::xml::xxerces

There’s also a utility overload that takes a std::wstring argument.

The CStrImpl class template additionally takes the wchar_t as a template parameter, as you can see above. Firstly, this facilitates testing of both size 2 and size 4 implementations with a single compiler. And secondly, it helps to keep the implementations still templated (not fully specialized), so that the implementations that are irrelevant for a given compiler and system simply are not instantiated – that irrelevant code can even be invalid, or missing!

And in fact the implementations for non-Unicode wchar_t are missing below. I haven’t coded them. It would involve transcoding using Xerces, and I didn’t bother with that:

In file [progrock/xml/xxerces/basic_types.h]:

namespace progrock{ namespace xml { namespace xxerces {

    namespace detail {

        template< cppx::Size wcharSize, class StdWChar, bool wCharIsUnicode >
        struct CStrImpl;

        template< class StdWChar >
        struct CStrImpl< 2, StdWChar, true >
            static Utf16Str
                func( StdWChar const s[] )
                return Utf16Str(
                    reinterpret_cast< Utf16Code const* >( s ), cppx::noDelete

        template< class StdWChar >
        struct CStrImpl< 4, StdWChar, true >
            static Utf16Str
                func( StdWChar const s[] )
                return cppx::reinterpretAs< Utf16Char >(
                    cppx::unicode::utf16( cppx::codePointPtr( s ) )
    }  // namespace detail

} } }  // namespace progrock::xml::xxerces

The size 4 implementation delegates the actual work to cppx::unicode::utf16(), which does a simple UTF-32 to UTF-16 transcoding.

The “abstraction overhead” stems from these function calls, because with the usual C++98 ownership transfer trick the compiler is probably unable to recognize that these calls and apparent conversions can be elided. I haven’t checked the generated machine code, but I’m pretty sure that the calls are there regardless of which compiler and optimization options – for compilers are not intelligent, they just embody simple pattern matching and rules based on common code patterns, which probably do not include ownership transfer patterns. But this overhead is of fixed size and quite small, since it only involves some pointer copying.

How to convert from UTF-32 to UTF-16

Original Unicode had exactly 16 bits per code point. Windows NT and programming languages like Java were subsequently based on a 16 bit wide character type. So in order to extend Unicode some solution was needed that would allow all that old “Unicode is 16-bit” code to continue working.

The set of original Unicode code points (i.e. the set of 16-bit unsigned integers) is called the BMP, short for the Basic Multilingual Plane.

And the solution that was adopted was to represent an extended code point as two 16-bit BMP code points. This is called a surrogate pair. With UTF-32 each Unicode code point is represented as a single 32 bit value, which directly is the code point’s value. With UTF-16 each BMP code point is represented as itself, and each code point outside the BMP is represented as two successive BMP code points. The first is the most significant, called the high surrogate, and the second is the least significant, called the low surrogate.

In file [progrock/cppx/text/unicode/utf.h]:

namespace progrock{ namespace cppx{ namespace unicode {

    namespace detail {
        template< class Iter, class Value >
        inline void put( Iter& it, Value v ) { *it = v;  ++it; }
    }  // namespace detail

    //--------------------------------------- UTF32 -> UTF16

    template< class CodePoint32Iter, class CodePoint16Iter >
            CodePoint32Iter const   startSource,
            CodePoint32Iter const   endSource,
            CodePoint16Iter const   destination
        CodePoint16Iter     out = destination;

        for( CodePoint32Iter it = startSource;  it != endSource;  ++it )
            CodePoint32 const   cp  = *it;

            if( bmpRange.contains( cp ) )
                detail::put( out, static_cast< CodePoint16 >( cp ) );
                Surrogate const surrogate( cp );

                detail::put( out, surrogate.high() );
                detail::put( out, surrogate.low() );
        return out;

    template< class CodePoint32Iter >
        utf16Length( CodePoint32Iter const start, CodePoint32Iter const end )
        return copyToUtf16( start, end, DummyOutIter< CodePoint16 >() ).pos();

    struct StrAndUtf16Length
        Size    str;
        Size    utf16;

        StrAndUtf16Length( CodePoint32 const s[] )
            CodePoint32 const   endOfBMP    = bmpRange.end;

            cppx::Size      n;
            cppx::Index     i;

            for( i = 0, n = 0;  s[i] != 0;  ++i )
                n += (s[i] >= endOfBMP);
            str = i;  utf16 = n;

    inline ZStr16
        utf16( CodePoint32 const s[] )
        StrAndUtf16Length const     length( s );
        Ownership< CodePoint16 >    buf( makeNewArray, length.utf16 + 1 );

        copyToUtf16( s, s + length.str, buf.ptr() );
        return buf.constify();

} } }  // namespace progrock::cppx::unicode

In StrAndUtf16Length I introduced special-case code doing the same as could be easily accomplished using the earlier more general code (and in fact I initially used the general code), because the idea of needless inefficiency started to gnaw on me. I haven’t measured. Thus, I’ve committed the gravest sin, listening to a gut feeling and adding more testing overhead and a possible bug vector in exhange for just a potential and probably very marginal efficiency improvement, but it eased my mind.

Conversion to and from surrogate pairs

The Surrogate class used in copyToUtf16 above is pretty simple. Originally I had this as direct inline code, where it was even simpler, just a few lines. But Surrogate is a natural abstraction, and it can be tested on its own, which yields a higher confidence that the code is correct.

The main idea is to deal with offsets. A non-BMP code point is not represented directly: instead the offset of that code point from the end of the BMP range is represented. This offset comprises at most 20 bits, so the upper 10 bits are represented by the high surrogate, and the lower 10 bits are represented by the low surrogate.

And of course those 10-bit values are just offsets, namely offsets into the respective surrogate value ranges:

In file [progrock/cppx/text/unicode/Surrogate.h]:

namespace progrock{ namespace cppx{ namespace unicode {

    enum HighAndLow { highAndLow };

    class Surrogate
        CodePoint32     myCpOffset;
        CodePoint16     myHighSurrogateOffset;
        CodePoint16     myLowSurrogateOffset;

        Surrogate( CodePoint32 cp )
            assert( !bmpRange.contains( cp ) );

            myCpOffset              = cp - bmpRange.end;
            myHighSurrogateOffset   =
                static_cast< CodePoint16 >( myCpOffset >> bitsPerHalfSurrogate );
            myLowSurrogateOffset    =
                static_cast< CodePoint16 >( myCpOffset & halfSurrogateSpanMask );

        Surrogate( HighAndLow, CodePoint16 highCp, CodePoint16 lowCp )
            assert( highSurrogateRange.contains( highCp ) );
            assert( lowSurrogateRange.contains( lowCp ) );

            myHighSurrogateOffset   = highCp - highSurrogateRange.start;
            myLowSurrogateOffset    = lowCp - lowSurrogateRange.start;
            myCpOffset              =
                (static_cast< CodePoint32 >( myHighSurrogateOffset ) << bitsPerHalfSurrogate)
                | myLowSurrogateOffset;

        CodePoint16 high() const
            return highSurrogateRange.start + myHighSurrogateOffset;

        CodePoint16 low() const
            return lowSurrogateRange.start + myLowSurrogateOffset;

        CodePoint32 codePoint() const
            return bmpRange.end + myCpOffset;

} } }  // namespace progrock::cppx::unicode

Relevant Unicode code point ranges

The possible high and low surrogate values are confined to two special ranges within the BMP. Each such range constitutes 210 code points, i.e. it suffices to encode a 10-bit value. So, a surrogate pair can encode 220 bits, which means that with this scheme Unicode allows for 220 code points in addition to the 216 code points of the BMP, minus the 211 code points reserved within the BMP for high and low surrogates.

In file [progrock/cppx/text/unicode/Surrogate.h]:

namespace progrock{ namespace cppx{ namespace unicode {

    // The BMP (Basic Multilingual Plane) corresponds to original 16-bit Unicode and
    // constitutes the first 64K code points. "Low" and "high" surrogates refer to
    // the most and least significant part of the code point represented by a pair.

    namespace detail{
        CodePoint16 const   hs_start        = 0xD800u;
        CodePoint16 const   hs_end          = 0xDC00u;
        CodePoint16 const   ls_start        = 0xDC00u;
        CodePoint16 const   ls_end          = 0xE000u;
        CodePoint32 const   bmp_end         = 0x00010000u;
        CodePoint32 const   cp_end          = 0x00110000u;

    template< class Type >
    struct Range
        Type const  start;
        Type const  end;

        Type span() const
        { return end - start; }

        bool contains( CodePoint32 cp ) const
        { return (start <= cp && cp < end); }

    typedef Range< CodePoint16 >    Range16;
    typedef Range< CodePoint32 >    Range32;

    CodePoint16 const   bitsPerHalfSurrogate        = 10;
    CodePoint16 const   halfSurrogateSpan           = 0x0400u;
    CodePoint16 const   halfSurrogateSpanMask       = halfSurrogateSpan - 1;
    CodePoint32 const   bitsPerSurrogate            = 2*bitsPerHalfSurrogate;   // 20
    CodePoint32 const   surrogateRepresentableSpan  = 0x00100000;

    CPPX_STATIC_ASSERT( halfSurrogateSpan == (1uL << bitsPerHalfSurrogate) );
    CPPX_STATIC_ASSERT( surrogateRepresentableSpan == (1uL << bitsPerSurrogate) );

    Range16 const       highSurrogateRange          = { detail::hs_start, detail::hs_end };
    Range16 const       lowSurrogateRange           = { detail::ls_start, detail::ls_end };
    Range16 const       surrogateRange              = { detail::hs_start, detail::ls_end };

    CPPX_STATIC_ASSERT( detail::hs_end - detail::hs_start == halfSurrogateSpan );
    CPPX_STATIC_ASSERT( detail::ls_end - detail::ls_start == halfSurrogateSpan );
    CPPX_STATIC_ASSERT( detail::hs_end == detail::ls_start );

    Range32 const       bmpRange                    = { 0, detail::bmp_end };
    Range32 const       codePointRange              = { 0, detail::cp_end };
    CodePoint32 const   bitsPerCodePoint            = bitsPerSurrogate + 1;     // 21

    CPPX_STATIC_ASSERT( detail::bmp_end == (1uL << 16) );
    CPPX_STATIC_ASSERT( detail::cp_end == detail::bmp_end + surrogateRepresentableSpan );
    CPPX_STATIC_ASSERT( detail::cp_end <= (1uL << bitsPerCodePoint ) );

    // The U+FFFD REPLACEMENT_CHARACTER is the general substitute character in the Unicode
    // Standard. It can be substituted for any “unknown” character in another encoding that
    // cannot be mapped in terms of known Unicode characters (see the Unicode standard
    // Section 5.3, Unknown and Missing Characters).
    CodePoint32 const   replacementCh   = 0x0000FFFD;

} } }  // namespace progrock::cppx::unicode

Code point types

Since code points are manipulated and interpreted at the bit level they need (practically) to be represented as unsigned integers:

In file [progrock/cppx/text/codepoint_types.h]:

namespace progrock{ namespace cppx{

    typedef UInt8       CodePoint8;
    typedef UInt16      CodePoint16;
    typedef UInt32      CodePoint32;

    template< Size nBytes >
    struct CodePointOfSize;

    template<> struct CodePointOfSize<1> { typedef CodePoint8     T; };
    template<> struct CodePointOfSize<2> { typedef CodePoint16    T; };
    template<> struct CodePointOfSize<4> { typedef CodePoint32    T; };

    // ... + conversions, string duplication function, and so on.

} }  // namespace progrock::cppx

An example usage

On page umpteen-hundred of his Eiffel book Bertrand Meyer remarked that some students had difficulties understanding what it was all about until they’d seen a working Eiffel “Hello, world!” – which he then, at page umpteen-hundred plus one, presented & discussed…

Happily I remembered to present the basic examples already in part I! 🙂

E.g., I showed the following way of providing a Xerces string argument by using a function uStr that produced a ZStr instance with custom deleter:

Just an example:

DOMDocument* const  pDoc    =
    dom.createDocument( 0, uStr( "Bad name!" ).ptr(), 0 );

And with the current discussion almost nothing has changed regarding how-to-use. I changed the name of that function to cStr, to associate it more with how the result of std::basic_string::c_str is only valid for the lifetime of the argument data; I changed the formal argument to wide string; and I also provided an implicit conversion instead of having to invoke a ptr method. But that’s just cosmetics.

What’s changed with the functionality that I have discussed above is that now cStr does not necessarily do more than copy a few pointers around: if the compiler’s wchar_t type fits (size 2, is Unicode, which is so in Windows) then the part I uStr function’s dynamic allocation and transcoding is completely avoided! 🙂


DISCLAIMER: This code has only been tested with MSVC. The main reason is that the general Xerces installation for MinGW g++ failed (it seemed to create the library fine, but then failed to build some example program), and I did not want to waste a lot of time on figuring out exactly what went wrong. It would have been nice if Xerces could just supply a MinGW g++ specific make file instead of using the general *nix installation.

Anyway, – enjoy!


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s