A wrapper for UTF-8 Windows apps.

The Windows API is based on UTF-16 encoded wide text. For example, the API function CommandLineToArgvW that parses a command line, only exists in wide text version. But the introduction of support for UTF-8 as the process code page in the May 2019 update of Windows 10 now greatly increases the incentive to use UTF-8 encoded narrow text internally in in Windows GUI applications, i.e. using UTF-8 as the C++ execution character set also in Windows.

This article presents a minimal example of that (a message box with international text, using Windows’ char based API wrappers); shows one way to configure Windows with UTF-8 as the ANSI code page default; and shows how to build such a program with the MingGW g++ and Visual C++ toolchains.

This is discussed in that order in the following sections:

  1. A minimal example.
  2. The header wrappers.
  3. Configuring Windows with UTF-8 as the ANSI code page default.
  4. An application manifest resource specifying UTF-8.
  5. Building with MinGW g++ and with Visual C++.

I apologize for the less than perfect formatting and possible odd things. Every time I edit WordPress removes all instances of the text <windows.h> and wreaks havoc on the rest. This article was originally written as a GitHub-compatible markdown file but it turned out that markdown syntax highlighting, and a lot more, didn’t work in WordPress, so the text had to be very manually re-expressed as a sequence of WordPress “blocks”.

1. A minimal example.

With a suitable wrapper for <windows.h> the C++ code of a program that displays a Windows message box with international text can now be as simple as this:

minimal.cpp

#include <header-wrappers/winapi/windows-h.utf8.hpp>

auto main()
    -> int
{
    const auto& text    = "Every 日本国 кошка loves Norwegian blåbærsyltetøy, nom nom!";
    const auto& title   = "Some Norwegian, Russian & Chinese text in there:";
    MessageBox( 0, text, title, MB_ICONINFORMATION | MB_SETFOREGROUND );
}

Result when the program is built with a specification of UTF-8 as process code page, or alternatively is run in a Windows installation configured with UTF-8 as the ANSI code page default:

Image of OK messagebox

In contrast, here is what it looks like when a corresponding program using <windows.h> directly is built without a specification of UTF-8 as process code page and the Windows ANSI default is codepage 1252, Windows ANSI Western, as in the old days:

Image of ungood Windows ANSI Western messagebox

2. The header wrappers.

The wrapper <header-wrappers/winapi/windows-h.utf8.hpp> supports this new “like ordinary C++” kind of Windows application:

  • it increases the C++-compatibility of <windows.h> by suppressing the min and max macro definitions via option NOMINMAX and by asking for more strictly typed declarations via option STRICT, plus it reduces the size of this gargantuan include (e.g. just now from 80 287 lines to 54 426 lines, with MinGW g++), via option WIN32_LEAN_AND_MEAN,
  • it makes the char based …A-functions such as MessageBoxA available without suffix, i.e. for that example as just MessageBox, by ensuring that option UNICODE is not defined, and
  • it asserts that the effective process codepage is UTF-8, which it might or might not be.

header-wrappers/winapi/windows-h.utf8.hpp

#pragma once
#undef UTF8_WINAPI
#define UTF8_WINAPI
#include "windows-h.hpp"

namespace uuid_0985060C_1AAD_453C_B3F9_A2E543F4CF1E {
    struct Winapi_envelope
    {
        Winapi_envelope()
        {
            static const bool dummy = winapi_h_assert_utf8_codepage();
        }
    };
    
    static const Winapi_envelope ensured_globally_single_utf8_assertion{};
}  // namespace uuid_0985060C_1AAD_453C_B3F9_A2E543F4CF1E

The little complexity above could be avoided by using a C++17 inline variable. It would be more in the C++ spirit of coding to absolutely maximum performance and least verbosity, when there is a choice. However, many people are still stuck with earlier C++ standards, and though a fallback using static instead could be automatically provided, the header would then require Visual C++ 2019 users to add option /Zc:__cplusplus, which is not presently supported by the Visual Studio GUI.

Except for that issue the wrapper is designed to be a trivial top-level wrapper so that one can replace it with one’s own equally trivial top-level wrapper, for example in order to communicate a “not UTF-8 process code page” failure to the user in manner of one’s own choosing.

To wit, the wrapper delegates the first two points to a more basic wrapper <header-wrappers/winapi/windows-h.hpp>, which goes like this:

header-wrappers/winapi/windows-h.hpp

#pragma once
#ifdef MessageBox
#   error "<windows.h> has already been included, possibly with undesired options."
#endif

#include <assert.h>
#ifdef _MSC_VER
#   include <iso646.h>                  // Standard `and` etc. also with MSVC.
#endif

#ifndef _WIN32_WINNT
#   define _WIN32_WINNT     0x0600      // Windows Vista as earliest supported OS.
#endif
#undef WINVER
#define WINVER _WIN32_WINNT

#define IS_NARROW_WINAPI() \
    ("Define UTF8_WINAPI please.", sizeof(*GetCommandLine()) == 1)

#define IS_WIDE_WINAPI() \
    ("Define UNICODE please.", sizeof(*GetCommandLine()) > 1)

// UTF8_WINAPI is a custom macro for this file. UNICODE, _UNICODE and _MBCS are MS macros.
#if defined( UTF8_WINAPI) and defined( UNICODE )
#   error "Inconsistent encoding options, both UNICODE (UTF-16) and UTF8_WINAPI (UTF-8)."
#endif

#undef UNICODE
#undef _UNICODE
#ifdef UTF8_WINAPI
#   define _MBCS        // Mainly for 3rd party code that uses it for platform detection.
#else
#   define UNICODE
#   define _UNICODE     // Mainly for 3rd party code that uses it for platform detection.
#endif
#undef NOMINMAX
#define NOMINMAX
#undef STRICT
#define STRICT
#undef WIN32_LEAN_AND_MEAN
#define WIN32_LEAN_AND_MEAN
// After this an `#include <winsock2.h>` will actually include that header.

#include <windows.h>

inline auto winapi_h_assert_utf8_codepage()
    -> bool
{
    #ifdef __GNUC__
        #pragma GCC diagnostic push
        #pragma GCC diagnostic ignored "-Wunused-value"
    #endif
    assert(( "The process codepage isn't UTF-8 (old Windows?).", GetACP() == 65001 ));
    #ifdef __GNUC__
        #pragma GCC diagnostic pop
    #endif
    return true;
}

2. Configuring Windows with UTF-8 as the ANSI code page default.

For portability the program should best be built with UTF-8 process code page specified as an application manifest resource. Alternatively it will work to configure Windows with UTF-8 as the Windows ANSI default, provided it’s Windows 10 with the update of May 2019, or later. But probably few if any ordinary users will want to configure their Windows, or to let a program do that just in order to run that program.

You as developer may however find the convenience of UTF-8 as the Windows ANSI default, highly desirable.

It worked for me to change the item ACP to value 65001, in the semi-documented registry key

Computer\HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Nls\CodePage

You can change the value e.g. via the “regedit” GUI utility, or the reg command. You must then reboot Windows for the change to take effect.

And voilà! 🙂

But wait! …

Before doing that let’s build the program with a manifest that specifies UTF-8 as process code page. That way the program will work on any post-May-2019 Windows 10 installation, not just “it works on my computer!”. The <windows.h> wrapper shown above ensures that it will not mistakenly run and present gibberish on an earlier Windows version.

4. An application manifest resource specifying UTF-8.

An application manifest is an UTF-8 encoded XML file that, if properly magically named, can be just shipped with the application, but that’s best embedded as a resource in the executable.

The following manifest specifies both UTF-8 as process code page, and that the app uses version 6.0 or later of the Common Controls DLL, as opposed to an earlier version that has the same DLL name. The Common Controls DLL version gives a modern look and feel to buttons, menus, list, edit fields etc. Why that’s not the default, or why it requires this heavy machinery to specify, would be mystery if Microsoft were an ordinary company.

Anyway, the text:

resources/application-manifest.xml

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<assembly manifestVersion="1.0" xmlns="urn:schemas-microsoft-com:asm.v1">
    <assemblyIdentity type="win32" name="UTF-8 message box" version="1.0.0.0"/>
    <application>
        <windowsSettings>
            <activeCodePage xmlns="http://schemas.microsoft.com/SMI/2019/WindowsSettings"
                >UTF-8</activeCodePage>
        </windowsSettings>
    </application>
    <dependency>
        <dependentAssembly>
            <assemblyIdentity
                type="win32"
                name="Microsoft.Windows.Common-Controls"
                version="6.0.0.0"
                processorArchitecture="*"
                publicKeyToken="6595b64144ccf1df"
                language="*"
                />
        </dependentAssembly>
    </dependency>
</assembly>

Beware: at the time of writing there could be no whitespace such as space or newline on either side of the “UTF-8” activeCodePage value, and it had to be all uppercase.

5. Building with MinGW g++ and with Visual C++.

One way to include that text as a resource in the executable is to use a general resource script:

resources.rc

#include <windows.h>
1 RT_MANIFEST "resources/application-manifest.xml"

Here the 1 is the resource id, and the RT_MANIFEST is the resource type (as I recall from years ago RT_MANIFEST is defined as small integer, probably just 1).

With the MinGW GNU tools this script is compiled by windres into an apparently ordinary object file, which is just linked with the main program object file:

[G:\code\minimal_gui\binaries]
> set CFG=-std=c++17 -Wall -pedantic-errors

[G:\code\minimal_gui\binaries]
> g++ %CFG% ..\minimal.cpp -c -o minimal.o

[G:\code\minimal_gui\binaries]
> windres ..\resources.rc -o resources.o

[G:\code\minimal_gui\binaries]
> g++ minimal.o resources.o -mwindows

Here the -mwindows option specifies the GUI subsystem for the executable, so that Windows doesn’t pop up a console window when one runs the program from Windows Explorer.

With Microsoft’s tools the script is compiled by rc into a special binary resource format in a .res file, which is just linked with the main program object file. Options can be passed to the compiler cl.exe via the environment variable CL, and to the linker link.exe via the environment variable LINK. Using an obscure linker option is unfortunately necessary for building a GUI subsystem executable with a standard C++ main function with this toolchain:

[G:\code\minimal_gui\binaries]
> set CL=^
More? /nologo ^
More? /utf-8 /EHsc /GR /FI"iso646.h" /std:c++17 /Zc:__cplusplus /W4 ^
More? /wd4459 /D _CRT_SECURE_NO_WARNINGS /D _STL_SECURE_NO_WARNINGS

[G:\code\minimal_gui\binaries]
> cl ..\minimal.cpp /c
minimal.cpp

[G:\code\minimal_gui\binaries]
> rc /nologo /fo resources.res ..\resources.rc

[G:\code\minimal_gui\binaries]
> set LINK=/entry:mainCRTStartup

[G:\code\minimal_gui\binaries]
> link /nologo minimal.obj resources.res user32.lib /subsystem:windows /out:b.exe

Microsoft now also has special tools (or maybe a special tool) to handle application manifests, but I haven’t used that.

Advertisement

Why COW was deemed ungood for std::string.

COW, short for copy on write, is a way to implement mutable strings so that creating strings and logically copying strings, is reduced to almost nothing; conceptually they become free operations like no-ops.

Basic idea: to share a data buffer among string instances, and only make a copy for a specific instance (the copy on write) when that instance’s data is modified. The general cost of this is only an extra indirection for accessing the value of a string, so a COW implementation is highly desirable. And so the original C++ standard, C++98, and its correction C++03, had special support for COW implementations, and e.g. the g++ compiler’s std::string implementations used COW.

So why was that support dropped in C++11?

In particular, would the same reason or reasons apply to a reference counted immutable string value class?

As we’ll see it does not, it’s just a severe mismatch between the std::string design and the ideal COW requirements. But it took a two hour car trip, driving 120 kms on winter roads, for my memory to yet again cough up the relevant scenario where Things Go Wrong™. I guess it’s like the “why can’t I assign a T** to a T const** question; it’s quite counter-intuitive.

Basic COW string theory: the COPOW principle.

A COW string has two possible states: exclusively owning the buffer, or sharing the buffer with other COW strings.

It starts out in state owning. Assignments and copying initializations can make it sharing. Before executing a “write” operation it must ensure that it’s in owning state, and a transition from sharing to owning involves copying the buffer contents to a new and for now exclusively owned buffer.

With a string type designed for COW any operation will be either non-modifying, a “read” operation, or directly modifying, a “write” operation, which makes it is easy to determine whether the string must ensure state owning before executing the operation.

With a std::string, however, references, pointers and iterators to mutable contents are handed out with abandon. Even a simple value indexing of a non-const string, s[i], hands out a reference that can be used to modify the string. And so for a non-const std::string every such hand-out-free-access operation can effectively be a “write” operation, and would have to be regarded as such for a COW implementation (if the current C++ standard had permitted a COW implementation, which it doesn’t).

I call this the principle of copy on possibility of write, or COPOW for short. It’s for strings that aren’t designed for COW. For a COW-oriented design applying COPOW reduces to pure COW.

A code example showing how COW works.

To keep the size of the following example down I don’t address the issue of constant time initialization from literal, but just show how assignment and copy initialization can be reduced to almost nothing:

#include <cppx-core/_all_.hpp>  // https://github.com/alf-p-steinbach/cppx-core

using C_str = const char*;      // Is also available in cppx.

namespace my
{
    $use_cppx( Raw_array_of_, Size );
    $use_std( begin, end, make_shared, vector, shared_ptr );

    class Cow_string
    {
        using Buffer = vector<char>;

        shared_ptr<Buffer>      m_buffer;
        Size                    m_length;

        void ensure_is_owning()
        {
            if( m_buffer.use_count() > 1 )
            {
                m_buffer = make_shared<Buffer>( *m_buffer );
            }
        }

    public:
        auto c_str() const
            -> C_str
        { return m_buffer->data(); }

        auto length() const
            -> Size
        { return m_length; }

        auto operator[]( const Size i ) const
            -> const char&
        { return (*m_buffer)[i]; }

        auto operator[]( const Size i )
            -> char&
        {
            ensure_is_owning();
            return (*m_buffer)[i];
        }

        template< Size n >
        Cow_string( Raw_array_of_<n, const char>& literal ):
            m_buffer( make_shared<Buffer>( literal, literal + n ) ),
            m_length( n - 1 )
        {}
    };
}  // namespace my

Here assignment is the default-generated assignment operator that just assigns the data members m_buffer and m_length, which are a shared_ptr and an integer, and ditto for copy initialization.

And apparently this code abides by the COPOW principle, so it should be safe…

The problem: UB by adding code that just copies.

Consider the following usage code, it’s perfectly fine:

auto main() -> int
{
    my::Cow_string s = "Du store Alpakka!";
    const C_str p = s.c_str();

    // In this block the contents of `s` are not modified.
    {
        $use_std( ignore );
        const char first_char = s[0];
        ignore = first_char;
    }

    $use_std( cout, endl );
    cout << p << endl;
}

This code is fine because the COW string is already in state owning when s[0] is executed on the non-const s. So all that the initialization of first_char does is to copy a char value. Fine.

But if a maintainer innocently just introduces a logical copy of the string value, which is what COW primarily optimizes, and which certainly doesn’t change the conceptual value, then mayhem ensues:

auto main() -> int
{
    my::Cow_string s = "Du store Alpakka!";
    const C_str p = s.c_str();

    // In this block the contents of `s` are not modified.
    {
        $use_std( ignore );
        my::Cow_string other = s;
        ignore = other;

        const char first_char = s[0];
        ignore = first_char;
    }

    $use_std( cout, endl );
    cout << p << endl;      //! Undefined behavior, p is dangling.
}

Uh oh.

Since s here is in state sharing, the COPOW principle makes the s[0] operation copy the shared buffer, to become owning. Then at the end of the block the only remaining owner of the original buffer, the other string, is destroyed, and destroys the buffer. Which leaves the p pointer dangling.

For a custom string type like Cow_string this is a user error. The type is just badly designed, so that it’s very easy to inadvertently use it incorrectly. But for a std::string it’s formally a bug in the COW implementation, a bug showing that COPOW is just not enough.

For a std::string, if the standard had permitted a COW implementation, to avoid the above calamity it would be necessary to transition to the owned state, incurring an O(n) copying of string data, every place that a reference, pointer or iterator is handed out, regardless of const-ness of the string. One could maybe call that copy on handing out any reference, COHOAR. It greatly reduces the set of cases where COW has an advantage. The C++ standardization committee deemed that cost too high, the remaining advantages of COW too low, to continue supporting COW. So,

  • the C++03 wordings that supported COW were removed;
  • wording was introduced, especially a general O(1) complexity requirement for [] indexing, that disallowed COW; and
  • functionality such as string_view was added, that relies on having pointers to string buffers, and that itself hands out references.

What about threads?

It’a common misconception that COW std::strings would be incompatible with multi-threading, or that making it compatible would make it inefficient, because with COW ordinary copying of a string doesn’t yield an actual copy that another thread can access freely.

In order to allow string instances that are used by different threads, to share a buffer, just about every access function, including simple [] indexing, would need to use a mutex.

However, a simple solution is to just not use ordinary copy initialization or assignment to create a string for another thread, but instead a guaranteed content copying operation such as std::string::substr, or initialization from iterator pair. The C++11 standard could have gone in this other direction. It could, in principle, have added to the existing C++03 support for COW strings, noting that COHOAR, not just COPOW, is required, and added a dedicated deep-copy operation or deep-copy support type plus wording about thread safety.

What about reference counted immutable strings?

An immutable string is a string type such as the built in string types in Java, C# or Python, where the available operations don’t support string data modification. Strings can still be assigned. One just can’t directly change the string data, like changing “H” to ”J“ in “Hello”.

Immutable strings can and in C++ typically will share their string data via reference counting, just as with COW strings. As with COW strings they support amortized constant time initialization from literal, ditto superfast copy assignment and copy initialization, and in addition, if strings don’t need to be zero-terminated they support constant time substring operations. They’re highly desirable.

So, is the problem shown above, a showstopper for immutable strings in C++?

Happily, no. The problem comes about because std::string hands out references, pointers and iterators that can be used to change the string data without involving std::string code, i.e. without its knowledge. That can’t happen with an immutable string.

And figuring out this, whether there was a showstopper, and whether std::string_view (that hands out references) could be used freely in code that would deal with immutable strings, was the reason that I delved into the question of COW std::string again. At one point, long ago, I knew, because I participated in some debates about it, but remembering the problematic use case wasn’t easy. It’s just not intuitive to me, that adding an operation that just copies, can create UB…

Non-crashing Python 3.x output in Windows

Non-crashing Python 3.x output in Windows

Problem

The following little Python 3.x program just crashes with CPython 3.3.4 in a default-configured English Windows:

crash.py3
#encoding=utf-8
print( "Blåbærsyltetøy!")
H:\personal\web\blog alf on programming at wordpress\001\test>chcp 437
Active code page: 437

H:\personal\web\blog alf on programming at wordpress\001\test>type crash.py3 | display_utf8
#encoding=utf-8
print( "Blåbærsyltetøy!")

H:\personal\web\blog alf on programming at wordpress\001\test>crash.py3
Traceback (most recent call last):
  File "H:\personal\web\blog alf on programming at wordpress\001\test\crash.py3", line 2, in 
    print( "Blåbærsyltet\xf8y!")
  File "C:\Program Files\CPython 3_3_4\lib\encodings\cp437.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\xf8' in position 12: character maps to 

H:\personal\web\blog alf on programming at wordpress\001\test>_

Here codepage 437 is the original IBM PC character set, which is the default narrow text interpretation in an English Windows console window.

A partial solution is to change the default console codepage to Windows ANSI, which then at least for CPython matches the encoding for output to a pipe or file, and it’s nice with consistency. But also this has a severely limited character set, with possible crash behavior for any unsupported characters.

Direct console output

Unicode text limited to the Basic Multilingual Plane (essentially original 16-bits Unicode) can be output to a Windows console via the WriteConsoleW Windows API function.

The standard Python ctypes module provides access to the API:

Direct_console_io.py
import ctypes
class Object: pass

winapi = Object()
winapi.STD_INPUT_HANDLE     = -10
winapi.STD_OUTPUT_HANDLE    = -11
winapi.STD_ERROR_HANDLE     = -12
winapi.GetStdHandle         = ctypes.windll.kernel32.GetStdHandle
winapi.CloseHandle          = ctypes.windll.kernel32.CloseHandle
winapi.WriteConsoleW        = ctypes.windll.kernel32.WriteConsoleW

class Direct_console_io:
    def write( self, s ) -> int:
        n_written = ctypes.c_ulong()
        ret = winapi.WriteConsoleW(
            self.std_output_handle, s, len( s ), ctypes.byref( n_written ), 0
            )
        return n_written.value

    def __del__( self ):
        if not winapi: return       # Looks like a bug in CPython 3.x
        winapi.CloseHandle( self.std_error_handle )
        winapi.CloseHandle( self.std_output_handle )
        winapi.CloseHandle( self.std_input_handle )

    def __init__( self ):
        self.dependency = winapi
        self.std_input_handle   = winapi.GetStdHandle( winapi.STD_INPUT_HANDLE )
        self.std_output_handle  = winapi.GetStdHandle( winapi.STD_OUTPUT_HANDLE )
        self.std_error_handle   = winapi.GetStdHandle( winapi.STD_ERROR_HANDLE )

Implementing input is left as an exercise for the reader.

Overriding the standard streams to use direct i/o and UTF-8.

In addition to the silly crashing behavior, the standard streams in CPython 3.x, like sys.stdout, default to Windows ANSI for output to file or pipe. In Python 2.7 this could be reset to more useful UTF-8 by reloading the sys module in order to get back a dynamically removed method that could set the default encoding. No longer in Python 3.x, so this code just creates new stream objects:

Utf8_standard_streams.py
import io
import sys
from Direct_console_io import Direct_console_io

class Dcio_raw_iobase( io.RawIOBase ):
    def writable( self ) -> bool:
        return True

    def write( self, seq_of_bytes ) -> int:
        b = bytes( seq_of_bytes )
        return self.dcio.write( b.decode( 'utf-8' ) )

    def __init__( self ):
        self.dcio = Direct_console_io()

class Dcio_buffered_writer( io.BufferedWriter ):
    def write( self, seq_of_bytes ) -> int:
        return self.raw_stream.write( seq_of_bytes )

    def flush( self ):
        pass

    def __init__( self, raw_iobase ):
        super().__init__( raw_iobase )
        self.raw_stream = raw_iobase

# Module initialization:
def __init__():
    using_console_input = sys.stdin.isatty()
    using_console_output = sys.stdout.isatty()
    using_console_error = sys.stderr.isatty()

    if using_console_output:
        raw_io = Dcio_raw_iobase()
        buf_io = Dcio_buffered_writer( raw_io )
        sys.stdout = io.TextIOWrapper( buf_io, encoding = 'utf-8' )
        sys.stdout.isatty = lambda: True
    else:
        sys.stdout = io.TextIOWrapper( sys.stdout.detach(), encoding = 'utf-8-sig' )

    if using_console_error:
        raw_io = Dcio_raw_iobase()
        buf_io = Dcio_buffered_writer( raw_io )
        sys.stderr = io.TextIOWrapper( buf_io, encoding = 'utf-8' )
        sys.stderr.isatty = lambda: True
    else:
        sys.stderr = io.TextIOWrapper( sys.stderr.detach(), encoding = 'utf-8-sig' )
    return

__init__()

Disclaimer: It’s been a long time since I fiddled with Python, so possibly I’m breaking a number of conventions plus doing things in some less than optimal way. But this was the first path I found through the jungle of apparently arbitrary io class derivations etc. It worked well enough for my purposes (in a little script to convert NRK’s HTML-format subtitles to SubRip format), so, I gather it can be useful also for you – at least as a basis for more robust and/or more general code.

2013 in review

The WordPress.com stats helper monkeys prepared a 2013 annual report for this blog.

Here’s an excerpt:

The concert hall at the Sydney Opera House holds 2,700 people. This blog was viewed about 10,000 times in 2013. If it were a concert at Sydney Opera House, it would take about 4 sold-out performances for that many people to see it.

Click here to see the complete report.

2012 in review

The WordPress.com stats helper monkeys prepared a 2012 annual report for this blog.

Here’s an excerpt:

600 people reached the top of Mt. Everest in 2012. This blog got about 10,000 views in 2012. If every person who reached the top of Mt. Everest viewed this blog, it would have taken 17 years to get that many views.

Click here to see the complete report.

C++11 features in Visual C++, yeah!

Recently Microsoft surprised me – positively! – with supporting newfangled C++11 features in the November CTP of their Visual C++ 11.0 compiler:

  • Variadic templates
  • Uniform initialization and initializer_lists
  • Delegating constructors
  • Raw string literals
  • Explicit conversion operators
  • Default template arguments for function templates

The library has not yet been updated to take advantage, but still this means that it’s now possible to avoid macro hell for things such as makeUnique (but unforutunately not yet for e.g. platform-specific Unicode strings, no support yet for the newfangled literals).

On the other hand, Visual C++ implements the regexp library, while g++ does not. And Visual C++ has working exceptions, while AFAIK the dang[1] compiler still has not. So this may just look as if Visual C++ is now overtaking g++ and the dang compiler in the race to support the C++11 standard! 🙂

[1] As Brian Keminghan remarked, dang if I can remember its name!

Liskov’s substitution principle in C++ <

Part I of III: The LSP and value assignment.

Abstract:
Liskov’s principle of substitution, a principle of generally good class design for polymorphism, strongly implies the slicing behavior of C++ value assignment. Destination value slicing can cause a partial assignment, which can easily break data integrity. Three solutions are (1) static checking of assignment, (2) dynamic checking of assignment, and (3), generally easiest, prohibiting assignment, then possibly providing cloning functionality as an alternative.

Contents:

  • > A short review of the LSP.
  • > How the LSP (almost) forces slicing in C++.
  • > The danger of slicing: type constraint violations.
  • > The possible design alternatives wrt. providing value assignment.
  • > How to provide assignment with STATIC CHECKING of type constraints.
  • > How to provide assignment with DYNAMIC CHECKING of type constraints.
  • > How to provide CLONING as an alternative to assignment.

Continue reading

ResultOf a function with C++11

Using std::result_of requires you to specify the function argument types. Which is not very practical when you don’t know the function signature. Happily @potatoswatter (David Krauss) over at Stack Overflow pointed out to me that std::function provides the desired functionality:

template< class Func >
struct ResultOf
{
    typedef typename std::function<
        typename std::remove_pointer<Func>::type
        >::result_type T;
};

Mainly, this saves one from writing a large number of partial specializations for Visual C++ 10.0, which lacks support for C++11 variadic templates.

– Enjoy!

Unicode part 2: UTF-8 stream mode



The Windows console subsystem has a host of Unicode-related bugs. And standard Windows programs such as more (not to mention the C# 4.0 compiler csc) just crash when they’re run from a console window with UTF-8 as active codepage, perplexingly claiming that they’re out of memory. On top of that the C++ runtime libraries of various compilers differ in how they behave. Doing C++ Unicode i/o in Windows consoles is therefore problematic. In this series I show how to work around limitations of the Visual C++ _O_U8TEXT file mode, with the Visual C++ and g++ compilers. This yields an automatic translation between external UTF-8 and internal UTF-16, enabling Windows console i/o of characters in the Basic Multilingual Plane.

Recap

In part 1 I introduced two approaches to Unicode handling in small Windows console programs:

  • The all UTF-8 approach where everything is encoded as UTF-8, and where there are no BOM encoding markers.
     
  • The wide string approach where all external text (including the C++ source code) is encoded as UTF-8, and all internal text is encoded as UTF-16.

The all UTF-8 approach is the approach used in a typical Linux installation. With this approach a novice can remain unaware that he is writing code that handles Unicode: it Just Works™ – in Linux. However, we saw that it mass-failed in Windows:

  • Input with active codepage 65001 (UTF-8) failed due to various bugs.
     
  • Console output with Visual C++ produced gibberish due to the runtime library’s attempt to help by using direct console output.
     
  • I mentioned how wide string literals with non-ASCII characters are incorrectly translated to UTF-16 by Visual C++ due to the necessary lying to Visual C++ about the source code encoding (which is accomplished by not having a BOM at the start of the source code file).

The wide string approach, on the other hand, was shown to have special support in Visual C++, via the _O_U8TEXT file mode, which I called an UTF-8 stream mode. This mode works down at the C FILE level so that wide character operations on C FILE streams get automatic conversion to/from external UTF-8 encoding. That C FILE level support is needed and is practically impossible to do for the application programmer, so it’s a very good thing to have that UTF-8 stream mode…

I mentioned that as of Visual C++ 10 the UTF-8 mode is not fully implemented, and that it apparently has some bugs. I.e., that it cannot be used directly but needs some scaffolding and fixing. That’s what this second part is about.

Here I do not yet consider the g++ compiler. All this code is Visual C++ specific. However, I have done generally the same with g++, and I will probably discuss that in a third installment.

UTF-8 stream mode: the good (wide stream output)

The good news about the _O_U8TEXT mode is that it works for wide stream output, even for output of narrow strings via the wide stream:

[utf8_mode.msvc.output.good.cpp]
#include <stdexcept>        // std::runtime_error, std::exception
#include <stdlib.h>         // EXIT_SUCCESS, EXIT_FAILURE
#include <iostream>         // std::wcout, std::wcerr, std::endl
#include <string>           // std::string, std::wstring
using namespace std;

#include    <io.h>      // _setmode, _fileno
#include    <fcntl.h>   // _O_U8TEXT

bool throwX( string const& s ) { throw runtime_error( s ); }
bool hopefully( bool v ) { return v; }

void setUtf8Mode( FILE* f, char const name[] )
{
    int const newMode = _setmode( _fileno( f ), _O_U8TEXT );
    hopefully( newMode != -1 )
        || throwX( string() + "setmode failed for " + name );
}

int main()
{
    try
    {
        static char const       narrowText[]    = "Blåbærsyltetøy! 日本国 кошка!";
        static wchar_t const    wideText[]      = L"Blåbærsyltetøy! 日本国 кошка!";

        setUtf8Mode( stdout, "stdout" );

        wcout << "Narrow text: " << narrowText << endl;
        wcout << "Wide text:   " << wideText << endl;
        return EXIT_SUCCESS;
    }
    catch( exception const& x )
    {
        wcerr << "!" << x.what() << endl;
    }
    return EXIT_FAILURE;
}

Visual C++ produces a series of warnings for this source code:

W:\examples> cl utf8_mode.msvc.output.good.cpp /Fe"good"
utf8_mode.msvc.output.good.cpp
utf8_mode.msvc.output.good.cpp(24) : warning C4566: character represented by universal-character-name '\u65E5' cannot be represented in the current code page (1252)
utf8_mode.msvc.output.good.cpp(24) : warning C4566: character represented by universal-character-name '\u672C' cannot be represented in the current code page (1252)
utf8_mode.msvc.output.good.cpp(24) : warning C4566: character represented by universal-character-name '\u56FD' cannot be represented in the current code page (1252)
utf8_mode.msvc.output.good.cpp(24) : warning C4566: character represented by universal-character-name '\u043A' cannot be represented in the current code page (1252)
utf8_mode.msvc.output.good.cpp(24) : warning C4566: character represented by universal-character-name '\u043E' cannot be represented in the current code page (1252)
utf8_mode.msvc.output.good.cpp(24) : warning C4566: character represented by universal-character-name '\u0448' cannot be represented in the current code page (1252)
utf8_mode.msvc.output.good.cpp(24) : warning C4566: character represented by universal-character-name '\u043A' cannot be represented in the current code page (1252)
utf8_mode.msvc.output.good.cpp(24) : warning C4566: character represented by universal-character-name '\u0430' cannot be represented in the current code page (1252)

W:\examples> _

It might look as if these warnings are due to the active codepage in the console window, but they’re not related. Visual C++ is just complaining about information loss when it attempts to convert the narrowText literal from the source code’s UTF-8 to its incorrectly documented C++ execution character set, which is Windows ANSI. Where Windows ANSI is the codepage reported by the GetACP Windows API function (doc here), which defaults to the codepage specified in the ACP value of the as far as I know undocumented registry key HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Nls\CodePage.

So, the effect of compiling can be a bit different depending on the language that Windows is installed for, which determines the default GetACP codepage. But at least the compilation doesn’t depend on such an ephemeral setting as the active codepage in some console window! And these warnings are good: they’re spot on, for a change. 🙂

Now, does the program work?

W:\examples> chcp 1252
Active code page: 1252

W:\examples> good
Narrow text: Blåbærsyltetøy! ??? ?????!
Wide text:   Blåbærsyltetøy! 日本国 кошка!

W:\examples> good >good_result

W:\examples> type good_result
Narrow text: Blåbærsyltetøy! ??? ?????!
Wide text:   Blåbærsyltetøy! 日本国 кошка!

W:\examples> chcp 65001
Active code page: 65001

W:\examples> good
Narrow text: Blåbærsyltetøy! ??? ?????!
Wide text:   Blåbærsyltetøy! 日本国 кошка!

W:\examples> type good_result
Narrow text: Blåbærsyltetøy! ??? ?????!
Wide text:   Blåbærsyltetøy! 日本国 кошка!

W:\examples> _

Yes! It manages to present correct output even with active codepage 1252 (Windows ANSI Western) because it uses direct console i/o for this case. And as you can see the redirected output is UTF-8, as it should be.

Even more goodness: the UTF-8 mode also works down at the C library level, using e.g. the wprintf function.

UTF-8 stream mode: the bad (input)

The bad news about the _O_U8TEXT mode is that input is almost as non-functional as with an active codepage 65001.

Apparently the runtime retrieves input as 1 byte per character via the standard input stream.Which means that the input is encoded according to the console window’s active codepage. These bytes, that e.g. encode your input text in Windows ANSI, are then interpreted as if they were UTF-8 encoded text.

The interpretation as UTF-8 would only be correct for active codepage 65001. But with active codepage 65001, which is the only one where the interpretation would be correct, the input operation just fails. So, depending on the active codepage the input operation will either produce garbage for non-ASCII characters, or it will fail outright.

[utf8_mode.msvc.input.bad.cpp]
#include <stdexcept>        // std::runtime_error, std::exception
#include <stdlib.h>         // EXIT_SUCCESS, EXIT_FAILURE
#include <iostream>         // std::wcout, std::wcerr, std::endl
#include <string>           // std::string, std::wstring
using namespace std;

#include    <io.h>      // _setmode, _fileno
#include    <fcntl.h>   // _O_U8TEXT

bool throwX( string const& s ) { throw runtime_error( s ); }
bool hopefully( bool v ) { return v; }

void setUtf8Mode( FILE* f, char const name[] )
{
    int const newMode = _setmode( _fileno( f ), _O_U8TEXT );
    hopefully( newMode != -1 )
        || throwX( string() + "setmode failed for " + name );
}

void initStreams()
{
    setUtf8Mode( stdin, "stdin" );
    setUtf8Mode( stdout, "stdout" );
}

wstring lineFrom( wistream& stream )
{
    wstring     result;
    
    getline( stream, result );
    hopefully( !stream.fail() )
        || throwX( "lineFrom: getline failed" );
    return result;
}

int main()
{
    try
    {
        initStreams();

        wcout << "What's your name? ";
        wstring const name = lineFrom( wcin );
        wcout << "Pleased to meet you, " << name << "!" << endl;
        
        int const n = name.length();
        wcout << endl;
        wcout << "I represented your name as " << n << " wide characters:" << endl;
        for( int i = 0;  i < n;  ++i )
        {
            if( i > 0 ) { wcout << " | "; }
            wcout << hex << 0 + name[i] << " '" << name[i] << "'";
        }
        wcout << endl;
       
        return EXIT_SUCCESS;
    }
    catch( exception const& x )
    {
        wcerr << "!" << x.what() << endl;
    }
    return EXIT_FAILURE;
}

Testing this:

W:\examples> chcp 1252
Active code page: 1252

W:\examples> cl utf8_mode.msvc.input.bad.cpp /Fe"bad_input"
utf8_mode.msvc.input.bad.cpp

W:\examples> bad_input
What's your name? Bjørn
Pleased to meet you, Bj�rn!

I represented your name as 5 wide characters:
42 'B' | 6a 'j' | fffd '�' | 72 'r' | 6e 'n'

W:\examples> type good_result
Narrow text: Blåbærsyltetøy! ??? ?????!
Wide text:   Blåbærsyltetøy! 日本国 кошка!

W:\examples> bad_input
What's your name? Blåbærsyltetøy
Pleased to meet you, Blåbærsyltetøy!

I represented your name as 14 wide characters:
42 'B' | 6c 'l' | e5 'å' | 62 'b' | e6 'æ' | 72 'r' | 73 's' | 79 'y' | 6c 'l' | 74 't' | 65 'e' | 74 't' | f8 'ø' | 79 'y'

W:\examples> _

The last run with the funny characters pasted as input shows that interactive input of the UTF-8 byte values works. The input byte values, that look pretty funny in the console, are the UTF-8 byte values that encode “Blåbærsyltetøy”. And with that as the cleartext result the received byte values must have been interpreted directly as UTF-8.

Instead the runtime should have used direct console input, just as it uses direct console output when the standard output stream is connected to a console window. Since it does not (or does not manage to do that correctly), that’s what we have to provide.

UTF-8 stream mode: the bad (input) – FIXED

It is apparently easy to check whether the standard input stream is connected to a console window, namely via the_isatty function (doc here). However, the documentation hints ominously about _isatty maybe returning true for a stream connected to a serial port, and archaic things like that. Happily one can alternatively use the lower level Windows API console functions, like e.g. GetConsoleMode (doc here), which presumably will only succeed for a stream that represents something that supports the Windows console functions.

[is_input_console.cpp]
#include <iostream>
#include <stdio.h>          // stdin, _fileno

#include <io.h>             // _isatty

#include <windows.h>        // GetConsoleMode

int main()
{
    DWORD   consoleMode;

    HANDLE const    inputHandle         = GetStdHandle( STD_INPUT_HANDLE );
    bool const      winapiSaysConsole   = !!GetConsoleMode( inputHandle, &consoleMode );
    bool const      clibSaysConsole     = !!_isatty( _fileno( stdin ) );
    
    using namespace std;
    cerr << boolalpha;
    cerr << "_isatty: " << clibSaysConsole << endl;
    cerr << "GetConsoleMode: " << winapiSaysConsole << endl;
}
W:\examples> cl is_input_console.cpp /Fe"x"
is_input_console.cpp

W:\examples> x
_isatty: true
GetConsoleMode: true

W:\examples> x <nul
_isatty: true
GetConsoleMode: false

W:\examples> x <is_input_console.cpp
_isatty: false
GetConsoleMode: false

W:\examples> _

It surprised me that _isatty here incorrectly identified the Windows nul device as a console window. However, GetConsoleMode got it right. So GetConsoleMode is evidently more reliable for this detection task.

Anyway, with a console identified as such, at the C++ level it’s then possible to override things in std::basic_streambuf (doc here), where the required core functionality maps almost directly to a call of the Windows API function ReadConsole (doc here).

The program below illustrates the technique by adding the necessary special input support to the previous section’s program, with main unchanged. This program is however not intended to provide directly reusable code. It’s just a problem-specific concrete example written in a reasuble-like style:

[utf8_mode.msvc.input.fixed.cpp]
#ifdef  _MSC_VER
#   pragma warning( disable: 4373 )     // "Your override overrides"
#endif

#include <algorithm>        // std::remove
#include <stddef.h>         // ptrdiff_t
#include <stdexcept>        // std::runtime_error, std::exception
#include <stdlib.h>         // EXIT_SUCCESS, EXIT_FAILURE
#include <iostream>         // std::wcout, std::wcerr, std::endl
#include <streambuf>        // std::basic_streambuf
#include <string>           // std::string, std::wstring
using namespace std;

#include    <io.h>      // _setmode, _fileno
#include    <fcntl.h>   // _O_U8TEXT

#undef      UNICODE
#define     UNICODE
#include <windows.h>    // ReadConsole


typedef ptrdiff_t       Size;

bool throwX( string const& s ) { throw runtime_error( s ); }
bool hopefully( bool v ) { return v; }

class DirectInputBuffer
    : public std::basic_streambuf< wchar_t >
{
private:
    wstring     buffer_;

    Size bufferSize() const         { return buffer_.size(); }
    wchar_t* pBufferStart()         { return &buffer_[0]; }
    wchar_t* pBufferEnd()           { return pBufferStart() + bufferSize(); }

    wchar_t* pStart() const         { return eback(); }
    wchar_t* pCurrent() const       { return gptr(); }
    wchar_t* pEnd() const           { return egptr(); }

    static HANDLE inputHandle()
    {
        static HANDLE const handle = GetStdHandle( STD_INPUT_HANDLE );
        return handle;
    }

public:
    typedef std::basic_streambuf< wchar_t >     Base;
    typedef Base::traits_type                   Traits;

    DirectInputBuffer( Base const& anOriginalBuffer )
        : Base( anOriginalBuffer )      // Copies buffer read area pointers.
        , buffer_( 256, L'#' )
    {}

protected:
    virtual streamsize xsgetn( wchar_t* const pBuffer, streamsize const n )
    {
        wchar_t const   ctrlZ   = wchar_t( 1 + ('Z' - 'A') );

        DWORD       nCharactersRead     = 0;

        bool const  readSucceeded       = !!ReadConsole(
            inputHandle(), pBuffer, static_cast< DWORD >( n ), &nCharactersRead, nullptr
            );

        if( readSucceeded )
        {
            wchar_t const* const    pCleanEnd   =
                remove( pBuffer, pBuffer + nCharactersRead, L'\r' );

            nCharactersRead = pCleanEnd - pBuffer;

            bool const isInteractiveEOF =
                (nCharactersRead == 2 && pBuffer[0] == ctrlZ && pBuffer[1] == '\n');
                
            return (isInteractiveEOF? 0 : static_cast< streamsize >( nCharactersRead ));
        }
        return 0;
    }
    
    virtual int_type underflow()
    {
        // Try to get some more input (maximum a line).
        if( pCurrent() == 0 || pCurrent() >= pEnd() )
        {
            streamsize const nCharactersRead =
                xsgetn( pBufferStart(), bufferSize() );

            if( nCharactersRead > 0 )
            {
                setg(
                    pBufferStart(),                     // Reading area start
                    pBufferStart(),                     // Reading area current
                    pBufferStart() + nCharactersRead    // Reading area end
                    );
            }
        }
        
        if( pCurrent() == 0 || pCurrent() >= pEnd() )
        {
            return Traits::eof();
        }
        return Traits::to_int_type( *pCurrent() );
    }
};

void setUtf8Mode( FILE* f, char const name[] )
{
    int const newMode = _setmode( _fileno( f ), _O_U8TEXT );
    hopefully( newMode != -1 )
        || throwX( string() + "setmode failed for " + name );
}

bool inputIsFromConsole()
{
    static HANDLE const inputHandle = GetStdHandle( STD_INPUT_HANDLE );

    DWORD consoleMode;
    return !!GetConsoleMode( inputHandle, &consoleMode );
}

void initStreams()
{
    if( inputIsFromConsole() )
    {
        static DirectInputBuffer buffer( *wcin.rdbuf() );
        wcin.rdbuf( &buffer );
    }
    setUtf8Mode( stdin, "stdin" );
    setUtf8Mode( stdout, "stdout" );
}

wstring lineFrom( wistream& stream )
{
    wstring     result;
    
    getline( stream, result );
    hopefully( !stream.fail() )
        || throwX( "lineFrom: getline failed" );
    return result;
}

int main()
{
    try
    {
        initStreams();

        wcout << "What's your name? ";
        wstring const name = lineFrom( wcin );
        wcout << "Pleased to meet you, " << name << "!" << endl;
        
        int const n = name.length();
        wcout << endl;
        wcout << "I represented your name as " << n << " wide characters:" << endl;
        for( int i = 0;  i < n;  ++i )
        {
            if( i > 0 ) { wcout << " | "; }
            wcout << hex << 0 + name[i] << " '" << name[i] << "'";
        }
        wcout << endl;
       
        return EXIT_SUCCESS;
    }
    catch( exception const& x )
    {
        wcerr << "!" << x.what() << endl;
    }
    return EXIT_FAILURE;
}
}

Testing this:

W:\examples> chcp 1252
Active code page: 1252

W:\examples> cl utf8_mode.msvc.input.fixed.cpp /Fe"fixed_input"
utf8_mode.msvc.input.fixed.cpp

W:\examples> fixed_input
What's your name? Bjørn
Pleased to meet you, Bjørn!

I represented your name as 5 wide characters:
42 'B' | 6a 'j' | f8 'ø' | 72 'r' | 6e 'n'

W:\examples> fixed_input
What's your name? 日本国 кошка
Pleased to meet you, 日本国 кошка!

I represented your name as 9 wide characters:
65e5 '日' | 672c '本' | 56fd '国' | 20 ' ' | 43a 'к' | 43e 'о' | 448 'ш' | 43a 'к' | 430 'а'

W:\examples> _

Yay! 🙂

But I’d better mention again that in my English-language Windows 7 the console window stores the Chinese characters correctly but is only able to display them as rectangles, ▭. Correctly stored means that copy and paste works, which is why the text dumps above show these characters. However, the Norwegian and Russian characters are both stored and displayed correctly.

Also, I’d better mention that this is only a C++ level solution:

  • Input from the standard input stream should only be done via wcin.

Explained:

Input from the standard input stream should only be done at the C++ iostream level because in practice the UTF-8 mode input operation bugs can only be fixed at the C++ iostream level.

And input should only be done via the C++ wide stream wcin because, first of all, input data can be arbitrary, and secondly, with UTF-8 mode the narrow character operations just fail…

UTF-8 stream mode: the ugly (narrow streams)

C++11 (more precisely the N3290 final draft) §27.4.1/3:

  • “Mixing operations on corresponding wide- and narrow-character streams follows the same semantics as mixing such operations on FILEs, as specified in Amendment 1 of the ISO C standard.”

C99 (more precisely the N869 draft) §7.19.2/4:

  • “Each stream has an orientation. After a stream is associated with an external file, but before any operations are performed on it, the stream is without orientation. Once a wide-character input/output function has been applied to a stream without orientation, the stream becomes a wide-oriented stream. Similarly, once a byte input/output function has been applied to a stream without orientation, the stream becomes a byte-oriented stream. Only a call to the freopen function or the fwide function can otherwise alter the orientation of a stream. (A successful call to freopen removes any orientation.)”

C99 (more precisely the N869 draft) §7.19.2/5:

  • “Byte input/output functions shall not be applied to a wide-oriented stream and wide-character input/output functions shall not be applied to a byte-oriented stream.”

By these rules one would have to decide on using either cerr or wcerr in a program, but not both. That’s
not very practical, considering, for example, that one library component might produce error messages via cerr, while another might use wcerr for that purpose. But Visual C++ apparently follows a more practical-for-Windows-programming set of rules.

For Visual C++ 10.0 the fwide function is documented as being unimplemented. And from a practical point of view, at least at the level of outputting whole lines it apparently works fine to intermingle use of cout and wcout. So, happily, Visual C++ apparently just disregards the standard’s requirements and does not maintain an impractical explicit C FILE stream orientation.

Except … that with UTF-8 mode for the standard output stream, use of cout crashes the program!

[utf8_mode.msvc.byte_stream.ugly.cpp]
#include <stdexcept>        // std::exception
#include <stdlib.h>         // EXIT_SUCCESS, EXIT_FAILURE
#include <iostream>         // std::cout, std::endl
using namespace std;

#include    <io.h>      // _setmode, _fileno
#include    <fcntl.h>   // _O_U8TEXT

void initStreams()
{
    _setmode( _fileno( stdin ), _O_U8TEXT );
    _setmode( _fileno( stdout ), _O_U8TEXT );
    _setmode( _fileno( stderr ), _O_U8TEXT );
}

int main()
{
    try
    {
        initStreams();

        cout << "Hello, world!" << endl;
        cout << "Did you know, most Norwegians like 'blåbærsyltetøy'?" << endl;
        cerr << ":This is an error message, also mentioning 'blåbærsyltetøy'." << endl;

        return EXIT_SUCCESS;
    }
    catch( exception const& x )
    {
        cerr << "!" << x.what() << endl;
    }
    catch( ... )
    {
        cerr << "!Unknown exception." << endl;
    }
    return EXIT_FAILURE;
}

Screendump of crash dialogTesting:

W:\examples> ugly

W:\examples> _

The crash is caused by an assertion in the Visual C++ 10.0 fputc implementation, in source file [fputc.c]:

int __cdecl fputc (
        int ch,
        FILE *str
        )
{
    int retval=0;

    _VALIDATE_RETURN((str != NULL), EINVAL, EOF);

    _lock_str(str);
    __try {
        _VALIDATE_STREAM_ANSI_SETRET(str, EINVAL, retval, EOF);    // <-- Uh oh.

        if (retval==0)
        {
            retval = _putc_nolock(ch,str);
        }
    }
    __finally {
        _unlock_str(str);
    }

    return(retval);
}

where _VALIDATE_STREAM_ANSI_SETRET is defined thusly in [internal.h]:

/*
    We use _VALIDATE_STREAM_ANSI_SETRET to ensure that ANSI file operations(
    fprintf etc) aren't called on files opened as UNICODE. We do this check
    only if it's an actual FILE pointer &amp; not a string. It doesn't actually return
        immediately
*/

#define _VALIDATE_STREAM_ANSI_SETRET( stream, errorcode, retval, retexpr)            \
    {                                                                                \
        FILE *_Stream=stream;                                                        \
        int fn;                                                                      \
        _VALIDATE_SETRET(( (_Stream-&gt;_flag &amp; _IOSTRG) ||                             \
                           ( fn = _fileno(_Stream),                                  \
                             ( (_textmode_safe(fn) == __IOINFO_TM_ANSI) &amp;&amp;           \
                               !_tm_unicode_safe(fn)))),                             \
                         errorcode, retval, retexpr)                                 \
    }

I.e., with UTF-8 mode Microsoft’s code enforces the standard’s prohibition of mixing wide and narrow character operations on the same stream.

What to do, when the law is suddenly enforced?

UTF-8 stream mode: the ugly (narrow streams) – FIXED

Given the Windows convention of Windows ANSI encoding for internal char-based data, in particular for string literals, it does not make much sense to support cin input operations. With the user typing e.g. “кошка” the program running on a Norwegian machine would just produce something like “?????” as the Windows ANSI-encoded result. I.e. such narrow character input operations would replace a generally preferable hard crash with rather ungood data loss.

On the other hand, especially for students, existing library components may be logging error messages via narrow character operations on the standard error stream, like cerr << "!oops" << endl;. To avoid crashes for that, up at the C++ level one can install new iostream buffers for cout, cerr and clog, where those buffers do something reasonable such as forwarding all output to the wide streams (thereby providing Windows ANSI → UTF-8 translation). Down at the C level there is, however, no other practical crash-fix option than to leave the standard error stream in ANSI (untranslated) mode, so that the functionality down at the C level is much more restricted.

Back up at the C++ level again, this restriction for stderr can be compensated by adding UTF-8 conversion to the wide streams that write to the standard error stream, namely wcerr and wclog. One can and should also add custom direct console output support (more about that below). I.e., one is then essentially emulating the UTF-8 mode for wcerr and wclog, but only for operations at the C++ level.

This means that while the C++ level will/can work as expected, narrow character operations on stderr will only be guaranteed to produce the intended output when they’re limited to the ASCII character repertoire, and narrow character operations on stdin or stdout will crash.

Happily it’s very very easy to recognize a crash and/or garbage output, so that breaches of the programming conventions, …

  • use wcin for input,
     
  • don’t use C FILE level narrow characters operations on stdout, and
     
  • don’t use non-ASCII characters in narrow character operations on stderr

will likely be caught during testing.

Let’s start at the end of the internal processing, with the UTF-8 encoding support for wcerr.

With Visual C++ 10.0 the UTF-8 encoding itself is almost trivial, because first of all Visual C++ 10.0 supports the C++11 codecvt_utf8 facet, and secondly because Visual C++’s standard streams support the codecvt facet (the standard only requires file streams to support it), but as you can see below the console window result is not perfect!

[direct_display_troubles.cpp]
#include <codecvt>          // std::codecvt_utf8 (C++11)
#include <iostream>         // std::wcerr
#include <locale>           // std::locale
#include <memory>           // std::unique_ptr (C++11)
using namespace std;

void setUtf8Conversion( wostream& stream )
{
    typedef codecvt_utf8< wchar_t >     CodeCvt;

    unique_ptr< CodeCvt > pCodeCvt( new CodeCvt );
    locale  utf8Locale( locale(), pCodeCvt.get() );
    pCodeCvt.release();
    stream.imbue( utf8Locale );
}

int main()
{
    wcerr << "'blåbærsyltetøy' with default presentation." << endl;
    
    setUtf8Conversion( wcerr );
    wcerr << "'blåbærsyltetøy' with UTF-8 conversion applied." << endl;
}

Testing this:

W:\examples> chcp 1252

W:\examples> chcp 1252
Active code page: 1252

W:\examples> cl direct_display_troubles.cpp /Fe"x"
direct_display_troubles.cpp

W:\examples> x
'blåbærsyltetøy' with default presentation.
'blåbærsyltetøy' with UTF-8 conversion applied.

W:\examples> chcp 65001
Active code page: 65001

W:\examples> x
'bl�b�rsyltet�y' with default presentation.
'bl��b��rsyltet��y' with UTF-8 conversion applied.

W:\examples> (x 2>&1) >result

W:\examples> type result
'bl�b�rsyltet�y' with default presentation.
'blåbærsyltetøy' with UTF-8 conversion applied.

W:\examples> _

As you can see, the Visual C++ runtime library’ direct console output kicks in even without having set UTF-8 mode, and in some inexplicable way presents the resulting UTF-8 bytes – as garbage.

I have no idea how that happens.

However, one cure is to override it with a custom direct console output:

[direct_display_troubles.fixed.cpp]
#ifdef  _MSC_VER
#   pragma warning( disable: 4373 )     // "Your override overrides"
#endif

#include <assert.h>         // assert
#include <codecvt>          // std::codecvt_utf8 (C++11)
#include <iostream>         // std::wcerr
#include <locale>           // std::locale
#include <memory>           // std::unique_ptr (C++11)
#include <streambuf>        // std::basic_streambuf
using namespace std;

#undef  UNICODE
#define UNICODE
#undef  STRICT
#define STRING
#include <windows.h>    // GetStdHandle, GetConsoleMode, WriteConsole


template< class CharType >
class AbstractOutputBuffer
    : public basic_streambuf< CharType >
{
public:
    typedef basic_streambuf< CharType >     Base;
    typedef typename Base::traits_type      Traits;

    typedef Base                            StreamBuffer;

protected:
    virtual streamsize xsputn( char_type const* const s, streamsize const n ) = 0;

    virtual int_type overflow( int_type const c )
    {
        bool const cIsEOF   = Traits::eq_int_type( c, Traits::eof() );
        int_type const  failureValue    = Traits::eof();
        int_type const  successValue    = (cIsEOF? Traits::not_eof( c ) : c);

        if( !cIsEOF )
        {
            char_type const     ch                  = Traits::to_char_type( c );
            streamsize const    nCharactersWritten  = xsputn( &ch, 1 );

            return (nCharactersWritten == 1? successValue : failureValue);
        }
        return successValue;
    }

public:
    AbstractOutputBuffer()
    {}

    AbstractOutputBuffer( StreamBuffer& existingBuffer )
        : Base( existingBuffer )
    {}
};


class DirectOutputBuffer
    : public AbstractOutputBuffer< wchar_t >
{
public:
    enum StreamId { outputStreamId, errorStreamId, logStreamId };

private:
    StreamId    streamId_;

protected:
    virtual streamsize xsputn( wchar_t const* const s, streamsize const n )
    {
        static HANDLE const outputStreamHandle  = GetStdHandle( STD_OUTPUT_HANDLE );
        static HANDLE const errorStreamHandle   = GetStdHandle( STD_ERROR_HANDLE );

        HANDLE const    streamHandle    =
            (streamId_ == outputStreamId? outputStreamHandle : errorStreamHandle );
        
        DWORD nCharactersWritten    = 0;
        bool writeSucceeded         = !!WriteConsole(
            streamHandle, s, static_cast< DWORD >( n ), &nCharactersWritten, 0
            );
        return (writeSucceeded? static_cast< streamsize >( nCharactersWritten ) : 0);
    }

public:
    DirectOutputBuffer( StreamId streamId = outputStreamId )
        : streamId_( streamId )
    {}
};


void setUtf8Conversion( wostream& stream )
{
    typedef codecvt_utf8< wchar_t >     CodeCvt;

    unique_ptr< CodeCvt > pCodeCvt( new CodeCvt );
    locale  utf8Locale( locale(), pCodeCvt.get() );
    pCodeCvt.release();
    stream.imbue( utf8Locale );
}

bool isConsole( HANDLE streamHandle )
{
    DWORD consoleMode;
    return !!GetConsoleMode( streamHandle, &consoleMode );
}

bool isConsole( DWORD stdStreamId )
{
    return isConsole( GetStdHandle( stdStreamId ) );
}

void setDirectOutputSupport( wostream& stream )
{
    typedef DirectOutputBuffer  DOB;

    if( &stream == &wcout )
    {
        if( isConsole( STD_OUTPUT_HANDLE ) )
        {
            static DOB  outputStreamBuffer( DOB::outputStreamId );
            stream.rdbuf( &outputStreamBuffer );
        }
    }
    else if( &stream == &wcerr )
    {
        if( isConsole( STD_ERROR_HANDLE ) )
        {
            static DOB errorStreamBuffer( DOB::errorStreamId );
            stream.rdbuf( &errorStreamBuffer );
        }
    }
    else if( &stream == &wclog )
    {
        if( isConsole( STD_ERROR_HANDLE ) )
        {
            static DOB logStreamBuffer( DOB::logStreamId );
            stream.rdbuf( &logStreamBuffer );
        }
    }
    else
    {
        assert(( "setDirectOutputSupport: unsupported stream", false ));
    }
}

int main()
{
    wcerr << "'blåbærsyltetøy' with default presentation." << endl;
    
    setUtf8Conversion( wcerr );
    wcerr << "'blåbærsyltetøy' with UTF-8 conversion applied." << endl;
    
    setDirectOutputSupport( wcerr );
    wcerr << "'blåbærsyltetøy' with UTF-8 conversion & direct output applied." << endl;
}

Testing this:

W:\examples> chcp 1252
Active code page: 1252

W:\examples> cl direct_display_troubles.fixed.cpp /Fe"x"
direct_display_troubles.fixed.cpp

W:\examples> x
'blåbærsyltetøy' with default presentation.
'blåbærsyltetøy' with UTF-8 conversion applied.
'blåbærsyltetøy' with UTF-8 conversion & direct output applied.

W:\examples> chcp 65001
Active code page: 65001

W:\examples> x
'bl�b�rsyltet�y' with default presentation.
'bl��b��rsyltet��y' with UTF-8 conversion applied.
'blåbærsyltetøy' with UTF-8 conversion & direct output applied.

W:\examples> (x 2>&1) >result

W:\examples> type result
'bl�b�rsyltet�y' with default presentation.
'blåbærsyltetøy' with UTF-8 conversion applied.
'blåbærsyltetøy' with UTF-8 conversion & direct output applied.

W:\examples> _

Yay! Now we’re ready for tackling the previous section’s program. Again, the main function is exactly as before:

[utf8_mode.msvc.byte_stream.fixed.cpp]
#ifdef  _MSC_VER
#   pragma warning( disable: 4373 )     // "Your override overrides"
#endif

#include <assert.h>         // assert
#include <codecvt>          // std::codecvt_utf8 (C++11)
#include <stdexcept>        // std::exception
#include <stdlib.h>         // EXIT_SUCCESS, EXIT_FAILURE
#include <streambuf>        // std::basic_streambuf
#include <string>           // wstring
#include <iostream>         // std::cout, std::endl
#include <locale>           // std::locale
#include <memory>           // std::unique_ptr (C++11)
using namespace std;

#include    <io.h>      // _setmode, _fileno
#include    <fcntl.h>   // _O_U8TEXT

#undef  UNICODE
#define UNICODE
#undef  STRICT
#define STRING
#include <windows.h>    // MultiByteToWideChar


template< class CharType >
class AbstractOutputBuffer
    : public basic_streambuf< CharType >
{
public:
    typedef basic_streambuf< CharType >     Base;
    typedef typename Base::traits_type      Traits;

    typedef Base                            StreamBuffer;

protected:
    virtual streamsize xsputn( char_type const* const s, streamsize const n ) = 0;

    virtual int_type overflow( int_type const c )
    {
        bool const cIsEOF   = Traits::eq_int_type( c, Traits::eof() );
        int_type const  failureValue    = Traits::eof();
        int_type const  successValue    = (cIsEOF? Traits::not_eof( c ) : c);

        if( !cIsEOF )
        {
            char_type const     ch                  = Traits::to_char_type( c );
            streamsize const    nCharactersWritten  = xsputn( &ch, 1 );

            return (nCharactersWritten == 1? successValue : failureValue);
        }
        return successValue;
    }

public:
    AbstractOutputBuffer()
    {}

    AbstractOutputBuffer( StreamBuffer& existingBuffer )
        : Base( existingBuffer )
    {}
};


class OutputForwarderBuffer
    : public AbstractOutputBuffer< char >
{
public:
    typedef AbstractOutputBuffer< char >    Base;
    typedef Base::Traits                    Traits;
    
    typedef Base::StreamBuffer              StreamBuffer;
    typedef basic_streambuf<wchar_t>        WideStreamBuffer;

private:
    WideStreamBuffer*       pWideStreamBuffer_;
    wstring                 wideCharBuffer_;

    OutputForwarderBuffer( OutputForwarderBuffer const& );      // No such.
    void operator=( OutputForwarderBuffer const& );             // No such.

protected:
    virtual streamsize xsputn( char const* const s, streamsize const n )
    {
        if( n == 0 ) { return 0; }

        int const   nAsInt  = static_cast<int>( n );    //  Visual C++ sillywarnings.
        wideCharBuffer_.resize( nAsInt );
        int const nWideCharacters = MultiByteToWideChar(
            CP_ACP,             // Windows ANSI
            MB_PRECOMPOSED,     // Always precompose characters (this is the default).
            s,                  // Narrow character string.
            nAsInt,             // Number of bytes in narrow character string.
            &wideCharBuffer_[0],
            nAsInt              // Wide char buffer size.
            );
        assert( nWideCharacters > 0 );
        return pWideStreamBuffer_->sputn( &wideCharBuffer_[0], nWideCharacters );
    }

public:
    OutputForwarderBuffer(
        StreamBuffer&       existingBuffer,
        WideStreamBuffer*   pWideStreamBuffer
        )
        : Base( existingBuffer )
        , pWideStreamBuffer_( pWideStreamBuffer )
    {}
};


class DirectOutputBuffer
    : public AbstractOutputBuffer< wchar_t >
{
public:
    enum StreamId { outputStreamId, errorStreamId, logStreamId };

private:
    StreamId    streamId_;

protected:
    virtual streamsize xsputn( wchar_t const* const s, streamsize const n )
    {
        static HANDLE const outputStreamHandle  = GetStdHandle( STD_OUTPUT_HANDLE );
        static HANDLE const errorStreamHandle   = GetStdHandle( STD_ERROR_HANDLE );

        HANDLE const    streamHandle    =
            (streamId_ == outputStreamId? outputStreamHandle : errorStreamHandle );
        
        DWORD nCharactersWritten    = 0;
        bool writeSucceeded         = !!WriteConsole(
            streamHandle, s, static_cast< DWORD >( n ), &nCharactersWritten, 0
            );
        return (writeSucceeded? static_cast< streamsize >( nCharactersWritten ) : 0);
    }

public:
    DirectOutputBuffer( StreamId streamId = outputStreamId )
        : streamId_( streamId )
    {}
};


void setUtf8Conversion( wostream& stream )
{
    typedef codecvt_utf8< wchar_t >     CodeCvt;

    unique_ptr< CodeCvt > pCodeCvt( new CodeCvt );
    locale  utf8Locale( locale(), pCodeCvt.get() );
    pCodeCvt.release();
    stream.imbue( utf8Locale );
}

bool isConsole( HANDLE streamHandle )
{
    DWORD consoleMode;
    return !!GetConsoleMode( streamHandle, &consoleMode );
}

bool isConsole( DWORD stdStreamId )
{
    return isConsole( GetStdHandle( stdStreamId ) );
}

void setDirectOutputSupport( wostream& stream )
{
    typedef DirectOutputBuffer  DOB;

    if( &stream == &wcout )
    {
        if( isConsole( STD_OUTPUT_HANDLE ) )
        {
            static DOB  outputStreamBuffer( DOB::outputStreamId );
            stream.rdbuf( &outputStreamBuffer );
        }
    }
    else if( &stream == &wcerr )
    {
        if( isConsole( STD_ERROR_HANDLE ) )
        {
            static DOB errorStreamBuffer( DOB::errorStreamId );
            stream.rdbuf( &errorStreamBuffer );
        }
    }
    else if( &stream == &wclog )
    {
        if( isConsole( STD_ERROR_HANDLE ) )
        {
            static DOB logStreamBuffer( DOB::logStreamId );
            stream.rdbuf( &logStreamBuffer );
        }
    }
    else
    {
        assert(( "setDirectOutputSupport: unsupported stream", false ));
    }
}

void initStreams()
{
    // Set up UTF-8 conversions &  direct console output:
    _setmode( _fileno( stdin ), _O_U8TEXT );
    _setmode( _fileno( stdout ), _O_U8TEXT );
    setUtf8Conversion( wcerr );  setDirectOutputSupport( wcerr );
    setUtf8Conversion( wclog );  setDirectOutputSupport( wclog );

    // Forward narrow character output to the wide streams:
    static OutputForwarderBuffer    coutBuffer( *cout.rdbuf(), wcout.rdbuf() );
    static OutputForwarderBuffer    cerrBuffer( *cerr.rdbuf(), wcerr.rdbuf() );
    static OutputForwarderBuffer    clogBuffer( *clog.rdbuf(), wclog.rdbuf() );

    cout.rdbuf( &coutBuffer );
    cerr.rdbuf( &cerrBuffer );
    clog.rdbuf( &clogBuffer );
}

int main()
{
    try
    {
        initStreams();

        cout << "Hello, world!" << endl;
        cout << "Did you know, most Norwegians like 'blåbærsyltetøy'?" << endl;
        cerr << ":This is an error message, also mentioning 'blåbærsyltetøy'." << endl;

        return EXIT_SUCCESS;
    }
    catch( exception const& x )
    {
        cerr << "!" << x.what() << endl;
    }
    catch( ... )
    {
        cerr << "!Unknown exception." << endl;
    }
    return EXIT_FAILURE;
}

Testing this:

W:\examples> chcp 1252
Active code page: 1252

W:\examples> cl utf8_mode.msvc.byte_stream.fixed.cpp /Fe"x"
utf8_mode.msvc.byte_stream.fixed.cpp

W:\examples> x
Hello, world!
Did you know, most Norwegians like 'blåbærsyltetøy'?
:This is an error message, also mentioning 'blåbærsyltetøy'.

W:\examples> chcp 65001
Active code page: 65001

W:\examples> x
Hello, world!
Did you know, most Norwegians like 'blåbærsyltetøy'?
:This is an error message, also mentioning 'blåbærsyltetøy'.

W:\examples> (x 2>&1) >result

W:\examples> type result
Hello, world!
Did you know, most Norwegians like 'blåbærsyltetøy'?
:This is an error message, also mentioning 'blåbærsyltetøy'.

W:\examples> _

Summary

To make the Visual C++ _O_U8TEXT UTF-8 stream mode work in general, we had to

  • fix the input-from-console functionality by adding a special direct console input buffer in the C++ level wcin stream,
     
  • avoid crashes for narrow character operations on the standard error stream by keeping that stream in ANSI mode, installing forwarder buffers for cout and cerr, installing an UTF-8 conversion facet in wcerr, and adding direct console output support to wcerr, and
     
  • foresake the use of input and narrow character operations at the C level, except for the standard error stream.

With all this bug-fixing and support machinery added, it’s almost as if we had to implement the _O_U8TEXT mode from scratch! What does it really buy, then? What is the point of using that mode?

Well, essentially the _O_U8TEXT UTF-8 mode gives conversion to UTF-8 for C level wide character output operations on the standard output stream. It does not do interactive input, and it can not be reasonably used for the standard error stream. One might therefore say that Microsoft Unicode guru Michael Kaplan’s original blog posting about this mode, where it appeared to a simple general solution on its own, was a bit too optimistic!

Anyway, to be utterly clear, with this approach one has three main text encodings to deal with in Windows, and three main text encodings to deal with in a typical Linux installation:

Internal wchar_t data: Internal char data: External data:
Windows: UTF-16 Windows ANSI UTF-8
Linux: Usually UTF-32 but also UTF-16 UTF-8 UTF-8

And the main forces in play yielding the Windows row of the above table, are that …

  • the core Windows API is UTF-16 based, and wchar_t is 16 bits in Windows,
     
  • Visual C++ has Windows ANSI as its C++ execution character set, which e.g. forces ordinary narrow character literals to Windows ANSI encoding even when the source code is UTF-8, and
     
  • one desires UTF-8 for external data both to avoid data loss and for general interoperability.

There is not much that can be done about these forces, so the basic approach to deal with these issues is either very similar to what I have described here, or will otherwise involve pretty painful trade-offs.

Not to say, of course, that the trade-offs in this posting’s approach aren’t a little painful! 🙂

Cheers, & enjoy!