A wrapper for UTF-8 Windows apps.

The Windows API is based on UTF-16 encoded wide text. For example, the API function CommandLineToArgvW that parses a command line, only exists in wide text version. But the introduction of support for UTF-8 as the process code page in the May 2019 update of Windows 10 now greatly increases the incentive to use UTF-8 encoded narrow text internally in in Windows GUI applications, i.e. using UTF-8 as the C++ execution character set also in Windows.

This article presents a minimal example of that (a message box with international text, using Windows’ char based API wrappers); shows one way to configure Windows with UTF-8 as the ANSI code page default; and shows how to build such a program with the MingGW g++ and Visual C++ toolchains.

This is discussed in that order in the following sections:

  1. A minimal example.
  2. The header wrappers.
  3. Configuring Windows with UTF-8 as the ANSI code page default.
  4. An application manifest resource specifying UTF-8.
  5. Building with MinGW g++ and with Visual C++.

I apologize for the less than perfect formatting and possible odd things. Every time I edit WordPress removes all instances of the text <windows.h> and wreaks havoc on the rest. This article was originally written as a GitHub-compatible markdown file but it turned out that markdown syntax highlighting, and a lot more, didn’t work in WordPress, so the text had to be very manually re-expressed as a sequence of WordPress “blocks”.

1. A minimal example.

With a suitable wrapper for <windows.h> the C++ code of a program that displays a Windows message box with international text can now be as simple as this:

minimal.cpp

#include <header-wrappers/winapi/windows-h.utf8.hpp>

auto main()
    -> int
{
    const auto& text    = "Every 日本国 кошка loves Norwegian blåbærsyltetøy, nom nom!";
    const auto& title   = "Some Norwegian, Russian & Chinese text in there:";
    MessageBox( 0, text, title, MB_ICONINFORMATION | MB_SETFOREGROUND );
}

Result when the program is built with a specification of UTF-8 as process code page, or alternatively is run in a Windows installation configured with UTF-8 as the ANSI code page default:

Image of OK messagebox

In contrast, here is what it looks like when a corresponding program using <windows.h> directly is built without a specification of UTF-8 as process code page and the Windows ANSI default is codepage 1252, Windows ANSI Western, as in the old days:

Image of ungood Windows ANSI Western messagebox

2. The header wrappers.

The wrapper <header-wrappers/winapi/windows-h.utf8.hpp> supports this new “like ordinary C++” kind of Windows application:

  • it increases the C++-compatibility of <windows.h> by suppressing the min and max macro definitions via option NOMINMAX and by asking for more strictly typed declarations via option STRICT, plus it reduces the size of this gargantuan include (e.g. just now from 80 287 lines to 54 426 lines, with MinGW g++), via option WIN32_LEAN_AND_MEAN,
  • it makes the char based …A-functions such as MessageBoxA available without suffix, i.e. for that example as just MessageBox, by ensuring that option UNICODE is not defined, and
  • it asserts that the effective process codepage is UTF-8, which it might or might not be.

header-wrappers/winapi/windows-h.utf8.hpp

#pragma once
#undef UTF8_WINAPI
#define UTF8_WINAPI
#include "windows-h.hpp"

namespace uuid_0985060C_1AAD_453C_B3F9_A2E543F4CF1E {
    struct Winapi_envelope
    {
        Winapi_envelope()
        {
            static const bool dummy = winapi_h_assert_utf8_codepage();
        }
    };
    
    static const Winapi_envelope ensured_globally_single_utf8_assertion{};
}  // namespace uuid_0985060C_1AAD_453C_B3F9_A2E543F4CF1E

The little complexity above could be avoided by using a C++17 inline variable. It would be more in the C++ spirit of coding to absolutely maximum performance and least verbosity, when there is a choice. However, many people are still stuck with earlier C++ standards, and though a fallback using static instead could be automatically provided, the header would then require Visual C++ 2019 users to add option /Zc:__cplusplus, which is not presently supported by the Visual Studio GUI.

Except for that issue the wrapper is designed to be a trivial top-level wrapper so that one can replace it with one’s own equally trivial top-level wrapper, for example in order to communicate a “not UTF-8 process code page” failure to the user in manner of one’s own choosing.

To wit, the wrapper delegates the first two points to a more basic wrapper <header-wrappers/winapi/windows-h.hpp>, which goes like this:

header-wrappers/winapi/windows-h.hpp

#pragma once
#ifdef MessageBox
#   error "<windows.h> has already been included, possibly with undesired options."
#endif

#include <assert.h>
#ifdef _MSC_VER
#   include <iso646.h>                  // Standard `and` etc. also with MSVC.
#endif

#ifndef _WIN32_WINNT
#   define _WIN32_WINNT     0x0600      // Windows Vista as earliest supported OS.
#endif
#undef WINVER
#define WINVER _WIN32_WINNT

#define IS_NARROW_WINAPI() \
    ("Define UTF8_WINAPI please.", sizeof(*GetCommandLine()) == 1)

#define IS_WIDE_WINAPI() \
    ("Define UNICODE please.", sizeof(*GetCommandLine()) > 1)

// UTF8_WINAPI is a custom macro for this file. UNICODE, _UNICODE and _MBCS are MS macros.
#if defined( UTF8_WINAPI) and defined( UNICODE )
#   error "Inconsistent encoding options, both UNICODE (UTF-16) and UTF8_WINAPI (UTF-8)."
#endif

#undef UNICODE
#undef _UNICODE
#ifdef UTF8_WINAPI
#   define _MBCS        // Mainly for 3rd party code that uses it for platform detection.
#else
#   define UNICODE
#   define _UNICODE     // Mainly for 3rd party code that uses it for platform detection.
#endif
#undef NOMINMAX
#define NOMINMAX
#undef STRICT
#define STRICT
#undef WIN32_LEAN_AND_MEAN
#define WIN32_LEAN_AND_MEAN
// After this an `#include <winsock2.h>` will actually include that header.

#include <windows.h>

inline auto winapi_h_assert_utf8_codepage()
    -> bool
{
    #ifdef __GNUC__
        #pragma GCC diagnostic push
        #pragma GCC diagnostic ignored "-Wunused-value"
    #endif
    assert(( "The process codepage isn't UTF-8 (old Windows?).", GetACP() == 65001 ));
    #ifdef __GNUC__
        #pragma GCC diagnostic pop
    #endif
    return true;
}

2. Configuring Windows with UTF-8 as the ANSI code page default.

For portability the program should best be built with UTF-8 process code page specified as an application manifest resource. Alternatively it will work to configure Windows with UTF-8 as the Windows ANSI default, provided it’s Windows 10 with the update of May 2019, or later. But probably few if any ordinary users will want to configure their Windows, or to let a program do that just in order to run that program.

You as developer may however find the convenience of UTF-8 as the Windows ANSI default, highly desirable.

It worked for me to change the item ACP to value 65001, in the semi-documented registry key

Computer\HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Nls\CodePage

You can change the value e.g. via the “regedit” GUI utility, or the reg command. You must then reboot Windows for the change to take effect.

And voilà! 🙂

But wait! …

Before doing that let’s build the program with a manifest that specifies UTF-8 as process code page. That way the program will work on any post-May-2019 Windows 10 installation, not just “it works on my computer!”. The <windows.h> wrapper shown above ensures that it will not mistakenly run and present gibberish on an earlier Windows version.

4. An application manifest resource specifying UTF-8.

An application manifest is an UTF-8 encoded XML file that, if properly magically named, can be just shipped with the application, but that’s best embedded as a resource in the executable.

The following manifest specifies both UTF-8 as process code page, and that the app uses version 6.0 or later of the Common Controls DLL, as opposed to an earlier version that has the same DLL name. The Common Controls DLL version gives a modern look and feel to buttons, menus, list, edit fields etc. Why that’s not the default, or why it requires this heavy machinery to specify, would be mystery if Microsoft were an ordinary company.

Anyway, the text:

resources/application-manifest.xml

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<assembly manifestVersion="1.0" xmlns="urn:schemas-microsoft-com:asm.v1">
    <assemblyIdentity type="win32" name="UTF-8 message box" version="1.0.0.0"/>
    <application>
        <windowsSettings>
            <activeCodePage xmlns="http://schemas.microsoft.com/SMI/2019/WindowsSettings"
                >UTF-8</activeCodePage>
        </windowsSettings>
    </application>
    <dependency>
        <dependentAssembly>
            <assemblyIdentity
                type="win32"
                name="Microsoft.Windows.Common-Controls"
                version="6.0.0.0"
                processorArchitecture="*"
                publicKeyToken="6595b64144ccf1df"
                language="*"
                />
        </dependentAssembly>
    </dependency>
</assembly>

Beware: at the time of writing there could be no whitespace such as space or newline on either side of the “UTF-8” activeCodePage value, and it had to be all uppercase.

5. Building with MinGW g++ and with Visual C++.

One way to include that text as a resource in the executable is to use a general resource script:

resources.rc

#include <windows.h>
1 RT_MANIFEST "resources/application-manifest.xml"

Here the 1 is the resource id, and the RT_MANIFEST is the resource type (as I recall from years ago RT_MANIFEST is defined as small integer, probably just 1).

With the MinGW GNU tools this script is compiled by windres into an apparently ordinary object file, which is just linked with the main program object file:

[G:\code\minimal_gui\binaries]
> set CFG=-std=c++17 -Wall -pedantic-errors

[G:\code\minimal_gui\binaries]
> g++ %CFG% ..\minimal.cpp -c -o minimal.o

[G:\code\minimal_gui\binaries]
> windres ..\resources.rc -o resources.o

[G:\code\minimal_gui\binaries]
> g++ minimal.o resources.o -mwindows

Here the -mwindows option specifies the GUI subsystem for the executable, so that Windows doesn’t pop up a console window when one runs the program from Windows Explorer.

With Microsoft’s tools the script is compiled by rc into a special binary resource format in a .res file, which is just linked with the main program object file. Options can be passed to the compiler cl.exe via the environment variable CL, and to the linker link.exe via the environment variable LINK. Using an obscure linker option is unfortunately necessary for building a GUI subsystem executable with a standard C++ main function with this toolchain:

[G:\code\minimal_gui\binaries]
> set CL=^
More? /nologo ^
More? /utf-8 /EHsc /GR /FI"iso646.h" /std:c++17 /Zc:__cplusplus /W4 ^
More? /wd4459 /D _CRT_SECURE_NO_WARNINGS /D _STL_SECURE_NO_WARNINGS

[G:\code\minimal_gui\binaries]
> cl ..\minimal.cpp /c
minimal.cpp

[G:\code\minimal_gui\binaries]
> rc /nologo /fo resources.res ..\resources.rc

[G:\code\minimal_gui\binaries]
> set LINK=/entry:mainCRTStartup

[G:\code\minimal_gui\binaries]
> link /nologo minimal.obj resources.res user32.lib /subsystem:windows /out:b.exe

Microsoft now also has special tools (or maybe a special tool) to handle application manifests, but I haven’t used that.

Advertisement

UTF-8 in the Windows API

The May 2019 update of Windows 10 introduced the possibility of setting the ActiveCodePage property of an executable to UTF-8. This is done via the application manifest. The documentation is super-vague on the technical details and history, and in usual Microsoft fashion the functionality is obscured and the little desirable kill-a-gnat can only be done by costly nuclear bombing, so to speak — why let something simple be simple if it can be wrapped in military standard complexity?

But it means that with Visual C++ 2019 one can now use UTF-8 encoding for GUI applications, and for the output of console programs, without any encoding conversions in the code.

In particular, with UTF-8 active process codepage the arguments of main now come handily UTF-8 encoded, which means that they can now represent general filenames also in Windows. Hurray! Yippi!

However, interactive console input of UTF-8 is still limited to ASCII at the API-level. And the MinGW g++ 9.2 compiler’s default standard library implementation doesn’t support UTF-8 in the C and C++ locale machinery, e.g in setlocale, probably because it employs an old version of Microsoft’s runtime library. That means that FILE* or iostreams UTF-8 console output with MinGW g++ 9.2 only works for the default “C” locale.

I experimented by setting the ANSI codepage default in the registry to 65001, the UTF-8 codepage number. After rebooting the console windows came up with active codepage 65001, even though the OEM codepage default was the same old one (850 in my case). That indicates an effort on Microsoft’s part to support UTF-8 all the way in Windows, which if so is fantastically good.


An example application manifest.

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<assembly manifestVersion="1.0" xmlns="urn:schemas-microsoft-com:asm.v1">
    <assemblyIdentity type="win32" name="UTF-8 app example" version="6.0.0.0"/>
    <application>
        <windowsSettings>
            <activeCodePage xmlns="http://schemas.microsoft.com/SMI/2019/WindowsSettings"
                >UTF-8</activeCodePage>
        </windowsSettings>
    </application>
    <dependency>
        <dependentAssembly>
            <assemblyIdentity
                type="win32"
                name="Microsoft.Windows.Common-Controls"
                version="6.0.0.0"
                processorArchitecture="*"
                publicKeyToken="6595b64144ccf1df"
                language="*"
                />
        </dependentAssembly>
    </dependency>
</assembly>

The second assemblyIdentity part has nothing to do with the UTF-8 support, it just corrects a practically unusable default for the look and feel of buttons etc. Essentially this manifest corrects the two “wrong” defaults: the narrow text encoding, and the look ’n feel. From within an application with this manifest it looks as both CP_ACP (the global default) and CP_THREAD_ACP (the mysterious thread codepage) are UTF-8, codepage 65001.

In my experimentation UTF-8 had to be specified in uppercase, and it did not work with whitespace such as space or newline at either side.


An example resource script:

#include <windows.h>
1 RT_MANIFEST "resources/application-manifest.xml"

Why COW was deemed ungood for std::string.

COW, short for copy on write, is a way to implement mutable strings so that creating strings and logically copying strings, is reduced to almost nothing; conceptually they become free operations like no-ops.

Basic idea: to share a data buffer among string instances, and only make a copy for a specific instance (the copy on write) when that instance’s data is modified. The general cost of this is only an extra indirection for accessing the value of a string, so a COW implementation is highly desirable. And so the original C++ standard, C++98, and its correction C++03, had special support for COW implementations, and e.g. the g++ compiler’s std::string implementations used COW.

So why was that support dropped in C++11?

In particular, would the same reason or reasons apply to a reference counted immutable string value class?

As we’ll see it does not, it’s just a severe mismatch between the std::string design and the ideal COW requirements. But it took a two hour car trip, driving 120 kms on winter roads, for my memory to yet again cough up the relevant scenario where Things Go Wrong™. I guess it’s like the “why can’t I assign a T** to a T const** question; it’s quite counter-intuitive.

Basic COW string theory: the COPOW principle.

A COW string has two possible states: exclusively owning the buffer, or sharing the buffer with other COW strings.

It starts out in state owning. Assignments and copying initializations can make it sharing. Before executing a “write” operation it must ensure that it’s in owning state, and a transition from sharing to owning involves copying the buffer contents to a new and for now exclusively owned buffer.

With a string type designed for COW any operation will be either non-modifying, a “read” operation, or directly modifying, a “write” operation, which makes it is easy to determine whether the string must ensure state owning before executing the operation.

With a std::string, however, references, pointers and iterators to mutable contents are handed out with abandon. Even a simple value indexing of a non-const string, s[i], hands out a reference that can be used to modify the string. And so for a non-const std::string every such hand-out-free-access operation can effectively be a “write” operation, and would have to be regarded as such for a COW implementation (if the current C++ standard had permitted a COW implementation, which it doesn’t).

I call this the principle of copy on possibility of write, or COPOW for short. It’s for strings that aren’t designed for COW. For a COW-oriented design applying COPOW reduces to pure COW.

A code example showing how COW works.

To keep the size of the following example down I don’t address the issue of constant time initialization from literal, but just show how assignment and copy initialization can be reduced to almost nothing:

#include <cppx-core/_all_.hpp>  // https://github.com/alf-p-steinbach/cppx-core

using C_str = const char*;      // Is also available in cppx.

namespace my
{
    $use_cppx( Raw_array_of_, Size );
    $use_std( begin, end, make_shared, vector, shared_ptr );

    class Cow_string
    {
        using Buffer = vector<char>;

        shared_ptr<Buffer>      m_buffer;
        Size                    m_length;

        void ensure_is_owning()
        {
            if( m_buffer.use_count() > 1 )
            {
                m_buffer = make_shared<Buffer>( *m_buffer );
            }
        }

    public:
        auto c_str() const
            -> C_str
        { return m_buffer->data(); }

        auto length() const
            -> Size
        { return m_length; }

        auto operator[]( const Size i ) const
            -> const char&
        { return (*m_buffer)[i]; }

        auto operator[]( const Size i )
            -> char&
        {
            ensure_is_owning();
            return (*m_buffer)[i];
        }

        template< Size n >
        Cow_string( Raw_array_of_<n, const char>& literal ):
            m_buffer( make_shared<Buffer>( literal, literal + n ) ),
            m_length( n - 1 )
        {}
    };
}  // namespace my

Here assignment is the default-generated assignment operator that just assigns the data members m_buffer and m_length, which are a shared_ptr and an integer, and ditto for copy initialization.

And apparently this code abides by the COPOW principle, so it should be safe…

The problem: UB by adding code that just copies.

Consider the following usage code, it’s perfectly fine:

auto main() -> int
{
    my::Cow_string s = "Du store Alpakka!";
    const C_str p = s.c_str();

    // In this block the contents of `s` are not modified.
    {
        $use_std( ignore );
        const char first_char = s[0];
        ignore = first_char;
    }

    $use_std( cout, endl );
    cout << p << endl;
}

This code is fine because the COW string is already in state owning when s[0] is executed on the non-const s. So all that the initialization of first_char does is to copy a char value. Fine.

But if a maintainer innocently just introduces a logical copy of the string value, which is what COW primarily optimizes, and which certainly doesn’t change the conceptual value, then mayhem ensues:

auto main() -> int
{
    my::Cow_string s = "Du store Alpakka!";
    const C_str p = s.c_str();

    // In this block the contents of `s` are not modified.
    {
        $use_std( ignore );
        my::Cow_string other = s;
        ignore = other;

        const char first_char = s[0];
        ignore = first_char;
    }

    $use_std( cout, endl );
    cout << p << endl;      //! Undefined behavior, p is dangling.
}

Uh oh.

Since s here is in state sharing, the COPOW principle makes the s[0] operation copy the shared buffer, to become owning. Then at the end of the block the only remaining owner of the original buffer, the other string, is destroyed, and destroys the buffer. Which leaves the p pointer dangling.

For a custom string type like Cow_string this is a user error. The type is just badly designed, so that it’s very easy to inadvertently use it incorrectly. But for a std::string it’s formally a bug in the COW implementation, a bug showing that COPOW is just not enough.

For a std::string, if the standard had permitted a COW implementation, to avoid the above calamity it would be necessary to transition to the owned state, incurring an O(n) copying of string data, every place that a reference, pointer or iterator is handed out, regardless of const-ness of the string. One could maybe call that copy on handing out any reference, COHOAR. It greatly reduces the set of cases where COW has an advantage. The C++ standardization committee deemed that cost too high, the remaining advantages of COW too low, to continue supporting COW. So,

  • the C++03 wordings that supported COW were removed;
  • wording was introduced, especially a general O(1) complexity requirement for [] indexing, that disallowed COW; and
  • functionality such as string_view was added, that relies on having pointers to string buffers, and that itself hands out references.

What about threads?

It’a common misconception that COW std::strings would be incompatible with multi-threading, or that making it compatible would make it inefficient, because with COW ordinary copying of a string doesn’t yield an actual copy that another thread can access freely.

In order to allow string instances that are used by different threads, to share a buffer, just about every access function, including simple [] indexing, would need to use a mutex.

However, a simple solution is to just not use ordinary copy initialization or assignment to create a string for another thread, but instead a guaranteed content copying operation such as std::string::substr, or initialization from iterator pair. The C++11 standard could have gone in this other direction. It could, in principle, have added to the existing C++03 support for COW strings, noting that COHOAR, not just COPOW, is required, and added a dedicated deep-copy operation or deep-copy support type plus wording about thread safety.

What about reference counted immutable strings?

An immutable string is a string type such as the built in string types in Java, C# or Python, where the available operations don’t support string data modification. Strings can still be assigned. One just can’t directly change the string data, like changing “H” to ”J“ in “Hello”.

Immutable strings can and in C++ typically will share their string data via reference counting, just as with COW strings. As with COW strings they support amortized constant time initialization from literal, ditto superfast copy assignment and copy initialization, and in addition, if strings don’t need to be zero-terminated they support constant time substring operations. They’re highly desirable.

So, is the problem shown above, a showstopper for immutable strings in C++?

Happily, no. The problem comes about because std::string hands out references, pointers and iterators that can be used to change the string data without involving std::string code, i.e. without its knowledge. That can’t happen with an immutable string.

And figuring out this, whether there was a showstopper, and whether std::string_view (that hands out references) could be used freely in code that would deal with immutable strings, was the reason that I delved into the question of COW std::string again. At one point, long ago, I knew, because I participated in some debates about it, but remembering the problematic use case wasn’t easy. It’s just not intuitive to me, that adding an operation that just copies, can create UB…

Unicode part 1: Windows console i/o approaches



The Windows console subsystem has a host of Unicode-related bugs. And standard Windows programs such as more (not to mention the C# 4.0 compiler csc) just crash when they’re run from a console window with UTF-8 as active codepage, perplexingly claiming that they’re out of memory. On top of that the C++ runtime libraries of various compilers differ in how they behave. Doing C++ Unicode i/o in Windows consoles is therefore problematic. In this series I show how to work around limitations of the Visual C++ _O_U8TEXT file mode, with the Visual C++ and g++ compilers. This yields an automatic translation between external UTF-8 and internal UTF-16, enabling Windows console i/o of characters in the Basic Multilingual Plane.

Introduction

In both Windows and Linux properly internationalized applications use either UTF-16 or UTF-32 for their internal string handling. For example, the popular cross platform ICU library (International Components for Unicode) is based on UTF-16 encoded strings. For this kind of application Windows seems to be the better fit, since Windows’ API is UTF-16 based while Linux’ API is effectively, on a modern installation, UTF-8 based.

Still, in a simple console program one does not typically take on the quite steep overhead of full-fledged Unicode handling.

Instead of using a full-fledged Unicode handling library like ICU, one then relies on just the standard C and C++ libraries, and the Unicode handling reduces to what can easily be expressed using just the direct C++ language and standard library support.

How the Linux “all UTF-8” approach does not work in Windows

In Linux the typical small Unicode console program has everything char-based and UTF-8 encoded. The external data, the internal strings, the string literals, and of course the C or C++ source code, are all UTF-8 encoded. The total unification allows simple programs like this:

[utf8_sans_bom.all_utf8.cpp]
#include <stdexcept>        // std::runtime_error, std::exception
#include <stdlib.h>         // EXIT_SUCCESS, EXIT_FAILURE
#include <iostream>         // std::cout, std::cerr, std::endl
#include <string>           // std::string
using namespace std;

bool throwX( string const& s ) { throw runtime_error( s ); }
bool hopefully( bool v ) { return v; }

string lineFrom( istream& stream )
{
    string result;
    getline( stream, result );
    hopefully( !stream.fail() )
        || throwX( "lineFrom: failed to read line" );
    return result;
}

int main()
{
    try
    {
        static char const       narrowText[]    = "Blåbærsyltetøy! 日本国 кошка!";
        
        cout << "Narrow text: " << narrowText << endl;
        cout << endl;
        cout << "What's your name? ";
        string const name = lineFrom( cin );
        cout << "Glad to meet you, " << name << "!" << endl;
        return EXIT_SUCCESS;
    }
    catch( exception const& x )
    {
        cerr << "!" << x.what() << endl;
    }
    return EXIT_FAILURE;
}

Testing this in Ubuntu 11.10:

[~/blog/examples]
$ g++ utf8_sans_bom.all_utf8.cpp
[~/blog/examples]
$ ./a.out
Narrow text: Blåbærsyltetøy! 日本国 кошка!

What's your name? Bjørn Bråten Sæter
Glad to meet you, Bjørn Bråten Sæter!
[~/blog/examples]
$  _

Yay, it worked OK in Linux!

Testing the very same source code file in Windows using the same Linux-origins compiler (namely g++), and intentionally not specifying any codepage for the console window:

W:\examples> g++ -pedantic -Wall utf8_sans_bom.all_utf8.cpp

W:\examples> a
Narrow text: Blåbærsyltetøy! 日本国 кошка!

What's your name? Bjørn Bråten Sæter
Glad to meet you, Bjorn Bråten Sæter!

W:\examples> _

One reason for the gobbledygook here is that the Windows console by default assumes that the program produces OEM encoded text. That means, it assumes that the text is encoded using the original IBM PC character set, or a variation of that old character set. This encoding assumption is called the console window’s active codepage, and it can be inspected and changed via the chcp command, e.g. from codepage 437 (original IBM PC character set) to 65001 (UTF-8):

W:\examples> chcp
Active code page: 437

W:\examples> chcp 65001
Active code page: 65001

W:\examples> a
Narrow text: Blåbærsyltetøy! 日本国 кошка!huh?

W:\examples> _

Positive: the initial UTF-8 text output appeared to work. The Chinese characters 日本国 displayed as just empty rectangles, but they copied OK. Both the Norwegian and Russian copied OK and also displayed OK.

Negative: input did apparently not work, and it apparently caused some of the program’s output (including the prompt before the input operation) to disappear!

Exactly what went wrong above is difficult to say for sure. It might be the input operation, or it might be something else. However, the exact cause is irrelevant because input fails outright, not just producing weird side effects, if the user types in some non-ASCII characters such as Norwegian æ, ø and å:

W:\examples> a
Narrow text: Blåbærsyltetøy! 日本国 кошка!Bjørn Bråten Sæter
!lineFrom: failed to read line

W:\examples> _

About direct console i/o

Given that total failure for the “all UTF-8” approach has been established, it may perhaps appear to be overkill to also show the unintelligible output effect with the Windows platform’s major compiler, Visual C++ (here version 10.0), but as you’ll see it’s relevant:

W:\examples> cl utf8_sans_bom.all_utf8.cpp /Fe"b"
utf8_sans_bom.all_utf8.cpp

W:\examples> chcp
Active code page: 65001

W:\examples> b
Narrow text: Bl��b��rsyltet��y! ��������� ����������!

What's your name? Bjørn Bråten Sæter
!lineFrom: failed to read line

W:\examples> _

Here the Visual C++ runtime detects that the standard output is connected to a console window. And instead of sending the text via the ordinary standard output stream, it then attempts to place the correct Unicode UCS2-encoded characters directly in the console window’s text buffer. However, since the C++ source code was encoded as UTF-8 without BOM (as is usual in Linux), the Visual C++ compiler erroneously assumed that the source code was encoded as Windows ANSI, and so, since Visual C++ has Windows ANSI sort of hardwired as its C++ narrow character execution character set, it blindly copied the string literal’s bytes to the executable’s string values, whence the runtime, for its direct console i/o, is given UTF-8 bytes instead of the Windows ANSI bytes that it expects – so that its helpful translation to UCS2 fails…

At the Windows API level the runtime implements direct console output by calling the WriteConsole function instead of the WriteFile function. And similarly, if the console input had worked, then it would probably have been via a call to the ReadConsole function instead of the ReadFile function. The WriteConsole function accesses the console window’s text buffer directly and takes an UTF-16 wchar_t based argument, and ditto for ReadConsole.

Portable source code should be UTF-8 with BOM

One can avoid the direct console i/o by redirecting the output.

Such redirection then establishes that the output text byte level data is good, that all would have been well for this particular program’s output, except for the interference from the probably well-intentioned direct console i/o help attempt:

W:\examples> echo Bjørn Bråten Sæter | b >result

W:\examples> type result
Narrow text: Blåbærsyltetøy! 日本国 кошка!

What's your name? Glad to meet you, Bjørn Bråten Sæter !

W:\examples> _

And because the data is correct, one can be sure that the Visual C++ compiler was tricked into assuming that the source code was ANSI Western. And this then means that any wide string literal, which a Windows compiler has to translate to UTF-16, will be incorrectly translated if it contains any non-ASCII characters. Hence, for portable source code it is not a good idea to encode the source code as UTF-8 without BOM – for that is effectively to lie to the compiler.

Now that also g++ accepts a BOM at the start of the source code, portable source code should therefore be encoded as UTF-8 with BOM.

With the BOM in place Visual C++ correctly determines that the source code is UTF-8 encoded, although as of late 2011 this appears to still be undocumented. And with a correct assumption about the source code’s encoding, narrow string literals are correctly translated to Windows ANSI encoded string values in the executable. For Unicode literals in Windows one should therefore use wide string literals, e.g. L"Blåbærsyltetøy! 日本国 кошка!", which in Windows ends up as an UTF-16 encoded string value in the executable.

The Visual C++ UTF-8 stream mode

Use source code encoded as UTF-8 with BOM, and use wide string literals, OK (or rather, one just has to accept that complication!), but how does one then output one of these literals?

E.g., std::wcout in Windows has a rather strong tendency to translate down to Windows ANSI, not to UTF-8?

Well, in his 2008 blog posting Conventional wisdom is retarded, aka What the @#%&* is _O_U16TEXT? Michael Kaplan explained that

“the [Visual C++] CRT? Starting in 2005/8.0, it knows more about Unicode than any of us having been giving it credit for…”

The Visual C++ runtime library can convert automatically between internal UTF-16 and external UTF-8, if you just ask it to do so by calling the _setmode function with the appropriate file descriptor number and mode flag. E.g., mode _O_U8TEXT causes conversion to/from UTF-8.

One reason that many people have not known about the Unicode support that he discusses there, a Visual C++ Unicode stream mode, is that it’s mostly undocumented. Kaplan gives a link to documentation of the deprecated _wsopen function, as one place where the mode flags have been (inadvertently?) documented. However, the main usage is through the _setmode function, where, on the contrary, the official documentation goes on about how _setmode will invoke the “invalid parameter handler” unless the mode argument is either _O_TEXT or _O_BINARY. So, by using this functionality one is not just in ordinary Microsoft undocumented land. One is wholly over in explicitly-documented-as-not-working land.

On the other hand, considering that the official documentation is plain wrong about many things (e.g., for Visual C++ 10 it maintains that the source code encoding is limited to ASCII), and that the _setmode documentation is incorrect about the argument checking, and that the g++ compiler provides C level support for the _O_U8TEXT mode feature, considering all that one may choose to ignore the will-not-work statements of the documentation and just treat it as a documentation defect, for what good is a feature that can’t be used?

Since there is not really any alternative in order to get UTF-8 translation also down at the C library level, this is the approach that I’m going to discuss more detailed in part 2.

It might seem from Kaplan’s blog posting that you don’t have to do more than just set the mode, and go! But as you can expect from something in explicitly-documented-as-not-working land, it’s not fully implemented even in Visual C++. And even less fully implemented in g++…

Summary so far

Above I introduced two approaches to Unicode handling in small Windows console programs:

  • The all UTF-8 approach where everything is encoded as UTF-8, and where there are no BOM encoding markers.
     
  • The wide string approach where all external text (including the C++ source code) is encoded as UTF-8, and all internal text is encoded as UTF-16.

The all UTF-8 approach is the approach used in a typical Linux installation. With this approach a novice can remain unaware that he is writing code that handles Unicode: it Just Works™ – in Linux. However, we saw that it mass-failed in Windows:

  • Input with active codepage 65001 (UTF-8) failed due to various bugs.
     
  • Console output with Visual C++ produced gibberish due to the runtime library’s attempt to help by using direct console output.
     
  • I mentioned how wide string literals with non-ASCII characters are incorrectly translated to UTF-16 by Visual C++ due to the necessary lying to Visual C++ about the source code encoding (which is accomplished by not having a BOM at the start of the source code file).

The wide string approach, on the other hand, was shown to have special support in Visual C++, via the _O_U8TEXT file mode, which I called an UTF-8 stream mode. But I mentioned that as of Visual C++ 10 this special file mode is not fully implemented and/or it has some bugs: it cannot be used directly but needs some scaffolding and fixing. That’s what part 2 is about.


Cheers!

Cheers, & enjoy!

Ch 4 of my Norwegian intro to C++ available

The Norwegian introduction to C++ programming (a bit Windows-specific) is at Google Docs, in PDF format, 4 chapters so far:

Introduksjon til C++-programmering (Windows)

Each file has a nice table of contents but for that you need to download the PDF and view it in e.g. Foxit or Adobe Acrobat. Ch 1, the Introduction, is just 1 page, though. Ch 2, tooling up with Visual C++ and learning about some Windows stuff, is more pages. And so is ch 3, about basic C++ such as loops and decisions. And ch 4, about creating console programs (all programs so far just GUI), chimes in at some 50 pages!

Perhaps it’ll become a book…

Here’s a table of contents generated by (1) using a Word TOC field and half-documented RD fields to refer to the chapters, (2) [Shift Ctrl F9] in Word (is that still documented anywhere?) to “lock” the text, (3) edit, removing unwanted entries, (4) copy as text to Crimson Editor, save, and (5) run a very very hairy C++ program to generate the HTML.

Oh, I see in the preview that instead of a purely numbered list, in the WordPress blog I get letters and roman numerals!

So be it – but there’s also a PDF of the original over at Google docs (link above).

  1. Introduksjon. | 1
  2. Første program, etc. | 1
    1. Gratis verktøy. | 1
    2. Muligens ikke helt typiske installasjonsproblemer… | 2
    3. “Hallo, verden!” i Visual Studio / om IDE prosjekter. | 6
    4. Feilretting i Visual Studio / generelt om C++ typesjekking. | 15
    5. Hva “Hallo, verden!” programteksten betyr. | 18
    6. Spesielt aktuelle Windows-ting for nybegynneren. | 21
      1. Makroer og Unicode/ANSI-versjoner av Windows API-funksjoner. | 22
      2. Moderne utseende på knapper etc. / om DLL-er og manifest-filer. | 23
      3. Ikon og versjonsinformasjon / [.exe]-fil ressurser. | 28
    7. Gir C++ ekstra mye kode og kompleksitet? | 32
    8. Å finne relevant informasjon om ting. | 32
      1. Tipsruter og automatisk fullføring. | 32
      2. Å gå direkte til en aktuell deklarasjon eller definisjon. | 33
      3. Full teknisk dokumentasjon / hjelp / kort om Microsofts “T” datatyper. | 34
      4. Dokumentasjon av C++ språket og C++ standardbiblioteket. | 36
      5. Diskusjonsfora på nettet / FAQ-er. | 38
  3. Et første subsett av C++. | 1
    1. Gjenbruk av egendefinerte headerfiler. | 1
      1. En wrapper for [windows.h]. | 2
      2. Å konfigurere en felles headerfil søkesti i Visual Studio 2010. | 6
      3. En muligens enklere & mer pålitelig måte å konfigurere Visual Studio på. | 9
    2. Grunnleggende data. | 12
      1. Variabler, tilordninger, oppdateringer, regneuttrykk, implisitt konvertering. | 14
      2. Implisitte konverteringer. | 15
      3. Initialisering og const. | 16
    3. Tekstpresentasjon og strenger. | 17
      1. Arrays som buffere, konvertering tall ? tekst. | 17
      2. Strenger, konkatenering og std::wstring-typen, anrop av medlemsfunksjon. | 18
      3. Å lage tekstgenererings-støtte / egendefinerte funksjoner & operatorer. | 22
    4. Løkker, valg og sammenligningsuttrykk. | 27
      1. Sammenligninger og boolske uttrykk. | 32
      2. Valg. | 34
      3. Løkker. | 39
    5. Funksjoner. | 41
      1. Hva du kan og ikke kan gjøre med en C++ funksjon. | 41
      2. Funksjoner som abstraksjonsverktøy. | 41
      3. Verdioverføring og referanseoverføring av argumenter. | 45
  4. Kommandotolkeren. | 1
    1. Windows kommandotolkeren [cmd.exe]. | 2
      1. Å kjøre opp en kommandotolker-instans / konfigurering av konsollvinduer. | 2
      2. Kommandoer / hjelp. | 8
      3. Kommandoredigering & utklippstavle-operasjoner. | 11
      4. Linjekontinuering & tegn-escaping. | 11
      5. Operatorer & sammensatte kommandoer / omdirigering & rørledninger. | 12
      6. Erstatting av miljøvariabel-navn / arv av miljøvariabler. | 15
      7. Kommandotolkerens søk etter programmer: %path% og %pathext%. | 16
    2. Navigasjon. | 17
    3. Å kompilere fra kommandotolkeren. | 21
      1. Å nei! “Hallo, verden!” igjen! | 22
      2. Konsoll kontra GUI subsystem. | 24
      3. Å angi linker-opsjoner til kompilatoren / separat kompilering og linking. | 26
      4. Å be kompilatoren om standard C++, please. | 27
      5. Å angi headerfilkataloger, også kjent som inkluderingskataloger. | 28
    4. Batchfiler – å automatisere f.eks. et standardoppsett. | 31
    5. C++ iostreams. | 33
      1. iostream-objekter for standard datastrømmene. | 33
      2. Datastrøm orientering: nix mix (av char og wchar_t datastrømobjekter). | 36
      3. Å detektere “slutt på datastrømmen” (EOF, end of file). | 36
      4. Innlesing av strenger. | 40
      5. Praktikalitetsdigresjon: hvordan bli kvitt navneromskvalifikasjonene. | 42
      6. Innlesing av tall. | 43
      7. Formatert utskrift med iostream manipulatorer. | 48

Cheers, & enjoy! – Alf

Current TOC for my Norwegian intro to C++

About the Norwegian C++ intro, see my earlier posting.

Not sure if this works or not, but I’m trying to embed a PDF of a Table of Contents generated by Word:

Enjoy! 🙂  [Possibly/probably more to come, after all, I’m referring to chapter 4!]

– Alf

By the way, Olve, as you can see I’ve now added a chapter 3! Not quite at 42 yet… But.

A Norwegian introduction to C++ programming (in Windows)

I’m a compulsive writer, I admit. So, when testing Visual C++ 10.0, via Microsoft’s free Visual C++ Express IDE, I wrote about it. In Norwegian!

Maybe it’ll be a book. Anyway, I always write as if it’s going to be a book! I’m an incorrigible optimist!

It’s at Google Docs, in PDF format, 2 chapters so far:

Introduksjon til C++-programmering (Windows)

Each file has a nice table of contents but for that you need to download the PDF and view it in e.g. Foxit or Adobe Acrobat. Ch 1 is just 1 page, though. Ch 2 is more pages.

[Update, 4th of August: I’ve now added chapter 3, “Et første subsett av C++”. It’s great. :-)]

Comments very welcome!

Even if your name is Olve Maudal, say! 🙂

[cppx] B true, or B thrown! (Using the >> throwing pattern)

How often have you declared a variable and invented an ungrokkable three-letter name for that variable, just to temporarily hold the result of some API function so that you can check it immediately after the call? Let me guess, you’ve done that thousands of times. And if so then here are happy tidings: you can completely avoid introducing all those helper variables! 🙂

[… More] Read all of this posting →

[cppx] Is C4099 really a sillywarning? (MSVC sillywarnings)

Evidently some people get to my blog by googling up C4099, the MSVC warning that you’ve used struct in one place and class in another place, for the same class. This is one of the alleged sillywarnings in my sillywarnings suppression header. But given e.g. the discussion at StackOverflow, is it really a sillywarning, or perhaps something to take seriously?

[… More] Read all of this posting →

[cppx] Exception translation, part II

In part I I showed how to use the cppx library’s exception translation support, which decouples the specification of how non-standard exceptions should be translated, from each routine’s invocation of such translation. The translation can be customized by dynamically installing and uninstalling exception translator routines. And essentially each routine that wants exception translation must use a catch (this will most often be a generic catch(...)) where it invokes cppx::rethrowAsStdX, which in turn invokes the installed exception translator routines and performs a default translation if none of them apply.

In this second part I discuss how that translation machinery works.

In part III I’ll discuss the support for installation and uninstallation of exception translator routines. And perhaps I’ll need a part IV to discuss the cppx exception types! Anyway, now, diving down into the code…

[… More] Read all of this posting →