A wrapper for UTF-8 Windows apps.

The Windows API is based on UTF-16 encoded wide text. For example, the API function CommandLineToArgvW that parses a command line, only exists in wide text version. But the introduction of support for UTF-8 as the process code page in the May 2019 update of Windows 10 now greatly increases the incentive to use UTF-8 encoded narrow text internally in in Windows GUI applications, i.e. using UTF-8 as the C++ execution character set also in Windows.

This article presents a minimal example of that (a message box with international text, using Windows’ char based API wrappers); shows one way to configure Windows with UTF-8 as the ANSI code page default; and shows how to build such a program with the MingGW g++ and Visual C++ toolchains.

This is discussed in that order in the following sections:

  1. A minimal example.
  2. The header wrappers.
  3. Configuring Windows with UTF-8 as the ANSI code page default.
  4. An application manifest resource specifying UTF-8.
  5. Building with MinGW g++ and with Visual C++.

I apologize for the less than perfect formatting and possible odd things. Every time I edit WordPress removes all instances of the text <windows.h> and wreaks havoc on the rest. This article was originally written as a GitHub-compatible markdown file but it turned out that markdown syntax highlighting, and a lot more, didn’t work in WordPress, so the text had to be very manually re-expressed as a sequence of WordPress “blocks”.

1. A minimal example.

With a suitable wrapper for <windows.h> the C++ code of a program that displays a Windows message box with international text can now be as simple as this:

minimal.cpp

#include <header-wrappers/winapi/windows-h.utf8.hpp>

auto main()
    -> int
{
    const auto& text    = "Every 日本国 кошка loves Norwegian blåbærsyltetøy, nom nom!";
    const auto& title   = "Some Norwegian, Russian & Chinese text in there:";
    MessageBox( 0, text, title, MB_ICONINFORMATION | MB_SETFOREGROUND );
}

Result when the program is built with a specification of UTF-8 as process code page, or alternatively is run in a Windows installation configured with UTF-8 as the ANSI code page default:

Image of OK messagebox

In contrast, here is what it looks like when a corresponding program using <windows.h> directly is built without a specification of UTF-8 as process code page and the Windows ANSI default is codepage 1252, Windows ANSI Western, as in the old days:

Image of ungood Windows ANSI Western messagebox

2. The header wrappers.

The wrapper <header-wrappers/winapi/windows-h.utf8.hpp> supports this new “like ordinary C++” kind of Windows application:

  • it increases the C++-compatibility of <windows.h> by suppressing the min and max macro definitions via option NOMINMAX and by asking for more strictly typed declarations via option STRICT, plus it reduces the size of this gargantuan include (e.g. just now from 80 287 lines to 54 426 lines, with MinGW g++), via option WIN32_LEAN_AND_MEAN,
  • it makes the char based …A-functions such as MessageBoxA available without suffix, i.e. for that example as just MessageBox, by ensuring that option UNICODE is not defined, and
  • it asserts that the effective process codepage is UTF-8, which it might or might not be.

header-wrappers/winapi/windows-h.utf8.hpp

#pragma once
#undef UTF8_WINAPI
#define UTF8_WINAPI
#include "windows-h.hpp"

namespace uuid_0985060C_1AAD_453C_B3F9_A2E543F4CF1E {
    struct Winapi_envelope
    {
        Winapi_envelope()
        {
            static const bool dummy = winapi_h_assert_utf8_codepage();
        }
    };
    
    static const Winapi_envelope ensured_globally_single_utf8_assertion{};
}  // namespace uuid_0985060C_1AAD_453C_B3F9_A2E543F4CF1E

The little complexity above could be avoided by using a C++17 inline variable. It would be more in the C++ spirit of coding to absolutely maximum performance and least verbosity, when there is a choice. However, many people are still stuck with earlier C++ standards, and though a fallback using static instead could be automatically provided, the header would then require Visual C++ 2019 users to add option /Zc:__cplusplus, which is not presently supported by the Visual Studio GUI.

Except for that issue the wrapper is designed to be a trivial top-level wrapper so that one can replace it with one’s own equally trivial top-level wrapper, for example in order to communicate a “not UTF-8 process code page” failure to the user in manner of one’s own choosing.

To wit, the wrapper delegates the first two points to a more basic wrapper <header-wrappers/winapi/windows-h.hpp>, which goes like this:

header-wrappers/winapi/windows-h.hpp

#pragma once
#ifdef MessageBox
#   error "<windows.h> has already been included, possibly with undesired options."
#endif

#include <assert.h>
#ifdef _MSC_VER
#   include <iso646.h>                  // Standard `and` etc. also with MSVC.
#endif

#ifndef _WIN32_WINNT
#   define _WIN32_WINNT     0x0600      // Windows Vista as earliest supported OS.
#endif
#undef WINVER
#define WINVER _WIN32_WINNT

#define IS_NARROW_WINAPI() \
    ("Define UTF8_WINAPI please.", sizeof(*GetCommandLine()) == 1)

#define IS_WIDE_WINAPI() \
    ("Define UNICODE please.", sizeof(*GetCommandLine()) > 1)

// UTF8_WINAPI is a custom macro for this file. UNICODE, _UNICODE and _MBCS are MS macros.
#if defined( UTF8_WINAPI) and defined( UNICODE )
#   error "Inconsistent encoding options, both UNICODE (UTF-16) and UTF8_WINAPI (UTF-8)."
#endif

#undef UNICODE
#undef _UNICODE
#ifdef UTF8_WINAPI
#   define _MBCS        // Mainly for 3rd party code that uses it for platform detection.
#else
#   define UNICODE
#   define _UNICODE     // Mainly for 3rd party code that uses it for platform detection.
#endif
#undef NOMINMAX
#define NOMINMAX
#undef STRICT
#define STRICT
#undef WIN32_LEAN_AND_MEAN
#define WIN32_LEAN_AND_MEAN
// After this an `#include <winsock2.h>` will actually include that header.

#include <windows.h>

inline auto winapi_h_assert_utf8_codepage()
    -> bool
{
    #ifdef __GNUC__
        #pragma GCC diagnostic push
        #pragma GCC diagnostic ignored "-Wunused-value"
    #endif
    assert(( "The process codepage isn't UTF-8 (old Windows?).", GetACP() == 65001 ));
    #ifdef __GNUC__
        #pragma GCC diagnostic pop
    #endif
    return true;
}

2. Configuring Windows with UTF-8 as the ANSI code page default.

For portability the program should best be built with UTF-8 process code page specified as an application manifest resource. Alternatively it will work to configure Windows with UTF-8 as the Windows ANSI default, provided it’s Windows 10 with the update of May 2019, or later. But probably few if any ordinary users will want to configure their Windows, or to let a program do that just in order to run that program.

You as developer may however find the convenience of UTF-8 as the Windows ANSI default, highly desirable.

It worked for me to change the item ACP to value 65001, in the semi-documented registry key

Computer\HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Nls\CodePage

You can change the value e.g. via the “regedit” GUI utility, or the reg command. You must then reboot Windows for the change to take effect.

And voilà! 🙂

But wait! …

Before doing that let’s build the program with a manifest that specifies UTF-8 as process code page. That way the program will work on any post-May-2019 Windows 10 installation, not just “it works on my computer!”. The <windows.h> wrapper shown above ensures that it will not mistakenly run and present gibberish on an earlier Windows version.

4. An application manifest resource specifying UTF-8.

An application manifest is an UTF-8 encoded XML file that, if properly magically named, can be just shipped with the application, but that’s best embedded as a resource in the executable.

The following manifest specifies both UTF-8 as process code page, and that the app uses version 6.0 or later of the Common Controls DLL, as opposed to an earlier version that has the same DLL name. The Common Controls DLL version gives a modern look and feel to buttons, menus, list, edit fields etc. Why that’s not the default, or why it requires this heavy machinery to specify, would be mystery if Microsoft were an ordinary company.

Anyway, the text:

resources/application-manifest.xml

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<assembly manifestVersion="1.0" xmlns="urn:schemas-microsoft-com:asm.v1">
    <assemblyIdentity type="win32" name="UTF-8 message box" version="1.0.0.0"/>
    <application>
        <windowsSettings>
            <activeCodePage xmlns="http://schemas.microsoft.com/SMI/2019/WindowsSettings"
                >UTF-8</activeCodePage>
        </windowsSettings>
    </application>
    <dependency>
        <dependentAssembly>
            <assemblyIdentity
                type="win32"
                name="Microsoft.Windows.Common-Controls"
                version="6.0.0.0"
                processorArchitecture="*"
                publicKeyToken="6595b64144ccf1df"
                language="*"
                />
        </dependentAssembly>
    </dependency>
</assembly>

Beware: at the time of writing there could be no whitespace such as space or newline on either side of the “UTF-8” activeCodePage value, and it had to be all uppercase.

5. Building with MinGW g++ and with Visual C++.

One way to include that text as a resource in the executable is to use a general resource script:

resources.rc

#include <windows.h>
1 RT_MANIFEST "resources/application-manifest.xml"

Here the 1 is the resource id, and the RT_MANIFEST is the resource type (as I recall from years ago RT_MANIFEST is defined as small integer, probably just 1).

With the MinGW GNU tools this script is compiled by windres into an apparently ordinary object file, which is just linked with the main program object file:

[G:\code\minimal_gui\binaries]
> set CFG=-std=c++17 -Wall -pedantic-errors

[G:\code\minimal_gui\binaries]
> g++ %CFG% ..\minimal.cpp -c -o minimal.o

[G:\code\minimal_gui\binaries]
> windres ..\resources.rc -o resources.o

[G:\code\minimal_gui\binaries]
> g++ minimal.o resources.o -mwindows

Here the -mwindows option specifies the GUI subsystem for the executable, so that Windows doesn’t pop up a console window when one runs the program from Windows Explorer.

With Microsoft’s tools the script is compiled by rc into a special binary resource format in a .res file, which is just linked with the main program object file. Options can be passed to the compiler cl.exe via the environment variable CL, and to the linker link.exe via the environment variable LINK. Using an obscure linker option is unfortunately necessary for building a GUI subsystem executable with a standard C++ main function with this toolchain:

[G:\code\minimal_gui\binaries]
> set CL=^
More? /nologo ^
More? /utf-8 /EHsc /GR /FI"iso646.h" /std:c++17 /Zc:__cplusplus /W4 ^
More? /wd4459 /D _CRT_SECURE_NO_WARNINGS /D _STL_SECURE_NO_WARNINGS

[G:\code\minimal_gui\binaries]
> cl ..\minimal.cpp /c
minimal.cpp

[G:\code\minimal_gui\binaries]
> rc /nologo /fo resources.res ..\resources.rc

[G:\code\minimal_gui\binaries]
> set LINK=/entry:mainCRTStartup

[G:\code\minimal_gui\binaries]
> link /nologo minimal.obj resources.res user32.lib /subsystem:windows /out:b.exe

Microsoft now also has special tools (or maybe a special tool) to handle application manifests, but I haven’t used that.

Advertisement

UTF-8 in the Windows API

The May 2019 update of Windows 10 introduced the possibility of setting the ActiveCodePage property of an executable to UTF-8. This is done via the application manifest. The documentation is super-vague on the technical details and history, and in usual Microsoft fashion the functionality is obscured and the little desirable kill-a-gnat can only be done by costly nuclear bombing, so to speak — why let something simple be simple if it can be wrapped in military standard complexity?

But it means that with Visual C++ 2019 one can now use UTF-8 encoding for GUI applications, and for the output of console programs, without any encoding conversions in the code.

In particular, with UTF-8 active process codepage the arguments of main now come handily UTF-8 encoded, which means that they can now represent general filenames also in Windows. Hurray! Yippi!

However, interactive console input of UTF-8 is still limited to ASCII at the API-level. And the MinGW g++ 9.2 compiler’s default standard library implementation doesn’t support UTF-8 in the C and C++ locale machinery, e.g in setlocale, probably because it employs an old version of Microsoft’s runtime library. That means that FILE* or iostreams UTF-8 console output with MinGW g++ 9.2 only works for the default “C” locale.

I experimented by setting the ANSI codepage default in the registry to 65001, the UTF-8 codepage number. After rebooting the console windows came up with active codepage 65001, even though the OEM codepage default was the same old one (850 in my case). That indicates an effort on Microsoft’s part to support UTF-8 all the way in Windows, which if so is fantastically good.


An example application manifest.

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<assembly manifestVersion="1.0" xmlns="urn:schemas-microsoft-com:asm.v1">
    <assemblyIdentity type="win32" name="UTF-8 app example" version="6.0.0.0"/>
    <application>
        <windowsSettings>
            <activeCodePage xmlns="http://schemas.microsoft.com/SMI/2019/WindowsSettings"
                >UTF-8</activeCodePage>
        </windowsSettings>
    </application>
    <dependency>
        <dependentAssembly>
            <assemblyIdentity
                type="win32"
                name="Microsoft.Windows.Common-Controls"
                version="6.0.0.0"
                processorArchitecture="*"
                publicKeyToken="6595b64144ccf1df"
                language="*"
                />
        </dependentAssembly>
    </dependency>
</assembly>

The second assemblyIdentity part has nothing to do with the UTF-8 support, it just corrects a practically unusable default for the look and feel of buttons etc. Essentially this manifest corrects the two “wrong” defaults: the narrow text encoding, and the look ’n feel. From within an application with this manifest it looks as both CP_ACP (the global default) and CP_THREAD_ACP (the mysterious thread codepage) are UTF-8, codepage 65001.

In my experimentation UTF-8 had to be specified in uppercase, and it did not work with whitespace such as space or newline at either side.


An example resource script:

#include <windows.h>
1 RT_MANIFEST "resources/application-manifest.xml"