The Windows API is based on UTF-16 encoded wide text. For example, the API function CommandLineToArgvW
that parses a command line, only exists in wide text version. But the introduction of support for UTF-8 as the process code page in the May 2019 update of Windows 10 now greatly increases the incentive to use UTF-8 encoded narrow text internally in in Windows GUI applications, i.e. using UTF-8 as the C++ execution character set also in Windows.
This article presents a minimal example of that (a message box with international text, using Windows’ char
based API wrappers); shows one way to configure Windows with UTF-8 as the ANSI code page default; and shows how to build such a program with the MingGW g++ and Visual C++ toolchains.
This is discussed in that order in the following sections:
- A minimal example.
- The header wrappers.
- Configuring Windows with UTF-8 as the ANSI code page default.
- An application manifest resource specifying UTF-8.
- Building with MinGW g++ and with Visual C++.
I apologize for the less than perfect formatting and possible odd things. Every time I edit WordPress removes all instances of the text <windows.h>
and wreaks havoc on the rest. This article was originally written as a GitHub-compatible markdown file but it turned out that markdown syntax highlighting, and a lot more, didn’t work in WordPress, so the text had to be very manually re-expressed as a sequence of WordPress “blocks”.
1. A minimal example.
With a suitable wrapper for <windows.h>
the C++ code of a program that displays a Windows message box with international text can now be as simple as this:
minimal.cpp
#include <header-wrappers/winapi/windows-h.utf8.hpp>
auto main()
-> int
{
const auto& text = "Every 日本国 кошка loves Norwegian blåbærsyltetøy, nom nom!";
const auto& title = "Some Norwegian, Russian & Chinese text in there:";
MessageBox( 0, text, title, MB_ICONINFORMATION | MB_SETFOREGROUND );
}
Result when the program is built with a specification of UTF-8 as process code page, or alternatively is run in a Windows installation configured with UTF-8 as the ANSI code page default:
In contrast, here is what it looks like when a corresponding program using <windows.h>
directly is built without a specification of UTF-8 as process code page and the Windows ANSI default is codepage 1252, Windows ANSI Western, as in the old days:
2. The header wrappers.
The wrapper <header-wrappers/winapi/windows-h.utf8.hpp>
supports this new “like ordinary C++” kind of Windows application:
- it increases the C++-compatibility of
<windows.h>
by suppressing themin
andmax
macro definitions via optionNOMINMAX
and by asking for more strictly typed declarations via optionSTRICT
, plus it reduces the size of this gargantuan include (e.g. just now from 80 287 lines to 54 426 lines, with MinGW g++), via optionWIN32_LEAN_AND_MEAN
, - it makes the
char
based …A
-functions such asMessageBoxA
available without suffix, i.e. for that example as justMessageBox
, by ensuring that optionUNICODE
is not defined, and - it asserts that the effective process codepage is UTF-8, which it might or might not be.
header-wrappers/winapi/windows-h.utf8.hpp
#pragma once
#undef UTF8_WINAPI
#define UTF8_WINAPI
#include "windows-h.hpp"
namespace uuid_0985060C_1AAD_453C_B3F9_A2E543F4CF1E {
struct Winapi_envelope
{
Winapi_envelope()
{
static const bool dummy = winapi_h_assert_utf8_codepage();
}
};
static const Winapi_envelope ensured_globally_single_utf8_assertion{};
} // namespace uuid_0985060C_1AAD_453C_B3F9_A2E543F4CF1E
The little complexity above could be avoided by using a C++17 inline
variable. It would be more in the C++ spirit of coding to absolutely maximum performance and least verbosity, when there is a choice. However, many people are still stuck with earlier C++ standards, and though a fallback using static
instead could be automatically provided, the header would then require Visual C++ 2019 users to add option /Zc:__cplusplus
, which is not presently supported by the Visual Studio GUI.
Except for that issue the wrapper is designed to be a trivial top-level wrapper so that one can replace it with one’s own equally trivial top-level wrapper, for example in order to communicate a “not UTF-8 process code page” failure to the user in manner of one’s own choosing.
To wit, the wrapper delegates the first two points to a more basic wrapper <header-wrappers/winapi/windows-h.hpp>
, which goes like this:
header-wrappers/winapi/windows-h.hpp
#pragma once
#ifdef MessageBox
# error "<windows.h> has already been included, possibly with undesired options."
#endif
#include <assert.h>
#ifdef _MSC_VER
# include <iso646.h> // Standard `and` etc. also with MSVC.
#endif
#ifndef _WIN32_WINNT
# define _WIN32_WINNT 0x0600 // Windows Vista as earliest supported OS.
#endif
#undef WINVER
#define WINVER _WIN32_WINNT
#define IS_NARROW_WINAPI() \
("Define UTF8_WINAPI please.", sizeof(*GetCommandLine()) == 1)
#define IS_WIDE_WINAPI() \
("Define UNICODE please.", sizeof(*GetCommandLine()) > 1)
// UTF8_WINAPI is a custom macro for this file. UNICODE, _UNICODE and _MBCS are MS macros.
#if defined( UTF8_WINAPI) and defined( UNICODE )
# error "Inconsistent encoding options, both UNICODE (UTF-16) and UTF8_WINAPI (UTF-8)."
#endif
#undef UNICODE
#undef _UNICODE
#ifdef UTF8_WINAPI
# define _MBCS // Mainly for 3rd party code that uses it for platform detection.
#else
# define UNICODE
# define _UNICODE // Mainly for 3rd party code that uses it for platform detection.
#endif
#undef NOMINMAX
#define NOMINMAX
#undef STRICT
#define STRICT
#undef WIN32_LEAN_AND_MEAN
#define WIN32_LEAN_AND_MEAN
// After this an `#include <winsock2.h>` will actually include that header.
#include <windows.h>
inline auto winapi_h_assert_utf8_codepage()
-> bool
{
#ifdef __GNUC__
#pragma GCC diagnostic push
#pragma GCC diagnostic ignored "-Wunused-value"
#endif
assert(( "The process codepage isn't UTF-8 (old Windows?).", GetACP() == 65001 ));
#ifdef __GNUC__
#pragma GCC diagnostic pop
#endif
return true;
}
2. Configuring Windows with UTF-8 as the ANSI code page default.
For portability the program should best be built with UTF-8 process code page specified as an application manifest resource. Alternatively it will work to configure Windows with UTF-8 as the Windows ANSI default, provided it’s Windows 10 with the update of May 2019, or later. But probably few if any ordinary users will want to configure their Windows, or to let a program do that just in order to run that program.
You as developer may however find the convenience of UTF-8 as the Windows ANSI default, highly desirable.
It worked for me to change the item ACP
to value 65001
, in the semi-documented registry key
Computer\HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Nls\CodePage
You can change the value e.g. via the “regedit” GUI utility, or the reg
command. You must then reboot Windows for the change to take effect.
And voilà! 🙂
But wait! …
Before doing that let’s build the program with a manifest that specifies UTF-8 as process code page. That way the program will work on any post-May-2019 Windows 10 installation, not just “it works on my computer!”. The <windows.h>
wrapper shown above ensures that it will not mistakenly run and present gibberish on an earlier Windows version.
4. An application manifest resource specifying UTF-8.
An application manifest is an UTF-8 encoded XML file that, if properly magically named, can be just shipped with the application, but that’s best embedded as a resource in the executable.
The following manifest specifies both UTF-8 as process code page, and that the app uses version 6.0 or later of the Common Controls DLL, as opposed to an earlier version that has the same DLL name. The Common Controls DLL version gives a modern look and feel to buttons, menus, list, edit fields etc. Why that’s not the default, or why it requires this heavy machinery to specify, would be mystery if Microsoft were an ordinary company.
Anyway, the text:
resources/application-manifest.xml
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<assembly manifestVersion="1.0" xmlns="urn:schemas-microsoft-com:asm.v1">
<assemblyIdentity type="win32" name="UTF-8 message box" version="1.0.0.0"/>
<application>
<windowsSettings>
<activeCodePage xmlns="http://schemas.microsoft.com/SMI/2019/WindowsSettings"
>UTF-8</activeCodePage>
</windowsSettings>
</application>
<dependency>
<dependentAssembly>
<assemblyIdentity
type="win32"
name="Microsoft.Windows.Common-Controls"
version="6.0.0.0"
processorArchitecture="*"
publicKeyToken="6595b64144ccf1df"
language="*"
/>
</dependentAssembly>
</dependency>
</assembly>
Beware: at the time of writing there could be no whitespace such as space or newline on either side of the “UTF-8” activeCodePage
value, and it had to be all uppercase.
5. Building with MinGW g++ and with Visual C++.
One way to include that text as a resource in the executable is to use a general resource script:
resources.rc
#include <windows.h>
1 RT_MANIFEST "resources/application-manifest.xml"
Here the 1
is the resource id, and the RT_MANIFEST
is the resource type (as I recall from years ago RT_MANIFEST
is defined as small integer, probably just 1).
With the MinGW GNU tools this script is compiled by windres
into an apparently ordinary object file, which is just linked with the main program object file:
[G:\code\minimal_gui\binaries]
> set CFG=-std=c++17 -Wall -pedantic-errors
[G:\code\minimal_gui\binaries]
> g++ %CFG% ..\minimal.cpp -c -o minimal.o
[G:\code\minimal_gui\binaries]
> windres ..\resources.rc -o resources.o
[G:\code\minimal_gui\binaries]
> g++ minimal.o resources.o -mwindows
Here the -mwindows
option specifies the GUI subsystem for the executable, so that Windows doesn’t pop up a console window when one runs the program from Windows Explorer.
With Microsoft’s tools the script is compiled by rc
into a special binary resource format in a .res
file, which is just linked with the main program object file. Options can be passed to the compiler cl.exe
via the environment variable CL
, and to the linker link.exe
via the environment variable LINK
. Using an obscure linker option is unfortunately necessary for building a GUI subsystem executable with a standard C++ main
function with this toolchain:
[G:\code\minimal_gui\binaries]
> set CL=^
More? /nologo ^
More? /utf-8 /EHsc /GR /FI"iso646.h" /std:c++17 /Zc:__cplusplus /W4 ^
More? /wd4459 /D _CRT_SECURE_NO_WARNINGS /D _STL_SECURE_NO_WARNINGS
[G:\code\minimal_gui\binaries]
> cl ..\minimal.cpp /c
minimal.cpp
[G:\code\minimal_gui\binaries]
> rc /nologo /fo resources.res ..\resources.rc
[G:\code\minimal_gui\binaries]
> set LINK=/entry:mainCRTStartup
[G:\code\minimal_gui\binaries]
> link /nologo minimal.obj resources.res user32.lib /subsystem:windows /out:b.exe
Microsoft now also has special tools (or maybe a special tool) to handle application manifests, but I haven’t used that.
Excellent article. The one thing I would add is the support for UTF-8 font charset, so GDI can be used with UTF-8 strings. You can see my article about it.
How do I build this minimal project in visual studio.
Not working for me because
Building minimal.cpp in visual studio, visual studio automagically and invisibly includes its windows.h before your wrapper.
1>H:\utf8\header-wrappers\winapi\windows-h.hpp(25,1): fatal error C1189: #error: “Inconsistent encoding options, both UNICODE (UTF-16) and UTF8_WINAPI (UTF-8).”
How do I make your wrapper wrap windows.h?
How do I build this in visual studio? I have to set something somewhere so that it does not invisibly and automagically include windows.h on the include path for me.
I have to create a *.sln file that has the right properties, but I when I create an *.sln file using “create empty project” windows.h just somehow gets invisibly included first.