UTF-8 in the Windows API

The May 2019 update of Windows 10 introduced the possibility of setting the ActiveCodePage property of an executable to UTF-8. This is done via the application manifest. The documentation is super-vague on the technical details and history, and in usual Microsoft fashion the functionality is obscured and the little desirable kill-a-gnat can only be done by costly nuclear bombing, so to speak — why let something simple be simple if it can be wrapped in military standard complexity?

But it means that with Visual C++ 2019 one can now use UTF-8 encoding for GUI applications, and for the output of console programs, without any encoding conversions in the code.

In particular, with UTF-8 active process codepage the arguments of main now come handily UTF-8 encoded, which means that they can now represent general filenames also in Windows. Hurray! Yippi!

However, interactive console input of UTF-8 is still limited to ASCII at the API-level. And the MinGW g++ 9.2 compiler’s default standard library implementation doesn’t support UTF-8 in the C and C++ locale machinery, e.g in setlocale, probably because it employs an old version of Microsoft’s runtime library. That means that FILE* or iostreams UTF-8 console output with MinGW g++ 9.2 only works for the default “C” locale.

I experimented by setting the ANSI codepage default in the registry to 65001, the UTF-8 codepage number. After rebooting the console windows came up with active codepage 65001, even though the OEM codepage default was the same old one (850 in my case). That indicates an effort on Microsoft’s part to support UTF-8 all the way in Windows, which if so is fantastically good.

An example application manifest.

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<assembly manifestVersion="1.0" xmlns="urn:schemas-microsoft-com:asm.v1">
    <assemblyIdentity type="win32" name="UTF-8 app example" version=""/>
            <activeCodePage xmlns="http://schemas.microsoft.com/SMI/2019/WindowsSettings"

The second assemblyIdentity part has nothing to do with the UTF-8 support, it just corrects a practically unusable default for the look and feel of buttons etc. Essentially this manifest corrects the two “wrong” defaults: the narrow text encoding, and the look ’n feel. From within an application with this manifest it looks as both CP_ACP (the global default) and CP_THREAD_ACP (the mysterious thread codepage) are UTF-8, codepage 65001.

In my experimentation UTF-8 had to be specified in uppercase, and it did not work with whitespace such as space or newline at either side.

An example resource script:

#include <windows.h>
1 RT_MANIFEST "resources/application-manifest.xml"

Non-crashing Python 3.x output in Windows

Non-crashing Python 3.x output in Windows


The following little Python 3.x program just crashes with CPython 3.3.4 in a default-configured English Windows:

print( "Blåbærsyltetøy!")
H:\personal\web\blog alf on programming at wordpress\001\test>chcp 437
Active code page: 437

H:\personal\web\blog alf on programming at wordpress\001\test>type crash.py3 | display_utf8
print( "Blåbærsyltetøy!")

H:\personal\web\blog alf on programming at wordpress\001\test>crash.py3
Traceback (most recent call last):
  File "H:\personal\web\blog alf on programming at wordpress\001\test\crash.py3", line 2, in 
    print( "Blåbærsyltet\xf8y!")
  File "C:\Program Files\CPython 3_3_4\lib\encodings\cp437.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\xf8' in position 12: character maps to 

H:\personal\web\blog alf on programming at wordpress\001\test>_

Here codepage 437 is the original IBM PC character set, which is the default narrow text interpretation in an English Windows console window.

A partial solution is to change the default console codepage to Windows ANSI, which then at least for CPython matches the encoding for output to a pipe or file, and it’s nice with consistency. But also this has a severely limited character set, with possible crash behavior for any unsupported characters.

Direct console output

Unicode text limited to the Basic Multilingual Plane (essentially original 16-bits Unicode) can be output to a Windows console via the WriteConsoleW Windows API function.

The standard Python ctypes module provides access to the API:

import ctypes
class Object: pass

winapi = Object()
winapi.STD_INPUT_HANDLE     = -10
winapi.STD_OUTPUT_HANDLE    = -11
winapi.STD_ERROR_HANDLE     = -12
winapi.GetStdHandle         = ctypes.windll.kernel32.GetStdHandle
winapi.CloseHandle          = ctypes.windll.kernel32.CloseHandle
winapi.WriteConsoleW        = ctypes.windll.kernel32.WriteConsoleW

class Direct_console_io:
    def write( self, s ) -> int:
        n_written = ctypes.c_ulong()
        ret = winapi.WriteConsoleW(
            self.std_output_handle, s, len( s ), ctypes.byref( n_written ), 0
        return n_written.value

    def __del__( self ):
        if not winapi: return       # Looks like a bug in CPython 3.x
        winapi.CloseHandle( self.std_error_handle )
        winapi.CloseHandle( self.std_output_handle )
        winapi.CloseHandle( self.std_input_handle )

    def __init__( self ):
        self.dependency = winapi
        self.std_input_handle   = winapi.GetStdHandle( winapi.STD_INPUT_HANDLE )
        self.std_output_handle  = winapi.GetStdHandle( winapi.STD_OUTPUT_HANDLE )
        self.std_error_handle   = winapi.GetStdHandle( winapi.STD_ERROR_HANDLE )

Implementing input is left as an exercise for the reader.

Overriding the standard streams to use direct i/o and UTF-8.

In addition to the silly crashing behavior, the standard streams in CPython 3.x, like sys.stdout, default to Windows ANSI for output to file or pipe. In Python 2.7 this could be reset to more useful UTF-8 by reloading the sys module in order to get back a dynamically removed method that could set the default encoding. No longer in Python 3.x, so this code just creates new stream objects:

import io
import sys
from Direct_console_io import Direct_console_io

class Dcio_raw_iobase( io.RawIOBase ):
    def writable( self ) -> bool:
        return True

    def write( self, seq_of_bytes ) -> int:
        b = bytes( seq_of_bytes )
        return self.dcio.write( b.decode( 'utf-8' ) )

    def __init__( self ):
        self.dcio = Direct_console_io()

class Dcio_buffered_writer( io.BufferedWriter ):
    def write( self, seq_of_bytes ) -> int:
        return self.raw_stream.write( seq_of_bytes )

    def flush( self ):

    def __init__( self, raw_iobase ):
        super().__init__( raw_iobase )
        self.raw_stream = raw_iobase

# Module initialization:
def __init__():
    using_console_input = sys.stdin.isatty()
    using_console_output = sys.stdout.isatty()
    using_console_error = sys.stderr.isatty()

    if using_console_output:
        raw_io = Dcio_raw_iobase()
        buf_io = Dcio_buffered_writer( raw_io )
        sys.stdout = io.TextIOWrapper( buf_io, encoding = 'utf-8' )
        sys.stdout.isatty = lambda: True
        sys.stdout = io.TextIOWrapper( sys.stdout.detach(), encoding = 'utf-8-sig' )

    if using_console_error:
        raw_io = Dcio_raw_iobase()
        buf_io = Dcio_buffered_writer( raw_io )
        sys.stderr = io.TextIOWrapper( buf_io, encoding = 'utf-8' )
        sys.stderr.isatty = lambda: True
        sys.stderr = io.TextIOWrapper( sys.stderr.detach(), encoding = 'utf-8-sig' )


Disclaimer: It’s been a long time since I fiddled with Python, so possibly I’m breaking a number of conventions plus doing things in some less than optimal way. But this was the first path I found through the jungle of apparently arbitrary io class derivations etc. It worked well enough for my purposes (in a little script to convert NRK’s HTML-format subtitles to SubRip format), so, I gather it can be useful also for you – at least as a basis for more robust and/or more general code.