Non-crashing Python 3.x output in Windows

Non-crashing Python 3.x output in Windows

Problem

The following little Python 3.x program just crashes with CPython 3.3.4 in a default-configured English Windows:

crash.py3
#encoding=utf-8
print( "Blåbærsyltetøy!")
H:\personal\web\blog alf on programming at wordpress\001\test>chcp 437
Active code page: 437

H:\personal\web\blog alf on programming at wordpress\001\test>type crash.py3 | display_utf8
#encoding=utf-8
print( "Blåbærsyltetøy!")

H:\personal\web\blog alf on programming at wordpress\001\test>crash.py3
Traceback (most recent call last):
  File "H:\personal\web\blog alf on programming at wordpress\001\test\crash.py3", line 2, in 
    print( "Blåbærsyltet\xf8y!")
  File "C:\Program Files\CPython 3_3_4\lib\encodings\cp437.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\xf8' in position 12: character maps to 

H:\personal\web\blog alf on programming at wordpress\001\test>_

Here codepage 437 is the original IBM PC character set, which is the default narrow text interpretation in an English Windows console window.

A partial solution is to change the default console codepage to Windows ANSI, which then at least for CPython matches the encoding for output to a pipe or file, and it’s nice with consistency. But also this has a severely limited character set, with possible crash behavior for any unsupported characters.

Direct console output

Unicode text limited to the Basic Multilingual Plane (essentially original 16-bits Unicode) can be output to a Windows console via the WriteConsoleW Windows API function.

The standard Python ctypes module provides access to the API:

Direct_console_io.py
import ctypes
class Object: pass

winapi = Object()
winapi.STD_INPUT_HANDLE     = -10
winapi.STD_OUTPUT_HANDLE    = -11
winapi.STD_ERROR_HANDLE     = -12
winapi.GetStdHandle         = ctypes.windll.kernel32.GetStdHandle
winapi.CloseHandle          = ctypes.windll.kernel32.CloseHandle
winapi.WriteConsoleW        = ctypes.windll.kernel32.WriteConsoleW

class Direct_console_io:
    def write( self, s ) -> int:
        n_written = ctypes.c_ulong()
        ret = winapi.WriteConsoleW(
            self.std_output_handle, s, len( s ), ctypes.byref( n_written ), 0
            )
        return n_written.value

    def __del__( self ):
        if not winapi: return       # Looks like a bug in CPython 3.x
        winapi.CloseHandle( self.std_error_handle )
        winapi.CloseHandle( self.std_output_handle )
        winapi.CloseHandle( self.std_input_handle )

    def __init__( self ):
        self.dependency = winapi
        self.std_input_handle   = winapi.GetStdHandle( winapi.STD_INPUT_HANDLE )
        self.std_output_handle  = winapi.GetStdHandle( winapi.STD_OUTPUT_HANDLE )
        self.std_error_handle   = winapi.GetStdHandle( winapi.STD_ERROR_HANDLE )

Implementing input is left as an exercise for the reader.

Overriding the standard streams to use direct i/o and UTF-8.

In addition to the silly crashing behavior, the standard streams in CPython 3.x, like sys.stdout, default to Windows ANSI for output to file or pipe. In Python 2.7 this could be reset to more useful UTF-8 by reloading the sys module in order to get back a dynamically removed method that could set the default encoding. No longer in Python 3.x, so this code just creates new stream objects:

Utf8_standard_streams.py
import io
import sys
from Direct_console_io import Direct_console_io

class Dcio_raw_iobase( io.RawIOBase ):
    def writable( self ) -> bool:
        return True

    def write( self, seq_of_bytes ) -> int:
        b = bytes( seq_of_bytes )
        return self.dcio.write( b.decode( 'utf-8' ) )

    def __init__( self ):
        self.dcio = Direct_console_io()

class Dcio_buffered_writer( io.BufferedWriter ):
    def write( self, seq_of_bytes ) -> int:
        return self.raw_stream.write( seq_of_bytes )

    def flush( self ):
        pass

    def __init__( self, raw_iobase ):
        super().__init__( raw_iobase )
        self.raw_stream = raw_iobase

# Module initialization:
def __init__():
    using_console_input = sys.stdin.isatty()
    using_console_output = sys.stdout.isatty()
    using_console_error = sys.stderr.isatty()

    if using_console_output:
        raw_io = Dcio_raw_iobase()
        buf_io = Dcio_buffered_writer( raw_io )
        sys.stdout = io.TextIOWrapper( buf_io, encoding = 'utf-8' )
        sys.stdout.isatty = lambda: True
    else:
        sys.stdout = io.TextIOWrapper( sys.stdout.detach(), encoding = 'utf-8-sig' )

    if using_console_error:
        raw_io = Dcio_raw_iobase()
        buf_io = Dcio_buffered_writer( raw_io )
        sys.stderr = io.TextIOWrapper( buf_io, encoding = 'utf-8' )
        sys.stderr.isatty = lambda: True
    else:
        sys.stderr = io.TextIOWrapper( sys.stderr.detach(), encoding = 'utf-8-sig' )
    return

__init__()

Disclaimer: It’s been a long time since I fiddled with Python, so possibly I’m breaking a number of conventions plus doing things in some less than optimal way. But this was the first path I found through the jungle of apparently arbitrary io class derivations etc. It worked well enough for my purposes (in a little script to convert NRK’s HTML-format subtitles to SubRip format), so, I gather it can be useful also for you – at least as a basis for more robust and/or more general code.

Advertisements

Unicode part 1: Windows console i/o approaches



The Windows console subsystem has a host of Unicode-related bugs. And standard Windows programs such as more (not to mention the C# 4.0 compiler csc) just crash when they’re run from a console window with UTF-8 as active codepage, perplexingly claiming that they’re out of memory. On top of that the C++ runtime libraries of various compilers differ in how they behave. Doing C++ Unicode i/o in Windows consoles is therefore problematic. In this series I show how to work around limitations of the Visual C++ _O_U8TEXT file mode, with the Visual C++ and g++ compilers. This yields an automatic translation between external UTF-8 and internal UTF-16, enabling Windows console i/o of characters in the Basic Multilingual Plane.

Introduction

In both Windows and Linux properly internationalized applications use either UTF-16 or UTF-32 for their internal string handling. For example, the popular cross platform ICU library (International Components for Unicode) is based on UTF-16 encoded strings. For this kind of application Windows seems to be the better fit, since Windows’ API is UTF-16 based while Linux’ API is effectively, on a modern installation, UTF-8 based.

Still, in a simple console program one does not typically take on the quite steep overhead of full-fledged Unicode handling.

Instead of using a full-fledged Unicode handling library like ICU, one then relies on just the standard C and C++ libraries, and the Unicode handling reduces to what can easily be expressed using just the direct C++ language and standard library support.

How the Linux “all UTF-8” approach does not work in Windows

In Linux the typical small Unicode console program has everything char-based and UTF-8 encoded. The external data, the internal strings, the string literals, and of course the C or C++ source code, are all UTF-8 encoded. The total unification allows simple programs like this:

[utf8_sans_bom.all_utf8.cpp]
#include <stdexcept>        // std::runtime_error, std::exception
#include <stdlib.h>         // EXIT_SUCCESS, EXIT_FAILURE
#include <iostream>         // std::cout, std::cerr, std::endl
#include <string>           // std::string
using namespace std;

bool throwX( string const& s ) { throw runtime_error( s ); }
bool hopefully( bool v ) { return v; }

string lineFrom( istream& stream )
{
    string result;
    getline( stream, result );
    hopefully( !stream.fail() )
        || throwX( "lineFrom: failed to read line" );
    return result;
}

int main()
{
    try
    {
        static char const       narrowText[]    = "Blåbærsyltetøy! 日本国 кошка!";
        
        cout << "Narrow text: " << narrowText << endl;
        cout << endl;
        cout << "What's your name? ";
        string const name = lineFrom( cin );
        cout << "Glad to meet you, " << name << "!" << endl;
        return EXIT_SUCCESS;
    }
    catch( exception const& x )
    {
        cerr << "!" << x.what() << endl;
    }
    return EXIT_FAILURE;
}

Testing this in Ubuntu 11.10:

[~/blog/examples]
$ g++ utf8_sans_bom.all_utf8.cpp
[~/blog/examples]
$ ./a.out
Narrow text: Blåbærsyltetøy! 日本国 кошка!

What's your name? Bjørn Bråten Sæter
Glad to meet you, Bjørn Bråten Sæter!
[~/blog/examples]
$  _

Yay, it worked OK in Linux!

Testing the very same source code file in Windows using the same Linux-origins compiler (namely g++), and intentionally not specifying any codepage for the console window:

W:\examples> g++ -pedantic -Wall utf8_sans_bom.all_utf8.cpp

W:\examples> a
Narrow text: Blåbærsyltetøy! 日本国 кошка!

What's your name? Bjørn Bråten Sæter
Glad to meet you, Bjorn Bråten Sæter!

W:\examples> _

One reason for the gobbledygook here is that the Windows console by default assumes that the program produces OEM encoded text. That means, it assumes that the text is encoded using the original IBM PC character set, or a variation of that old character set. This encoding assumption is called the console window’s active codepage, and it can be inspected and changed via the chcp command, e.g. from codepage 437 (original IBM PC character set) to 65001 (UTF-8):

W:\examples> chcp
Active code page: 437

W:\examples> chcp 65001
Active code page: 65001

W:\examples> a
Narrow text: Blåbærsyltetøy! 日本国 кошка!huh?

W:\examples> _

Positive: the initial UTF-8 text output appeared to work. The Chinese characters 日本国 displayed as just empty rectangles, but they copied OK. Both the Norwegian and Russian copied OK and also displayed OK.

Negative: input did apparently not work, and it apparently caused some of the program’s output (including the prompt before the input operation) to disappear!

Exactly what went wrong above is difficult to say for sure. It might be the input operation, or it might be something else. However, the exact cause is irrelevant because input fails outright, not just producing weird side effects, if the user types in some non-ASCII characters such as Norwegian æ, ø and å:

W:\examples> a
Narrow text: Blåbærsyltetøy! 日本国 кошка!Bjørn Bråten Sæter
!lineFrom: failed to read line

W:\examples> _

About direct console i/o

Given that total failure for the “all UTF-8” approach has been established, it may perhaps appear to be overkill to also show the unintelligible output effect with the Windows platform’s major compiler, Visual C++ (here version 10.0), but as you’ll see it’s relevant:

W:\examples> cl utf8_sans_bom.all_utf8.cpp /Fe"b"
utf8_sans_bom.all_utf8.cpp

W:\examples> chcp
Active code page: 65001

W:\examples> b
Narrow text: Bl��b��rsyltet��y! ��������� ����������!

What's your name? Bjørn Bråten Sæter
!lineFrom: failed to read line

W:\examples> _

Here the Visual C++ runtime detects that the standard output is connected to a console window. And instead of sending the text via the ordinary standard output stream, it then attempts to place the correct Unicode UCS2-encoded characters directly in the console window’s text buffer. However, since the C++ source code was encoded as UTF-8 without BOM (as is usual in Linux), the Visual C++ compiler erroneously assumed that the source code was encoded as Windows ANSI, and so, since Visual C++ has Windows ANSI sort of hardwired as its C++ narrow character execution character set, it blindly copied the string literal’s bytes to the executable’s string values, whence the runtime, for its direct console i/o, is given UTF-8 bytes instead of the Windows ANSI bytes that it expects – so that its helpful translation to UCS2 fails…

At the Windows API level the runtime implements direct console output by calling the WriteConsole function instead of the WriteFile function. And similarly, if the console input had worked, then it would probably have been via a call to the ReadConsole function instead of the ReadFile function. The WriteConsole function accesses the console window’s text buffer directly and takes an UTF-16 wchar_t based argument, and ditto for ReadConsole.

Portable source code should be UTF-8 with BOM

One can avoid the direct console i/o by redirecting the output.

Such redirection then establishes that the output text byte level data is good, that all would have been well for this particular program’s output, except for the interference from the probably well-intentioned direct console i/o help attempt:

W:\examples> echo Bjørn Bråten Sæter | b >result

W:\examples> type result
Narrow text: Blåbærsyltetøy! 日本国 кошка!

What's your name? Glad to meet you, Bjørn Bråten Sæter !

W:\examples> _

And because the data is correct, one can be sure that the Visual C++ compiler was tricked into assuming that the source code was ANSI Western. And this then means that any wide string literal, which a Windows compiler has to translate to UTF-16, will be incorrectly translated if it contains any non-ASCII characters. Hence, for portable source code it is not a good idea to encode the source code as UTF-8 without BOM – for that is effectively to lie to the compiler.

Now that also g++ accepts a BOM at the start of the source code, portable source code should therefore be encoded as UTF-8 with BOM.

With the BOM in place Visual C++ correctly determines that the source code is UTF-8 encoded, although as of late 2011 this appears to still be undocumented. And with a correct assumption about the source code’s encoding, narrow string literals are correctly translated to Windows ANSI encoded string values in the executable. For Unicode literals in Windows one should therefore use wide string literals, e.g. L"Blåbærsyltetøy! 日本国 кошка!", which in Windows ends up as an UTF-16 encoded string value in the executable.

The Visual C++ UTF-8 stream mode

Use source code encoded as UTF-8 with BOM, and use wide string literals, OK (or rather, one just has to accept that complication!), but how does one then output one of these literals?

E.g., std::wcout in Windows has a rather strong tendency to translate down to Windows ANSI, not to UTF-8?

Well, in his 2008 blog posting Conventional wisdom is retarded, aka What the @#%&* is _O_U16TEXT? Michael Kaplan explained that

“the [Visual C++] CRT? Starting in 2005/8.0, it knows more about Unicode than any of us having been giving it credit for…”

The Visual C++ runtime library can convert automatically between internal UTF-16 and external UTF-8, if you just ask it to do so by calling the _setmode function with the appropriate file descriptor number and mode flag. E.g., mode _O_U8TEXT causes conversion to/from UTF-8.

One reason that many people have not known about the Unicode support that he discusses there, a Visual C++ Unicode stream mode, is that it’s mostly undocumented. Kaplan gives a link to documentation of the deprecated _wsopen function, as one place where the mode flags have been (inadvertently?) documented. However, the main usage is through the _setmode function, where, on the contrary, the official documentation goes on about how _setmode will invoke the “invalid parameter handler” unless the mode argument is either _O_TEXT or _O_BINARY. So, by using this functionality one is not just in ordinary Microsoft undocumented land. One is wholly over in explicitly-documented-as-not-working land.

On the other hand, considering that the official documentation is plain wrong about many things (e.g., for Visual C++ 10 it maintains that the source code encoding is limited to ASCII), and that the _setmode documentation is incorrect about the argument checking, and that the g++ compiler provides C level support for the _O_U8TEXT mode feature, considering all that one may choose to ignore the will-not-work statements of the documentation and just treat it as a documentation defect, for what good is a feature that can’t be used?

Since there is not really any alternative in order to get UTF-8 translation also down at the C library level, this is the approach that I’m going to discuss more detailed in part 2.

It might seem from Kaplan’s blog posting that you don’t have to do more than just set the mode, and go! But as you can expect from something in explicitly-documented-as-not-working land, it’s not fully implemented even in Visual C++. And even less fully implemented in g++…

Summary so far

Above I introduced two approaches to Unicode handling in small Windows console programs:

  • The all UTF-8 approach where everything is encoded as UTF-8, and where there are no BOM encoding markers.
     
  • The wide string approach where all external text (including the C++ source code) is encoded as UTF-8, and all internal text is encoded as UTF-16.

The all UTF-8 approach is the approach used in a typical Linux installation. With this approach a novice can remain unaware that he is writing code that handles Unicode: it Just Works™ – in Linux. However, we saw that it mass-failed in Windows:

  • Input with active codepage 65001 (UTF-8) failed due to various bugs.
     
  • Console output with Visual C++ produced gibberish due to the runtime library’s attempt to help by using direct console output.
     
  • I mentioned how wide string literals with non-ASCII characters are incorrectly translated to UTF-16 by Visual C++ due to the necessary lying to Visual C++ about the source code encoding (which is accomplished by not having a BOM at the start of the source code file).

The wide string approach, on the other hand, was shown to have special support in Visual C++, via the _O_U8TEXT file mode, which I called an UTF-8 stream mode. But I mentioned that as of Visual C++ 10 this special file mode is not fully implemented and/or it has some bugs: it cannot be used directly but needs some scaffolding and fixing. That’s what part 2 is about.


Cheers!

Cheers, & enjoy!

Ch 4 of my Norwegian intro to C++ available

The Norwegian introduction to C++ programming (a bit Windows-specific) is at Google Docs, in PDF format, 4 chapters so far:

Introduksjon til C++-programmering (Windows)

Each file has a nice table of contents but for that you need to download the PDF and view it in e.g. Foxit or Adobe Acrobat. Ch 1, the Introduction, is just 1 page, though. Ch 2, tooling up with Visual C++ and learning about some Windows stuff, is more pages. And so is ch 3, about basic C++ such as loops and decisions. And ch 4, about creating console programs (all programs so far just GUI), chimes in at some 50 pages!

Perhaps it’ll become a book…

Here’s a table of contents generated by (1) using a Word TOC field and half-documented RD fields to refer to the chapters, (2) [Shift Ctrl F9] in Word (is that still documented anywhere?) to “lock” the text, (3) edit, removing unwanted entries, (4) copy as text to Crimson Editor, save, and (5) run a very very hairy C++ program to generate the HTML.

Oh, I see in the preview that instead of a purely numbered list, in the WordPress blog I get letters and roman numerals!

So be it – but there’s also a PDF of the original over at Google docs (link above).

  1. Introduksjon. | 1
  2. Første program, etc. | 1
    1. Gratis verktøy. | 1
    2. Muligens ikke helt typiske installasjonsproblemer… | 2
    3. “Hallo, verden!” i Visual Studio / om IDE prosjekter. | 6
    4. Feilretting i Visual Studio / generelt om C++ typesjekking. | 15
    5. Hva “Hallo, verden!” programteksten betyr. | 18
    6. Spesielt aktuelle Windows-ting for nybegynneren. | 21
      1. Makroer og Unicode/ANSI-versjoner av Windows API-funksjoner. | 22
      2. Moderne utseende på knapper etc. / om DLL-er og manifest-filer. | 23
      3. Ikon og versjonsinformasjon / [.exe]-fil ressurser. | 28
    7. Gir C++ ekstra mye kode og kompleksitet? | 32
    8. Å finne relevant informasjon om ting. | 32
      1. Tipsruter og automatisk fullføring. | 32
      2. Å gå direkte til en aktuell deklarasjon eller definisjon. | 33
      3. Full teknisk dokumentasjon / hjelp / kort om Microsofts “T” datatyper. | 34
      4. Dokumentasjon av C++ språket og C++ standardbiblioteket. | 36
      5. Diskusjonsfora på nettet / FAQ-er. | 38
  3. Et første subsett av C++. | 1
    1. Gjenbruk av egendefinerte headerfiler. | 1
      1. En wrapper for [windows.h]. | 2
      2. Å konfigurere en felles headerfil søkesti i Visual Studio 2010. | 6
      3. En muligens enklere & mer pålitelig måte å konfigurere Visual Studio på. | 9
    2. Grunnleggende data. | 12
      1. Variabler, tilordninger, oppdateringer, regneuttrykk, implisitt konvertering. | 14
      2. Implisitte konverteringer. | 15
      3. Initialisering og const. | 16
    3. Tekstpresentasjon og strenger. | 17
      1. Arrays som buffere, konvertering tall ? tekst. | 17
      2. Strenger, konkatenering og std::wstring-typen, anrop av medlemsfunksjon. | 18
      3. Å lage tekstgenererings-støtte / egendefinerte funksjoner & operatorer. | 22
    4. Løkker, valg og sammenligningsuttrykk. | 27
      1. Sammenligninger og boolske uttrykk. | 32
      2. Valg. | 34
      3. Løkker. | 39
    5. Funksjoner. | 41
      1. Hva du kan og ikke kan gjøre med en C++ funksjon. | 41
      2. Funksjoner som abstraksjonsverktøy. | 41
      3. Verdioverføring og referanseoverføring av argumenter. | 45
  4. Kommandotolkeren. | 1
    1. Windows kommandotolkeren [cmd.exe]. | 2
      1. Å kjøre opp en kommandotolker-instans / konfigurering av konsollvinduer. | 2
      2. Kommandoer / hjelp. | 8
      3. Kommandoredigering & utklippstavle-operasjoner. | 11
      4. Linjekontinuering & tegn-escaping. | 11
      5. Operatorer & sammensatte kommandoer / omdirigering & rørledninger. | 12
      6. Erstatting av miljøvariabel-navn / arv av miljøvariabler. | 15
      7. Kommandotolkerens søk etter programmer: %path% og %pathext%. | 16
    2. Navigasjon. | 17
    3. Å kompilere fra kommandotolkeren. | 21
      1. Å nei! “Hallo, verden!” igjen! | 22
      2. Konsoll kontra GUI subsystem. | 24
      3. Å angi linker-opsjoner til kompilatoren / separat kompilering og linking. | 26
      4. Å be kompilatoren om standard C++, please. | 27
      5. Å angi headerfilkataloger, også kjent som inkluderingskataloger. | 28
    4. Batchfiler – å automatisere f.eks. et standardoppsett. | 31
    5. C++ iostreams. | 33
      1. iostream-objekter for standard datastrømmene. | 33
      2. Datastrøm orientering: nix mix (av char og wchar_t datastrømobjekter). | 36
      3. Å detektere “slutt på datastrømmen” (EOF, end of file). | 36
      4. Innlesing av strenger. | 40
      5. Praktikalitetsdigresjon: hvordan bli kvitt navneromskvalifikasjonene. | 42
      6. Innlesing av tall. | 43
      7. Formatert utskrift med iostream manipulatorer. | 48

Cheers, & enjoy! – Alf

Current TOC for my Norwegian intro to C++

About the Norwegian C++ intro, see my earlier posting.

Not sure if this works or not, but I’m trying to embed a PDF of a Table of Contents generated by Word:

Enjoy! 🙂  [Possibly/probably more to come, after all, I’m referring to chapter 4!]

– Alf

By the way, Olve, as you can see I’ve now added a chapter 3! Not quite at 42 yet… But.

A Norwegian introduction to C++ programming (in Windows)

I’m a compulsive writer, I admit. So, when testing Visual C++ 10.0, via Microsoft’s free Visual C++ Express IDE, I wrote about it. In Norwegian!

Maybe it’ll be a book. Anyway, I always write as if it’s going to be a book! I’m an incorrigible optimist!

It’s at Google Docs, in PDF format, 2 chapters so far:

Introduksjon til C++-programmering (Windows)

Each file has a nice table of contents but for that you need to download the PDF and view it in e.g. Foxit or Adobe Acrobat. Ch 1 is just 1 page, though. Ch 2 is more pages.

[Update, 4th of August: I’ve now added chapter 3, “Et første subsett av C++”. It’s great. :-)]

Comments very welcome!

Even if your name is Olve Maudal, say! 🙂