Non-crashing Python 3.x output in Windows

Non-crashing Python 3.x output in Windows

Problem

The following little Python 3.x program just crashes with CPython 3.3.4 in a default-configured English Windows:

crash.py3
#encoding=utf-8
print( "Blåbærsyltetøy!")
H:\personal\web\blog alf on programming at wordpress\001\test>chcp 437
Active code page: 437

H:\personal\web\blog alf on programming at wordpress\001\test>type crash.py3 | display_utf8
#encoding=utf-8
print( "Blåbærsyltetøy!")

H:\personal\web\blog alf on programming at wordpress\001\test>crash.py3
Traceback (most recent call last):
  File "H:\personal\web\blog alf on programming at wordpress\001\test\crash.py3", line 2, in 
    print( "Blåbærsyltet\xf8y!")
  File "C:\Program Files\CPython 3_3_4\lib\encodings\cp437.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\xf8' in position 12: character maps to 

H:\personal\web\blog alf on programming at wordpress\001\test>_

Here codepage 437 is the original IBM PC character set, which is the default narrow text interpretation in an English Windows console window.

A partial solution is to change the default console codepage to Windows ANSI, which then at least for CPython matches the encoding for output to a pipe or file, and it’s nice with consistency. But also this has a severely limited character set, with possible crash behavior for any unsupported characters.

Direct console output

Unicode text limited to the Basic Multilingual Plane (essentially original 16-bits Unicode) can be output to a Windows console via the WriteConsoleW Windows API function.

The standard Python ctypes module provides access to the API:

Direct_console_io.py
import ctypes
class Object: pass

winapi = Object()
winapi.STD_INPUT_HANDLE     = -10
winapi.STD_OUTPUT_HANDLE    = -11
winapi.STD_ERROR_HANDLE     = -12
winapi.GetStdHandle         = ctypes.windll.kernel32.GetStdHandle
winapi.CloseHandle          = ctypes.windll.kernel32.CloseHandle
winapi.WriteConsoleW        = ctypes.windll.kernel32.WriteConsoleW

class Direct_console_io:
    def write( self, s ) -> int:
        n_written = ctypes.c_ulong()
        ret = winapi.WriteConsoleW(
            self.std_output_handle, s, len( s ), ctypes.byref( n_written ), 0
            )
        return n_written.value

    def __del__( self ):
        if not winapi: return       # Looks like a bug in CPython 3.x
        winapi.CloseHandle( self.std_error_handle )
        winapi.CloseHandle( self.std_output_handle )
        winapi.CloseHandle( self.std_input_handle )

    def __init__( self ):
        self.dependency = winapi
        self.std_input_handle   = winapi.GetStdHandle( winapi.STD_INPUT_HANDLE )
        self.std_output_handle  = winapi.GetStdHandle( winapi.STD_OUTPUT_HANDLE )
        self.std_error_handle   = winapi.GetStdHandle( winapi.STD_ERROR_HANDLE )

Implementing input is left as an exercise for the reader.

Overriding the standard streams to use direct i/o and UTF-8.

In addition to the silly crashing behavior, the standard streams in CPython 3.x, like sys.stdout, default to Windows ANSI for output to file or pipe. In Python 2.7 this could be reset to more useful UTF-8 by reloading the sys module in order to get back a dynamically removed method that could set the default encoding. No longer in Python 3.x, so this code just creates new stream objects:

Utf8_standard_streams.py
import io
import sys
from Direct_console_io import Direct_console_io

class Dcio_raw_iobase( io.RawIOBase ):
    def writable( self ) -> bool:
        return True

    def write( self, seq_of_bytes ) -> int:
        b = bytes( seq_of_bytes )
        return self.dcio.write( b.decode( 'utf-8' ) )

    def __init__( self ):
        self.dcio = Direct_console_io()

class Dcio_buffered_writer( io.BufferedWriter ):
    def write( self, seq_of_bytes ) -> int:
        return self.raw_stream.write( seq_of_bytes )

    def flush( self ):
        pass

    def __init__( self, raw_iobase ):
        super().__init__( raw_iobase )
        self.raw_stream = raw_iobase

# Module initialization:
def __init__():
    using_console_input = sys.stdin.isatty()
    using_console_output = sys.stdout.isatty()
    using_console_error = sys.stderr.isatty()

    if using_console_output:
        raw_io = Dcio_raw_iobase()
        buf_io = Dcio_buffered_writer( raw_io )
        sys.stdout = io.TextIOWrapper( buf_io, encoding = 'utf-8' )
        sys.stdout.isatty = lambda: True
    else:
        sys.stdout = io.TextIOWrapper( sys.stdout.detach(), encoding = 'utf-8-sig' )

    if using_console_error:
        raw_io = Dcio_raw_iobase()
        buf_io = Dcio_buffered_writer( raw_io )
        sys.stderr = io.TextIOWrapper( buf_io, encoding = 'utf-8' )
        sys.stderr.isatty = lambda: True
    else:
        sys.stderr = io.TextIOWrapper( sys.stderr.detach(), encoding = 'utf-8-sig' )
    return

__init__()

Disclaimer: It’s been a long time since I fiddled with Python, so possibly I’m breaking a number of conventions plus doing things in some less than optimal way. But this was the first path I found through the jungle of apparently arbitrary io class derivations etc. It worked well enough for my purposes (in a little script to convert NRK’s HTML-format subtitles to SubRip format), so, I gather it can be useful also for you – at least as a basis for more robust and/or more general code.

Advertisements

[wordpress] stats broken – proof

[Latest, as of 10th June, just a few hours after posting: Oops, while the stats probably are somewhat broken, this purported “proof” is a mis-interpretation (and that could apply also to my earlier observation): the stats main page only shows the 10 most viewed pages. Argh. Is it better to add a comment like this, showing off my rush-to-conclusions error to all readers, or to silently delete the posting? I left the posting in place. In a way, my eror ilustrats my point that its extremely easy to make errors, especially based on invalid assumptions!]

In my earlier posting “wordpress: broken stats?” I conjectured that the WordPress number-of-views-of-your-blog stats are broken. I then found an instance where the number of views of a posting decreased from 1 to 0 (within the same day), but that’s a bit difficult to show visually! However, right now the stats of this blog offer self-contradicting information in the same page, so I took a snapshot of the screen: [… More] Read all of this posting →

How to create a bug like this Thunderbird bug

Screenshot of Thunderbird address bugYou thought it would be safe to post an article to [comp.lang.c++.moderated] using the X-Replace-Address header (or line within the article). No spam, since in the posted article the address will appear as clcppm-poster@this.is.invalid. 🙂 Hah!

[… More] Read all of this posting →

[cppx] How to avoid disastrous integer wrap-around

The time is 65535:65535 PM, and the temperature is a nice comfy 4294967295 degrees… Oops. Clearly some signed integer values have been promoted to unsigned somewhere, or perhaps the program has just used unsigned integer arithmetic incorrectly.

There are two main cures for that:

  • make sure that you’re only using unsigned arithmetic in expressions that involve unsigned type numerical values, or …
  • make sure that you’re only using signed arithmetic in expressions that involve unsigned type numerical values.

As of 2010 most programmers still choose the first cure, attempting to use unsigned arithmetic wherever there are unsigned numerical values involved, e.g. being careful to use size_t instead of int for some loop counter. I think the main reason is that it’s an old convention (which is propagated by e.g. the C++ standard library), a case of “frozen history”, but it may also have something to do with each intrusion of unsigned numerical values being viewed as a purely local problem, suggesting that a purely local fix, each time, is appropriate. This then leads to a bug-attracting mix of signed and unsigned type expression, which moreover are difficult to recognize as respectively signed and unsigned type, and are easy to get wrong.

Here I’m advocating the second cure, a single global convention: do not mix signed and unsigned numerical types. The unsigned types are for bit-level operations (only), while the signed types are for numerical values (only). But since the standard library was designed before the advantages of this clear usage separation were fully realized, and indeed at a time when some computer architectures still conferred some advantages to e.g. an unsigned size(), adopting this more rational and work-saving convention requires some support.

[… More] Read all of this posting →

[wordpress] WordPress replacing “C++” tag with “C#”!

It turned out that WordPress replaces all “C++” tags with “C#” tags in their presentation of postings about any particular theme, like the list of postings about programming.

Technically a “C++” tag is first transformed to the tag id “c” by removing all punctuation, and then the presentation of tag id “c” is as “C#” (not even with proper superscript like C#).

I understand that to people at large “C++” or “C#” or just “C” is much the same thing, really, so it shouldn’t matter what WordPress presented it as?

But this is akin to me posting about how I’m the proud owner of a Bentley, and WordPress informing the world that my blog is about how aroused I am as owner of a mare (female horse).

How does one get from “proud owner of Bentley”, to “aroused owner of mare”? First one, i.e. WordPress, generalizes (like “C++” → “c”). Proud, that must mean happy, surely, and Bentley, that must mean a means of transportation, surely. So, he means that he’s the happy owner of a means of transportation.

Then one, i.e. WordPress, specializes, incorrectly. Happy, oh that must mean aroused, and means of transportation, that must mean a mare, and even if that perhaps is not 100% precisely what he meant it does give the rough idea, surely?, and if we chose something else we’d offend someone else!

It’s all so needless.

Just don’t do it, WordPress.