The Windows console subsystem has a host of Unicode-related bugs. And standard Windows programs such as more
(not to mention the C# 4.0 compiler csc
) just crash when they’re run from a console window with UTF-8 as active codepage, perplexingly claiming that they’re out of memory. On top of that the C++ runtime libraries of various compilers differ in how they behave. Doing C++ Unicode i/o in Windows consoles is therefore problematic. In this series I show how to work around limitations of the Visual C++ _O_U8TEXT
file mode, with the Visual C++ and g++ compilers. This yields an automatic translation between external UTF-8 and internal UTF-16, enabling Windows console i/o of characters in the Basic Multilingual Plane.
- Recap
- UTF-8 stream mode: the good (wide stream output)
- UTF-8 stream mode: the bad (input)
- UTF-8 stream mode: the bad (input) – FIXED
- UTF-8 stream mode: the ugly (narrow streams)
- UTF-8 stream mode: the ugly (narrow streams) – FIXED
- Summary
Recap
In part 1 I introduced two approaches to Unicode handling in small Windows console programs:
- The all UTF-8 approach where everything is encoded as UTF-8, and where there are no BOM encoding markers.
- The wide string approach where all external text (including the C++ source code) is encoded as UTF-8, and all internal text is encoded as UTF-16.
The all UTF-8 approach is the approach used in a typical Linux installation. With this approach a novice can remain unaware that he is writing code that handles Unicode: it Just Works™ – in Linux. However, we saw that it mass-failed in Windows:
- Input with active codepage 65001 (UTF-8) failed due to various bugs.
- Console output with Visual C++ produced gibberish due to the runtime library’s attempt to help by using direct console output.
- I mentioned how wide string literals with non-ASCII characters are incorrectly translated to UTF-16 by Visual C++ due to the necessary lying to Visual C++ about the source code encoding (which is accomplished by not having a BOM at the start of the source code file).
The wide string approach, on the other hand, was shown to have special support in Visual C++, via the _O_U8TEXT
file mode, which I called an UTF-8 stream mode. This mode works down at the C FILE
level so that wide character operations on C FILE
streams get automatic conversion to/from external UTF-8 encoding. That C FILE
level support is needed and is practically impossible to do for the application programmer, so it’s a very good thing to have that UTF-8 stream mode…
I mentioned that as of Visual C++ 10 the UTF-8 mode is not fully implemented, and that it apparently has some bugs. I.e., that it cannot be used directly but needs some scaffolding and fixing. That’s what this second part is about.
Here I do not yet consider the g++ compiler. All this code is Visual C++ specific. However, I have done generally the same with g++, and I will probably discuss that in a third installment.
UTF-8 stream mode: the good (wide stream output)
The good news about the _O_U8TEXT
mode is that it works for wide stream output, even for output of narrow strings via the wide stream:
#include <stdexcept> // std::runtime_error, std::exception #include <stdlib.h> // EXIT_SUCCESS, EXIT_FAILURE #include <iostream> // std::wcout, std::wcerr, std::endl #include <string> // std::string, std::wstring using namespace std; #include <io.h> // _setmode, _fileno #include <fcntl.h> // _O_U8TEXT bool throwX( string const& s ) { throw runtime_error( s ); } bool hopefully( bool v ) { return v; } void setUtf8Mode( FILE* f, char const name[] ) { int const newMode = _setmode( _fileno( f ), _O_U8TEXT ); hopefully( newMode != -1 ) || throwX( string() + "setmode failed for " + name ); } int main() { try { static char const narrowText[] = "Blåbærsyltetøy! 日本国 кошка!"; static wchar_t const wideText[] = L"Blåbærsyltetøy! 日本国 кошка!"; setUtf8Mode( stdout, "stdout" ); wcout << "Narrow text: " << narrowText << endl; wcout << "Wide text: " << wideText << endl; return EXIT_SUCCESS; } catch( exception const& x ) { wcerr << "!" << x.what() << endl; } return EXIT_FAILURE; }
Visual C++ produces a series of warnings for this source code:
W:\examples> cl utf8_mode.msvc.output.good.cpp /Fe"good"
utf8_mode.msvc.output.good.cpp
utf8_mode.msvc.output.good.cpp(24) : warning C4566: character represented by universal-character-name '\u65E5' cannot be represented in the current code page (1252)
utf8_mode.msvc.output.good.cpp(24) : warning C4566: character represented by universal-character-name '\u672C' cannot be represented in the current code page (1252)
utf8_mode.msvc.output.good.cpp(24) : warning C4566: character represented by universal-character-name '\u56FD' cannot be represented in the current code page (1252)
utf8_mode.msvc.output.good.cpp(24) : warning C4566: character represented by universal-character-name '\u043A' cannot be represented in the current code page (1252)
utf8_mode.msvc.output.good.cpp(24) : warning C4566: character represented by universal-character-name '\u043E' cannot be represented in the current code page (1252)
utf8_mode.msvc.output.good.cpp(24) : warning C4566: character represented by universal-character-name '\u0448' cannot be represented in the current code page (1252)
utf8_mode.msvc.output.good.cpp(24) : warning C4566: character represented by universal-character-name '\u043A' cannot be represented in the current code page (1252)
utf8_mode.msvc.output.good.cpp(24) : warning C4566: character represented by universal-character-name '\u0430' cannot be represented in the current code page (1252)
W:\examples> _
It might look as if these warnings are due to the active codepage in the console window, but they’re not related. Visual C++ is just complaining about information loss when it attempts to convert the narrowText
literal from the source code’s UTF-8 to its incorrectly documented C++ execution character set, which is Windows ANSI. Where Windows ANSI is the codepage reported by the GetACP
Windows API function (doc here), which defaults to the codepage specified in the ACP
value of the as far as I know undocumented registry key HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Nls\CodePage
.
So, the effect of compiling can be a bit different depending on the language that Windows is installed for, which determines the default GetACP
codepage. But at least the compilation doesn’t depend on such an ephemeral setting as the active codepage in some console window! And these warnings are good: they’re spot on, for a change. 🙂
Now, does the program work?
W:\examples> chcp 1252 Active code page: 1252 W:\examples> good Narrow text: Blåbærsyltetøy! ??? ?????! Wide text: Blåbærsyltetøy! 日本国 кошка! W:\examples> good >good_result W:\examples> type good_result Narrow text: BlÃ¥bærsyltetøy! ??? ?????! Wide text: BlÃ¥bærsyltetøy! 日本国 кошка! W:\examples> chcp 65001 Active code page: 65001 W:\examples> good Narrow text: Blåbærsyltetøy! ??? ?????! Wide text: Blåbærsyltetøy! 日本国 кошка! W:\examples> type good_result Narrow text: Blåbærsyltetøy! ??? ?????! Wide text: Blåbærsyltetøy! 日本国 кошка! W:\examples> _
Yes! It manages to present correct output even with active codepage 1252 (Windows ANSI Western) because it uses direct console i/o for this case. And as you can see the redirected output is UTF-8, as it should be.
Even more goodness: the UTF-8 mode also works down at the C library level, using e.g. the wprintf
function.
UTF-8 stream mode: the bad (input)
The bad news about the _O_U8TEXT
mode is that input is almost as non-functional as with an active codepage 65001.
Apparently the runtime retrieves input as 1 byte per character via the standard input stream.Which means that the input is encoded according to the console window’s active codepage. These bytes, that e.g. encode your input text in Windows ANSI, are then interpreted as if they were UTF-8 encoded text.
The interpretation as UTF-8 would only be correct for active codepage 65001. But with active codepage 65001, which is the only one where the interpretation would be correct, the input operation just fails. So, depending on the active codepage the input operation will either produce garbage for non-ASCII characters, or it will fail outright.
#include <stdexcept> // std::runtime_error, std::exception #include <stdlib.h> // EXIT_SUCCESS, EXIT_FAILURE #include <iostream> // std::wcout, std::wcerr, std::endl #include <string> // std::string, std::wstring using namespace std; #include <io.h> // _setmode, _fileno #include <fcntl.h> // _O_U8TEXT bool throwX( string const& s ) { throw runtime_error( s ); } bool hopefully( bool v ) { return v; } void setUtf8Mode( FILE* f, char const name[] ) { int const newMode = _setmode( _fileno( f ), _O_U8TEXT ); hopefully( newMode != -1 ) || throwX( string() + "setmode failed for " + name ); } void initStreams() { setUtf8Mode( stdin, "stdin" ); setUtf8Mode( stdout, "stdout" ); } wstring lineFrom( wistream& stream ) { wstring result; getline( stream, result ); hopefully( !stream.fail() ) || throwX( "lineFrom: getline failed" ); return result; } int main() { try { initStreams(); wcout << "What's your name? "; wstring const name = lineFrom( wcin ); wcout << "Pleased to meet you, " << name << "!" << endl; int const n = name.length(); wcout << endl; wcout << "I represented your name as " << n << " wide characters:" << endl; for( int i = 0; i < n; ++i ) { if( i > 0 ) { wcout << " | "; } wcout << hex << 0 + name[i] << " '" << name[i] << "'"; } wcout << endl; return EXIT_SUCCESS; } catch( exception const& x ) { wcerr << "!" << x.what() << endl; } return EXIT_FAILURE; }
Testing this:
W:\examples> chcp 1252 Active code page: 1252 W:\examples> cl utf8_mode.msvc.input.bad.cpp /Fe"bad_input" utf8_mode.msvc.input.bad.cpp W:\examples> bad_input What's your name? Bjørn Pleased to meet you, Bj�rn! I represented your name as 5 wide characters: 42 'B' | 6a 'j' | fffd '�' | 72 'r' | 6e 'n' W:\examples> type good_result Narrow text: BlÃ¥bærsyltetøy! ??? ?????! Wide text: BlÃ¥bærsyltetøy! 日本国 кошка! W:\examples> bad_input What's your name? BlÃ¥bærsyltetøy Pleased to meet you, Blåbærsyltetøy! I represented your name as 14 wide characters: 42 'B' | 6c 'l' | e5 'å' | 62 'b' | e6 'æ' | 72 'r' | 73 's' | 79 'y' | 6c 'l' | 74 't' | 65 'e' | 74 't' | f8 'ø' | 79 'y' W:\examples> _
The last run with the funny characters pasted as input shows that interactive input of the UTF-8 byte values works. The input byte values, that look pretty funny in the console, are the UTF-8 byte values that encode “Blåbærsyltetøy”. And with that as the cleartext result the received byte values must have been interpreted directly as UTF-8.
Instead the runtime should have used direct console input, just as it uses direct console output when the standard output stream is connected to a console window. Since it does not (or does not manage to do that correctly), that’s what we have to provide.
UTF-8 stream mode: the bad (input) – FIXED
It is apparently easy to check whether the standard input stream is connected to a console window, namely via the_isatty
function (doc here). However, the documentation hints ominously about _isatty
maybe returning true for a stream connected to a serial port, and archaic things like that. Happily one can alternatively use the lower level Windows API console functions, like e.g. GetConsoleMode
(doc here), which presumably will only succeed for a stream that represents something that supports the Windows console functions.
#include <iostream> #include <stdio.h> // stdin, _fileno #include <io.h> // _isatty #include <windows.h> // GetConsoleMode int main() { DWORD consoleMode; HANDLE const inputHandle = GetStdHandle( STD_INPUT_HANDLE ); bool const winapiSaysConsole = !!GetConsoleMode( inputHandle, &consoleMode ); bool const clibSaysConsole = !!_isatty( _fileno( stdin ) ); using namespace std; cerr << boolalpha; cerr << "_isatty: " << clibSaysConsole << endl; cerr << "GetConsoleMode: " << winapiSaysConsole << endl; }
W:\examples> cl is_input_console.cpp /Fe"x" is_input_console.cpp W:\examples> x _isatty: true GetConsoleMode: true W:\examples> x <nul _isatty: true GetConsoleMode: false W:\examples> x <is_input_console.cpp _isatty: false GetConsoleMode: false W:\examples> _
It surprised me that _isatty
here incorrectly identified the Windows nul
device as a console window. However, GetConsoleMode
got it right. So GetConsoleMode
is evidently more reliable for this detection task.
Anyway, with a console identified as such, at the C++ level it’s then possible to override things in std::basic_streambuf
(doc here), where the required core functionality maps almost directly to a call of the Windows API function ReadConsole
(doc here).
The program below illustrates the technique by adding the necessary special input support to the previous section’s program, with main
unchanged. This program is however not intended to provide directly reusable code. It’s just a problem-specific concrete example written in a reasuble-like style:
#ifdef _MSC_VER # pragma warning( disable: 4373 ) // "Your override overrides" #endif #include <algorithm> // std::remove #include <stddef.h> // ptrdiff_t #include <stdexcept> // std::runtime_error, std::exception #include <stdlib.h> // EXIT_SUCCESS, EXIT_FAILURE #include <iostream> // std::wcout, std::wcerr, std::endl #include <streambuf> // std::basic_streambuf #include <string> // std::string, std::wstring using namespace std; #include <io.h> // _setmode, _fileno #include <fcntl.h> // _O_U8TEXT #undef UNICODE #define UNICODE #include <windows.h> // ReadConsole typedef ptrdiff_t Size; bool throwX( string const& s ) { throw runtime_error( s ); } bool hopefully( bool v ) { return v; } class DirectInputBuffer : public std::basic_streambuf< wchar_t > { private: wstring buffer_; Size bufferSize() const { return buffer_.size(); } wchar_t* pBufferStart() { return &buffer_[0]; } wchar_t* pBufferEnd() { return pBufferStart() + bufferSize(); } wchar_t* pStart() const { return eback(); } wchar_t* pCurrent() const { return gptr(); } wchar_t* pEnd() const { return egptr(); } static HANDLE inputHandle() { static HANDLE const handle = GetStdHandle( STD_INPUT_HANDLE ); return handle; } public: typedef std::basic_streambuf< wchar_t > Base; typedef Base::traits_type Traits; DirectInputBuffer( Base const& anOriginalBuffer ) : Base( anOriginalBuffer ) // Copies buffer read area pointers. , buffer_( 256, L'#' ) {} protected: virtual streamsize xsgetn( wchar_t* const pBuffer, streamsize const n ) { wchar_t const ctrlZ = wchar_t( 1 + ('Z' - 'A') ); DWORD nCharactersRead = 0; bool const readSucceeded = !!ReadConsole( inputHandle(), pBuffer, static_cast< DWORD >( n ), &nCharactersRead, nullptr ); if( readSucceeded ) { wchar_t const* const pCleanEnd = remove( pBuffer, pBuffer + nCharactersRead, L'\r' ); nCharactersRead = pCleanEnd - pBuffer; bool const isInteractiveEOF = (nCharactersRead == 2 && pBuffer[0] == ctrlZ && pBuffer[1] == '\n'); return (isInteractiveEOF? 0 : static_cast< streamsize >( nCharactersRead )); } return 0; } virtual int_type underflow() { // Try to get some more input (maximum a line). if( pCurrent() == 0 || pCurrent() >= pEnd() ) { streamsize const nCharactersRead = xsgetn( pBufferStart(), bufferSize() ); if( nCharactersRead > 0 ) { setg( pBufferStart(), // Reading area start pBufferStart(), // Reading area current pBufferStart() + nCharactersRead // Reading area end ); } } if( pCurrent() == 0 || pCurrent() >= pEnd() ) { return Traits::eof(); } return Traits::to_int_type( *pCurrent() ); } }; void setUtf8Mode( FILE* f, char const name[] ) { int const newMode = _setmode( _fileno( f ), _O_U8TEXT ); hopefully( newMode != -1 ) || throwX( string() + "setmode failed for " + name ); } bool inputIsFromConsole() { static HANDLE const inputHandle = GetStdHandle( STD_INPUT_HANDLE ); DWORD consoleMode; return !!GetConsoleMode( inputHandle, &consoleMode ); } void initStreams() { if( inputIsFromConsole() ) { static DirectInputBuffer buffer( *wcin.rdbuf() ); wcin.rdbuf( &buffer ); } setUtf8Mode( stdin, "stdin" ); setUtf8Mode( stdout, "stdout" ); } wstring lineFrom( wistream& stream ) { wstring result; getline( stream, result ); hopefully( !stream.fail() ) || throwX( "lineFrom: getline failed" ); return result; } int main() { try { initStreams(); wcout << "What's your name? "; wstring const name = lineFrom( wcin ); wcout << "Pleased to meet you, " << name << "!" << endl; int const n = name.length(); wcout << endl; wcout << "I represented your name as " << n << " wide characters:" << endl; for( int i = 0; i < n; ++i ) { if( i > 0 ) { wcout << " | "; } wcout << hex << 0 + name[i] << " '" << name[i] << "'"; } wcout << endl; return EXIT_SUCCESS; } catch( exception const& x ) { wcerr << "!" << x.what() << endl; } return EXIT_FAILURE; } }
Testing this:
W:\examples> chcp 1252 Active code page: 1252 W:\examples> cl utf8_mode.msvc.input.fixed.cpp /Fe"fixed_input" utf8_mode.msvc.input.fixed.cpp W:\examples> fixed_input What's your name? Bjørn Pleased to meet you, Bjørn! I represented your name as 5 wide characters: 42 'B' | 6a 'j' | f8 'ø' | 72 'r' | 6e 'n' W:\examples> fixed_input What's your name? 日本国 кошка Pleased to meet you, 日本国 кошка! I represented your name as 9 wide characters: 65e5 '日' | 672c '本' | 56fd '国' | 20 ' ' | 43a 'к' | 43e 'о' | 448 'ш' | 43a 'к' | 430 'а' W:\examples> _
Yay! 🙂
But I’d better mention again that in my English-language Windows 7 the console window stores the Chinese characters correctly but is only able to display them as rectangles, ▭. Correctly stored means that copy and paste works, which is why the text dumps above show these characters. However, the Norwegian and Russian characters are both stored and displayed correctly.
Also, I’d better mention that this is only a C++ level solution:
- Input from the standard input stream should only be done via
wcin
.
Explained:
Input from the standard input stream should only be done at the C++ iostream level because in practice the UTF-8 mode input operation bugs can only be fixed at the C++ iostream level.
And input should only be done via the C++ wide stream wcin
because, first of all, input data can be arbitrary, and secondly, with UTF-8 mode the narrow character operations just fail…
UTF-8 stream mode: the ugly (narrow streams)
C++11 (more precisely the N3290 final draft) §27.4.1/3:
- “Mixing operations on corresponding wide- and narrow-character streams follows the same semantics as mixing such operations on
FILE
s, as specified in Amendment 1 of the ISO C standard.”
C99 (more precisely the N869 draft) §7.19.2/4:
- “Each stream has an orientation. After a stream is associated with an external file, but before any operations are performed on it, the stream is without orientation. Once a wide-character input/output function has been applied to a stream without orientation, the stream becomes a wide-oriented stream. Similarly, once a byte input/output function has been applied to a stream without orientation, the stream becomes a byte-oriented stream. Only a call to the
freopen
function or thefwide
function can otherwise alter the orientation of a stream. (A successful call tofreopen
removes any orientation.)”
C99 (more precisely the N869 draft) §7.19.2/5:
- “Byte input/output functions shall not be applied to a wide-oriented stream and wide-character input/output functions shall not be applied to a byte-oriented stream.”
By these rules one would have to decide on using either cerr
or wcerr
in a program, but not both. That’s
not very practical, considering, for example, that one library component might produce error messages via cerr
, while another might use wcerr
for that purpose. But Visual C++ apparently follows a more practical-for-Windows-programming set of rules.
For Visual C++ 10.0 the fwide
function is documented as being unimplemented. And from a practical point of view, at least at the level of outputting whole lines it apparently works fine to intermingle use of cout
and wcout
. So, happily, Visual C++ apparently just disregards the standard’s requirements and does not maintain an impractical explicit C FILE
stream orientation.
Except … that with UTF-8 mode for the standard output stream, use of cout
crashes the program!
#include <stdexcept> // std::exception #include <stdlib.h> // EXIT_SUCCESS, EXIT_FAILURE #include <iostream> // std::cout, std::endl using namespace std; #include <io.h> // _setmode, _fileno #include <fcntl.h> // _O_U8TEXT void initStreams() { _setmode( _fileno( stdin ), _O_U8TEXT ); _setmode( _fileno( stdout ), _O_U8TEXT ); _setmode( _fileno( stderr ), _O_U8TEXT ); } int main() { try { initStreams(); cout << "Hello, world!" << endl; cout << "Did you know, most Norwegians like 'blåbærsyltetøy'?" << endl; cerr << ":This is an error message, also mentioning 'blåbærsyltetøy'." << endl; return EXIT_SUCCESS; } catch( exception const& x ) { cerr << "!" << x.what() << endl; } catch( ... ) { cerr << "!Unknown exception." << endl; } return EXIT_FAILURE; }
W:\examples> ugly
W:\examples> _
The crash is caused by an assertion in the Visual C++ 10.0 fputc
implementation, in source file [fputc.c]:
int __cdecl fputc ( int ch, FILE *str ) { int retval=0; _VALIDATE_RETURN((str != NULL), EINVAL, EOF); _lock_str(str); __try { _VALIDATE_STREAM_ANSI_SETRET(str, EINVAL, retval, EOF); // <-- Uh oh. if (retval==0) { retval = _putc_nolock(ch,str); } } __finally { _unlock_str(str); } return(retval); }
where _VALIDATE_STREAM_ANSI_SETRET
is defined thusly in [internal.h]:
/* We use _VALIDATE_STREAM_ANSI_SETRET to ensure that ANSI file operations( fprintf etc) aren't called on files opened as UNICODE. We do this check only if it's an actual FILE pointer & not a string. It doesn't actually return immediately */ #define _VALIDATE_STREAM_ANSI_SETRET( stream, errorcode, retval, retexpr) \ { \ FILE *_Stream=stream; \ int fn; \ _VALIDATE_SETRET(( (_Stream->_flag & _IOSTRG) || \ ( fn = _fileno(_Stream), \ ( (_textmode_safe(fn) == __IOINFO_TM_ANSI) && \ !_tm_unicode_safe(fn)))), \ errorcode, retval, retexpr) \ }
I.e., with UTF-8 mode Microsoft’s code enforces the standard’s prohibition of mixing wide and narrow character operations on the same stream.
What to do, when the law is suddenly enforced?
UTF-8 stream mode: the ugly (narrow streams) – FIXED
Given the Windows convention of Windows ANSI encoding for internal char
-based data, in particular for string literals, it does not make much sense to support cin
input operations. With the user typing e.g. “кошка” the program running on a Norwegian machine would just produce something like “?????” as the Windows ANSI-encoded result. I.e. such narrow character input operations would replace a generally preferable hard crash with rather ungood data loss.
On the other hand, especially for students, existing library components may be logging error messages via narrow character operations on the standard error stream, like cerr << "!oops" << endl;
. To avoid crashes for that, up at the C++ level one can install new iostream buffers for cout
, cerr
and clog
, where those buffers do something reasonable such as forwarding all output to the wide streams (thereby providing Windows ANSI → UTF-8 translation). Down at the C level there is, however, no other practical crash-fix option than to leave the standard error stream in ANSI (untranslated) mode, so that the functionality down at the C level is much more restricted.
Back up at the C++ level again, this restriction for stderr
can be compensated by adding UTF-8 conversion to the wide streams that write to the standard error stream, namely wcerr
and wclog
. One can and should also add custom direct console output support (more about that below). I.e., one is then essentially emulating the UTF-8 mode for wcerr
and wclog
, but only for operations at the C++ level.
This means that while the C++ level will/can work as expected, narrow character operations on stderr
will only be guaranteed to produce the intended output when they’re limited to the ASCII character repertoire, and narrow character operations on stdin
or stdout
will crash.
Happily it’s very very easy to recognize a crash and/or garbage output, so that breaches of the programming conventions, …
- use
wcin
for input,
- don’t use C
FILE
level narrow characters operations onstdout
, and
- don’t use non-ASCII characters in narrow character operations on
stderr
will likely be caught during testing.
Let’s start at the end of the internal processing, with the UTF-8 encoding support for wcerr
.
With Visual C++ 10.0 the UTF-8 encoding itself is almost trivial, because first of all Visual C++ 10.0 supports the C++11 codecvt_utf8
facet, and secondly because Visual C++’s standard streams support the codecvt
facet (the standard only requires file streams to support it), but as you can see below the console window result is not perfect!
#include <codecvt> // std::codecvt_utf8 (C++11) #include <iostream> // std::wcerr #include <locale> // std::locale #include <memory> // std::unique_ptr (C++11) using namespace std; void setUtf8Conversion( wostream& stream ) { typedef codecvt_utf8< wchar_t > CodeCvt; unique_ptr< CodeCvt > pCodeCvt( new CodeCvt ); locale utf8Locale( locale(), pCodeCvt.get() ); pCodeCvt.release(); stream.imbue( utf8Locale ); } int main() { wcerr << "'blåbærsyltetøy' with default presentation." << endl; setUtf8Conversion( wcerr ); wcerr << "'blåbærsyltetøy' with UTF-8 conversion applied." << endl; }
Testing this:
W:\examples> chcp 1252 W:\examples> chcp 1252 Active code page: 1252 W:\examples> cl direct_display_troubles.cpp /Fe"x" direct_display_troubles.cpp W:\examples> x 'blåbærsyltetøy' with default presentation. 'blÃ¥bærsyltetøy' with UTF-8 conversion applied. W:\examples> chcp 65001 Active code page: 65001 W:\examples> x 'bl�b�rsyltet�y' with default presentation. 'bl��b��rsyltet��y' with UTF-8 conversion applied. W:\examples> (x 2>&1) >result W:\examples> type result 'bl�b�rsyltet�y' with default presentation. 'blåbærsyltetøy' with UTF-8 conversion applied. W:\examples> _
As you can see, the Visual C++ runtime library’ direct console output kicks in even without having set UTF-8 mode, and in some inexplicable way presents the resulting UTF-8 bytes – as garbage.
I have no idea how that happens.
However, one cure is to override it with a custom direct console output:
#ifdef _MSC_VER # pragma warning( disable: 4373 ) // "Your override overrides" #endif #include <assert.h> // assert #include <codecvt> // std::codecvt_utf8 (C++11) #include <iostream> // std::wcerr #include <locale> // std::locale #include <memory> // std::unique_ptr (C++11) #include <streambuf> // std::basic_streambuf using namespace std; #undef UNICODE #define UNICODE #undef STRICT #define STRING #include <windows.h> // GetStdHandle, GetConsoleMode, WriteConsole template< class CharType > class AbstractOutputBuffer : public basic_streambuf< CharType > { public: typedef basic_streambuf< CharType > Base; typedef typename Base::traits_type Traits; typedef Base StreamBuffer; protected: virtual streamsize xsputn( char_type const* const s, streamsize const n ) = 0; virtual int_type overflow( int_type const c ) { bool const cIsEOF = Traits::eq_int_type( c, Traits::eof() ); int_type const failureValue = Traits::eof(); int_type const successValue = (cIsEOF? Traits::not_eof( c ) : c); if( !cIsEOF ) { char_type const ch = Traits::to_char_type( c ); streamsize const nCharactersWritten = xsputn( &ch, 1 ); return (nCharactersWritten == 1? successValue : failureValue); } return successValue; } public: AbstractOutputBuffer() {} AbstractOutputBuffer( StreamBuffer& existingBuffer ) : Base( existingBuffer ) {} }; class DirectOutputBuffer : public AbstractOutputBuffer< wchar_t > { public: enum StreamId { outputStreamId, errorStreamId, logStreamId }; private: StreamId streamId_; protected: virtual streamsize xsputn( wchar_t const* const s, streamsize const n ) { static HANDLE const outputStreamHandle = GetStdHandle( STD_OUTPUT_HANDLE ); static HANDLE const errorStreamHandle = GetStdHandle( STD_ERROR_HANDLE ); HANDLE const streamHandle = (streamId_ == outputStreamId? outputStreamHandle : errorStreamHandle ); DWORD nCharactersWritten = 0; bool writeSucceeded = !!WriteConsole( streamHandle, s, static_cast< DWORD >( n ), &nCharactersWritten, 0 ); return (writeSucceeded? static_cast< streamsize >( nCharactersWritten ) : 0); } public: DirectOutputBuffer( StreamId streamId = outputStreamId ) : streamId_( streamId ) {} }; void setUtf8Conversion( wostream& stream ) { typedef codecvt_utf8< wchar_t > CodeCvt; unique_ptr< CodeCvt > pCodeCvt( new CodeCvt ); locale utf8Locale( locale(), pCodeCvt.get() ); pCodeCvt.release(); stream.imbue( utf8Locale ); } bool isConsole( HANDLE streamHandle ) { DWORD consoleMode; return !!GetConsoleMode( streamHandle, &consoleMode ); } bool isConsole( DWORD stdStreamId ) { return isConsole( GetStdHandle( stdStreamId ) ); } void setDirectOutputSupport( wostream& stream ) { typedef DirectOutputBuffer DOB; if( &stream == &wcout ) { if( isConsole( STD_OUTPUT_HANDLE ) ) { static DOB outputStreamBuffer( DOB::outputStreamId ); stream.rdbuf( &outputStreamBuffer ); } } else if( &stream == &wcerr ) { if( isConsole( STD_ERROR_HANDLE ) ) { static DOB errorStreamBuffer( DOB::errorStreamId ); stream.rdbuf( &errorStreamBuffer ); } } else if( &stream == &wclog ) { if( isConsole( STD_ERROR_HANDLE ) ) { static DOB logStreamBuffer( DOB::logStreamId ); stream.rdbuf( &logStreamBuffer ); } } else { assert(( "setDirectOutputSupport: unsupported stream", false )); } } int main() { wcerr << "'blåbærsyltetøy' with default presentation." << endl; setUtf8Conversion( wcerr ); wcerr << "'blåbærsyltetøy' with UTF-8 conversion applied." << endl; setDirectOutputSupport( wcerr ); wcerr << "'blåbærsyltetøy' with UTF-8 conversion & direct output applied." << endl; }
Testing this:
W:\examples> chcp 1252 Active code page: 1252 W:\examples> cl direct_display_troubles.fixed.cpp /Fe"x" direct_display_troubles.fixed.cpp W:\examples> x 'blåbærsyltetøy' with default presentation. 'blÃ¥bærsyltetøy' with UTF-8 conversion applied. 'blåbærsyltetøy' with UTF-8 conversion & direct output applied. W:\examples> chcp 65001 Active code page: 65001 W:\examples> x 'bl�b�rsyltet�y' with default presentation. 'bl��b��rsyltet��y' with UTF-8 conversion applied. 'blåbærsyltetøy' with UTF-8 conversion & direct output applied. W:\examples> (x 2>&1) >result W:\examples> type result 'bl�b�rsyltet�y' with default presentation. 'blåbærsyltetøy' with UTF-8 conversion applied. 'blåbærsyltetøy' with UTF-8 conversion & direct output applied. W:\examples> _
Yay! Now we’re ready for tackling the previous section’s program. Again, the main
function is exactly as before:
#ifdef _MSC_VER # pragma warning( disable: 4373 ) // "Your override overrides" #endif #include <assert.h> // assert #include <codecvt> // std::codecvt_utf8 (C++11) #include <stdexcept> // std::exception #include <stdlib.h> // EXIT_SUCCESS, EXIT_FAILURE #include <streambuf> // std::basic_streambuf #include <string> // wstring #include <iostream> // std::cout, std::endl #include <locale> // std::locale #include <memory> // std::unique_ptr (C++11) using namespace std; #include <io.h> // _setmode, _fileno #include <fcntl.h> // _O_U8TEXT #undef UNICODE #define UNICODE #undef STRICT #define STRING #include <windows.h> // MultiByteToWideChar template< class CharType > class AbstractOutputBuffer : public basic_streambuf< CharType > { public: typedef basic_streambuf< CharType > Base; typedef typename Base::traits_type Traits; typedef Base StreamBuffer; protected: virtual streamsize xsputn( char_type const* const s, streamsize const n ) = 0; virtual int_type overflow( int_type const c ) { bool const cIsEOF = Traits::eq_int_type( c, Traits::eof() ); int_type const failureValue = Traits::eof(); int_type const successValue = (cIsEOF? Traits::not_eof( c ) : c); if( !cIsEOF ) { char_type const ch = Traits::to_char_type( c ); streamsize const nCharactersWritten = xsputn( &ch, 1 ); return (nCharactersWritten == 1? successValue : failureValue); } return successValue; } public: AbstractOutputBuffer() {} AbstractOutputBuffer( StreamBuffer& existingBuffer ) : Base( existingBuffer ) {} }; class OutputForwarderBuffer : public AbstractOutputBuffer< char > { public: typedef AbstractOutputBuffer< char > Base; typedef Base::Traits Traits; typedef Base::StreamBuffer StreamBuffer; typedef basic_streambuf<wchar_t> WideStreamBuffer; private: WideStreamBuffer* pWideStreamBuffer_; wstring wideCharBuffer_; OutputForwarderBuffer( OutputForwarderBuffer const& ); // No such. void operator=( OutputForwarderBuffer const& ); // No such. protected: virtual streamsize xsputn( char const* const s, streamsize const n ) { if( n == 0 ) { return 0; } int const nAsInt = static_cast<int>( n ); // Visual C++ sillywarnings. wideCharBuffer_.resize( nAsInt ); int const nWideCharacters = MultiByteToWideChar( CP_ACP, // Windows ANSI MB_PRECOMPOSED, // Always precompose characters (this is the default). s, // Narrow character string. nAsInt, // Number of bytes in narrow character string. &wideCharBuffer_[0], nAsInt // Wide char buffer size. ); assert( nWideCharacters > 0 ); return pWideStreamBuffer_->sputn( &wideCharBuffer_[0], nWideCharacters ); } public: OutputForwarderBuffer( StreamBuffer& existingBuffer, WideStreamBuffer* pWideStreamBuffer ) : Base( existingBuffer ) , pWideStreamBuffer_( pWideStreamBuffer ) {} }; class DirectOutputBuffer : public AbstractOutputBuffer< wchar_t > { public: enum StreamId { outputStreamId, errorStreamId, logStreamId }; private: StreamId streamId_; protected: virtual streamsize xsputn( wchar_t const* const s, streamsize const n ) { static HANDLE const outputStreamHandle = GetStdHandle( STD_OUTPUT_HANDLE ); static HANDLE const errorStreamHandle = GetStdHandle( STD_ERROR_HANDLE ); HANDLE const streamHandle = (streamId_ == outputStreamId? outputStreamHandle : errorStreamHandle ); DWORD nCharactersWritten = 0; bool writeSucceeded = !!WriteConsole( streamHandle, s, static_cast< DWORD >( n ), &nCharactersWritten, 0 ); return (writeSucceeded? static_cast< streamsize >( nCharactersWritten ) : 0); } public: DirectOutputBuffer( StreamId streamId = outputStreamId ) : streamId_( streamId ) {} }; void setUtf8Conversion( wostream& stream ) { typedef codecvt_utf8< wchar_t > CodeCvt; unique_ptr< CodeCvt > pCodeCvt( new CodeCvt ); locale utf8Locale( locale(), pCodeCvt.get() ); pCodeCvt.release(); stream.imbue( utf8Locale ); } bool isConsole( HANDLE streamHandle ) { DWORD consoleMode; return !!GetConsoleMode( streamHandle, &consoleMode ); } bool isConsole( DWORD stdStreamId ) { return isConsole( GetStdHandle( stdStreamId ) ); } void setDirectOutputSupport( wostream& stream ) { typedef DirectOutputBuffer DOB; if( &stream == &wcout ) { if( isConsole( STD_OUTPUT_HANDLE ) ) { static DOB outputStreamBuffer( DOB::outputStreamId ); stream.rdbuf( &outputStreamBuffer ); } } else if( &stream == &wcerr ) { if( isConsole( STD_ERROR_HANDLE ) ) { static DOB errorStreamBuffer( DOB::errorStreamId ); stream.rdbuf( &errorStreamBuffer ); } } else if( &stream == &wclog ) { if( isConsole( STD_ERROR_HANDLE ) ) { static DOB logStreamBuffer( DOB::logStreamId ); stream.rdbuf( &logStreamBuffer ); } } else { assert(( "setDirectOutputSupport: unsupported stream", false )); } } void initStreams() { // Set up UTF-8 conversions & direct console output: _setmode( _fileno( stdin ), _O_U8TEXT ); _setmode( _fileno( stdout ), _O_U8TEXT ); setUtf8Conversion( wcerr ); setDirectOutputSupport( wcerr ); setUtf8Conversion( wclog ); setDirectOutputSupport( wclog ); // Forward narrow character output to the wide streams: static OutputForwarderBuffer coutBuffer( *cout.rdbuf(), wcout.rdbuf() ); static OutputForwarderBuffer cerrBuffer( *cerr.rdbuf(), wcerr.rdbuf() ); static OutputForwarderBuffer clogBuffer( *clog.rdbuf(), wclog.rdbuf() ); cout.rdbuf( &coutBuffer ); cerr.rdbuf( &cerrBuffer ); clog.rdbuf( &clogBuffer ); } int main() { try { initStreams(); cout << "Hello, world!" << endl; cout << "Did you know, most Norwegians like 'blåbærsyltetøy'?" << endl; cerr << ":This is an error message, also mentioning 'blåbærsyltetøy'." << endl; return EXIT_SUCCESS; } catch( exception const& x ) { cerr << "!" << x.what() << endl; } catch( ... ) { cerr << "!Unknown exception." << endl; } return EXIT_FAILURE; }
Testing this:
W:\examples> chcp 1252 Active code page: 1252 W:\examples> cl utf8_mode.msvc.byte_stream.fixed.cpp /Fe"x" utf8_mode.msvc.byte_stream.fixed.cpp W:\examples> x Hello, world! Did you know, most Norwegians like 'blåbærsyltetøy'? :This is an error message, also mentioning 'blåbærsyltetøy'. W:\examples> chcp 65001 Active code page: 65001 W:\examples> x Hello, world! Did you know, most Norwegians like 'blåbærsyltetøy'? :This is an error message, also mentioning 'blåbærsyltetøy'. W:\examples> (x 2>&1) >result W:\examples> type result Hello, world! Did you know, most Norwegians like 'blåbærsyltetøy'? :This is an error message, also mentioning 'blåbærsyltetøy'. W:\examples> _
Summary
To make the Visual C++ _O_U8TEXT
UTF-8 stream mode work in general, we had to
- fix the input-from-console functionality by adding a special direct console input buffer in the C++ level
wcin
stream,
- avoid crashes for narrow character operations on the standard error stream by keeping that stream in ANSI mode, installing forwarder buffers for
cout
andcerr
, installing an UTF-8 conversion facet inwcerr
, and adding direct console output support towcerr
, and
- foresake the use of input and narrow character operations at the C level, except for the standard error stream.
With all this bug-fixing and support machinery added, it’s almost as if we had to implement the _O_U8TEXT
mode from scratch! What does it really buy, then? What is the point of using that mode?
Well, essentially the _O_U8TEXT
UTF-8 mode gives conversion to UTF-8 for C level wide character output operations on the standard output stream. It does not do interactive input, and it can not be reasonably used for the standard error stream. One might therefore say that Microsoft Unicode guru Michael Kaplan’s original blog posting about this mode, where it appeared to a simple general solution on its own, was a bit too optimistic!
Anyway, to be utterly clear, with this approach one has three main text encodings to deal with in Windows, and three main text encodings to deal with in a typical Linux installation:
Internal wchar_t
data:Internal char
data:External data: Windows: UTF-16 Windows ANSI UTF-8 Linux: Usually UTF-32 but also UTF-16 UTF-8 UTF-8
And the main forces in play yielding the Windows row of the above table, are that …
- the core Windows API is UTF-16 based, and
wchar_t
is 16 bits in Windows,
- Visual C++ has Windows ANSI as its C++ execution character set, which e.g. forces ordinary narrow character literals to Windows ANSI encoding even when the source code is UTF-8, and
- one desires UTF-8 for external data both to avoid data loss and for general interoperability.
There is not much that can be done about these forces, so the basic approach to deal with these issues is either very similar to what I have described here, or will otherwise involve pretty painful trade-offs.
Not to say, of course, that the trade-offs in this posting’s approach aren’t a little painful! 🙂
Cheers, & enjoy!