[cppx] Characters & nasal daemons…

Consider this innocent-looking code:

#include <iostream>
#include <ctype.h>      // isalpha
#include <locale.h>     // setlocale

int main( int argc, char* argv[] )
{
    char const* progname = argv[0];
    setlocale( LC_ALL, "" );        // User's default locale.
    if( isalpha( progname[0] ) )    // Uh...
    {
        std::cout << "My name starts with a letter!" << std::endl;
    }
}

Running the program from a command interpreter, as file [a.exe] it reports that its name starts with a letter, OK, check. As file [7.exe] it doesn’t say anything, OK, check. But as file [å.exe] (‘å’ is the last letter in the Norwegian alphabet) it also says nothing, even on my Norwegian PC!

What’s wrong?

Well, there’s a great many things wrong, including that in Windows the program may receive a full path, instead of just a filename, as argv[0] 🙂 But the main thing here is the use of the isalpha function from the C library because that function requires an argument value that is non-negative, or else the special value EOF. And with an ordinary 8-bit ASCII based character set and a signed type (which is the default on all extant computers except reportedly two archaic beasts that use sign and magnitude integers) 'å' cannot be positive, since there’s no such character in ASCII, so it must be negative…

And the result of that is Undefined Behavior, causing nasal daemons to appear at the most embarassing moment!

The proper thing to do is to cast that character to unsigned char before passing it to isalpha. More generally, if you have a Byte type then it’s probably defined as unsigned char, and indicates more clearly what’s going on. And so for this kind of usage, in the cppx library I define types Byte and SignedByte as follows, …

In [progrock/cppx/primitive_types.h]:

namespace progrock{ namespace cppx{

    typedef unsigned char       Byte;
    typedef signed char         SignedByte;

    // More stuff here &hellip;

} }  // namespace progrock::cppx

… as well as do-it-once-and-be-over-with-it wrappers for the C library character classification functions, …

In [progrock/cppx/text/char_util.h]:

namespace progrock{ namespace cppx{

    inline bool isAlphaNum( char const ch )             { return !!isalnum( Byte( ch ) ); }
    inline bool isControl( char const ch )              { return !!iscntrl( Byte( ch ) ); }
    inline bool isAlpha( char const ch )                { return !!isalpha( Byte( ch ) ); }
    inline bool isDigit( char const ch )                { return !!isdigit( Byte( ch ) ); }
    inline bool isGraphic( char const ch   )            { return !!isgraph( Byte( ch ) ); }
    inline bool isLowercase( char const ch )            { return !!islower( Byte( ch ) ); }
    inline bool isGraphicOrSpace( char const ch )       { return !!isprint( Byte( ch ) ); }
    inline bool isPunctuation( char const ch )          { return !!ispunct( Byte( ch ) ); }
    inline bool isWhitespace( char const ch )           { return !!isspace( Byte( ch ) ); }
    inline bool isUppercase( char const ch )            { return !!isupper( Byte( ch ) ); }
    inline bool isHexDigit( char const ch )             { return !!isxdigit( Byte( ch ) ); }

    inline bool isAlphaNum( wchar_t const ch )          { return !!iswalnum( ch ); }
    inline bool isControl( wchar_t const ch )           { return !!iswcntrl( ch ); }
    inline bool isAlpha( wchar_t const ch )             { return !!iswalpha( ch ); }
    inline bool isDigit( wchar_t const ch )             { return !!iswdigit( ch ); }
    inline bool isGraphic( wchar_t const ch   )         { return !!iswgraph( ch ); }
    inline bool isLowercase( wchar_t const ch )         { return !!iswlower( ch ); }
    inline bool isGraphicOrSpace( wchar_t const ch )    { return !!iswprint( ch ); }
    inline bool isPunctuation( wchar_t const ch )       { return !!iswpunct( ch ); }
    inline bool isWhitespace( wchar_t const ch )        { return !!iswspace( ch ); }
    inline bool isUppercase( wchar_t const ch )         { return !!iswupper( ch ); }
    inline bool isHexDigit( wchar_t const ch )          { return !!iswxdigit( ch ); }

} }  // namespace progrock::cppx

In C such wrappers would have the problem of not handling the EOF constant properly (it’s intentionally a value that cannot be represented as char). But in C++ I’ve needed and written a wrapper like isAlpha above numerous times and have never needed EOF-value handling. So I think that these functions are Good, what the doctor ordered :-), and that if EOF needs to be handled some place then it’s no big deal to do that: it’s easy to add complexity on top of simplicity, but difficult to simplify the complex!

Advertisements

2 comments on “[cppx] Characters & nasal daemons…

  1. A few questions and misc things that occurred while (and after)
    reading 🙂

    – Getting the “full path” at argv[ 0 ] isn’t Windows-specific or
    non-conforming.

    Although C99 is much clearer on this issue, C++03 doesn’t give
    any meaningful requirement so pretty much anything argv[ 0 ]
    will be pointing to is fine (to the standard, that is :-)).

    No idea why the new C++ standard hasn’t been sync’d (didn’t
    they want to do so? And the difference, here, isn’t just about
    the wording. For instance C99 has a documentation requirement
    if argc > 0).

    With regards to the wrapper functions:

    – Why aren’t you using the <locale> function templates?
    Note that you could declare your own functions, with just one
    parameter, and, in the implementation file, forward the work
    to those. This would allow

    – avoiding the inclusion of <locale> in every TU using
    the functions

    – have, for each of the isxyz()’s, a one-argument version
    that <locale> lacks completely (more on this later).

    – I’m not convinced that in this case the Byte typedef indicates
    more clearly what’s going on. After all, we are conceptually
    working with characters, so a “literal” unsigned char might be
    clearer (slightly —the crux of the matter is knowing that we
    are trying to workaround the gotcha about the isxxx()
    functions; once you know that… Thinking about it,
    conditionally subtracting CHAR_MIN might arguably be the
    clearest way, since it is not “language-oriented”, just a
    plain arithmetic transformation one would also use outside of
    computing.)

    – The “name manoeuvring” in “GraphicOrSpace” is interesting. It
    might be debatable, but once you are no longer using the
    standard names (whose meaning is known, whether intuitive or
    not with regards to whether a space is “graphic” or
    “printing”) it makes sense to clarify. I’m not sure I like it
    (and generally speaking I don’t like “ThisOrThat” names) but
    it shows care. And that’s good to see.

    Speaking of names… the functions in <locale> have no
    one-argument version (overload; or default-argument for the
    second parameter) which would instead seem quite natural to
    me. E.g.:


    // in namespace std, of course...
    template
    bool isspace( Ch c, locale const & = locale() ) ;

    I don’t see any reason for this absence other than C
    compatibility. But that’s only because they used the same
    names as C does! Why didn’t they just call them differently?

    – Why const parameters? (Hmm… a short bullet, once in a while
    :-))

    – You might want to mention in the prose that the wchar_t case
    is different and doesn’t need the watch-for-negatives
    workaround?

    – I have never investigated further but I’ve seen incoherent
    results between the one-argument iswxyz()’s (the “C heritage”
    versions) and their <locale> counterparts. At a quick
    test with a Norwegian locale and the codepoints in [0, 255] I
    got four mismatches:

    code point mismatch on
    —————————————————–
    133 iswspace
    170 iswalnum, iswalpha, iswlower
    181 “” “” “”
    186 “” “” “”

    The mismatches were all “one way”: true for C++, zero for C.
    And nothing changed (same table, “same one way”) with three or
    four different locales. Oh well.

    – The train of function definitions exhibits the kind of code
    duplication that I’ve never found a good way to avoid.

    A similar example is here:

    <http://breeze.svn.sourceforge.net/viewvc/breeze/trunk/breeze/meta/maximum.hpp?view=log&gt;

    [NOTE: a rant is starting and I don’t know when it will be
    over :-)]

    One could use an external tool to generate them (nothing
    fancy, in this case; your command line interpreter, for
    instance) and integrate it in the build system.

    But I can’t say it’s a solution that I like it for such a
    case.

    Or a “local macro”:


    #define CPPX_define_classification_functions( wrapper, backend ) ...

    CPPX_define_classification_functions( isAlphaNum, isalnum )
    CPPX_define_classification_functions( isControl, iscntrl )
    ...

    #undef CPPX_define_classification_functions

    (the plural, “functions”, in the name is because the macro
    definition would take care of both char and wchar_t overloads
    (and possibly more in the future?)).

    Not something I’d expose at an art gallery, either.

    Of course, the source of duplication is the usage of different
    functions for the different classifications (alnum, xdigit,
    etc.). One can in general specify the classification by using
    an argument, instead. In fact, a function taking such an
    argument is usually already in the library implementation (so
    that all of the various isxxx can just forward the real work
    to it) but it’s an implementation detail.

    Well, this might do:


    bool
    is( std::ctype_base::mask m,
    char c,
    std::locale const & loc = std::locale() )
    {
    std::use_facet< std::ctype >( loc ).is( m, c ) ;
    }

    but if one has to use <locale> and doesn’t care about
    EOF/WEOF then, of course, why not using its function templates
    (either directly or writing the one-argument versions
    mentioned above).

    (Even if this wasn’t the case, who would want to write
    ctype_base::that_mask at every invocation… one should at
    least add some scaffolding to spare the user this torture 🙂
    —besides, IMHO, making it more robust against the chance
    that the implementation just uses an integral type for
    std::ctype_base::mask).

    A simple approach is using a wrapper function template; e.g.:


    template
    bool
    test( char c )
    {
    return testing_function( static_cast( c ) ) != 0 ;
    }

    but, as it is, this doesn’t do much static checking… you can
    pass whatever function taking an int and returning an int you
    might have at end 🙂 And I’m afraid (I haven’t actually
    tried) that making it robust will bring in just the same kind
    of code duplication we are trying to avoid. Thoughts?

  2. Hm, I guess I could write several blog postings just about the questions you raise!

    But regarding just why I’m not using the <locale> templates: they don’t work. He he. Additionally I hate the iostream stuff :-), it’s IMHO over-engineered, overly complex and lacking in functionality and reasonable defaults, and where else do you find two-phase initialization in the standard library?, but that would only have mattered if it did work with current compilers, which it doesn’t.

    Consider:

    #include <iostream>
    #include <locale>
    #include <stdlib.h>
    
    int main()
    {
        using namespace std;
        cout.imbue( locale( "" ) );
        cout << 3.14 << endl;
    }
    

    On my computer, Norwegian locale in Windows XP, with g++ (TDM-2 mingw32) 4.4.1 this produces "3.14" (English) while with Visual C++ it produces "3,14" (Norwegian).

    I think the C locale functionality got it mostly right, it's simple and effective. And it works. But one main problem is whether it's thread-safe (I don't know, but the C standard can't guarantee that, and I suspect that it's not thread safe in practice): I'd like the locale to be a thread local variable, sort of like a thread locale…

    Cheers,

    – Alf

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s