Consider this innocent-looking code:
#include <iostream> #include <ctype.h> // isalpha #include <locale.h> // setlocale int main( int argc, char* argv[] ) { char const* progname = argv[0]; setlocale( LC_ALL, "" ); // User's default locale. if( isalpha( progname[0] ) ) // Uh... { std::cout << "My name starts with a letter!" << std::endl; } }
Running the program from a command interpreter, as file [a.exe] it reports that its name starts with a letter, OK, check. As file [7.exe] it doesn’t say anything, OK, check. But as file [å.exe] (‘å’ is the last letter in the Norwegian alphabet) it also says nothing, even on my Norwegian PC!
What’s wrong?
Well, there’s a great many things wrong, including that in Windows the program may receive a full path, instead of just a filename, as argv[0]
🙂 But the main thing here is the use of the isalpha
function from the C library because that function requires an argument value that is non-negative, or else the special value EOF
. And with an ordinary 8-bit ASCII based character set and a signed type (which is the default on all extant computers except reportedly two archaic beasts that use sign and magnitude integers) 'å'
cannot be positive, since there’s no such character in ASCII, so it must be negative…
And the result of that is Undefined Behavior, causing nasal daemons to appear at the most embarassing moment!
The proper thing to do is to cast that character to unsigned char
before passing it to isalpha
. More generally, if you have a Byte
type then it’s probably defined as unsigned char
, and indicates more clearly what’s going on. And so for this kind of usage, in the cppx library I define types Byte
and SignedByte
as follows, …
In [progrock/cppx/primitive_types.h]:
namespace progrock{ namespace cppx{ typedef unsigned char Byte; typedef signed char SignedByte; // More stuff here … } } // namespace progrock::cppx
… as well as do-it-once-and-be-over-with-it wrappers for the C library character classification functions, …
In [progrock/cppx/text/char_util.h]:
namespace progrock{ namespace cppx{ inline bool isAlphaNum( char const ch ) { return !!isalnum( Byte( ch ) ); } inline bool isControl( char const ch ) { return !!iscntrl( Byte( ch ) ); } inline bool isAlpha( char const ch ) { return !!isalpha( Byte( ch ) ); } inline bool isDigit( char const ch ) { return !!isdigit( Byte( ch ) ); } inline bool isGraphic( char const ch ) { return !!isgraph( Byte( ch ) ); } inline bool isLowercase( char const ch ) { return !!islower( Byte( ch ) ); } inline bool isGraphicOrSpace( char const ch ) { return !!isprint( Byte( ch ) ); } inline bool isPunctuation( char const ch ) { return !!ispunct( Byte( ch ) ); } inline bool isWhitespace( char const ch ) { return !!isspace( Byte( ch ) ); } inline bool isUppercase( char const ch ) { return !!isupper( Byte( ch ) ); } inline bool isHexDigit( char const ch ) { return !!isxdigit( Byte( ch ) ); } inline bool isAlphaNum( wchar_t const ch ) { return !!iswalnum( ch ); } inline bool isControl( wchar_t const ch ) { return !!iswcntrl( ch ); } inline bool isAlpha( wchar_t const ch ) { return !!iswalpha( ch ); } inline bool isDigit( wchar_t const ch ) { return !!iswdigit( ch ); } inline bool isGraphic( wchar_t const ch ) { return !!iswgraph( ch ); } inline bool isLowercase( wchar_t const ch ) { return !!iswlower( ch ); } inline bool isGraphicOrSpace( wchar_t const ch ) { return !!iswprint( ch ); } inline bool isPunctuation( wchar_t const ch ) { return !!iswpunct( ch ); } inline bool isWhitespace( wchar_t const ch ) { return !!iswspace( ch ); } inline bool isUppercase( wchar_t const ch ) { return !!iswupper( ch ); } inline bool isHexDigit( wchar_t const ch ) { return !!iswxdigit( ch ); } } } // namespace progrock::cppx
In C such wrappers would have the problem of not handling the EOF
constant properly (it’s intentionally a value that cannot be represented as char
). But in C++ I’ve needed and written a wrapper like isAlpha
above numerous times and have never needed EOF
-value handling. So I think that these functions are Good, what the doctor ordered :-), and that if EOF
needs to be handled some place then it’s no big deal to do that: it’s easy to add complexity on top of simplicity, but difficult to simplify the complex!
A few questions and misc things that occurred while (and after)
reading 🙂
– Getting the “full path” at argv[ 0 ] isn’t Windows-specific or
non-conforming.
Although C99 is much clearer on this issue, C++03 doesn’t give
any meaningful requirement so pretty much anything argv[ 0 ]
will be pointing to is fine (to the standard, that is :-)).
No idea why the new C++ standard hasn’t been sync’d (didn’t
they want to do so? And the difference, here, isn’t just about
the wording. For instance C99 has a documentation requirement
if argc > 0).
With regards to the wrapper functions:
– Why aren’t you using the <locale> function templates?
Note that you could declare your own functions, with just one
parameter, and, in the implementation file, forward the work
to those. This would allow
– avoiding the inclusion of <locale> in every TU using
the functions
– have, for each of the isxyz()’s, a one-argument version
that <locale> lacks completely (more on this later).
– I’m not convinced that in this case the Byte typedef indicates
more clearly what’s going on. After all, we are conceptually
working with characters, so a “literal” unsigned char might be
clearer (slightly —the crux of the matter is knowing that we
are trying to workaround the gotcha about the isxxx()
functions; once you know that… Thinking about it,
conditionally subtracting CHAR_MIN might arguably be the
clearest way, since it is not “language-oriented”, just a
plain arithmetic transformation one would also use outside of
computing.)
– The “name manoeuvring” in “GraphicOrSpace” is interesting. It
might be debatable, but once you are no longer using the
standard names (whose meaning is known, whether intuitive or
not with regards to whether a space is “graphic” or
“printing”) it makes sense to clarify. I’m not sure I like it
(and generally speaking I don’t like “ThisOrThat” names) but
it shows care. And that’s good to see.
Speaking of names… the functions in <locale> have no
one-argument version (overload; or default-argument for the
second parameter) which would instead seem quite natural to
me. E.g.:
// in namespace std, of course...
template
bool isspace( Ch c, locale const & = locale() ) ;
I don’t see any reason for this absence other than C
compatibility. But that’s only because they used the same
names as C does! Why didn’t they just call them differently?
– Why const parameters? (Hmm… a short bullet, once in a while
:-))
– You might want to mention in the prose that the wchar_t case
is different and doesn’t need the watch-for-negatives
workaround?
– I have never investigated further but I’ve seen incoherent
results between the one-argument iswxyz()’s (the “C heritage”
versions) and their <locale> counterparts. At a quick
test with a Norwegian locale and the codepoints in [0, 255] I
got four mismatches:
code point mismatch on
—————————————————–
133 iswspace
170 iswalnum, iswalpha, iswlower
181 “” “” “”
186 “” “” “”
The mismatches were all “one way”: true for C++, zero for C.
And nothing changed (same table, “same one way”) with three or
four different locales. Oh well.
– The train of function definitions exhibits the kind of code
duplication that I’ve never found a good way to avoid.
A similar example is here:
<http://breeze.svn.sourceforge.net/viewvc/breeze/trunk/breeze/meta/maximum.hpp?view=log>
[NOTE: a rant is starting and I don’t know when it will be
over :-)]
One could use an external tool to generate them (nothing
fancy, in this case; your command line interpreter, for
instance) and integrate it in the build system.
But I can’t say it’s a solution that I like it for such a
case.
Or a “local macro”:
#define CPPX_define_classification_functions( wrapper, backend ) ...
CPPX_define_classification_functions( isAlphaNum, isalnum )
CPPX_define_classification_functions( isControl, iscntrl )
...
#undef CPPX_define_classification_functions
(the plural, “functions”, in the name is because the macro
definition would take care of both char and wchar_t overloads
(and possibly more in the future?)).
Not something I’d expose at an art gallery, either.
Of course, the source of duplication is the usage of different
functions for the different classifications (alnum, xdigit,
etc.). One can in general specify the classification by using
an argument, instead. In fact, a function taking such an
argument is usually already in the library implementation (so
that all of the various isxxx can just forward the real work
to it) but it’s an implementation detail.
Well, this might do:
bool
is( std::ctype_base::mask m,
char c,
std::locale const & loc = std::locale() )
{
std::use_facet< std::ctype >( loc ).is( m, c ) ;
}
but if one has to use <locale> and doesn’t care about
EOF/WEOF then, of course, why not using its function templates
(either directly or writing the one-argument versions
mentioned above).
(Even if this wasn’t the case, who would want to write
ctype_base::that_mask at every invocation… one should at
least add some scaffolding to spare the user this torture 🙂
—besides, IMHO, making it more robust against the chance
that the implementation just uses an integral type for
std::ctype_base::mask).
A simple approach is using a wrapper function template; e.g.:
template
bool
test( char c )
{
return testing_function( static_cast( c ) ) != 0 ;
}
but, as it is, this doesn’t do much static checking… you can
pass whatever function taking an int and returning an int you
might have at end 🙂 And I’m afraid (I haven’t actually
tried) that making it robust will bring in just the same kind
of code duplication we are trying to avoid. Thoughts?
Hm, I guess I could write several blog postings just about the questions you raise!
But regarding just why I’m not using the <locale> templates: they don’t work. He he. Additionally I hate the iostream stuff :-), it’s IMHO over-engineered, overly complex and lacking in functionality and reasonable defaults, and where else do you find two-phase initialization in the standard library?, but that would only have mattered if it did work with current compilers, which it doesn’t.
Consider:
On my computer, Norwegian locale in Windows XP, with g++ (TDM-2 mingw32) 4.4.1 this produces "3.14" (English) while with Visual C++ it produces "3,14" (Norwegian).
I think the C locale functionality got it mostly right, it's simple and effective. And it works. But one main problem is whether it's thread-safe (I don't know, but the C standard can't guarantee that, and I suspect that it's not thread safe in practice): I'd like the locale to be a thread local variable, sort of like a thread locale…
Cheers,
– Alf