[cppx] How to avoid disastrous integer wrap-around

The time is 65535:65535 PM, and the temperature is a nice comfy 4294967295 degrees… Oops. Clearly some signed integer values have been promoted to unsigned somewhere, or perhaps the program has just used unsigned integer arithmetic incorrectly.

There are two main cures for that:

  • make sure that you’re only using unsigned arithmetic in expressions that involve unsigned type numerical values, or …
  • make sure that you’re only using signed arithmetic in expressions that involve unsigned type numerical values.

As of 2010 most programmers still choose the first cure, attempting to use unsigned arithmetic wherever there are unsigned numerical values involved, e.g. being careful to use size_t instead of int for some loop counter. I think the main reason is that it’s an old convention (which is propagated by e.g. the C++ standard library), a case of “frozen history”, but it may also have something to do with each intrusion of unsigned numerical values being viewed as a purely local problem, suggesting that a purely local fix, each time, is appropriate. This then leads to a bug-attracting mix of signed and unsigned type expression, which moreover are difficult to recognize as respectively signed and unsigned type, and are easy to get wrong.

Here I’m advocating the second cure, a single global convention: do not mix signed and unsigned numerical types. The unsigned types are for bit-level operations (only), while the signed types are for numerical values (only). But since the standard library was designed before the advantages of this clear usage separation were fully realized, and indeed at a time when some computer architectures still conferred some advantages to e.g. an unsigned size(), adopting this more rational and work-saving convention requires some support.

A signed size type

It’s only on a 32-bit architecture that int can be safe instead of size_t. In particular, in 64-bit Windows long int is 32-bit while size_t is 64-bit, although in 64-bit *nix long int is typically 64-bit (see e.g. Wikipedia). Hence, some other signed type is needed for sizes and indexes, and that type is the standard library’s ptrdiff_t. It’s guaranteed to be large enough for any array except for an array of char filling more than half of the available address space. And that single special case is not only so rare as to not occur in practice, at all (except possibly for some embedded computer programming); it’s also a case that’s simply not fully supported by current C++.

The standard library defines two types, size_t and ptrdiff_t, not only to provide a choice between signed and unsigned but because these types are used differently: the names indicate different usages. size_t is used for sizes, corresponding to the cppx Size type below. And ptrdiff_t is used for array indices and pointer offsets, corresponding to the cppx Index type below.

In [progrock/cppx/primitive_types.h]:

namespace progrock{ namespace cppx{
    // Some more stuff (e.g. definition of Byte) first, then ...

    namespace signedSize
    {
        CPPX_STATIC_ASSERT( sizeof( size_t ) == sizeof( ptrdiff_t ) );

        typedef ptrdiff_t   T;
        enum
        {
            maximum = ptrdiff_t( size_t( -1 ) / 2 ),
            minimum = -maximum - 1      // Assumes two's complement form.
        };

        // Assert that at least ptrdiff_t is two's complement form. This assertion
        // is stronger, asserting also that there's no bounds checking. But AFAIK
        // it's all that's available from the standard library.
        CPPX_STATIC_ASSERT( std::numeric_limits<T>::is_modulo );
    }  // namespace signedSize

    typedef signedSize::T   Size;
    typedef signedSize::T   Index;

} }  // namespace progrock::cppx

Wrapper functions

C-style casts are generally bad, e.g. easily hiding or inadvertently introducing some reinterpretation or const-casting, so writing Size(v.size()) is ungood. And C++-style casts are verbose, so writing static_cast<Size>(v.size()) is also ungood. Hence, a wrapper function is called for, writing just size(v) :-).

Wrapper functions for size- and indexing- related things also allow you to treat raw arrays and container classes in a uniform way, writing e.g. just endOf(a), instead of a+sizeof(a)/sizeof(*a) if a is a raw array and a.end() if a is e.g. a std::vector. Similarly, in cppx I define a startOf function. These names are intentionally slightly unlike begin and end, so as to avoid any confusion (for size I found it difficult to come up with some different name, though):

In [progrock/cppx/collections/iterator_range_util.h]:

namespace progrock{ namespace cppx{

    template< typename T >
    inline Size size( T const& c )              { return static_cast<Size>( c.size() ); }

    template< typename T, Size N >
    inline Size size( T (&)[N] )                { return N; }

    template< typename T >
    inline typename It<T>::T startOf( T& c )    { return c.begin(); }

    template< typename T, Size N >
    inline T* startOf( T (&a)[N] )              { return a; }

    template< typename T >
    inline typename It<T>::T endOf( T& c )      { return c.end(); }

    template< typename T, Size N >
    inline T* endOf( T (&a)[N] )                { return a + N; }

    #define CPPX_ALL_OF( v )   \
        ::progrock::cppx::startOf( v), ::progrock::cppx::endOf( v )

} }  // namespace progrock::cppx

It’s possible to redesign size so that it can be used at compile time, then wrapped via a macro, but although it’s an interesting academic exercise I have never needed that in practice, and the necessary macro for that complicates things. Another limitation is that with C++98 these functions can’t be used for locally defined classes, since in C++98 such classes have no linkage and template instantiation requires external linkage. However, like the first point this is not a problem in practice, and anyway as I understand it it’s been fixed in C++0x.

Summing up, with the above you can write e.g. for (Index i = size(s)-1; i >= 0; --i), and other such natural constructions, with no fear of value wrapping bugs. 🙂

Automatic iterator type deduction

For e.g. a std::vector<int> const the startOf and endOf functions above automatically deduce the proper iterator type, via the cppx It utility class template:

In [progrock/cppx/collections/It.h]:

namespace progrock{ namespace cppx{

    template< typename Type >
    struct It
    {
        typedef typename Type::iterator T;
    };

    template< typename Type >
    struct It< Type const >
    {
        typedef typename Type::const_iterator T;
    };

    template< typename ElemType, Size N >
    struct It< ElemType[N] >
    {
        typedef ElemType* T;
    };

} }  // namespace progrock::cppx

Cheers, & … enjoy! 🙂

11 comments on “[cppx] How to avoid disastrous integer wrap-around

  1. Very interesting perspective. I definitely fall into the first category of “use unsigned as much as possible”. I have never thought of using signed value for all numerical values.

    The main reason for me to use unsigned type is because stuff such as count and size feels more natural to be unsigned.

    This is thought provoking.

    • Yes, it might seem to be the same. I only hinted at the problems since they are well known. You shouldn’t take that hint too literally… 😉

      For those unfamiliar with the problems of unsigned arithmetic, here are some exercises that I think illustrate some of it:

      1) Write a loop that counts down from 5 to 0 inclusive, and prints each number. First with signed type loop counter, and then with unsigned type loop counter. There are many solutions for the latter (none very intuitive).

      2) For your C++ compiler and values of x in the range 3 through 7, what is the value of (x-5)/10 when x is respectively int and unsigned?

      3) For general numbers, if you have a < b+k, then you also have a-k < b, and vice versa. Assume that k = 42. For what values of a and b is it safe to rewrite a < b+k as a-k < b when a and b are signed integers? What about unsigned?

      Cheers & hth.,

      – Alf

      • The more I think about it, the more I think you are right. I have been doing it wrong the whole time.

        I like the argument in part 3. With signed type, it is always safe to rewrite a < b+k as a-k < b. There is no hole.

        I have recently encountered a few bugs with subtraction and unsigned integer. Since most of the application uses unsigned type, the bug was difficult to track down. It wouldn't even be a problem if we consistently use signed type.

        … Alan

      • It all depends on what you are doing.

        There are counter examples :
        1) a point for you 🙂

        2) And what is it for x = -2147483643?

        3) As you probably know, for large set of values for a and k, (a-k) is causing UB.

  2. I don’t know if it is interesting but Scientific and Engineering C++ by Barton & Nackman also used ptrdiff_t:

    typedef ptrdiff_t Subscript;  // Signed, machine-dependent
    

    But only for subscripts; sizes were still represented by some unsigned:

    typedef unsigned short int Dimension;
    

    (don’t ask me why it is “short”: I should recheck the book :-))

    PS:

    > for size I found it difficult to come up with some different name, though

    I call it “count” (it’s a term I came up with since the C days, when I wanted to distinguish between the “count” of an array of chars and its “length” as a string).

    • Thanks. That’s much better than size! 🙂 I’ve consequently changed my cppx source to countOf. Happily not much other code was affected! I added the “of” suffix because pure count might easily clash with client code names; because the “of” reads nicely; and because it matches startOf and endOf.

  3. The problem is that — as far as the standard is concerned — ptrdiff_t is merely guaranteed to be at least 16 bits (although in practice it usually follows the bitness of the machine and thus, as you say, works for all arrays except those that are extremely large whose element type is of size 1). Subtracting two pointers that are too far apart results in undefined behavior (see [expr.add]/6).

    Furthermore, while there are gotchas to be aware of, you rarely perform dangerous operations on indexes, sizes or counts in general. I challenge you to find real use case where you perform (x-5)/10 where x is an index to an array.

    As for “a < b+k", indeed one needs to be careful there and make sure that the expression on either side is still an index/size/count.

    • He he. 🙂 No, the limited guaranteed range for ptrdiff_t is not a problem on 32-bit or 64-bit systems, any more than it is a problem for for size_t. As you note, ptrdiff_t is of guaranteed sufficient range for any valid pointer subtraction.

      But it can be a problem on 16-bit machines (which happily are quite rare!), because at least according to the N869 committee draft for C99, ptrdiff_t is required to be at least 17 bits.

      So, on such a machine it might introduce needless inefficiency to explicitly use ptrdiff_t, or indeed to do any pointer subtraction, since that’s defined to yield ptrdiff_t result…

      But programming for a 16-bit machine you have to make many adjustments anyway, writing pretty machine-specific code and/or doing things in ways imposed by the limits of the machine, so it’s not like it’s practically relevant.

      I agree that dangerous operations on indexes, sizes or counts in general are rare. But they’re there. Like, checking the result of .size() or .length() against some computed limit. For example, s.length() > -5 will in practice almost always yield (unexpected) false. That’s due to implicit promotion.

      Happily most compilers will warn about comparing signed and unsigned.

      But as of 2010 they generally don’t warn about computed numerical results.

      Cheers, & thanks for commenting!,

      – Alf

  4. Pingback: 2010 in review | Alf on programming (mostly C++)

Leave a reply to Alan Ning Cancel reply