Why COW was deemed ungood for std::string.

COW, short for copy on write, is a way to implement mutable strings so that creating strings and logically copying strings, is reduced to almost nothing; conceptually they become free operations like no-ops.

Basic idea: to share a data buffer among string instances, and only make a copy for a specific instance (the copy on write) when that instance’s data is modified. The general cost of this is only an extra indirection for accessing the value of a string, so a COW implementation is highly desirable. And so the original C++ standard, C++98, and its correction C++03, had special support for COW implementations, and e.g. the g++ compiler’s std::string implementations used COW.

So why was that support dropped in C++11?

In particular, would the same reason or reasons apply to a reference counted immutable string value class?

As we’ll see it does not, it’s just a severe mismatch between the std::string design and the ideal COW requirements. But it took a two hour car trip, driving 120 kms on winter roads, for my memory to yet again cough up the relevant scenario where Things Go Wrong™. I guess it’s like the “why can’t I assign a T** to a T const** question; it’s quite counter-intuitive.

Basic COW string theory: the COPOW principle.

A COW string has two possible states: exclusively owning the buffer, or sharing the buffer with other COW strings.

It starts out in state owning. Assignments and copying initializations can make it sharing. Before executing a “write” operation it must ensure that it’s in owning state, and a transition from sharing to owning involves copying the buffer contents to a new and for now exclusively owned buffer.

With a string type designed for COW any operation will be either non-modifying, a “read” operation, or directly modifying, a “write” operation, which makes it is easy to determine whether the string must ensure state owning before executing the operation.

With a std::string, however, references, pointers and iterators to mutable contents are handed out with abandon. Even a simple value indexing of a non-const string, s[i], hands out a reference that can be used to modify the string. And so for a non-const std::string every such hand-out-free-access operation can effectively be a “write” operation, and would have to be regarded as such for a COW implementation (if the current C++ standard had permitted a COW implementation, which it doesn’t).

I call this the principle of copy on possibility of write, or COPOW for short. It’s for strings that aren’t designed for COW. For a COW-oriented design applying COPOW reduces to pure COW.

A code example showing how COW works.

To keep the size of the following example down I don’t address the issue of constant time initialization from literal, but just show how assignment and copy initialization can be reduced to almost nothing:

#include <cppx-core/_all_.hpp>  // https://github.com/alf-p-steinbach/cppx-core

using C_str = const char*;      // Is also available in cppx.

namespace my
    $use_cppx( Raw_array_of_, Size );
    $use_std( begin, end, make_shared, vector, shared_ptr );

    class Cow_string
        using Buffer = vector<char>;

        shared_ptr<Buffer>      m_buffer;
        Size                    m_length;

        void ensure_is_owning()
            if( m_buffer.use_count() > 1 )
                m_buffer = make_shared<Buffer>( *m_buffer );

        auto c_str() const
            -> C_str
        { return m_buffer->data(); }

        auto length() const
            -> Size
        { return m_length; }

        auto operator[]( const Size i ) const
            -> const char&
        { return (*m_buffer)[i]; }

        auto operator[]( const Size i )
            -> char&
            return (*m_buffer)[i];

        template< Size n >
        Cow_string( Raw_array_of_<n, const char>& literal ):
            m_buffer( make_shared<Buffer>( literal, literal + n ) ),
            m_length( n - 1 )
}  // namespace my

Here assignment is the default-generated assignment operator that just assigns the data members m_buffer and m_length, which are a shared_ptr and an integer, and ditto for copy initialization.

And apparently this code abides by the COPOW principle, so it should be safe…

The problem: UB by adding code that just copies.

Consider the following usage code, it’s perfectly fine:

auto main() -> int
    my::Cow_string s = "Du store Alpakka!";
    const C_str p = s.c_str();

    // In this block the contents of `s` are not modified.
        $use_std( ignore );
        const char first_char = s[0];
        ignore = first_char;

    $use_std( cout, endl );
    cout << p << endl;

This code is fine because the COW string is already in state owning when s[0] is executed on the non-const s. So all that the initialization of first_char does is to copy a char value. Fine.

But if a maintainer innocently just introduces a logical copy of the string value, which is what COW primarily optimizes, and which certainly doesn’t change the conceptual value, then mayhem ensues:

auto main() -> int
    my::Cow_string s = "Du store Alpakka!";
    const C_str p = s.c_str();

    // In this block the contents of `s` are not modified.
        $use_std( ignore );
        my::Cow_string other = s;
        ignore = other;

        const char first_char = s[0];
        ignore = first_char;

    $use_std( cout, endl );
    cout << p << endl;      //! Undefined behavior, p is dangling.

Uh oh.

Since s here is in state sharing, the COPOW principle makes the s[0] operation copy the shared buffer, to become owning. Then at the end of the block the only remaining owner of the original buffer, the other string, is destroyed, and destroys the buffer. Which leaves the p pointer dangling.

For a custom string type like Cow_string this is a user error. The type is just badly designed, so that it’s very easy to inadvertently use it incorrectly. But for a std::string it’s formally a bug in the COW implementation, a bug showing that COPOW is just not enough.

For a std::string, if the standard had permitted a COW implementation, to avoid the above calamity it would be necessary to transition to the owned state, incurring an O(n) copying of string data, every place that a reference, pointer or iterator is handed out, regardless of const-ness of the string. One could maybe call that copy on handing out any reference, COHOAR. It greatly reduces the set of cases where COW has an advantage. The C++ standardization committee deemed that cost too high, the remaining advantages of COW too low, to continue supporting COW. So,

  • the C++03 wordings that supported COW were removed;
  • wording was introduced, especially a general O(1) complexity requirement for [] indexing, that disallowed COW; and
  • functionality such as string_view was added, that relies on having pointers to string buffers, and that itself hands out references.

What about threads?

It’a common misconception that COW std::strings would be incompatible with multi-threading, or that making it compatible would make it inefficient, because with COW ordinary copying of a string doesn’t yield an actual copy that another thread can access freely.

In order to allow string instances that are used by different threads, to share a buffer, just about every access function, including simple [] indexing, would need to use a mutex.

However, a simple solution is to just not use ordinary copy initialization or assignment to create a string for another thread, but instead a guaranteed content copying operation such as std::string::substr, or initialization from iterator pair. The C++11 standard could have gone in this other direction. It could, in principle, have added to the existing C++03 support for COW strings, noting that COHOAR, not just COPOW, is required, and added a dedicated deep-copy operation or deep-copy support type plus wording about thread safety.

What about reference counted immutable strings?

An immutable string is a string type such as the built in string types in Java, C# or Python, where the available operations don’t support string data modification. Strings can still be assigned. One just can’t directly change the string data, like changing “H” to ”J“ in “Hello”.

Immutable strings can and in C++ typically will share their string data via reference counting, just as with COW strings. As with COW strings they support amortized constant time initialization from literal, ditto superfast copy assignment and copy initialization, and in addition, if strings don’t need to be zero-terminated they support constant time substring operations. They’re highly desirable.

So, is the problem shown above, a showstopper for immutable strings in C++?

Happily, no. The problem comes about because std::string hands out references, pointers and iterators that can be used to change the string data without involving std::string code, i.e. without its knowledge. That can’t happen with an immutable string.

And figuring out this, whether there was a showstopper, and whether std::string_view (that hands out references) could be used freely in code that would deal with immutable strings, was the reason that I delved into the question of COW std::string again. At one point, long ago, I knew, because I participated in some debates about it, but remembering the problematic use case wasn’t easy. It’s just not intuitive to me, that adding an operation that just copies, can create UB…


Ch 4 of my Norwegian intro to C++ available

The Norwegian introduction to C++ programming (a bit Windows-specific) is at Google Docs, in PDF format, 4 chapters so far:

Introduksjon til C++-programmering (Windows)

Each file has a nice table of contents but for that you need to download the PDF and view it in e.g. Foxit or Adobe Acrobat. Ch 1, the Introduction, is just 1 page, though. Ch 2, tooling up with Visual C++ and learning about some Windows stuff, is more pages. And so is ch 3, about basic C++ such as loops and decisions. And ch 4, about creating console programs (all programs so far just GUI), chimes in at some 50 pages!

Perhaps it’ll become a book…

Here’s a table of contents generated by (1) using a Word TOC field and half-documented RD fields to refer to the chapters, (2) [Shift Ctrl F9] in Word (is that still documented anywhere?) to “lock” the text, (3) edit, removing unwanted entries, (4) copy as text to Crimson Editor, save, and (5) run a very very hairy C++ program to generate the HTML.

Oh, I see in the preview that instead of a purely numbered list, in the WordPress blog I get letters and roman numerals!

So be it – but there’s also a PDF of the original over at Google docs (link above).

  1. Introduksjon. | 1
  2. Første program, etc. | 1
    1. Gratis verktøy. | 1
    2. Muligens ikke helt typiske installasjonsproblemer… | 2
    3. “Hallo, verden!” i Visual Studio / om IDE prosjekter. | 6
    4. Feilretting i Visual Studio / generelt om C++ typesjekking. | 15
    5. Hva “Hallo, verden!” programteksten betyr. | 18
    6. Spesielt aktuelle Windows-ting for nybegynneren. | 21
      1. Makroer og Unicode/ANSI-versjoner av Windows API-funksjoner. | 22
      2. Moderne utseende på knapper etc. / om DLL-er og manifest-filer. | 23
      3. Ikon og versjonsinformasjon / [.exe]-fil ressurser. | 28
    7. Gir C++ ekstra mye kode og kompleksitet? | 32
    8. Å finne relevant informasjon om ting. | 32
      1. Tipsruter og automatisk fullføring. | 32
      2. Å gå direkte til en aktuell deklarasjon eller definisjon. | 33
      3. Full teknisk dokumentasjon / hjelp / kort om Microsofts “T” datatyper. | 34
      4. Dokumentasjon av C++ språket og C++ standardbiblioteket. | 36
      5. Diskusjonsfora på nettet / FAQ-er. | 38
  3. Et første subsett av C++. | 1
    1. Gjenbruk av egendefinerte headerfiler. | 1
      1. En wrapper for [windows.h]. | 2
      2. Å konfigurere en felles headerfil søkesti i Visual Studio 2010. | 6
      3. En muligens enklere & mer pålitelig måte å konfigurere Visual Studio på. | 9
    2. Grunnleggende data. | 12
      1. Variabler, tilordninger, oppdateringer, regneuttrykk, implisitt konvertering. | 14
      2. Implisitte konverteringer. | 15
      3. Initialisering og const. | 16
    3. Tekstpresentasjon og strenger. | 17
      1. Arrays som buffere, konvertering tall ? tekst. | 17
      2. Strenger, konkatenering og std::wstring-typen, anrop av medlemsfunksjon. | 18
      3. Å lage tekstgenererings-støtte / egendefinerte funksjoner & operatorer. | 22
    4. Løkker, valg og sammenligningsuttrykk. | 27
      1. Sammenligninger og boolske uttrykk. | 32
      2. Valg. | 34
      3. Løkker. | 39
    5. Funksjoner. | 41
      1. Hva du kan og ikke kan gjøre med en C++ funksjon. | 41
      2. Funksjoner som abstraksjonsverktøy. | 41
      3. Verdioverføring og referanseoverføring av argumenter. | 45
  4. Kommandotolkeren. | 1
    1. Windows kommandotolkeren [cmd.exe]. | 2
      1. Å kjøre opp en kommandotolker-instans / konfigurering av konsollvinduer. | 2
      2. Kommandoer / hjelp. | 8
      3. Kommandoredigering & utklippstavle-operasjoner. | 11
      4. Linjekontinuering & tegn-escaping. | 11
      5. Operatorer & sammensatte kommandoer / omdirigering & rørledninger. | 12
      6. Erstatting av miljøvariabel-navn / arv av miljøvariabler. | 15
      7. Kommandotolkerens søk etter programmer: %path% og %pathext%. | 16
    2. Navigasjon. | 17
    3. Å kompilere fra kommandotolkeren. | 21
      1. Å nei! “Hallo, verden!” igjen! | 22
      2. Konsoll kontra GUI subsystem. | 24
      3. Å angi linker-opsjoner til kompilatoren / separat kompilering og linking. | 26
      4. Å be kompilatoren om standard C++, please. | 27
      5. Å angi headerfilkataloger, også kjent som inkluderingskataloger. | 28
    4. Batchfiler – å automatisere f.eks. et standardoppsett. | 31
    5. C++ iostreams. | 33
      1. iostream-objekter for standard datastrømmene. | 33
      2. Datastrøm orientering: nix mix (av char og wchar_t datastrømobjekter). | 36
      3. Å detektere “slutt på datastrømmen” (EOF, end of file). | 36
      4. Innlesing av strenger. | 40
      5. Praktikalitetsdigresjon: hvordan bli kvitt navneromskvalifikasjonene. | 42
      6. Innlesing av tall. | 43
      7. Formatert utskrift med iostream manipulatorer. | 48

Cheers, & enjoy! – Alf

Current TOC for my Norwegian intro to C++

About the Norwegian C++ intro, see my earlier posting.

Not sure if this works or not, but I’m trying to embed a PDF of a Table of Contents generated by Word:

Enjoy! 🙂  [Possibly/probably more to come, after all, I’m referring to chapter 4!]

– Alf

By the way, Olve, as you can see I’ve now added a chapter 3! Not quite at 42 yet… But.

A Norwegian introduction to C++ programming (in Windows)

I’m a compulsive writer, I admit. So, when testing Visual C++ 10.0, via Microsoft’s free Visual C++ Express IDE, I wrote about it. In Norwegian!

Maybe it’ll be a book. Anyway, I always write as if it’s going to be a book! I’m an incorrigible optimist!

It’s at Google Docs, in PDF format, 2 chapters so far:

Introduksjon til C++-programmering (Windows)

Each file has a nice table of contents but for that you need to download the PDF and view it in e.g. Foxit or Adobe Acrobat. Ch 1 is just 1 page, though. Ch 2 is more pages.

[Update, 4th of August: I’ve now added chapter 3, “Et første subsett av C++”. It’s great. :-)]

Comments very welcome!

Even if your name is Olve Maudal, say! 🙂

[cppx] B true, or B thrown! (Using the >> throwing pattern)

How often have you declared a variable and invented an ungrokkable three-letter name for that variable, just to temporarily hold the result of some API function so that you can check it immediately after the call? Let me guess, you’ve done that thousands of times. And if so then here are happy tidings: you can completely avoid introducing all those helper variables! 🙂

[… More] Read all of this posting →

[cppx] Is C4099 really a sillywarning? (MSVC sillywarnings)

Evidently some people get to my blog by googling up C4099, the MSVC warning that you’ve used struct in one place and class in another place, for the same class. This is one of the alleged sillywarnings in my sillywarnings suppression header. But given e.g. the discussion at StackOverflow, is it really a sillywarning, or perhaps something to take seriously?

[… More] Read all of this posting →

[cppx] Exception translation, part II

In part I I showed how to use the cppx library’s exception translation support, which decouples the specification of how non-standard exceptions should be translated, from each routine’s invocation of such translation. The translation can be customized by dynamically installing and uninstalling exception translator routines. And essentially each routine that wants exception translation must use a catch (this will most often be a generic catch(...)) where it invokes cppx::rethrowAsStdX, which in turn invokes the installed exception translator routines and performs a default translation if none of them apply.

In this second part I discuss how that translation machinery works.

In part III I’ll discuss the support for installation and uninstallation of exception translator routines. And perhaps I’ll need a part IV to discuss the cppx exception types! Anyway, now, diving down into the code…

[… More] Read all of this posting →

[cppx] Exception translation, part I

MFC throws pointers, Xerces throws various types not derived from std::exception, and as I recall the Lotus Notes API throws integers. An interesting diversity. But as the Chinese curse goes: may you live in interesting times! Such non-standard exceptions generally require you to have multiple catches every place such an exception could otherwise escape to higher levels, or where you want to handle any exception with a given general meaning. It’s not that the designers have tried to be clever or that they’re out to get you: it’s just that these libraries stem from before C++ was standardized, i.e., before 1998.

One partial remedy is exception translation. Somehow arrange for any non-standard exception to be caught and result in some corresponding exception derived from std::exception. It does not solve all the problems – but it sure beats doing it by copy-n’-paste coding of exception handlers!

With C++98 exception translation cannot be completely centralized in a portable, reusable way. As far as I know that’s not possible even with the upcoming C++0x standard. But it’s possible to provide portable and application-independent support that does most of the job, and that provides a general convention that largely eliminates the chance of any non-standard exception slipping through, and that’s what I discuss here. [… More] Read all of this posting →

[cppx] Unique identifier values via std::type_info

Like my posting about cloning this posting about unique identifier values is in preparation for discussing the cppx library’s exception translation support. In short the aspect discussed here is how to let the calling code choose an id for a set of installed translators, so that it can remove them all in one operation (specifying that id). And the simplest id you have in C++ is a type with external linkage. You can obtain a unique-wrt.-comparisions id for a class from the typeid operator. The only problem is that that id, of type std::type_info, is not copyable and not generally comparable, so it can’t be used directly as a key in a std::map, say, and the standard fails to guarantee that you will always obtain the same std::type_info instance for the same type, so it’s a bit risky to use a pointer to that instance…

[… More] Read all of this posting →

[C++ basics] Five lesser known special cast operations

If you’re already familiar with casting to inaccessible base, disambiguation casts, casting to most derived object, casting to/from first member of a POD, and accessing the built-in address operator via a cast, then this blog entry is perhaps not for you. 🙂 It’s basics, but for some reason even seasoned C++ programmers are not always aware of all these special cast operations. This is about special case semantics, not syntax or general operations.

[… More] Read all of this posting →