String libraries have been a frequent topic of conversation lately. There’s a lot of things wrong with the C++ standard’s implementation, a lot of missing functionality, and a lot of discussion about the best ways to fix it or work around it. I’m going to go over a few suggestions that I and my fellow colleagues, peers, and instructors all agree on.
The first thing to note is that std::string is not immutable. This means that any string references can have its contents changed at any time, which leads to a greater deal of copying being done to ensure these changes never happen. Given that strings very rarely change – even in applications heavily focused on string/text editing – it makes sense to use immutable strings.
In practice, this basically just means writing a simple wrapper around C-style strings that copies its source string on initialization or assignment, and frees it in the destructor. That might sound inefficient, but in practice this string wrapper will be rarely used. We’re mostly interested in this custom class just to ensure that we aren’t suffering the penalties imposed by poor std::string implementations (for instance, MSVC’s std::string is 32 bytes in size on x86, not including any memory which may be dynamically allocated).
The work horse of our API will be a string range class. This is a pretty simple type which is just a wrapper around a start pointer and an end pointer. This class has a number of neat benefits. First, size of any string can be easily calculated by simply subtracting start from end.
Second, a string range can “look into” any character data buffer. APIs that need to process string data and which take string ranges can easily accept data from string literals, string objects, string builders, memory mapped files, and so on without any additional copies or temporary allocation.
Third, string ranges are quite fast to pass and return, being only two pointers in size. They require absolutely no dynamic allocation, can have trivial copy and move constructors, assignment operators, and destructors.
There’s one case where a string range may not be suitable, and that’s for use with any API that expects a regular C-style NUL-terminated string. This unfortunately includes almost all OS APIs on just about every platform. These calls are suitably rare however that makes temporary copies in these cases is not usually going to be a problem. There’s also a trick that you can use to avoid extraneous copies, with some care. This trick is to simple dereference the end pointer and check if the value is the NUL byte. If it is then the string range already represents a NUL-terminated string. This will usually be safe because your string ranges will be constructed from string literals, string objects, string builders, or other sources of string data where the buffer will always have a NUL byte at the end, so the end pointer of the string range will either be pointing at that NUL byte or at some other valid character inside the source string data. However, it is possible that your application does not always do this; if you use memory mapped files then it’s possible for a string range’s end pointer to be pointing at one byte past the end of the memory mapped range, and dereferencing it is undefined. If you use any string data sources that may not be NUL terminated, it’s important to keep this in mind if you ever plan on using the is-end-NUL trick. One final suggestion is that the default constructor for a string range should set the start and end pointers to reference an empty string literal rather than NULL. This means that your string range’s end pointer will never be NULL, making sure that even default constructed string ranges are safe to dereference and pass to C-style APIs.
When it comes to formatting strings, printf and other unsafe C style APIs are best avoided. C++ iostreams style APIs are also best avoided as they are very slow and use the standard C++ std::string API, which we’re trying to avoid.
Thankfully, building your own is rather easy. A string builder class is basically a wrapper around a std::vector (or an equivalent container of your own) that offers various methods for writing data into the buffer. Simple character append, string append, and number append functions are relatively easy to write. For floating point values, you can use the gvct function (called _gvct on Windows) to get correct and accurate string formatting. This is generally the function used internally by printf implementations.
If you desire, you can add advanced iterator support for using generic algorithms on string buffer data. Insert and cursor operations are also relatively easy to add, if you need them.
The API then simply offers a method that returns a string range, which can be copied into a string wrapper if necessary for longer term storage in memory.
Note that it’s best to ensure that the character buffer is always NUL terminated, making it easy to debug the data by calling OS print/debugoutput functions on the string buffer’s contents. It also enables the string range NUL check mentioned above. It’s also feasible to offer a method that returns an rvalue reference for creating string wrappers directly, particularly if you don’t mind the extra memory padding that the buffer will often have due to automatic buffer expansion algorithms.
A specialized use case of a string builder is to concatenate two or more strings. A simple + operator seems attractive, but remember that this only works for two operands. If you try to use a + operator with three or more strings then multiple temporary strings (and hence allocations/deallocations) become necessary, even with rvalue references and move constructors. However, using a string builder is also not the most efficient means of doing string concatenation.
First, the string builder is going to use an expansion algorithm that results in multiple buffer reallocations, and the final buffer will have extra space. If you are concatenating N strings, you can easily calculate the exact necessary space necessary (the sum of the lengths of all the operands, plus one byte for the NUL character), and you should at most need N copies (from each source to the proper location in the destination). This can be easily implemented as a single variadic function (using a count value or sentinel), a variadic template, or a series of overloaded functions supporting up to some pre-determined number of operands. The most operands I’ve seen in common use is 5, used for constructing paths (a directory, a path separator, a file name, a period, and an extension).
Though rarely needed, it’s easy enough to implement that doing so is a no brainer.
One further major consideration is string interning. String interning is the process of ensuring that there is only a single allocated copy of any particular string. When strings are loaded up or constructed dynamically, the string is searched for in a central table; if found, the existing instance is returned, and otherwise a new entry is made and returned.
One advantage is a reduction in memory usage for applications that make heavy use of frequently entered dynamic strings. Another advantage is that comparing interned strings is O(1) rather than O(n) and requires no dereferencing of the string pointers (because if the pointers are not equal, they must point to different strings).
Note that it is not generally a great idea to intern all strings. Large strings, temporary strings of dynamic nature, and strings that are highly likely to be unique gain no benefits from string interning. Especially for temporary unique strings, interning can mean that the strings never get freed in many simpler and more efficient implementations (as the interned string table is never pruned and entries never removed from it).
It’s also possible to allow multiple interning tables, which may make sense when there are different sets of strings that are unlikely to have values in common. An interned string handle class may in this case contain two pointers: one to the intern table and one to the table entry for the string. Checking strings for equality with this type is still fairly fast; if the table pointer is identical, simply compare the entry pointers, and if the table pointers are different, call the usual string range comparison types.
One final advantage of string interning is that the table entries can include hashes for each string (and in fact usually will, as the best interning table implementation is most often a hash table). This means that using interned strings as keys into other hash tables can be done without needing to rehash the key.
Advanced String Literals
A final neat trick is to use a smart string literal class. Classes like the string wrapper, string builder, and string range would not be constructible from a string literal but would instead take a special literal string class. This class itself is the only constructor/method (other than OS functions and string library internals) that ever accepts a C-style string pointer.
The first advantage is that this lets you easily ban the use of char* in the application framework and safely search/grep for it in static analysis tools, as only the string library code will ever use it.
Second, the string literal can be implemented in such a way that it computes the string length at compile time (most compilers have a builtin version of strlen that does this). Furthermore, on current GCC and future MSVC compilers, this class can do compile-time string hashing, so that use of the literal as a hash table key or for string interning doesn’t require runtime hashing.
Finally, a string literal like this can be passed to a string intern lookup and, if the lookup requires the creation of a new entry, the implementation can know that it’s safe to just store the reference (as a string range, perhaps) without needing to make a copy of the string data.
One special use case for strings is editing. This includes both simple text entry interfaces in a simple GUI toolkit in a game to full-blown text editors and word processors.
It’s worth mentioning that none of the string classes outlined above are particularly suitable for those cases. A GUI text entry may be better off with a fixed-size character buffer class with cursor support (although a string builder with cursor support is also quite servicable). For larger text editors, more advanced string data types are called for. Usually there will be no benefit to interning, and you certainly don’t want to use a single large buffer that needs to resize frequently as text is entered or removed.
String ranges may not even be useful for this case as the more common large string data structures do not guarantee contiguous memory, which a string range requires.
The suggestions above are in many cases made based on performance claims. As with will performance tips, profile and test your changes to make sure they are actual performance improvements for you application.
Also, keep in mind that some applications may have different needs than most others and some of the above tips may not be as beneficial for your application.</char></div>