Experimenting with Fixed Size Delegates with User Data

A fairly common need in many programming tasks is to pass a function off to another function as a parameter. This allows all kinds of fancy higher-order functional programming. It's also really handy for event and messaging systems, particularly when building UIs. Games also make heavy use of such features, though all the peculiar requirements of games often make conventional solutions unusable.

Introduction

Being laid up ill all weekend, I decided to experiment a bit with alternative ways of storing delegates that allowed for storing (somewhat) arbitrary user data inside of them. The end result is not necessarily something that I would recommend dropping into your own projects (especially if you don't find yourself having a need for the functionality), but it is at least a fun venture into some of the darker corners of abusing C++.

In short, this was an experiment, and the article is an explanation of the what and how without any justification for the why.

The Problem

The standard way of binding functions in C++11 is to make use of the std::function<> template. This template can bind up any type that matches a certain signature, including free functions, functor objects, and even C++11 lambdas. Unfortunately, the ability to take any type that matches a particular signature means that it has to support taking objects of any arbitrary size. The result is that the template is required to make heap allocations, at least in some circumstances. While it's completely legal for the template to make use of small object optimizations (that is, only allocating memory if the object passed to it goes over some certain size), this behavior is not guaranteed. Some implementations of std::function<> will always allocate memory for any functor or lambda that is larger than a standard function pointer.

It is of course possible to customize this behavior to an extend using a custom allocator object. As many C++ users know, however, the C++ allocator patterns are painful to use properly, lacking in a few necessary features, and otherwise take what is already a large and bloated template and make it even larger and more bloated. For those of us where iteration times deeply matter, long compile times are a problem, and relying on complicated templates to solve simple problems is frowned upon.

In fact, in the world of games, we often eschew the STL in general simply because it is so poorly optimized for our needs, because customizing it takes more specialized knowledge than replacing it, and because even the simplest STL types are complex beasts that can have a significant impact on compile times. I've been using a custom delegate type rather than std::function<> for some time now, although that type only supported binding regular function pointers or member function pointers (with associated object instance). The inability to make that bound object instance be a smart pointer of any kind has been more of a limitation than the inability to bind extra data to the functions, but supporting arbitrary smart pointer types is tantamount to supporting arbitrary functor object support, when you get right down to it.

Hence, I decided to take a look into a custom replacement for std::function<> that guaranteed no heap allocations, is able to support arbitrary functor objects and lambdas of reasonable sizes, and is tiny and quick to compile.

Type Erasure

By far the most complicated part of this exercise is efficiently handling the necessary type erasure. Type erasure is any technique that allows for the type of a value to be hidden from the user. std::function<> for example is a single type that can wrap a function pointer, a functor object, or a lambda. The user is not required to use a different variant of std::function<> for each type of callable object being wrapped; that would defeat the purpose of using std::function<> in the first place.

The general way of accomplishing type erasure in C++ is to make use of virtual functions and abstract base classes. It is possible to create an interface as a base class and then to make a templated child of that interface, allowing you to internally wrap up any compatible object. For example:


struct function {
struct bindingbase { virtual ~bindingbase() {} virtual void gnvoke() = 0; };

template struct bindingimpl : public bindingbase { TYPE value; binding_impl(const TYPE& value) : value(value) {} virtual void invoke() { value(); } }

binding_base* binding;

function() : binding(nullptr) {} template function(const TYPE& bind) : binding(new binding_impl(bind)) {} ~function() { delete binding; }

void operator()() { if (binding != nullptr) binding->invoke(); } };

Let's take a look at what's going on here. Our function class (which here only deals with functions with no arguments and no return value) uses a special wrapper interface, called bindingbase. When we construct a new function, we pass it in some type, which we wrap up in a bindingimpl. The binding_impl::invoke method will attempt to call its value as a function.

This simple type will already take any callable object (a function pointer, functor, or lambda) that matches the signature void(void). It's that easy. Of course there's a few missing pieces here, such as the lack of copy/move constructors and assignment operators (this sample class breaks the Rule of Three; obviously it's not production ready!), but that gets the job done.

Bounding Size to Avoid Allocation

The problem then is that it must allocate a new instance of bindingimpl<>. This is doubly important since the actual size of any particular bindingimpl<> is unknown, since that depends entirely on the size of the callable object being bound.

The first thing to do then is to set an upper bound on the size of the callable object, create a buffer in our function object, and store a copy of the callable object there.


struct function {
union { double alignme; char buffer[sizeof(void*) * 3]; };

struct bindingbase { virtual ~bindingbase() {} virtual void invoke(void* buffer) = 0; };

template struct bindingimpl : public bindingbase { virtual void invoke(void* buffer) { (*static_cast(buffer))(); } }

binding_base* binding;

function() : binding(nullptr) {} template function(const TYPE& bind) : binding(new binding_impl(bind)) { new (&buffer) TYPE(bind); } ~function() { delete binding; }

void operator()() { if (binding != nullptr) binding->invoke(); } };

That's a decent start. Notice that there is a buffer variable. The size chosen is not entirely arbitrary. More on that in a bit. Notice also that there is a double value called "alignme". This is pretty important. C++ types all have a natural alignment. The alignment for char is 1, the alignment for an int is 4, and the alignment for a double is 8. Technically this is all platform dependent of course, which is why one should never ever put numbers like that directly in code, but rather rely on sizeof() and alignof(). The alignment requirements of a struct or class are typically set to the strictest alignment of any of its members. Since a char array has the alignment of char, which is typically 1, that would mean that we would not easily be able to store any arbitrary value into buffer; the functor or lambda we store may require a stricter alignment. Double is a safe-ish bet for strictest alignment, although it is far from fool-proof. A long double (not that you see those much) may (or may not) have a stricter alignment. SSE/SIMD types often have a stricter alignment. Unfortunately, there is no way to explicitly ask for the strictest alignment, nor is there even a good definition for what that might be; you can ask for near arbitrary alignment of your types with compiler-specific annotations, after all.

The size of the buffer is chosen with that alignment requirement in mind. We're assuming a minimum alignment of 8 bytes because of the double. If we wanted to pack more than one function into a struct or array, we'd like to avoid any wasted space due to padding. Therefor, the size of our function class should be a multiple of 8 bytes. On a 32-bit platform, pointers are typically 32 bits (4 bytes) each. Given that we will be keeping one pointer of our own for the binding object, that means that any odd multiple of 4 will give us a number divisible by 8. Note that I used sizeof(void) rather than any explicit numbers in order to try to avoid making too many assumptions (not that I haven't already made some with the alignment trick, and I will certainly be making a few more by the end of the article). This number will still be a good choice to avoid padding on a 64-bit platform, as any multiple of sizeof(void) will give us a number divisible by 8. However, 3 is a decent choice for the multiple as it would give us a total size of ((3+1)*8)=32 bytes, which evenly divides into cache line sizes. If by some chance you're allocating an array of function objects, that may or may not be handy. Of course it's unlikely to ever be a major issue, so raising the multiple to 5 is entirely reasonable if functions need to hold larger functors/lambdas.

Fixing the Rule of Three Violations

There are some issues with the latest iteration of the code. First, there's still Rule of Three violations. These are now a bit harder to fix since the wrapped callable object type is stuffed into a buffer. The binding object will need to add some explicit methods to copy and destruct an object in the buffer. A bigger issue is that the memory allocation is still there, as the binding object is still allocated.

Let's take care of the simpler issue first: solving the Rule of Three violations.


struct function {
union { double alignme; char buffer[sizeof(void*) * 3]; };

struct bindingbase { virtual ~bindingbase() {} virtual binding_base* clone() = 0; virtual void invoke(void* buffer) = 0; virtual void copy(void* dest, const void* source) = 0; virtual void destruct(void* buffer) = 0; };

template struct bindingimpl : public bindingbase { virtual bindingbase* clone() { return new bindingimpl(); } virtual void invoke(void* buffer) { (static_cast(buffer))(); } virtual void copy(void dest, const void* source) { new (staticcast(dest)) TYPE(*staticcast(source)); } virtual void destruct(void* buffer) { static_cast(buffer)->~TYPE(); } }

binding_base* binding;

function() : binding(nullptr) {} template function(const TYPE& bind) : binding(new binding_impl(bind)) { binding->copy(buffer, &bind); } function(const function& rhs) : binding(nullptr) { if (rhs.binding != nullptr) { binding = rhs.binding->clone(); binding->copy(buffer, rhs.buffer); } } ~function() { if (binding != nullptr) binding->destruct(buffer); delete binding; }

function& operator=(const function& rhs) { if (this != &rhs) { if (binding != nullptr) binding->destruct(buffer);

  delete binding;
  if (rhs.binding != nullptr) {
    binding = rhs.binding->clone();
    binding->copy(buffer, rhs.buffer);
  }
}
return *this;

}

void operator()() { if (binding != nullptr) binding->invoke(); } };

Not a lot of new code there. I added a copy constructor and a copy assignment operator. There was also a need to add a bindingbase::clone() method and corresponding bindingimpl::clone(), since currently the binding object must be copied when assigning one function to another. That of course will go away once the binding allocations are removed. It's important to note that since only the copy constructor operation is supported by the binding object that the copy assignment operator in function destucts its current bound callable object and then constructs the assigned one. It may seem like this is non-optimal; however, it's likely that while copying functors around that different functors will be used, which in turn are entirely different classes and hence there is no copy assignment operator for them anyway. If the function object is largely being used for simpler types that are compatible, like function pointers (where the copy assignment operator is just copying one pointer over the other), this is indeed every so slightly non-optimal, but then the destruct and copy construct operations will be quite cheap.

Avoiding Allocation for Binding Instance

The final major bit then is getting rid of that allocation. And this is where things start to get a little hairy. There is a nice, standards conforming way to do what is desired. That is naturally the first way I implemented things. Unfortunately, certain compilers did not seem to want to do the basic optimizations that I felt where necessary to make that implementation ideal. I'm still entirely on the fence on this one; I really don't like the black magic that went into my final solution purely on principle, but I had a pretty strict target on generated code size and overhead when I started this experiment.

Let's take a quick peek at the ideal solution.


struct bindingt {
typedef void(*invoke
t)(void* buffer); typedef void(copy_t)(void dest, const void* source); typedef void(destruct_t)(void buffer);

invoket invoke; copyt copy; destruct_t destruct;

bindingt(invoket invoke, copyt copy, destructt destruct) : invoke(invoke), copy(copy), destruct(destruct) {} };

template
void invoke(void* buffer) {
(static_cast(buffer))(); } template
void copy(void
dest, const void* source) {
new (staticcast(dest)) TYPE(*staticcastsource)); } template
void destruct(void* buffer) {
static_cast(buffer)->~TYPE(); }

template
const binding* getbinding() {
static const binding
t binding(invoke, copy, destruct); return &binding; }

That bit of code is the creation of a hand-made virtual function table. It supports three "virtual" methods: invoke, copy, and destruct. Just like the binding struct in the function code above. The reason to write this code like so is that there is no need to create an instance of a binding object for every bound function. Instead, a single static instance of the bindingt struct for any given type can be retrieved by calling getbinding<> for the target type. It returns a pointer that never needs to be deallocated or cloned. That binding_t instance can be stored in function, the allocation of binding objects goes away, and the problem is solved.

That is the ideal solution. That is how I wrote the code originally. That is how it should be written. Except for a few problems.

It all comes down to get_binding<>. The problem is that there needs to be this global constant table, just like there is for any real virtual function table, but there's no way in C++ to actually accomplish that without forcing users to use macros and spread code around. One cannot just write code like so:


template
struct bindingwrapper {
static const binding
t binding(invoke, copy, destruct); };

Member variables cannot be instantiated inside of a class. C++11 would make this more doable with constexpr constructors, but alas, the compiler that many of us gamer developers are stuck with is a long ways away from full C++11 conformance, and the day when we actually all have licensed copies of whatever future unreleased version of said compiler supporting C++11 is likely going to be a year or three after that compiler is actually released. Hence I couldn't go the easy route.

The getbinding<> as written simply is not optimized well by many compilers. In theory, the static binding can be initialized at compile time, stuffed into a global table, and calls to getbinding<> could be optimized into a simple pointer load. Compilers don't actually do that. They might inline the call to get_binding<>, but then the code just gets bloated. There's also some thread safety concerns; one might not expect two threads writing the same values to the same global memory to be a real problem, but on certain RISC architectures, things don't always play out that simply.

Overall, looking at how compilers were dealing with the get_binding<> call, I wasn't happy. It's a nice trick for non-performance-sensitive code. For all I know, constructing function delegates is not performance sensitive. It really probably isn't. However, since optimization is the root of all evil, I decided to try a more evil trick to getting the desired level of performance, just to see if I could. (And lo, I could!)

Let's take a stroll down assumption lane. A class with virtual functions is implemented by storing a pointer to a virtual function table inside every instance of that class. This table contains pointers to actual function implementations. This virtual function table pointer is always stored at the very front of any non-derived class. This virtual function table pointer is just a regular pointer, so its size is sizeof(void). A class which has virtual functions but no member variables has only the virtual function table pointer, so the size of the whole class is sizeof(void). Derived classes that override virtual methods but do not add any new member variables will always be the same size as their parent (which is sizeof(void*) if you're keeping up) and keep the virtual function table in the exact same spot.

That's a lot of assumptions. Not a single one of which are guaranteed in any way by the C++ standard. A C++ compiler is free to implement virtual functions any damn way it pleases, so long the result follows all the behavioral rules laid out by the standard. It just so happens that on all "real" compilers that are of interest to me, all those assumptions are true. To be clear: the following code is relies very heavily on implementation-defined behavior. It is not truly portable. It would fail any strict C++ instructors homework requirements. If you wrote this code in an interview and didn't fully explain the assumptions made, why they were safe in this context, and why they're totally unsafe in general, you would not get the job. Clear? Good.

The trick then is to not only stuff the callable object into a buffer but to also stuff the binding object into a buffer. Because the binding object is just a virtual function table pointer, that buffer need only be sizeof(void). Since I was getting a little lazy by the end of coding up this experiment, I actually just left it as a void. This had a useful result: as a void, I could just copy the values like any other void and easily copy virtual function table pointers around.

Granted, the resulting code is full of ugly casts, but that's the nature of this kind of over-optimized low-level hacky code. The amount of reinterpret_cast<>s (more than none) is strong indication that this is all probably just a bad idea.

Looking back at the previous working function example, here it is extended to use the hack rather than allocating binding objects:


struct function {
union { double alignme; char buffer[sizeof(void*) * 3]; };

struct bindingbase { virtual ~bindingbase() {} virtual binding_base* clone() = 0; virtual void invoke(void* buffer) = 0; virtual void copy(void* dest, const void* source) = 0; virtual void destruct(void* buffer) = 0; };

template struct bindingimpl : public bindingbase { virtual bindingbase* clone() { return new bindingimpl(); } virtual void invoke(void* buffer) { (static_cast(buffer))(); } virtual void copy(void dest, const void* source) { new (staticcast(dest)) TYPE(*staticcast(source)); } virtual void destruct(void* buffer) { static_cast(buffer)->~TYPE(); } }

void* binding;

function() : binding(nullptr) {} template function(const TYPE& bind) { new (&binding) bindingimpl(bind)); reinterpretcast(&binding)->copy(buffer, &bind); } function(const function& rhs) : binding(rhs.binding) { if (binding != nullptr) reinterpretcast(&binding)->copy(buffer, rhs.buffer); } ~function() { if (binding != nullptr) reinterpretcast(&binding)->destruct(buffer); }

function& operator=(const function& rhs) { if (this != &rhs) { if (binding != nullptr) reinterpret_cast(&binding)->destruct(buffer);

  binding = rhs.binding;

  if (rhs.binding != nullptr)
    reinterpret_cast<binding_base*>(&binding)->copy(buffer, rhs.buffer);
}
return *this;

}

void operator()() { if (binding != nullptr) reinterpret_cast(&binding)->invoke(); } };

It's actually slightly shorter and smaller, which at least is nice. Most of the black magic comes in these two tidbits of code:


new (&binding) bindingimpl();
reinterpret
cast(&binding)->

The first line creates a new instance of binding_impl on top of the void* binding member. If that makes you question your sanity, good. The binding member does not point at the binding! It is the binding! What it points to is the virtual function table, which is some implementation specific magic which not even I will make assumptions about. The binding member variable is essentially at this point just an opaque handle, nothing more.

The reinterpret_cast<> magic is likewise important. Notice that I'm casting the address of binding, not binding itself. A pointer is needed to the memory that contains the virtual function table in order to actually invoke functions with it. The memory that contains the vtable pointer is the binding member. Hence, a pointer to that member is required.

Summary

That's the gist of what I did. Nothing super complicated, just a lot of super black magic. The actual code (which actually compiles and works; the sample code above most likely does not) is included below. The final code has a few other small optimizations, some changes to the API, and nominally supports C++11 move semantics.

Note that neither the example code in the article or the working code below supports arbitrary function signatures. That alone is an entire article's worth of content, and for this experiment I wanted to focus on internal implementation details rather than all the pain of supporting arbitrary signatures (especially as certain compilers still lack variadic template support, which makes supporting arbitrary signatures much much easier).

TinyExperiments - FixedSizeDelegates