Variable Refresh Monitors and Animation


So variable refresh rates have a lot of hype surrounding them now, with representatives from GPU and monitor vendors are saying that it will just magically make existing games better with no effort required on the part of the developer. This is bullshit and you shouldn’t believe them. If you’re a developer you need to be careful to handle these new refresh rates and if  you’re a customer you should consider just locking your monitor to 60Hz for games that don’t do the right thing.

The problem

The issue is that games do animation. So in order to compute a frame for display it needs to essentially predict when that frame will be presented, and then compute the positions of all the objects at that time. This is only partially true because there’s lag in the display pipeline that we don’t know anything about, but since that lag affects every frame the same you can typically just ignore it and just advance the game state by 1/60Hz each frame and your animation will look smooth.

However, if the update frequency is variable then the amount of time it takes to render and display a frame changes every frame. So when we’re computing the next frame we don’t know how far ahead to “step”. It may have been 16ms last frame, but this frame it may be 15 or 17 – we just can’t tell in advance because we don’t know how long it will take the CPU and GPU to finish the frame. This means that objects no longer move smoothly in the world, they will move in a jerky fashion because the amount they get advanced each frame isn’t consistent with how long it took to actually display the frame.


So the first solution here is to just lock your framerate at some fixed number. 60Hz is a decent option, but if you have variable refresh rates maybe you’ll go for something odd like 45Hz or whatever. In fact, I would be surprised if the sweet spot for smoothness/performance just happens to be 60Hz by sheer coincidence. However, this is troublesome because our graphics APIs don’t give us an easy way to do this right now – perhaps these fancy new monitors could just let us set the refresh rate at arbitrary intervals instead and we can go back to using vsync, perhaps with some kind of clever fallback (if you miss the vsync, it uses variable refresh to present the frame as quickly as possible and then “resets” the framerate at that point, so that the next “vsync” happens 1/refresh later).

However, picking the refresh rate up front is tricky, even if we can pick any number we want. The game is going to run better on some scenes than others, and it will also vary from system to system. Maybe this could be automatic? The game could render each frame with a “fixed” target in mind, this target is set in such a way that we’re very confident we can hit it, and all the physics and animation is updated by that amount of time. If we can simultaneously tune this target based on how long it typically takes to update the frames, we’ll end up running faster where we can, and slower where we need to. So the idea is this:

  1. Render the frame.
  2. If the current frame time is less than the target frame time, do a busy wait then present just at the right time (this should be the overwhelming majority of frames).
  3. If the current frame time is more than the target frame time, present immediately and increase the target time.
  4. If the frame time was significantly less than the target frame time, and has been for some time, slightly decay the target frame time so that we get better fluidity.

There are many ways you could tweak this. For example, maybe you pick a target frame time based on a histogram of the most recent 100 frame times, or something – picking a time so that 99 of them would’ve come in under budget. Or maybe you fit a normal distribution to it and pick the 99th percentile target frame time.

Or maybe something simpler could work. For example, we could make sure our target frame rate is always 10% higher than the longest frame in the last second (or something). This means that as our actual frame times drop the target frame time will drop too (after a second) and we get better fluidity. If our frame times creep up slightly we will pre-emptively bump up our target to 10% on top of that to give ourselves breathing room in case they keep going up. If we do end up with a frame spike that’s worse than 10% of our worst frame in the last second, we will immediately bump up the target frame time to 10% above that new worst case. You would tune that percentage based on your game – if your frame times vary slowly you could use a lower buffer percentage, and if it varies quickly you may need a larger buffer percentage.

The biggest problem with this issue is step 2 above. The busy wait. In a typical renderer this means we can’t actually start the doing rendering for the next frame because we have to baby-sit the present for the previous frame, to make sure it doesn’t fire too early. This wastes valuable CPU cycles and GPU/CPU parallelism. Maybe in DX12 we could have a dedicated “presenter thread” that just submits command buffers and then does the busy-wait to present, while the main render thread actually does the rendering into those command buffers. Really what we want here is a way to tell the OS to sync to an arbitrary time period. For example, if the last couple of frames have all taken around 20 ms (which we can tell by measuring both the CPU time, well as inspecting some GPU profiling queries a frame later to see how long the GPU took) we could call some API to set the vsync period to be 22 ms just to give us some buffer and avoid jittering. This way the OS could potentially use a HW timer to avoid eating up CPU cycles.

Either way, let’s at least stop pretending that variable refresh rates are magic that can just be enabled for any game and look good. The developer needs to know when these things are enabled and take steps to make sure animation and physics remains smooth. It would be ideal if future APIs could actually help us out in doing this, but either way we need to do something to handle this.

Flags are a code smell


In most major code bases, it’s common to have objects with lots of Boolean flags indicating what state they’re in. E.g. is an enemy targeting the player or not? Is it trying to flee or not? If you have a set of mutually exclusive flags an enum can be used instead, but this is often not possible, and doesn’t change much in terms of the downsides.

In my opinion these are code smells. Smells don’t necessarily indicate actual issues, just patterns that are likely to be used incorrectly and thus warrants extra care.

In many cases if you think about how to avoid flags you’re forced change the code in ways that result in clearer code (specifically several simple functions rather than one general function that checks a dozen flags), as well as improved performance due to fewer branches and cache misses. Let me stress that the main gain here is the improved code quality, so don’t be fooled into thinking this is some kind of optimization – better performance is a happy side effect. And yes, there are cases where flags are the clearer option as well (see below).

An example

For a somewhat contrived example, consider a game rendering system. Each object in the world has some rendering state associated with it, this would be things like what meshes and materials to use, but also various flags determining how it’s to be rendered.

Let’s say we have a flag called isDynamic, which tells us if the object can move or if it’s static. We can change the flag throughout the life of the object, but when it’s set to false we assume it’s going to be static for some time (so we pay some up-front cost to pre-calculate some acceleration structures that pay off during actual rendering).

Sometimes you want to temporarily toggle rendering for an object so we check this in the renderer as well by adding an isVisible flag.

In addition, the object as a whole may have an isEnabled flag which lets us disable the whole object (physics, animation, rendering all ignore it when it’s not enabled).

Ok, so far we have three flags which means we have 23 = 8 possible combinations of flags, but do we really have eight separate legal states? No! E.g. an object which is not enabled can’t be visible. You could also argue that an object which is disabled or invisible is neither dynamic nor static. So we have significantly more possible states than legal states, and it’s very easy to accidentally end up in an illegal state by forgetting to check a flag in one of the many branches throughout the renderer (e.g. perhaps you go down the “isDynamic” branch somewhere for an object which is invisible, thus getting yourself into an inconsistent state). As you add more flags the set of possible states explodes, and it’s common for the interdependencies to be complex and easy to mess up in all the places you have to check.

An alternative

Instead of storing all these flags, take a step back and think about what the point of each flag is. The isDynamic flag is there to distinguish between dynamic and static objects. A different way of doing this is for the renderer to simply keep an array of dynamic objects and an array of static objects.

The isVisible flag is different. It means the object should be in neither the dynamic or static array. For the whole-object enabled flag we can do something similar, but for each sub-system.

The exact way you avoid any given flag will differ depending on what the purpose of the flag is, but you can usually find a way to avoid it.

So that’s the point of all this? Well the main benefit is that all the code spread out through the renderer can now be simpler. Instead of checking flags for a complicated set of legal states all over the code, we just have completely separate functions for each kind of state. E.g. there’s one method to go through all the static objects and render them, another to go through all the dynamic ones. These functions may call into shared helper functions here and there (to avoid duplication), but if you want to know how dynamic objects get rendered there’s a single place to start reading, and it will be easier to understand because the code is handling fewer cases.

The complexity isn’t entirely gone, it’s just moved. It’s now in methods like “SetVisible”, “SetDynamic” and so on. This is far better, since the complexity is now localized, making it easier to ensure correctness.

A brief note on performance: As we’re now separating things into different arrays to keep the logic clearer, you can often also split the data up across different arrays, which improves the cache locality for the functions that deal with that particular part of the state. For example, objects may use different data if they’re static or dynamic so if you can avoiding have to “step over” the static data in the dynamic code-path that’s going to save lots of cache misses.

What about checking which state an object is in?

If you use the proposed scheme most code never actually have to check any per-object flags, because the data already comes pre-sorted to the “processing” code (and is kept sorted by the state transition code). However, there are exceptions. Sometimes you really do want to check what state a specific object is in. This means doing an array lookup for each of the different arrays to see if the object is there.

To make this fast, you could store the index for each array in the object, which gives you fast lookups. This does means you’re still storing state-related information in the object itself (one index per array), but it can be private information so you still get the other benefits – the renderer still doesn’t have to check flags.

Another option is to make sure each object uses the same index in all the arrays it’s in. You generate a unique ID for each object, and use this as the location for the object in all arrays. The obvious way to do this is to allocate MAX_OBJECT_COUNT-sized arrays for all possible states, but that can of course be wasteful (and costly – gaps in the array cost cache misses when traversing). A better way is to use a “sparse array”, which only stores objects that are actually in the array, but still allows O(1) lookup. A hash table is a reasonable backing-store for this, but you can do better if you let your IDs be unique ints with uniform distribution (no need to actually hash anything, use the ID as-is in a Robin-hood style hash table).

When not to use this

I said above that flags are a code smell, not code bugs! There’s plenty of cases where using flags is entirely appropriate. E.g. if you set flags multiple times per frame then the increased cost of moving it between different arrays all the time is likely a bad idea. Also, if you use the sparse array idea above checking a flag is a lot more expensive so if you do this frequently you should probably at least cache the result of this lookup in a flag. Another scenario is if you only have one of something, there’s really no reason to have multiple arrays in that case, since most arrays will be empty. Also, sometimes the added cognitive overhead of multiple arrays just isn’t worth it if the code is trivial. Last but not least, sometimes it’s just damn difficult to figure out how to avoid a flag and spending too much effort jumping through hoops to satisfy some rule of thumb is just dogmatic.

Even if you go with the suggested approach above, you may still need an actual flag to handle the state transitions appropriately. E.g. if you make an object visible after it’s been invisible you probably need to know if you should add it to the dynamic or static array now that it’s visible. The point isn’t to have no flags at all (despite the title), it’s to move the resulting complexity to the code that actually deals with the state transitions, instead of spreading (and duplicating) that same complexity all over the code.

More resources

If this approach appeals to you, you really should look into data-oriented design. I’m a fan of this approach (first and foremost consider the data, and how it gets processed, don’t obsess over object identity), but I would advocate moderation. There are often benefits to having some of the state with the object, instead of having everything be implicit based on what component arrays it’s in (in terms of readability, and sometimes even performance). Resist the temptation to be an extremist about this, that’s really no different than the programmer who uses design patterns every chance they get.

For more on data oriented design, check out .

Another good intro to this and other things, from a performance perspective is “Code Clinic: How to Write Code the Compiler can Actually Optimize” by Mike Acton at GDC 2014 (available at the GDC vault with membership – you really need to see the talk here though, slides alone aren’t enough).

Another practical case study is Pitfalls of Object Oriented Programming by Tony Albrecht.


The Perils of Future-Coding



In my opinion, one of the most insidious form of technical debt is what I like to call future-coding. You might also call it over-engineering, second-system syndrome, etc. It’s code written in an overly elaborate and general way in order to, the future-coder reasons, pre-emptively handle use cases that we’ll “probably see in the future”.

I find it more treacherous than plain “stupid code” because it predominantly affects smart people. It’s mostly smart people with limited experience – once you see the downsides of future coding a few times, you tend to knock it off (although there are powerful confirmation biases at work) – but still, these are bright people you’d want to hire and keep around in you organization because they’ll do great things in the future.

Let me illustrate my point using a hypothetical example with two equally smart programmers, where one just has a little bit more experience than the other and is less prone to future-coding.

The Tale of Two Street Lights

Let’s say you’re making a game where a large portion of the time the player is navigating an urban environment. One day the artists ask for a feature to make it easier to place street lights in a level. They say: “Can we have a way to click a path through the level and have the game place lights at an even spacing along it?” This seems like an easy enough request so you start implementing it.

Here’s where our story forks into two divergent paths.

First option – the pragmatic coder

The pragmatic coder writes some quick UI in the editor where the artist clicks on points in the level that gets stored into an array.

On the runtime side he writes a dozen-line for-loop that simply walks the path segments at equal spacing and spawns a street light each time. Job done, the pragmatic coder moves on to the next task.

Second option – the future-coder

The future-coder writes the same UI as the pragmatic coder, but for the runtime he decides upon a different approach. After all, he’s read books about design patterns and large scale software design, and he knows about the importance of the Single Responsibility Principle, and keeping code factored for changes that might occur in the future.

Thus, he decides to abstract over the exact type of path used. Sure, the artists only asked for a simple path today, but in the future they might want a B-spline or some other kind of curve, so he pulls out the logic for the path into its own Path object that can step along the curve in some kind of iterator fashion. Of course, that’s not enough, in order to change the behavior of the curve he puts this Path object behind an interface and adds a factory to spawn the right kind of path based on some kind of configuration parameter. He also updates the UI to allow for future kinds of paths. Maybe he even implements a B-spline option just so that he can test this properly with more than one option.

Now, the code to spawn street lights first creates an abstract path from a factory, but how does he determine the spacing of the objects? The artists asked for simple equal spacing today, but in the future they might want something different, so he writes another abstract class that handles the spacing policy (with a factory to go with it).

Street lights in your game happens to be reasonably symmetric, so today just spawning them with their default orientation works fine, but in the future the artists might want to use this system to spawn objects where the orientation needs to depend on the curve, so our future-coder updates the path object to also supply tangents and bi-tangents, and writes an abstract class to handle the specific placement policy in order to take this new information into account.

Phew, that was a lot of work – he’s added a lot of new classes and glue code, and the code that actually uses it is a lot more opaque and hard to read than what the pragmatic programmer came up with. It will be worth it in the future, however, when this extra flexibility will pay off. The future-coder goes home late today, but feels good that he’s written an elegant system that will stand the test of time, rather than just a dumb one-off implementation.

Reality Kicks In!

Let’s go through a few scenarios of what might happen next.

Good news, no changes!

Turns out this simple system was all the artists ever needed. They never ask for any changes, and the code is shipped unchanged from the original implementation.

The pragmatic coder feels good that his code did exactly what was needed and never caused any issues or technical debt in the project.

The future-coder feels bad that he spent all this time on an extensible system and nobody ever made the most of the flexibility in his elegant design, but at least it was used at all and the artists liked it, so it’s not too bad.

The feature gets cut

A week later the artists decide that they don’t mind placing street lights manually after all, since it gives them extra control. The automatic street light placing feature goes unused, and is eventually deleted.

The pragmatic coder barely remembers the feature and couldn’t give a damn.

The future-coder feels bad that his elegant system was completely deleted. He grumbles to the other programmers about the awful artists and designers who always keep changing the requirements and wasting so much of his time asking for stuff that they end up not using.

Requirements change

The artists are happy with the feature, but they have one minor request: “Yes, I know we asked for equal spacing of the street lights, but actually we want to make sure that there’s always a street light on each corner of the path, and that the spacing within each path segment is slightly adjusted to ensure this”. To the artists this feels like a simple change so they don’t feel too bad about asking for it.

For the pragmatic coder this is a trivial change. He updates his dozen-liner to first compute the number of street lights required for each segment by dividing the length by the desired spacing and rounding it to the nearest integer, then computing an updated spacing by dividing the segment length by this count. This takes a few minutes of his time and he moves on to other things.

The future-coder starts thinking about how to accommodate this new request into his streetlight-placing framework. Shit-shit-shit, he thinks, a frustrated panic starts to spread in his mind. He never anticipated this change and can’t really see how to update his system to handle it. What does it even mean to place a light at “each corner” when the path is completely abstract? After all, if a future path is defined with a spline, the “corner” may not have a control point on it.

And even if you did have a way to define what a “corner” means, how do you feed that data to the object that governs the spacing policy, now that it’s completely decoupled from the path object? This will require some major re-architecting he thinks, and postpones the task until he can find a larger slot in his schedule. Later that night the future-coder sits in a bar with a programmer coworker, complaining about those damn artists that keep changing the specs on him, oblivious to the irony that it’s precisely his overly complicated attempt to be robust to spec-changes that is causing him problems in the first place.

Ah, a solution! He decides to define “corners” in terms of the curvature of the path, with a configurable threshold, and then feeds this data to the spacing policy object so it can use it to compute the appropriate spacing. The code gets kind of messy since the path and spacing classes are now getting more coupled, and there’s a lot of tunable parameters involved, but it does the job so he checks it in and goes home.

The next day the artists informs him that his solution doesn’t work at all. They want to use this as way to ensure exact placement of streetlights even on straight-line stretches of road, by placing dummy control points in the path. So this notion of using curvature to define street light positions just doesn’t work.

Let’s leave the future-coder to work out a solution to this self-inflicted problem. I have no idea what he could be doing to rescue his system at this point, but whatever he comes up with it’s unlikely to be pretty, leading to even more technical debt and complexity.

Requirements change – redemption for our future-coder!

In this scenario the requirements change in a different way. The artists now want to have the option of using either a straight path with sharp corners, or a smooth B-spline path.

The pragmatic coder adds a flag to the UI, and a switch in his runtime code that either calls the old placement code, or calls some new code that interprets the control points as a B-spline and steps through them using his math library’s B-spline code. The code is still easy to understand, almost certainly contains no bugs, and it didn’t take too long to update.

The future-coder feels redeemed! He may even have had B-spline option already implemented so he can simply point the artist at the appropriate UI option. But even if he has to implement the B-spline option from scratch it’s a simple matter of implementing a new Path object that iterates over a B-spline, and update all the relevant factories and configuration options. Yes it’s more code than what the pragmatic coder needed to write to make this change, and it’s still much harder to read and navigate, and it’s not at all as clear that it doesn’t contain bugs. But look at the UML-diagram! It’s so elegant, and follows the single responsibility principle, and dammit we’ve already seen that we’ve needed one additional option already, so in the future when the number of options grows to a dozen or so we’ll really see the benefit of this elegant framework!

The future-coder wins the lottery

This time the scenario is that the artists keep asking for more and more kinds of paths over time, with increasingly elaborate spacing options, and it becomes one of the most reused systems in the whole project.

After the a few different path options have been added, each having a few different spacing options, the pragmatic coder starts seeing the limits of his original design, and decides that a more abstract and generic design will have greater benefits than downsides. So he refactors his code so that the path is specified by an abstract path object, and pulls the spacing policy out into its own object too. At the end of the day he ends up with a system much like the future-coder, but doesn’t feel too bad because he spent almost no time on the original systems in the first place, and the simplicity of them meant he could get the initial system going quickly, and respond to change-requests faster than the future-coder could’ve.

The future-coder feels like a goddamn programming genius. He totally predicted this and it paid off! He is now more certain than ever that just spending some time up front predicting the future will pay off down the line.

He’s blind to the fact that for every time he happens to get it right, there are a hundred other instances where his failed predictions led to unnecessarily complicated code, time wasted debugging because you can’t tell that the code obviously has no deficiencies, and additional development effort because the extra scaffolding required by his general systems made it harder to address changes that weren’t anticipated.

It’s like he’s won at the roulette table in Vegas – he’s convinced that he’s just that good, that he’s found a system. He’ll keep spending his money at the casino, without ever really doing the math on whether or not his overall winnings is higher than his losses. Everyone says they’re up on average in Vegas right? It’s a mystery that Casinos still make money!


We probably all have a bit of the future-coder in us, but we must resist him! The bottom line is that the future-coder almost never wins. In reality, predicting the future is virtually impossible, so don’t even try. It hardly ever pays off. Best case, you’ve just wasted a lot of time, but in reality abstraction has cognitive overhead throughout the project when reading or debugging code, and can make it harder to deal with unanticipated changes (ironically), as well as cause performance issues. And even when it does pay off it doesn’t actually buy you all that much and is actually a net negative because it skews your intuition into thinking that future-coding is a good thing.

The human brain didn’t evolve to think in abstract terms. You will probably have noticed this when trying to understand a new mathematical concept – it’s much easier if you start with concrete examples rather than the fully general definitions and proofs! Simple code that just does one thing in the dumbest way possible takes very little time for someone to understand, whereas if you have to jump through abstract classes and factories (each named in suitably abstract ways) it can take an absurd amount of time to understand something that really isn’t all that complicated.

All these extra virtual calls act as barriers to inlining, in addition to the overhead of the virtual function calls themselves, which cause performance issues. Yes, I know you’ve been told that you shouldn’t worry about performance for code that isn’t in a hotspot, but the reality is that for many applications there aren’t any real hotspots anymore. It’s depressing to look at the profiler for gameplay code and see that your warmest “hotspot” takes 5% of your frame time, and has already been optimized to death. By that time it’s often too late – performance has been siphoned off by tens of thousands of small inefficiencies added over many years, all because people didn’t want to worry about minor performance hits for code that wasn’t on the mythical “hot-path”. Simple code tends to be cheaper to execute, most of the time, in addition to being better in almost every other way as well.

So am I saying that we should just hack things together in the quickest way possible, with no regard for future maintainability? No, of course not. I’m advocating doing things in the simplest way possible, not the quickest way possible. Often, the simplest way requires removing a bunch of cruft that has accumulated over time, deleting stateful flags, pulling logic out into clearly named functions etc. Sometimes, it even requires an abstract factory.

You should optimize for simplicity, readability and modifiability. If you do, you will also end up with code that’s more maintainable, easy to modify and optimize in the future. Take the example above where our heroes needed to add support for a B-spline. Let’s assume that there was no library for interpolating a B-spline already so that they had to add it. I’d expect both types of programmers to pull that code out into its own function. My motivation for doing this isn’t that I’m trying to predict the future, though! You pull it out into its own function because doing so makes the code easier to read today, by making sure the B-spline functionality is hidden behind a sensibly named function instead of cluttering up the “flow” of the main function with unnecessary details. The minor cognitive cost of the extra function call is vastly outweighed by the increased readability of hiding unimportant details. It’s not about predicting the future, it’s about increasing readability.

Sensibly named function calls without side effects almost always add more to readability then they cost, if they’re virtual less so, if they have side effects even less so, if they’re in a class even less so, and so on. It’s all a tradeoff, and the break-even point for any given abstraction will change depending on the specifics of what you’re doing.

Very few abstraction techniques have no merit at all, but the road to programmer hell is paved with best-practices, applied too early. You should postpone applying these abstraction techniques until the cognitive and other costs of using them are clearly outweighed by their benefits (e.g. when you have demonstrated reusability for a component). There are often genuine systems even in game engines where it’s perfectly reasonable to apply many of the abstraction techniques you’ve read about in books about design patterns and large scale software design etc., but the vast majority of code is better off if you keep things simple, and avoid trying to predict the future.

More on Robin Hood Hashing


I posted about Robin Hood hashing in a previous post. In the comments to that post David Wragg pointed out a flaw in my implementation: if you repeatedly delete and reinsert elements into a hashtable, the average probe count keeps rising. This is a good observation, and it’s true. Repeatedly deleting items makes the table slower but it (until you insert enough to trigger a rebuild of the table, which will filter out any tombstones – note: the number tombstones do not affect the rate of rebuilds).

Now, first of all it’s worth noting that the probe count grows slowly. After a slight tweak to the insertion method, at least (if the existing element is deleted, then tweak the replacement condition so it happens even when the probe counts are merely equal, rather than the existing one being strictly smaller). In David’s test case it grew to about 30 after a million deletions (with my hash function it’s more like 20).

The other point to make is that we already accept periodic “batch” operations such as the table-rebuilding that happens whenever you hit the max load factor. For that reason, it’s relatively reasonable to just reinsert all items again (skipping tombstones) whenever some number of deletions have occurred (e.g. 5N or something – it doesn’t have to be very frequent since the average probe count grows so slowly with the number of deletions).

So, there’s a simple and fairly bullet proof solution to this problem, but what if we prefer an incremental approach, where we make each deletion slightly slower, but never have to run a big batch job to improve the average probe count?

I’m not sure if there are existing solutions with established asymptotic properties, but I tried a simple strategy that seems to work pretty well in my test. The idea is this: whenever we delete an element, we scan ahead for a small number of slots and rearrange them to limit (or remove) the impact of the delete on the average probe count. So long as this operation tends to improve the average probe count at a faster rate than the probe count would otherwise grow, you’ll end up with an average probe count that stays reasonable.

Here’s the code that runs at the very end of the “erase” method (after the element has already been destroyed):

int cur_ix = found_ix;      
for (int i = 0; i<NUM_DELETE_FIXUPS; ++i)
    const uint32_t next_ix = (cur_ix + 1) & mask;
    if (next_ix == found_ix) break; // looped around

    // if the next elem is empty or has a zero probe dist, 
    // then any query will always fail on the next elem anyway, 
    // so might as well clear the current one too.
    if (elem_hash(next_ix) == 0 || 
        probe_distance(elem_hash(next_ix), next_ix) == 0)
        elem_hash(cur_ix) = 0;
        return true;

    // Now, the next elem has probe count of at least 1, 
    // so it could use our deleted slot to improve its 
    // query perf. 
    // First, pretend we're a dupe of the next elem.
    elem_hash(cur_ix) = elem_hash(next_ix);

    // Since we're now a dupe of the next elem, ordering is 
    // arbitrary, which means we're allowed to swap places.
    // Note> current slot has been destructed, hence 'new'.
    new(&buffer[cur_ix]) elem(std::move(buffer[next_ix]));

    cur_ix = next_ix;
elem_hash(cur_ix) |= 0x80000000; // mark as deleted
return true;

You can choose a small number for NUM_DELETE_FIXUPS, or remove the condition in the loop (it’ll go through one pass of the table at worst, although that’s highly unlikely since it’ll stop on any empty element or an element with a zero probe distance). In my tests I’ve found that 2 will bound the variance and average probe count, but 3 will keep them slightly lower at marginal extra cost so that’s what I’ve set it to.

Again, this is just me messing around – I haven’t made any real attempt to prove the properties of this approach. There may be more rigorously investigated approaches in the literature somewhere. Given that there exists a simple fix with very low amortized cost (periodic reinsertion of elements), I don’t really consider this issue a deal-breaker even if this incremental approach doesn’t work out.

Language Design Deal Breakers


I’m a bit of a programming language nerd. I love to learn new languages. That said, I spend most of my days writing C++. It’s a truly awful language, with many warts and problems, but I know it well and with enough effort and pain you can get the job done. The biggest benefit of C++ is purely an historical accident: it’s great because it’s popular. There’s no trouble finding people who know it, or libraries that work with it.

The point is, that in order for me, or anyone else, to actually switch from C++ to a different language, there has to be a substantial improvement to overcome the inertia that C++ has built up over the years.

I find that when I look at a new language, there are a number of “deal breakers” which simply mean that regardless of all its other nice features I will never take it as a serious contender. Note that this isn’t a fair fight. Something can be a deal breaker even if C++ doesn’t have that feature either. Any language has to be so much better than C++ that the benefits outweigh the inertia. A deal breaker is a feature that either A) puts it at a disadvantage compared to C++ or B) puts it on equal footing with C++ in an area where C++ is doing particularly poorly.

Here’s my list of language design requirements, which if unmet would be deal breakers for me, what’s yours?

1. Must be statically typed

I know you’re supposed to be diplomatic and claim that there’s two sides to this story, and no real right answer, but really people who think dynamic typing is suitable for large scale software development are just nuts. They’ll claim it’s more flexible and general but that’s just nonsense in my opinion. It may be true compared to utter strawman examples of static typing. If you think this is an argument, go learn Haskell, and then get back to me.

There’s a wonderful correspondence between programs that can be statically typed, and programs that can be statically understood by other people. In other words, programs that you can understand without having to run a debugger or REPL, by just reading code. If you’re writing programs that rely on dynamic typing to work, you’re probably writing exceedingly difficult to understand code. If you are writing code that can be statically understood by other people, you’re probably writing down comments to indicate the allowable types and preconditions anyway (or you should) – so you might as well have those properties checked by the compiler.

Most code does not rely on dynamic typing to work. They could be written just as well in statically typed languages. In that case it’s just a matter of convenience – make specifying types light-weight enough that the extra syntactic overhead doesn’t outweigh the benefits (type inference helps, even the simple “local” kind that C# and C++ 11 has cuts down on much of the noise).

Then there’s the correctness issue. Static typing catches mistakes early. This is a good thing. Now, dynamic typing advocates will say that of course any real program will also have a unit testing suite, so types can be checked that way! This is obviously naïve in the extreme. In the real world writing unit tests for everything very frequently fails the bang-for-buck test – the cost is just too high. This cost comes in two forms, first the cost of writing the tests, and second the friction it introduces to future modifications and refactorings – after all if you have to update 20 unit tests to refactor a messy interface, you’re a lot less likely to do it. Examples of code you probably don’t want to write unit tests for includes gameplay logic which gets written and rewritten twenty times before ship as the result of iteration. Having to write test harnesses and tests for all this throwaway code is madness. Then of course there’s code that’s just plain hard to unit test, such as graphics. The point is that pervasive unit testing is a fantasy – it may be feasible in some domains, but it’s certainly not the case everywhere. Static typing is a great compromise – you basically document the things you would’ve put in comments anyway, but you do so in a standardized way that can be checked by the compiler. This catches many errors for very little cost. When unit tests makes sense (as they sometimes do), you can add those too for even more coverage, but when they don’t you still have a basic sanity check.

Finally, performance. Yes, JITers can be very clever and employ tracing and stuff to close some of the gap here, but of course those techniques could be used by a statically typed language implementation as well. If we can statically decide what methods to call and what types to use we will always be at an advantage compared to a language where this has to be determined at runtime. Not to mention that statically typed languages will encourage you to write code that can be executed efficiently (e.g. variables don’t change types over the course of a program). Also, static typing allows you to make performance guarantees that won’t be randomly broken tomorrow. For example you can store an unboxed array of values, or store a value on the stack or embedded within another object. You don’t have to rely on a runtime tracing system to uncover and apply these optimizations (with typically no way to ensure that it actually occurred), which also means you don’t have to worry that you’ll fall off a performance cliff in the future when something subtle changed which caused the JITer to fail to optimize something. The programmer is in control, and can statically enforce important performance-related choices.

Of course if you want to allow dynamic types as an option for a subset of cases where it makes sense (e.g. dealing with untyped data read from external sources), that’s fine.

2. Memory safety

C++ fails this one, but it’s still a deal breaker. If I was content with worrying about memory scribbles and random access violations I’d just stick to C++. In order to offer enough benefits to convince me to switch from C++, your language needs to be memory safe.

Now, let me also state that I think the language needs an “escape hatch” where you can tag some function or module as “unsafe” and get access to raw pointers, the point is that this needs to be something optional that you have to explicitly enable (using a compiler switch, ideally) for very isolated cases, and you should never be required to use it for typical “application level code”.

This also implies some kind of guarantees about memory reclamation…. We’ll get to that.

3. Yes, memory safety – no null pointer exceptions!

I use a broader definition of memory safety than some. In my view, if a program follows a reference and crashes because the reference is invalid, it’s not a memory safe language. This implies that in addition to making sure you never point to stale memory or memory of the wrong type (through type safety and automatic storage reclamation), you also have to make sure that a reference never points to null.

Null pointer exceptions have no business in modern languages. Getting rid of them costs no performance, and statically eliminates the one big remaining cause of runtime crashes that we still see in otherwise “modern” languages. Yes, it requires some thought (particularly w.r.t. how you do object construction), but this is a solved problem. There is only one valid reason not to do this and that’s legacy code – perhaps you didn’t know what you were doing when you started off, not realizing the issue and how simple the solution is, and now you have too much legacy code to fix it. A language without legacy issues should get this right, and yet many don’t.

Note: it’s not enough to simply have a “non-nullable-pointer” type, not even if it’s checked at compile time. You have to make sure that nullable pointers cannot be dereferenced. In other words, using the pointer-dereferencing operator on a nullable pointer should literally be a compile-time type error. You should have to do a check of the pointer first which introduces a dynamic branch where on one of the sides the type of the pointer has been changed to be non-nullable, which enables the dereferencing operator again.

Having every few statements potentially trigger a runtime crash is an extremely expensive price to pay… for nothing! We’re not talking about a pragmatic tradeoff here, where you pay the cost of reduced reliability but get something else in return – there’s just no real upside to allowing dereferencing of potentially null pointers.

4. Efficient storage reclamation

I mentioned above that I require automatic storage reclamation, well I also require it to be efficient. Too many modern languages fail this. I don’t really mind how they achieve performance, but two ways you could do it is to support pervasive use of value types (i.e. types that get embedded within their owners – which drastically reduces the number of pointers in your heap, and therefore the cost of GC), along with thread-local heaps. Concurrent collection is another solution, though you’d have to be careful not to incur too much of an overhead on the mutator with whatever memory barrier you use. I’ve written about this issue twice in the past: one two

5. Great Windows support

Windows is still the most popular OS by far. If you don’t give me installers, or easily-buildable code, for windows I’m not buying. I realize a lot of academia is used to *nix, which causes a disproportionate preference for linux and mac in projects such as this, but in the real world your customers actually use windows so you should make sure it works.


The most common one of these deal breakers is probably dynamic typing. I usually just close the tab immediately when I see that a language is dynamically typed. So while strictly speaking this is the most common issue, it doesn’t feel like it because I never invest any amount of time even considering those languages anymore. They may have some benefits in terms of theoretical elegance (e.g. Lisp, or even Lua or IO), but for actual work I’ve seen too much of the benefits of static typing, and too much of the horrors of not having it. It’s so obvious, that it barely qualifies as a deal breaker – it’s not even in the running to begin with.

So really, the most common deal breaker is probably point 3 above – Null pointer exceptions. Whenever I see promising languages like Go, D or Nimrod, that still repeat Hoare’s billion dollar mistake I get a bit sad and frustrated. It’s like seeing a beautiful, elegant, car that has just one tiny flaw in that the wheels occasionally fall off. It simply doesn’t matter how many awesome and innovative features you have if you don’t get the fundamentals right.

Robin Hood Hashing should be your default Hash Table implementation


There’s a neat variation on open-addressing based hash tables called Robin Hood hashing. This technique isn’t very well-known, but it makes a huge practical difference because it both improves performance and space utilization compared to other “standard” hash tables (e.g. chaining). It also eliminates the major downside of other open addressing hash tables.

Here are the benefits, summarized:

  • High load factors can be used without seriously affecting performance. 0.9 is perfectly reasonable as a default (0.95 or higher works too, it mainly affects insertion cost a bit).
  • No linked lists or other extra pointers. This reduces cache misses and storage overhead. Your underlying structure can be a simple flat array since it’s just open addressing under the hood.
  • Lookup and insertion logic is fast. Again, no linked lists to traverse or other complications, just a linear probe sequence with a few checks per element.
  • Unlike other open addressing schemes, looking for non-existent elements is still fast.

For a simple benchmark where I inserted 100k english words into a map, then deleted 10% of them, and then looked them all up again, the timings for my simple Robin Hood hash table was 23%, 66% and 25% lower for insertions, deletions and lookups respectively compared to the VS 2012 std::unordered_map, it did this using 30% less memory overall.

It’s all about the variance

So the basic idea is to take normal open addressing, but use one clever trick in order to drastically reduce the variance of the expected average and maximum probe lengths. We’ll se later why reducing variance is so important. To give you an idea how Robin Hood Hashing improves things, the probe length variance for a RH table at a load factor of 0.9 is 0.98, whereas for a normal open addressing scheme it’s 16.2. At a load factor of 0.99 it’s 1.87 and 194 respectively.

The clever trick is just this: when you probe for a position to insert a new element, if the probe length for the existing element is less than the current probe length for the element being inserted, swap the two elements and keep going.

That way elements that were inserted early and thus “lucked out” on their probe lengths, will gradually be moved away from their preferred slot as new elements come in that could make better use of that place in the table (hence the name – the insertion “takes from the rich”, i.e. the elements with low probe counts). It leads to an “evening out” of the probe lengths.

Why is low variance better? Well, with modern cache architectures a probe count of 1 isn’t really much faster than a probe count of 3, because the main cost is fetching the cache line, so the ideal scenario is for all elements to have the same probe count, and having that probe count fitting within one cache line.

It turns out that Robin Hood hashing has the same expected probe count as normal open addressing (about 2.55) – it just has substantially less variance, and therefore ends up with far fewer cache misses. Yes, there are fewer elements that can early out after 1 probe, but there also far fewer elements that end up needing to fetch multiple cache lines because of long probe lengths.

Furthermore, the expected value of the longest probe sequence approaches about 6 for a load of 0.9 (it’s not actually constant, but grows very, very slowly – see this paper), which is pretty reasonable.

What about failed lookups?

So now you’re probably thinking that this sounds good, but in the back of your mind you know that normal open addressing tables do pretty well too, but have a major downside in that searching for non-existing elements is troublesome to handle. To allay your fears I’ll first tell you a very simple (but totally practical) solution that is enabled by the low variance, and then we’ll see the even better version later.

First a recap. The usual problem is this: since the search algorithm terminates when you find an “empty” slot in the underlying array, it can take a very long time to determine that an element doesn’t exist in the array when the table grows full.

How does Robin Hood hashing solve this? Well the simplest solution is to exploit the fact that the expected longest probe count is low (~6). Just modify the standard search algorithm to ignore empty slots (rather than terminate the search) and keep going until you’ve probed longer than the known maximum probe length for the whole table. This maximum probe length will be small, with a very high probability, even for very loaded tables.

This solution would obviously be inefficient for the standard open addressing scheme, since the expected longest probe count is much higher (e.g. at a load of 0.9 and ~64K elems it is about 70).

The details

In order to know what the probe count of an existing element is (which is key to the algorithm) we could just re-hash the element to re-compute its “desired” slot index and then subtract this from its actual location. A faster way is to simply cache the hash value for each key.

Storing the hash value is generally a sensible strategy anyway, because it means you have a fast early-out for comparisons, so we take that approach here. In other words, the hash table elements consist of a 32 bit hash value, the key, and the value. For my implementation I stored the hash values in a separate array in order to get more hash probes per cache line (at the expense of a second mandatory cache miss to actually fetch the key). This gave a small speed-up for my test where the key size was large (sizeof(std::string)), but there’s a #define in the code to put the hash value alongside its key.

In order to indicate that a slot is unused, we modify the hash function to never return 0, and use a stored hash value of 0 to mean “uninitialized slot”.

First, let’s look at the insertion algorithm, since this is where the actual Robin Hood trick comes in.

void insert_helper(uint32_t hash, Key&& key, Value&& val)
    int pos = desired_pos(hash);
    int dist = 0;
        if(elem_hash(pos) == 0)
            construct(pos, hash, move(key), move(val));

        // If the existing elem has probed less than us, 
        // then swap places with existing
        // elem, and keep going to find another slot for that elem.
        int existing_dist = probe_distance(elem_hash(pos), pos);
        if (existing_dist < dist)
                construct(pos, hash, move(key), move(val));
            swap(hash, elem_hash(pos));
            swap(key, buffer[pos].key);
            swap(val, buffer[pos].value);
            dist = existing_dist;

        pos = (pos+1) & mask;

The algorithm is pretty simple. We simply loop until we’ve found an uninitialized slot (hash value == 0). If we found an existing slot whose probe distance is less than our current probe count (‘dist’), we swap places and continue.
Note: using move semantics here matters (e.g. for efficient swapping).

In order to delete an element, we follow the same strategy as for normal open addressing and mark the slot as a tombstone. Tombstones are treated specially in the insert algorithm. We overwrite the tombstone only when we would’ve wanted to swap anyway (see the check for is_deleted above).

We must mark the tombstone using a whole bit rather than just reserving a single hash value (like we did for uninitialized slots), because we need to know the probe count of the tombstone.

bool erase(const Key& key)
    const uint32_t hash = hash_key(key);
    const int ix = lookup_index(key);

    if (ix == -1) return false;

    elem_hash(ix) |= 0x80000000; // mark as deleted
    return true;

This is all pretty straightforward. We first find the element, then we call its destructor, and finally set the “tombstone” bit.

And lastly, the lookup algorithm:

int lookup_index(const Key& key) const
    const uint32_t hash = hash_key(key);
    int pos = desired_pos(hash);
    int dist = 0;
        if (elem_hash(pos) == 0) 
            return -1;
        else if (dist > probe_distance(elem_hash(pos), pos)) 
            return -1;
        else if (elem_hash(pos) == hash && buffer[pos].key == key) 
            return pos;				

        pos = (pos+1) & mask;

This just finds the desired position of the element using the hash, then probes linearly from there on in. There are two conditions to signify that the element doesn’t exist, and finally one check to see if we’ve found the element.

The first exit condition checks for completely uninitialized elements. This is because if the element we are looking for had probed an uninitialized slot during insertion, it would have simply been place there. So the existence of an uninitialized slot along our probe sequence means the element must not be in the table.

The second condition is more interesting. This is our replacement for simple checking the probe distance against a table-wide maximum probe count. We know that when we probe an element during insertion, the one with the longer probe count of the two gets to keep the slot. So if we’re looking for an element that exists, we should expect to never see an existing element with a shorter probe count then our current count (if that had happened, there would’ve been a swap during insertion!).

In other words, we early out as soon as our current probe count exceeds the probe count of the stored element. Since the average probe count for stored elements is 2.55, we can exit pretty quickly for non-existent elements (much earlier than stopping after a table-wide maximum probe count).

This second condition is why we need to maintain the hash value even for tomb stones. Imagine what would happen if we simply marked the slot as uninitialized when it was deleted – the next insertion that comes across it would simply occupy it thereby getting an unfairly low probe count, and more importantly messing up the second condition above. Instead, we just flag deleted elements as tombstones, and only reuse the slot in the insertion algorithm if a swap would’ve happened anyway.

Finally, the last condition simply compares the hashes and the keys and returns the found element.

Here’s a the full code for this blog post: code. I should note that I wrote this specific implementation for this blog post, so it hasn’t been extensively used or tested (or optimized). It’s quite likely that there are bugs.

Thoughts on Game Data Formats


I’ve been thinking about the representation and loading of game data recently. This has largely been the result of stumbling upon a variety of blog posts and libraries such as:


Desired characteristics

So, the basic problem is that you want to store a large variety of “resources” for a game “level” (or “sector” or whatever), describing all sorts of things such as the location of every object, the vertices of every mesh, and so on. There are a number of desirable characteristics for this data format, such as:

  • Efficient on-disc representation
  • Efficient run-time and load-time access. Ideally you can load the data structure into memory and just use it directly without any “translation” step. At the very least any load-time cost should be cheap.
  • Some way of reflecting the data at run-time. Ideally this is something you don’t need for the final release game, at least for most objects, but it’s extremely useful during development to be able to essentially “attach” a visual debugger to a running game and get a description of every object in the level. E.g. for console development your editor could run on the PC, while the actual game runs on the console. The console would reflect its actual runtime state and keep the editor up to date, and the editor would send modifications as messages to the runtime.
  • Some way of evolving the representation of an object. Ideally a new runtime of the game should be able to just transparently load old resources without any fuss, or rebuilding. This is particularly useful for post-release patching, but even during development “rebuilding” thousands of objects (with whatever synchronization overhead and dependency tracking you might also need to do to avoid stomping on other developers’ toes) can quickly become an issue.
  • Multiple artist should be able to work on a level at the same time.

I haven’t implemented anything that provides all these benefits simultaneously in a shipping game, but with hindsight, and after thinking about some of the newfangled “protocol” formats that have become popular recently, I think I have a good approach in mind.


Levels are just folders of resources in the file system. Every single “object” in the level corresponds to a file, e.g. “tree_001291.inst”. The exact internal organization can vary – perhaps sub-folders refer to individual “sectors” (for e.g. streaming purposes), or perhaps the sub-folders are completely arbitrary and used by the designers to organize things however they want. The point is that there is no overall “level file” that anyone has to lock to modify the level. This means multiple artists can add to and modify different parts of the level. Just use normal source control, with exclusive locks, to process the individual files. Any per-level properties would be stored in tiny little resources that get dropped into the level like anything else. E.g. you might have a “physics properties” resource that store a per-level set of parameters like gravity. Keep even these global resources small and focused, to avoid contention. This does means you’ll have many thousands of files per editor, but it’s also super simple to inspect and debug when anything goes wrong.

Resources consist of “instance” data and a “schema“, both of which has a source representation and a runtime representation. For example, the schema might be represented in a simple JSON-like format, the instance source data can vary (e.g. a Maya .mb file, or a .wav file). The runtime representation is binary and efficient, produced at build-time from the source, using an appropriate “builder”. It’s likely that many different kind of instances use the same builder (e.g. simple “property bags” type resources might just take their input as plain JSON, even if each one represents a completely different kind of game object), with a few custom ones requiring more severe processing.

The schema describes the data layout of each resource type. I.e. you might have thousand “goblin” resources in the level, but there will only be a single schema describing the layout of all those instances. The reason for storing as schema separately, as opposed to having each instance be self-describing, is to avoid duplicating type-metadata and field offsets etc. that don’t change, and also because it gives you a convenient place to hang additional (optional) meta data off of, such as RTTI information (and on the editor side of things, UI meta data).

The instance data is basically just the actual bits that refer to a single instance of that resource. E.g. a single health pack, or monster spawn location, or rigid body, or a mesh, or whatever. It contains no “meta data”. It’s a compact format, but stored in a “native” way so that it’s cheap to read at runtime directly without first having to convert it to a runtime representation. Keeping the instance data in the “native” form is important because it means most resource types wouldn’t have any “runtime version” of the on-disc data – it would just use the instance data directly. Some types would probably have some additional “companion” runtime data layered on top, and created at load-time – e.g. D3D header objects and such.

A particular version of a schema is immutable. I.e. once you’ve created (and checked in) a schema you can’t ever change it. You can only create a new version of it (and you can only do that according to specific rules, that don’t invalidate previous instances and schemas). Each instance refers to a particular schema of a particular version. In other words, the game can load any version of any instance, because it uses the schema the instance was built against to interpret the data. So what I said above about there only being one schema for thousands of instances of a given resource type was a lie – there can be multiple versions of each schema. In practice you’ll only have a few versions actively used at any one time, because anytime you modify an instance, it would be saved out using the most recent schema (and you can periodically run a “refresh” batch job to rebuild instances).

The schema’s runtime asset is simple. Each field in the structure has a unique integer identifier starting from zero, with no gaps. This is the only way to identify a field. These identifier cannot change when you create new versions of a schema. You can add new field, and you can mark old fields as “deleted” (which means they no longer take up space in the instance), but you can’t actually reuse an old identifier. Lots of deleted fields imply some overhead in the schema runtime asset, but not in the instance data. These rules ensure that creating a new version of the schema won’t cause old instance to be invalid.

The runtime structure of a schema is basically just an array of offsets, one for each field, indicating where that field is stored within the instance data. There’s also some additional meta-data containing the string representation of each field etc., used for RTTI (ideally stored in a separate chunk that can be skipped if you don’t need it).

Each schema corresponds to a generated C++ class header. This class simply provides a bunch of accessor methods. Each one looks up the offset for the field in question using the offset array from the schema, and the identifier for the field, and returns the memory from the instance data at that location casted to the right type. This does mean an additional indirection for field access, but it should still be reasonably fast (especially since the schema’s offset table is likely to be hot in the cache quite frequently). If the field doesn’t exist in the schema (e.g. it was added after the schema was created – so its identifier is larger than the size of the offset array, or the offset array contains the “deleted” marker), the generated code would simply return the default value (stored inline as a constant, in the generated code). This is the migration path – if you add a new field to a schema, all you need to do is make sure you add the corresponding code to “do the right thing” with that new field, including doing something sensible for default values. Old code will simply ignore that field (their version of the generated class won’t have it, nor any code that relies on it!), and if the new code reads an old instance it will just get the default value for that field.

Further details and rationale

So that’s the basic idea, there are some other details that are important to note, though.

So clearly you don’t want to be loading individual files all the time, which is why there needs to be some kind of “bundling” process that groups resources together. Probably sorts them by (schema,version), so that you load all objects with a given schema at the same time. This wouldn’t just be for release builds, you’d regularly build these bundles even during development. In order that you don’t invalidate a whole bundle and revert to excruciatingly slow load times if you modify a single resource, there needs to be a way for the loader to basically override individual resources from a bundle.

One likely objection is that it’s much simpler to just enforce header/data consistency and always just do a plain memcpy-style resource load. That way there’s no need for the indirection via the schema. Why go through the trouble? Well, in a system with strict header/data dependencies, if you change the layout, you have to rebuild every resource of that type. I really like the simplicity of this approach, but I’ve been down that road and had to live with resource rebuilding being a major iteration hurdle for everyone involved.

The bottom line is that if you have thousands of assets of a particular resource type, every tiny little inefficiency in your build-time code gets multiplied by that number, making your performance margins razor-thin. Even 100ms of build time for a resource can quickly become unacceptable if you have ten thousand of them that need to be rebuilt because you added a flag. It’s not hard to hit those kind of times from just loading DLLs, and hitting the disk once or twice. Any slight inefficiency quickly adds up to many seconds, or even minutes of waiting around. I don’t know about you, but I’d rather not worry about a couple of milliseconds here and there on the build-side of my code. Plus, this approach opens up a pretty nice patching/DLC story.

Another objection may concern the fact that we’re generating code. Why not let our schema source asset simply be our C++ header file and parse it to build the runtime asset for a schema? Or why not annotate it with preprocessor macros and generate the offsets directly in code? Well, in a word – simplicity. Writing parsers is pretty hard. Writing a parser with good error messages is even harder. Voluntarily writing a parser with good error messages for C++ is pretty much grounds for termination. You do not want to be in this business! You might be able to restrict the kind of C++ you allow in this “schema header”, but inevitably someone will mess up and put unsupported code in there, and your cobbled together mess of preprocessor defines, build-time parsers, and maybe even template meta programming will lead to the worst error messages you’ve ever seen. I’ve been down this road. There be dragons. Maybe you can avoid them if you think hard about it going in, but honestly I don’t see any major upside to the approach that pays for the extra risk and complexity.

In contrast, generating code is child’s play. The accessor methods in the generated code is trivial. Anyone can understand and debug it. There’s absolutely nothing clever going on at all when you’re generating code (the code that generates C++ is simple, and the generated code is simple too). Everything is just plain old dumb C++ code that contains absolutely nothing clever. In my opinion, of the common choices for these things, code generation is the right choice. It retains maximum flexibility since you can just generate any code you want (you’re not restricted to what you can do with preprocessor macros or template or whatever), without costing you any simplicity.

You may worry about the overhead of these accessor methods. While it’s reasonably cheap, there’s still a range-check on the offset, and an extra indirection. I’m not overly concerned about that. Part of it is because in my experience it turns out a huge chunk of the data you actually touch lives in a relatively small set of game types. E.g. matrices, quaternions, vectors, colors, etc. These types never actually change representation in practice, so can simply be hard coded as “native” types instead of being represented as schemas of their own. In other words, accessing the toplevel “transform” field does require the extra range check and indirection, but each member of that structure (e.g. 16 matrix members) can be accessed directly as it’s just a plain old C struct. I’d estimate you end up with a few dozen “native” types like this, which improves the constant factors involved by quite a bit.

Plus, if accessor overhead is really an issue, you could make sure your Release builds will never load old instances (by running a global “refresh job” on all instances as you produce your final builds) and hard-code the offset in the generated accessor code instead of going via the schema. I’d avoid this, because it means you may have issues using the “upgrade path” for future patches. Another option is to tag certain type as having an “efficient” format, this would make the schema builder also produce a plain C struct for the type, and a loader method that simply copies each member at load time (using the schema offset table and instance data) – once that’s done it’s just plain old C data that can be accessed at full speed. This costs you load time performance and simplicity, but does offer a nice “backup plan” (e.g. if you want to modify the runtime data and save it back out to the instance source format, for live editing purposes).

So what about on-disc storage? If we need to store all our data in a “native” format, won’t we waste a whole bunch of memory? For example using 32 bits to store an integer when it could be compressed as 10 bits is a bit wasteful, right? Yeah, this is true, and it may be worth worrying about. However, it’s not uncommon to have some kind of transparent compression on the IO-system level anyway (e.g. based on zlib, or Snappy), so combined with the bundling system above most of the redundancy should be compressed away. If not, Cap’n proto has a clever trick where they store the instance data as the actual instance data XOR’d with the default instance (which I would store in the schema). This means most instances consist of tons of zeroes (since most fields aren’t actually changed from their defaults, in practice). Combine this with some really cheap zero-aware compression where you only store non-zero bytes, and you get great on-disc storage characteristics. Of course, the next step is to skip the XOR step altogether. When you compress the instance you do it using a “default-aware” compressor instead of a zero-aware one – i.e. skip storing any bytes that are identical to the default instance and at load time you would initialize the instance by either loading bytes from the resource instance or from the default instance, based on some control stream stored in the resource . Of course, you should measure carefully if this is worth it. There’s something to be said for loading directly from disk into the final destination buffer, without having to even involve the CPU.

Another interesting thing I’d like to try is to separate mutable from immutable data on the level of the schema (with immutable being the default, and this affecting the accessors being generated). Typically the vast majority of the data you’d load this way is immutable (in terms of size) – e.g. vertex data, audio data, texture data, etc., with only a few things actually being modified at runtime. Since we’re generating code from a schema, we can make these kind of global experiments pretty easily.

Anyway, I this seems to combine the best features from all the different systems I’ve worked with, read about, or even implemented myself. Thoughts?