The Perils of Future-Coding

Standard

 

In my opinion, one of the most insidious form of technical debt is what I like to call future-coding. You might also call it over-engineering, second-system syndrome, etc. It’s code written in an overly elaborate and general way in order to, the future-coder reasons, pre-emptively handle use cases that we’ll “probably see in the future”.

I find it more treacherous than plain “stupid code” because it predominantly affects smart people. It’s mostly smart people with limited experience – once you see the downsides of future coding a few times, you tend to knock it off (although there are powerful confirmation biases at work) – but still, these are bright people you’d want to hire and keep around in you organization because they’ll do great things in the future.

Let me illustrate my point using a hypothetical example with two equally smart programmers, where one just has a little bit more experience than the other and is less prone to future-coding.

The Tale of Two Street Lights

Let’s say you’re making a game where a large portion of the time the player is navigating an urban environment. One day the artists ask for a feature to make it easier to place street lights in a level. They say: “Can we have a way to click a path through the level and have the game place lights at an even spacing along it?” This seems like an easy enough request so you start implementing it.

Here’s where our story forks into two divergent paths.

First option – the pragmatic coder

The pragmatic coder writes some quick UI in the editor where the artist clicks on points in the level that gets stored into an array.

On the runtime side he writes a dozen-line for-loop that simply walks the path segments at equal spacing and spawns a street light each time. Job done, the pragmatic coder moves on to the next task.

Second option – the future-coder

The future-coder writes the same UI as the pragmatic coder, but for the runtime he decides upon a different approach. After all, he’s read books about design patterns and large scale software design, and he knows about the importance of the Single Responsibility Principle, and keeping code factored for changes that might occur in the future.

Thus, he decides to abstract over the exact type of path used. Sure, the artists only asked for a simple path today, but in the future they might want a B-spline or some other kind of curve, so he pulls out the logic for the path into its own Path object that can step along the curve in some kind of iterator fashion. Of course, that’s not enough, in order to change the behavior of the curve he puts this Path object behind an interface and adds a factory to spawn the right kind of path based on some kind of configuration parameter. He also updates the UI to allow for future kinds of paths. Maybe he even implements a B-spline option just so that he can test this properly with more than one option.

Now, the code to spawn street lights first creates an abstract path from a factory, but how does he determine the spacing of the objects? The artists asked for simple equal spacing today, but in the future they might want something different, so he writes another abstract class that handles the spacing policy (with a factory to go with it).

Street lights in your game happens to be reasonably symmetric, so today just spawning them with their default orientation works fine, but in the future the artists might want to use this system to spawn objects where the orientation needs to depend on the curve, so our future-coder updates the path object to also supply tangents and bi-tangents, and writes an abstract class to handle the specific placement policy in order to take this new information into account.

Phew, that was a lot of work – he’s added a lot of new classes and glue code, and the code that actually uses it is a lot more opaque and hard to read than what the pragmatic programmer came up with. It will be worth it in the future, however, when this extra flexibility will pay off. The future-coder goes home late today, but feels good that he’s written an elegant system that will stand the test of time, rather than just a dumb one-off implementation.

Reality Kicks In!

Let’s go through a few scenarios of what might happen next.

Good news, no changes!

Turns out this simple system was all the artists ever needed. They never ask for any changes, and the code is shipped unchanged from the original implementation.

The pragmatic coder feels good that his code did exactly what was needed and never caused any issues or technical debt in the project.

The future-coder feels bad that he spent all this time on an extensible system and nobody ever made the most of the flexibility in his elegant design, but at least it was used at all and the artists liked it, so it’s not too bad.

The feature gets cut

A week later the artists decide that they don’t mind placing street lights manually after all, since it gives them extra control. The automatic street light placing feature goes unused, and is eventually deleted.

The pragmatic coder barely remembers the feature and couldn’t give a damn.

The future-coder feels bad that his elegant system was completely deleted. He grumbles to the other programmers about the awful artists and designers who always keep changing the requirements and wasting so much of his time asking for stuff that they end up not using.

Requirements change

The artists are happy with the feature, but they have one minor request: “Yes, I know we asked for equal spacing of the street lights, but actually we want to make sure that there’s always a street light on each corner of the path, and that the spacing within each path segment is slightly adjusted to ensure this”. To the artists this feels like a simple change so they don’t feel too bad about asking for it.

For the pragmatic coder this is a trivial change. He updates his dozen-liner to first compute the number of street lights required for each segment by dividing the length by the desired spacing and rounding it to the nearest integer, then computing an updated spacing by dividing the segment length by this count. This takes a few minutes of his time and he moves on to other things.

The future-coder starts thinking about how to accommodate this new request into his streetlight-placing framework. Shit-shit-shit, he thinks, a frustrated panic starts to spread in his mind. He never anticipated this change and can’t really see how to update his system to handle it. What does it even mean to place a light at “each corner” when the path is completely abstract? After all, if a future path is defined with a spline, the “corner” may not have a control point on it.

And even if you did have a way to define what a “corner” means, how do you feed that data to the object that governs the spacing policy, now that it’s completely decoupled from the path object? This will require some major re-architecting he thinks, and postpones the task until he can find a larger slot in his schedule. Later that night the future-coder sits in a bar with a programmer coworker, complaining about those damn artists that keep changing the specs on him, oblivious to the irony that it’s precisely his overly complicated attempt to be robust to spec-changes that is causing him problems in the first place.

Ah, a solution! He decides to define “corners” in terms of the curvature of the path, with a configurable threshold, and then feeds this data to the spacing policy object so it can use it to compute the appropriate spacing. The code gets kind of messy since the path and spacing classes are now getting more coupled, and there’s a lot of tunable parameters involved, but it does the job so he checks it in and goes home.

The next day the artists informs him that his solution doesn’t work at all. They want to use this as way to ensure exact placement of streetlights even on straight-line stretches of road, by placing dummy control points in the path. So this notion of using curvature to define street light positions just doesn’t work.

Let’s leave the future-coder to work out a solution to this self-inflicted problem. I have no idea what he could be doing to rescue his system at this point, but whatever he comes up with it’s unlikely to be pretty, leading to even more technical debt and complexity.

Requirements change – redemption for our future-coder!

In this scenario the requirements change in a different way. The artists now want to have the option of using either a straight path with sharp corners, or a smooth B-spline path.

The pragmatic coder adds a flag to the UI, and a switch in his runtime code that either calls the old placement code, or calls some new code that interprets the control points as a B-spline and steps through them using his math library’s B-spline code. The code is still easy to understand, almost certainly contains no bugs, and it didn’t take too long to update.

The future-coder feels redeemed! He may even have had B-spline option already implemented so he can simply point the artist at the appropriate UI option. But even if he has to implement the B-spline option from scratch it’s a simple matter of implementing a new Path object that iterates over a B-spline, and update all the relevant factories and configuration options. Yes it’s more code than what the pragmatic coder needed to write to make this change, and it’s still much harder to read and navigate, and it’s not at all as clear that it doesn’t contain bugs. But look at the UML-diagram! It’s so elegant, and follows the single responsibility principle, and dammit we’ve already seen that we’ve needed one additional option already, so in the future when the number of options grows to a dozen or so we’ll really see the benefit of this elegant framework!

The future-coder wins the lottery

This time the scenario is that the artists keep asking for more and more kinds of paths over time, with increasingly elaborate spacing options, and it becomes one of the most reused systems in the whole project.

After the a few different path options have been added, each having a few different spacing options, the pragmatic coder starts seeing the limits of his original design, and decides that a more abstract and generic design will have greater benefits than downsides. So he refactors his code so that the path is specified by an abstract path object, and pulls the spacing policy out into its own object too. At the end of the day he ends up with a system much like the future-coder, but doesn’t feel too bad because he spent almost no time on the original systems in the first place, and the simplicity of them meant he could get the initial system going quickly, and respond to change-requests faster than the future-coder could’ve.

The future-coder feels like a goddamn programming genius. He totally predicted this and it paid off! He is now more certain than ever that just spending some time up front predicting the future will pay off down the line.

He’s blind to the fact that for every time he happens to get it right, there are a hundred other instances where his failed predictions led to unnecessarily complicated code, time wasted debugging because you can’t tell that the code obviously has no deficiencies, and additional development effort because the extra scaffolding required by his general systems made it harder to address changes that weren’t anticipated.

It’s like he’s won at the roulette table in Vegas – he’s convinced that he’s just that good, that he’s found a system. He’ll keep spending his money at the casino, without ever really doing the math on whether or not his overall winnings is higher than his losses. Everyone says they’re up on average in Vegas right? It’s a mystery that Casinos still make money!

Conclusion

We probably all have a bit of the future-coder in us, but we must resist him! The bottom line is that the future-coder almost never wins. In reality, predicting the future is virtually impossible, so don’t even try. It hardly ever pays off. Best case, you’ve just wasted a lot of time, but in reality abstraction has cognitive overhead throughout the project when reading or debugging code, and can make it harder to deal with unanticipated changes (ironically), as well as cause performance issues. And even when it does pay off it doesn’t actually buy you all that much and is actually a net negative because it skews your intuition into thinking that future-coding is a good thing.

The human brain didn’t evolve to think in abstract terms. You will probably have noticed this when trying to understand a new mathematical concept – it’s much easier if you start with concrete examples rather than the fully general definitions and proofs! Simple code that just does one thing in the dumbest way possible takes very little time for someone to understand, whereas if you have to jump through abstract classes and factories (each named in suitably abstract ways) it can take an absurd amount of time to understand something that really isn’t all that complicated.

All these extra virtual calls act as barriers to inlining, in addition to the overhead of the virtual function calls themselves, which cause performance issues. Yes, I know you’ve been told that you shouldn’t worry about performance for code that isn’t in a hotspot, but the reality is that for many applications there aren’t any real hotspots anymore. It’s depressing to look at the profiler for gameplay code and see that your warmest “hotspot” takes 5% of your frame time, and has already been optimized to death. By that time it’s often too late – performance has been siphoned off by tens of thousands of small inefficiencies added over many years, all because people didn’t want to worry about minor performance hits for code that wasn’t on the mythical “hot-path”. Simple code tends to be cheaper to execute, most of the time, in addition to being better in almost every other way as well.

So am I saying that we should just hack things together in the quickest way possible, with no regard for future maintainability? No, of course not. I’m advocating doing things in the simplest way possible, not the quickest way possible. Often, the simplest way requires removing a bunch of cruft that has accumulated over time, deleting stateful flags, pulling logic out into clearly named functions etc. Sometimes, it even requires an abstract factory.

You should optimize for simplicity, readability and modifiability. If you do, you will also end up with code that’s more maintainable, easy to modify and optimize in the future. Take the example above where our heroes needed to add support for a B-spline. Let’s assume that there was no library for interpolating a B-spline already so that they had to add it. I’d expect both types of programmers to pull that code out into its own function. My motivation for doing this isn’t that I’m trying to predict the future, though! You pull it out into its own function because doing so makes the code easier to read today, by making sure the B-spline functionality is hidden behind a sensibly named function instead of cluttering up the “flow” of the main function with unnecessary details. The minor cognitive cost of the extra function call is vastly outweighed by the increased readability of hiding unimportant details. It’s not about predicting the future, it’s about increasing readability.

Sensibly named function calls without side effects almost always add more to readability then they cost, if they’re virtual less so, if they have side effects even less so, if they’re in a class even less so, and so on. It’s all a tradeoff, and the break-even point for any given abstraction will change depending on the specifics of what you’re doing.

Very few abstraction techniques have no merit at all, but the road to programmer hell is paved with best-practices, applied too early. You should postpone applying these abstraction techniques until the cognitive and other costs of using them are clearly outweighed by their benefits (e.g. when you have demonstrated reusability for a component). There are often genuine systems even in game engines where it’s perfectly reasonable to apply many of the abstraction techniques you’ve read about in books about design patterns and large scale software design etc., but the vast majority of code is better off if you keep things simple, and avoid trying to predict the future.

More on Robin Hood Hashing

Standard

I posted about Robin Hood hashing in a previous post. In the comments to that post David Wragg pointed out a flaw in my implementation: if you repeatedly delete and reinsert elements into a hashtable, the average probe count keeps rising. This is a good observation, and it’s true. Repeatedly deleting items makes the table slower but it (until you insert enough to trigger a rebuild of the table, which will filter out any tombstones – note: the number tombstones do not affect the rate of rebuilds).

Now, first of all it’s worth noting that the probe count grows slowly. After a slight tweak to the insertion method, at least (if the existing element is deleted, then tweak the replacement condition so it happens even when the probe counts are merely equal, rather than the existing one being strictly smaller). In David’s test case it grew to about 30 after a million deletions (with my hash function it’s more like 20).

The other point to make is that we already accept periodic “batch” operations such as the table-rebuilding that happens whenever you hit the max load factor. For that reason, it’s relatively reasonable to just reinsert all items again (skipping tombstones) whenever some number of deletions have occurred (e.g. 5N or something – it doesn’t have to be very frequent since the average probe count grows so slowly with the number of deletions).

So, there’s a simple and fairly bullet proof solution to this problem, but what if we prefer an incremental approach, where we make each deletion slightly slower, but never have to run a big batch job to improve the average probe count?

I’m not sure if there are existing solutions with established asymptotic properties, but I tried a simple strategy that seems to work pretty well in my test. The idea is this: whenever we delete an element, we scan ahead for a small number of slots and rearrange them to limit (or remove) the impact of the delete on the average probe count. So long as this operation tends to improve the average probe count at a faster rate than the probe count would otherwise grow, you’ll end up with an average probe count that stays reasonable.

Here’s the code that runs at the very end of the “erase” method (after the element has already been destroyed):

int cur_ix = found_ix;      
for (int i = 0; i<NUM_DELETE_FIXUPS; ++i)
{
    const uint32_t next_ix = (cur_ix + 1) & mask;
    if (next_ix == found_ix) break; // looped around

    // if the next elem is empty or has a zero probe dist, 
    // then any query will always fail on the next elem anyway, 
    // so might as well clear the current one too.
    if (elem_hash(next_ix) == 0 || 
        probe_distance(elem_hash(next_ix), next_ix) == 0)
    {
        elem_hash(cur_ix) = 0;
        return true;
    }

    // Now, the next elem has probe count of at least 1, 
    // so it could use our deleted slot to improve its 
    // query perf. 
    // First, pretend we're a dupe of the next elem.
    elem_hash(cur_ix) = elem_hash(next_ix);

    // Since we're now a dupe of the next elem, ordering is 
    // arbitrary, which means we're allowed to swap places.
    // Note> current slot has been destructed, hence 'new'.
    new(&buffer[cur_ix]) elem(std::move(buffer[next_ix]));
    buffer[next_ix].~elem();

    cur_ix = next_ix;
}
elem_hash(cur_ix) |= 0x80000000; // mark as deleted
return true;

You can choose a small number for NUM_DELETE_FIXUPS, or remove the condition in the loop (it’ll go through one pass of the table at worst, although that’s highly unlikely since it’ll stop on any empty element or an element with a zero probe distance). In my tests I’ve found that 2 will bound the variance and average probe count, but 3 will keep them slightly lower at marginal extra cost so that’s what I’ve set it to.

Again, this is just me messing around – I haven’t made any real attempt to prove the properties of this approach. There may be more rigorously investigated approaches in the literature somewhere. Given that there exists a simple fix with very low amortized cost (periodic reinsertion of elements), I don’t really consider this issue a deal-breaker even if this incremental approach doesn’t work out.

Language Design Deal Breakers

Standard

I’m a bit of a programming language nerd. I love to learn new languages. That said, I spend most of my days writing C++. It’s a truly awful language, with many warts and problems, but I know it well and with enough effort and pain you can get the job done. The biggest benefit of C++ is purely an historical accident: it’s great because it’s popular. There’s no trouble finding people who know it, or libraries that work with it.

The point is, that in order for me, or anyone else, to actually switch from C++ to a different language, there has to be a substantial improvement to overcome the inertia that C++ has built up over the years.

I find that when I look at a new language, there are a number of “deal breakers” which simply mean that regardless of all its other nice features I will never take it as a serious contender. Note that this isn’t a fair fight. Something can be a deal breaker even if C++ doesn’t have that feature either. Any language has to be so much better than C++ that the benefits outweigh the inertia. A deal breaker is a feature that either A) puts it at a disadvantage compared to C++ or B) puts it on equal footing with C++ in an area where C++ is doing particularly poorly.

Here’s my list of language design requirements, which if unmet would be deal breakers for me, what’s yours?

1. Must be statically typed

I know you’re supposed to be diplomatic and claim that there’s two sides to this story, and no real right answer, but really people who think dynamic typing is suitable for large scale software development are just nuts. They’ll claim it’s more flexible and general but that’s just nonsense in my opinion. It may be true compared to utter strawman examples of static typing. If you think this is an argument, go learn Haskell, and then get back to me.

There’s a wonderful correspondence between programs that can be statically typed, and programs that can be statically understood by other people. In other words, programs that you can understand without having to run a debugger or REPL, by just reading code. If you’re writing programs that rely on dynamic typing to work, you’re probably writing exceedingly difficult to understand code. If you are writing code that can be statically understood by other people, you’re probably writing down comments to indicate the allowable types and preconditions anyway (or you should) – so you might as well have those properties checked by the compiler.

Most code does not rely on dynamic typing to work. They could be written just as well in statically typed languages. In that case it’s just a matter of convenience – make specifying types light-weight enough that the extra syntactic overhead doesn’t outweigh the benefits (type inference helps, even the simple “local” kind that C# and C++ 11 has cuts down on much of the noise).

Then there’s the correctness issue. Static typing catches mistakes early. This is a good thing. Now, dynamic typing advocates will say that of course any real program will also have a unit testing suite, so types can be checked that way! This is obviously naïve in the extreme. In the real world writing unit tests for everything very frequently fails the bang-for-buck test – the cost is just too high. This cost comes in two forms, first the cost of writing the tests, and second the friction it introduces to future modifications and refactorings – after all if you have to update 20 unit tests to refactor a messy interface, you’re a lot less likely to do it. Examples of code you probably don’t want to write unit tests for includes gameplay logic which gets written and rewritten twenty times before ship as the result of iteration. Having to write test harnesses and tests for all this throwaway code is madness. Then of course there’s code that’s just plain hard to unit test, such as graphics. The point is that pervasive unit testing is a fantasy – it may be feasible in some domains, but it’s certainly not the case everywhere. Static typing is a great compromise – you basically document the things you would’ve put in comments anyway, but you do so in a standardized way that can be checked by the compiler. This catches many errors for very little cost. When unit tests makes sense (as they sometimes do), you can add those too for even more coverage, but when they don’t you still have a basic sanity check.

Finally, performance. Yes, JITers can be very clever and employ tracing and stuff to close some of the gap here, but of course those techniques could be used by a statically typed language implementation as well. If we can statically decide what methods to call and what types to use we will always be at an advantage compared to a language where this has to be determined at runtime. Not to mention that statically typed languages will encourage you to write code that can be executed efficiently (e.g. variables don’t change types over the course of a program). Also, static typing allows you to make performance guarantees that won’t be randomly broken tomorrow. For example you can store an unboxed array of values, or store a value on the stack or embedded within another object. You don’t have to rely on a runtime tracing system to uncover and apply these optimizations (with typically no way to ensure that it actually occurred), which also means you don’t have to worry that you’ll fall off a performance cliff in the future when something subtle changed which caused the JITer to fail to optimize something. The programmer is in control, and can statically enforce important performance-related choices.

Of course if you want to allow dynamic types as an option for a subset of cases where it makes sense (e.g. dealing with untyped data read from external sources), that’s fine.

2. Memory safety

C++ fails this one, but it’s still a deal breaker. If I was content with worrying about memory scribbles and random access violations I’d just stick to C++. In order to offer enough benefits to convince me to switch from C++, your language needs to be memory safe.

Now, let me also state that I think the language needs an “escape hatch” where you can tag some function or module as “unsafe” and get access to raw pointers, the point is that this needs to be something optional that you have to explicitly enable (using a compiler switch, ideally) for very isolated cases, and you should never be required to use it for typical “application level code”.

This also implies some kind of guarantees about memory reclamation…. We’ll get to that.

3. Yes, memory safety – no null pointer exceptions!

I use a broader definition of memory safety than some. In my view, if a program follows a reference and crashes because the reference is invalid, it’s not a memory safe language. This implies that in addition to making sure you never point to stale memory or memory of the wrong type (through type safety and automatic storage reclamation), you also have to make sure that a reference never points to null.

Null pointer exceptions have no business in modern languages. Getting rid of them costs no performance, and statically eliminates the one big remaining cause of runtime crashes that we still see in otherwise “modern” languages. Yes, it requires some thought (particularly w.r.t. how you do object construction), but this is a solved problem. There is only one valid reason not to do this and that’s legacy code – perhaps you didn’t know what you were doing when you started off, not realizing the issue and how simple the solution is, and now you have too much legacy code to fix it. A language without legacy issues should get this right, and yet many don’t.

Note: it’s not enough to simply have a “non-nullable-pointer” type, not even if it’s checked at compile time. You have to make sure that nullable pointers cannot be dereferenced. In other words, using the pointer-dereferencing operator on a nullable pointer should literally be a compile-time type error. You should have to do a check of the pointer first which introduces a dynamic branch where on one of the sides the type of the pointer has been changed to be non-nullable, which enables the dereferencing operator again.

Having every few statements potentially trigger a runtime crash is an extremely expensive price to pay… for nothing! We’re not talking about a pragmatic tradeoff here, where you pay the cost of reduced reliability but get something else in return – there’s just no real upside to allowing dereferencing of potentially null pointers.

4. Efficient storage reclamation

I mentioned above that I require automatic storage reclamation, well I also require it to be efficient. Too many modern languages fail this. I don’t really mind how they achieve performance, but two ways you could do it is to support pervasive use of value types (i.e. types that get embedded within their owners – which drastically reduces the number of pointers in your heap, and therefore the cost of GC), along with thread-local heaps. Concurrent collection is another solution, though you’d have to be careful not to incur too much of an overhead on the mutator with whatever memory barrier you use. I’ve written about this issue twice in the past: one two

5. Great Windows support

Windows is still the most popular OS by far. If you don’t give me installers, or easily-buildable code, for windows I’m not buying. I realize a lot of academia is used to *nix, which causes a disproportionate preference for linux and mac in projects such as this, but in the real world your customers actually use windows so you should make sure it works.

Conclusion

The most common one of these deal breakers is probably dynamic typing. I usually just close the tab immediately when I see that a language is dynamically typed. So while strictly speaking this is the most common issue, it doesn’t feel like it because I never invest any amount of time even considering those languages anymore. They may have some benefits in terms of theoretical elegance (e.g. Lisp, or even Lua or IO), but for actual work I’ve seen too much of the benefits of static typing, and too much of the horrors of not having it. It’s so obvious, that it barely qualifies as a deal breaker – it’s not even in the running to begin with.

So really, the most common deal breaker is probably point 3 above – Null pointer exceptions. Whenever I see promising languages like Go, D or Nimrod, that still repeat Hoare’s billion dollar mistake I get a bit sad and frustrated. It’s like seeing a beautiful, elegant, car that has just one tiny flaw in that the wheels occasionally fall off. It simply doesn’t matter how many awesome and innovative features you have if you don’t get the fundamentals right.

Robin Hood Hashing should be your default Hash Table implementation

Standard

There’s a neat variation on open-addressing based hash tables called Robin Hood hashing. This technique isn’t very well-known, but it makes a huge practical difference because it both improves performance and space utilization compared to other “standard” hash tables (e.g. chaining). It also eliminates the major downside of other open addressing hash tables.

Here are the benefits, summarized:

  • High load factors can be used without seriously affecting performance. 0.9 is perfectly reasonable as a default (0.95 or higher works too, it mainly affects insertion cost a bit).
  • No linked lists or other extra pointers. This reduces cache misses and storage overhead. Your underlying structure can be a simple flat array since it’s just open addressing under the hood.
  • Lookup and insertion logic is fast. Again, no linked lists to traverse or other complications, just a linear probe sequence with a few checks per element.
  • Unlike other open addressing schemes, looking for non-existent elements is still fast.

For a simple benchmark where I inserted 100k english words into a map, then deleted 10% of them, and then looked them all up again, the timings for my simple Robin Hood hash table was 23%, 66% and 25% lower for insertions, deletions and lookups respectively compared to the VS 2012 std::unordered_map, it did this using 30% less memory overall.

It’s all about the variance

So the basic idea is to take normal open addressing, but use one clever trick in order to drastically reduce the variance of the expected average and maximum probe lengths. We’ll se later why reducing variance is so important. To give you an idea how Robin Hood Hashing improves things, the probe length variance for a RH table at a load factor of 0.9 is 0.98, whereas for a normal open addressing scheme it’s 16.2. At a load factor of 0.99 it’s 1.87 and 194 respectively.

The clever trick is just this: when you probe for a position to insert a new element, if the probe length for the existing element is less than the current probe length for the element being inserted, swap the two elements and keep going.

That way elements that were inserted early and thus “lucked out” on their probe lengths, will gradually be moved away from their preferred slot as new elements come in that could make better use of that place in the table (hence the name – the insertion “takes from the rich”, i.e. the elements with low probe counts). It leads to an “evening out” of the probe lengths.

Why is low variance better? Well, with modern cache architectures a probe count of 1 isn’t really much faster than a probe count of 3, because the main cost is fetching the cache line, so the ideal scenario is for all elements to have the same probe count, and having that probe count fitting within one cache line.

It turns out that Robin Hood hashing has the same expected probe count as normal open addressing (about 2.55) – it just has substantially less variance, and therefore ends up with far fewer cache misses. Yes, there are fewer elements that can early out after 1 probe, but there also far fewer elements that end up needing to fetch multiple cache lines because of long probe lengths.

Furthermore, the expected value of the longest probe sequence approaches about 6 for a load of 0.9 (it’s not actually constant, but grows very, very slowly – see this paper), which is pretty reasonable.

What about failed lookups?

So now you’re probably thinking that this sounds good, but in the back of your mind you know that normal open addressing tables do pretty well too, but have a major downside in that searching for non-existing elements is troublesome to handle. To allay your fears I’ll first tell you a very simple (but totally practical) solution that is enabled by the low variance, and then we’ll see the even better version later.

First a recap. The usual problem is this: since the search algorithm terminates when you find an “empty” slot in the underlying array, it can take a very long time to determine that an element doesn’t exist in the array when the table grows full.

How does Robin Hood hashing solve this? Well the simplest solution is to exploit the fact that the expected longest probe count is low (~6). Just modify the standard search algorithm to ignore empty slots (rather than terminate the search) and keep going until you’ve probed longer than the known maximum probe length for the whole table. This maximum probe length will be small, with a very high probability, even for very loaded tables.

This solution would obviously be inefficient for the standard open addressing scheme, since the expected longest probe count is much higher (e.g. at a load of 0.9 and ~64K elems it is about 70).

The details

In order to know what the probe count of an existing element is (which is key to the algorithm) we could just re-hash the element to re-compute its “desired” slot index and then subtract this from its actual location. A faster way is to simply cache the hash value for each key.

Storing the hash value is generally a sensible strategy anyway, because it means you have a fast early-out for comparisons, so we take that approach here. In other words, the hash table elements consist of a 32 bit hash value, the key, and the value. For my implementation I stored the hash values in a separate array in order to get more hash probes per cache line (at the expense of a second mandatory cache miss to actually fetch the key). This gave a small speed-up for my test where the key size was large (sizeof(std::string)), but there’s a #define in the code to put the hash value alongside its key.

In order to indicate that a slot is unused, we modify the hash function to never return 0, and use a stored hash value of 0 to mean “uninitialized slot”.

First, let’s look at the insertion algorithm, since this is where the actual Robin Hood trick comes in.

void insert_helper(uint32_t hash, Key&& key, Value&& val)
{
    int pos = desired_pos(hash);
    int dist = 0;
    for(;;)
    {			
        if(elem_hash(pos) == 0)
        {			
            construct(pos, hash, move(key), move(val));
            return;
        }

        // If the existing elem has probed less than us, 
        // then swap places with existing
        // elem, and keep going to find another slot for that elem.
        int existing_dist = probe_distance(elem_hash(pos), pos);
        if (existing_dist < dist)
        {	
            if(is_deleted(elem_hash(pos)))
            {
                construct(pos, hash, move(key), move(val));
                return;
            }			
            swap(hash, elem_hash(pos));
            swap(key, buffer[pos].key);
            swap(val, buffer[pos].value);
            dist = existing_dist;
        }

        pos = (pos+1) & mask;
        ++dist;			
    }
}

The algorithm is pretty simple. We simply loop until we’ve found an uninitialized slot (hash value == 0). If we found an existing slot whose probe distance is less than our current probe count (‘dist’), we swap places and continue.
Note: using move semantics here matters (e.g. for efficient swapping).

In order to delete an element, we follow the same strategy as for normal open addressing and mark the slot as a tombstone. Tombstones are treated specially in the insert algorithm. We overwrite the tombstone only when we would’ve wanted to swap anyway (see the check for is_deleted above).

We must mark the tombstone using a whole bit rather than just reserving a single hash value (like we did for uninitialized slots), because we need to know the probe count of the tombstone.

bool erase(const Key& key)
{
    const uint32_t hash = hash_key(key);
    const int ix = lookup_index(key);

    if (ix == -1) return false;

    buffer[ix].~elem();
    elem_hash(ix) |= 0x80000000; // mark as deleted
    --num_elems;
    return true;
}

This is all pretty straightforward. We first find the element, then we call its destructor, and finally set the “tombstone” bit.

And lastly, the lookup algorithm:

int lookup_index(const Key& key) const
{
    const uint32_t hash = hash_key(key);
    int pos = desired_pos(hash);
    int dist = 0;
    for(;;)
    {							
        if (elem_hash(pos) == 0) 
            return -1;
        else if (dist > probe_distance(elem_hash(pos), pos)) 
            return -1;
        else if (elem_hash(pos) == hash && buffer[pos].key == key) 
            return pos;				

        pos = (pos+1) & mask;
        ++dist;
    }
}

This just finds the desired position of the element using the hash, then probes linearly from there on in. There are two conditions to signify that the element doesn’t exist, and finally one check to see if we’ve found the element.

The first exit condition checks for completely uninitialized elements. This is because if the element we are looking for had probed an uninitialized slot during insertion, it would have simply been place there. So the existence of an uninitialized slot along our probe sequence means the element must not be in the table.

The second condition is more interesting. This is our replacement for simple checking the probe distance against a table-wide maximum probe count. We know that when we probe an element during insertion, the one with the longer probe count of the two gets to keep the slot. So if we’re looking for an element that exists, we should expect to never see an existing element with a shorter probe count then our current count (if that had happened, there would’ve been a swap during insertion!).

In other words, we early out as soon as our current probe count exceeds the probe count of the stored element. Since the average probe count for stored elements is 2.55, we can exit pretty quickly for non-existent elements (much earlier than stopping after a table-wide maximum probe count).

This second condition is why we need to maintain the hash value even for tomb stones. Imagine what would happen if we simply marked the slot as uninitialized when it was deleted – the next insertion that comes across it would simply occupy it thereby getting an unfairly low probe count, and more importantly messing up the second condition above. Instead, we just flag deleted elements as tombstones, and only reuse the slot in the insertion algorithm if a swap would’ve happened anyway.

Finally, the last condition simply compares the hashes and the keys and returns the found element.

Here’s a the full code for this blog post: code. I should note that I wrote this specific implementation for this blog post, so it hasn’t been extensively used or tested (or optimized). It’s quite likely that there are bugs.

Thoughts on Game Data Formats

Standard

I’ve been thinking about the representation and loading of game data recently. This has largely been the result of stumbling upon a variety of blog posts and libraries such as:

http://gamedevcoder.wordpress.com/2013/02/16/c-plus-plus-rtti-for-games/

http://kentonv.github.io/capnproto/install.html

http://blinkprotocol.org/

http://www.altdevblogaday.com/2012/06/04/read-my-lips-no-more-loading-screens/

 

Desired characteristics

So, the basic problem is that you want to store a large variety of “resources” for a game “level” (or “sector” or whatever), describing all sorts of things such as the location of every object, the vertices of every mesh, and so on. There are a number of desirable characteristics for this data format, such as:

  • Efficient on-disc representation
  • Efficient run-time and load-time access. Ideally you can load the data structure into memory and just use it directly without any “translation” step. At the very least any load-time cost should be cheap.
  • Some way of reflecting the data at run-time. Ideally this is something you don’t need for the final release game, at least for most objects, but it’s extremely useful during development to be able to essentially “attach” a visual debugger to a running game and get a description of every object in the level. E.g. for console development your editor could run on the PC, while the actual game runs on the console. The console would reflect its actual runtime state and keep the editor up to date, and the editor would send modifications as messages to the runtime.
  • Some way of evolving the representation of an object. Ideally a new runtime of the game should be able to just transparently load old resources without any fuss, or rebuilding. This is particularly useful for post-release patching, but even during development “rebuilding” thousands of objects (with whatever synchronization overhead and dependency tracking you might also need to do to avoid stomping on other developers’ toes) can quickly become an issue.
  • Multiple artist should be able to work on a level at the same time.

I haven’t implemented anything that provides all these benefits simultaneously in a shipping game, but with hindsight, and after thinking about some of the newfangled “protocol” formats that have become popular recently, I think I have a good approach in mind.

Overview

Levels are just folders of resources in the file system. Every single “object” in the level corresponds to a file, e.g. “tree_001291.inst”. The exact internal organization can vary – perhaps sub-folders refer to individual “sectors” (for e.g. streaming purposes), or perhaps the sub-folders are completely arbitrary and used by the designers to organize things however they want. The point is that there is no overall “level file” that anyone has to lock to modify the level. This means multiple artists can add to and modify different parts of the level. Just use normal source control, with exclusive locks, to process the individual files. Any per-level properties would be stored in tiny little resources that get dropped into the level like anything else. E.g. you might have a “physics properties” resource that store a per-level set of parameters like gravity. Keep even these global resources small and focused, to avoid contention. This does means you’ll have many thousands of files per editor, but it’s also super simple to inspect and debug when anything goes wrong.

Resources consist of “instance” data and a “schema“, both of which has a source representation and a runtime representation. For example, the schema might be represented in a simple JSON-like format, the instance source data can vary (e.g. a Maya .mb file, or a .wav file). The runtime representation is binary and efficient, produced at build-time from the source, using an appropriate “builder”. It’s likely that many different kind of instances use the same builder (e.g. simple “property bags” type resources might just take their input as plain JSON, even if each one represents a completely different kind of game object), with a few custom ones requiring more severe processing.

The schema describes the data layout of each resource type. I.e. you might have thousand “goblin” resources in the level, but there will only be a single schema describing the layout of all those instances. The reason for storing as schema separately, as opposed to having each instance be self-describing, is to avoid duplicating type-metadata and field offsets etc. that don’t change, and also because it gives you a convenient place to hang additional (optional) meta data off of, such as RTTI information (and on the editor side of things, UI meta data).

The instance data is basically just the actual bits that refer to a single instance of that resource. E.g. a single health pack, or monster spawn location, or rigid body, or a mesh, or whatever. It contains no “meta data”. It’s a compact format, but stored in a “native” way so that it’s cheap to read at runtime directly without first having to convert it to a runtime representation. Keeping the instance data in the “native” form is important because it means most resource types wouldn’t have any “runtime version” of the on-disc data – it would just use the instance data directly. Some types would probably have some additional “companion” runtime data layered on top, and created at load-time – e.g. D3D header objects and such.

A particular version of a schema is immutable. I.e. once you’ve created (and checked in) a schema you can’t ever change it. You can only create a new version of it (and you can only do that according to specific rules, that don’t invalidate previous instances and schemas). Each instance refers to a particular schema of a particular version. In other words, the game can load any version of any instance, because it uses the schema the instance was built against to interpret the data. So what I said above about there only being one schema for thousands of instances of a given resource type was a lie – there can be multiple versions of each schema. In practice you’ll only have a few versions actively used at any one time, because anytime you modify an instance, it would be saved out using the most recent schema (and you can periodically run a “refresh” batch job to rebuild instances).

The schema’s runtime asset is simple. Each field in the structure has a unique integer identifier starting from zero, with no gaps. This is the only way to identify a field. These identifier cannot change when you create new versions of a schema. You can add new field, and you can mark old fields as “deleted” (which means they no longer take up space in the instance), but you can’t actually reuse an old identifier. Lots of deleted fields imply some overhead in the schema runtime asset, but not in the instance data. These rules ensure that creating a new version of the schema won’t cause old instance to be invalid.

The runtime structure of a schema is basically just an array of offsets, one for each field, indicating where that field is stored within the instance data. There’s also some additional meta-data containing the string representation of each field etc., used for RTTI (ideally stored in a separate chunk that can be skipped if you don’t need it).

Each schema corresponds to a generated C++ class header. This class simply provides a bunch of accessor methods. Each one looks up the offset for the field in question using the offset array from the schema, and the identifier for the field, and returns the memory from the instance data at that location casted to the right type. This does mean an additional indirection for field access, but it should still be reasonably fast (especially since the schema’s offset table is likely to be hot in the cache quite frequently). If the field doesn’t exist in the schema (e.g. it was added after the schema was created – so its identifier is larger than the size of the offset array, or the offset array contains the “deleted” marker), the generated code would simply return the default value (stored inline as a constant, in the generated code). This is the migration path – if you add a new field to a schema, all you need to do is make sure you add the corresponding code to “do the right thing” with that new field, including doing something sensible for default values. Old code will simply ignore that field (their version of the generated class won’t have it, nor any code that relies on it!), and if the new code reads an old instance it will just get the default value for that field.

Further details and rationale

So that’s the basic idea, there are some other details that are important to note, though.

So clearly you don’t want to be loading individual files all the time, which is why there needs to be some kind of “bundling” process that groups resources together. Probably sorts them by (schema,version), so that you load all objects with a given schema at the same time. This wouldn’t just be for release builds, you’d regularly build these bundles even during development. In order that you don’t invalidate a whole bundle and revert to excruciatingly slow load times if you modify a single resource, there needs to be a way for the loader to basically override individual resources from a bundle.

One likely objection is that it’s much simpler to just enforce header/data consistency and always just do a plain memcpy-style resource load. That way there’s no need for the indirection via the schema. Why go through the trouble? Well, in a system with strict header/data dependencies, if you change the layout, you have to rebuild every resource of that type. I really like the simplicity of this approach, but I’ve been down that road and had to live with resource rebuilding being a major iteration hurdle for everyone involved.

The bottom line is that if you have thousands of assets of a particular resource type, every tiny little inefficiency in your build-time code gets multiplied by that number, making your performance margins razor-thin. Even 100ms of build time for a resource can quickly become unacceptable if you have ten thousand of them that need to be rebuilt because you added a flag. It’s not hard to hit those kind of times from just loading DLLs, and hitting the disk once or twice. Any slight inefficiency quickly adds up to many seconds, or even minutes of waiting around. I don’t know about you, but I’d rather not worry about a couple of milliseconds here and there on the build-side of my code. Plus, this approach opens up a pretty nice patching/DLC story.

Another objection may concern the fact that we’re generating code. Why not let our schema source asset simply be our C++ header file and parse it to build the runtime asset for a schema? Or why not annotate it with preprocessor macros and generate the offsets directly in code? Well, in a word – simplicity. Writing parsers is pretty hard. Writing a parser with good error messages is even harder. Voluntarily writing a parser with good error messages for C++ is pretty much grounds for termination. You do not want to be in this business! You might be able to restrict the kind of C++ you allow in this “schema header”, but inevitably someone will mess up and put unsupported code in there, and your cobbled together mess of preprocessor defines, build-time parsers, and maybe even template meta programming will lead to the worst error messages you’ve ever seen. I’ve been down this road. There be dragons. Maybe you can avoid them if you think hard about it going in, but honestly I don’t see any major upside to the approach that pays for the extra risk and complexity.

In contrast, generating code is child’s play. The accessor methods in the generated code is trivial. Anyone can understand and debug it. There’s absolutely nothing clever going on at all when you’re generating code (the code that generates C++ is simple, and the generated code is simple too). Everything is just plain old dumb C++ code that contains absolutely nothing clever. In my opinion, of the common choices for these things, code generation is the right choice. It retains maximum flexibility since you can just generate any code you want (you’re not restricted to what you can do with preprocessor macros or template or whatever), without costing you any simplicity.

You may worry about the overhead of these accessor methods. While it’s reasonably cheap, there’s still a range-check on the offset, and an extra indirection. I’m not overly concerned about that. Part of it is because in my experience it turns out a huge chunk of the data you actually touch lives in a relatively small set of game types. E.g. matrices, quaternions, vectors, colors, etc. These types never actually change representation in practice, so can simply be hard coded as “native” types instead of being represented as schemas of their own. In other words, accessing the toplevel “transform” field does require the extra range check and indirection, but each member of that structure (e.g. 16 matrix members) can be accessed directly as it’s just a plain old C struct. I’d estimate you end up with a few dozen “native” types like this, which improves the constant factors involved by quite a bit.

Plus, if accessor overhead is really an issue, you could make sure your Release builds will never load old instances (by running a global “refresh job” on all instances as you produce your final builds) and hard-code the offset in the generated accessor code instead of going via the schema. I’d avoid this, because it means you may have issues using the “upgrade path” for future patches. Another option is to tag certain type as having an “efficient” format, this would make the schema builder also produce a plain C struct for the type, and a loader method that simply copies each member at load time (using the schema offset table and instance data) – once that’s done it’s just plain old C data that can be accessed at full speed. This costs you load time performance and simplicity, but does offer a nice “backup plan” (e.g. if you want to modify the runtime data and save it back out to the instance source format, for live editing purposes).

So what about on-disc storage? If we need to store all our data in a “native” format, won’t we waste a whole bunch of memory? For example using 32 bits to store an integer when it could be compressed as 10 bits is a bit wasteful, right? Yeah, this is true, and it may be worth worrying about. However, it’s not uncommon to have some kind of transparent compression on the IO-system level anyway (e.g. based on zlib, or Snappy), so combined with the bundling system above most of the redundancy should be compressed away. If not, Cap’n proto has a clever trick where they store the instance data as the actual instance data XOR’d with the default instance (which I would store in the schema). This means most instances consist of tons of zeroes (since most fields aren’t actually changed from their defaults, in practice). Combine this with some really cheap zero-aware compression where you only store non-zero bytes, and you get great on-disc storage characteristics. Of course, the next step is to skip the XOR step altogether. When you compress the instance you do it using a “default-aware” compressor instead of a zero-aware one – i.e. skip storing any bytes that are identical to the default instance and at load time you would initialize the instance by either loading bytes from the resource instance or from the default instance, based on some control stream stored in the resource . Of course, you should measure carefully if this is worth it. There’s something to be said for loading directly from disk into the final destination buffer, without having to even involve the CPU.

Another interesting thing I’d like to try is to separate mutable from immutable data on the level of the schema (with immutable being the default, and this affecting the accessors being generated). Typically the vast majority of the data you’d load this way is immutable (in terms of size) – e.g. vertex data, audio data, texture data, etc., with only a few things actually being modified at runtime. Since we’re generating code from a schema, we can make these kind of global experiments pretty easily.

Anyway, I this seems to combine the best features from all the different systems I’ve worked with, read about, or even implemented myself. Thoughts?

On GC in Games (response to Jeff and Casey)

Standard

So it turns out youtube comments suck. I’ll write my response to Jeff and Casey’s latest podcast in blog form instead of continuing the discussion there. View it here: http://www.youtube.com/watch?v=tK50z_gUpZI

Now, first let me say that I agree with 99% of the sentiment of this podcast. I think writing high performance games in Java or C# is kinda crazy, and the current trend of writing apps in HTML5 and JavaScript and then running it on top of some browser-like environment is positively bonkers. The proliferation of abstraction layers and general “cruft” is a huge pet peeve of mine – I don’t understand why it takes 30 seconds to launch a glorified text editor (like most IDEs – Eclipse, Visual Studio, etc.), when it took a fraction of a second twenty years ago on hardware that was thousands of times slower.

That said, I do think their arguments against GC aren’t quite fair (and note that GC has nothing to do with JITs or VMs). The pile up roughly the right things in the “cons” column, but they completely ignore “pros” column, and as a result act baffled that anyone would ever think GC is appropriate for any reason.

Before I get into it, I should probably link to my previous post on GC where I spend a large chunk of time lamenting how poorly designed C# and Java are w.r.t. GC in particular. Read it here.

To summarize: no mainstream language does this “right”. What you want is a language that’s memory safe, but without relegating every single allocation to a garbage collected heap. 95% of your memory allocations should either be a pure stack allocation, or anchored to the stack (RAII helps), or be tied uniquely to some owning object and die immediately when the parent dies. Furthermore, the language should highly discourage allocations in general – it should be value-oriented like C so that there’s just plain less garbage to deal with in the first place. Rust is a good example of a language of this kind.

You’ll note that most of Jeff and Casey’s ranting is not actually about the GC itself, but about promiscuous allocation behavior, and I fully agree with that, but I think it’s a mistake to conflate the two. GC doesn’t imply that you should heap allocate at the drop of a hat, or that you shouldn’t think about who “owns” what memory.

Here’s the point: Garbage collection is about memory safety. It’s not about convenience, really. Nobody serious argues that GC means you don’t have to worry about resource usage. If you have type safety, array bounds checks, null safety, and garbage collection, you can eliminate memory corruption. That’s why people accept all the downsides of GC even in languages where it comes with much higher than necessary penalties (e.g. Java, C#, Ruby, Lua, Python, and so on… pretty much all mainstream languages).

A couple of weeks ago I spent several days tracking down a heap corruption in a very popular third party game engine. I haven’t tracked down who’s responsible for the bug (though I have access to their repository history), and exactly how long it’s been there, but from the kind of bug it was I wouldn’t be surprised if it’s been there for many years, and therefore in hundreds (or even thousands?) of shipped games. It just started happening after several years for no real reason (maybe the link order change just enough, or the order of heap allocations changed just enough, to make it actually show up as a crash).

The main thing to say about this bug (I won’t detail it here because it’s not my code) is that it was caused by three different pieces of code interacting badly, but neither piece was necessarily doing anything stupid. I can easily see very smart and professional programmers writing these three pieces of code at different times, and going through a few iterations perhaps, and all of a sudden there’s a perfect storm, and a latent memory corruption is born.

I mention this because it raises a few important points:

  • Memory corruption is not always caught before you ship. Any argument about manual memory corruption not being so bad because it’s at least transparent and debuggable, unlike the opaque GC, falls flat on its face for this reason. Yes, you have all the code, and it’s not very complicated, but how does that help you if you never even see the bug before you ship? Memory corruption bugs are frequently difficult to even repro. They might happen once every thousand hours due to some rare race condition, or some extremely rare sequence of heap events. You could in principle debug it (though it often takes considerable effort and time), if you knew it was there, but very sometimes you just don’t.
  • Memory corruption is often very hard to debug. Often this goes hand in hand with the previous point. Something scribbles to some memory, and fourty minutes later enough errors have cascaded from this to cause a visible crash. It’s extremely hard to trace back in time to figure out the root cause of these things. This is another ding against the “the GC is so opaque” argument. Opacity isn’t just about whether or not you have access to the code – it’s also about how easy it is to fix even if you do. The extreme difficulty of tracking down some of the more subtle memory corruption bugs means that the theoretical transparency you get from owning all the code really doesn’t mean much. With a GC at least most problems are simple to understand – yes you may have to “fix” it by tuning some parameters, or even pre-allocating/reusing memory to avoid the GC altogether (because you can’t break open the GC itself), but this is far less effort and complexity than a lot of heap corruption bugs.
  • Smart people fuck up too. In the comments there were a number of arguments that essentially took the form “real programmers can deal with manual memory management”*. Well, this is an engine developed by some of the best developers in the industry, and it’s used for many thousands of games, including many AAA games. Furthermore, there was absolutely nothing “stupid” going on here. It was all code that looked completely sane and sensible, but due to some very subtle interactions caused a scribble. Also, it’s not hard to go through the release notes for RAD-developed middleware and find fixes for memory corruption bugs – so clearly even RAD engineers (of whom I have a very high opinion) occasionally fuck up here.

With memory safety, most of these bugs simply disappear. The majority of them really don’t happen at all anymore – and the rest turn into a different kind of bug, which is much easier to track down: a space leak (a dangling pointer in a memory safe language just means you’ll end up using more memory than you expected, which can be tracked down in minutes using rudimentary heap analysis tools).

In other words: memory safety eliminates a whole host of bugs, and improves debuggability of some other bugs. Even when a GC causes additional issues (which they do – there’s a real cost to GC for sure) they at least do so before you ship, unlike the bugs caused by not having memory safety. This is a very important distinction!

Yes, you should be careful, and I’m certainly not advocating Java or C# here, but when you do consider the tradeoffs you should at least be honest about the downsides of not having memory safety. There is real value in eliminating these issues up front.

In current languages I would probably almost always come down on the side of not paying the cost of GC for high-performance applications. E.g. I’ll generally argue against any kind of interpretation or VM-based scripting altogether (DSLs that compile to native is a different issue), especially if they require a GC. However, I don’t think you need to overstate your case when making the tradeoff.

If I could pay a small fixed cost of, let’s say 0.5ms per frame, but be guaranteed that I’m not going to have to worry about any memory corruption ever again I’d totally take that tradeoff. We’re not there yet, but we really aren’t that far off either – the problem isn’t intrinsic to GC. Plenty of high performance games, even 60Hz ones, have shipped with GC’d scripting languages, and while they don’t usually collect the whole heap, a lot of them do manage to keep the GC overhead around that level of cost. So maybe in the future, instead of paying 0.5ms to GC a small heap that’s completely ruined by a shitty language that generates too much garbage, we could instead GC the whole heap and end up with similar levels of complexity by just creating less garbage in the first place (using a non-shitty language).

 

*Side note: I really hate arguments of the form “real programmers can deal with X” used to dismiss the problem by basically implying that anyone who has a problem just isn’t very good. It’s incredibly insulting and lazy, and no discussion was ever improved by saying it. In my opinion hubris, or extrapolating too far from your own experience, is a far more common sign of incompetence or inexperience than admitting that something is hard.

Runtime Code Specialization

Standard

Specializing code is nothing new, but I’m surprised that modern programming languages don’t attempt to make this easier.

Just to give you some context, the idea is that whenever you have an algorithm that takes multiple parameters where one or more of the parameters are fixed for some period of time, you specialize the code for the fixed parameters in order to achieve greater optimization at runtime. A key idea is that you may not know at compile time what the fixed value is, and the duration for which a value is fixed could be shorter than the execution of the program (mere milliseconds, even).

One example would be to compile a regular expression into machine code – at runtime – whose function is to take the input string and match it against the (fixed) regular expression. This can be many times faster than an algorithm based on interpreting the regular expression as you go.

Other examples are algorithms where there are runtime inefficiencies due to combinatorial explosion. Such as converting one image format to another. If you have enough different image formats it’s infeasible to write one function for each pair of formats, so you would typically write a single conversion function that converts from the source format into some high precision intermediate format, and then calls another function to convert to the destination format. This can be inefficient because you have to switch on the data format inside the conversion loop. With code specialization you look up the two conversion function once, and then call them directly (or even inline them, then simplify away the intermediate format when possible).

Now, all this can be done manually, and while things haven’t improved as much as I would like, things are getting better. In C# you could always use the System.Reflection.Emit namespace to generate MSIL at runtime, and then compile it into a method. This was tedious and low level, but better than generating x86 assembly at least! Starting with .Net 3.5 you can instead use expression trees. This is much better. Not only can you compile some simple lambda expression directly to expression trees, but even building them manually is much higher level.

For example, starting with a simple function to match “glob” patterns (e.g. “f?o.*”) like this:

bool match(string pattern, string testString, int i, int j)
{
    if (i == pattern.Length) 
        return j == testString.Length;

    switch (pattern[i])
    {
        case '?'    : 
            return matchHelper(pattern, testString, i+1, j+1);
        case '*'    :                    
            for (int k = j; k <= testString.Length; ++k)
            {
                if (matchHelper(pattern, testString, i+1, k)) 
                      return true;                        
            }
            return false;                

        default     : 
            return j < testString.Length && 
                   pattern[i] == testString[j] && 
                   matchHelper(pattern, testString, i+1, j+1);
    }
}

We can (manually) convert this to a function that, passed an expression holding the string to be tested and the starting index, generates an expression representing whether or not the pattern matches. We do this by interpreting the “pattern”, and generating code that just tests a passed in string expression against it. We don’t call any functions in the generated expression, but rather keep traversing the pattern doing as much of the testing as we can ahead of time, only putting the stuff that absolutely depends on the passed in test string into the expression tree:

// Generates an expression of type bool
static Expression GenerateMatchExpression(string pattern, int i, ParameterExpression str, Expression j )
{
    var strLenExp = Expression.Property(str, StringLengthProp);

    // If the pattern is empty, we match if we've reached the end of the string
    if (i == pattern.Length) 
        return Expression.Equal(j, strLenExp);

    // Shave off a single char from the string
    if (pattern[i] == '?') 
        return GenerateMatchExpression(pattern, i+1, str, Expression.Increment(j));
    
    // Remove the wild card, and try to match all substrings
    if (pattern[i] == '*')
    {
        var subStringStart = Expression.Variable(typeof(int));              
        var breakLabel = Expression.Label(typeof(bool));
        return Expression.Block(typeof(bool),
            new ParameterExpression[] { subStringStart },
            // start at current index, we'll increment this 
            // to check all "tails", including empty string
            Expression.Assign(subStringStart, j), 
            Expression.Loop(
                Expression.Block(                        
                    // strStart > str.Length
                    Expression.IfThen(Expression.GreaterThan(subStringStart, strLenExp),
                            Expression.Break(breakLabel, Expression.Constant(false))
                    ),

                    //check substring
                    Expression.IfThen(GenerateMatchExpression(pattern, i + 1, str, subStringStart), 
                        Expression.Break(breakLabel, Expression.Constant(true))
                    ),
                    Expression.Assign(subStringStart, Expression.Increment(subStringStart))
                ),
                breakLabel
            )
        );
    }
    
    // Check a single character.
    return  Expression.AndAlso(  
                Expression.LessThan(j, strLenExp), 
                Expression.AndAlso(  
                    Expression.Equal( Expression.Constant(pattern[i]), 
                                      Expression.Call(str, GetCharsMethod, j)),
                    GenerateMatchExpression(pattern, i + 1, str, Expression.Increment(j))
                )
            );
    
}

Then we pass in the string and index by wrapping this expression in a lambda like so:

var strParam = Expression.Parameter(typeof(string));
var exp = Expression.Lambda<Func<string,bool>>(
                GenerateMatchExpression(pattern, 0, strParam, Expression.Constant(0)), strParam
          );        
Func match = exp.Compile();

For example, if we pass in “a*b*c?”, we get an expression tree that looks like this:

.Lambda #Lambda1(System.String $var1) {
    0 < $var1.Length && 'a' == .Call $var1.get_Chars(0) && .Block(System.Int32 $var2) {
        $var2 = .Increment(0);
        .Loop  {
            .Block() {
                .If ($var2 > $var1.Length) {
                    .Break #Label1 { False }
                } .Else {
                    .Default(System.Void)
                };
                .If (
                    $var2 < $var1.Length && 'b' == .Call $var1.get_Chars($var2) && .Block(System.Int32 $var3) {
                        $var3 = .Increment($var2);
                        .Loop  {
                            .Block() {
                                .If ($var3 > $var1.Length) {
                                    .Break #Label2 { False }
                                } .Else {
                                    .Default(System.Void)
                                };
                                .If (
                                    $var3 < $var1.Length && 'c' == .Call $var1.get_Chars($var3) && .Increment(.Increment($var3)) == $var1.Length
                                ) {
                                    .Break #Label2 { True }
                                } .Else {
                                    .Default(System.Void)
                                };
                                $var3 = .Increment($var3)
                            }
                        }
                        .LabelTarget #Label2:
                    }
                ) {
                    .Break #Label1 { True }
                } .Else {
                    .Default(System.Void)
                };
                $var2 = .Increment($var2)
            }
        }
        .LabelTarget #Label1:
    }
}

Note how we get one loop generated for each ‘*’ character, in a nested fashion, and how the entire thing is inlined and specialized. This should compile to pretty efficient code. Of course it could be even more efficient if we first generated a DFA from this and then compiled that, but the point here is code specialization, not finite automata.

Now, this works, but holy hell that took a lot of work to get right, and this was for a fairly trivial example. Imagine if this was something more complicated! Imagine if you had to maintain this! The horror!

So, what do we actually need? We need a language that allows you to specialize (almost) arbitrary functions for some of their parameters, returning a new function with all the constants folded away, virtual functions inlined, etc. If this was as easy as writing down some simple attributes you could imagine such a language would be able to achieve far superior performance to even native languages like C++ simply because it’s infeasible to do this kind of code specialization on a wide scale unless there’s language support for it. You could potentially write a C# function that does this. I.e. create an expression tree by wrapping the code to be specialized in a lambda, then pass this expression to a function that massages the resulting tree given that some of the parameters are constant. Perhaps the JIT compiler can even be convinced to do some of the optimizations for you once it knows that some of the inputs are constants, but more likely than not you have to manually propagate the fact that these things are truly fixed through the expression tree.

This probably isn’t as simple as I make it out to be. For example, you may want to specialize an animation function on the blend tree, while allowing the specific blend values of each node to vary each time you invoke it. This could be done by having the programmer manually separate these two pieces of data out and only specializing the first, but ideally you could mix and match “fixed” data from “varying” data and have specialization only “boil away” the fixed stuff, while storing references to the varying stuff.

My contention is that client-facing apps have a lot of code that could benefit from this sort of strategy. Thinking about something like a game engine running at 60Hz it’s amazing just how much time you spend doing the exact same thing over and over again every frame where the only thing that’s different from frame to frame is the specific integer or float that gets passed into the ALU, but most branches and vtables go the same way every time. Being able to flag things as being constant for some duration of time and avoid all that “interpretation overhead” would be a huge win in a bunch of cases.

The next step would be to have a runtime that automatically traces the actual execution and specializes things for you as you go. Much like javascript compilers do. However, I’m talking about applying this to highly optimized ahead-of-time compiled programs, so the number of things the tracer would have to look out for would presumably be less extensive than for a dynamic language (e.g. we’re not looking to discover that a variable is an integer – we know that already). We’re basically just looking to do some constant folding, some vtable lookups (and subsequent inlining) and stuff like that. The static compiler could flag where it might be profitable to do runtime tracing, perhaps.

I kinda like the idea of having the programmer explicitly define specializations, though, because it means you can put the onus on the programmer to ensure certain things that a compiler wouldn’t want to assume. Like non-divergence of recursive functions, which means that e.g. a tree traversal function could be specialized with each recursive call being blindly inlined. The programmer would be responsible for proving that, taken the specified parameters as constants, the recursion will eventually stop.

Are there any languages out there that do this?