On GC in Games (response to Jeff and Casey)

So it turns out youtube comments suck. I’ll write my response to Jeff and Casey’s latest podcast in blog form instead of continuing the discussion there. View it here: http://www.youtube.com/watch?v=tK50z_gUpZI

Now, first let me say that I agree with 99% of the sentiment of this podcast. I think writing high performance games in Java or C# is kinda crazy, and the current trend of writing apps in HTML5 and JavaScript and then running it on top of some browser-like environment is positively bonkers. The proliferation of abstraction layers and general “cruft” is a huge pet peeve of mine – I don’t understand why it takes 30 seconds to launch a glorified text editor (like most IDEs – Eclipse, Visual Studio, etc.), when it took a fraction of a second twenty years ago on hardware that was thousands of times slower.

That said, I do think their arguments against GC aren’t quite fair (and note that GC has nothing to do with JITs or VMs). The pile up roughly the right things in the “cons” column, but they completely ignore “pros” column, and as a result act baffled that anyone would ever think GC is appropriate for any reason.

Before I get into it, I should probably link to my previous post on GC where I spend a large chunk of time lamenting how poorly designed C# and Java are w.r.t. GC in particular. Read it here.

To summarize: no mainstream language does this “right”. What you want is a language that’s memory safe, but without relegating every single allocation to a garbage collected heap. 95% of your memory allocations should either be a pure stack allocation, or anchored to the stack (RAII helps), or be tied uniquely to some owning object and die immediately when the parent dies. Furthermore, the language should highly discourage allocations in general - it should be value-oriented like C so that there’s just plain less garbage to deal with in the first place. Rust is a good example of a language of this kind.

You’ll note that most of Jeff and Casey’s ranting is not actually about the GC itself, but about promiscuous allocation behavior, and I fully agree with that, but I think it’s a mistake to conflate the two. GC doesn’t imply that you should heap allocate at the drop of a hat, or that you shouldn’t think about who “owns” what memory.

Here’s the point: Garbage collection is about memory safety. It’s not about convenience, really. Nobody serious argues that GC means you don’t have to worry about resource usage. If you have type safety, array bounds checks, null safety, and garbage collection, you can eliminate memory corruption. That’s why people accept all the downsides of GC even in languages where it comes with much higher than necessary penalties (e.g. Java, C#, Ruby, Lua, Python, and so on… pretty much all mainstream languages).

A couple of weeks ago I spent several days tracking down a heap corruption in a very popular third party game engine. I haven’t tracked down who’s responsible for the bug (though I have access to their repository history), and exactly how long it’s been there, but from the kind of bug it was I wouldn’t be surprised if it’s been there for many years, and therefore in hundreds (or even thousands?) of shipped games. It just started happening after several years for no real reason (maybe the link order change just enough, or the order of heap allocations changed just enough, to make it actually show up as a crash).

The main thing to say about this bug (I won’t detail it here because it’s not my code) is that it was caused by three different pieces of code interacting badly, but neither piece was necessarily doing anything stupid. I can easily see very smart and professional programmers writing these three pieces of code at different times, and going through a few iterations perhaps, and all of a sudden there’s a perfect storm, and a latent memory corruption is born.

I mention this because it raises a few important points:

  • Memory corruption is not always caught before you ship. Any argument about manual memory corruption not being so bad because it’s at least transparent and debuggable, unlike the opaque GC, falls flat on its face for this reason. Yes, you have all the code, and it’s not very complicated, but how does that help you if you never even see the bug before you ship? Memory corruption bugs are frequently difficult to even repro. They might happen once every thousand hours due to some rare race condition, or some extremely rare sequence of heap events. You could in principle debug it (though it often takes considerable effort and time), if you knew it was there, but very sometimes you just don’t.

  • Memory corruption is often very hard to debug. Often this goes hand in hand with the previous point. Something scribbles to some memory, and fourty minutes later enough errors have cascaded from this to cause a visible crash. It’s extremely hard to trace back in time to figure out the root cause of these things. This is another ding against the “the GC is so opaque” argument. Opacity isn’t just about whether or not you have access to the code – it’s also about how easy it is to fix even if you do. The extreme difficulty of tracking down some of the more subtle memory corruption bugs means that the theoretical transparency you get from owning all the code really doesn’t mean much. With a GC at least most problems are simple to understand – yes you may have to “fix” it by tuning some parameters, or even pre-allocating/reusing memory to avoid the GC altogether (because you can’t break open the GC itself), but this is far less effort and complexity than a lot of heap corruption bugs.

  • Smart people fuck up too. In the comments there were a number of arguments that essentially took the form “real programmers can deal with manual memory management”*. Well, this is an engine developed by some of the best developers in the industry, and it’s used for many thousands of games, including many AAA games. Furthermore, there was absolutely nothing “stupid” going on here. It was all code that looked completely sane and sensible, but due to some very subtle interactions caused a scribble. Also, it’s not hard to go through the release notes for RAD-developed middleware and find fixes for memory corruption bugs – so clearly even RAD engineers (of whom I have a very high opinion) occasionally fuck up here.

With memory safety, most of these bugs simply disappear. The majority of them really don’t happen at all anymore – and the rest turn into a different kind of bug, which is much easier to track down: a space leak (a dangling pointer in a memory safe language just means you’ll end up using more memory than you expected, which can be tracked down in minutes using rudimentary heap analysis tools).

In other words: memory safety eliminates a whole host of bugs, and improves debuggability of some other bugs. Even when a GC causes additional issues (which they do – there’s a real cost to GC for sure) they at least do so before you ship, unlike the bugs caused by not having memory safety. This is a very important distinction!

Yes, you should be careful, and I’m certainly not advocating Java or C# here, but when you do consider the tradeoffs you should at least be honest about the downsides of not having memory safety. There is real value in eliminating these issues up front.

In current languages I would probably almost always come down on the side of not paying the cost of GC for high-performance applications. E.g. I’ll generally argue against any kind of interpretation or VM-based scripting altogether (DSLs that compile to native is a different issue), especially if they require a GC. However, I don’t think you need to overstate your case when making the tradeoff.

If I could pay a small fixed cost of, let’s say 0.5ms per frame, but be guaranteed that I’m not going to have to worry about any memory corruption ever again I’d totally take that tradeoff. We’re not there yet, but we really aren’t that far off either – the problem isn’t intrinsic to GC. Plenty of high performance games, even 60Hz ones, have shipped with GC’d scripting languages, and while they don’t usually collect the whole heap, a lot of them do manage to keep the GC overhead around that level of cost. So maybe in the future, instead of paying 0.5ms to GC a small heap that’s completely ruined by a shitty language that generates too much garbage, we could instead GC the whole heap and end up with similar levels of complexity by just creating less garbage in the first place (using a non-shitty language).

 

*Side note: I really hate arguments of the form “real programmers can deal with X” used to dismiss the problem by basically implying that anyone who has a problem just isn’t very good. It’s incredibly insulting and lazy, and no discussion was ever improved by saying it. In my opinion hubris, or extrapolating too far from your own experience, is a far more common sign of incompetence or inexperience than admitting that something is hard.

comments powered by Disqus