Thoughts on Game Data Formats
4/May 2013
I’ve been thinking about the representation and loading of game data recently. This has largely been the result of stumbling upon a variety of blog posts and libraries such as:
http://gamedevcoder.wordpress.com/2013/02/16/c-plus-plus-rtti-for-games/
http://kentonv.github.io/capnproto/install.html
http://www.altdevblogaday.com/2012/06/04/read-my-lips-no-more-loading-screens/
Desired characteristics
So, the basic problem is that you want to store a large variety of “resources” for a game “level” (or “sector” or whatever), describing all sorts of things such as the location of every object, the vertices of every mesh, and so on. There are a number of desirable characteristics for this data format, such as:
Efficient on-disc representation
Efficient run-time and load-time access. Ideally you can load the data structure into memory and just use it directly without any “translation” step. At the very least any load-time cost should be cheap.
Some way of reflecting the data at run-time. Ideally this is something you don’t need for the final release game, at least for most objects, but it’s extremely useful during development to be able to essentially “attach” a visual debugger to a running game and get a description of every object in the level. E.g. for console development your editor could run on the PC, while the actual game runs on the console. The console would reflect its actual runtime state and keep the editor up to date, and the editor would send modifications as messages to the runtime.
Some way of evolving the representation of an object. Ideally a new runtime of the game should be able to just transparently load old resources without any fuss, or rebuilding. This is particularly useful for post-release patching, but even during development “rebuilding” thousands of objects (with whatever synchronization overhead and dependency tracking you might also need to do to avoid stomping on other developers’ toes) can quickly become an issue.
Multiple artist should be able to work on a level at the same time.
I haven’t implemented anything that provides all these benefits simultaneously in a shipping game, but with hindsight, and after thinking about some of the newfangled “protocol” formats that have become popular recently, I think I have a good approach in mind.
Overview
Levels are just folders of resources in the file system. Every single “object” in the level corresponds to a file, e.g. “tree_001291.inst”. The exact internal organization can vary - perhaps sub-folders refer to individual “sectors” (for e.g. streaming purposes), or perhaps the sub-folders are completely arbitrary and used by the designers to organize things however they want. The point is that there is no overall “level file” that anyone has to lock to modify the level. This means multiple artists can add to and modify different parts of the level. Just use normal source control, with exclusive locks, to process the individual files. Any per-level properties would be stored in tiny little resources that get dropped into the level like anything else. E.g. you might have a “physics properties” resource that store a per-level set of parameters like gravity. Keep even these global resources small and focused, to avoid contention. This does means you’ll have many thousands of files per editor, but it’s also super simple to inspect and debug when anything goes wrong.
Resources consist of “instance” data and a “schema”, both of which has a source representation and a runtime representation. For example, the schema might be represented in a simple JSON-like format, the instance source data can vary (e.g. a Maya .mb file, or a .wav file). The runtime representation is binary and efficient, produced at build-time from the source, using an appropriate “builder”. It’s likely that many different kind of instances use the same builder (e.g. simple “property bags” type resources might just take their input as plain JSON, even if each one represents a completely different kind of game object), with a few custom ones requiring more severe processing.
The schema describes the data layout of each resource type. I.e. you might have thousand “goblin” resources in the level, but there will only be a single schema describing the layout of all those instances. The reason for storing as schema separately, as opposed to having each instance be self-describing, is to avoid duplicating type-metadata and field offsets etc. that don’t change, and also because it gives you a convenient place to hang additional (optional) meta data off of, such as RTTI information (and on the editor side of things, UI meta data).
The instance data is basically just the actual bits that refer to a single instance of that resource. E.g. a single health pack, or monster spawn location, or rigid body, or a mesh, or whatever. It contains no “meta data”. It’s a compact format, but stored in a “native” way so that it’s cheap to read at runtime directly without first having to convert it to a runtime representation. Keeping the instance data in the “native” form is important because it means most resource types wouldn’t have any “runtime version” of the on-disc data – it would just use the instance data directly. Some types would probably have some additional “companion” runtime data layered on top, and created at load-time – e.g. D3D header objects and such.
A particular version of a schema is immutable. I.e. once you’ve created (and checked in) a schema you can’t ever change it. You can only create a new version of it (and you can only do that according to specific rules, that don’t invalidate previous instances and schemas). Each instance refers to a particular schema of a particular version. In other words, the game can load any version of any instance, because it uses the schema the instance was built against to interpret the data. So what I said above about there only being one schema for thousands of instances of a given resource type was a lie - there can be multiple versions of each schema. In practice you’ll only have a few versions actively used at any one time, because anytime you modify an instance, it would be saved out using the most recent schema (and you can periodically run a “refresh” batch job to rebuild instances).
The schema’s runtime asset is simple. Each field in the structure has a unique integer identifier starting from zero, with no gaps. This is the only way to identify a field. These identifier cannot change when you create new versions of a schema. You can add new field, and you can mark old fields as “deleted” (which means they no longer take up space in the instance), but you can’t actually reuse an old identifier. Lots of deleted fields imply some overhead in the schema runtime asset, but not in the instance data. These rules ensure that creating a new version of the schema won’t cause old instance to be invalid.
The runtime structure of a schema is basically just an array of offsets, one for each field, indicating where that field is stored within the instance data. There’s also some additional meta-data containing the string representation of each field etc., used for RTTI (ideally stored in a separate chunk that can be skipped if you don’t need it).
Each schema corresponds to a generated C++ class header. This class simply provides a bunch of accessor methods. Each one looks up the offset for the field in question using the offset array from the schema, and the identifier for the field, and returns the memory from the instance data at that location casted to the right type. This does mean an additional indirection for field access, but it should still be reasonably fast (especially since the schema’s offset table is likely to be hot in the cache quite frequently). If the field doesn’t exist in the schema (e.g. it was added after the schema was created – so its identifier is larger than the size of the offset array, or the offset array contains the “deleted” marker), the generated code would simply return the default value (stored inline as a constant, in the generated code). This is the migration path - if you add a new field to a schema, all you need to do is make sure you add the corresponding code to “do the right thing” with that new field, including doing something sensible for default values. Old code will simply ignore that field (their version of the generated class won’t have it, nor any code that relies on it!), and if the new code reads an old instance it will just get the default value for that field.
Further details and rationale
So that’s the basic idea, there are some other details that are important to note, though.
So clearly you don’t want to be loading individual files all the time, which is why there needs to be some kind of “bundling” process that groups resources together. Probably sorts them by (schema,version), so that you load all objects with a given schema at the same time. This wouldn’t just be for release builds, you’d regularly build these bundles even during development. In order that you don’t invalidate a whole bundle and revert to excruciatingly slow load times if you modify a single resource, there needs to be a way for the loader to basically override individual resources from a bundle.
One likely objection is that it’s much simpler to just enforce header/data consistency and always just do a plain memcpy-style resource load. That way there’s no need for the indirection via the schema. Why go through the trouble? Well, in a system with strict header/data dependencies, if you change the layout, you have to rebuild every resource of that type. I really like the simplicity of this approach, but I’ve been down that road and had to live with resource rebuilding being a major iteration hurdle for everyone involved.
The bottom line is that if you have thousands of assets of a particular resource type, every tiny little inefficiency in your build-time code gets multiplied by that number, making your performance margins razor-thin. Even 100ms of build time for a resource can quickly become unacceptable if you have ten thousand of them that need to be rebuilt because you added a flag. It’s not hard to hit those kind of times from just loading DLLs, and hitting the disk once or twice. Any slight inefficiency quickly adds up to many seconds, or even minutes of waiting around. I don’t know about you, but I’d rather not worry about a couple of milliseconds here and there on the build-side of my code. Plus, this approach opens up a pretty nice patching/DLC story.
Another objection may concern the fact that we’re generating code. Why not let our schema source asset simply be our C++ header file and parse it to build the runtime asset for a schema? Or why not annotate it with preprocessor macros and generate the offsets directly in code? Well, in a word – simplicity. Writing parsers is pretty hard_. _Writing a parser with good error messages is even harder. Voluntarily writing a parser with good error messages for C++ is pretty much grounds for termination. You do not want to be in this business! You might be able to restrict the kind of C++ you allow in this “schema header”, but inevitably someone will mess up and put unsupported code in there, and your cobbled together mess of preprocessor defines, build-time parsers, and maybe even template meta programming will lead to the worst error messages you’ve ever seen. I’ve been down this road. There be dragons. Maybe you can avoid them if you think hard about it going in, but honestly I don’t see any major upside to the approach that pays for the extra risk and complexity.
In contrast, generating code is child’s play. The accessor methods in the generated code is trivial. Anyone can understand and debug it. There’s absolutely nothing clever going on at all when you’re generating code (the code that generates C++ is simple, and the generated code is simple too). Everything is just plain old dumb C++ code that contains absolutely nothing clever. In my opinion, of the common choices for these things, code generation is the right choice. It retains maximum flexibility since you can just generate any code you want (you’re not restricted to what you can do with preprocessor macros or template or whatever), without costing you any simplicity.
You may worry about the overhead of these accessor methods. While it’s reasonably cheap, there’s still a range-check on the offset, and an extra indirection. I’m not overly concerned about that. Part of it is because in my experience it turns out a huge chunk of the data you actually touch lives in a relatively small set of game types. E.g. matrices, quaternions, vectors, colors, etc. These types never actually change representation in practice, so can simply be hard coded as “native” types instead of being represented as schemas of their own. In other words, accessing the toplevel “transform” field does require the extra range check and indirection, but each member of that structure (e.g. 16 matrix members) can be accessed directly as it’s just a plain old C struct. I’d estimate you end up with a few dozen “native” types like this, which improves the constant factors involved by quite a bit.
Plus, if accessor overhead is really an issue, you could make sure your Release builds will never load old instances (by running a global “refresh job” on all instances as you produce your final builds) and hard-code the offset in the generated accessor code instead of going via the schema. I’d avoid this, because it means you may have issues using the “upgrade path” for future patches. Another option is to tag certain type as having an “efficient” format, this would make the schema builder also produce a plain C struct for the type, and a loader method that simply copies each member at load time (using the schema offset table and instance data) – once that’s done it’s just plain old C data that can be accessed at full speed. This costs you load time performance and simplicity, but does offer a nice “backup plan” (e.g. if you want to modify the runtime data and save it back out to the instance source format, for live editing purposes).
So what about on-disc storage? If we need to store all our data in a “native” format, won’t we waste a whole bunch of memory? For example using 32 bits to store an integer when it could be compressed as 10 bits is a bit wasteful, right? Yeah, this is true, and it may be worth worrying about. However, it’s not uncommon to have some kind of transparent compression on the IO-system level anyway (e.g. based on zlib, or Snappy), so combined with the bundling system above most of the redundancy should be compressed away. If not, Cap’n proto has a clever trick where they store the instance data as the actual instance data XOR’d with the default instance (which I would store in the schema). This means most instances consist of tons of zeroes (since most fields aren’t actually changed from their defaults, in practice). Combine this with some really cheap zero-aware compression where you only store non-zero bytes, and you get great on-disc storage characteristics. Of course, the next step is to skip the XOR step altogether. When you compress the instance you do it using a “default-aware” compressor instead of a zero-aware one – i.e. skip storing any bytes that are identical to the default instance and at load time you would initialize the instance by either loading bytes from the resource instance or from the default instance, based on some control stream stored in the resource . Of course, you should measure carefully if this is worth it. There’s something to be said for loading directly from disk into the final destination buffer, without having to even involve the CPU.
Another interesting thing I’d like to try is to separate mutable from immutable data on the level of the schema (with immutable being the default, and this affecting the accessors being generated). Typically the vast majority of the data you’d load this way is immutable (in terms of size) – e.g. vertex data, audio data, texture data, etc., with only a few things actually being modified at runtime. Since we’re generating code from a schema, we can make these kind of global experiments pretty easily.
Anyway, I this seems to combine the best features from all the different systems I’ve worked with, read about, or even implemented myself. Thoughts?