Runtime Code Specialization

Standard

Specializing code is nothing new, but I’m surprised that modern programming languages don’t attempt to make this easier.

Just to give you some context, the idea is that whenever you have an algorithm that takes multiple parameters where one or more of the parameters are fixed for some period of time, you specialize the code for the fixed parameters in order to achieve greater optimization at runtime. A key idea is that you may not know at compile time what the fixed value is, and the duration for which a value is fixed could be shorter than the execution of the program (mere milliseconds, even).

One example would be to compile a regular expression into machine code – at runtime – whose function is to take the input string and match it against the (fixed) regular expression. This can be many times faster than an algorithm based on interpreting the regular expression as you go.

Other examples are algorithms where there are runtime inefficiencies due to combinatorial explosion. Such as converting one image format to another. If you have enough different image formats it’s infeasible to write one function for each pair of formats, so you would typically write a single conversion function that converts from the source format into some high precision intermediate format, and then calls another function to convert to the destination format. This can be inefficient because you have to switch on the data format inside the conversion loop. With code specialization you look up the two conversion function once, and then call them directly (or even inline them, then simplify away the intermediate format when possible).

Now, all this can be done manually, and while things haven’t improved as much as I would like, things are getting better. In C# you could always use the System.Reflection.Emit namespace to generate MSIL at runtime, and then compile it into a method. This was tedious and low level, but better than generating x86 assembly at least! Starting with .Net 3.5 you can instead use expression trees. This is much better. Not only can you compile some simple lambda expression directly to expression trees, but even building them manually is much higher level.

For example, starting with a simple function to match “glob” patterns (e.g. “f?o.*”) like this:

bool match(string pattern, string testString, int i, int j)
{
    if (i == pattern.Length) 
        return j == testString.Length;

    switch (pattern[i])
    {
        case '?'    : 
            return matchHelper(pattern, testString, i+1, j+1);
        case '*'    :                    
            for (int k = j; k <= testString.Length; ++k)
            {
                if (matchHelper(pattern, testString, i+1, k)) 
                      return true;                        
            }
            return false;                

        default     : 
            return j < testString.Length && 
                   pattern[i] == testString[j] && 
                   matchHelper(pattern, testString, i+1, j+1);
    }
}

We can (manually) convert this to a function that, passed an expression holding the string to be tested and the starting index, generates an expression representing whether or not the pattern matches. We do this by interpreting the “pattern”, and generating code that just tests a passed in string expression against it. We don’t call any functions in the generated expression, but rather keep traversing the pattern doing as much of the testing as we can ahead of time, only putting the stuff that absolutely depends on the passed in test string into the expression tree:

// Generates an expression of type bool
static Expression GenerateMatchExpression(string pattern, int i, ParameterExpression str, Expression j )
{
    var strLenExp = Expression.Property(str, StringLengthProp);

    // If the pattern is empty, we match if we've reached the end of the string
    if (i == pattern.Length) 
        return Expression.Equal(j, strLenExp);

    // Shave off a single char from the string
    if (pattern[i] == '?') 
        return GenerateMatchExpression(pattern, i+1, str, Expression.Increment(j));
    
    // Remove the wild card, and try to match all substrings
    if (pattern[i] == '*')
    {
        var subStringStart = Expression.Variable(typeof(int));              
        var breakLabel = Expression.Label(typeof(bool));
        return Expression.Block(typeof(bool),
            new ParameterExpression[] { subStringStart },
            // start at current index, we'll increment this 
            // to check all "tails", including empty string
            Expression.Assign(subStringStart, j), 
            Expression.Loop(
                Expression.Block(                        
                    // strStart > str.Length
                    Expression.IfThen(Expression.GreaterThan(subStringStart, strLenExp),
                            Expression.Break(breakLabel, Expression.Constant(false))
                    ),

                    //check substring
                    Expression.IfThen(GenerateMatchExpression(pattern, i + 1, str, subStringStart), 
                        Expression.Break(breakLabel, Expression.Constant(true))
                    ),
                    Expression.Assign(subStringStart, Expression.Increment(subStringStart))
                ),
                breakLabel
            )
        );
    }
    
    // Check a single character.
    return  Expression.AndAlso(  
                Expression.LessThan(j, strLenExp), 
                Expression.AndAlso(  
                    Expression.Equal( Expression.Constant(pattern[i]), 
                                      Expression.Call(str, GetCharsMethod, j)),
                    GenerateMatchExpression(pattern, i + 1, str, Expression.Increment(j))
                )
            );
    
}

Then we pass in the string and index by wrapping this expression in a lambda like so:

var strParam = Expression.Parameter(typeof(string));
var exp = Expression.Lambda<Func<string,bool>>(
                GenerateMatchExpression(pattern, 0, strParam, Expression.Constant(0)), strParam
          );        
Func match = exp.Compile();

For example, if we pass in “a*b*c?”, we get an expression tree that looks like this:

.Lambda #Lambda1(System.String $var1) {
    0 < $var1.Length && 'a' == .Call $var1.get_Chars(0) && .Block(System.Int32 $var2) {
        $var2 = .Increment(0);
        .Loop  {
            .Block() {
                .If ($var2 > $var1.Length) {
                    .Break #Label1 { False }
                } .Else {
                    .Default(System.Void)
                };
                .If (
                    $var2 < $var1.Length && 'b' == .Call $var1.get_Chars($var2) && .Block(System.Int32 $var3) {
                        $var3 = .Increment($var2);
                        .Loop  {
                            .Block() {
                                .If ($var3 > $var1.Length) {
                                    .Break #Label2 { False }
                                } .Else {
                                    .Default(System.Void)
                                };
                                .If (
                                    $var3 < $var1.Length && 'c' == .Call $var1.get_Chars($var3) && .Increment(.Increment($var3)) == $var1.Length
                                ) {
                                    .Break #Label2 { True }
                                } .Else {
                                    .Default(System.Void)
                                };
                                $var3 = .Increment($var3)
                            }
                        }
                        .LabelTarget #Label2:
                    }
                ) {
                    .Break #Label1 { True }
                } .Else {
                    .Default(System.Void)
                };
                $var2 = .Increment($var2)
            }
        }
        .LabelTarget #Label1:
    }
}

Note how we get one loop generated for each ‘*’ character, in a nested fashion, and how the entire thing is inlined and specialized. This should compile to pretty efficient code. Of course it could be even more efficient if we first generated a DFA from this and then compiled that, but the point here is code specialization, not finite automata.

Now, this works, but holy hell that took a lot of work to get right, and this was for a fairly trivial example. Imagine if this was something more complicated! Imagine if you had to maintain this! The horror!

So, what do we actually need? We need a language that allows you to specialize (almost) arbitrary functions for some of their parameters, returning a new function with all the constants folded away, virtual functions inlined, etc. If this was as easy as writing down some simple attributes you could imagine such a language would be able to achieve far superior performance to even native languages like C++ simply because it’s infeasible to do this kind of code specialization on a wide scale unless there’s language support for it. You could potentially write a C# function that does this. I.e. create an expression tree by wrapping the code to be specialized in a lambda, then pass this expression to a function that massages the resulting tree given that some of the parameters are constant. Perhaps the JIT compiler can even be convinced to do some of the optimizations for you once it knows that some of the inputs are constants, but more likely than not you have to manually propagate the fact that these things are truly fixed through the expression tree.

This probably isn’t as simple as I make it out to be. For example, you may want to specialize an animation function on the blend tree, while allowing the specific blend values of each node to vary each time you invoke it. This could be done by having the programmer manually separate these two pieces of data out and only specializing the first, but ideally you could mix and match “fixed” data from “varying” data and have specialization only “boil away” the fixed stuff, while storing references to the varying stuff.

My contention is that client-facing apps have a lot of code that could benefit from this sort of strategy. Thinking about something like a game engine running at 60Hz it’s amazing just how much time you spend doing the exact same thing over and over again every frame where the only thing that’s different from frame to frame is the specific integer or float that gets passed into the ALU, but most branches and vtables go the same way every time. Being able to flag things as being constant for some duration of time and avoid all that “interpretation overhead” would be a huge win in a bunch of cases.

The next step would be to have a runtime that automatically traces the actual execution and specializes things for you as you go. Much like javascript compilers do. However, I’m talking about applying this to highly optimized ahead-of-time compiled programs, so the number of things the tracer would have to look out for would presumably be less extensive than for a dynamic language (e.g. we’re not looking to discover that a variable is an integer – we know that already). We’re basically just looking to do some constant folding, some vtable lookups (and subsequent inlining) and stuff like that. The static compiler could flag where it might be profitable to do runtime tracing, perhaps.

I kinda like the idea of having the programmer explicitly define specializations, though, because it means you can put the onus on the programmer to ensure certain things that a compiler wouldn’t want to assume. Like non-divergence of recursive functions, which means that e.g. a tree traversal function could be specialized with each recursive call being blindly inlined. The programmer would be responsible for proving that, taken the specified parameters as constants, the recursion will eventually stop.

Are there any languages out there that do this?

About these ads

8 thoughts on “Runtime Code Specialization

  1. I think part of the reason that few (if any) products have been released that address runtime code specialization is that x86 CPUs are getting really good at compensating for virtual function calls, indirect memory accesses, etc. As long as you don’t push too many indirections through the pipeline at once, and as long as they’re fairly predictable, the 168-instruction reorder buffer and 2K-entry branch target buffer (using Intel’s Sandy Bridge as an example) will practically make them free.

    This has caused me quite a lot of frustration in the past… I’ve spent many hours on a software rendering algorithm, making sure everything flows through the CPU without stalls, only to find that my performance is virtually unchanged.

    • True, however cache misses for example are still pretty much the biggest consistent source of performance issues. If you can take the hit of all of them during “specialization time” and produce straight line code that never even has to fetch any of that memory when it runs next time (all the decisions it needs to make based on that data have already been made during specialization), you could get significant performance gains.

      Even if you need to fetch some of the “fixed” during runtime (e.g. the “weights” in an animation blend tree might not be fixed, even if the node structure is) you could get big gains because that data is no longer on the critical path for any branches.

      I’m kinda hoping that something like this could make a high level managed language much faster than C++ (because it would be pretty much infeasible to do in C++, or at least much harder). If the only result is that it makes C# go as fast as C++ (by eliminating temporary allocations, extra pointer chasing, vtable chasing etc.) then I’d be happy with that, but I’m kinda thinking it can actually leapfrog it.

      • Fair point. I actually think that getting “faster than C++” code is actually quite attainable.

        A couple days ago I did some basic tests to check whether various C++ compilers would recognize compile-time constants held in various containers and inline them. All tested compilers (VC++2010SP1, G++ 4.7.2, Clang 3.0-6) recognized const ints in const structs and const arrays, but none inlined values from const vectors of const ints. Also, VC++ couldn’t optimize out this almost-trivial loop over the array: “for(int i=0;i<7;i++) if(myArray[i] == 0) putchar(myArray[i+1]);".

        To beat C++, I think simply making the compiler aware of a few of the basic container types (i.e. lists and dicts) for the purpose of const folding would be a good start.

  2. Yup, totally agree, it’s one of the major things I feel is missing from C++.

    For example, any branches inside an inner loop (that are constant for the duration of the loop) could be hoisted out to outside the loop, and a new inner loop generated with that branch folded. This should be an automatic transformation (perhaps programmer guided).

    • Yeah. Somteimes it can do some of that already.. Sometimes. But it’s hard with aliasing analysis etc. to not break these things. E.g. call a single virtual function in the loop and most bets are off because you have no idea if any of the data you’re branching on is modified by that function – no hoisting for you!

      As you say, with specialization this problem can go away because you can inline across vtable lookups so you can actually get a full view of what the loop is doing – including alias analysis – and then hoist out constant stuff.

  3. std.regex in D does that with ctRegex. The regex is processed at compile time and specialized code is generated. It is indeed way faster than all regex engine I know of.

  4. This has been done. Check out DyC, the Synthesis OS, follow-up work like Synthetix OS, etc. A quick search for related articles on Google Scholar will reveal a lot of interesting work.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s