Using SSE intrinsics to optimize scalar code

Just a quick note on optimizing scalar code using approximate SSE instructions. Now, first off let me just say that it’s rarely useful to micro-optimize scalar code (e.g. implementing Vector3 addition using SSE is probably a net loss, or at least wasted effort). If you really want to go fast, what you need to do is find hotspots which operates on a lot of data in a loop and turning that whole thing into SSE code.

However, there are exceptions where throwing in a single SSE instruction in otherwise scalar code could give a small but nice speedup. For those cases, you want to make it easy to exploit these SSE instructions even for people who don’t know how to write SSE code. Luckily MSVC now uses SSE instructions even for scalar math, which means that there’s really no penalty to drop in your own SSE intrinsics here and there.

For example, sometimes you have some code that uses a floating point divide, which is very costy. While MSVC will use the SSE form of this division instruction it will not automatically use the approximate SSE division instruction (since this would change the semantics of the program). Since the approximate division instruction has a latency of 5 instructions compared to 13 for the regular division (on Haswell), and has 5x greater throughput, there is or course an opportunity for a nice speedup here if you don’t need an exact result. To get the approximate division operator you have to use it yourself, by implementing a handy helper like so:

// Returns an approximation of 1.0f/x
__forceinline float rcp(float x)
{   
    return _mm_cvtss_f32(_mm_rcp_ss(_mm_set_ss(x)));
}

This simply sets the float into an SSE register, performs the _mm_rcp_ss intrinsic and converts back to float. In optimized code the extra set and conversion intrinsics go away and the compiler just inserts the rcpss instruction in the right place.

Another common example is the reciprocal square root, or 1.0/sqrtf(x). Implementing this in regular math involves both a sqrt and a division, these are both expensive instructions with a combined latency of 31 clocks. In contrast, the approximate rsqrtss instruction has a latency of only 5 (and a throughput of 1 per clock).

Here’s the straight up wrapper for rsqrt:

__forceinline float rsqrt_fast(float x)
{
    return _mm_cvtss_f32(_mm_rsqrt_ss(_mm_set_ss(x)));
}

This gives pretty decent results, but you might also consider doing a single Newton-Rhapson iteration on top of this approximation to refine the result.

__forceinline float rsqrt(float x)
{
    float xrsqrt_est = rsqrt_fast(x);
    return xrsqrt_est*(1.5f - x*0.5f*xrsqrt_est*xrsqrt_est); // NR iteration
}

In some quick tests this is only about 5% slower than rsqrt_fast while giving significantly more accurate results (it’s still about 60% faster than 1.0f/sqrtf(x)). The performance will depend on what else you’re doing in the surrounding code so you might still want to expose the rsqrt_fast version just in case it matters somewhere.

comments powered by Disqus