Using SSE intrinsics to optimize scalar code
23/Feb 2016
Just a quick note on optimizing scalar code using approximate SSE instructions. Now, first off let me just say that it’s rarely useful to micro-optimize scalar code (e.g. implementing Vector3 addition using SSE is probably a net loss, or at least wasted effort). If you really want to go fast, what you need to do is find hotspots which operates on a lot of data in a loop and turning that whole thing into SSE code.
However, there are exceptions where throwing in a single SSE instruction in otherwise scalar code could give a small but nice speedup. For those cases, you want to make it easy to exploit these SSE instructions even for people who don’t know how to write SSE code. Luckily MSVC now uses SSE instructions even for scalar math, which means that there’s really no penalty to drop in your own SSE intrinsics here and there.
For example, sometimes you have some code that uses a floating point divide, which is very costy. While MSVC will use the SSE form of this division instruction it will not automatically use the approximate SSE division instruction (since this would change the semantics of the program). Since the approximate division instruction has a latency of 5 instructions compared to 13 for the regular division (on Haswell), and has 5x greater throughput, there is or course an opportunity for a nice speedup here if you don’t need an exact result. To get the approximate division operator you have to use it yourself, by implementing a handy helper like so:
|
|
This simply sets the float into an SSE register, performs the _mm_rcp_ss
intrinsic and converts back to float. In optimized code the extra set and conversion intrinsics go away and the compiler just inserts the rcpss
instruction in the right place.
Another common example is the reciprocal square root, or 1.0/sqrtf(x)
. Implementing this in regular math involves both a sqrt and a division, these are both expensive instructions with a combined latency of 31 clocks. In contrast, the approximate rsqrtss
instruction has a latency of only 5 (and a throughput of 1 per clock).
Here’s the straight up wrapper for rsqrt
:
|
|
This gives pretty decent results, but you might also consider doing a single Newton-Rhapson iteration on top of this approximation to refine the result.
|
|
In some quick tests this is only about 5% slower than rsqrt_fast
while giving significantly more accurate results (it’s still about 60% faster than 1.0f/sqrtf(x)
). The performance will depend on what else you’re doing in the surrounding code so you might still want to expose the rsqrt_fast
version just in case it matters somewhere.