Packing and Unpacking Bits Faster than BitConverter

Yesterday I’ve published an article on packing bits, I still don’t know if there’s a library out there that allows you to do that easily and efficiently so I’m using my own functions for that purpose. However, a popular way to pack bytes into other data types is the BitConverter class. Please note the functionality is very different from my packing/unpacking functions because it doesn’t allow you to enter an arbitrary number of bits and simply packs bytes into other data type and unpacks other data into bytes.

Performance is always a concern to me. For this game I’m working on I want to extract the best performance possible from the tools I’m using; it’s not just about having ‘enough’ speed because you can always take advantage of extra performance, for example having a larger map or larger draw area, or ticking the objects at a higher frequency.

I noticed that when compared to the BitConverter my functions are slower, although by no means they can be considered ‘slow’. So I’ve decided to see what it takes to beat BitConverter in terms of performance. It turns out that you can’t really achieve that level of performance easily, and I had to write functions that bit shifts and masks the bits directly with minimal instructions (that resembles clojure sometimes, specifically the “(((((((uint)” part).

In the end I could achieve a much better performance for packing the bits but unpacking is a whole different matter. Here’s the code I’ve used:



I’m populating arrays with random bytes for the tests, and I’m passing a static readonly byte[] {8,8,8,8} to PackBitFields() and GetBitFields(), the BitConvert methods I’m comparing are ToUInt32() and GetBytes(), and also I’m using ‘unsafe’ code for an alternate direct unpacking function (DirectUnpack8888Unsafe).

For a fair comparison, I’m also just casting the variables and accessing the array elements directly because we can consider that the fastest you can get, I’ve changed the naming a bit to make the graph more compact and elegant, but everything should be self-explanatory except that the Custom Pack is the PackBitFields and Custom Unpack is the GetBitFields functions respectively.

The graph shows time per operation but with different iteration counts.

Packing results:

As you can see the Direct Pack is almost as fast as casting. You can’t really go any faster than that I think. Its cost is absolutely negligible. Both BitConverter and the Custom Pack (PackBitFields) have some overhead. That is completely explicable when you look the PackBitFields function closely, and I presume BitConverter also does some extra checks under the hood.

Unpacking results:

This one shows more clearly how the tests don’t scale linearly, I don’t have a good guess on what could be the culprit, but of course the loops have some overhead. I was expecting the cost per operation to be extremely linear provided enough samples and proper benchmarking (pre-JIT and cleaning GC), at least in a hypothetical ‘perfect’ environment. The final results (basically dividing the raw numbers by iters*samples) are calculated in a scientific application so we can rule out rounding or precision issues. DirectUnpack8888Unsafe is the winner here but not by much, and I don’t like to use unsafe code, although that specific snippet looks fail-safe. The Custom Unpack (GetBitFields) performs very well but scales differently than the others. The spike on the casting graph really drew my attention… I’ve no idea how that’s possible.

Here are the results (in seconds) per operation (all of these are *10^-8 for comparing):

Iterations/Samples 1000/10000 100000/1000 1000000/30 10000000/10
Custom Pack 0.919 0.950 0.970 0.961
B.Conv Pack 0.765 0.790 0.810 0.840
Direct Pack 0.325 0.405 0.444 0.410
Acc & Cast 0.309 0.388 0.419 0.407
Custom Unpack 3.154 3.246 7.482 12.281
B.Conv Unpack 2.150 4.395 10.574 10.479
Direct Unpack 2.035 2.162 11.536 12.554
Direct Unsafe 2.150 2.284 10.776 11.233
A & Cast Back 2.004 2.125 12.084 10.882

Here are the raw numbers in any case (the top numbers are the total result of the sum all iterations and samples):

Iterations 1000 100000 1000000 10000000
Samples 10000 1000 30 10
Custom Pack 0.0919455 0.9495634 0.2910572 0.9610064
Custom Unpack 0.3154235 3.2461785 2.2447317 12.2807165
B.Conv Pack 0.0764927 0.7899617 0.2430286 0.8400908
B.Conv Unpack 0.2149990 4.3947272 3.1722101 10.4789114
Direct Pack 0.0325352 0.4048323 0.1332875 0.4104641
Direct Unpack 0.2034563 2.1618007 3.4606810 12.5544608
Direct Unsafe 0.2150088 2.2842652 3.2328402 11.2327730
Acc & Cast 0.0308669 0.3877108 0.1256066 0.4070876
A & Cast Back 0.2004328 2.1245730 3.6251144 10.8822947
PER OPERATION 1000*10000 100000*1000 1000000*30 10000000*10
Custom Pack 9.1946E-09 9.4956E-09 9.7019E-09 9.6101E-09
B.Conv Pack 7.6493E-09 7.8996E-09 8.1010E-09 8.4009E-09
Direct Pack 3.2535E-09 4.0483E-09 4.4429E-09 4.1046E-09
Acc & Cast 3.0867E-09 3.8771E-09 4.1869E-09 4.0709E-09
Custom Unpack 3.1542E-08 3.2462E-08 7.4824E-08 1.2281E-07
B.Conv Unpack 2.1500E-08 4.3947E-08 1.0574E-07 1.0479E-07
Direct Unpack 2.0346E-08 2.1618E-08 1.1536E-07 1.2554E-07
Direct Unsafe 2.1501E-08 2.2843E-08 1.0776E-07 1.1233E-07
A & Cast Back 2.0043E-08 2.1246E-08 1.2084E-07 1.0882E-07

And the raw measurements:

Conclusion: All the functions shown here can be considered high-performance when you take their differences into account. The ‘Custom’ functions allow a better control over your individual bits and provide very decent performance. If you need fine control and want the absolutely best performance, direct bitwise operations and possibly unsafe code are the way to go. You could make a code generator for those functions. I like my old custom functions better because they are more elegant to use and the performance is just fine, but I don’t discard the possibility of using some direct functions too, especially since I’m trying to keep the bit packing ‘formats’ to a minimum in terms of variety.

Leave a Reply