Yesterday I’ve published an article on packing bits, I still don’t know if there’s a library out there that allows you to do that easily and efficiently so I’m using my own functions for that purpose. However, a popular way to pack bytes into other data types is the BitConverter class. Please note the functionality is very different from my packing/unpacking functions because it doesn’t allow you to enter an arbitrary number of bits and simply packs bytes into other data type and unpacks other data into bytes.
Performance is always a concern to me. For this game I’m working on I want to extract the best performance possible from the tools I’m using; it’s not just about having ‘enough’ speed because you can always take advantage of extra performance, for example having a larger map or larger draw area, or ticking the objects at a higher frequency.
I noticed that when compared to the BitConverter my functions are slower, although by no means they can be considered ‘slow’. So I’ve decided to see what it takes to beat BitConverter in terms of performance. It turns out that you can’t really achieve that level of performance easily, and I had to write functions that bit shifts and masks the bits directly with minimal instructions (that resembles clojure sometimes, specifically the “(((((((uint)” part).
In the end I could achieve a much better performance for packing the bits but unpacking is a whole different matter. Here’s the code I’ve used:
Packers:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
public static uint PackBitFields(byte[] values, byte[] bitFields) { uint retVal = values[0]; //we set the first value right away for (int f = 1; f < values.Length; f++) { retVal <<= bitFields[f]; //we shift the previous value retVal |= values[f]; //and add our current value } return retVal; } public static uint DirectPack8888(byte[] values) { return (((((((uint) values[0]) << 8) | values[1]) << 8) | values[2]) << 8) | values[3]; } |
Unpackers:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 |
public static byte[] GetBitFields(uint packedBits, byte[] bitFields) { int fields = bitFields.Length-1; // number of fields to unpack byte[] retArr = new byte[fields+1]; // init return array int curPos = 0; // current field bit position (start) int lastEnd; // position where last field ended for (int f = fields; f >= 0; f--) // loop from last { lastEnd = curPos; // we store where the last value ended curPos += bitFields[f]; // we get where the current value starts int leftShift = MaxBits - curPos; // we figure how much left shift we gotta apply for the other numbers to overflow into oblivion retArr[f] = (byte)((packedBits << leftShift) >> leftShift + lastEnd); // we do magic } return retArr; } public static byte[] DirectUnpack8888(uint packed) { return new byte[] { (byte)(packed >> 24), (byte)((packed >> 16) & 255), (byte)((packed >> 8) & 255), (byte)(packed & 255), }; } public static unsafe byte[] DirectUnpack8888Unsafe(uint packed) { byte[] array = new byte[4]; fixed (byte* ptr = array) { *(uint*)ptr = packed; } return array; } |
I’m populating arrays with random bytes for the tests, and I’m passing a static readonly byte[] {8,8,8,8} to PackBitFields() and GetBitFields(), the BitConvert methods I’m comparing are ToUInt32() and GetBytes(), and also I’m using ‘unsafe’ code for an alternate direct unpacking function (DirectUnpack8888Unsafe).
For a fair comparison, I’m also just casting the variables and accessing the array elements directly because we can consider that the fastest you can get, I’ve changed the naming a bit to make the graph more compact and elegant, but everything should be self-explanatory except that the Custom Pack is the PackBitFields and Custom Unpack is the GetBitFields functions respectively.
The graph shows time per operation but with different iteration counts.
Packing results:
As you can see the Direct Pack is almost as fast as casting. You can’t really go any faster than that I think. Its cost is absolutely negligible. Both BitConverter and the Custom Pack (PackBitFields) have some overhead. That is completely explicable when you look the PackBitFields function closely, and I presume BitConverter also does some extra checks under the hood.
Unpacking results:
This one shows more clearly how the tests don’t scale linearly, I don’t have a good guess on what could be the culprit, but of course the loops have some overhead. I was expecting the cost per operation to be extremely linear provided enough samples and proper benchmarking (pre-JIT and cleaning GC), at least in a hypothetical ‘perfect’ environment. The final results (basically dividing the raw numbers by iters*samples) are calculated in a scientific application so we can rule out rounding or precision issues. DirectUnpack8888Unsafe is the winner here but not by much, and I don’t like to use unsafe code, although that specific snippet looks fail-safe. The Custom Unpack (GetBitFields) performs very well but scales differently than the others. The spike on the casting graph really drew my attention… I’ve no idea how that’s possible.
Here are the results (in seconds) per operation (all of these are *10^-8 for comparing):
Iterations/Samples | 1000/10000 | 100000/1000 | 1000000/30 | 10000000/10 |
Custom Pack | 0.919 | 0.950 | 0.970 | 0.961 |
B.Conv Pack | 0.765 | 0.790 | 0.810 | 0.840 |
Direct Pack | 0.325 | 0.405 | 0.444 | 0.410 |
Acc & Cast | 0.309 | 0.388 | 0.419 | 0.407 |
Custom Unpack | 3.154 | 3.246 | 7.482 | 12.281 |
B.Conv Unpack | 2.150 | 4.395 | 10.574 | 10.479 |
Direct Unpack | 2.035 | 2.162 | 11.536 | 12.554 |
Direct Unsafe | 2.150 | 2.284 | 10.776 | 11.233 |
A & Cast Back | 2.004 | 2.125 | 12.084 | 10.882 |
Here are the raw numbers in any case (the top numbers are the total result of the sum all iterations and samples):
Iterations | 1000 | 100000 | 1000000 | 10000000 |
Samples | 10000 | 1000 | 30 | 10 |
Custom Pack | 0.0919455 | 0.9495634 | 0.2910572 | 0.9610064 |
Custom Unpack | 0.3154235 | 3.2461785 | 2.2447317 | 12.2807165 |
B.Conv Pack | 0.0764927 | 0.7899617 | 0.2430286 | 0.8400908 |
B.Conv Unpack | 0.2149990 | 4.3947272 | 3.1722101 | 10.4789114 |
Direct Pack | 0.0325352 | 0.4048323 | 0.1332875 | 0.4104641 |
Direct Unpack | 0.2034563 | 2.1618007 | 3.4606810 | 12.5544608 |
Direct Unsafe | 0.2150088 | 2.2842652 | 3.2328402 | 11.2327730 |
Acc & Cast | 0.0308669 | 0.3877108 | 0.1256066 | 0.4070876 |
A & Cast Back | 0.2004328 | 2.1245730 | 3.6251144 | 10.8822947 |
PER OPERATION | 1000*10000 | 100000*1000 | 1000000*30 | 10000000*10 |
Custom Pack | 9.1946E-09 | 9.4956E-09 | 9.7019E-09 | 9.6101E-09 |
B.Conv Pack | 7.6493E-09 | 7.8996E-09 | 8.1010E-09 | 8.4009E-09 |
Direct Pack | 3.2535E-09 | 4.0483E-09 | 4.4429E-09 | 4.1046E-09 |
Acc & Cast | 3.0867E-09 | 3.8771E-09 | 4.1869E-09 | 4.0709E-09 |
Custom Unpack | 3.1542E-08 | 3.2462E-08 | 7.4824E-08 | 1.2281E-07 |
B.Conv Unpack | 2.1500E-08 | 4.3947E-08 | 1.0574E-07 | 1.0479E-07 |
Direct Unpack | 2.0346E-08 | 2.1618E-08 | 1.1536E-07 | 1.2554E-07 |
Direct Unsafe | 2.1501E-08 | 2.2843E-08 | 1.0776E-07 | 1.1233E-07 |
A & Cast Back | 2.0043E-08 | 2.1246E-08 | 1.2084E-07 | 1.0882E-07 |
And the raw measurements:
https://gist.github.com/Alan-FGR/24b9b23ce8d7c777c2dd675cd14c6500
Conclusion: All the functions shown here can be considered high-performance when you take their differences into account. The ‘Custom’ functions allow a better control over your individual bits and provide very decent performance. If you need fine control and want the absolutely best performance, direct bitwise operations and possibly unsafe code are the way to go. You could make a code generator for those functions. I like my old custom functions better because they are more elegant to use and the performance is just fine, but I don’t discard the possibility of using some direct functions too, especially since I’m trying to keep the bit packing ‘formats’ to a minimum in terms of variety.