CPU cache optimization and multi-threading in C#

I’m making a game that contains a massive and very complex cellular automation grid (like ‘game of life’) for the map. It’s basically a 2D side-scroller, but the whole map is dynamic. My plan is to achieve 8192×1024 wrap-around (on X) maps. First I’ve made a simple prototype with classes for the objects. I’m using C# and C++ for the expensive parts and a flyweight pattern for memory efficiency. I’ve quickly made a usable map generator, although it’s not decent yet: (Perlin noise only, terrain and ores)

And after making the the ‘viewer’ (I’m using a 2D physics engine) I was able to achieve acceptable performance for the prototype. The next step was to multi-thread the grid ticking code. It was a substantial improvement considering that, since the ticking code run on a ‘secondary’ thread, it can run at a very slow frequency and won’t affect the framerate, basically the world would only take more time to update. I’m also taking advantage of atomic access for most data.

I’m a big fan of a data-oriented design, and ‘objects’ were not a thing when I started programming, so my first programs were fully data-oriented with primitives (not to be confused with data-driven). I revamped some code to use a data-oriented approach and take advantage of CPU caching. The CPU I use to test is a Haswell Pentium G, these guys have 32KB of both icache and dcache core-independent on L1 (2x(32KB i+32KB d)), 256KB of unified core-independent cache on L2 (2x265KB), and also 3MB of unified shared cache on L3 (1x3MB). Intel call it ‘Smart cache’ but I’m not sure what that implies.

The first thing I did was to use structs instead of classes for the blocks. I carefully choose the data type of all members to keep the memory to a minimum, and adopted a ‘modular’ approach for them, something like a very lightweight ECS by maintaining a collection of ‘attributes’ on each block that are used by the functions declared in the class that contains the logic of each block type (flyweight). So, for example, a block that needs to store its position on the map would have an ‘attribute’ (which is an explicitly laid out struct containing a uint enum value for ID and a uint value for multi-purpose usage) for each dimension, and would pass that value for the class holding its logic.

After the block object was converted to a struct, the first single-threaded results were very promising. It was substantially faster than the classes. However, when I multi-threaded it, I was surprised to see only a marginal improvement if any. It turns out that multi-threading along with CPU-cache optimization is an extremely complex matter. Basically the code was generating a lot of cache misses (presumably mostly L3). After discarding the possibility of external factors (like OS scheduler) affecting the performance to that extent, the conclusion was not very pleasant: Cache optimization, especially if you want to take advantage of the higher levels, is much harder when you’re multi-threading. I can think of a few scenarios where it’s trivial to achieve cache friendly multi-threading, for example when you can loop in ‘chunks’, but for any realistic real-world application, it’s so hard to come up with good solutions on higher scales that I’m wondering if it’s worth the effort.

Bandwidth isn’t a big deal, after all CPU caching works based on that, you transmit a lot of data (bandwidth) at high latency to another memory bank that you have lower latency access. In a fair analogy, the road have many lanes, so many vehicles can travel simultaneously, but the speed limit isn’t high. CPU caches are orders of magnitude faster than RAM access. In a sensible scenario, you can expect L1 to be roughly 1000x faster, and L2 100x faster. As a game developer, those numbers are exciting, but it’s not very easy to take advantage of those incredible speeds.

In the end you have to apply common-sense to decide whether to take advantage of that. I honestly don’t know yet if I’m going to go back to simple classes, since it’s easier to program, and simply not care about CPU caching any longer. Structs are heap-allocated when they’re members of an array, but they are kept in adjacent memory (possibly not perfectly, but much more cacheable than classes if you’re modifying the collection since the addresses aren’t changed), but to extract the best performance, depending on your ticking cycle, you have to use a scoped copy of the segment you’re iterating. I have to reconsider what is required for the game, for example: what’s a realistic percentage of dynamic blocks in a typical scenario, and what’s a ‘good enough’ size for the world? Based on those questions I’m going to take that decision, and if in the future that changes it’s not the end of the world too.

Something to consider is that some languages (and/or std libs), most notably Java, don’t provide easy ways (if any) to optimize your code for CPU caching. I don’t like nor use Java, but I acknowledge that it’s very popular and one of the most important languages, and you can be pretty sure that chances are the Java programs/games you use, even the high-performance ones, don’t take any advantage of CPU caching besides the purely accidental cases. So, perhaps another thing to think about is: if it’s good enough for them, maybe (and only maybe) it’s good enough for us. 🙂

1 thought on “CPU cache optimization and multi-threading in C#

  • Pingback: Anonymous

Leave a Reply