I received some feedback on my previous post, Some performance implications of CAS operations, indicating that a few clarifications are in order. If I had to summarize the intended conclusion, it’d go something like this:
Sharing is evil, fundamentally limits scalability, and is best avoided.
I have to admit that the post was meant to focus more on concrete data, since I expected the meta-point about sharing to be implied. I figured folks would pick up on the link: (i) Sharing memory requires concurrency control, (ii) Concurrency control requires CAS, (iii) CAS is expensive, therefore (iv) Sharing memory is expensive. Many people simply don’t understand how crippling CAS can be when placed in a hot path, and I wanted to point out some (albeit extreme) examples of this point.
I did have a motivation for the post. A lot of people point at lock-free techniques, software transactional memory, reader/writer locks, etc. as ways to improve scalability. Sadly this seldom pans out. Each involves CASs of some sort, and, assuming the lock-based equivalent is written properly (that is, to hold locks for very short periods of time), the alternative can in fact often fare worse. I call this game “count the CASs.” It’s the roundtrips back to shared memory, failed optmistic attempts, cache invalidations, and line ping ponging that kills you.
Some might accuse me of unfairly targeting CAS. That’s hogwash. I’ve been in the trenches for years writing and optimizing systems-level parallel code on Windows. A parallel for loop can go from scaling perfectly to not scaling at all if you choose the wrong granularity for the loop counter increments. And vice versa. Why? Because the frequency of CASs will bring the memory system to its knees. You simply must consider these kinds of things when developing your data structures and algorithms; easing pressure on the cache hierarchy is the only way to scale beyond a handful of processors.
The sad truth is that only radical changes to the way we write software will allow fine-grained parallelism to scale to the numbers we expect in the 5 year time horizon. Hiding more and more conveniently inserted CAS operations auto-magically for folks is not doing them any good. Mostly functional combined with concurrency-safe mutation on guaranteed-isolated object graphs is, in my opinion, the only path forward.