RSS 2.0

Personal Info:

Joe Send mail to the author(s) leads the architecture of an experimental OS's developer platform, where he is also chief architect of its programming language. His current mission is to enable writing large-scale software that is reliable, secure, and scalable by-construction. Before this, Joe founded the Parallel Extensions to .NET project. He has been granted 19 patents, with 49 pending. When not working, Joe enjoys travelling with his wife, writing books, writing music, studying music theory & mathematics, and doing anything involving food & wine.

My books

My music

Disclaimer:
The content of this site are my own personal opinions and do not represent my employer's view in anyway.

© 2012, Joe Duffy

 
 Monday, January 12, 2009

I received some feedback on my previous post, Some performance implications of CAS operations, indicating that a few clarifications are in order.  If I had to summarize the intended conclusion, it’d go something like this:

Sharing is evil, fundamentally limits scalability, and is best avoided.

I have to admit that the post was meant to focus more on concrete data, since I expected the meta-point about sharing to be implied.  I figured folks would pick up on the link: (i) Sharing memory requires concurrency control, (ii) Concurrency control requires CAS, (iii) CAS is expensive, therefore (iv) Sharing memory is expensive.  Many people simply don’t understand how crippling CAS can be when placed in a hot path, and I wanted to point out some (albeit extreme) examples of this point.

I did have a motivation for the post.  A lot of people point at lock-free techniques, software transactional memory, reader/writer locks, etc. as ways to improve scalability.  Sadly this seldom pans out.  Each involves CASs of some sort, and, assuming the lock-based equivalent is written properly (that is, to hold locks for very short periods of time), the alternative can in fact often fare worse.  I call this game “count the CASs.”  It’s the roundtrips back to shared memory, failed optmistic attempts, cache invalidations, and line ping ponging that kills you.

Some might accuse me of unfairly targeting CAS.  That’s hogwash.  I’ve been in the trenches for years writing and optimizing systems-level parallel code on Windows.  A parallel for loop can go from scaling perfectly to not scaling at all if you choose the wrong granularity for the loop counter increments.  And vice versa.  Why?  Because the frequency of CASs will bring the memory system to its knees.  You simply must consider these kinds of things when developing your data structures and algorithms; easing pressure on the cache hierarchy is the only way to scale beyond a handful of processors.

The sad truth is that only radical changes to the way we write software will allow fine-grained parallelism to scale to the numbers we expect in the 5 year time horizon.  Hiding more and more conveniently inserted CAS operations auto-magically for folks is not doing them any good.  Mostly functional combined with concurrency-safe mutation on guaranteed-isolated object graphs is, in my opinion, the only path forward.

1/12/2009 10:46:30 PM (Pacific Standard Time, UTC-08:00)  #   

 

Recent Entries:

Search:

Browse by Date:
<January 2009>
SunMonTueWedThuFriSat
28293031123
45678910
11121314151617
18192021222324
25262728293031
1234567

Browse by Category:

Notables: