|
Personal Info:
Joe  leads the architecture of an experimental OS's developer platform, where
he is also chief architect of its programming language. His current mission is to enable
writing large-scale software that is reliable, secure, and scalable by-construction. Before this, Joe
founded the Parallel Extensions to .NET project.
He has been granted 19 patents, with 49 pending. When not working, Joe enjoys travelling with his wife,
writing books, writing music,
studying music theory & mathematics, and doing anything involving food & wine.
My books
My music
Disclaimer:
The content of this site are my own personal opinions and do
not represent my employer's view in anyway.
© 2012, Joe Duffy
|
|
 Saturday, November 12, 2011
I often wish that .NET had erred on the side of offering postmortem instead of premortem finalization.
The distinction here is when exactly the finalizer runs, i.e. after or before the GC has actually reclaimed an object. This governs whether a dying object is (a) accessible from within its own finalizer, and therefore (b) eligible to become resurrected. Postmortem finalization occurs after the object is long gone, and hence says “no” to both of these questions; premortem finalization happens beforehand and hence says “yes.”
.NET chose the latter.
The primary downside of premortem finalization, setting aside the confusing nature of resurrection, is that the object in question cannot be collected until after its finalizer has run. This should be fairly obvious: it is only that second time the object is found to be dead “again” that we know the finalizer has or has not resurrected it.
This may seem like a small matter. But it matters quite a lot when building high performance software. In a garbage collected system, relying on high rates of finalization to keep up with demanding workloads almost never works. But in a premortem finalization system, even moderate demands become cause for concern.
Premortem finalization leads to finalized objects getting promoted to the elder generations before actually dying. If you check the value of GC.GetGeneration(this) within an object’s finalizer, for example, you will notice it is one greater than the generation in which the object was found to be dead the first time. Say it was found dead in Gen1; then GC.GetGeneration(this) will return ‘2’. Yet another collection must happen, in Gen2 to boot, in order to actually reclaim this object. And, of course, it’s not just this object, but also the transitive closure of objects to which it refers.
This approach penalizes the majority use case of finalizable objects. At least on .NET, most objects merely invoke CloseHandle on an IntPtr in the finalizer. This clearly needn’t hold up freeing the managed state. And resurrection is a dubious scenario anyway: such objects quickly end up in Gen2 where collections are expensive and infrequent. If you’re pooling via resurrection because you create expensive objects at a high rate of birth and death, manual memory management (or a different design altogether) is likely your only savior.
Although Java’s finalizers are also premortem, the JDK offers the facilities necessary to implement postmortem finalization on your own. It entails using WeakReference and ReferenceQueue. See this article if you are curious.
.NET doesn’t offer the notifications required to do the same. You can, however, learn from postmortem finalization to write better premotem finalizers: prefer simple finalizable objects that refer to only the state necessary to implement finalization – which ordinarily means no other managed objects. The SafeHandle abstraction is a good example of this. Most implementations are comprised of a simple IntPtr. This pattern will ensure that collateral promotion due to finalization is more contained.
After saying all of this, I hope it is just amusing trivia. I'm sure nobody is writing finalizers these days anyway.
 Sunday, October 23, 2011
It's been unbelievably long since I last blogged.
The reason is simple. I've been ecstatic in my job and, every time I think to write something, I quickly end up turning to work and soon find that hours (days? months?) have passed. This is a wonderful problem to have, but not so good for keeping the blog looking fresh and new. (I've also been writing a fair bit of music lately.) Well, this weekend I managed to lock myself out of my VPN access, and decided that this was a sign that I ought to dust off the cobwebs on a blog entry or two that I've had in the works for quite some time.
The topic for today is generics, a feature many of us know and love. Specifically, their impact on software performance, something I frequently see developers struggling to understand and tame in the wild.
The blessing; and, the curse
I absolutely love generics. I can hardly imagine writing code without them these days. The code reuse, higher-order expressiveness, beautiful abstractions, and static type-safety enabled by first class parametric polymorphism are all game-changing. And being a language history wonk, I'm delighted to see many mainstream programming languages stealing a page from ML and theoretical CS generally.
Generics, however, are not free. And in some circumstances, they are, dare I say, rather expensive. Few language features surpass generics in the ability to write a concise and elegant bit of code, which then translates into reams of ugly assembly code out the rear end of the compiler. I am of course speaking mainly to models in which compilation leads to code specialization (like .NET's), versus erasure (like Java's).
Most developers coming from a C++ background understand code expansion deeply, because they program with templates. Unlike templates, however, there is ample runtime type information (RTTI) associated with generic instantiations… such that the costs associated with generics frequently – and perhaps surprisingly – are a superset of those costs normally associated with C++ templates. At the same time, because the compiler understands parametric polymorphism, it can sometimes do a better job optimizing, e.g. with techniques like code sharing.
Basically, with templates and erasure, the equation for predicting code expansion is super simple. You get it all (in the former) or you get none of it (in the latter), but with specialized generics this equation is quite complex.
Paradoxically, these same costs are the main value that generics bring to the table! Write a little type-agnostic code and then "instantiate" that same code over multiple types without repeating yourself. But, generics are not magic; did you ever stop to wonder things like: What machine code is generated for these types? Does the compiler need to specialize the actual code that runs on the processor for unique instantiations, or is it all the same? And if it does need to specialize, where, how, and why? And perhaps most importantly, what hidden costs are there, and how should I think about them while writing code?
Before reading further, paranoia need not ensue. The point of this article is merely to raise your awareness. All programmers should know what the abstractions they use cost, and make conscious tradeoffs when writing code with them. The aforementioned benefits of generics really are often "worth it," both in the elegance and reusability of abstractions, and in developer productivity. In my experience, however, the associated costs are so subtle and ill-documented that even people who write highly generic code typically remain unaware of them. Even more subtly, these costs are somewhat different in nature when pre-compiling your code, such as with .NET's NGen technology.
This brief essay will walk through a few such costs in the context of the .NET Framework and CLR's implementation of generics. This is in no way an exhaustive study of generic compilation, and your mileage will vary from one platform to the next. Although the studies presented would apply to other implementations of generics, the reality is that if you're writing code in, say, Java – where type erasure is employed rather than code specialization – then all of this is going to be less relevant to you.
With no further delay, let's get started.
Code, RTTI, oh my
When considering costs, we must always think about both size and speed.
There is at least as much assembly code created for an instantiation as the code you've written for the generic abstraction in C# or MSIL. A simple mental model – that thankfully turns out isn't entirely accurate, thanks to some sharing optimizations described below – is that for each instantiation of a generic type or method you get a new copy of that code specialized to the type in question. Obviously, this increases code size. And just as obviously, it will add some runtime cost to JIT compile the code (if you aren't using ahead of time compilation), as well as putting more pressure on I-cache and TLB.
Another source of significant cost is the runtime data structures needed for RTTI and Reflection, like vtables and other metadata. Quite simply, the runtime needs to know the identity of each generic instantiation, to prevent things like casting a List<Int16> to a List<String>, and even List<Object> to List<String>; and given that there is often distinct code generated for unique instantiations, the vtable contents for those different List<T> instantiations are going to look quite different.
And of course, there are statics. Each generic instantiation gets its own set, requiring extra storage and another level of indirection when fetching them. Unique statics means D-cache and TLB pressure. It turns out that code shared across AppDomains, like mscorlib.dll, already need such things. But I have found that it's surprisingly common for a developer to throw a static field (or nested class!) onto an outer generic type, without actually needing it to be replicated for each unique instantiation.
In addition to the immediate effects, generic types often refer to other generic types which refer to other generic types … and so on. Instantiating a root type is akin to instantiating the full transitive closure.
To make our discussion friendly and familiar, we shall use the .NET Framework's List<T> type – presumably one of the most commonly used generic types on the planet – to illustrate many of these costs. And unfortunately, you'll also see that many of the common performance pitfalls plague this type too. (So, really, you need not feel bad if your own code is guilty of them too.)
Why the distinct code, anyway?
There is only one copy of List<T>'s code in mscorlib's MSIL. It is essentially just a blueprint for the list class.
When I create a List<Int16> in my program and use it, however, there clearly needs to be some assembly code created in order to execute List<T>'s associated functionality, just with any T's used by List<T>'s code replaced by actual 2-byte short integers. And similarly, if I were to instantiate a List<String>, all those T's need to be replaced by pointer-sized object references, either 4- or 8-bytes depending on machine architecture, that are reported live to the garbage collector.
This is what leads to our simple mental model above, in which each instantiation gets its own copy of the code. In this case, both List<Int16> and List<String> would be entirely independent types at runtime, with wholly separate copies of the machine code.
Certainly if I manually went about creating my own Int16List and StringList types, they would be distinct types with distinct machine code generated. Being a prudent developer, however, I'd probably try to arrange to share as much of the implementation as possible between the two types, perhaps using implementation inheritance. But alas, there's no way I could share it all: any code specific to Int16 or String, for example, would surely differ, both in MSIL and in the native code.
Generics basically give you the ability to do this same thing, without you needing to do the factoring of type-independent and type-specific code yourself. The compiler does that for you.
Why might the code be different? As stated above, Int16 values are 2 bytes and String pointers are native word sized (4 bytes on 32-bit, 8 bytes on 64-bit). All the code that passes values of type T on the stack, either as arguments or return values, moves instances into and out of memory locations (like the T[] backing array), and so on, needs to be specialized based on the size of T. This wouldn't be true of a generics implementation that used type erasure, like Java's, but then you'd need to box the value types on the heap so that everything is a pointer. If T is a Float, we will likely emit code that uses floating point math instead of general purpose registers. Any tables that report GC roots are likely to be different, since object references can be embedded inside struct values that get laid out on the stack. And so on. Some day you might want to compare the machine code for a simple generic Echo<T> method for different kinds of T's; it is really easy to do, and is quite illustrative.
A naïve wish might go as follows. Imagine that I had written my own dedicated Int16List and StringList types, and that we diffed the resulting machine code between the distinct list types; we'd presumably find a fair bit of duplication for all the reasons stated above. It would be a nice property if, when we used the generic List<Int16> and List<String> types, and similarly diffed the resulting assembly, the amount of specialized code would be no greater than the amount of specialized code between our best hand-written Int16List and StringList types. I.e., only parts that need to be different are different.
We could go even further with our wish. Imagine I had a List<DateTime> and List<Int64>. Both are 8-byte values, and do not contain any GC references. If I were writing a specialized 8ByteValueList in C++ and had immense performance constraints, I would, again being a prudent developer, probably use some type unsafe code, with nasty reinterpret_casts, so that I could use the same list type to store any kind of 8-byte value. (Except in C++ I could even store pointers!) It would also be a nice property if generics did some of this for us, while still retaining the type safety we love about generics.
It turns out we will get neither of our wishes exactly, although we will get something close to the spirit of our wishes.
Code sharing
Indeed, the CLR does arrange to share many generic instantiations. The rule is simple, although it is subject to change in the future (being an optimization and all): instantiations over reference types are shared among all reference type instantiations for that generic type/method, whereas instantiations over value types get their own full copy of the code. In other words, List<String> and List<Object> are backed by the same code, but List<DateTime> and List<Int64> get their own.
It is true that, in theory, List<DateTime> and List<Int64> could use the same shared code, because they are of identical size and have GC roots in the same locations (trivially, because neither has one). But there are additional restrictions on generated code that makes this problematic, for example if we were talking about Double and Int64. In short, the CLR doesn't actually share value type instantiations as of the 4.0 runtime, although clearly it could in certain situations (value types of the same size with GC roots in the same locations).
As you might guess, this extends to multi-parameter generics in obvious ways. A Dictionary<Object, Object> is shared with a Dictionary<String, String>, etc., and a Dictionary<Int64, Object> is shared with a Dictionary<Int64, String>. A Dictionary<DateTime, DateTime> is not, however, shared with a Dictionary<Int64, Int64> instantiation, as per the above.
My pal Joel Pobar wrote a post eons ago describing how code sharing works in great detail, which I do not intend to rehash. Please refer to his post for an excellent overview of how code sharing works.
An important thing to remember, however, is that no matter how much code sharing happens, you still need distinct RTTI data structures. So although List<Object> and List<String> share the same machine code, they have distinct vtables; sure, each table is full of pointers to the same code functions, but you are still paying for the runtime data structures. A distinct instantiation, therefore, is never actually free!
Transitive closures
Why am I making such a big deal about code sharing, anyway?
Another surprising aspect of generics is the transitive closure problem. Particularly when doing pre-compilation of generics, each unique instantiation doesn't simply lead to a specialized version of the code associated with the type being directly instantiated. The whole transitive closure of types, starting with that root type, will also be compiled. This can be a surprisingly huge number of types! JIT is much more pay-for-play, such that you get one level of explosion at a time, but once there is code that calls a particular type's method, even if that code is lazily compiled, creation of the type is forced.
To illustrate this, let's take our friend List<T>. Before examining the list, how many generic types would you expect that a single new List<T> instantiation instantiates?
What if I told you that a single List<int> instantiation creates (at least) 28 types? And that, say, five unique instantiations of List<T> might cost you 300K of disk space and 70K of working set? Well, of course, if you are writing a script, or something with fairly loose performance requirements, this might not matter much. But if topics like download time, mobile footprint, and cache performance are important to you, then you probably want to pay attention to this. To a first approximation, size is speed.
Yes, you heard me right: 28 types. Holy smokes... How can this be?!
Nested types are one obvious answer, and indeed List<T> has two: an Enumerator class (which is reasonable), and one to support the legacy synchronized collections pattern (which we presumably wish we didn't have to pay for). The larger answer here, however, is functionality. Yes, functionality! This is a great example where the cost of generics explodes as you add more features. Start simple, keep adding stuff, as has happened to List<T> over the years, and you will soon find that a series of elegant abstractions adds up to a gut-wrenching bucket of bytes.
Here's a quick sketch of the transitive closure of generic types used by List<T>:
List<T>
T[] type
IList<T> type
ICollection<T> type
IEnumerable<T> type
IEnumerator<T> type
ReadOnlyCollection<T> type (AsReadOnly)
(Nothing more than List<T>)
IComparer<T> type (BinarySearch, Sort)
{Array.BinarySearch<T> method (BinarySearch)}
ArraySortHelper<T> type
IArraySortHelper<T> type
GenericArraySortHelper<T> type
EqualityComparer<T> type (Contains)
IEqualityComparer<T> type
IEquatable<T> type
NullableEqualityComparer<T> type
Nullable<T> type
EnumEqualityComparer<T> type
{JitHelpers.UnsafeEnumCast<T> method}
ObjectEqualityComparer<T> type
Predicate<T> delegate type (Find*)
Action<T> delegate type (ForEach)
{Array.LastIndexOf<T> method (LastIndexOf)}
Comparison<T> delegate type (Sort)
Array.FunctorComparer<T> type (Sort)
Comparer<T> type
GenericComparer<T> type
NullableComparer<T> type
ObjectComparer<T> type
{Array.Sort<T> method (Sort)}
ArraySortHelper<T> type (see earlier)
Enumerator inner type
SynchronizedList<T> inner type
IList<T> interface (see earlier)
ICollection<T> interface (see earlier)
IEnumerable<T> interface (see earlier)
{Interlocked.CompareExchange<Object> method (SyncRoot)}
{_emptyArray T[] static field}
I'm not trying to pick on List<T>. This class is only unique in this regard in that it offers a large transitive closure of (mostly useful!) functionality. And it's not the only guilty party. We recently shaved off 100K's of code size on my team, for example, that were being lost simply because all the LINQ methods were declared as instance methods on the base collection class, rather than being extension methods as in .NET. We found nested enumerator and iterator types, cached static lambdas as static fields, and huge transitive closures of other generic types, all allocated when you just touched any collection type. Any collections library is apt to be full of this stuff, since they are highly generic. But collection libraries are certainly not the only places to go sniffing for such problems.
As an aside, it turns out that extension methods are a great way to make generic abstractions more pay-for-play.
Adding it up
Let's see what the above adds up to. I ran some programs through NGen as a quick and dirty experiment, and inspected the on-disk sizes and also the runtime working set sizes. I ensured clrjit.dll was not loaded into the process. Here's what I found. Take these numbers with a grain of salt, as they will change from release to release; they are simply rules of thumb. When in doubt, crank up NGen, DumpBin, and/or start trawling the heap with VADump yourself!
One empty type with no methods in CLR 4.0 seems to cost roughly 0.2K bytes of on-disk metadata, and about 0.7K in x64 working set. (This is a good rule of thumb irrespective of generics… in terms of order of magnitude, you can think "one empty type means 1K of memory.") A single List<S> instantiation, where S is an empty struct, is in the neighborhood of 60K on-disk metadata, and 14K of x64 working set. A single List<C> instantiation, however, where C is an empty class, is only – surprise – about 7K on-disk and 4K in-memory. Why the large discrepancy? Well, it just so happens that mscorlib.dll already includes an instantiation or two of List<T> over reference types, so this 4K is the incremental cost on top of reusing what is there; remember, there are still unique vtables and data structures still required for RTTI.
Rico did a similar analysis a few years back, and concluded that each unique List, where E was an enum type, cost 8K. Why the increase to 14K over the years? x64 and ever-increasing functionality on the basic collections classes, presumably. Remember, it's not just List<T> that has grown, it's also everything that List<T> uses internally as an implementation detail.
Dynamic specialization with dictionaries
Some specialization in behavior can be accomplished with dynamic runtime behavior, rather than static code specialization. A prime example is the following:
class C
{
public static void M<T>()
{
System.Console.WriteLine(typeof(T).Name);
}
}
Where does the program get the value of typeof(T) from? If you look at the MSIL, you will see that C# has emitted a ldtoken MSIL instruction. For some struct type, we can compile that as a constant in the code, because it is getting its own copy of the code. What occurs when two instantiations share code, like M<String> and M<Object>, however? As you might guess, there is an indirection.
The thing we usually use for such indirections – the vtable – is nowhere to be found in this particular example, because M is a static method. To deal with this, the compiler inserts an extra "hidden" argument, frequently called a generic dictionary, from which the emitted assembly code can fetch the type token. The cost here typically isn't bad, because many of the operations that pull in the dictionary are already RTTI or Reflection-based, and would require an indirection already (e.g., through a vtable).
The operations which require a dictionary of some kind include anything that has to do with RTTI and yet no vtable is readily accessible: typeof, casts, is and as operators, etc. And as you might guess, if instantiations aren't shared (such as with value types on the CLR), no dictionary is needed, because the code is fully specialized. There are also multiple kinds of dictionaries used by the runtime, depending on whether you are using a generic type, method, or some combination of both.
JITting when you didn't mean to
There are two primary ways in which you will JIT compile when using generics, even if you were good doobie and used NGen to reduce startup time.
One way is if you instantiate a new generic type exported from mscorlib.dll with a type argument also defined in mscorlib.dll, that wasn't already instantiated inside mscorlib.dll. (See my old Generics and Performance blog entry for more details.) You can very easily see this happening by using an instantiation like Dictionary<DateTime, DateTime>, and watching the clrjit.dll module getting loaded.
The other way is generic virtual methods (GVMs). It turns out that GVMs pose incredible difficulty for ahead of time separate compilation, because the compiler cannot know statically which slot in the vtable points at the particular implementation you are about to call. (Unless you use whole program compilation, something not offered by .NET at present time.) For each such method, there's an unbounded set of possible specialized instantiations a slot might point to, and so the vtable cannot be laid out in a traditional manner. C++ doesn't allow templated virtual methods for this very reason.
Thankfully, GVMs are somewhat rare. However, it only took 5 minutes of poking around to find one that is quite front-and-center in .NET: in the implementation of LINQ, there is an Iterator<T> type that has a method declared as follows:
public abstract IEnumerable<TResult> Select<TResult>(Func<TSource, TResult> selector);
All we need to do is figure out how to tickle that method, and we're guaranteed to JIT. As it turns out, sure enough, the following code does the trick and forces clrjit.dll to get loaded in .NET 4.0:
int[] xs = …;
int[] ys = xs.Where(x => true).Select(x => x).ToArray();
The Iterator<T> type is used for back-to-back Where and Select operators, as a performance optimization that avoids excess allocations and interface dispatch. But because it depends on a GVM, it does incur an initial penalty for using it, even if you have used NGen to avoid runtime code generation.
In conclusion
The moral of the story here is not that you should fear generics. Beautiful things can be built with them.
Instead, it's to use generics thoughtfully. Nothing in life is free, and generics are no exception to this rule. If code size is important to you, then you will want to have performance gates measuring your numbers against your goals; if you are working in a codebase that uses generics heavily, and you end up spending any significant time on code size optimizations, you will want to try to track down large transitive closures. As I stated above, you could really be throwing away 100K's of code here.
And as to the surprise JITting, I've seen teams compiling with NGen and having a functional gate that fails any new code that causes clrjit.dll to get loaded at runtime. Although tracking down the root cause might be tricky when that gate fails, at least you won't let the camel's nose under the tent.
Investing in tools here is a very good idea.
When it comes down to it, really thinking about what code must be executed by the process is helpful. Step back and imagine you were writing this all in C++, with the associated performance concerns front-and-center: consider how you'd arrange to reuse as much implementation as you can, manage memory efficiently, perhaps employ unsafe tricks that would have violated type safety and so are offlimits in .NET, and all that jazz. Then step back and be grateful that you have a type- and memory-safe environment to help you write more robust code, but also be realistic about what you are paying in exchange.
I hope you've learned a useful thing or two in this article. If you'd like to learn more, here are a few other good resources:
- An MSR paper on the original implementation of .NET generics: http://research.microsoft.com/pubs/64031/designandimplementationofgenerics.pdf
- Rico's "Six Questions About Generics and Performance" blog entry: http://blogs.msdn.com/b/ricom/archive/2004/09/13/229025.aspx
- Joel Pobar's "Generics and Code Sharing" blog post: http://blogs.msdn.com/b/joelpob/archive/2004/11/17/259224.aspx
- My "Generics and Performance" blog entry: http://www.bluebytesoftware.com/blog/2005/03/23/DGUpdateGenericsAndPerformance.aspx
Cheers.
 Wednesday, June 01, 2011
InfoQ recently asked me a few questions about concurrency and programming languages, and here is what I had to say:
http://www.infoq.com/articles/Interview-Joe-Duffy
A little teaser:
"The major shift we face will be that mainstream languages will start to incorporate more concurrency-safety -- immutability and isolation -- and the platform libraries and architectures will better support this style of software decomposition. OOP developers are accustomed to partitioning their program into classes and objects; now they will need to become accustomed to partitioning their program into asynchronous Actors that can run concurrently. Within this sea of asynchrony will lay ordinary imperative code, frequently augmented with fine-grained task and data parallelism."
As an aside, I know I've been super quiet lately. I never thought I'd go months without blogging. My sincere apologies for this; work has been too all-consuming / fun, and I've been unable to carve out much time for anything else. (Speaking of which, we are still hiring: email me at joedu at you-know-where dot com if you are interested.) I'm about to head out to Europe for a few weeks, where hopefully I'll have a bit of time to write up something exciting to share. Cheers.
 Saturday, December 04, 2010
After spending more time than I’d like to admit over the years researching memory model literature (particularly Java’s terrific JMM and related publications), subsequently trying to “fix” the CLR memory model, reviewing proposals for new C++ memory models, and beating my head against the wall for months developing a new memory model that supports weakly ordered processors like the kind you’ll find on ARM in a developer-friendly yet power-efficient way, I have a conclusion to share.
Volatile is evil
Why? Let me recount the reasons:
- It doesn’t mean what you think.
- It used to have a very specific purpose — to enure memory operations with external side-effects did not get reordered — and has since gotten bastardized and used for many secondarily-related purposes.
- Even if you think it does mean what you think, the annotation scheme is all wrong. Volatile annotates a storage location, and yet what really matters is what happens when accessing said storage locations. The fences occur when you access the variable, not when you declare it. And yet from a readability perspective, they are completely invisible and easy to miss.
- And even if you don’t care about readability, the meaning of volatile changes wildly when you switch platforms. Today it’s store / release, tomorrow it’s write / read fences. Perhaps it’s even sequentially consistent. And the label of “store / release” could actually be a white lie, as with the CLR’s memory model thanks to store buffer forwarding and the lack of fences in the CLR JIT’s x64 stores.
- Performance, man, performance! Sure sequential consistency as the default sounds nice on the tin, but once you’re running that mobile app on ARM, and sucking up 160 cycles for each write you perform, you’re going to curse volatile like the plague.
And so the moral of the story follows...
Attempting to “fix” volatile is a waste of time
Instead, a new world order has arrived. We must take a two-pronged approach to solving instruction-level interleaving bugs, neither prong having much to do with the traditional definition of memory models or volatiles. We must:
- Eliminate memory ordering from 99% of developers’ purviews. This is already the case with single-threaded programs, because code motion in compilers and processors is limited to what only affects concurrent observations. So the answer is pretty clear: developers must move towards single-threaded programming models connected through message passing, optionally with provably race-free fine-grained parallelism inside of those single-threaded worlds.
- Leave the memory model esoterica to the Einsteins, and radically change its meaning. Data dependence and transitive visibility of memory operations are in. Volatiles on storage locations are out. Instead, we must throw fences into programmer’s faces, and force them to understand each and every one that occurs. And moreover, force them to decide about each and every one that occurs. Specifically, hidden fences thanks to volatile are no longer. Those who cannot take it should fall into the 99% bucket already mentioned above, versus the 1% bucket.
Let’s set #1 aside for now, since it’s obviously a huge can of worms.
But what about #2? It is quite encouraging that the C++0x group is firmly on the path of #2. See http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n2145.html for more details. In a nutshell, each location that you’d have ordinarily tagged volatile instead becomes a template atomic type. And then each read / write has the opportunity to specify the desired kind of fence, whether that is acquire (for reads), release (for writes), fully ordered, or relaxed (meaning no fence).
I do think it’d be worth them considering compiler-only fences too. So that relaxed means fully-relaxed, and there is a fence in-between that merely prevents the compiler from optimizing the memory operation. This pays homage to volatile’s legacy in C as merely a variable that mustn’t be subject to optimizations, because operations against these variables pertain to, say, memory-mapped device I/O.
Another nitpick of mine is that I’d have required each access to specify the fence, whereas C++0x implicitly uses full fence if left unspecified at the callsite. It’s a minor convenience, but I like always having the fence spelled out very explicitly in the code. Lock-free access to shared variables is sufficiently dangerous that automatic sequential consistency is the least of your problems.
Nevertheless, the C++0x direction is a massive good step forward, and these are just minor details.
My hope is that .NET follows suit. And the timing couldn’t be more apropos as “now”: we are moving forward in a heavily mobile, distributed-system-on-a-chip, and heterogeneous world, where processor memory models will necessarily continue to weaken. The overly strong x86 memory model, kept alive primarily to ensure compatibility, has simply grown too expensive to accommodate. The power benefits and architectural simplifications are hard to argue with, and because compatibility becomes less of an issue as new platforms arise (e.g. for mobile), the world moves to the cloud, and hence there is legacy to worry about, I do hope that processor vendors seize the opportunity. ARM certainly has. It is less about out-of-order execution as it is about coherency costs. Truthfully, I’d be disappointed if anything else happened, even though the risk to compatibility for shrink-wrapped software scares the hell out of me. But this is most certainly the right way to go, long-term. As software platforms move in the direction of #1 – as I also foresee – the need for fences dwindles. The cost of supporting the current .NET memory model is too great and will become a liability with time.
Thankfully, it is quite simple to build a veneer atop .NET that works a lot more like atomic. For example, imagine that we had a new System.Threading.Volatile static class, and that it offered the moral equivalent to atomic inner types for each atomic primitive we can synchronize against:
namespace System.Threading
{
public static class Volatile
{
public struct Int32 {..}
public struct Int64 {..}
…
public struct Reference where T : class {..}
…
}
}
Now instead of tagging a location as ‘volatile’, you would use one of these primitives. For example, rather than:
static volatile MySingleton s_instance;
You would say:
static Volatile.Reference<MySingleton> s_instance;
Each class has a similar set of operations. For example:
namespace System.Threading
{
public static class Volatile
{
public struct Int32
{
public Int32(int value);
public int ReadAcquireFence();
public int ReadFullFence();
public int ReadCompilerOnlyFence();
public int ReadUnfenced();
public void WriteReleaseFence(int newValue);
public void WriteFullFence(int newValue);
public void WriteCompilerOnlyFence(int newValue);
public void WriteUnfenced(int newValue);
public int AtomicCompareExchange(int newValue, int comparand);
public int AtomicExchange(int newValue);
public int AtomicAdd(int delta);
public int AtomicIncrement();
public int AtomicDecrement();
// Etc…, bitwise ops, other math ops, etc.
}
}
}
Of course, only the integer types would offer the increment, decrement, add, and related operators. And it turns out that offering different kinds of fences on the Atomic* operations would be incredibly useful too, because processors like ARM do not couple the fence to the compare-and-swap / load-locked-store-conditional as x86 processors do. Taking advantage of this can be huge if you are writing performance critical code, like a concurrent garbage collector whose atomic swaps need not imply ordering with the surrounding instruction stream. You can quibble over the details, like whether these should use enums instead of the name to encode the fence-kind. I did it this way to keep the implementations branch-free, although with a decent inlining JIT compiler, it’d probably optimize those away thanks to constant propagation.
It’s quite trivial to implement these APIs atop existing .NET primitives. I built a little library that does so, but it was so boring and repetitive I decided not to post it alongside this blog entry as originally intended.
With the above definition, we can very clearly see the fences involved in doing, say, double-checked locking:
static Volatile.Reference<MySingleton> s_instance;
public static MySingleton Instance
{
get
{
MySingleton instance = s_instance.ReadAcquireFence();
if (instance == null) {
instance = new MySingleton();
instance = s_instance.AtomicCompareExchangeRelease(instance, null);
}
return instance;
}
}
We see there are two fences. One is an acquire and, depending on what your memory model says about data dependence, is probably unnecessary. Most sane memory models guarantee that data dependent loads do not pass. So we needn’t worry that we’ll see a non-null s_instance whose contents haven’t been initialized. (If we were talking structs, it’d be another story.) Nevertheless, it’s definitely required that we use a release-only fence for the publication of the object. This guarantees writes to fields within the MySingleton constructor have completed prior to the write of the new object to the shared instance field. The point here is that you are forced to think about the fences, and you actually see them.
Of course, most platforms need to provide the bare minimum of fencing to assure type safety, particularly for languages like C#. My understanding is that C++0x has decided, at least for now, not to offer type-safety in the face of multithreading. That means you might publish an object and, if stores occur out-of-order, the reader could see an object partially initialized with an invalid vtable pointer. In C# and Java, the language designers have thankfully decided to shield programmers from this. The need for fences also extends to unsafe code like strings, where – were it possible for a thread to read the non-zero length before the char* pointer was valid – writes to random memory could occur and hence threaten type-safety. Thankfully, again, C# and Java protect developers from this, mostly due to the automatic zero’ing of memory as the GC allocates and hands out new segments.
There are costs to offering this type safety assurance. So you can understand why the C++ designers want to keep fences out of object allocation. If you have #1 above, however, the costs are dramatically lower and more acceptable. But the world is – unfortunately – still a freethreaded one, and we have several years to go before we’ve reached the final destination. As a step forward, however, the death of volatile is a welcomed one. Say it with me.
“Sayonara volatile.”
Here’s hoping that .NET 5.0 takes this step forward too.
 Sunday, October 31, 2010
I rambled on and on a few weeks back about how much performance matters. Although I got a lot of contrary feedback, most of it had to do with my deliberately controversial title than the content of the article. I had intended to post a redux, but something more concise is on my mind lately.
GC-based memory management is a boon to productivity, not to mention program safety. Few would argue with this. However, the most effective developers know how their particular GC works, and optimize their program’s data structure, allocation, and lifetime behavior to suit their particular GC best.
This is dangerous, but a pragmatic fact of life. It is dangerous because who’s to say that the runtime team doesn’t intend to entirely revamp the GC’s collection strategy next release, at which point your thoughtfulness may actually harm you? It’s a pragmatic fact for a few reasons: it’s probably not likely that the behavior of your favorite GC is going to change too fundamentally over time; if it did, you’d need to rethink things anyhow; oh, and when was that next release anyway (2 or more years out); and finally, what do you care about more, the theoretical loose coupling, or real results today?
One of the worst data structures for traversal is a linked list. That’s because its contents are fetched by pointer-chasing, an act that usually destroys locality, unless the data associated with each pointer was carefully constructed to live next to its previous and next pointer’s data. This seldom happens, because the main point of a linked list is to free you from such constraints.
One of the best data structures for traversal, on the other hand, is an array. Adjacent elements are truly adjacent to one another in memory, meaning that as you fetch the i’th element from memory, you’re probably pulling in the i’th, i+1’th, …, and so on, thanks to spatial locality. Of course, if the elements are just pointers, then you're back to the chasing game; as with anything, it depends.
How many elements you prefetch of course depends on the size of the elements with respect to your processor’s cache line. If you’re working with 8-byte elements, and 128-byte cache lines, then you may pay 100 cycles for the first fetch, and then amortize that cost over the subsequent 15 found cheaply in cache for 10 cycles. The result is about 250 cycles total; compare this to a linked list, where you’ll probably spend 1,600 cycles, or more than 6X the cost. And of course you’re trashing other data in the cache in the process. As you traverse more and more of the list, the numbers snowball, and the amortization of locality, or lack thereof, provides a stark contrast.
There’s another subtle reason why this is important. Stop and think about what happens when a GC occurs.
Yep, that’s right. Your data structures need to be traversed during a GC, after all, to ascertain the liveness of any pointers held within. That scan looks a whole heck of a lot like the same traversal I just described, and enjoys the same locality properties. So we can immediately conclude that data structures whose traversal is efficient will translate into less time spent in the GC chasing pointers, and better cache efficiency.
For programs that are sensitive to long pause times, this is huge. I talk to customers all the time whose programs are sensitive to microsecond-long GC delays, and – aside from ensuring good GC lifetime practices, like ensuring all objects either die young or live long – being conscious about locality can be immensely important. Especially for any long-lived, large data structures that will be subject to Gen2 collections throughout their lifetime.
There is another useful trick to know. If a data structure contains no pointers, the GC will not have to trace these pointers. Obvious, right? A linked list inherently contains pointers, so this trick really doesn’t apply: the GC will need to traverse the whole live portions of the object graph. What’s interesting is that an array, on the other hand, may or may not contain elements that contain pointers. For example, an array of ints clearly has no pointers, whereas an array of string references clearly does. This doesn’t just apply to the primitive types, but also custom structs which may or may not contain references. When the GC encounters such an array, its contents need not be traversed: instead, the array is alive, and that's that. Yet another opportunity to eliminate pointer chasing. Not only does this save the GC from doing some heavy lifting, but the pointer-free structs eliminate the need for the GC write barriers on array stores too.
So think of all this next time you’re confronted with the decision to employ a tree, graph, or linked list, and whether a dense, and perhaps pointer-free, representation could be beneficial. Even if it means you must replace pointers with index calculations. The locality benefits may not matter, but then again, they may. And at least you can knowingly make a balanced tradeoff, with these potential advantages in mind.
 Saturday, September 18, 2010
I have several positions open on my team here at Microsoft.
My team's responsibility spans multiple aspects of a research operating system’s programming model. The three main areas are concurrency, languages, and frameworks. When I say concurrency, I mean things like asynchrony and message passing, data and task parallelism, distributed parallelism, runtime scheduling and resource management, and heterogeneity and GPGPU. When I say languages, I mean type systems, mostly-functional programming, verified safe concurrency, and both front- and back-end compilation. And when I say frameworks, I mean virtually anything you could imagine wanting out of a platform framework: all things XML, data interoperability (database, web services, etc.), collections, transactions, multimaster synchronization, and even low level things, like regex, numerics, and globalization.
Our team is 100% developers, and we have an “everybody codes, everybody loves to code” culture. Even managers are expected to spend a significant amount of time prolifically writing code.
All of these components are new and built from the ground up. So self-drive and an ability to have a vision and make it happen are incredibly important.
We are always happy to hire great, hard-working people, regardless of years of experience. If you’re extremely strong in one or more of the abovementioned areas, more of a generalist, are an amazing coder, or all of the above, you’d fit in perfectly. This is the most amazing team of people I’ve ever worked with. If you are interested, please email your resume to me at joedu AT microsoft DOT com.
 Monday, September 06, 2010
I can’t tell you how many times I’ve heard the age-old adage echoed inappropriately and out of context:
"We should forget about small efficiencies, say about 97% of the time; premature optimization is the root of all evil" -- Donald E. Knuth, Structured Programming with go to Statements
I have heard the "premature optimization is the root of all evil" statement used by programmers of varying experience at every stage of the software lifecycle, to defend all sorts of choices, ranging from poor architectures, to gratuitous memory allocations, to inappropriate choices of data structures and algorithms, to complete disregard for variable latency in latency-sensitive situations, among others.
Mostly this quip is used defend sloppy decision-making, or to justify the indefinite deferral of decision-making. In other words, laziness. It is safe to say that the very mention of this oft-misquoted phrase causes an immediate visceral reaction to commence within me... and it’s not a pleasant one.
In this short article, we’ll look at some important principles that are counter to what many people erroneously believe this statement to be saying. To save you time and suspense, I will summarize the main conclusions: I do not advocate contorting oneself in order to achieve a perceived minor performance gain. Even the best performance architects, when following their intuition, are wrong 9 times out of 10 about what matters. (Or maybe 97 times out of 100, based on Knuth’s numbers.) What I do advocate is thoughtful and intentional performance tradeoffs being made as every line of code is written. Always understand the order of magnitude that matters, why it matters, and where it matters. And measure regularly! I am a big believer in statistics, so if a programmer sitting in his or her office writing code thinks just a little bit more about the performance implications of every line of code that is written, he or she will save an entire team that time and then some down the road. Given the choice between two ways of writing a line of code, both with similar readability, writability, and maintainability properties, and yet interestingly different performance profiles, don’t be a bozo: choose the performant approach. Eschew redundant work, and poorly written code. And lastly, avoid gratuitously abstract, generalized, and allocation-heavy code, when slimmer, more precise code will do the trick.
Follow these suggestions and your code will just about always win in both maintainability and performance.
Understand the order of magnitude that matters
First and foremost, you really ought to understand what order of magnitude matters for each line of code you write.
In other words, you need to have a budget; what can you afford, and where can you afford it? The answer here changes dramatically depending on whether you’re writing a device driver, reusable framework library, UI control, highly-connected network application, installation script, etc. No single answer fits all.
I am personally used to writing code where 100 CPU cycles matters. So invoking a function that acquires a lock by way of a shared-memory interlocked instruction that may take 100 cycles is something I am apt to think hard about; even more worrisome is if that acquisition could block waiting for 100,000 cycles. Indeed this situation could become disastrous under load. As you can tell, I write a lot of systems code. If you’re working on a network-intensive application, on the other hand, most of the code you write is going to be impervious to 100 cycle blips, and more sensitive to efficient network utilization, scalability, and end-to-end performance. And if you’re writing a little one-time script, or some testing or debugging program, you may get away with ignoring performance altogether, even multi-million cycle network round-trips.
To be successful at this, you’ll need to know what things cost. If you don’t know what things cost, you’re just flailing in the dark, hoping to get lucky. This includes rule of thumb order of magnitudes for primitive operations – e.g. reading / writing a register (nanoseconds, single-digit cycles), a cache hit (nanoseconds, tens of cycles), a cache miss to main memory (nanoseconds, hundreds of cycles), a disk access including page faults (micro- or milliseconds, millions of cycles), and a network roundtrip (milliseconds or seconds, many millions of cycles) – in addition to peering beneath opaque abstractions provided by other programmers, to understand their best, average, and worst case performance.
Clearly the concerns and situations you must work to avoid change quite substantially depending on the class of code you are writing, and whether the main function of your program is delivering a user experience (where usability reigns supreme), delivering server-side throughput, etc. Thinking this through is crucial, because it helps avoid true "premature optimization" traps where a programmer ends up writing complicated and convoluted code to save 10 cycles, when he or she really needs to be thinking about architecting the interaction with the network more thoughtfully to asynchronously overlap round-trips. Understanding how performance impacts the main function of your program drives all else.
Pay attention to interoperability between layers of separately authored software that is composed together. The most common cause of hangs is that an API didn’t specify the expected performance, and so a UI programmer ended up using it in an innocuous but inappropriate way, because they couldn’t afford the range of order of magnitude cost that the API’s performance was expected to fall within. Hangs aren’t the only manifestation; O(N^2), or worse, performance can also result, if, for example, a caller didn’t realize the function called was going to enumerate a list in order to generate its results.
It is also important to think about worst case situations. What happens if that lock is held for longer than expected, because the system is under load and the scheduler is overloaded? And what if the owning thread was preempted while holding the lock, and now will not get to run again for quite some time? What happens if the network is saturated because a big news event is underway, or worse, the phone network is intermittently cutting out, the network cable has been unplugged, etc.? What about the case where, because a user has launched far too many applications at once, your memory-intensive operation that usually enjoys nice warmth and locality suddenly begins waiting for the disk on the majority of its memory accesses, due to demand paging? These things happen all the time.
In each of these situations, you can end up paying many more orders of magnitude in cost than you expected under ordinary circumstances. The lock acquisition that usually took 100 CPU cycles now takes several million cycles (as long as a network roundtrip), and the network operation that is usually measured in milliseconds is now measured in tens of seconds, as the software painfully waits for the operation to time out. And your "non-blocking" memory-intensive algorithm on the UI thread just caused a hang, because it’s paging like crazy.
You’ve experienced these problems as a user of modern software, I am sure, and it isn’t fun. An hourglass, spinning donut, unresponsive button clicks, "(Not Responsive)" title bars, and bleachy white screens. An important measurement of a programmer’s worth is how good the code they write operates under the extreme and rare circumstances. Because, truth be told, when you have a large user-base, these circumstances aren’t that rare after all. This is more of a "performance in the large" thing, but it turns out that the end result is delivered as a result of many "performance in the small" decisions adding up. A developer wrote code meant to be used in a particular way, but decided what order of magnitude was reasonable based on best case, … and gave no thought to the worst case.
Using the right data structure for the job
This is motherhood and apple pie, Computer Science 101, … bad clichés abound. And yet so many programmers get this wrong, because they simply don’t give it enough thought.
One of my favorite books on the topic ("Programming Pearls") has this to say about them:
"Most programmers have seen them, and most good programmers realize they’ve written at least one. They are huge, messy, ugly programs that should have been short, clean, beautiful programs."
I’ll add one adjective to the "short, clean, beautiful" list: fast.
Data structures drive storage and access behavior, both strongly affecting the size and speed of algorithms and components that make use of them. Worst case really does matter. This too is a case where the right choice will boost not only performance but also the cleanliness of the program.
I’m actually not going to spend too much time on this; when I said this is CS101, I meant it. However, it is crucial to be intentional and smart in this choice. Validate assumptions, and measure.
Ironically, in my experience, many senior programmers can make frighteningly bad data structure choices, often because they are more apt to choose a sophisticated and yet woefully inappropriate one. They may choose a linked list, for example, because they want zero-allocation element linking via an embedded next pointer. And yet they then end up with many lists traversals throughout the program, where a dense array representation would have been well worth the extra allocation. The naïve programmer would have happily new’d up a List<T>, and avoided some common pitfalls; yet, here the senior guy is working as hard as humanly possible to avoid a single extra allocation. They over-optimized in one dimension, and ignored many others that mattered more.
This same class of programmer may choose a very complicated lock-free data structure for sharing elements between threads, incurring many more object allocations (and thus increased GC pressure), and a large number of expensive interlocked operations scattered throughout the code. The sexy lure of lock-freedom tricked them into making a bad choice. Perhaps they didn’t quite understand that locks and lock-free data structures share many costs in common. Or perhaps they just hoped to get lucky and squeeze out out-of-this-world scalability thanks to lock-freedom, without actually considering the access patterns necessary to lead to such scalability and whether their program employed them.
These are often held up as examples of "premature optimization", but I hold them up as examples of "careless optimization". The double kicker here is that the time spent building the more complicated solution would have been better spent carefully thinking and measuring, and ultimately deciding not to be overly clever in the first place. This most often plagues the mid-range programmer, who is just smart enough to know about a vast array of techniques, but not yet mature enough to know when not to employ them.
A different, better-performing approach
It’s an all-too-common occurrence. I’ll give code review feedback, asking "Why didn’t you take approach B? It seems to be just as clear, and yet obviously has superior performance." Again, this is in a circumstance where I believe the difference matters, given the order of magnitude that matters for the code in question. And I’ll get a response, "Premature optimization is the root of all evil." At which point I want to smack this certain someone upside the head, because it’s such a lame answer.
The real answer is that the programmer didn’t stop to carefully consider alternatives before coding up solution A. (To be fair, sometimes good solutions evade the best of us.) The reality is that the alternative approach should have been taken; it may be true that it’s "too late" because the implications of the original decision were nontrivial and perhaps far-reaching, but that is too often an unfortunate consequence of not taking the care and thought to do it right in the first place.
These kinds of "peanut butter" problems add up in a hard to identify way. Your performance profiler may not obviously point out the effect of such a bad choice so that it’s staring you in your face. Rather than making one routine 1000% slower, you may have made your entire program 3% slower. Make enough of these sorts of decisions, and you will have dug yourself a hole deep enough to take a considerable percentage of the original development time just digging out. I don’t know about you, but I prefer to clean my house incrementally and regularly rather than letting garbage pile up to the ceilings, with flies buzzing around, before taking out the trash. Simply put, all great software programmers I know are proactive in writing clear, clean, and smart code. They pour love into the code they write.
In this day and age, where mobility and therefore power is king, instructions matter. My boss is fond of saying "the most performant instruction is the one you didn’t have to execute." And it’s true. The best way to save battery power on mobile phones is to execute less code to get the same job done.
To take an example of a technology that I am quite supportive of, but that makes writing inefficient code very easy, let’s look at LINQ-to-Objects. Quick, how many inefficiencies are introduced by this code?
int[] Scale(int[] inputs, int lo, int hi, int c) {
var results = from x in inputs
where (x >= lo) && (x <= hi)
select (x * c);
return results.ToArray();
}
It’s hard to account for them all.
There are two delegate object allocations, one for the call to Enumerable.Where and the other for the call to Enumerable.Select. These delegates point to potentially two distinct closure objects, each of which has captured enclosing variables. These closure objects are instances of new classes, which occupy nontrivial space in both the binary and at runtime. (And of course, the arguments are now stored in two places, must be copied to the closure objects, and then we must incur extra indirections each time we access them.) In all likelihood, the Where and Select operators are going to allocate new IEnumerable and new IEnumerator objects. For each element in the input, the Where operator will make two interface method calls, one to IEnumerator.MoveNext and the other to IEnumerator.get_Current. It will then make a delegate call, which is slightly more expensive than a virtual method call on the CLR. For each element for which the Where delegate returns ‘true’, the Select operator will have likewise made two interface method calls, in addition to another delegate invocation. Oh, and the implementations of these likely use C# iterators, which produce relatively fat code, and are implemented as state machines which will incur more overhead (switch statements, state variables, etc.) than a hand-written implementation.
Wow. And we aren’t even done yet. The ToArray method doesn’t know the size of the output, so it must make many allocations. It’ll start with 4 elements, and then keep doubling and copying elements as necessary. And we may end up with excess storage. If we end up with 33,000 elements, for example, we will waste about 128KB of dynamic storage (32,000 X 4-byte ints).
A programmer may have written this routine this way because he or she has recently discovered LINQ, or has heard of the benefits of writing code declaratively. And/or he or she may have decided to introduce a more general purpose implementation of a Scale API versus doing something specific to the use in the particular program that Scale will be immediately used in. This is a great example of why premature generalization is often at odds with writing efficient code.
Imagine an alternative universe, on the other hand, where Scale will only get used once and therefore we can take advantage of certain properties of its usage. Namely, perhaps the input array need not be preserved, and instead we can update the elements matching the criteria in place:
void ScaleInPlace(int[] inputs, int lo, int hi, int c) {
for (int i = 0; i < inputs.Length; i++) {
if ((inputs[i] >= lo) & (inputs[i] <= hi)) {
inputs[i] *= c;
}
}
}
A quick-and-dirty benchmark shows this to be an order of magnitude faster. Again, is it an order of magnitude that you care about? Perhaps not. See my earlier thoughts on that particular topic. But if you care about costs in the 100s or 1000s of cycles range, you probably want to pay heed.
Now, I’m not trying to take potshots at LINQ. It was just an example. In fact, I spent 3 years running a team that delivered PLINQ, a parallel execution engine for LINQ-to-Objects. LINQ is great where you can afford it, and/or where the alternatives do not offer ridiculously better performance. For example, if you can’t do in-place updates, functionally producing new data is going to require allocations no matter which way you slice it. But having watched people using PLINQ, I have witnessed numerous occasions where an inordinately expensive query was made 8-times faster by parallelizing… where the trivial refactoring into a slimmed down algorithm with proper data structures would have speed the code up by 100-fold. Parallelizing a piggy piece of code to make it faster merely uses more of the machine to get the same job done, and will negatively impact power, resource management, and utilization.
Another view is that writing code in this declarative manner is better, because it’ll just get faster as the compiler and runtimes enjoy new optimizations. This sounds nice, and seems like taking a high road of some sort. But what usually matters is today: how does the code perform using the latest and greatest technology available today. And if you scratch underneath the surface, it turns out that most of these optimizations are what I call "science fiction" and unlikely to happen. If you write redundant asserts at twenty adjacent layers of your program, well, you’re probably going to pay for them. If you allocate objects like they are cheap apples growing on trees, you’re going to pay for them. True, optimizations might make things faster over time, but usually not in the way you expect and usually not by orders of magnitude unless you are lucky.
A colleague of mine used to call C a WYWIWYG language—"what you write is what you get"—wherein each line of code roughly mapped one-to-one, in a self-evident way, with a corresponding handful of assembly instructions. This is a stark contrast to C#, wherein a single line of code can allocate many objects and have an impact to the surrounding code by introducing numerous silent indirections. For this reason alone, understanding what things cost and paying attention to them is admittedly more difficult – and arguably more important – in C# than it was back in the good ole’ C days. ILDASM is your friend … as is the disassembler. Yes, good systems programmers regularly look at the assembly code generated by the .NET JIT. Don’t assume it generates the code you think it does.
Gratuitous memory allocations
I love C#. I really do. I was reading my "Secure Coding in C and C++" book for fun this weekend, and it reminded me how many of those security vulnerabilities are eliminated by construction thanks to type- and memory-safety.
But the one thing I don’t love is how easy and invisible it makes heap allocations.
The very fact that C++ memory management is painful means most C++ programmers are overly-conscious about allocation and footprint. Having to opt-in to using pointers means developers are conscious about indirections, rather than having them everywhere by default. These are qualitative, hard-to-backup statements, but in my experience they are quite true. It’s also cultural.
Because they are so easy, it’s doubly important to be on the lookout for allocations in C#. Each one adds to a hard-to-quantify debt that must be repaid later on when a GC subsequently scans the heap looking for garbage. An API may appear cheap to invoke, but it may have allocated an inordinate amount of memory whose cost is only paid for down the line. This is certainly not "paying it forward."
It’s never been so easy to read a GB worth of data into memory and then haphazardly leave it hanging around for the GC to take care of at some indeterminate point in the future as it is in .NET. Or an entire list of said data. Too many times a .NET program holds onto GBs of working set, when a more memory conscientious approach would have been to devise an incremental loading strategy, employ a denser representation of this data, or some combination of the two. But, hey, memory is plentiful and cheap! And in the worst case, paging to disk is free! Right? Wrong. Think about the worst case.
Depending on the size of the objects allocated, how long they remain live, and how many processors are being used, the subsequent GCs necessary to clean up incessant allocations may impact a program in a difficult to predict way. Allocating a bunch of very large objects that live for long enough to make it out of the nursery, but not forever, for instance, is one of the worst evils you can do. This is known as mid-life crisis. You either want really short-lived objects or really long-lived ones. But in any case, it really matters: the LINQ example earlier shows how easy it is to allocate crap without seeing it in the code.
If I could do it all over again, I would make some changes to C#. I would try to keep pointers, and merely not allow you to free them. Indirections would be explicit. The reference type versus value type distinction would not exist; whether something was a reference would work like C++, i.e. you get to decide. Things get tricky when you start allocating things on the stack, because of the lifetime issues, so we’d probably only support stack allocation for a subset of primitive types. (Or we’d employ conservative escape analysis.) Anyway, the point here is to illustrate that in such a world, you’d be more conscious about how data is laid out in memory, encouraging dense representations over sparse and pointer rich data structures silently spreading all over the place. We don't live in this world, so pretend as though we do; each time you see a reference to an object, think "indirection!" to yourself and react as you would in C++ when you see a pointer dereference.
Allocations are not always bad, of course. It’s easy to become paranoid here. You need to understand your memory manager to know for sure. Most GC-based systems, for example, are heavily tuned to make successive small object allocations very fast. So if you’re programming in C#, you’ll get less "bang for your buck" by fusing a large number of contiguous object allocations into a small number of large object allocations that ultimately occupy an equivalent amount of space, particularly if those objects are short-lived. Lots of little garbage tends to be pretty painless, at least relatively.
Variable latency and asynchrony
There aren’t many ways to introduce a multisecond delay into your program at a moment’s notice. But I/O can do just that.
Code with highly variable latency is dangerous, because it can have dramatically different performance characteristics depending on numerous variables, many of which are driven by environmental conditions outside of your program’s control. As such it is immensely important to document where such variable latency can occur, and to program defensively against it happening.
For example, imagine a team of twenty programmers building some desktop application. The team is just large enough that no one person can understand the full details of how the entire system works. So you’ve got to compose many pieces together. (As I mentioned earlier, composition can lead to hard-to-predict performance characteristics.) Programmer Alice is responsible for serving up a list of fonts, and Programmer Bob is consuming that list to paint it on the UI. Does Bob know what it takes to fetch the list of fonts? Probably not. Does Alice know the full set of concerns that Bob must think about to deliver a responsive UI, like incremental repaints, progress reporting, and cancellation? Probably not. So Alice does the best she knows how to do: she hits the cache, when the font cache is fully populated, and falls back to fetching fonts from the printer otherwise. She returns a List<Font> object from her API. Now Bob just calls the API and paints the results on the UI; the call appears to be quite snappy in his testing. Unfortunately, when the cache is not populated, the "quick" cache hit turns into a series of network roundtrips with the printer; that hangs the UI, but only for a few milliseconds. Even worse, when the printer is accidentally offline, perhaps due to a power outage, the UI freezes for twenty seconds, because that’s the hard-coded timeout. Ouch!
This situation happens all the time. It’s one of the most common causes of UI hangs.
If you’re programming in an environment where asynchrony is first class, Alice could have advertised the fact that, under worst-case circumstances, fetching the font list would take some time. If she were programming in .NET, for example, she’d return a Task<List<Font>> rather than a List<Font>. The API would then be self-documenting, and Bob would know that waiting for the task’s result is dangerous business. He knows, of course, that blocking the UI thread often leads to responsiveness problems. So he would instead use the ContinueWith API to rendezvous with the results once they become available. And Bob may now know he needs to go back and work more closely with Alice on this interface: to ensure cancellation is wired up correctly, and to design a richer interface that facilitates incremental painting and progress reporting.
Variable latency is not just problematic for responsiveness reasons. If I/O is expressed synchronously, a program cannot efficiently overlap many of them. Imagine we must make three network calls as part of completing an operation, and that each one will take 25 milliseconds to complete. If we do synchronous I/O, the whole operation will take at least 75 milliseconds. If we launch do asynchronous I/O, on the other hand, the operation may take as few as 25 milliseconds to finish. That’s huge.
If I had my druthers, all I/O would be asynchronous. But that’s not where we are today.
The concern is not limited to just I/O, of course. Compute- and memory-bound work can quickly turn into variable latency work, particularly under stressful situations like when an application is paging. So truthfully any abstraction doing "heavy lifting" should offer an asynchronous alternative.
Examples of "bad" optimizations
It is easy to take it too far. Even if you’re shaving off cycles where each-and-every cycle matters, you may be doing the wrong thing. If anything, I hope this article has convinced you to be thoughtful, and to strive to strike a healthy balance between beautiful code and performance.
Anytime the optimization sacrifices maintainability, it is highly suspect. Indeed, many such optimizations are superficial and may not actually improve the resulting code’s performance.
The worst category of optimization is one that can lead to brittle and insecure code.
One example of this is heavy reliance on stack allocation. In C and C++, doing stack allocation of buffers often leads to difficult choices, like fixing the buffer size and writing to it in place. There is perhaps no single technique that, over the years, has led to the most buffer overrun-based exploits. Not only that, but stack overflow in Windows programs is quite catastrophic, and increases in likelihood the more stack allocation that a program does. So doing _alloca in C++ or stackalloc in C# is really just playing with fire, particularly for dynamically sized and potentially big allocations.
Another example is using unsafe code in C#. I can’t tell you how many times I’ve seen programmers employ unsafe pointer arithmetic to avoid the automatic bounds checking generated by the CLR JIT compiler. It is true that in some circumstances this can be a win. But it is also true that most programmers who do this never bothered to crack open the resulting assembly to see that the JIT compiler does a fairly decent job at automatic bounds check hoisting. This is an example where the cost of the optimization outweighs the benefits in most circumstances. The cost to pin memory, the risk of heap corruption due to a failure to properly pin memory or an offset error, and the complication in the code, are all just not worth it. Unless you really have actually measured and found the routine to be a problem.
If it smells way too complicated, it probably is.
In conclusion
I’m not saying Knuth didn’t have a good point. He did. But the "premature optimization is the root of all evil" pop-culture and witty statement is not a license to ignore performance altogether. It’s not justification to be sloppy about writing code. Proactive attention to performance is priceless, especially for certain kinds of product- and systems-level software.
My hope is that this article has helped to instill a better sense, or reinforce an existing good sense, for what matters, where it matters, and why it matters when it comes to performance. Before you write a line of code, you really need to have a budget for what you can afford; and, as you write your code, you must know what that code costs, and keep a constant mental tally of how much of that budget has been spent. Don’t exceed the budget, and most certainly do not ignore it and just hope that wishful thinking will save your behind. Building up this debt will cost you down the road, I promise. And ultimately, test-driven development works for performance too; you will at least know right away once you have exceeded your budget.
Think about worst case performance. It’s not the only thing that matters, but particularly when the best and worst case differ by an order of magnitude, you will probably need to think more holistically about the composition of caller and callee while building a larger program out of constituent parts.
And lastly, the productivity and safety gains of managed code, thanks to nice abstractions, type- and memory-safety, and automatic memory management, do not have to come at the expense of performance. Indeed this is a stereotype that performance conscious programmers are in a position to break down. All you need to do is slow down and be thoughtful about each and every line of code you write. Remember, programming is as much engineering as it is an art. Measure, measure, measure; and, of course, be thoughtful, intentional, and responsible in crafting beautiful and performant code.
 Sunday, July 11, 2010
That immutability facilitates increased degrees of concurrency is an oft-cited dictum. But is it true? And either way, why?
My view on this matter may be a controversial one. Immutability is an important foundational tool in the toolkit for building concurrent – in addition to reliable and predictable – software. But it is not the only one that matters. Making all your data immutable isn’t going to instantly lead to a massively scalable program. Natural isolation is also critically important, perhaps more so. And, as it turns out, sometimes mutability is just what the doctor ordered, as with large-scale data parallelism.
Isolation first; immutability second; synchronization last
Stepping back for a moment, the recipe for concurrency is rather simple. Say you’ve got multiple concurrent pieces of work running simultaneously (or have a goal of getting there); for discussion’s sake, call them tasks. Take two tasks. The first critical decision has two cases: either these tasks concurrently access overlapping data in shared-memory, or they do not. If they do not, they are isolated, and no precautions associated with racing memory updates are needed. If they do share data, on the other hand, then something else must give. If all concurrently accessed data is immutable, or all functions used to interact with data are pure, then dangerous concurrency hazards are avoided. All is well. If some data is mutable, however, then this is where things get tricky, and higher-level synchronization is needed to make accesses safe. This decision tree is straightforward and clear.
I have listed those four attributes – isolated, immutable and pure, and synchronization – in a very intentional order. Thankfully, this order mirrors the natural top-down hierarchical architecture of most modern object- and component-based programs: we have large containers that communicate through well-defined interfaces, each comprised of layers of such containers, and somewhere towards the leaves, a fair amount of intimate commingling of knowledge regarding data and invariants.
This order also reflects the order of complexity and execution-time costs, from least to most. Isolation is simple, because components depend on each other in loosely-coupled ways, and in fact scales superiorly in a concurrent program because no synchronization is necessary, the “right” data structure may be chosen for the job – immutable or not – and locality is part-and-parcel to the architecture. Immutability at least avoids the morass of synchronization, which can affect programs immensely in complexity, runtime overheads, and write-contention for shared data. It is clear that synchronization is something to avoid at all costs, particularly anything done in an ad-hoc manner like locks.
Making the concurrency
But where did all this concurrency come from, anyway?
It came from two things:
- The coarse-grained breakdown of a program into isolated pieces.
- The fine-grained data parallelism.
On #1: Program fragments that are isolated are already half-way down the road to running concurrently as tasks. The second half of this journey, of course, is teaching them to interact with one another asynchronously, most frequently through message-passing or by sticking them into a pipeline. The details of course depend on what programming language you are using. It may be through agents, actors, active objects, COM objects, EJBs, CCR receivers, web-services, something ad-hoc built with .NET tasks, or some other reification. Nevertheless the isolation is common to all these.
On #2: Data parallelism, it turns out, often works best with mutable data structures. These structures must be partitionable, of course, so that tasks comprising the data parallel operations may operate with logically isolated chunks of this data safely, even if they are parts of the same physical data structure. So chunks of them are isolated even though they don’t appear to be. This is trivially achievable with many important parallel-friendly data structures like arrays, vectors, and matrices. Capturing this isolation in the type system is of course no small task, though region typing gets close (see UIUC’s Data Parallel Java).
But you usually don’t want these structures to be immutable, because they can be modified in constant-time and space if they are their classic simple mutable forms. Programmers doing HPC-style data-parallelism a la FORTRAN, vectorization, and GPGPU know this quite well. Compare this a world where we are doing data-parallelism over immutable data structures, where modifications often necessitate allocations or more complicated big-oh times due to clever techniques meant to avoid such allocations, as with persistent immutable data structures. This is likely less ideal. It is true that some data parallel operations are not in-place against mutable data – as with PLINQ – at which point purity, but not immutability, is key. The two are related but not identical: immutability pervades the construction of data structures, whereas purity pervades the construction of functions. But if you can get by with one copy of the data, why not do it? Particularly since most datasets amenable to parallel speedups are quite large.
Immutability: the bricks, not the mortar
Notice that the concurrency did not actually come from immutable data structures in either case, however. So what are they good for?
One obvious use, which has little to do with concurrency, is to enforce characteristics of particular data structures in a program. A translation lookup table may not have been meant to be written to except for initialization time, and using an immutable data structure is a wonderful way to enforce this intent.
What about concurrency? Immutable data structures facilitate sharing data amongst otherwise isolated tasks in an efficient zero-copy manner. No synchronization necessary. This is the real payoff.
For example, say we’ve got a document-editor and would like to launch a background task that does spellchecking in parallel. How will the spellchecker concurrently access the document, given that the user may continue editing it simultaneously? Likely we will use an immutable data structure to hold some interesting document state, such as storing text in a piece-table. OneNote, Visual Studio, and many other document-editors use this technique. This is zero-cost snapshot isolation.
Not having immutability in this particular scenario is immensely painful. Isolation won’t work very well. You could model the document as a task, and require the spellchecker to interact with it using messages. Chattiness would be a concern. And, worse, the spellchecker’s messages may now interleave with other messages, like a user editing the document. Those kinds of message-passing races are non-trivial to deal with. Synchronization won’t work well either. Clearly we don’t want to lock the user out of editing his or her document just because spellchecking is occurring. Such a boneheaded design is what leads to spinning donuts, bleached-white screens, and “(Not Responding)” title bars. But clearly we don’t want to acquire a lock and then make a full copy of the entire document. Perhaps we’d try to copy just what is visible on the screen. This is a dangerous game to play.
Immutability does not solve all of the problems in this scenario, however. Snapshots of any kind lead to a subtle issue that is familiar to those with experience doing multimaster, in which multiple parties have conflicting views on what “the” data ought to be, and in which these views must be reconciled.
In this particular case, the spellchecker sends the results back to the task which spawned it, and presumably owns the document, when it has finished checking some portion of the document. Because the spellchecker was working with an immutable snapshot, however, its answer may now be out-of-date. We have turned the need to deal with message-level interleaving – as described above – into the need to deal with all of the messages that may have interleaved within a window of time. This is where multimaster techniques, such as diffing and merging come into play. Other techniques can be used, of course, like cancelling and ignoring out-of-date results. But it is clear something intentional must be done.
In conclusion
It is safe to say that immutability facilitates important concurrent architectures and algorithms. It can really help big time, for sure. But it is clearly no panacea. Whether mutability or immutability is the right choice for a particular data structure in your program, as with all things, depends.
It could be the case that choosing a piece-table for storing your text facilitates large-scales of concurrency in version two of your software application, but that in version one you have no use for it. Making that call ahead of time may pay in spades down the road, even if it comes at a marginal cost up-front. Or it could be that choosing an immutable data structure costs you in time and space, and you never end up exploiting the fact that you could have shared that particular structure in a zero-cost way across agents in your program.
One thing’s for sure: I’m glad to be programming in languages like C#, F#, Clojure, and Scala, where I’ve got a choice.
 Thursday, July 01, 2010
In .NET today, readonly/initonly-ness is in the eye of the provider. Not the beholder.
Although both C# and the CLR verifier go to great pains to ensure you don't change a readonly/initonly field outside of its constructor (or class constructor, in the case of a static field), this guarantee doesn't imply what you might imagine. It means what it says: you can't change such fields except for in certain contexts.
If you try, C# won't let you, including forming byrefs to them:
v.cs(0,0): error CS0191: A readonly field cannot be assigned to (except in a constructor or a variable initializer)
v.cs(0,0): error CS0192: A readonly field cannot be passed ref or out (except in a constructor)
v.cs(0,0): error CS0198: A static readonly field cannot be assigned to (except in a static constructor or a variable initializer)
v.cs(0,0): error CS0199: A static readonly field cannot be passed ref or out (except in a static constructor)
And neither will the CLR verifier:
[IL]: Error: [c:\v.exe : C::Main][offset 0x00000001] Cannot change initonly
field outside its .ctor.
Of course, attempting to invoke an operation on a readonly struct will make a defensive copy locally, and invoke the method against that. This ensures the readonly contents cannot change.
One unfortunate hole in this safety is with unions. You do not need unsafe code to break readonly, and yet the effect is the same as with an unverifiable program that writes to a readonly field:
struct S1 {
public readonly int X;
}
struct S2 {
public int X;
}
[StructLayout(LayoutKind.Explicit)]
struct S3 {
[FieldOffset(0)]
public S1 A;
[FieldOffset(0)]
public S2 B;
}
Now we can change A.X via B.X, even though A.X is supposedly readonly:
S3 s3 = ...;
int x = s3.A.X;
s3.B.X++;
ASSERT(x == s3.A.X); // false; it is +1
The same would have been true even if the field S3.A was marked readonly.
This is quite an evil trick. I have to be honest that I believe this is a CIL verification hole, and should produce unverifiable MSIL much like when you try to overlay structs containing overlapping GC references. Nevertheless, it is what it is.
Let's step back. Why does all of this matter, anyway, and what guarantees were we hoping that readonly would provide?
It would be ideal, I assert, if the guarantee was not just "the target field can only be written to in the constructor", but also "the target field, once read, cannot be observed with a different value later on". This would not be true during construction, but we'd like to say it holds at all other times.
The above example throws a wrench in this idea. As does the following example. But this new example will be more disturbing, because the solution is not a simple verifier change.
What would you expect this program to print to the console?
struct S {
public readonly int X;
public S(int x) { X = x; }
public void MultiplyInto(int c, out S target) {
System.Console.WriteLine(this.X);
target = new S(X * c);
System.Console.WriteLine(this.X); // same? it is, after all, readonly.
}
}
S s = new S(42);
s.MultiplyInto(10, out s);
As you may or may not have guessed, the output is "42" followed by "420". Yes, the value of 'this.X' changes after we have assigned to 'target' inside MultiplyTo, because the caller aliases the out-param with the 'this' param. Recall that parameter passing for structs in C# is done byref, so that these two references actually physically point to the same location when that call is made. The assignment to 'target', therefore, actually replaces the entire contents of 'this' all at once. And hence this gives the illusion that readonly fields are shifting.
You might be tempted to say that this can be prevented with alias analysis. But this is deceptively difficult to do. Consider this more complicated example:
class C {
public struct S S;
}
void M1(C c) {
M2(c, out c.S);
}
void M2(C c, out S s) {
c.S.MultiplyInto(10, out s);
}
It is in no way clear inside M2 that the two aliases refer to the same location. The aliasing occurred higher up in the stack. Although byrefs are restricted to stack-only passing, making the necessary alias analysis tantalizingly close to attainable, it is nontrivial to say the least. Presumably we would have had to have blocked the forming of the byref within M1, rather than its use within M2. We could fall back to runtime checks, but that is also unfortunate for numerous reasons.
The moral of the story? Structs as containers of readonly values are not to be trusted, at least not for situations that call for bulletproof safety, such as caching values in the compiler rather than rereading them, because the fields are readonly. Although C# and the CLR do a good job at verifying readonly/initonly are done right at the initialization site, there are still places where these guarantees break down. Thankfully the byref aliasing problem does not threaten thread-safety, but the union problem does. And in conclusion, I do have to imagine all of this will get fixed somewhere down the road, it's just a matter of when and where.
 Sunday, June 27, 2010
Partially-constructed objects are a constant source of difficulty in object-oriented systems. These are objects whose construction has not completed in its entirety prior to another piece of code trying to use them. Such uses are often error-prone, because the object in question is likely not in a consistent state. Because this situation is comparatively rare in the wild, however, most people (safely) ignore or remain ignorant of the problem. Until one day they get bitten.
Not only are partially-constructed objects a source of consternation for everyday programmers, they are also a challenge for language designers wanting to provide guarantees around invariants, immutability and concurrency-safety, and non-nullability. We shall see examples below why this is true. The world would be better off if partially-constructed objects did not exist. Thankfully there is some interesting prior art that moves us in this direction from which to learn.
Seeing such a beast in the wild
In what situations might you see a partially-constructed object? There are two common ones in C++ and C#:
- ‘this’ is leaked out of a constructor to some code that assumes the object has been initialized.
- A failure partway through an object’s construction leads to its destructor or finalizer running against a partially-constructed object.
In the first case, the rule of thumb is “don’t do that.” This is easier said than done. The second case, on the other hand, is a fact of life, and the rule of thumb is “tread with care, and be intentional.” Let’s examine both more closely.
The evils of leaking ‘this’
Leaking ‘this’ during construction to code that expects to see a fully-initialized object is a terrible practice. Before moving on, it’s important to remember initialization order in C++ and C#: base constructors run first, and then more derived constructors. If I have E subclasses D subclasses C, then constructing an instance of E will run C’s constructor, and then D’s, and then lastly E’s. Destructors in C++, of course, run in the reverse order.
Member initializers, on the other hand, run in different orders in C++ versus C#. In C#, they run from most derived first, to least derived. So in the above situation, E’s initializers run, and then D’s, and then C’s. This happens fully before running the ad-hoc constructor code. In C++, however, member initializers run alongside the ordinary construction process. C’s member initializers run just before C’s ad-hoc construction code, and then D’s, and then E’s. Another difference is that C#’s initializers cannot access ‘this’, whereas C++’s initializers can.
For example, this C# program will print E_init, D_init, C_init, C_ctor, D_ctor, and then E_ctor:
using System;
class C {
int x = M();
public C() {
Console.WriteLine("C_ctor");
}
private static int M() {
Console.WriteLine("C_init");
return 42;
}
}
class D : C {
int x = M();
public D() : base() {
Console.WriteLine("D_ctor");
}
private static int M() {
Console.WriteLine("D_init");
return 42;
}
}
class E : D {
int x = M();
public E() : base() {
Console.WriteLine("E_ctor");
}
private static int M() {
Console.WriteLine("E_init");
return 42;
}
}
class Program {
public static void Main() {
new E();
}
}
And this C++ program will print C_init, C_ctor, D_init, D_ctor, E_init, E_ctor, ~E, ~D, and finally ~C:
#include
using namespace std;
struct C {
int x;
C() : x(M()) { cout << "C_ctor" << endl; }
~C() { cout << "~C" << endl; }
static int M() { cout << "C_init" << endl; return 42; }
};
struct D : C {
int x;
D(): x(M()) { cout << "D_ctor" << endl; }
~D() { cout << "~D" << endl; }
static int M() { cout << "D_init" << endl; return 42; }
};
struct E : D {
int x;
E() : x(M()) { cout << "E_ctor" << endl; }
~E() { cout << "~E" << endl; }
static int M() { cout << "E_init" << endl; return 42; }
};
static void main() {
E e;
}
It’s interesting to note that the CLR allows constructor chaining to happen in any order. The C# compiler emits the calls to base as the first thing a constructor does, but other languages can choose to do differently. The verifier ensures that a call has occurred somewhere in the constructor body before returning.
This IL program, for example, will print E, D, and then C – the reverse of what C# gives us:
.assembly extern mscorlib { }
.assembly ctor { }
.class C {
.method public specialname rtspecialname instance void .ctor() cil managed {
ldstr "C"
call void [mscorlib]System.Console::WriteLine(string)
ldarg.0
call instance void [mscorlib]System.Object::.ctor()
ret
}
}
.class D extends C {
.method public specialname rtspecialname instance void .ctor() cil managed {
ldstr "D"
call void [mscorlib]System.Console::WriteLine(string)
ldarg.0
call instance void C::.ctor()
ret
}
}
.class E extends D {
.method public specialname rtspecialname instance void .ctor() cil managed {
ldstr "E"
call void [mscorlib]System.Console::WriteLine(string)
ldarg.0
call instance void D::.ctor()
ret
}
}
.class Program {
.method public static void Main() cil managed {
.entrypoint
newobj instance void E::.ctor()
pop
ret
}
}
So why is leaking ‘this’ bad, anyway?
Say you’ve decided in the implementation of D’s constructor that you would like to stick ‘this’ into a global hash-map. Doing this sadly means other code could grab the pointer and begin accessing it before E’s constructor has even run. That is a race at-best and a ticking time-bomb in all likelihood. For example:
class C {
public static Dictionary s_globalLookup;
private readonly int m_key;
public C(int key) {
m_key = key;
s_globalLookup.Add(key, this);
}
}
Even though we have taken great care to initialize our readonly field m_key before sticking ‘this’ into a dictionary, any subclasses will not have been initialized at this point. Only if C is sealed can we be assured of this. Another piece of code that grabs the element out of the hashtable and begins calling virtual methods on it, for example, is in a race with the completion of the initialization code for subclasses.
Leaking ‘this’ isn’t always such an explicit act. Merely invoking a virtual method within the constructor may dispatch a more derived class’s override before the more derived class’s constructor has run. And therefore its state is most likely not in a place conducive to correct execution of that override. It is fairly common knowledge that invoking virtual methods during construction is an extraordinarily poor practice, and best avoided.
Failure to construct
Let’s move on to the second issue. Imagine we suffer an exception during construction of an object. Perhaps this is due to a failure to allocate resources, or maybe even due to argument validation. It is clear that a leaked ‘this’ object in such cases will be a problem, because the object will have escaped into the wild even though its initialization failed. Subsequent attempts to use the object will obviously pose problems. What is more subtle is that if the class in question declares a destructor (in C++) or finalizer (in C#), a problem may now be lurking.
Let’s say we have the original situation shown above: C derives from D derives from E. Now say an exception happens within D’s constructor. At this point in time, C’s constructor has run to completion, D’s constructor has run partially up to the point of failure, and E’s has not run at all. (And, of course, in the case of C#, all member-initializers for all classes have actually run.) What happens to the cleanup code?
In C++, only constructors that have run will have their corresponding destructors executed. In the above situation, where C, D, and E each declares a destructor, only C’s will be run during stack unwind. It is imperative, therefore, that D handles failure within its constructor rather than relying on the destructor. For example:
class D : C {
int* m_pBuf1;
int* m_pBuf2;
public:
D() {
m_pBuf1 = … alloc …;
m_pBuf2 = … alloc …;
}
~D() {
if (m_pBuf2) … free …;
if (m_pBuf1) … free …;
}
}
If the allocation destined for m_pBuf2 fails by throwing an exception, the destructor for D will not run, and therefore m_pBuf1 will leak. The C++ solution to this particular example is to use smart pointers and member initializers for the allocations, because successfully initialized members do get destructed, even when the constructor (or indeed one of the member initializers) subsequently fails. This means that destructors for a particular class do not have to tolerate that class’s state not having been fully constructed, because those destructors will never run. Destructors must not, however, invoke virtuals, for two obvious reasons: (a) subclasses may not have ever been initialized, and (b) destructors run in reverse order, and so the destruction code for the subclass will have already run by the time a baseclass’s destructor runs.
In C#, finalizers will run, regardless of whether an object’s constructor ran fully, partway through, or not at all. If the object is allocated – and so long as GC.SuppressFinalize isn’t called on it – the finalizer runs. This distinctly means that C# finalizers must always be resilient to partially-constructed objects (unlike C++ destructors). Thankfully finalizers are a rare bird, and therefore this issue is seldom even noticed by .NET programmers.
This issue does not arise in the case of .NET’s IDisposable pattern. If a constructor throws, the assignment to the target local variable does not occur. And therefore the variable enclosed in, say, a using statement remains null. This means that there is no way to possibly invoke Dispose on the object. Moreover, the allocation in using occurs prior to entering the try/finally block. Hence, you really had better be writing constructors that don’t throw and/or protecting such resources with smart-pointer-like things with finalizers, a la SafeHandle.
Impediments to language support
As if these weren’t sufficient cause for concern, I also mentioned earlier – and somewhat vaguely – that partially-constructed objects interfere with language designers’ efforts to add invariants, immutability and concurrency-safety, and non-nullability to the language. And all of these are important properties to consider in our present age of complexity and concurrency, so this point is worth understanding more deeply. Let’s look at each in-order.
Invariants and safe-points
A partially-constructed object obviously may have broken invariants. By definition, invariants are meant to hold at the end of construction, and so if construction never completes, the rules of engagement are being broken.
Imagine, for example:
class C {
int m_x;
int m_y;
invariant m_x < m_y;
public C(int a) {
m_x = a;
m_y = a + 1;
}
}
It is ordinarily very difficult to ensure that each instruction atomically transitions the state of an object from one invariant safe-point to another. A common technique is to define well-defined points at which invariants must hold. We might model each failure point as one such technique. But even in C#, the above program does not satisfy this constraint, because an overflow exception may be generated at the ‘m_y = a + 1’ line. Or a thread-abort exception may be generated right in the middle of those two functions. Or, if addition were implemented as a JIT helper, an out-of-memory exception could get generated due to failure to JIT the helper function.
In such cases, we’d like to say that the object does not exist. But the sad fact is that the object *does* exist, and if the object has acquired physical resources at the time of failure to construct, we must compensate and reclaim them. The ideal world looks a lot like object construction as transactions, where the end of construction is the commit-point. The state-of-the-art is very different from this, however, and so any static verification and theorem proving that depends on invariants on an object holding, well, invariantly, is subject to being broken by partially-constructed objects.
Immutability… or not
Immutability is also threatened by partially-constructed objects. Immutability is a one of many first class techniques for solving concurrency-safety in the language, so this one is quite unfortunate.
In C#, for example, we might be tempted to say that a shallowly immutable type is one whose fields are all marked readonly. (And a deeply immutable type is one whose fields are all readonly, and also refer to types that are immutable.) A readonly field cannot be written to after construction has completed. Unfortunately, if the ‘this’ leaks out during construction, we may see those readonly fields in an uninitialized or even changing state:
class C {
public static C s_c;
readonly int m_x;
public C() {
m_x = 42;
s_c = this;
while (true) {
++m_x;
}
}
}
This is quite evidently a terrible and malicious program. C appears to be immutable, because it only contains readonly fields, but is quite clearly not, because the value of m_x is assigned to multiple times. If we had a guarantee that all readonly fields were definitely assigned once-and-only-once before ‘this’ can escape, then we’d be close to a solution. But of course we have no such guarantee. In C#, at least.
A related issue is co-initialization of objects. This is interesting and relevant, because in such cases we actually want to leak out partially-constructed objects. Imagine we’d like to build a cyclic graph comprised of two nodes, A and B, each referring to the other. With a naïve approach to immutability, we simply cannot make this work. Either A must first refer to B, in which case A refers to a partially-constructed B; or B must first refer to A, in which case B refers to a partially-constructed A. Both are equally as bad. The two assignments are not atomic.
Cyclic data structures are a commonly cited weakness of immutability, and an argument in favor of supporting partially-instantiated objects in a first class way, although there are approaches that can work. One example is to separate edges from nodes, and represent them with different data structures. We can then build the nodes A and B, and then build the edges A->B and B->A without needing to use cycles.
It’s not-null, or at least it wasn’t supposed to be
Tony Hoare called it his billion-dollar mistake: the introduction of nulls into a programming language. I think he sells himself short, however, because the absence of nulls in an imperative programming language – however worthy a pursuit – is actually a notoriously difficult to attain.
Spec# is one example of a C-style language with non-nullability, in which T! represents a “non-null T”, for any T. Although this is done in a pleasant way conducive to C#-compatibility – a significant goal of Spec# -- I’d personally prefer to see the polarity switched: T would mean “non-null T” and T? would mean “nullable T”, for any T, reference- and value-types included. This is much more like Haskell’s Maybe monad, and is the syntax I’ll use below for illustration purposes. But I digress.
Non-nullability is a wonderful invention, because it is common for 75% or more of the contracts and assertions in modern programs to be about a pointer being non-null prior to dereferencing it, both in C# and in C++. Relying on the type-system to prove the absence of nulls is one big step towards creating programs that are robust and correct out-of-the-gate, particularly for systems software where such degrees of reliability are important. And it cuts down on all those boilerplate contracts sprinkled throughout a program. Instead of:
void M(C c, D d, E e)
requires c != null, d != null, e != null
{
… use(c, d, e) …
}
You simply say:
void M(C c, D d, E e)
{
… use (c, d, e) …
}
No opportunity to miss one, and no need for runtime checks. It’s beautiful.
A problem quickly arises, however, having to do with partially-constructed objects. All of an object’s fields cannot possibly be non-null while the constructor is executing, because the object has been zero’d out and the assignments to its fields have not yet been made. Clearly constructor code needs to be treated “differently” somehow. We cannot simply live with the fact that ‘this’ escaping leads to a partially-constructed object leaking out into the program, because that could lead to serious errors. These serious errors include potential security holes, if unsafe code is manipulating the supposedly non-null pointer. One advantage to adding non-nullability is that runtime null checks can be omitted, because the type system guarantees the absence of nulls in certain places. In this situation, partially-constructed objects lead to holes in our nice type system support. Either the dynamic non-null checks are required as back-stop, or we’ll need some other coping technique.
There are related issues with non-nullability, like with partially-populated arrays. Imagine we’d like to allocate an array of 1M elements of type T, and we will proceed to populate those elements following the array’s allocation. There’s clearly a window of time during which the array contains 1M nulls, and then 1M-1 nulls, …, and if we finish 1M-1M nulls, i.e. 0 nulls. It is only at that last transition that the array can be considered to contain non-null T’s. The standard technique is to use an explicit dynamic conversion check, or to force the creation of the list to supply all of the elements of the array at construction time.
Coping techniques
There are, thankfully, some interesting ways to cope with partially-constructed objects. There is, in fact, a spectrum of techniques, ranging from escape analysis in various forms, to limitations around how objects are constructed such that a partially-constructed one can never leak, to automatic insertion of dynamic checks to prevent the use of partially-constructed objects, to static annotations that treat partially-constructed instances as first class things in the type system. And of course there’s always the technique of “deal with it”, which is the one that most C++-style languages have chosen, including our beloved C#.
The F# approach: restrictions and dynamic checks
F#, it turns out, does a very good job in preventing partially-initialized objects. A first important step is that fields in F# are readonly by-default, unless you opt-in to mutability using the mutable keyword. Therefore data structures are mostly immutable. And the construction rules are meant to make it very unlikely that you’ll expose a partially-constructed object during construction. How so? It’s simple: such fields must be initialized prior to running ad-hoc construction code, and if you attempt to initialize them multiple times, the compiler supplies an error. In other words, you really have to work hard to screw yourself, unlike C++ and C# where it’s very easy.
It’s of course possible to do some dirty tricks to publish or access a partially-initialized object, despite needing to work very hard at it. There is, however, a nice surprise awaiting us when we try. For example:
type C() =
abstract member virt : unit -> unit
default this.virt() = ()
type D() as this =
inherit C()
do this.virt()
type E =
inherit D
val x : int
new() = { x = 42; }
override this.virt() = printf "x: %d" this.x
let e = new E()
This example attempts to perform a virtual invocation from C before the more derived class has been fully initialized. This overridden virtual simply (attempts to) prints out the value of x. If we compile and run this program, however, we will notice that we get an exception: “InvalidOperationException: the initialization of an object or value resulted in an object or value being accessed recursively before it was fully initialized.”
Pretty neat. The compiler will stick in the checks necessary when ‘this’ is being accessed, to dynamically verify that an object is not being leaked before having been fully initialized. The F# approach can be summed up as trying to make things airtight as possible statically at compile-time, but admitting that there are holes – primarily due to inheritance – and dealing with them by inserting dynamic runtime checks. This tradeoff clearly makes sense for F#, because it is attempting to attain a robust level of reliability around immutability.
F# also adds non-nullability for the most part. Like Haskell’s Maybe monad, F# adds an option type that can store a single None value which lies outside of the underlying type’s domain to effectively represent null. Because F# is a .NET language it of course also needs to worry about nulls at interop boundaries with other languages like C# and VB. This is a great step forward; first class CLI support would be a nice next step.
A slight variant of the F# idea is to initialize data up the whole class hierarchy in one pass, and then run ad-hoc construction code in a second pass in the usual way. So long as readonly data can be initialized without running the ad-hoc construction code, this helps to statistically cut down on the chances for accidental leaking of ‘this’.
Type system: T versus notconstructed-T
We can model two kinds of T in the type-system: T and notconstructed-T. The constructor for any type T would then see the ‘this’ pointer as an notconstructed-T, and everything else – by default – sees ordinary T’s.
What good does this distinction do? It enables us to add verification and restrictions around the use of notconstructed-T’s and limitations to the conversion between the two types. See this paper by Manuel Fahndrich and Rustan Leino for an example of how this approach was taken in Spec#’s non-nullability work.
For example, we can prohibit conversion between T and notconstructed-T altogether, thereby disallowing escaping ‘this’ references altogether. If the type of ‘this’ within a constructor is different than all other references to type T, and they are not convertible, we’ve successfully walled the problem off in the type system. This protects against erroneous method calls, so that a constructor could not call any of its own methods, because these methods expect an ordinary T whereas the constructor only has a notconstructed-T. And because you cannot state notconstructed-T in the language, you cannot let one leak by storing it into a field.
We could add more sophisticated support, by allowing a programmer to explicitly tag certain non-field references as T-notconstructed. This makes the concept first-class in the language, and allows one to explicitly declare the fact that code is interacting with a partially-constructed T:
class C
{
int m_x;
public C() {
m_x = V();
}
protected virtual int V() notconstructed {
… I know to be careful …
}
}
In this example, the programmer has annotated V with ‘notconstructed’. This enables the call from the constructor because the method’s ‘this’ is an uninitialized-T, and serves as a warning to the programmer that he or she should take care, much like the code written inside a constructor.
We must also decide whether fields are offlimits via notconstructed-T’s. If yes, we can add F#-style dynamic checks for initialization, but only for attempted accesses against notconstructed object’s fields. This is nice because it means the scope of dynamic checks are limited, and used in a pay-for-play manner. And we could even enable a programmer to sidestep the error by stating at the use-site that they understand a particular field access may be to uninitialized memory, like Field.ReadMaybeUninitialized(&m_field).
To be honest, the reason this approach has likely not yet seen widespread use is that the cost is not commensurate with the benefit. At least, in my opinion. To make something like partially-constructed objects a first-class concept in a programming language, programmers would need to want to know where they are dealing with them. For systems programmers, this makes sense. For many other programmers, it would be useless ceremony with no perceived value. And yet the initial approach where nothing new needed to be stated, but yet escaping ‘this’ was prevented, blocks certain patterns of legal use. This is where theory and practice run up against one another. There is, however, presumably a nice middle ground awaiting discovery.
Winding down
I meant for this to be a short post. But the topic really is fascinating, and has been coming up time and time again as we do language work here at Microsoft. But it is truly fascinating mainly because, like nulls, the problem is widespread yet tolerable, and most C++ and C# programmers learn the rules and make do. Partially-constructed objects are a major blemish, but not a crisis that threatens the complete abandonment of imperative programming.
I do believe language trends indicate that more will move away from C++-style object initialization and related issues, and more towards immutability and treating data and its initialization separately from imperative ad-hoc initialization code. Haskell, F#, and Clojure, for example, show us some promising paths forward. There are a plethora of other attempts at solving related problems, and I unfortunately could only scratch the surface.
Although these techniques are not new, the primary question – to me – is how close to “the metal” in systems programming these concepts can be made to work. Typically for those situations, you need to rely on a mixture of static verification and complete freedom, because the dynamic checking is too costly and the code to work around overly-limiting static verification also adds too much cost. But as soon as you add complete freedom into the picture, you run into partially-constructed objects as a consequence, and all the issues I’ve mentioned above.
Anyway, I hope it was interesting.
 Sunday, May 16, 2010
My article about Transactional Memory (TM) was picked up by a few news feeds recently.
If I had known this would occur, I would have written it with greater precision. Because my article presents a mixture of technical challenges interspersed among more subjective and cultural issues, I am sure it is difficult to tease out my intended conclusion. To summarize, I merely believe adding TM to a shared memory architecture alone is insufficient.
Indeed, I remain a big fan of transactions. Atomicity, consistency, and isolation, and coming up with strategies for achieving all three in tandem, are part and parcel to architecting software.
After watching Barbara Liskov's OOPSLA Turing Award reprise, I decided to reacquaint myself with some old Argus papers this weekend. It has been some time since I last read them. Argus was Liskov's language for distributed programming and her follow-on to CLU. As with most research done by brilliant people, the work was way ahead of its time, has appeared in ad-hoc incarnations and permutations over time, and remains relevant today. This research is particularly interesting to work that my team is doing right now, especially its notion of guardians. And it is relevant to the TM discussion too.
The Argus approach of using isolation to coarsely partition state and operations into independent bubbles, and then communicating asynchronously between the so-called guardians that are responsible for this state, is an architecture that is common among most highly concurrent programs. This aids state management and fault tolerance. Argus makes an interesting observation that, although guardians may be sent messages concurrently -- and indeed activities themselves may even introduce local concurrency -- manipulation of state can be done safely and even in parallel thanks to transactions.
The requirement is that messages are atomic and commute. Transactions, it turns out, are a convenient way of implementing this requirement.
You will observe a similar architecture in other places, including in some languages that have adopted TM. Haskell has moved in this direction. Everything is purely functional and so, of course, no state is mutated in an unsafe way by default. However, with the introduction of concucrrency comes mutable cells for message passing and with parallelism comes indeterminism. You can push the state management problem up indefinitely, but at the top there are almost always mutable operations on real-world state (even if it is "just I/O"). Haskell programs have a safe architecture to begin with, and it is the intentional and careful addition of specific facilities that forces one to focus on the problematic seams. One could say that Haskell starts clean and stays clean, versus most shared memory-based languages which start dirty and try to attain cleanliness (at least when it comes to concurrency).
Why aren't transactions sufficient, then, given that the I in ACID stands for Isolation? You wouldn't model a database as one flat table in which each row is a single byte, however, would you? As you begin to decompose your program into isolated state, your bubbles (or guardians) are the tables, and your objects are the rows. This is just an analogy but I find it useful to think in these terms. Taking a bunch of intermingled state and pouring transactions on top is not going to give you this nicely partitioned separation of state which has proven to be the lifeblood of concurrency.
Even once you've attained a more isolated architecture, however, transactions are not a panacea. They are just one of many viable state management techniques in a programmer's arsenal, hierarchical state machines being another notable example. And in fact, many of the problems I mentioned in the TM article are still worrisome even when you start from the right place. From within a guardian, you may wish to enlist the aid of another unrelated guardian to perform a coordinated atomic activity, because a higher level invariant relationship between them must be preserved. Or an application which composes multiple guardians may wish to do so atomically. Even Argus required manual compensation for such things. This can be solved in part by DTC. But experience suggests that continuing to push the enlistment scope one level higher eventually leads to substantial problems. A topic for another day, I suppose.
My primary conclusion is that TM is a great complement to highly concurrent programs, but only so long as you start from the right place. The Argus and Haskell approaches are conducive to large-scale concurrency, but it is primarily because of the natural isolation those models provide; the addition of transactions address problems that remain after taking that step. But without that first step, they would have gotten nowhere.
 Sunday, April 25, 2010
We use static analysis very heavily in my project here at Microsoft, as a way of finding bugs and/or enforcing policies that would have otherwise gone unenforced. Many of the analyses we rely on are in fact minor extensions to the CLR type system, and verge on “effect typing”, an intriguing branch of type systems research that has matured significantly over the years.
Many of these annotations are done on methods, rather than types. A few examples include:
- [MayBlock] indicates that a method is free to call methods that might block.
- [NoAllocations] indicates that a method is neither allowed to allocate, nor call another method that might allocate
- [Throws(...)] indicates that a method is allowed to throw an exception of a type in the set { … }, or call other methods that may throw exceptions in the set { … }.
- And so on.
It turns out there’s a general way for handling these annotations. And indeed, you will quickly find that pursuing ad-hoc solutions to each independently leads to troubles. We shall briefly look at the generalization.
We must first observe that each falls into one of two categories: additive or subtractive. MayBlock and Throws are additive. They say what is permitted. NoAllocations, on the other hand, is subtractive, because the annotation declares what is not permitted. This distinction, we shall see, is crucial.
First we can imagine that each distinct effect shown above has a distinct effect type.
The types EMayBlock, ENoAllocations, and EThrows correspond to the annotations above. This will permit us to model effects using subtyping polymorphism. We will use the usual notation, i.e. “S <: T” means “S is a subtype of T”, or “a S is substitutable in place of a T”. For example, String <: Object. Throws is special, because it has a type hierarchy of its own beneath the root type. As you might guess, this hierarchy is infinite in size and is comprised of each possible permutation of exception types.
There are two special kinds of effects: the null effect (ENil), and a set of other effects (EMany). The latter permits us to create a new, unique effect type merely by concatenating a list of other effect types.
Each method is then given an EMany effect type containing its full set of effect types. For example:
[MayBlock, Throws(typeof(FileNotFoundException)), NoAllocations] void M() { … }
Is given the distinct effect type EMany { EMayBlock, EThrows(typeof(FileNotFoundException)), ENoAllocations }.
We should make one generalization before moving on. ENil ~ EMany { }. In other words, having no effects is equivalent to a list of no effects. Furthermore, EMany { } ~ EMany { ENil }. In other words, having a list of no effects is equivalent to having no effects.
Now we are ready to weave everything together. The main question confronting us is as follows: What is the subtyping relationship between the various effect types, including the null and list types?
The easiest to do away with is the EMany type. Given two EMany types E and F, then E <: F if, for all effects T in E’s type set, there exists an effect type U in F’s type set such that T <: U. In simpler terms, a list is a subtype of another list so long as all of its components are also subtypes of a component of the other. This is very abstract, but we shall see soon why it is useful.
Now we get to see why the additive and subtractive distinctions are so important:
Given an additive effect type EAdditive, we say ENil <: EAdditive.
Given a subtractive effect type ESubtractive, we say ESubtractive <: ENil.
The first statement says that a method with no effects is substitutable for a method with additive effects, and the second says that a method with subtractive effects is substitutable for a method with no effects. The corollaries are perhaps just as important. A method with additive effects cannot take the place of a method with no effects, whereas a method with subtractive effects can.
For the simple single-effect case, effects depicted in this way represent points on a line, where ENil is zero, subtractive effects are negative integers, and additive effects are positive integers. The lattice obviously becomes rather complicated as many effects accumulate.
Where does substitutability come up with respect to methods, anyway, you may ask? The first is in determining which other methods can be called. If a method M with effects E is trying to call another method N with effects F, this is permitted so long as F <: E. The next is in virtuals and overriding. A virtual with effects E may be overridden by a method with effects F so long as F <: E. The following example illustrates this idea, in addition to the composition of the subtyping rules we have shown so far:
class C { public virtual void M() {} [MayBlock, NoAllocations] public virtual void N() {} }
class D : C { [MayBlock, NoAllocations] public override void M() {} public override void N() {} }
In this example, the four methods are given the following effect types:
- C::M gets EMany { ENil }, or just ENil.
- C::N gets EMany { MayBlock, NoAllocations }.
- D::M gets EMany { MayBlock, NoAllocations }.
- D::N gets EMany { ENil }, or just ENil.
What does all this gibberish mean? Well it’s straightforward and intuitive, actually.
We are attempting to add the MayBlock and NoAllocations effects to the overridden M method which has none. Because MayBlock is additive, this is illegal (someone might call C::M thinking the code will not block), whereas it is OK for NoAllocations (calls through D::M are assured no allocations will happen, even though calls through C::M are guaranteed no such thing). Similarly, we are attempting to remove both effects from the overridden N method. Because MayBlock is additive, this is OK (M isn’t required to block, even though calls through C::M may suspect it of doing so), whereas it is decidedly not OK for NoAllocations (calls through C::M will reasonable assume allocations do not happen, whereas D::M would be free to perform them). It may take some thought to convince yourself that this is correct, but I hope that you find that it is. All of this works because of the subtyping of effect types.
All of this works similarly with delegates. The source delegate signature is akin to the base class in the above example, whereas the target method being bound to is like the override.
Things get a little more complicated when considering the EThrows effect. It is additive, so it is of course true that ENil <: EThrows(*). However, what if we have two different EThrows, and wish to inquire about substitutability of one in place of the other? We can come up with a simple rule that is general purpose for all set-of-type kinds of effects. Namely, consider two instances A and B of the same effect type:
Given an additive effect type EAdditive, then A <: B if, for all types T in A’s set-of-types, there exists a type U in B’s set-of-types such that T <: U.
Given a subtractive effect type ESubtractive, then A <: B if, for all types T in A’s set-of-types, there exists a type U in B’s set-of-types such that U <: T.
These sound quite similar, except that they end differently (i.e. T <: U vs. U <: T). We may illustrate the additive case with EThrows; to illustrate the subtractive case, let us imagine we can declare a ENoAllocations effect type that specifies which precise types may not be allocated:
class A{} class B : A{}
class C { [Throws(typeof(Exception)), NoAllocations(typeof(A))] public virtual void M() {} [Throws(typeof(FileNotFoundException)), NoAllocations(typeof(B))] public virtual void N() {} }
class D : C { [Throws(typeof(FileNotFoundException)), NoAllocations(typeof(B))] public override void M() {} [Throws(typeof(Exception)), NoAllocations(typeof(A))] public override void N() {} }
The results should not be surprising. D::M overrides C::M’s exception list, by being more specific and declaring that FileNotFoundException is thrown instead of just Exception. This is OK. Whereas D::N overrides C::N’s list by being more general purpose, specifying Exception instead of FileNotFoundException. This is clearly not OK. The NoAllocations type works in exactly the reverse. D::M attempts to prohibit allocations of B, but this is merely one possible subtype of the base method C::M’s declaration of A, and therefore this is illegal. Whereas D::N ensures no instances of A are allocated, which of course subsumes the base method C::N’s declaration that no B’s are allocated.
Everything gets a little more interesting when you consider generics. For example, how would we type a general purpose Map method? (This pattern arises quite frequently.) We would presumably want it to somehow “acquire” the effects of the delegate it invokes on all elements in a list. For example:
U[] Map<T, U>(T[] input, Func<T, U> func);
This declaration is stronger than necessary. The Func<T, U> class – prepackaged with the .NET Framework – does not have any effects on it. So it may not, for example, bind to a method that has any additive effects like Throws on it. This is rather unfortunate.
To solve this we could imagine treating effects with parametric polymorphism:
[Effects(E)] U Func<T, U, [EffectParameter] E>(T x);
This fictitious syntax merely says that Func can be instantiated with an effect type E, and that the Func “method” itself acquires the effect E. (Admittedly I should stop using faux-attribute syntax for illustrations since we’ve reached this level of language integration.) Now Map can be declared as such:
[Effects(E)] U[] Map<T, U, [EffectParameter] E>(T[] input, Func<T, U, E> func);
This says that Map has the same effects as the Func that is supplied as an argument. It turns out that we may want to extend this further, by enabling symbolic manipulation of effects. We may wish, for example, to specify that the Func is not allowed to block, by stating it does not have [MayBlock] in it. You could imagine using something very similar to generic constraints to achieve this. It is also interesting to allow concatenation of multiple effect types, both through partial and full specialization. For example, Map above may clearly have effects of its own. You also tend to want generic constraints like, 'where E : F', which of course just depends on the aforementioned subtyping rules. And of course C# 4.0's co- and contravariance can be applied to effects too.
Anyway, I have probably gone beyond most readers’ interest level in this subject. Things sure do get very interesting when you allow symbolic manipulation of effects. They get even more interesting when you begin to think of types as having “permissible effects” attached to them. However, the main thing I wanted to point out with this brief article is that this pattern arises quite frequently. And despite everyone struggling through what seem to be odd corner cases as they develop ad-hoc solutions, there really is a sound generalization behind it all. Many languages have first class effect typing, and I have found it liberating to think of many of these type system annotations through that lens. Perhaps you shall too.
 Sunday, February 28, 2010
Simon Peyton Jones was in town a couple weeks back to deliver a repeat of his ECOOP’09 keynote, “Classes, Jim, but not as we know them. Type classes in Haskell: what, why, and whither”, to a group of internal Microsoft language folks. It was a fantastic talk, and pulled together multiple strains of thought that I’ve been pondering lately, most notably the common thread amongst them.
In the talk, he compared polymorphism in Java-like languages (including C# which I will switch to referring to over Java hereforth) with ML and Haskell. In other words, how does a programmer commonly write code in each language that is maximally reusable? Of course, C# programmers primarily achieve this through subclassing, whereas functional programmers rely on type parameterization. Over the years, however, the former group has begun to borrow a great deal from the latter; as evidence, witness the growingly-pervasive use of generics in both Java and C# over the past decade. The talk was given mainly through the lens of this evolution, which appears to approach an interesting limit if projected far enough into the future.
Type classes came on the scene towards the end of the 1980’s, and immediately became a fertile seed for research and exploration in the relationship between subclass and parametric polymorphism. Type classes are much closer to subclass polymorphism than Haskell’s borrowed ML-style, which is to say parametric polymorphism. This is most intriguing because Haskell does not rely on subclassing, and so the mixture of two breeds new patterns.
I thought that it might be interesting to compare the mixture of subclass and parametric polymorphism in Haskell vis-à-vis type classes with the same in C# vis-à-vis a mixture of interfaces, generics, and generic constraints. Hence this post. We shall proceed by examining some basic type classes in Haskell with their equals in C#. Though similar, the dissimilarities are as stark as the similarities. And the lack of higher kinds -- particularly when combined with type classes -- means that some Haskell patterns simply are not expressible in C#.
The Simple Case: Equality (or Lack Thereof)
The most basic type class of all is Eq, which allows the comparison of two like-typed pieces of data. This may seem like a commodity if you ordinarily write code in languages like Java and C# which have a strong notion of object identity. In Haskell, however, equality is value equality over algebraic data types rather than objects, so polymorphism over equality operators is quite a bit more important. Indeed, as we shall see, Haskell’s approach is more powerful than == in Java-like languages. (Witness the neverending dichotomy between reference and value equality vis-à-vis Object.Equals in C#.) But alas, let us proceed by crawling in a series of logical steps, rather than leaping to the conclusions.
Haskell’s Eq type class is defined as such: class Eq a where
(==), (/=) :: a -> a -> Bool
x /= y = not (x == y)
x == y = not (x /= y)
As you see, Eq provides two operators: == and /=. Default implementations of each define == as the inverse of /= and /= as the inverse of ==. Not only is this a convenience, but it also specifies the desired contract implementations ought to abide by. Other types may become members of the Eq class by mapping the one or both operators to type-specific functionality. You will immediately recognize the similarity to virtual methods in OOP languages, where the operators can be overridden by subclasses.
Of course all of the primitive data types already implement Eq, so you get value equality over numbers, strings, etc. Imagine we declared a new Coords type – comprised of two integers – and want to make it a member of Eq also – wherein equality is determined by a pairwise comparison of each’s members:
data Coords = Coords { fst :: Integer, snd :: Integer }
We make Coords a member of the Eq type class, and thereby define equality over instance, through the ‘instance Eq Coords where’ construct. This maps type class functions to real implementation functions. The example here defines them inline, though you may of course refer to existing functions instead: instance Eq Coords where
(Coords fst1 snd1) == (Coords fst2 snd2) = (fst1 == fs2) && (snd1 == snd2)
Now we can take a ‘[Coords]’ and ask whether a particular ‘Coords’ exists within it.
A function may constrain a type variable to a certain class, and thereby access members of that class. For example, the following ‘isin’ function tests whether an instance of some type ‘a’ is contained within a list of type ‘[a]’. To do this, it demands that ‘a’ is a member of Eq using the syntax “Eq a =>”:
isin :: Eq a => a -> [a] -> Bool
x `isin` [] = False
x `isin` (y:ys) = x == y || (x `isin` ys)
The moral equivalent to the Eq type class in C# is not so easy to decide. The most obvious first guess is the built-in == and != operators. However, we will quickly find that this is not quite right, because these operators are not polymorphic in C#. To illustrate this point, let’s try to write the ‘isin’ method in C#, using generics and the == operator, for example:
bool IsIn<T>(T x, T[] ys)
{
foreach (T y in ys) {
if (x == y)
return true;
}
}
This function will not compile. The reason is that == and != in C# are not defined over all types (specifically not for value types). You can get IsIn to compile by restricting the T to a reference type:
bool IsIn<T>(T x, T[] ys) where T : class
{
… same as above …
}
Although this code is deceptively similar to the Haskell example, it is actually quite different. The == used to compare two instances compiles into the MSIL CEQ operator, effectively hard-coding an object identity comparison. Even if an overloaded == operator for a particular instantiated T is available, the compiler will not bind to it. Why? Because it is overloading and specifically *not* overriding. For example, say we had a MyData type and an overloaded == operator for comparing two instances:
class MyClass
{
public static bool operator ==(MyClass a, MyClass b) { return true; }
public static bool operator !=(MyClass a, MyClass b) { return false; }
}
According to this, all MyClass objects are equal. However, the following call yields the answer ‘false’:
IsIn<MyClass>(new MyClass(), new MyClass[] { new MyClass() });
The same problem arises should instances of MyClass get referred to by Object references. == and != do not perform any kind of virtual dispatch; the selection of implementation is chosen statically.
Perhaps it is the Equals method inherited from System.Object, then? This, at least, is virtual. And indeed, this gets much closer to Eq. Any type may override Equals, and a generic definition defined in terms of it dispatches virtually and allows subclasses to change behavior on a type-by-type basis: bool IsIn<T>(T x, T[] ys)
{
foreach (T y in ys) {
if (x == y || (x != null && x.Equals(y)))
return true;
}
return false;
}
(Even this is slightly different, because it assumes a certain type-agnostic behavior about nulls.)
This is cheating, however. We’ve taken advantage of the fact that someone thought to put an Equals method on System.Object, thereby giving all Ts such a method. There are clearly limits to how many crosscutting things can be added to System.Object before it becomes overwhelmed with concepts, not to mention the size (e.g. v-tables). Moreover, Equals on Object is weakly typed; a better solution is to use interfaces, like the IEquatable<T> interface that introduces a strongly typed Equals method:
public interface IEquatable<T>
{
bool Equals(T other); }
And to use a generic type constraint on IsIn’s T, much more akin to what ‘isin’ in Haskell above did:
bool IsIn<T>(T x, T[] ys)
where T : IEquatable<T>
{
foreach (T y in ys) {
if (x == y || (x != null && x.Equals(y)))
return true;
} return false;
}
This is cheating a little less, because we can implement an interface after-the-fact without impacting a class’s type hierarchy. This, in fact, looks remarkably similar to the Haskell ‘isin’ shown earlier, using type classes and parametric polymorphism, where here we have used interfaces in place of type classes.
We might be tempted to define a default NotEquals method over all IEquatable<T> instances, just like Haskell does by implementing the defaults for == and /= as the inverse of each other:
public static class Equatable
{
public static bool NotEquals<T>(this IEquatable<T> @this, IEquatable<T> other)
{
return !this.Equals(other);
}
}
This is not perfect. It is not polymorphic; see my previous post for an extensive discussion of this and related points. And what about nulls? If '@this' is null, the default implementation is going to AV. We’d need to bake in type-agnostic knowledge of null again. Sigh!
Sadly, it turns out this whole approach in general isn’t quite right anyway. For two reasons:
- First, we still infect the type in question with the interface being implemented; it cannot be done completely outside of the type’s definition, as with type classes.
- Second, type classes in Haskell do not actually require a value of the type in question to dispatch against the class’s functions, whereas we clearly do in the above example: we need to virtually dispatch against the object, and rely on this virtual dispatch to execute different code for each type. This will come up as we look at the numeric classes, but it is a critical difference.
A closer analogy is to use IEqualityComparer<T>:
public interface IEqualityComparer<T>
{
bool Equals(T x, T y);
}
(IEqualityComparer<T> in .NET also has a GetHashCode method on it. Let’s ignore that for now.)
Unfortunately, if our IsIn method were to use IEqualityComparer<T> to do its job, callers would be required to pass an instance explicitly; we cannot infer a “default” comparer based solely on the T:
bool IsIn<T>(T x, T[] ys, IEqualityComparer<T> eq)
{
foreach (T y in ys) {
if (eq.Equals(x, y))
return true;
}
return false;
}
Type classes actually function rather similarly, with two major differences:
- The interface object – called a dictionary – is passed and used implicitly.
- The mapping from types to dictionaries is done implicitly, whereas in .NET you’ll need to find an instance of the interface in question through other means.
This second difference is solved by a little hack in .NET. If you take a look at the EqualityComparer<T>.Default property, you shall see a lot of slightly gross reflection code to return an instance of IEqualityComparer<T> for any arbitrary T. The code checks some well-known types and conditions, and ultimately falls back to the aforementioned interfaces and default Equals method for the most general case. It’s not pretty, but it’s a beautiful hack given the tools at our disposal in C#.
A Harder Case: Polymorphic Numbers, on Output Parameters
The Eq type class is easy. The functions it defines are polymorphic on their inputs, but not on their outputs; both == and /= return Bool values. Once we transition to polymorphic output parameters or return values, we encounter a pattern quite different from that which is found in most .NET interfaces.
Let’s illustrate these differences by looking at Haskell’s Num type class:
class (Eq a, Show a) => Num a where
(+), (-), (*) :: a -> a -> a
negate :: a -> a
abs, signum :: a -> a
fromInteger :: Integer -> a
Here we see another feature of Haskell type classes: inheritance. Num derives from both Eq and Show – indicated by “(Eq a, Show a) => Num a” – the latter class of which we have not yet shown but is the moral equivalent to .NET’s Object.ToString method. It enables pretty printing of values, clearly something that would be expected to be common among all numeric data types. Haskell’s numeric class hierarchy is quite elegant, enabling highly polymorphic computations. A nice little tutorial of can be found here: http://www.haskell.org/tutorial/numbers.html.
But the question at hand is what the C# equivalent would be.
Our first approach would be to mimic the IEquatable<T> solution above:
interface INumeric<T>
{
T Add(T d);
T Subtract(T d);
T Multiply(T d);
T Absolute();
T FromInteger(int x);
}
This works fine, and primitive types in .NET could presumably implement it:
struct int : INumeric { .. }
struct float : INumeric { .. }
struct double : INumeric { .. }
…
This enables polymorphic code, like a Sum method, through the use of generic type constraints:
public static T Sum<T>(params T[] values)
where T : INumeric<T>
{
T accum = default(T);
foreach (T v in values)
accum = v.Add(accum);
return accum;
}
This example works great. Why then, you might wonder, doesn’t LINQ use this instead of providing special-case overloads of Average, Min, Max, Sum, etc. for all well-known primitive data types?
The primary reason is the performance hit taken to perform addition through O(N) interface calls versus O(N) MSIL ADD instructions. It is just a basic fact of life that today’s leading edge separate compilation techniques will not achieve parity with the hand specialized variants. While it is true that the JIT compiler *could* specialize the code for specific Ts and specific interfaces to emit more efficient instructions, like int, float, etc. over INumeric<T> calls, it will not do so today. This reduces the ability to share code – which admittedly is what we want here – and is tangled up in a judgment call based on heuristics. But I digress.
There is a larger problem that arises with other examples, at least from a language expressiveness point-of-view: the need to have an instance in hand to invoke interface methods. FromInteger, for example, is rather awkward to write. In fact, we cannot write a method with INumeric<T> like we could in Haskell:
public static T MakeT<T>(int value)
where T : INumeric<T>
{
… ? …
}
How do we invoke FromInteger, given that no T is available at the time of MakeT’s invocation? You can’t; you need to arrange for an instance to be available. There are ways out of this corner. One solution is to mandate that T has a default constructor:
public static T MakeT<T>(int value)
where T : INumeric<T>, new()
{
return new T(value).FromInteger(value);
}
That is always acceptable for structs, since they always have such a constructor; but this practice requires that classes be designed to possibly not hold invariants at all times, and so is not always acceptable or at the very least requires design accommodation.
The alternative is probably obvious. Use a similar approach to IEqualityComparer<T>:
interface INumericProvider<T>
{
T Add(T x, T y);
T Subtract(T x, T y);
T Multiply(T x, T y);
T Absolute(T x);
T FromInteger(int x);
}
And now, of course, each method that does polymorphic number crunching must accept an instance of INumericProvider<T>. That’s particularly cumbersome, so it’s more likely that .NET developers would prefer the aforementioned approach, where the type must provide a default constructor.
Admittedly, I seldom run into this particular problem in practice; but when I do, I really wish I had something like Haskell type classes to help me out.
Before moving on, it is worth pointing out one Haskell type class problem that explicit interface object passing in .NET helps to avoid. Should you need multiple implementations of a given class for the same type, as is relatively common with equality comparisons, you must disambiguate in Haskell by separation by module and being careful about what you import. This is similar to C#’s extension methods. With explicitly passed interface objects, however, it is trivial to manage and pass separate objects if you’d like.
Close, but No Cigar: Higher Kinds
There is one last feature that Haskell provides – a pretty big one, I might add – that C# simply cannot do: higher kinded types, or polymorphism over constructed types. This feature is orthogonal to type classes, but gets used pervasively in conjunction with them. An example will make this stunningly clear:
class Monad m where
(>>=) :: m a -> (a -> m b) -> m b
(>>) :: m a -> m b -> m b
return :: a -> m a
fail :: String -> m a
m >> k = m >>= \_ -> k
fail s = error s
Let’s try to transcribe the core of this class in C#, renaming >>= to Bind, and omitting the >> and ‘fail s’ operators because they have default implementations: public interface IMonad<M, A>
{
M<B> Bind<B>(M<A> m, Func k);
M<A> Return(A a);
M<A> Fail(string s);
}
This approach is tantalizingly close. It suffers from the already-admitted problem that, for any M<A> instance, you will need to pass the appropriate IMonad<A> provider object – just as with the IEqualityComparer<T> and INumericProvider<T> examples above.
But the code of course won’t *actually* compile, because the type variable M cannot be constructed as shown here. We find references to M<A> and M<B>, which are complete nonsense to C#. M is just a plain type variable. M is required to be what Haskell calls a type constructor (* -> *), which is a generic type that must be instantiated before it is a terminal type. I’ve written about this before. Although it seems like a trivial omission in C#’s language definition, it strikes at the heart of the type system.
A fictitious syntax for expressing this in C# might be:
public interface IMonad
where M : <>
{ ...
}
And if, say, M were expected to be a two- or three-parametered type, we would find, respectively:
... where M : <,>
... where M : <,,>
And so on.
This could in theory work. But C# -- and more worrisome .NET and the CLR – do not support this presently, and, to be quite honest, likely never will. It is immensely powerful, however. Life without monads is a life destined to continuous repetition. The “LINQ Pattern”, for example, is one example case in .NET where, for each ‘source’ type, we must create a “copy” of the original System.Linq.Enumerable variant. And shame on those who wish to write polymorphic code that will work for any LINQ provider.
Winding Down
Let’s wind down. I need to go grab dinner at Mama's Fish House on Maui right now.
I hope to have shown some of the similarities and dissimilarities between type classes and interfaces, and some patterns that arise when these things are mixed with parametric polymorphism. The mix of inheritance for type classes, but not for implementation types, in Haskell is unique. C#, of course, allows inheritance both amongst interfaces and implementations which is both a blessing and a curse.
I do think both camps have something to teach one another. For example, having a default interface lookup mechanism for arbitrary types in C# would be wonderful, and indeed might provide a replacement for extension methods that has more longevity. I’m sure much of this will happen with time; either “in place” as the respective languages evolve, or as new languages are created with time.
But most importantly, I hope that the blog post was educational and fun. Enjoy.
 Tuesday, February 09, 2010
One of my comments in the 2nd edition of the .NET Framework design guidelines (on page 164) was that you can use extension methods as a way of getting default implementations for interface methods. We've actually begun using these techniques here on my team. To illustrate this trick, let's rewind the clock and imagine we were designing new collections APIs from day one.
Let's say we gave the core interfaces the most general methods possible. These may neither be the most user friendly overloads nor the ones that most people use all the time. They would, however, be those from which all the other convenience methods could be implemented. An INewList<T> interface that was designed with these principles in mind may look like this:
public interface INewList<T> : IEnumerable<T>
{
int Count { get; }
T this[int index] { get; set; }
void InsertAt(int index, T item);
void RemoveAt(int index);
}
This interface is missing all the nice convenience methods you will find on .NET's IList<T>, like Add, Clear, Contains, CopyTo, IndexOf, and Remove. So it's not really as nice to use. You can't write an API that takes in an INewList<T> and performs an Add against it, for example, like you can with IList<T>.
One approach to solving this might be to write a concrete class -- much like .NET's System.Collections.ObjectModel.Collection<T> -- that provides concrete implementations of all of these methods, and then other lists can simply subclass that. But we can do better.
Instead, let's give INewList<T> default implementations of all of these methods. How do we do this? That's right: with extension methods. Voila!
public static class NewListExtensions
{
public static void Add<T>(this INewList<T> lst, T item)
{
lst.InsertAt(lst.Count, item);
}
public static void Clear<T>(this INewList<T> lst)
{
int count;
while ((count = lst.Count) > 0) {
lst.RemoveAt(count - 1);
}
}
public static bool Contains<T>(this INewList<T> lst, T item)
{
return lst.IndexOf(item) != -1;
}
public static void CopyTo<T>(this INewList<T> lst, T[] array, int arrayIndex)
{
for (int i = 0; i < lst.Count; i++) {
array[arrayIndex + i] = lst[i];
}
}
public static int IndexOf<T>(this INewList<T> lst, T item)
{
var eq = EqualityComparer<T>.Default;
for (int i = 0; i < lst.Count; i++) {
if (eq.Equals(item, lst[i])) {
return i;
}
}
return -1;
}
public static bool Remove<T>(this INewList<T> lst, T item)
{
int index = lst.IndexOf(item);
if (index == -1) {
return false;
}
lst.RemoveAt(index);
return true;
}
}
Well isn't that neat. We've now given any INewList<T> implementations all these common methods without dirtying their class hierarchies, built atop a tiny core of extensibility. This is much like .NET's Collection<T> which exposes the core as abstract methods. Indeed, we can go even further. Any convenience overloads, like the multitude of CopyTos on List<T> in .NET, can be given to all INewList<T>'s also. And yet implementing INewList<T> remains as braindead simple as it was before: two properties and two methods. In fact, it's simpler than doing a more feature-rich IList<T>, because the convenience methods come for free.
It would be even niftier if you could add these methods straight onto INewList<T>, and have the C# compiler emit the extension methods silently for you. In other words:
public interface INewList<T> : IEnumerable<T>
{
... interface methods (as above) ...
void Add(T item)
{
InsertAt(Count, item);
}
void Clear()
{
int count;
while ((count = Count) > 0) {
RemoveAt(count - 1);
}
}
... and so on ... }
Although this would just be sugar for the NewListExtensions class shown earlier, it sure saves some typing and makes it the pattern more apparent and first class.
Though cool, this whole idea is certainly not perfect.
For one, there are no extension properties. So you can't use this trick for properties.
But the more obvious and severe downside to this approach that these methods are not specialized for the given concrete type. For example, the Clear method is potentially far less efficient than a hand-rolled List<T>, because it does O(N) RemoveAts rather than a single O(1) fixup of the count.
Recall now that the compiler binds more tightly to instance methods than extension methods. So we could implement our own little list class with a faster Clear method if we'd like:
class MyList : INewList<T>
{
... the two properties and two methods from INewList<T> ...
public void Clear()
{
.. efficient! ... }
}
Now when someone calls Clear on a MyList<T> directly, the compiler will bind to the efficient Clear.
This is still not perfect. If you pass the MyList<T> to an API that takes in an INewList<T>, any calls to Clear will fall back to the extension method. Extension methods are not virtual in any way. You can try to simulate virtual dispatch, but it gets messy quick. For example, say we defined an IFasterList<T> that includes all those convenience methods that lists frequently want to make faster; we can then do a typecheck plus virtual dispatch in the extension method.
For now, let's pretend that's just the Clear method:
public interface IFasterList<T> : INewList<T>
{
void Clear();
}
Of course, MyList<T> above would now implement IFasterList<T>. Invocations through IFasterList<T> will automatically bind to the faster variant; but if objects that implement IFasterList<T> get passed around as IList<T>s, you lose this ability. So the Clear extension method can now do a typecheck:
public static void Clear<T>(this INewList<T> lst)
{
IFasterList<T> fstLst = lst as IFasterList<T>;
if (fstLst != null) {
fstLst.Clear();
return;
}
int count;
while ((count = lst.Count) > 0) {
lst.RemoveAt(count - 1);
}
}
This works but is obviously a tedious and hard-to-maintain solution. It would be neat if someday C# figured out a way to "magically" reconcile virtual dispatch and extension methods. I don't know if there is a clever solution out there. I am skeptical. Nevertheless, despite this flaw, the above techniques are certainly thought provoking and interesting enough to play around with and consider for your own projects. And at the very least, it's fun. Enjoy.
 Friday, January 08, 2010
Sometimes you need to wait for something before proceeding with a computation.
Perhaps you need to know the value of some integer that is being computed concurrently. Maybe you need to wait for the bytes to flush to disk before telling another process the file is consistent and ready to read. Or you need to get that next row back from the database before painting it on the UI. It could be that you need to wait for the missile to leave the bay before closing the bay door. And so on.
And sometimes there’s simply nothing better to do while waiting for these things to happen other than to let the CPU halt (or let other processes on the machine run). You need to twiddle your thumbs a bit, and exhibit a little patience. Or at least your program does. This is simply an unfortunate fact of life.
This manifests numerous ways in our programming models:
1) Waiting on an event. 2) Waiting to acquire an already-held lock. 3) Finding that the GUI message queue is empty and doing a MsgWaitForMultipleObjectsEx. 4) Finding that the COM RPC queue is empty and doing a CoWaitForMultipleHandles. 5) Issuing an Ada rendezvous ‘accept’ and finding that no messages await you, thus blocking. 6) Issuing an Erlang ‘receive’ and finding that no messages await you, thus blocking. 7) Waiting on a .NET 4.0 task. 8) Issuing a ContinueWith on a .NET 4.0 task. 9) And so on.
There are three big distinctions to make about the characteristic nature of this waiting: namely, (1) what condition's establishment is being sought -- i.e. the reason for the wait, (2) whether multiple such conditions of interest may be waited on simultaneously, and, related, (3) whether waiting for said condition(s) necessarily means that the processing of some other conditions that may arise elsewhere, but require the blocked context to run, cannot occur.
I will be the first to admit that this statement is rather abstract. But it really does matter.
For example, MsgWaitForMultipleObjectsEx is a pumping wait. Not only do you wait for the occurrence of one of several events to get set, but the arrival of a new top-level message at the message queue (either GUI or COM RPC-related) causes immediate processing of that message, presuming the thread is blocked at that call at the time. Although you can be deeply nested in some complicated code, you “jump” to the event loop to run the message handling code. Vanilla WaitForMultipleObjectsEx works in a similar way vis-à-vis APCs, provided the wait is alertable. This is quite different from a fully blocking non-pumping wait, which only waits for one or more very specific events, but does not dispatch messages simultaneously.
Win32 esoterica aside, the concepts appear elsewhere. The moral equivalent in Ada or Erlang is to do a selective-accept or -receive, intentionally not dispatching certain messages that might arrive in the meantime. (To be fair, you can also do this in COM with message filters.) This often happens when you nest accepts and receives. You may be capable of processing messages A-Z at the top-level tail recursive loop; but if that nested accept only knows about message kinds M and N, then there are 24 other kinds that will not be picked up in the meantime.
Not pumping for messages is dangerous. And it can lead to deadlock if you pump for the wrong ones. Like if you’re accepting M or N, yet the triggering of M or N depends on first processing some message K waiting in the queue. COM RPCs with cycles run face first into this. And/or not pumping can lead to responsiveness and scalability problems. Perhaps M or N eventually does arrive, yet little old K needs to wait an indeterminate amount of time before it is seen. Whereas we could have overlapped its processing. This is why most STAs pump while waiting, and, similarly, why many Erlang processes consist of a main loop that is prepared to handle any message the process accepts at that top level loop. They may seem very different but they are strikingly not.
Yet paradoxically pumping for messages is also dangerous. You must predict all the kinds of messages that may reentrantly get executed, and your state at the point of the blocking call must be consistent enough to tolerate them. (At least those that will actually happen.) In COM STAs, this can be wholly unpredictable and indeed because the CLR auto-pumps on STAs the blocking points can be hidden. Overly aggressively admitting messages may seem like the right thing to do, until you’ve wedged yourself into some unforeseen inconsistent state. You can avoid this by making each message handler atomic; see Argus. But if you can't or don't have the discipline to do that, or aren't quite sure, you must not pump. You either avoid pumping altogether or you selectively pump messages that do not touch the state encapsulated by the pump. Or you lock access to state with a non-recursive lock and run the risk of deadlock.
I have found it clarifying to think about blocking in event loop concurrency and state machine terms, advancing from one state to the next in between waits. It’s a slippery model, but particularly when working in message passing systems that employ event loops, it can help to identify all the familiar problems with shared memory, blocking, and consistency.
Indeed it is interesting how blocking and non-blocking systems can rapidly approach each other. Starting from either extreme tempts you to tiptoe closer and closer to the middle. The familiarity of the other extreme tempts you. Until, alas, you just might meet in the middle.
 Sunday, January 03, 2010
Rewind the clock to mid-2004. Around this time awareness about the looming “concurrency sea change” was rapidly growing industry-wide, and indeed within Microsoft an increasing number of people – myself included – began to focus on how the .NET Framework, CLR, and Visual C++ environments could better accommodate first class concurrent programming. Of course, our colleagues in MSR and researchers in the industry more generally had many years’ head start on us, in some cases dating back 3+decades. It is safe to say that there was no shortage of prior art to understand and learn from.
One piece of prior art was particularly influential on our thoughts: software transactional memory. (STM, or, in short just TM.) In fact, right around that time, Tim Harris’s TM work grew in notoriety (my first exposure arriving by way of OOPSLA’03’s proceedings, which contained the “Language Support for Lightweight Transactions” paper). TM was immediately fascinating, and simultaneously promising. For a number of reasons:
- TM hid sophisticated synchronization mechanisms under a simple veil.
- It could be implemented using sophisticated (and scalable) techniques, again under a simple veil.
- It built on decades of experience in building scalable and parallel transactional databases.
- Among others. But most of all, it was a bright shiny light in a sea of complexity.
- And how fortunate: Tim was a colleague in our neighboring MSR Cambridge offices (and still is).
In a nutshell, TM offered declarative concurrency-safety. You declare what you’d like in as few simple words as possible, and you get what you want. In this case, those simple words are ‘atomic { S; }’.
Many people latched onto TM rapidly and simultaneously, both inside and outside of Microsoft. I hacked together a little prototype built atop SSCLI (“Rotor”), and another architect on our team built an even more feature-rich prototype using MSIL rewriting. We compared notes, began jointly exploring the design space, and talking more regularly with other colleagues like Tim in MSR. Soon thereafter we kicked off a small working group with about a dozen architects and researchers from around the company, aiming to articulate what a real productized TM might look like. Fun times.
We were eventually given the OK for an official “incubation” project, and multiple years’ of exploration and hard work ensued. In fact, the fruits of a team of many’s labor recently got released in the form of a Community Technology Preview -- a good conduit for experimentation, but with no commitment to add it to any of Microsoft’s products. To be clear, I had only a small part to play in this ambitious project, and mostly towards the start. Partway through, I stepped away to do PLINQ and Parallel Extensions to .NET, both of which are now part of the .NET Framework 4.0. Dozens of amazing people played a significant role in the project over the years. But I am getting way ahead of myself…
I’ve been away from the nitty-gritty day-to-day details of TM for about 3 years now, which feels sufficiently long to develop a healthy perspective on the project. So here it is. What follows is of course in no way Microsoft’s “official position” on the technology, but rather my own personal one. I’ve interspersed generalizations with specific details because that’s just how my brain thinks about TM.
Towards the North Star
A wondrous property of concurrent programming is the sheer number and diversity of programming models developed over the years. Actors, message-passing, data parallel, auto-vectorization, ...; the titles roll off the tongue, and yet none dominates and pervades. In fact, concurrent programming is a multi-dimensional space with a vast number of worthy points along its many axes.
This rich history is simultaneously a blessing and a daunting curse. But in any case can make for some very interesting multi-year-long immersion. My UW talk from 1 1/2 years ago just barely touches on the sheer breadth.
TM’s greatest virtue is the first word in its name: transactional. It turns out that, no matter your concurrent programming model du jour, three fundamental concepts crop up again and again: isolation (of state), atomicity (of state transitions), and consistency (of those atomic transitions). We use locks in shared-memory programming, coarse grained messages in message-passing, and functional programming to achieve all of these things in different ways. Transactions are another such mechanism, sure, but more than that, transactions are an all-encompassing way of thinking about how programs behave at their most fundamental core. Transaction is a religion.
Not everybody believes this, and of course why would they: it is an immensely subjective and qualitative statement. Some will claim that models like message passing entirely avoid the likes of “race conditions,” and such, but this is clearly false: state transitions are made, complicated state invariants are erected amongst a sea of smaller isolated states, and care must be taken, just as in shared memory. Even Argus, a beautiful early incarnation of message-passing (via promises) demands that messages are atomic in nature. This property is not checked and, if done improperly, leads to “races in the large.” Even Argus introduced the notion of transactions and persistence in the form of guardians.
Of course, message passing helps push you in the right direction. It is not, however, a panacea.
I was reading my ICFP proceedings recently and was reminded of research done in the context of Erlang that supports this assertion. In it, they apply CHESS-like techniques (with clever search space culling) to find race conditions. Indeed we use similar techniques very successfully for our message-passing programming models on my team here at Microsoft.
Transactions are terrific because they are “automatic”. You declare the boundaries, and the transactional machinery takes care of the rest. This is true of databases and also TM. Countless developers in the wild write massively concurrent programs by issuing operations against databases: they can do this so easily because they grok the simple façade that transactions provide. Numerous server-side state-based applications use transactions to shield programmers from the pitfalls of concurrency. Behold MSDTC. The bet we were making is that similar models would scale down just as well “in the small”.
The canonical syntactic footprint of TM is also beautiful and simple. You say:
atomic {
… concurrency-safe code goes here …
}
And everything in that block is magically concurrency-safe. (Well, you still need to ensure the consistency part, but isolation and atomicity are built-in. Mix this with Eiffel- or Spec#-style contracts and assertions like those in .NET 4.0, run at the end of each transaction, and you’re well on your way to verified consistency also. The ‘check E’ work in Haskell was right along these lines.) You can read and write memory locations, call other methods, all without worrying about whether concurrency-safety will be at risk.
For example, consider three transactions running concurrently:
int x = 0, y = 0, z = 0;
atomic { atomic { atomic {
x++; y++; z++;
} x++; y++;
} x++;
}
No matter the order in which these run, the end result will be x == 3, y == 2, z == 1.
Contrast this elegant simplicity with the many pitfalls of locks:
- Data races. Like forgetting to hold a lock when accessing a certain piece of data. And other flavors of data races, such as holding the wrong lock when accessing a certain piece of data. Not only do these issues not exist, but the solution is not to add countless annotations associating locks with the data they protect; instead, you declare the scope of atomicity, and the rest is automatic.
- Reentrancy. Locks don’t compose. Reentrancy and true recursive acquires are blurred together. If a locked region expects reentrancy, usually due to planned recursion, life is good; if it doesn’t, life is bad. This often manifests as virtual calls that reenter the calling subsystem while invariants remain broken due to a partial state transition. At that point, you’re hosed.
- Performance. The tension between fine-grained locking (better scalability) versus coarse-grained locking (simplicity and superior performance due to fewer lock acquire/release calls) is ever-present. This tension tugs on the cords of correctness, because if a lock is not held for long enough, other threads may be able to access data while invariants are still broken. Scalability pulls you to engage in a delicate tip-toe right up to the edge of the cliff.
- Deadlocks. This one needs no explanation.
In a nutshell, locks are not declarative. Not even close. They are not associated with the data protected by those locks, but rather the code that accesses said data. (For example: in the above code snippet, do we need three locks? Or one? Or …? Imagine we choose three: one for each variable, x, y, and z. What if we increment z, release its associated lock, and some other thread can now see the newly incremented z before the y and x get incremented. Whether this is acceptable depends on the program.) Sure, you can achieve atomicity and isolation, but only by intimately reasoning about your code by understanding the way they are implemented. And if you care about performance, you are also going to need to think about hardware esoterica such as CMPXCHG, spin waiting, cache contention, optimistic techniques with version counters and memory models, ABA, and so on.
The contrast is stark. Atomic-block-style transactions provide automatic serializability of whole regions of code, no matter what that code does, and the TM infrastructure does the rest, choosing between: optimistic, pessimistic, coarse, fine, etc. The linearization point of a transaction is clear: the end of the atomic block. TM can even adjust strategies based on the surrounding environment: hardware, dynamic program behavior, etc. (“Policy”.) In comparison to locks, TM is an order of magnitude simpler. There have even been studies whose conclusions support this assertion.
(Transactions unfortunately do not address one other issue, which turns out to be the most fundamental of all: sharing. Indeed, TM is insufficient – indeed, even dangerous – on its own is because it makes it very easy to share data and access it from multiple threads; first class isolation is far more important to achieve than faux isolation. This is perhaps one major difference between client and server transactions. Most server sessions are naturally isolated, and so transactions only interfere around the edges. I’m showing my cards too early, and will return to this point much, much later in this essay.)
TM also has the attractive quality of automatic rollback of partial state updates. (How did I get this far without discussing rollback?) Concurrency aside, this avoids needing to write backout code to run in the face unhandled exceptions. In retrospect this capability alone is almost enough to justify TM in limited quantities. Reams of code “out there” contain brittle, untested, and, therefore, incorrect error handling code. We have seen such code lead to problems ranging in severity: reliability issues leading to data loss, security exploits, etc. Were we to replace all those try/catch/rethrow blocks of code with transactions, we could do away with this error prone spaghetti. We’d also eliminate try/filter exploits thanks to Windows/Win32 2-pass SEH. Sometimes I wish we focused on this simple step forward, forgot about concurrency-safety, and baby stepped our way forward. Likely it wouldn’t have been enough, but I still wonder to this day.
We also toyed with the ability to replace reliability-oriented CER blocks with transactions. As you go through a transaction, there is a log of forward progress and how to undo it. So no matter the kind of failure, including OOM, you can rollback the partial state updates with zero allocation required.
At some point we began describing an ‘atomic’ block as though the program used a single global lock for all its concurrency operations. This would be grossly inefficient, of course, and fails to capture the precise isolation and rollback properties, but nevertheless conveys the basic idea. It also, as an aside, foreshadows a few of the difficult problems that lie ahead, namely strong vs. weak atomicity. Even though there is only one, if you forget to hold this one global lock while accessing shared data, you’ve still got a data race on your hands. This model won’t save you. We will return to this later on.
Tough Decisions: Life as a Starving Artist
We faced some programming model decisions requiring artistic license early on.
One that we quickly decided was whether to automatically roll back a transaction in response to an unhandled exception thrown from within. Such as with this code:
atomic {
x++;
if (p)
throw new Exception(“Whoops”);
}
If p evaluates to true, and hence an unhandled exception thrown, should that x++ be rolled back?
Most on the team said “Yes” as a gut reaction, whereas some argued we should require the programmer to catch-and-rollback by hand. We settled on the automatic approach because it seemed to do what you would expect in all the cases we looked at. Your transaction failed to complete normally and consistently. We also debated whether to support a unilateral “Transaction.Abort()” capability; while we agreed a “Transaction.Commit()” would be silly – the only way to commit a transaction being to reach its end non-exceptionally – the jury remained split on unilateral abort. We eventually found that, particularly when nesting is involved, the ability to detect a dire problem with the universe and bail unilaterally can be useful.
And we also hit some tough snags early on. Some were trivial, like what happens when an exception is thrown out of an atomic block. Of course that exception was likely constructed within the atomic block (‘throw new SomeException()’ being the most common form of ‘throw’), so we decided we probably need to smuggle at least some of that exception’s state out of the transaction. Like its stack trace. And perhaps its message string. I wrote the initial incarnation of the CLR exception subsystem support, and stopped at shallow serialization across the boundary. But this was a slippery slope, and eventually the general problem was seen, leading to more generalized nesting models (which I shall describe briefly below). Another snag, which was quite non-trivial, was the impact to the debugging experience. Depending on various implementation choices – like in-place versus buffered writes – you may need to teach the debugger about TM intrinsically. And some of the challenges were fundamental to building a first class TM implementation. Clearly the GC needed to know about TM and its logging, because it needs to keep both the “before” and “after” state of the transaction alive, in case it needed to roll back. The JIT compiler was very quickly splayed open and on the surgery table. And so on.
Throughout, it became abundantly clear that TM, much like generics, was a systemic and platform-wide technology shift. It didn’t require type theory, but the road ahead sure wasn’t going to be easy.
So we knocked down many early snags, and kept plowing forward, eagerly and excitedly. None of these challenges were insurmountable. We remained hopeful and happy (perhaps even blissful) to continue exploring the space of possible solutions. More irksome snags lurked right around the corner, however. And little did we know that some decisions we were about to make would subject us to some of the biggest such snags. TM’s greatest feature – slap an atomic around a block of code and it just gets better – would turn out to be its greatest challenge… but alas, I am again jumping ahead; more on that later.
Turtles, but How Far Down? Or, Bounded vs. Unbounded Transactions
Not all transactions are equal. There is a broad spectrum of TMs, ranging from those that are bounded to updating, say, 4 memory locations or cache lines, to those that are entirely unbounded. Indeed TM blurs together with other hardware-accelerated synchronization techniques, like speculative lock elision (SLE). The more constrained TM models are often hardware-hybrids, and the limitations imposed are typically due to physical hardware constraints. Models can be pulled along other axes, however, such as whether memory locations must be tagged in order to be used in a transaction or not, etc. Haskell requires this tagging (via TVars) so that side-effects are evident in the type system as with any other kind of monad.
We quickly settled on unbounded transactions. Everything else looked like multi-word CAS and, although we knew multi-word CAS would be immensely useful for developing new lock-free algorithms, our aim was to build something radically new and with broader appeal. If we ended up with a hardware-hybrid, we would expect the software to pick up the slack; you’d get nice acceleration within the hardware constraints, and then “fall off the silent cliff” to software emulation thereafter. Thus the unbounded approach was chosen.
In hindsight, this was a critical decision that had far-reaching implications. And to be honest, I now frequently doubt that it was the right call. We had our hearts in the right places, and the entire industry was trekking down the same path at the same time (with the notable exception of Haskell). But with the wisdom of age and hindsight, I do believe limited forms of TM could be wildly successful at particular tasks and yet would have avoided many of the biggest challenges with unbounded TM.
And believe me: many such challenges arose in the ensuing months.
An example of one challenge that didn’t threaten the model of TM per se, but sure did make our lives more difficult, is the compilation strategy we were forced to adopt. Transactions cost something. To transact a read or write entails a non-trivial amount of extra work; we spent a lot of time optimizing away redundant work, and developing new optimizations that reduced the overhead of TM. But at the end of the day, the cost is not zero – and in fact, the common case is far from it. Imagine you have an unbounded transaction model and are faced with compiling a particular method from MSIL to native code. A simple separate-module -based compiler (i.e. not whole-program) will not necessarily know whether this method will get called from a transaction, or from non-transactional code, such that in the worst case the method must be prepared for transactional access. There are a variety of techniques to use to produce code that supports both: the two extremes are (1) cloning, or (2) sharing w/ conditional dynamic checking. Neither extreme is particularly attractive, and this choice represents a classic space-time tradeoff that entails finding a reasonable middle ground. A JIT compiler can dynamically produce the version that is needed at the moment, but offline compilers – like the CLR’s NGEN – do not have this luxury. And within Microsoft at least, and among shrink-wrap ISVs, offline compilation is of greater importance than JIT compilation. For better or for worse.
The model of unbounded transactions is the hard part. You surround any arbitrary code with ‘atomic { … }’ and everything just works. It sounds beautiful. But just think about what can happen within a transaction: memory operations, calls to other methods, P/Invokes to Win32, COM Interop, allocation of finalizable objects, I/O calls (disk, network, databases, console, …), GUI operations, lock acquisitions, CAS operations, …, the list goes on and on. Versus bounded transactions, where we could say something like: if you do more than N things, the transaction will fail to run – deterministically.
Unbounded really was the golden nugget. But we should not be shy about what this decision implies.
Implementing the Idea
This leads me to a brief tangent on implementation. Given that we didn’t implement TM with a single global lock, as the naïve mental model above suggests, you may wonder how we actually did do it. Three main approaches were seriously considered:
- IL rewriting. Use a tool that passes over the IL post-compilation to inject transaction calls.
- Hook the (JIT) compiler. The runtime itself would intrinsically know how to inject such calls.
- Library-based. All transactional operations would be explicit calls into the TM infrastructure.
Approaches #1 and #2 would look similar, but the latter would be quite different. Instead of:
atomic {
x++;
}
Or:
Atomic.Run(() => {
x++;
});
You might say something like:
Atomic.Run(() => {
Atomic.Write(Atomic.Read(ref x) + 1);
});
With enough language work, we could have tried to desugar the latter into the former, but when you start crossing method boundaries, everything gets more complicated. (Do you create transactional clones of every method, and rewrite calls from ordinary methods to the transactional clone? This is easy to do with a rewriter or compiler, but quite difficult with a pure library approach.) We also knew we’d need to do some very sophisticated compiler optimizations to get TM’s performance to the point of acceptable. So we chose approach #2 for our “real” prototype, and never looked back.
After this architectural approach was decided, a vast array of interesting implementation choices remained.
We moved on to building the primitive library with all the TM APIs that the JIT would introduce calls into. We quickly settled an approach much like Harris’s (and, at the time, pretty much the industry/research standard): optimistic reads, in-place pessimistic writes, and automatic retry. That means reads do not acquire locks of any sort, and instead, once the end of the transaction has been reached, all reads are validated; if any locations read have been modified concurrently (or an uncommitted value was read), the whole transaction is thrown away and reattempted from the start. Writes work like locks. This approach makes reads cheap: a single read consists of reading the value, and a version number whose address is at a statically known offset. No interlockeds. This is great since reads typically far outnumber writes. Down the line, we explored adding more sophisticated policy than this, which I will detail in brief below.
So the compiler would inject hooks for the above code like so:
while (true) {
TX tx = new TX();
try {
// x++;
tx.OpenReadOptimistic(ref x);
int tmp = x;
tx.OpenWritePessimistic(ref x);
x = (tmp + 1);
if (!tx.Validate())
continue;
tx.Commit();
}
catch {
tx.Rollback();
throw;
}
}
Notice there are some obvious overheads in here:
- The atomic block becomes a loop (to support automated retry).
- A new transaction must be allocated and likely placed in TLS (if methods are called).
- A try/catch block is used to initiate rollback on unhandled exceptions.
- Each unique location read in a block requires at least one call to OpenReadOptimistic.
- Each unique location written requires at least one call to OpenWritePessimistic.
- Each location read must be validated (at Validate), and finally the transaction is committed (at Commit).
Much of the work in the compiler was meant to reduce these overheads. For example, if the same location is read multiple times, there’s no need to call OpenReadOptimistic more than once. If the compiler can statically detect this, it may elide some of the calls. If the same location is read and then written – as in the above example – only the write lock must be acquired. If no methods are called, the transaction object can be enregistered, and we needn’t add it to TLS so long as the exception trap code knows how to move it from register to TLS on demand. Et cetera.
There are other overheads that are not so obvious. Optimistic reads mandate that there is a version number for each location somewhere, and pessimistic writes mandate that there is a lock for each location somewhere.
A straightforward technique is to use a hashing scheme to associate locations with this auxiliary data: each address is hashed to index into a table of version numbers and locks. This leads to false sharing, of course, but reduces space overhead and makes lookup fast. Unfortunately, in a garbage collected environment, addresses are not stable and therefore hashing becomes complicated. You can use object hash codes for this purpose, but .NET hash codes are overridable; and generating them is not nearly as cheap as using the memory location’s address, which by definition is already in-hand. Other alternatives of course exist. You can associate version numbers and locks with the objects themselves, just like monitors and object headers/sync-blocks in the CLR: this provides object-granularity locking. Ahh, the age old tension of fine vs. coarse grained locking comes up again.
We eventually realized we’d want both optimistic and pessimistic reads, the latter of which worked a lot like reader/writer locks. We crammed all these into a clever little word-sized data structure which worked a lot like Vista’s SRWL data structure. Except that it also contained a version number.
It was always surprising to me what strange things in the runtime we bumped up against. We realized a nice GC optimization: instead of keeping strong references to all intermediary states in a transaction log, we could keep weak references to all but the “before” and “after” state. This is important when transacting synthetic situations like this:
static BigHonkinFoo s_f;
…
atomic {
for (int i = 0; i < 1000000; i++)
s_f = new BigHonkinFoo();
}
Of course you wouldn’t write that code exactly. But there’s no need to keep alive all but the s_f that existed prior to entering the atomic block and the current one at any given time. But this leads to particularly hairy finalization issues. If a finalizable object is allocated within a transaction (say BigHonkinFoo), and is then reclaimed, its Finalize() method will be scheduled to run on a separate thread. Yet the transaction log may contain references to it. Thus there is a race between the transaction’s final outcome and the invocation of the finalizer. We came up with a clever solution for this, but there were countless other clever solutions for various things not worth diving too deep into.
Hacking is fun. However, it was not going to be what made or broke TM as a model.
Disillusionment Part I: the I/O Problem
It wasn’t long before we realized another sizeable, and more fundamental, challenge with unbounded transactions. Finalizers touched on this. What do we do with atomic blocks that do not simply consist of pure memory reads and writes? (In other words, the majority of blocks of code written today.) This was not just a pesky question of how to compile a piece of code, but rather struck right at the heart of the TM model.
You already saw the OpenReadOptimistic, OpenWritePessimistic, Validate, Commit, and Rollback pseudo-TM infrastructure calls, each of which operated on memory locations. But what about a read or write from a single block or entire file on the filesystem? Or output to the console? Or an entry in the Event Log? What about a web service call across the LAN? Allocation of some native memory? And so on. Ordinarily these kinds of operations will be composed with other memory operations, with some interesting invariant relationship holding between the disparate states. A transaction comprised of a mixture still ought to remain atomic and isolated.
The answer seemed clear. At least in theory. The transaction literature, including Reuter and Gray’s classic, had a simple answer for such things: on-commit and on-rollback actions, to perform or compensate the logical action being performed, respectively. (Around this same time, the Haskell folks were creating just this capability in their STM, where ‘onCommit’ and ‘onRollback’ would take arbitrary lambdas to execute the said operation at the proper time.) Because we were working primarily in .NET – with a side project targeting C++ -- we decided to use the new System.Transactions technology in 2.0 to hook into inherently transactional resources, like transacted NTFS, registry, and, of course, databases.
(Digging through my blog, I found this article written back in June 2006 about building a volatile resource manager for memory allocation/free operations, just as an example.)
This worked, though we were quite obviously swimming upstream. Numerous challenges confronted us.
A significant problem was that not all operations are inherently transactional, so in many cases we were faced with the need to add faux transactions on top of existing non-transactional services. (Already-transactional services were easy, like databases. Except that mixing fine-grain TM transactions with distributed DTC transactions makes my skin crawl.) For example, how would you undo a write to the console? Well, you can’t, really. So we decided maybe the right default for Console.WriteLine was to use an on-commit action to perform the actual write only once the transaction had committed.
But in even thinking this thought, we realized we were standing on shaky ground. What if the WriteLine was followed by something like a ReadLine, for example, where the program was meant to wait for the user to enter something into the console (likely in response to the prompt output by WriteLine)? (This example is a toy, of course, but represents a more fundamental pattern common in networked programs.) The basic problem was immediately clear. Adding isolation to an existing non-isolated operation is not always behavior-preserving, particularly when I/O is involved. Sometimes it is necessary to step outside of the isolation that would otherwise get poured on top by a simple transactional model.
This particular problem isn’t specific to traditional I/O per se.
Foreign function interface calls through.NET’s P/Invoke suffer from like problems. A call to CreateEvent may be compensatable (via an on-rollback action) with a call to CloseHandle. But this is flawed. Once that event’s HANDLE is requested, and/or it is passed to other Win32 APIs like MsgWaitForMultipleObjects, then the isolation of the faux transaction is broken, and real state must be provided to the Win32 APIs. And if another thread were to look up that HANDLE – perhaps through a name given to it in the call to CreateEvent – it may be able to see and interact with that event before the enclosing transaction has been committed. The abstraction leaks. And even if the abstraction is perfect, it is obvious there’s quite a bit of work to be had in order to transact all the touch points between .NET and Win32, of which there are many. And I mean many.
Other issues wait just around the corner. For example, how would you treat a lock block that was called from within a transaction? (You might say “that’s just not supported”, but when adding TM to an existing, large ecosystem of software, you’ve got to draw the compatibility line somewhere. If you draw it too carelessly, large swaths of existing software will not work; and in this case, that often meant that we claimed to provide unbounded transactions, and yet we would be placing bounds on them such that a lot of existing software could not be composed with transactions. Not good.) A seemingly straightforward answer is to treat a lock block like an atomic block. So if you encounter:
atomic {
lock (obj) { … }
}
it is logically transformed into:
atomic {
atomic { … }
}
On the face of it, this looks okay. (Forget problems like freeform use of Monitor.Enter/Exit for now.) We’re strengthening the atomicity and isolation, so what could go wrong? Well, it turns out that examples like this can also suffer from the “too much isolation” problem. Adding transactions to a lock-block extends the lifetime of the isolation of that particular block’s effects, possibly leading to lack of forward progress. In fact, you don’t need locks to illustrate the problem. Imagine a simple lock-free algorithm that communicates between threads using shared variables:
volatile int flag = 0;
…
flag = 1; while (flag != 1) ;
while (flag == 1) ; flag = 2;
If you invoke this code from within a transaction (on each thread), you’re apt to lead to deadlock. Both transactions’ effects will be isolated from the others’, whereas we are quite obviously intending to publish the updates to the flag variable immediately.
Anyway, the whole lock thing is a bit of a digression. The simple fact is that very little .NET code would actually run inside an atomic block but for things like collections and pure computations due to the I/O problem. You can develop one-off solutions for each problem that arises – and indeed we did so for many of them – and even hang those solutions underneath one general framework – like System.Transactions – but you cannot help but eventually become overwhelmed by the totality of the situation. The team experimented with static checking to turn these dynamic failures into static ones, but this only marginally improved matters.
I could go on and on about the I/O problem, its various incarnations, and what we did about it. Instead I will sum it up: this problem was, and still is, the “elephant in the room” threatening unbounded TM’s broader adoption.
The question ultimately boils down to this: is the world going to be transactional, or is it not?
Whether unbounded transactions foist unto the world will succeed, I think, depends intrinsically on the answer to this question. It sure looked like the answer was going to be “Yes” back when transactional NTFS and registry was added to Vista. But the momentum appears to have slowed dramatically.
Nesting
Let’s get back to some fun, less depressing material. There are more surprises lurking ahead.
I already mentioned a great virtue of transactions is their ability to nest. But I neglected to say how this works. And in fact when we began, we only recognized one form of nesting. You’re in one atomic block and then enter into another one. What happens if that inner transaction commits or rolls back, before the fate of the outer transaction is known? Intuition guided us to the following answer:
- If the inner transaction rolls back, the outer transaction does not necessarily do so. However, no matter what the outer transaction does, the effects of the inner will not be seen.
- If the inner transaction commits, the effects remain isolated in the outer transaction. It “commits into” the outer transaction, we took to saying. Only if the outer transaction subsequently commits will the inner’s effects be visible; if it rolls back, they are not.
For example, consider this code:
void f() { void g() {
atomic { // Tx0 atomic { // Tx1
x++; y++;
try { if (p1)
g(); throw new BarException();
} catch { }
if (p0) }
throw; }
}
if (p2)
throw new FooException();
}
}
Imagine x = y = 0 at the start, and we invoke f. Many outcomes are possible.
- If p1 is true, g will throw an exception, aborting Tx1’s write to y. There are then two possibilities. (1)If p0 is true, the exception is repropagated and Tx0 will also abort, rolling back its write to x; this leaves x == y == 0. (2) If p0 is false, the exception is swallowed, and Tx0 proceeds to committing its write to x; this leaves x == 1, whereas y == 1.
- If p1 is false, on the other hand, g will not throw anything. Tx1 will commit its write to y “into” the outer transaction Tx0. One of two outcomes will now occur depending on the value of p2. (1) If p2 is true, an exception is thrown out of f, and Tx0 rolls back both the inner transaction Tx1’s effects and its own, leaving x == y == 0. (2) Else, f completes ordinarily, and Tx0 commits both Tx1’s and its own effects, leading to x == y == 1.
We expected most peoples’ intuition to match this behavior.
The canonical working example was a BankAccount class:
class BankAccount {
decimal m_balance;
public void Deposit(decimal delta) {
atomic { m_balance += delta; }
}
public static void Transfer(
BankAccount a, BankAccount b, decimal delta) {
atomic {
a.Deposit(-delta);
b.Deposit(delta);
}
}
}
This was an illustrative and beautiful example. It made beautiful slide-ware. We are composing the Deposit operations of two separate bank accounts into a single Transfer method. Of course doing the a.Deposit(-delta) and b.Deposit(delta) must be made atomic, else a failure could either lead to missing money, and/or someone could witness the world with the money in transit (and nowhere except for one a thread’s stack) rather than having been transferred atomically. And building the same thing with locks is frustratingly difficult: using fine-grained per-account locks can lead to deadlock very quickly.
Intuitively we walked down many variants of this mode of nesting. We reacquainted ourselves with Moss’s great dissertation on the topic, and remembered this intuitive nesting mode as closed nested transactions. And we shortly recognized the need for another mode: open nested transactions.
To motivate the need for open nesting, imagine we’ve got a hashtable whose physical storage is independent from its logical storage. Resizing the table of buckets, for example, has little to do with whether a particular {key, value} pair exists within those buckets. The resizing operation, in fact, is logically idempotent and isolated: the same set of keys will exist within the table both before and after such an operation. So we can actually commit the physical effects of such an operation eagerly. With a naïve TM implementation, two independent keys hashing to the same bucket will conflict, and the reads and writes for such operations will live as long as the enclosing user-level transactions. Instead, we can serialize logical operations with respect to one another at a “higher level” than physically independent operations do, leading to greater concurrency. Two transactions will only conflict in long-running transactions if they truly operate on the same keys, rather than just happening to hash to the same bucket.
Open nesting forced us to contemplate the sharing of state between outer and inner transactions more deliberately, and gave us some troubles syntactically. We had wanted to say:
atomic { // ordinary closed nesting.
Foo f = new Foo();
atomic(open) { /// open nesting.
… f? …
}
}
But is it really legal for the inner transaction here to access the ‘f’, which has been constructed and is presumably uncommitted in the outer transaction? With closed nested transactions there is lock compatibility between the outer and inner transactions. An inner closed nested transaction can of course read a memory location write-locked by the outer transaction, for example. However, the same must not true of open nesting, because an open nested transaction commits “to the world” rather than into its outer transaction. Allowing it to read and then potentially publish uncommitted state would violate serializability. It’s possible that the inner open nested transaction will commit, whereas the outer will roll back. (The reverse situation is equally problematic.) And yet it’s darn useful to pass state from an outer to an inner transaction – and indeed, often impossible to do anything otherwise – yet what if the key itself were a complicated object graph rather than value, and the key bleeds across transaction boundaries?
Many issues like this arose. Our straightforward answer was that only pass-by-value worked across such a boundary. I don’t think we ever found nirvana here.
We developed other transaction modes also.
As we added data parallel operations within a nested transaction, we realized that we’d need something a lot like closed nesting but with special accommodation for intra-transaction parallelism. This led us to parallel nested transactions, enabling lock sharing from a parent to its many data parallel children. These children could not communicate with one another other than to “commit into” the parent, and subsequently reforking, thereby ensuring non-interference between them. Of course children could share read-locks amongst one another, just not write locks.
And we continued to reject the temptation of adding weakened serializability modes a la relational databases (unrepeatable reads, etc). Although we expected this to arise out of necessity with time, it never did; the various nesting modes we provided seemed to satisfy the typical needs.
A Better Condition Variable
Here’s a brief aside on one of TM’s bonus features.
Some TM variants also provide for “condition variable”-like facilities for coordination among threads. I think Haskell was the first such TM to provide a ‘retry’ and ‘orElse’ capability. When a ‘retry’ is encountered, the current transaction is rolled back, and restarted once the condition being sought becomes true. How does the TM subsystem know when that might be? This is an implementation detail, but one obvious choice is to monitor the reads that occurred leading up to the ‘retry’ – those involved in the evaluation of the predicate – and once any of them changes, to reschedule that transaction to run. Of course, it will reevaluate the predicate and, if it has become false, the transaction will ‘retry’ again.
A simple blocking queue could be written this way. For example:
object TakeOne()
{
atomic {
if (Count == 0)
retry;
return Pop();
}
}
If, upon entering the atomic block, Count is witnessed as being zero, we issue a retry. The transaction subsystem notices we read Count with a particular version number, and then blocks the current transaction until Count’s associated version number changes. The transaction is then rescheduled, and races to read Count once again. After Count is seen as non-zero, the Pop is attempted. The Pop, of course, may fail because of a race – i.e. we read Count optimistically without blocking out writers – but the usual transaction automatic-reattempt logic will kick in to mask the race in that case.
The ‘orElse’ feature is a bit less obvious, though still rather useful. It enables choice among multiple transactions, each of which may end up issuing a ‘retry’. I don’t think I’ve seen it in any TMs except for ours and Haskell’s.
To illustrate, imagine we’ve got 3 blocking queues like the one above. Now imagine we’d like to take from the first of those three that becomes non-empty. ‘orElse’ makes this simple:
BlockingQueue bq1 = …, bq2 = …, bq3 = …;
atomic {
object obj =
orElse {
bq1.TakeOne(),
bq2.TakeOne(),
bq3.TakeOne()
};
}
While ‘orElse’ is perhaps an optional feature, you simply can’t write certain kinds of multithreaded algorithms without ‘retry’. Anything that requires cross-thread communication would need to use spin variables.
Deliberate Plans of Action: Policy
I waved my hands a bit above perhaps without you even knowing it. When I talk about optimistic, pessimistic, and automatic retry, I am baking in a whole lot of policy. It turns out there is a wide array of techniques. The simplest question we faced early on was, when an optimistic read fails to validate at the end of a transaction, when should we reattempt execution of that transaction?
The naïve answer is “immediately”. But obviously that would lead to livelock under some conditions. A more reasonable answer is “spin for N cycles and then retry”. But this too can lead to livelock. A better answer is to either choose some random strategy, or to make an intelligent adaptive choice. We experimented with many such variants, including random backoff, sophisticated waiting and signaling based on the memory locations in question, among others. We even played games like giving transactions karma points for cooperatively acquiescing to other competing transactions, and allowing those transactions with the most karma points to make more forward progress before interrupting them.
A few good papers supplied useful (and entertaining) reading material on the topic, but to be honest, nobody had a good answer at the time. Thankfully these are all implementation details. So we were free to experiment.
Deadlock breaking also requires policy. Thankfully we can actually roll back the effects of transactions engaged in a deadly embrace with TM, so we merely need to know how often to run the deadlock detection algorithm. There was a similar problem when deciding to back off outer layers of nesting, and in fact this becomes more complicated when deadlocks are involved. Imagine:
atomic { atomic {
x++; y++;
atomic { atomic {
y++; x++;
} }
} }
This deadlock-prone example is tricky because rolling back the inner-most transactions won’t be sufficient to break the deadlock that may occur. Instead the TM policy manager needs to detect that multiple levels of nesting are involved and must be blown away in order to unstick forward progress.
Another variant that went beyond deciding when to favor one transaction over another was to upgrade to pessimistic locking if optimistic let us down. The whole justification behind optimistic is that, …well, we’re optimistic that conflicts won’t happen. So it seems reasonable that, if they do occur, we fall back to something more, …well, pessimistic. There is a dial here too. Perhaps you only want to fall back to pessimistic after failing optimistically N times in a row, where N > 1. As I mentioned above, our single-word lock associated with each object supported both locking and versioning cheaply.
Disillusionment Part II: Weak or Strong Atomicity?
All along, we had this problem nipping at our heels. What happens if code accesses the same memory locations from inside and outside a transaction? We certainly expected this to happen over the life of a program: state surely transitions from public and shared among threads to private to a single thread regularly. But if some location were to be accessed transactionally and non-transactionally concurrently, at once, we’d (presumably) have a real mess on our hands. A supposedly atomic, isolated, etc. transaction would no longer be protected from the evils of racey code.
For example:
atomic { // Tx0 x++; // No-Tx
x++;
}
Can we make any statements about the value of x after Tx0 commits (or rolls back)? Not really. It depends on the way the particular TM being used has been implemented. An in-place model that rolls back could not only roll back Tx0’s but also the unprotected x++’s write. And so on.
On one hand, this code is racey. So you could explain away the undefined behavior as being a race condition. On the other hand, it was also troublesome. All those problems with locks begin cropping up all over the place. It would have been ideal if we could notify developers that they made a mistake. Then we could have made the assertion that data races are simply not possible with TM.
(Except for consistency-related ones, of course.)
At the same time, many hardware models were being explored. And of course in hardware you’ve got the physical addresses that variables resolve to and needn’t worry about aliasing. So it was actually possible to issue a fault if a location was used transactionally and non-transactionally at once. But given that our solution was software-based, we were uncomfortable betting the farm on hardware support.
Another approach was static analysis. We could require transactional locations to be tagged, for example. This had the unfortunate consequence of making reusable data structures less, well, reusable. Collections for example presumably need to be usable from within and outside transactions alike. After-the-fact analysis could be applied without tagging, but false positives were common. We never really took a hard stance on this problem, but always assumed the combination of static analysis, tooling, and, perhaps someday, hardware detection would make this problem more diagnosable. But I think we generally resolved ourselves to the fact that our TM would suffer from weak atomicity problems.
We thought this was explainable. Sadly it led to something that surely was not.
Disillusionment Part III: the Privatization Problem
I still remember the day like it was yesterday. A regular weekly team meeting, to discuss our project’s status, future, hard problems, and the like. A summer intern on board from a university doing pioneering work in TM, sipping his coffee. Me, sipping my tea. Then that same intern’s casual statement pointing out an Earth-shattering flaw that would threaten the kind of TM we (and most of the industry at the time) were building. We had been staring at the problem for over a year without having seen it. It is these kinds of moments that frighten me and make me a believer in formal computer science.
Here it is in a nutshell:
bool itIsOwned = false;
MyObj x = new MyObj();
…
atomic { // Tx0 atomic { // Tx1
// Claim the state for my use: if (!itIsOwned)
itIsOwned = true; x.field += 42;
} }
int z = x.field;
...
The Tx0 transaction changes itIsOwned to true, and then commits. After it has committed, it proceeds to using whatever state was claimed (in this case an object referred to by variable x) outside of the purview of TM. Meanwhile, another transaction Tx1 has optimistically read itIsOwned as false, and has gone ahead to use x. An update in-place system will allow that transaction to freely change the state of x. Of course, it will roll back here, because isItOwned changed to true. But by then it is too late: the other thread using x outside of a transaction will see constantly changing state – torn reads even – and who knows what will happen from there. A known flaw in any weakly atomic, update in-place TM.
If this example appears contrived, it’s not. It shows up in many circumstances. The first one in which we noticed it was when one transaction removes a node from a linked list, while another transaction is traversing that same list. If the former thread believes it “owns” the removed element simply because it took it out of the list, someone’s going to be disappointed when its state continues to change.
This, we realized, is just part and parcel of an optimistic TM system that does in-place writes. I don’t know that we ever fully recovered from this blow. It was a tough pill to swallow. After that meeting, everything changed: a somber mood was present and I think we all needed a drink. Nevertheless we plowed forward.
We explored a number of alternatives. And so did the industry at large, because that intern in question published a paper on the problem. One obvious solution is to have a transaction that commits a change to a particular location wait until all transactions that have possibly read that location have completed – a technique we called quiescence. We experimented with this approach, but it was extraordinarily complicated, for obvious reasons.
We experimented with blends of pessimistic operations instead of optimistic, alternative commit protocols, like using a “commit ticket” approach that serializes transaction commits, each of which tended to sacrifice performance greatly. Eventually the team decided to do buffered writes instead of in-place writes, because any concurrent modifications in a transaction will simply not modify the actual memory being used outside of the transaction unless that transaction successfully commits.
This, however, led to still other problems, like the granular loss of atomicity problem. Depending on the granularity of your buffered writes – we chose object-level – you can end up with false sharing of memory locations between transactional and non-transactional code. Imagine you update two separate fields of an object from within and outside a transaction, respectively, concurrently. Is this legal? Perhaps not. The transaction may bundle state updates to the whole object, rather than just one field.
All these snags led to the realization that we direly needed a memory model for TM.
Disillusionment Part IV: Where is the Killer App?
Throughout all of this, we searched and searched for the killer TM app. It’s unfair to pin this on TM, because the industry as a whole still searches for a killer concurrency app. But as we uncovered more successes in the latter, I became less and less convinced that the killer concurrency apps we will see broadly deployed in the next 5 years needed TM. Most enjoyed natural isolation, like embarrassingly parallel image processing apps. If you had sharing, you were doing something wrong.
In Conclusion
I eventually shifted focus to enforcing coarse-grained isolation through message-passing, and fine-grained isolation through type system support a la Haskell’s state monad. This would help programmers to realize where they accidentally had sharing, I thought, rather than merely masking this sharing and making it all work (albeit inefficiently).
I took this path not because I thought TM had no place in the concurrency ecosystem. But rather because I believed it did have a place, but that several steps would be needed before getting there.
I suspected that, just like with Argus, you’d want transactions around the boundaries. And that you’d probably want something like open nesting for fine-grained scalable data structures, like shared caches. These are often choke points in a coarse-grained locking system, and often cannot be fully isolated, at least in the small. Ironically I am just now arriving there. In the system I work on I see these issues actually staring us in the face.
This is just my own personal view on TM. You may also be interested in reading the current STM.NET team’s views also, available on their MSDN blog.
For me the TM project was particularly enjoyable. And it was a great learning experience. I worked with some amazing people, and it was a privilege. You really had the sense that something big was right around the corner, and every day was a rush of enjoyment. Despite running as fast as we could, it seemed like we could just barely keep pace with the research community. Over time more and more researchers turned to TM, and I distinctly recall reading at least one new TM paper per week.
This was also the first time I realized that Microsoft, at its core, really does operate like a collection of many startups. Our TM work was a grassroots movement, and there was no official sponsorship for our effort at the start. It was just a group of people independently getting together to discuss how TM might fit into the direction the industry was headed. Eventually TM started showing up on slide decks in presentations to management, followed by dedicated TM reviews, and even a BillG review. I will never forget, a couple years after that review – during an overall concurrency review – Bill standing up at the whiteboard, drawing the code “atomic { … }” and asking something to the effect: “Why can’t you just use transactional memory for that?” I guess the idea stuck with him too.
Who knows. Maybe in 10 years, the world will be transactional after all.
 Sunday, November 01, 2009
Say you've got a Task<T>. Well, now what?
You know that eventually a T will become available, but until then you're out of luck. You could go ahead and be a naughty little devil by calling Wait on it -- blocking the current thread (eek!) -- or you could call ContinueWith on the task to get back a new Task<U>, representing the work you would do to create some new U object if only you presently had a T in hand. And then perhaps you will find yourself in the same situation for that U.
These are those dataflow graphs I mentioned in the previous blog post. Things of beauty.
To be more concrete about the situation I describe, imagine you've got the following IFoo interface:
interface IFoo
{
int Bar();
string Baz(int x);
}
Now, given a Task<IFoo>, you can't do anything related to an IFoo. And yet presumably that's why you've got the task in the first place: because you care about the IFoo. What if you ultimately want to invoke the Bar method, for example?
Task<IFoo> task = ...;
You can of course block the thread:
// Option A: block the thread.
int resultA = task.Result.Bar();
...
Or you can choose to program in a very clunky way:
// Option B: use dataflow.
Task<int> resultB = task.ContinueWith(t => t.Result.Bar());
But what if, instead, you could do something like this?
// Option C: magic.
Task<int> resultC = task.Bar();
Whoa, wait a minute. We're calling Bar() on a Task<IFoo>? Neat, but how can that be?
This is obviously a trick. All of the members of T are somehow being made available on the Task<T> object, so that they can be called before the task has actually been resolved to a concrete value. Of course, were we to allow this, what you get back to represent the result of such calls would need to be task objects too: hence we get back a Task<int> from the call on Bar(), instead of an int. This is similar to call streams in Barbara Liskov's Argus language (her primary focus immediately after CLU).
This kind of lifting from the inner type outward is much like what you get in languages that allow generic mixins. C# already has one semi-such type, though you may not realize it: Nullable<T> actually allows you to directly access interfaces implemented by T without needing to call Value on it. It's almost like Nullable<T> was defined as deriving from T itself which is clearly not actually possible (for numerous reasons, not the least significant of which is that it's a struct). Try it. This works because the type system treats Nullable<T> and T somewhat uniformly (though you'd be surprised by some dangers lurking within -- effectively Nullable<T> mustn't implement any interfaces *ever* otherwise a type hole would result). But I digress...
Unfortunately without deep language changes we can't get this to work the way we'd like. I have found numerous occasions where a general lifting capability in C# would be useful: Lazy<T> is but one example. That said, each time we run across an instance, it demands slightly different type system treatment, and it seems unlikely such a general facility would be as usable as the one off features.
Type systems aside, I am actually using a very dirty trick to make this work: I'm using the new System.Dynamic features in .NET 4.0 to do it all dynamically. You may love or hate this, depending on your stance on type systems. Being an ML guy, I'll let you figure out what I think. (Hint: gross hack!)
We can go further. (Although sadly I won't demonstrate how to do so in this blog post. I had wanted to go all the way, but need to get some actual language work done today, in addition to a little Riemann study, instead of having endless fun tinkering with Visual Studio 2010. Shucks.) Notice that Baz accepts an int as input. Well, what if all we've got is a Task<int>? We can of course also allow that to get passed in too:
Task<string> resultD = task.Baz(42); // Real input. Fine.
Task<int> arg = ...;
Task<string> resultE = task.Baz(arg); // A task as input! Cool!
But wait, there is more! It slices and dices too. The next trick is difficult -- if not impossible -- to do without far reaching language changes. But we could also even bridge the world of ordinary methods too, not just those that have been accessed by tunneling through a Task<T>. For example:
string f(int x) {...}
...
Task<int> task = ...;
Task<string> result = f(task);
Not to even mention:
Task<int> x = ...;
Task<int> y = ...;
Task<int> z = x + y;
This is deep. What we are saying is that anywhere a T is expected, we can supply a Task<T>. Of course once we've entered the world of tasks, we cannot escape until values actually begin resolving. So when we invoke the method f in this example, we of course get back a Task<string> for its result. Once we've stepped onto a turtle's back, well, it's turtles all the way down.
(Which reminds me of the well known tale:
A well-known scientist (some say it was Bertrand Russell) once gave a public lecture on astronomy. He described how the earth orbits around the sun and how the sun, in turn, orbits around the center of a vast collection of stars called our galaxy. At the end of the lecture, a little old lady at the back of the room got up and said: "What you have told us is rubbish. The world is really a flat plate supported on the back of a giant tortoise." The scientist gave a superior smile before replying, "What is the tortoise standing on?" "You're very clever, young man, very clever", said the old lady. "But it's turtles all the way down!"
Tasks are not greasy hamburgers after all, as I had claimed in the last post, but rather they are turtles.
I've wasted all of my energy speaking of turtle hamburgers drenched in asynchronous aioli, and have left only a little to go over the hacked up implementation of this idea. Sigh. Well, we had better get to it.)
In summary: we'll just rely on dynamic dispatch to do the lifting, thanks to the new .NET 4.0 DynamicObject class. This is wildly less efficient than a proper type system design would yield, not to mention the utter lack of static type checking. Of course a proper implementation that designed for this from Day One would also avoid the tremendous amount of object allocation that relying on the current Task<T> objects and ContinueWith overloads imply. But nevertheless, this approach will allow us to at least have a good ole' time and stimulate the creative side of the noggin.
First, I shall provide an extension method for getting a DynamicTask<T> -- the thing that actually derives from DynamicObject and implements the custom dynamic binding:
public static class DynamicTask
{
public static dynamic AsDynamic<T>(this Task<T> task)
{
return new DynamicTask<T>(task);
}
}
Notice that this changes our calling conventions ever so slightly. Namely:
// Option C: magic.
Task<int> resultC = task.AsDynamic().Bar();
The AsDynamic places the caller into the lifted context. As invocations are made, the results become real tasks, and not dynamic ones, such that to continue the calling will require many AsDynamic()s. This is a minor inconvenience and we could certainly automatically wrap the return values in DynamicTask<T> objects if we wanted to eliminate this problem, i.e. to make chaining less verbose.
Second, we must implement the DynamicTask<T> class. We will do a very simple translation. Given a member access expression 'x.m', where m is either a field or property of type U, we will morph this into the new expression 'x.Task.ContinueWith(v => v.Result.m)', which is of type Task<U>. Similarly, given a method invocation 'x.M(a1,...,aN)', whose return value is of type U, we will morph it into the new expression 'x.Task.ContinueWith(v => v.Result.M(a1,...,aN))', which is of type Task<U> (or just Task if U is the void type). To support the ability to pass a task argument where an actual one is expected would require packing the argument with the target into an array, and doing a ContinueWhenAll on it.
(Perhaps I will illustrate how to do these other tricks in a later post, but I'm tight for time right now. I'm only sketching the general idea. Even in what I show below, things will be incomplete, because topics such as getting exception propagation right when tasks begin failing are tricky. Ideally the whole dataflow chain will be "broken" by such an exception. Additionally, I've only implemented what was necessary to get a few interesting examples working. The binder, for example, certainly has a few loose ends. Blog reader beware.)
Here is the implementation of DynamicTask<T>:
public class DynamicTask<T> : DynamicObject
{
private Task<T> m_task;
public DynamicTask(Task<T> task)
{
if (task == null) {
throw new ArgumentNullException("task");
}
m_task = task;
}
public Task<T> Task {
get { return m_task; }
}
public override DynamicMetaObject GetMetaObject(Expression parameter) {
if (parameter == null) {
throw new Exception("parameter");
}
return new TaskLiftedObject(this, parameter);
}
class TaskLiftedObject : DynamicMetaObject
{
...
}
}
Simple. All of the dynamic magic resides in the implementation of TaskLiftedObject, which derives from the DynamicMetaObject class. It is constructed with an instance of the DynamicTask<T> along with the expression tree that can be used to dynamically load up an instance of that task. All of the dynamic features work with expression trees. For example, in response to an attempt to invoke a method M on a DynamicTask<T>, our binder will need to find the right method M on the underlying T, and then return an expression tree that does the ContinueWith and so forth.
Let's start cracking open TaskLiftedObject:
class TaskLiftedObject : DynamicMetaObject
{
private DynamicTask<T> m_task;
public TaskLiftedObject(DynamicTask<T> task, Expression expression) :
base(expression, BindingRestrictions.Empty, task)
{
m_task = task;
}
We will override two of DynamicMetaObject's functions. BindGetMember is called when a member is accessed (like a property or field), whereas BindInvokeMember is called when a method call is made. There are several other methods that a proper binder would need to override in order to make delegate dispatch and such work properly. But this suffices to get started:
public override DynamicMetaObject BindGetMember(GetMemberBinder binder)
{
// We have a member access:
// x.m
//
// which must become:
// x.Task.ContinueWith(v => { v.Result.m; })
//
return new DynamicMetaObject(
MakeContinuationTask(Bind(binder.Name, -1), null),
BindingRestrictions.GetInstanceRestriction(Expression, Value),
Value
);
}
public override DynamicMetaObject BindInvokeMember(InvokeMemberBinder binder, DynamicMetaObject[] args)
{
// We have a call:
// x.Foo(a1,...,aN)
//
// which must become:
// x.Task.ContinueWith(v => { v.Result.Foo(a1,...,aN); })
//
Expression[] argsEx = new Expression[args.Length];
for (int i = 0; i < args.Length; i++) {
argsEx[i] = args[i].Expression;
}
return new DynamicMetaObject(
MakeContinuationTask(Bind(binder.Name, binder.CallInfo.ArgumentCount), argsEx),
BindingRestrictions.GetInstanceRestriction(Expression, Value),
Value
);
}
Clearly the workhorses here are Bind and MakeContinuationTask. Bind is responsible for performing dynamic lookup for a matching member on T that has the requested Name and, if a method call is being made, the proper number of parameters. For brevity, I've omitted anything to do with argument type checking, an obvious hole that we'd want to fix some day:
private static MemberInfo Bind(string name, int argCount)
{
// Lookup the target member on the T, rather than the (Dynamic)Task<T>.
return
(from m in typeof(T).GetMembers(BindingFlags.Instance | BindingFlags.Public)
where m.Name.Equals(name) &&
(argCount == -1 ?
!(m is MethodInfo) :
((MethodInfo)m).GetParameters().Length == argCount)
select m).
Single();
}
Nothing too interesting here either -- just a bit of hacky reflection code done with a fancy LINQ query. If anything other than exactly one method was found, the call to Single() will throw an exception. If you want to see what a "real" dynamic binder looks like, you won't find it here: check out VB's or IronPython's.
Now for the meat. The MakeContinuationTask method takes the target member that we've found dynamically via Bind, as well as an optional array of expression trees, each representing an argument being passed to the target method (and which will be null for property and field access), and manufactures the expression tree that represents the execution of the dynamic call itself:
private Expression MakeContinuationTask(MemberInfo target, Expression[] targetArgs)
{
var lambdaParam = Expression.Parameter(typeof(Task<T>), "v");
var lambdaParamResult = Expression.Property(lambdaParam, "Result");
Expression lambdaBody;
Type lambdaReturnType;
if (target is MethodInfo) {
lambdaBody = Expression.Call(lambdaParamResult, (MethodInfo)target, targetArgs);
lambdaReturnType = ((MethodInfo)target).ReturnParameter.ParameterType;
}
else if (target is PropertyInfo) {
lambdaBody = Expression.Property(lambdaParamResult, (PropertyInfo)target);
lambdaReturnType = ((PropertyInfo)target).PropertyType;
}
else if (target is FieldInfo) {
lambdaBody = Expression.Field(lambdaParamResult, (FieldInfo)target);
lambdaReturnType = ((FieldInfo)target).FieldType;
}
else {
throw new Exception("Unsupported dynamic invoke: " + target.GetType().Name);
}
return Expression.Call(
Expression.Property(
Expression.Convert(this.Expression, typeof(DynamicTask<T>)),
typeof(DynamicTask<T>).GetProperty("Task")
),
GetContinueWith(lambdaReturnType), // ContinueWith
new Expression[] {
// v => { v.Result.M(a0,...,aN) }
Expression.Lambda(lambdaBody, lambdaParam)
}
);
}
You should be able to convince yourself that this code generates the desired transformation described earlier. It uses a method to find the overload of Task<T>.ContinueWith that we want to bind against, and invokes that on the Task<T> contained within the DynamicTask<T> against which the dynamic call was made. It is rather unfortunate that the CLR does not allow the void type as a generic type argument, so we have to be a little bit inconsistent with our treatment of void returns, by choosing a different ContinueWith overload.
If the above reflection code was called hacky, the ContinueWith lookup is worse. It's very inefficient, not to mention fragile (because it depends on the current layout of Task<T>'s overloads, what with instantiating generic methods and the like). C'est la vie:
private static MethodInfo GetContinueWith(Type returnType)
{
// @TODO: caching to avoid expensive lookups each time.
if (returnType == typeof(void)) {
return typeof(Task<T>).GetMethod(
"ContinueWith",
new Type[] { typeof(Action<Task<T>>) }
);
}
else {
foreach (MethodInfo mif in typeof(Task<T>).GetMethods()) {
if (mif.Name == "ContinueWith" && mif.IsGenericMethodDefinition) {
MethodInfo mifOfT = mif.MakeGenericMethod(returnType);
ParameterInfo[] mifParams = mifOfT.GetParameters();
if (mifParams.Length == 1 &&
mifParams[0].ParameterType == typeof(Func<,>).MakeGenericType(typeof(Task<T>), returnType)) {
return mifOfT;
}
}
}
}
throw new Exception("Fatal error: ContinueWith overload not found");
}
}
And that's it. With that, we can get dynamic invocations on unresolved T's via Task<T> objects. Nifty.
I'm not saying any of this is a really good idea. Honestly, I'm not. Of course, there's a kernel of a good idea there and the systems we are working on take this kernel to its extreme. By providing a programming model that encourages deep chains of datafow to be expressed speculatively in a natural and familiar manner, greater degrees of latent parallelism can lie resident in an application waiting to be unlocked as more processors become available. Doing it for real requires impactful changes to the language, supporting infrastructure, and particularly tooling. Just imagine what it means to break into a debugger to inspect deep dataflow graphs that have been constructed by compiler magic underneath you. And the use of ContinueWith is a little lame, because of course the target of our call may be something that can be run speculatively too with first class pipleining, rather than completely delaying the invocation of it.
So we won't be seeing lifted tasks in .NET anytime soon. Writing up this blog post was merely an excuse to toy around with the new C# dynamic features and to have a little recreational time. And to generate excitement about what .NET 4.0 holds in store. I hope you have enjoyed it. Now back to reality.
 Saturday, October 31, 2009
Well, Visual Studio 2010 Beta 2 is out on the street. It contains plenty of neat new things to keep one busy for at least a rainy Saturday. I proved this today.
Of course, Parallel Extensions is in the box. .NET 4.0's Task and Task<T> abstractions are used to implement such things as PLINQ and Parallel.For loops, but of course they are great for representing asynchronous work too. The FromAsync adapters move you from the dark ages of IAsyncResult to the glitzy new space age of tasks.
Not only are tasks tastier than hamburgers, but they enable complex dataflow graphs of asynchronous work to unfold dynamically at runtime, thanks to the ContinueWith method. From a Task<T> you can get a Task<U> that was computed based on the T; ad infinitum. We like dataflow. It is the key to unlocking parallelism, or more accurately, boiling away all else except for dataflow is the key. But what about control flow, you might ask? We like it less. But you can do it, so long as you put in some work. F#'s async workflows make this sort of thing a tad easier, but the raw libraries in .NET 4.0 don't come with any sort of loops or conditional capabilities. Perhaps in the future they will. Nevertheless, in this post I shall demonstrate how to build a couple simple ones.
Not because the lack of them is going to cause unprecidented and unheard of horrors, but rather because in doing so we'll see some neat features of tasks.
The two methods I will illustrate in this post are:
public static class TaskControlFlow
{
public static Task For(int from, int to, Func<int, Task> body, int width)
public static Task While(Func<int, bool> condition, Func<int, Task> body, int width)
}
Notice that each body is given the iteration index and is expected to launch asynchronous work and return a Task. The parameters that these methods take are probably obvious. Well, except for the last one. The "width" indicates how many outstanding asynchronous bodies should be in flight at once. The Task returned by For and While won't be considered done until all iterations are done, and any exceptions will be propagated as you might hope. It would be pretty useless otherwise.
For example, we could write a while loop that does something very silly:
TaskControlFlow.While(
i => i < 100,
i => { return CreateTimerTask(250).ContinueWith(_ => Console.WriteLine(i)); },
4
).Wait();
This just prints returns a "timer task" that completes after 250ms and prints out the iteration to the console. We pass a width of 4, so only four tasks will be outstanding at any given time. Notice we call Wait at the end, since both For and While return tasks representing the in flight work. This could have instead been written using a For loop as follows:
TaskControlFlow.For(0, 100,
i => { return CreateTimerTask(250).ContinueWith(_ => Console.WriteLine(i)); },
4
).Wait();
The CreateTimerTask method, by the way, looks like this:
private static Task CreateTimerTask(int ms)
{
var tcs = new TaskCompletionSource<bool>();
new Timer(x => ((TaskCompletionSource<bool>)x).SetResult(true), tcs, ms, -1);
return tcs.Task;
}
As something more realistic, imagine we wanted to do something with a large number of files, and don't want to block a whole bunch of threads in the process. The following "simple" expression will count up all of the bytes for all of the files in a particular directory, without once blocking the thread -- well, except for the initial call to Directory.GetFiles:
string win = "c:\\...\\";
string[] files = Directory.GetFiles(win);
int total = 0;
TaskControlFlow.For(0, files.Length,
i => {
bool eof = false;
int offset = 0;
byte[] buff = new byte[4096];
FileStream fs = File.OpenRead(files[i]);
return TaskControlFlow.While(
j => !eof,
j => Task<int>.Factory.
FromAsync<byte[],int,int>(
fs.BeginRead, fs.EndRead, buff, offset, buff.Length,
null, TaskCreationOptions.None
).
ContinueWith(v => {
if (eof = v.Result < buff.Length) {
fs.Close();
}
offset += v.Result;
Interlocked.Add(ref total, v.Result);
}),
1
);
},
8
).Wait();
Console.WriteLine(total);
Pretty neat. We've somewhat arbitrarily chosen a width of 8 for this loop. And notice something very subtle but important here: we've chosen a width of 1 for the inner loop that plows through the bytes of a file. This is because we're sharing state, and it would not be safe to launch numerous iterations at once. The same byte[], eof variable, and so forth, would become corrupt. I will mention in passing that it's unfortunate that we've got that interlocked stuck in there to add to the total. Refactoring this so that we could just do a LINQ reduce over the whole thing would be nice. Indeed, it can be done.
We can do away with the For implementation very quickly. It is just implemented in terms of While:
public static Task For(int from, int to, Func<int, Task> body, int width)
{
return While(i => from + i < to, body, width);
}
And it turns out that the While implementation is not terribly complicated either. Here it is:
public static Task While(Func<int, bool> condition, Func<int, Task> body, int width)
{
var tcs = new TaskCompletionSource<bool>();
int currIx = -1; // Current shared index.
int currCount = width; // The number of outstanding tasks.
int canceled = 0; // 1 if at least one body was cancelled.
ConcurrentBag<Exception> exceptions = null; // A collection of exceptions, if any.
// Generate a continuation action: this fires for each body that completes.
Action<Task> fcont = null;
fcont = tsk => {
if (tsk.IsFaulted) {
// Accumulate exceptions.
LazyInitializer.EnsureInitialized(ref exceptions);
foreach (Exception inner in tsk.Exception.InnerExceptions) {
exceptions.Add(inner);
}
}
else if (tsk.IsCanceled) {
// Mark that cancellation has occurred.
canceled = 1;
}
else if (canceled == 0 && exceptions == null) {
// If no cancellations / exceptions are found, attempt to kick off more work.
int ix = Interlocked.Increment(ref currIx);
if (condition(ix)) {
// Generate a new body task, handling exceptions. Then make sure we
// tack on the continuation on that new task, so we can keep on going...
// If the condition yielded 'false', we'll simply fall through and try to finish.
Task btsk;
try {
btsk = body(ix);
}
catch (Exception ex) {
btsk = AlreadyFaulted(ex);
}
btsk.ContinueWith(fcont);
return;
}
}
// If this is the last task, signal completion.
if (Interlocked.Decrement(ref currCount) == 0) {
if (exceptions != null) {
tcs.SetException(exceptions);
}
else if (canceled == 1) {
tcs.SetCanceled();
}
else {
tcs.SetResult(true);
}
}
};
// Fire off the right number of starting tasks.
for (int i = 0; i < width; i++) {
AlreadyDone.ContinueWith(fcont);
}
return tcs.Task;
}
I've commented the code inline to illustrate what is going on. The only other part that isn't shown are the AlreadyDone and AlreadyFaulted members, which simply give Tasks that are already in a final state. This isn't strictly necessary, but come in handy in a number of situations:
internal static Task AlreadyDone;
static TaskControlFlow()
{
var tcs = new TaskCompletionSource<bool>();
tcs.SetResult(true);
AlreadyDone = tcs.Task;
}
private static Task AlreadyFaulted(Exception ex)
{
var tcs = new TaskCompletionSource<bool>();
tcs.SetException(ex);
return tcs.Task;
}
And that's it. I'm done for now. Hope you enjoyed it. I've got a few other posts in the works -- primarily the result of a day full of hacking (I got in the office at 7am this morning, and have been here ever since, 14 hours later) -- demonstrating how to do speculative asynchronous work for if/else branches. Finally, I also have a neat example that illustrates how to do deep dataflow-based speculation without having to wait for work to complete. This combines the new .NET 4.0 dynamic capabilities with parallelism, so I'm pretty excited to get it working and write about it.
 Monday, October 19, 2009
Embarrassingly, I neglected to write about the oldest trick in the book in my last post: designing the producer/consumer data structure to reduce false sharing. As I've written about several times previously (e.g. see here), and more so in the book, false sharing is always deadly and must be avoided.
As a simple example, consider a program that merely increments a shared counter over and over again. If we give P threads their own separate counters, and ask them to increment the respective counter an equal number of times. Each thread can of course do this without synchronization, because the counters are distinct: no locks or even interlocked operations are necessary. Naively, one might expect that running P of them in parallel leads to no interference, and hence perfect parallelization. However, when I run a little benchmark on my 8-way machine, the numbers for increasing values of P tell a very different story:
1 = 22425789
2 = 42023726 (187%)
4 = 175828522 (784%)
8 = 333906288 (1489%)
It is clear that the throughput drops dramatically as P increases. The reason? Each counter, being only 8 bytes wide, shares a cache line with as many as 7 other counters -- or 15 if we're on a machine with 128 byte cache lines. A simple change to the counter's layout, so that individual counters do not share the same cache line, will remedy the situation. The numbers improve dramatically. In fact, they remain constant no matter the value of P:
1 = 21914250
2 = 21900392 (100%)
4 = 21865781 (100%)
8 = 21934008 (100%)
This perfect scaling isn't always possible due to memory bandwidth, but because we're just incrementing a single counter per core this doesn't manifest as a problem.
For what it's worth, the machine I am running these tests on is an 8-way, dual-socket, quad-core. Pairs of cores share an L1 cache, and all cores in a socket share an L2 cache. So the pairs {0,1}, {2,3}, {4,5}, and {6,7} are each expected to have distinct L1 caches and the groups {0,1,2,3} and {4,5,6,7} are expect to have distinct L2 caches. The 2 number above is run with two threads affinitized such that they share the same L1 cache. If we force them apart, however, we get slightly different results:
2 = 42023726 (187%) -- same L1 cache
2 = 54706505 (244%) -- same L2 cache
2 = 75030977 (335%) -- separate sockets
As expected, the more distance in the cache hierarchy, the greater the slowdown due to the increased ping pong paths.
The specific results are of course unique to my machine, but nevertheless the conclusion is clear: reducing sharing leads to substantial performance gains, particularly with large numbers of threads hammering on the shared lines. Often more so than eliminating other sources of wasted cycles, like interlocked operations. Eliminating those sources is clearly important too, but it really is amazing how deadly and yet difficult to discover false sharing can be: few cases are as obvious as this one.
One aside is worth mentioning before winding down. When I first ran this experiment, I had done it two ways: (1) with fields of a shared object, then using StructLayout(LayoutKind=Explicit) to keep fields apart, and (2) with counters crammed into an array, which then contains padding elements to eliminate the false sharing. The former is shown above. If you try the latter, you may be surprised. The layout of arrays on the CLR is such that an array's length resides before the first element. So unless you pad the first element of the array, all accesses will perform bounds checking that touches the first element's line. Because this line is being mutated by the thread incrementing the first counter, terrible false sharing results. To solve this, we must pad the first element too.
For example, here are the array numbers with false sharing:
1 = 27366202
2 = 125264714 (458%)
4 = 1383953372 (7969%)
8 = 3136996731 (11463%)
Notice the P = 8 case is over 100x slower! Yowzas. After fixing things, with the first element padded, we again observe perfect scaling:
1 = 27393869
2 = 27465999 (100%)
4 = 27370901 (100%)
8 = 27408631 (100%)
Clearly false sharing is not merely a theoretical concern. In fact, during our Beta1 performance milestone in Parallel Extensions, most of our performance problems came down to stamping out false sharing in numerous places: the partitioning logic of parallel for loops, polling cancellation token flags, enumerators allocated at the beginning of a PLINQ query and constantly mutated during its execution, and even in our examples (e.g. see Herb's matrix multiplication example), etc. It is terribly simple to make a mistake and, in a complicated system, terribly difficult to pinpoint the origin of what can be a truly crippling scalability bottleneck.
In the next post, we will go back and take a look at our single-producer / single-consumer buffer, and redesign it to have substantially better cache behavior.
~
For reference, here's the basic program used for a lot of these tests:
//#define CACHE_FRIENDLY
//#define USE_ARRAY
#pragma warning disable 0169
using System;
using System.Diagnostics;
using System.Runtime.InteropServices;
using System.Threading;
class Program
{
const int P = 1;
#if USE_ARRAY
class Counters
{
long[] m_longs;
internal Counters(int n) {
#if CACHE_FRIENDLY
m_longs = new long[(n+1)*16];
#else
m_longs = new long[n];
#endif
}
public void Increment(int i) {
#if CACHE_FRIENDLY
m_longs[(i+1)*16]++;
#else
m_longs[i]++;
#endif
}
}
#else // USE_ARRAY
#if CACHE_FRIENDLY
[StructLayout(LayoutKind.Explicit)]
#endif
struct Counters
{
#if CACHE_FRIENDLY
[FieldOffset(0)]
#endif
public long a;
#if CACHE_FRIENDLY
[FieldOffset(128)]
#endif
public long b;
#if CACHE_FRIENDLY
[FieldOffset(256)]
#endif
public long c;
#if CACHE_FRIENDLY
[FieldOffset(384)]
#endif
public long d;
#if CACHE_FRIENDLY
[FieldOffset(512)]
#endif
public long e;
#if CACHE_FRIENDLY
[FieldOffset(640)]
#endif
public long f;
#if CACHE_FRIENDLY
[FieldOffset(768)]
#endif
public long g;
#if CACHE_FRIENDLY
[FieldOffset(896)]
#endif
public long h;
}
static Counters s_c = new Counters();
#endif // USE_ARRAY
public static void Main(string[] args)
{
int p = int.Parse(args[0]);
const int iterations = int.MaxValue / 4;
ManualResetEvent mre = new ManualResetEvent(false);
#if USE_ARRAY
Counters c = new Counters(p);
#endif
Thread[] ts = new Thread[p];
for (int i = 0; i < ts.Length; i++) {
int tid = i;
ts[i] = new Thread(delegate() {
SetThreadAffinityMask(GetCurrentThread(), new UIntPtr(1u << tid));
mre.WaitOne();
for (int j = 0; j < iterations; j++)
#if USE_ARRAY
c.Increment(tid);
#else
switch (tid) {
case 0: s_c.a++; break;
case 1: s_c.b++; break;
case 2: s_c.c++; break;
case 3: s_c.d++; break;
case 4: s_c.e++; break;
case 5: s_c.f++; break;
case 6: s_c.g++; break;
case 7: s_c.h++; break;
}
#endif
});
ts[i].Start();
}
Stopwatch sw = Stopwatch.StartNew();
mre.Set();
foreach (Thread t in ts) t.Join();
Console.WriteLine(sw.ElapsedTicks);
}
[System.Runtime.InteropServices.DllImport("kernel32.dll")]
static extern IntPtr GetCurrentThread();
[System.Runtime.InteropServices.DllImport("kernel32.dll")]
static extern UIntPtr SetThreadAffinityMask(IntPtr hThread, UIntPtr dwThreadAffinityMask);
}
 Sunday, October 04, 2009
Commonly two threads must communicate with one another, typically to exchange some piece of information. This arises in low-level shared memory synchronization as in PLINQ’s asynchronous data merging, in the implementation of higher level patterns like message passing, inter-process communication, and in countless other situations. If only two agents partake in this arrangement, however, it is possible to implement a highly efficient exchange protocol. Although the situation is rather special, exploiting this opportunity can lead to some interesting performance benefits.
The standard technique for shared-memory situations is to use a ring buffer. This buffer is ordinarily an array of fixed length that may become full or empty. The two threads in this arrangement assume the role of producer and consumer: the producer adds data to the buffer and may make it full, whereas the consumer removes data from the buffer and may make it empty. It is possible to generalize this to multi-consumers or multi-producers, with some added cost to synchronization. What is described below is for the two thread case.
We will call this a ProducerConsumerRendezvousBuffer<T>, and its basic structure looks like this:
using System;
using System.Threading;
public class ProducerConsumerRendezvousPoint<T>
{
private T[] m_buffer;
private volatile int m_consumerIndex;
private volatile int m_consumerWaiting;
private AutoResetEvent m_consumerEvent;
private volatile int m_producerIndex;
private volatile int m_producerWaiting;
private AutoResetEvent m_producerEvent;
public ProducerConsumerRendezvousPoint(int capacity)
{
if (capacity < 2) throw new ArgumentOutOfRangeException("capacity");
m_buffer = new T[capacity];
m_consumerEvent = new AutoResetEvent(false);
m_producerEvent = new AutoResetEvent(false);
}
private int Capacity
{
get { return m_buffer.Length; }
}
private bool IsEmpty
{
get { return (m_consumerIndex == m_producerIndex); }
}
private bool IsFull
{
get { return (((m_producerIndex + 1) % Capacity) == m_consumerIndex); }
}
public void Enqueue(T value)
{
...
}
public T Dequeue()
{
...
}
}
There are some basic invariants to call out:
- Our buffer holds our elements, producer index says at what position the next element enqueued will be stored, and the consumer index says from what position the next request to dequeue an element will retrieve its value.
- We reserve one element in our buffer to differentiate between fullness and emptiness. This is why we demand that capacity be >= 2. We could alternatively know how to distinguish between a free slot and a used one, such as checking for null, but keep things simple for now.
- Thus, IsEmpty is true when the consumer and producer index are the same. Whereas IsFull is true when the consumer is one ahead of the producer, such that producing would make the producer index collide with the consumer index (otherwise leading to IsEmpty).
- It should be obvious that our intent is to block consumption when IsEmpty == true and production when IsFull == true. This is the point of the waiting flags and events.
Now let us look at the implementation first of Enqueue and then Dequeue, paying special attention to the necessary synchronization operations. They look nearly identical:
public void Enqueue(T value)
{
if (IsFull) {
WaitUntilNonFull();
}
m_buffer[m_producerIndex] = value;
Interlocked.Exchange(ref m_producerIndex, (m_producerIndex + 1) % Capacity);
if (m_consumerWaiting == 1) {
m_consumerEvent.Set();
}
}
Enqueue begins, as expected, by checking whether the queue is full. Notice that we have not yet issued any memory fences yet. The only thread that may make the buffer full is the current one, which will obviously not occur before proceeding, and therefore we needn’t perform any expensive synchronization operation for this check. The value seen may of course be stale but we can deal with that possibility inside the slow path, WaitUntilNonFull. We’ll look at that momentarily.
We then proceed to placing the value in the buffer at the current producer’s index. Only the current thread will update the producer index and a consumer will not read from the current value so long as the producer index refers to it. The value may not even be written atomically, e.g. for T’s that are greater than a pointer sized word. This is okay: only the act of incrementing the index allows a consumer to access the element in question. Writes on the CLR 2.0 memory model are retired in order and the reading side will use an acquire load of the index before accessing the element’s words. Indeed we could use complicated multipart value types that are comprised of lengthy buffers, header words, and so on.
We then increment the producer index, handling the possibility of wrap-around by modding with the capacity. This uses an Interlocked.Exchange for one simple reason: we are about to read a consumer waiting flag, and must prevent the load of that flag from moving prior to the producer index write. The consumer sets this flag when it notices the queue is empty and waits. This enables us to use a “Dekker style” check to minimize synchronization. We could have alternatively just unconditionally set the event, doing away with the interlocked operation altogether. But that call, if it involves kernel transitions, which is quite likely, is going to be much more expensive and would occur on every call to Enqueue. And any event of this kind that doesn’t require kernel transitions is going to at least require an interlocked operation for the same reason we require one here. An alternative technique involves setting when we transition the buffer from empty to non-empty or full to non-full, but this wastes a possibly expensive signal if the other party isn't even currently waiting. If full or empty is a rare situation, then full or empty and with a peer actually physically waiting is even rarer.
Let’s now look at the WaitUntilNonFull method. It’s really the reverse of what the consumer does, so based on everything said till this point, I am guessing it’s obvious:
private void WaitUntilNonFull()
{
Interlocked.Exchange(ref m_producerWaiting, 1);
try {
while (IsFull) {
m_producerEvent.WaitOne();
}
}
finally {
m_producerWaiting = 0;
}
}
We begin by issuing a memory fence and setting the producer waiting flag. This memory fence is necessary to advertise that we are about to wait, and also to ensure the subsequent check of IsFull is synchronized. The consumer does something very much like the producer does (above) after taking an element: if the producer is waiting, the consumer has made space for it and therefore must signal. But it could be the case that the consumer has already made the queue non-full before it could notice the producer’s waiting flag. We catch this by ensuring the producer’s check of IsFull cannot go before setting the producer waiting; similarly, the consumer cannot make IsFull false without subsequently noticing that the producer is waiting; this avoids deadlock.
Everything else is self explanatory. Well, almost. We need a loop here to catch one subtle situation. Imagine a producer enters into this method thinking the buffer is full. It sets its flag, and then immediately notices that the buffer is not full anymore. A consumer has generated a new item of interest. But imagine that consumer noticed that the producer was waiting and hence set its event. This is an auto-reset event, so the next time the producer must wait, the event will have already been set and it’ll likely wake up before IsFull has become true. An alternative way of dealing with this is to call Reset on the event if we didn’t actually wait on the event, but again we keep things simple.
At this point, the consumer side is going to look very familiar:
public T Dequeue()
{
if (IsEmpty) {
WaitUntilNonEmpty();
}
T value = m_buffer[m_consumerIndex];
m_buffer[m_consumerIndex] = default(T);
Interlocked.Exchange(ref m_consumerIndex, (m_consumerIndex + 1) % Capacity);
if (m_producerWaiting == 1) {
m_producerEvent.Set();
}
return value;
}
private void WaitUntilNonEmpty() {
Interlocked.Exchange(ref m_consumerWaiting, 1);
try {
while (IsEmpty) {
m_consumerEvent.WaitOne();
}
}
finally {
m_consumerWaiting = 0;
}
}
This is near-identical to Enqueue and WaitUntilNonFull, and so needs little explanation. The acquire load inside IsEmpty of the producer index ensures that we observe the producer index for this particular value being beyond the current consumer’s index before loading the value itself, thereby ensuring we see the whole set of written words. The one other thing to point out is that we “null out” the element consumed which, for large buffers, helps to avoid space leaks that would have otherwise been possible.
There are certainly some opportunities for improving this.
For example, we might add a little bit of spinning in the wait cases. This would be worthwhile for cases that exchange data at very fast rates and have small buffers, meaning that the chance of hitting empty and full conditions is quite high. Avoiding the context switch thrashing is likely to lead to hotter caches, because threads will remain runnable for longer, and the raw costs of switching themselves.
Additionally, we technically could use a single event if we wanted to spend the effort. We’d have to handle a few tricky cases, however: namely, the case where a producer or consumer ends up waiting on an event because it “just missed” the event of interest, thus satisfying the event. Indeed both threads could actually end up waiting on the event simultaneously and we need to somehow ensure the right one eventually gets awakened. This leads to some chatter and probably isn’t worth the added complexity.
Here is a peek at some rough numbers from a little benchmark that has two threads enqueuing and dequeuing elements as fast as humanly (or computerly) possible. This is a particularly unique and unlikely situation, but stresses the implementation in a few interesting ways. In particular, it will stress the contentious slow paths; although these are expected to be rarer, the fast paths are just so easy to get right in this data structure that they are mostly uninteresting to stress performance-wise. There are then a few variants, each based on the original version shown above:
- 2 element capacity, which means we’ll be transitioning from empty to full and back a lot.
- 1024 element capacity, which means we won’t.
- With spinning, using .NET 4.0’s new System.Threading.SpinWait type.
- An implementation that overuses interlocked operations as many naïve programmers would do.
The 2 element capacity situation is common in some message passing systems, e.g. Ada rendezvous, Comega joins, and the like. Whereas the 1,024 element capacity situation is common for more general purpose channels, where some amount of pipelining is anticipated.
I whipped together a benchmark -- so quickly that we can barely trust it, I might add -- to measure these things. Here’s a small table, showing the observed relative costs:
2 capacity 1024 capacity
As-is No-spin 100.00% 1.93%
Spin 56.41% 1.66%
Naïve No-spin 101.20% 2.09%
Spin 67.73% 1.87%
As with most microbenchmarks, take the results with a grain of salt. And there are certainly more interesting variants to compare this against, including a monitor-based implementation that locks around access to the buffer itself. Nevertheless, we can draw a few conclusions from this: as expected, the version that uses a single interlocked on enqueue and single interlocked on dequeue is faster than the naïve version that uses multiple (surprise!); spinning makes a much more interesting difference on the 2 element capacity situation, as expected, because it reduces the number of context switches dramatically; and, finally, the larger capacity enables a producer to race ahead of the consumer, hence avoiding far fewer transitions from full to empty to full and so forth.
This post was more of a case study than anything else. There is nothing conclusive or groundbreaking here, and I suppose I should have said that would be the case up front. That said, I’ve seen this technique used in over a dozen situations in actual product code now, so I figured I’d write a little about it, with a focus on how to minimize the synchronization operations. We even contemplated shipping such a type in Parallel Extensions to .NET, but it’s just too darn specialized to warrant it. So the closest thing we provided is BlockingCollection<T>. Enjoy.
 Monday, September 28, 2009
I've officially started down the long road of writing a 2nd edition of Concurrent Programming on Windows, and would like your help.
There are many great new features in Windows 7 and the next versions of .NET, Visual C++ / CRT, and Visual Studio. The book will of course cover them all.
But I am also looking to reshape the 1st edition in many dimensions. I'd like to focus on readability, conciseness, and clearly separating the "must know" topics from the more geeky and advanced ones. This is a common conundrum when writing a technical book. The advanced topics are more likely to appeal to readers of my blog, for instance, but may be daunting for newcomers to concurrency. Tradeoffs abound. Nevertheless the 2nd edition is likely to be slimmed down compared to the 1st.
Any and all feedback, suggestions, and ideas are welcome. What did you like about the 1st edition, and what did you not like? If you could change a handful of things, what would make the top of your list? And was it missing something crucial that you would like to see covered? Please send your feedback to joe AT@ acm DOT org, or simply leave comments here on the blog. Regardless of whether you've read the 1st edition or not.
I sincerely look forward to hearing from you. Cheers.
 Monday, July 27, 2009
I had originally entitled this post "Having your concurrency cake and eating it too", but it sounded a little too silly.
I have grown convinced over the past few years that taming side effects in our programming languages is a necessary prerequisite to attaining ubiquitous parallelism nirvana. Although I am continuously exploring the ways in which we can accomplish this -- ranging from shared nothing isolation, to purely functional programming, and anything and everything in between -- what I wonder the most about is whether the development ecosystem at large is ready and willing for such a change.
It is this that I find the most frightening. I know we can give the world Haskell, or Erlang, or simple incremental steps within familiar environments, like Parallel Extensions. (Indeed, the world already has these things.) But elevating effects to a first class concern in day-to-day programming turns out to be a tough pill to swallow. Particularly since the incremental degrees of parallelism that this switch will unlock are questionable (see this and this); and even if they were pervasive and impressive, it's unclear what percentage of developers will pay what specific price for a 2x, 4x, or even 16x increase in compute performance. It sounds great on paper, but the cost / benefit equation is a complicated one.
"Pay for play" is the standard terminology we use for such things around here, and the solution needs to have the right amount of it.
Many folks with embarrassingly parallel algorithms will succeed just fine in a shared memory + locks + condition variables world, and indeed have already begun to do so. And specialized tools -- like GPGPU programming -- have popped up that, when small kernels of computations are written in a highly constrained way, will parallelize, sometimes impressively. Is this enough? Perhaps for the next 5 years, but surely not much longer after that. It is in my opinion qualitatively very important for the future of computer science that we provide programming environments that are more conducive to safe and automatic parallelism. And yet I cannot stand up with a straight face and proclaim that each and every developer on the face of the planet should practice side effect abstinence. A healthy balance between cognitive familiarity and pragmatic [r]evolution must be found. Many promising approaches are in the works (see UIUC's DPJ), but we are years away.
Until then, parallelism on broadly deployed commercial platforms will likely remain in the realm of specialists.
Of course, Haskell and Erlang both accomplish the no effects feat in a sneaky way. For those interested in foisting parallelism unto the masses, lessons can be learned from these communities. If you buy into purely functional programming, you necessarily buy into programming without effects, and the (sparing) use of monads to represent them. (Or, as my colleague Erik calls it, fundamentalist functional programming).) And if you buy into large scale message passing, you (typically) necessarily also buy into programming without shared memory, leaving behind only strongly isolated effects. The key here is that developers gain many other benefits by switching to these platforms -- and the lack of effects is admittedly a consequential byproduct of this switch. The lack of effects are not center stage. The two approaches have recently begun to converge in what I believe to be the appropriate long-term approach: strong isolation with effects within, and safe, deterministic data parallelism through careful control over sharing, aliasing, and heap separation.
That said, though not center stage, the switch to effectless programming is certainly not painless.
Enabling side effects among otherwise functional code, I think, is a good thing, because it allows familiar algorithms to be encoded in an ordinary imperative way. Familiarity is key: it may sound two faced, but I don't think parallelism is sufficiently top of mind that developers will want to completely rearrange the way that they write software. Perhaps we will evolve in this direction, but a significant leap will fall flat. Moreover, many algorithms actually depend on stateful updates to achieve adequate performance, like write in place graphics buffers. The Haskell state monad strikes a nice balance between embedding imperative-looking effects, when coupled with the do notation, within a strictly functional language.
Furthermore, I really respect that Haskell discourages cheating. (Any unsafePerformIO is viewed with great suspicion.) I quite like mostly-functional programming languages like ML and Scheme, because they tend to be easier on programmers with C backgrounds, but strongly dislike that a mutation can lurk within what appears to be an otherwise pure function. Documenting side effects in the type system is healthy and allows better symbolic reasoning about the dependencies and implicit parallelism contained within, transitively, while still providing a way to get at effectful programming. Haskell does a great job at this. The elimination of dependence ought to be the focus of programmers, and not the elimination of ad-hoc and unstructured access to shared, mutable state. These are algorithmic and important concerns.
What remains unclear is where the boundaries lie. Part and parcel of documenting effects is thinking about them when designing your software. You need to consider whether IList<T>'s Contains method may mutate the list or not, for example, and hold the line on implementations of the interface. Either it returns an 'a' or an 'IO a' -- and this decision is one that has far reaching implications. This is a wholly separate kind of interface contract than what most programmers are accustomed to having to think about during the code-debug-edit cycle. And surely Python and JavaScript developers will not care one way or the other, particularly if it forces more design decisions up front than what is customary today. This bifurcation seems inevitable, and yet there is substantial crossover: C# developers will write Python scripts, and Python developers will consume components written in C#.
And yet, I think we need to venture down this path in order achieve automatically scalable software. Parallel computers have become incredibly cheap, and so the historical barriers into high performance technical computing have been whittled away to the software skills necessary to write scalable programs; we will likely succeed at expanding this market without radical changes, but if we stopped there, vast reams of client-side software will be left in the dust. I've been making inroads into solving the problem on my end, with a new language that sits between C# and Haskell. I'm biased, have been hard at work on this problem for many years, and yet still struggle to answer these fundamental questions. I am a big believer that there's got to be a happy medium out there. But I'm still very perplexed, and face some very high walls to hurdle. Who will discover the right balance, and when will they do so?
 Monday, July 13, 2009
In this blog post, I'll demonstrate building some very simple (but nice!) synchronization abstractions: a Lock type and a standalone ConditionVariable class. And we'll use a few new types in .NET 4.0 in the process. I had to implement a condition variable recently -- the joys of developing a new operating system / platform from the ground up -- and decided to put together a toy example for a blog post as I went. Warning: this is for educational purposes only.
Not to sound like a broken record, but it is a very good idea to manage locks intentionally. Doing so makes synchronization code easier to write, understand, and, correspondingly, maintain; given the difficult nature of concurrency, any opportunity for simplification is always welcomed. Yes, that means avoiding the CLR's dreadful capability to lock on arbitrary objects. (Which, by the way, is effectively just a holdover from the days where .NET was trying to woo developers from Java onto the platform.) In retrospect, this ability was a bad idea, and we should have provided and embellished a System.Threading.Lock class from Day One.
Well, rewind the clock and imagine we had provided such a Lock class. In fact, here's an overly simple one right here. I'm going to cheat a little, and reuse two locks that come with .NET 4.0: Monitor itself, and the new SpinLock class:
//#define SPIN_LOCK
public class Lock
{
#if SPIN_LOCK
private SpinLock m_slock = new SpinLock();
#else
private object m_slock = new object();
#endif
private ThreadLocal<int> m_acquireCount = new ThreadLocal<int>();
public void Enter() {
#if SPIN_LOCK
bool ignoreTaken;
m_slock.Enter(ref ignoreTaken);
#else
Monitor.Enter(m_slock);
#endif
m_acquireCount.Value = m_acquireCount.Value + 1;
}
public void Exit() {
m_acquireCount.Value = m_acquireCount.Value - 1;
#if SPIN_LOCK
m_slock.Exit();
#else
Monitor.Exit(m_slock);
#endif
}
public bool IsHeld {
get { return m_acquireCount.Value > 0; }
}
public int RecursionCount {
get { return m_acquireCount.Value; }
}
}
Okay, this is not rocket science. And to be fair, it's missing some critical features, like reliable acquisition (finally available on Monitor in 4.0, and also SpinLock), and lock leveling. But it's a start.
Once we've got such a Lock class, we may want to extend it with 1st class condition variable support. Condition variables are core to the monitor concept, and provide a synchronization point that combines a lock with some condition that may be waited upon and triggered. They help to avoid all the pitfalls of standalone events: mainly missed pulses due to the lack of synchronization involved between producers and consumers.
Furthermore, imagine we allow multiple separate ConditionVariable objects per single Lock object. This is a feature that Monitor doesn't currently support (though Win32 CONDITION_VARIABLEs do). This capability would enable us to, say, create a bounded buffer with a single lock to protect the queue, and two separate condition variables: one for the non-empty condition, and the other for the non-full condition. This simplifes the implementation, and helps to avoid deadlock-prone techniques that result from trying to use multiple separate synchronization objects.
The trick is that the Lock and ConditionVariable class need to be well-integrated. So we will provide a constructor that accepts a Lock object:
public class ConditionVariable
{
private Lock m_slock;
public ConditionVariable(Lock slock) {
if (slock == null)
throw new ArgumentNullException("slock");
m_slock = slock;
}
Once we've got that, there are two basic operations to implement: waiting and pulsing (signaling). To achieve this, we'll give each thread its own ManualResetEventSlim object -- a lightweight event class, new to .NET 4.0. (Ironically, it uses Monitor.Wait and Pulse under the covers.) This event will be stored in an instance of the new .NET 4.0 type, ThreadLocal<T>. (An alternative is to store it in a [ThreadStatic], and reuse the same event across all ConditionVariables. Since we only support waiting on one such condition at a time (currently), there is no reason we can't just have one per thread. This is precisely what the CLR does internally, though it's a shame we can't grab hold of that preexisting event.) In addition to that, we'll need a wait-list, maintained in FIFO order as a Queue<ManualResetEventSlim>:
private Queue<ManualResetEventSlim> m_waiters =
new Queue<ManualResetEventSlim>();
private ThreadLocal<ManualResetEventSlim> m_waitEvent =
new ThreadLocal<ManualResetEventSlim>();
Waiting does pretty much what you'd imagine. The m_slock object doubly acts as protection against concurrent access to the waiters list. So when a Wait call is made, we demand that the lock is held by the calling thread. Subtly, we also demand that it hasn't been recursively acquired, since that would require exiting the lock multiple times. This can lead to desynchronization bugs. Unfortunately, Monitor does this, but is critically broken as a result. Once the validation occurs, Wait simply places the current thread into the wait list, exits the lock, waits to be awakened, and then reacquires the lock before returning. This is pretty much exactly what the CLR Monitor class does internally:
public void Wait() {
int rcount = m_slock.RecursionCount;
if (rcount == 0)
throw new InvalidOperationException("Lock is not held.");
if (rcount > 1)
throw new InvalidOperationException("Lock is held recursively.");
// Lazily initialze our event, if necessary.
ManualResetEventSlim mres = m_waitEvent.Value;
if (mres == null) {
mres = m_waitEvent.Value = new ManualResetEventSlim(false);
}
else {
mres.Reset();
}
m_waiters.Enqueue(mres);
m_slock.Exit();
mres.Wait(); // bugbug: interrupt => desync.
m_slock.Enter();
}
Lastly, we must implement the Pulse and PulseAll methods. For kicks, we'll provide an overload of Pulse -- which normally awakens one waiting thread -- that awakens an arbitrary maximum number of threads. So you could say Pulse(4) to awaken at most 4 threads, for example. These methods are even simpler than Wait: they dequeue events off the wait list, and just set them. This awakens the waiters, as desired:
public void Pulse() {
Pulse(1);
}
public void Pulse(int maxPulses) {
if (!m_slock.IsHeld)
throw new InvalidOperationException("Lock is not held.");
for (int i = 0; i < maxPulses; i++) {
if (m_waiters.Count > 0) {
m_waiters.Dequeue().Set();
}
else {
break;
}
}
}
public void PulseAll() {
Pulse(int.MaxValue);
}
}
(This has the unfortunate side effect of two-step dances. The pulse will awaken threads at the mres.Wait() line in Wait, and they immediately try to call m_slock.Enter() as a result. A priority boost may cause them to preempt the pulsing thread, even though they will just end up waiting. A possible fix to this is to even more tightly integrate the Lock and ConditionVariable classes, by having a "deferred pulse" list attached to the lock. Once it has been completely exited, the Lock's Exit method could drain the deferred pulse list and awaken the threads, thus avoiding the two-step dance.)
As to examples, let's take a quick peek at a blocking / bounded queue. When constructed, a capacity is given. Whenever an Enqueue would cause the buffer's contents to exceed the capacity, the producer is blocked until space is made by a consumer. Whenever a Dequeue is attempted on an empty buffer, the consumer is blocked until an item is produced. Though there are opportunities for optimization, this is encoded straightforwardly as follows:
class BlockingQueue<T>
{
private int m_capacity;
private Queue<T> m_q;
private Lock m_qLock;
private ConditionVariable m_qNonFullCondition;
private ConditionVariable m_qNonEmptyCondition;
public BlockingQueue(int capacity) {
m_capacity = capacity;
m_q = new Queue<T>();
m_qLock = new Lock();
m_qNonFullCondition = new ConditionVariable(m_qLock);
m_qNonEmptyCondition = new ConditionVariable(m_qLock);
}
public void Enqueue(T item) {
m_qLock.Enter();
while (m_q.Count == m_capacity)
m_qNonFullCondition.Wait();
m_q.Enqueue(item);
m_qNonEmptyCondition.Pulse();
m_qLock.Exit();
}
public T Dequeue() {
T item;
m_qLock.Enter();
while (m_q.Count == 0)
m_qNonEmptyCondition.Wait();
item = m_q.Dequeue();
m_qNonFullCondition.Pulse();
m_qLock.Exit();
return item;
}
}
The naive approach typically uses a single event to signal the non-empty / non-full transitions. The risk of doing this, of course, is that the wrong kind of thread (producer or consumer) will be signaled, depending on chance and wait arrival order. This is ordinarily only a concern for bounded queues of reasonably small sizes, and high degrees of concurrency, but is still an interesting example of why multiple condition variables per lock is useful.
Enjoy!
 Tuesday, June 23, 2009
I wrote this memo over 2 1/2 years ago about what to do with concurrent exceptions in Parallel Extensions to .NET. Since Beta1 is now out, I thought posting it may provide some insight into our design decisions. (And yes, most design discussions start this way. Somebody develops a personal itch, dives deep into it, and emerges with a proposal for others to vote up, shoot down, or, as is typically the case, somewhere in the middle (provide constructive feedback, iterate, etc).) I've made only a few slight edits (like replacing code- and type-names), but it's mainly in original form. I still agree with much of what I wrote, although I'd definitely write it differently today. And in retrospect, I would have driven harder to get deeper runtime integration. Perhaps in the next release.
~~~
Concurrency and Exceptions October, 2006
Exceptions raised inside of concurrent workers must be dealt with in a deliberate way. Failures can happen concurrently, and yet often the programmer is working with an API that appears to them as though it’s sequential. The basic question is, then, how do we communicate failure?
The problem
Fork/join concurrency, in which a single “master” thread forks and coordinates with N separate parallel workers, is an incredibly common instance of one of these sequential-looking concurrent operations. The same callback is run by many threads at once, and may fail zero, one, or multiple times. The exception propagation problem is inescapable here and comes with a lot of expectations, because the programmer is presented a traditional stack-based function calling interface papered on top of data or task parallelism underneath.
I am faced with the need for a solution to this problem for PLINQ right now and, while I could invent a one-off solution, we owe it to our customers to come up with a common platform-wide approach (or at least ManyCore-wide). Any solution should compose well across the stack, so that somebody invoking a PLINQ query from within their TPL task that was spawned from a thread pool thread yields the expected and consistent result. And I would like for us to reach consensus for both managed and native programming models.
Before moving on, there is one non-goal to call out. Long-running tasks not under the category of fork/join also deserve some attention, because of the ease with which stack traces can be destroyed and the corresponding impact to debugging, but I will ignore them for now. The problem is not new, exists with the IAsyncResult pattern, and PLINQ doesn’t use this sort of singular asynchronous concurrency. These cases can typically be trivially solved using existing mechanisms, like standard exception marshaling.
No errors, one error, many errors
To understand the core of the issue, imagine we have an API ‘void ForAll<T>(Action<T> a, T[] d)’. It takes a delegate and an array, and for every element ‘e’ in ‘d’ invokes the delegate, passing the element, i.e. ‘a(e)’. If multiple processors are available, the implementation of ForAll may use some heuristic to distribute work among several OS threads, for instance by partitioning the array, probably running one partition on the caller’s thread, and finally joining with these threads before returning so that the caller knows that all of the work is complete when the API returns.
ForAll is not fictitious, and is similar to a number of PLINQ APIs: Where, Select, Join, Sort, etc. It is also exposed directly by the TPL runtime’s Parallel class which intelligently forks and joins with workers.
‘a’ is a user-specified delegate and can do just about anything. That includes, of course, throwing an exception. What’s worse, because ‘a’ is run in several threads concurrently, there may be more than one exception thrown. In fact, there are three distinct possibilities:
- No errors: No invocations of ‘a’ throw an exception.
- One error: A single invocation of ‘a’ throws an exception.
- Many errors: Concurrent invocations of ‘a’ on separate threads throw exceptions.
Clearly letting an exception crash whichever thread the problematic ‘a(e)’ happened to be run on is problematic and confusing. If for no other reason than the IAsyncResult pattern has established precedent. But realistically, the developer would be forced to devise his or her own scheme to marshal the failure back to the calling thread in order for any sort of chance at recovery. They would get it wrong and it would lead to incompatible and poorly composing silos over time. A Byzantine model that fully prohibits exceptions passing fork/join barriers goes against the simple, familiar, and understandable (albeit often deceptively so) model of exceptions.
(That said, marshaling leads to a crappy debugging experience. An already attached debugger will get a break-on-throw notification at the exception on the origin thread, but since we catch, marshal, and (presumably) rethrow, the first and second chances for unhandled exceptions won’t happen until after the exception been marshaled. This breaks the first pass, and by the time the debugger breaks in, or a crash dump is taken, the stack associated with the origin thread is apt to have gone away, been reused for another task (in the case of the thread pool), etc. We generally try to avoid breaking the first pass in the .NET Framework, but do it in plenty of places: the BCL today already contains tons of try { … } catch { /*cleanup */ throw; }-style exception handlers, for example. For this reason I’m not terribly distraught over the implications of doing it ourselves. And sans deeper integration with the exception subsystem – something we ought to consider – there aren’t many reasonable alternatives.)
What makes this problem really bad is that ForAll appears as though it’s synchronous:
void f() {
// do some stuff
ForAll(…, …);
// do some more stuff, ‘ForAll’ is completely done
}
The method call to ForAll itself is synchronous, but of course its internal execution is not. But still, to the developer, the call to this function represents one task, one logical piece of work, regardless of the fact that the implementation uses multiple threads for execution. As higher level APIs are built atop things like ForAll, the low level parallel infrastructure problem becomes a higher level library or application problem. A Sort that is internally parallel must now decide what exception(s) it will tell callers it may throw.
Nondeterministic exception ordering
We assume the ForAll API stops calling ‘a(e)’ on any given thread when it first encounters an exception. That is, each thread just does something like this:
for (int i = start_idx; i < end_idx; i++) {
a(d[i]);
}
The for loop terminates when any single iteration throws an exception. Imagine our array contains 2048 elements and that ForAll smears the data across 8 threads, partitioning the array into 256-element sized chunks of contiguous elements. So partition 0 gets elements [0…256), partition 1 gets [256…512), …, and partition 7 gets [1792…2048). Now imagine that ‘a’ throws an exception whenever fed a null element, and that every 256th element in ‘d’, starting at element 10, is null. What can a developer reasonably expect to happen?
On one hand, if we’re trying to preserve the illusion of sequential execution, we would only want to surface the exception from the 10th element. With a sequential loop, this would have prevented the 266th, 522nd, and so on, elements from even being passed to ‘a’. So we might simply say that the “left most” exception (based on ordinal index) is the one that gets propagated. The obvious problem with this is there are races involved: subsequent iterations indeed may have actually run. Alternatively, we might consider only letting the “first” propagate. Unfortunately, that doesn’t work either, because we unfortunately can’t necessarily determine, for a set of concurrent exceptions, which got thrown first. Even if they have timestamps, they could occur in parallel at indistinguishably close times. Nor does this really matter, because it feels fundamentally wrong.
The reason is that we can’t simply throw away failures without true recoverability in the system, a la STM. The execution of code leading up to the exception did actually happen, after all, and there could be residual effects. We might be masking a terrible problem by throwing failures away, possibly leading to (more) state corruption and (prolonged, perhaps unrecoverable) damage. What if the 10th element was a simple ArgumentNullException that the caller chooses to tolerate, but the 266th element’s exception was in response to a catastrophic error from which the application can’t recover? We can’t choose to propagate the 10th but swallow the 266th. Broadly accepted exceptions best practices suggest that app and library devs never catch and swallow exceptions they cannot reasonably handle. We should do our best to follow the spirit of this guidance too.
Re-propagation
We could employ an approach similar to the IAsyncResult pattern, with some slight tweaks.
If each concurrent copy of ForAll caught any unhandled exceptions and marshaled them to the forking thread, including any exceptions that happen on the forking thread itself, we could then propagate all of them together after the join completes. The question is then: what exactly do we propagate?
If there is just a single exception, it’s tempting to just rethrow it. But I don’t believe this is a good approach for two primary reasons:
- This will destroy the stack trace of the original exception. This means no information about the actual source of the error inside ‘a’ is available. With some help from the CLR team, we might be able to get a special type of ‘rethrow’ that copied the original stack trace before recreating a new one. This is already done for remoted exceptions, and the Exception base class will prefix the original remoted stack trace to the new stack trace.
- This doesn’t scale to handle multiple exceptions. If we could solve #1, it might be attractive because it appears as-if things happened sequentially, but we can’t escape #2, no matter what we do. We could have different behavior in these two cases, but I believe it’s better to remain consistent instead. Otherwise, developers will need to write their exception handles two ways: one way to handle singular cases, and the other way to handle multiple cases, where the same API may do either nondeterministically.
Given that we need to propagate multiple exceptions, we should wrap them in an aggregate exception object, and propagate that instead. At least this way, the original exceptions will be preserved, stack trace and all. Of course the original exceptions themselves might be other aggregates, handling arbitrary composition.
For sake of discussion, call this aggregate exception System.AggregateException, which of course derives from System.Exception. It exposes the raw Exception objects thrown by the threads, via an ‘Exception[] InnerExceptions’ property, and additional meta-data about each exception: from which thread it was thrown, and any API specific information about the concurrent operation itself. This last part is just to help debuggability. For instance, we might tell the developer that the ArgumentNullException was thrown from a thread pool thread with ID 1011, and that it occurred while invoking the 266th element ‘e’ of array ‘d’. We might also guarantee the exceptions will be stored in the order in which they were marshaled back to the forking thread, just to help the developer (as much as we can) piece together the sequence of events leading to failures.
(Editor’s note: we decided against storing this meta-data information for various reasons.)
Now the dev can do whatever he or she wishes in response to the exception. Previously they might have written:
try {
ForEach(a, d);
} catch (FileNotFoundException fex) {
// Handler(fex);
}
And now they would have to instead write:
try {
ForAll(a, d);
} catch (AggregateException pex) {
List<Exception> unhandled = new List<Exception>();
foreach (Exception e in pex.InnerExceptions) {
FileNotFoundException fex = e as FileNotFoundException;
if (fex == null) {
unhandled.Add(fex);
} else {
// Handler(fex);
}
}
if (unhandled.Count > 0)
throw new AggregateException(unhandled);
}
In other words, they would catch the AggregateException, enumerate over the inner exceptions, and react to any FileNotFoundExceptions as they would have normally. (Taking into consideration that there might have been multiple.) At the end, if there are any non-FileNotFoundExceptions left over, we propagate a new AggregateException with the handled FileNotFoundExceptions removed. If there was only one remaining, we could, I suppose, try to rethrow just that, but this has the same nondeterminism problems mentioned above.
Very few people will write this code. But one of the most vocal arguments against it is: just throw one singular exception, such as ForAllException, and let it crash, because no developer will handle it. Well, that scheme is no better than throwing the AggregateException. At least the aggregation model lets people write backout and recovery code if they have the patience to deal with the reality that multiple exceptions occurred.
To make this slightly easier, we could expose an API, ‘void Handle(Func<Exception, bool> a) where T : Exception’, that effectively encapsulates the same logic as shown above, repropagating the exception at the end if all the exceptions weren’t handled (i.e. some weren’t of type T):
try {
ForAll(a, d);
} catch (AggregateException pex) {
pex.Handle(delegate(Exception ex) {
FileNotFoundException fex = ex as FileNotFoundException;
if (fex != null) {
// Handle(fex);
return true;
}
return false;
});
}
(One problem with this approach is that the ‘throw’ inside of Handle will destroy the original stack trace for ‘pex’. An alternative might be for Handle to modify the AggregateException in place, keeping the stack trace intact, returning a bool that the caller switches on and does a ‘throw’ if it returns false; this is unattractive because it’s error prone and could lead to accidentally swallowing, but in the end might help debuggability.)
If we cared about eliminating unnecessary catch/rethrows, we could use 1st pass filters instead, but this would only be available to VB and C++/CLI programmers, as C# doesn’t expose filters. For example, in pseudo-code:
try {
ForAll(a, d);
} catch (fex.InnerExceptions.Contains<FileNotFoundException>()) {
// Handle …
}
Although interesting, we’re trying to move away from our two pass model. So let’s forget about this for now.
This approach suffers when composing with non-aggregate exception aware code. For it to work well, everybody on the call stack needs to be looking inside the aggregate for “their” exception, handling it, and possibly repropagating. If we want existing BCL APIs to start using data parallelism internally, we would have to be careful here, not to break AppCompat because we start throwing AggregateExceptions instead of the originals.
This is probably where there’s an opportunity for better CLR and tool integration. For instance, you could imagine a world where the CLR automatically unravels the parallel failures, matching and running handlers for specific exceptions inside the aggregate as it goes, but repropagating if all exceptions weren’t handled. This is very hand-wavy and fundamentally changes the way exceptions work, so it would require a lot more thought. A catch block that swallows an exception (today) is just about guaranteed—asynchronous exceptions aside—that the IP will soon reach the next instruction after the try/catch block. This is a pretty basic invariant. With this proposal, that wouldn’t be the case, and would be bound to break large swaths of code. Sticking with the library approach (with all its imperfections) seems like the best plan of attack for now.
Waiting for the “join” to finish
There was something implicit in the design mentioned above. The ForAll API, and others like it, wouldn’t actually propagate exceptions until the fate of all threads was known.
Imagine we have the scenario described earlier (2048 elements, 8 threads), but slightly different: the 0th element causes an exception, but no other. It turns out this is probably a common case, i.e. that only a subset of the partitions will yield an exception. In this case, we would still have to wait for 7*256 = 1,792 elements to be run through ‘a’ before this exception is propagated. Imagine a slightly different case. The 0th element throws a catastrophic exception, and the application is going to terminate as soon as it propagates. ‘a’ simply can’t be run any more, and will keep reporting back this same exception. But it will take 8 of these exceptions to actually stop the application, i.e. by calling ‘a’ on the 0th, 256th, 512th, etc. elements, if we wait for all tasks to complete. If each exception corresponds to some failed attempt at forward progress, one that possibly corrupts state, then the damage is O(N) times “worse” (for some measurement) than in the sequential program, where N is the number of concurrent tasks.
Instead of waiting helplessly, we could try to aggressively shut down these concurrent workers.
At first, you might be tempted to employ CLR asynchronous thread aborts, but this is fraught with peril. Almost all .NET Framework code today is taught that thread abort == AppDomain unload, and reacts accordingly. State corruption stemming from libraries as fundamental as the BCL would be just about guaranteed. Changing this state of mind and the state of our software would be quite the undertaking.
Instead, we can have the concurrent API itself periodically check an ‘abort’ flag shared among all workers. The first thread to propagate an exception would set this flag. And whenever a worker has seen that it has been set, it voluntarily returns instead of finishing processing data:
for (int i = start_idx; i < end_idx && !aborted; i++) {
a(d[i]);
}
This increases the responsiveness of exception propagation, but clearly isn’t foolproof. There will still be a delay for long-running callbacks. Thankfully, with PLINQ, TPL, and I hope most of our parallel libraries, the units of work will be individually fine-grained, and therefore this technique should suffice.
If a concurrent worker is blocked, there’s not a whole lot we can do. Much like thread aborts, you might be tempted to use Thread.Interrupt to remove it from the wait condition. Unfortunately this will leave state corruption in its wake, because plenty of code does things like WaitHandle.WaitOne(Timeout.Infinite) without checking the return value or expecting a ThreadInterruptionException. The same argument applies to, say, user-mode APCs. Eventually you might also be tempted to use IO cancellation in Windows Vista to cancel errant, runaway network or disk IO requests. This would be great. But this also generally has the same problem as interruption, so until we find a general solution to that, we can’t do any of this.
(Editor’s note: We eventually solved this problem by coming up with a unified cancellation framework.)
One last note
This path forward seems best for now, but it leaves me wanting more.
In the end, this feels like a more fundamental problem. An API like ForAll gives the illusion of an ordinary, old sequential caller/callee relationship. But the callee doesn’t use a stack-based calling approach: instead, it distributes work among many concurrent workers, turning the linear stack into a sort of dynamically unfolding cactus stack (or tree). And SEH exceptions are fundamentally linear stack-based creatures.
In this world, it’s just a simple fact that data all over the place can become corrupt simultaneously. Many things can fail at once because many things are happening at once. It’s inescapable. Recovery is disastrously difficult, so most failures will end in crashes. STM’s promise for automatic recovery offers a glimmer of hope, but without it, I worry that papering a sequential “feel” on top of data/task parallelism is a dangerous game to play.
 Tuesday, June 16, 2009
One of my many focuses lately has been developing a memory ordering model for our project here at Microsoft. There are four main questions to answer when defining such a model:
- What are the ordering guarantees for ordinary loads and stores?
- What are the ordering guarantees for volatile loads and stores?
- What kinds of explicit fences are allowed?
- Where are fences used automatically, e.g. to preserve type safety and security?
These tend to be the differentiation points for any model. Everything else is mostly commodity. Not that there is much else, mind you, but respecting data dependence, not speculating ahead such that exceptions occur that wouldn't have occurred in a sequential execution, and so forth are all must haves, for instance. Most interesting permutations of answers for these questions have already been explored, and industry consensus is being reached, so it would be better to say I've been picking a model rather than defining one.
What's interesting is that memory model designers are often colored by their favorite architecture du jour. If somebody cares primarily about X86, they are apt to choose something very strong. If somebody cares primarily about ARM, however, they are apt to choose something very weak. There is a classic tradeoff here. Stronger means easier to program, while weaker means better performance. For some reason, many of the projects I've worked on have had an abundance of strong hardware (like X86) and a scarcity of weak hardware (like ARM and IA64). The reality sinks in: most developers on the team code to X86, and then when it comes time to getting more serious about the other platforms, code starts breaking all over the place. This is why the CLR went so strong in 2.0, even though IA64 was an important platform to support.
Let's look at some common answers to the above questions.
For #1:
- C++, Visual C++, ECMA 1.0, Java Memory Model, and Prism: no ordering guarantees.
- CLR 2.0: ordered stores, no ordering for loads.
For #2:
- C++: prevents compiler-only code motion, but explicit fences are needed for processor ordering.
- Visual C++, ECMA 1.0, and CLR 2.0: loads are acquire, stores are release ordered.
- Java Memory Model: loads and stores are fully ordered (sequentially consistent).
For #3:
- C++: implementation-specific.
- Visual C++: intrinsics and Win32 APIs.
- ECMA 1.0 and CLR 2.0: locks, and mostly Win32-style interlocked APIs.
- Java Memory Model: locks, compare-and-swap, atomics, etc.
For #4:
Managed environments like the CLR and JVM need to ensure type safety, even if ordinary loads and stores are unordered. This is nontrivial, because the boundary around type safety is blurred. Certainly we must ensure garbage v-table pointers are not seen. But is a thread allowed to read non-zeroed memory behind an object reference? And can it contain garbage (e.g. "values out of thin air")? What about writes done by mutator threads, including write barriers, while a concurrent collector is tracing objects in the heap? Are array lengths part of the set of protected fields that mustn't be read out of order? Strings, since they are commonly used for security checking? And so on.
It is mainly the deep questions around #4, and also some simple compatibility struggles (around things like double checked locking), that caused the stronger answers for #1 in the CLR 2.0.
In any case, I'm advocating a very different approach than the traditional models.
We pick completely weak ordering for ordinary loads and stores, to enable efficient execution on weaker platforms like ARM, PowerPC, IA64, etc. That part isn't new. But here's the clincher. No volatiles. There are special variables that are used to communicate between threads (call them volatile if you'd like), but using them implies no kind of special automatic fencing. Instead, whenever accessing such a variable, at the site of usage, the kind of fence desired must be used (compiler-enforced): full-fence (sequentially consistent), acquire-fence, release-fence, no-fence, or compiler-only-fence (for things like ensuring loads don't get hoisted as loop invariant). Of course, certain kinds of fences are sprinkled throughout the system to guarantee type safety in all of the aforementioned places (and more), but these are implementation details.
(This approach is rather like Herb Sutter's Prism and C++0x atomics. See http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2008/n2664.htm.)
Particularly after managing teams who developed a plethora of lock free code, I love this approach. I can review code and immediately understand what ordering invariants the developer assumed when writing the code. This doesn't really make writing lock free code any simpler, except that it forces you to pause and think about things a bit more carefully than you may have otherwise. But it certainly makes code easier to understand and maintain, and makes it clear to people that sprinkling volatile all over the place isn't going to save your butt: the only thing that will do that is careful thinking and engineering.
 Thursday, June 04, 2009
An interesting alternative to reader/writer locks is to combine pessimistic writing with optimistic reading. This borrows some ideas from transactional memory, although of course the ideas have existed long before. I was reminded of this trick by a colleague on my new team just a couple days ago.
The basic idea is to read a bunch of state optimistically, without taking a lock of any sort, and then prior to using it for meaningful work (which may depend on the state being consistent and correct), a validation step must take place. This validation uses version numbers which writers are responsible for maintaining. Specifically, we'll use two version counters, version1 and version2: the writer increments version1, performs the writes, and then increments version2; and the reader reads version2, performs its reads, and then verifies that version 1 is equal to the version2 that it saw at the start. If this verification fails, we'll ordinarily just do a little spinning and then go back around the loop again.
Stop for a moment and ponder something very critical to this algorithm. The writer increments variables in the opposite order of the reader's reads. To see why this works, imagine we start with version1 == version2 == 0. There are two hazards to worry about. (1) A reader begins reading, and writes occur before it has finished. And (2) a reader begins reading while a write is in progress. These are simple to detect, and in fact boil down to the same thing. A reader sees version2 == 0, and the first thing a writer does is version1++. So when the reader attempts to validate, it will notice the version2 it saw != version1 any longer. If the writer has already begun by the time the reader arrives, it is possible for the reader to know it is doomed even before it has started doing any of its reads.
Here is the code in its full glory:
using System;
using System.Threading;
public class OptimisticSynchronizer
{
private volatile int m_version1;
private volatile int m_version2;
public void BeforeWrite() {
++m_version1;
}
public void AfterWrite() {
++m_version2;
}
public ReadMark GetReadMark() {
return new ReadMark(this, m_version2);
}
public struct ReadMark
{
private OptimisticSynchronizer m_sync;
private int m_version;
internal ReadMark(OptimisticSynchronizer sync, int version) {
m_sync = sync;
m_version = version;
}
public bool IsValid {
get { return m_sync.m_version1 == m_version; }
}
}
public void DoWrite(Action writer) {
BeforeWrite();
try {
writer();
} finally {
AfterWrite();
}
}
public T DoRead<T>(Func<T> reader) {
T value = default(T);
SpinWait sw = new SpinWait();
while (true) {
ReadMark mark = GetReadMark();
value = reader();
if (mark.IsValid) {
break;
}
sw.SpinOnce();
}
return value;
}
}
We leave it to the caller of this class to acquire locks as appropriate to synchronize writers. Typically this will just mean wrapping a Monitor.Enter/Exit around calls to things like BeforeWrite, AfterWrite, and DoWrite. But readers explicitly do not need this same protection. DoRead exemplifies the safe reading pattern, although it can be done by hand via the ReadMark APIs.
It's also worth considering what kinds of fences are truly required for this to work. Logically speaking, we need to ensure the entrance to a protected block (either read or write) is an acquire fence, and exit from the block is a release fence. This is similar to the ordering semenaitcs necessary for a lock block. So long as we use volatile modifiers for the version counters, and for the variables read within the protected block, this will work fine. Even on weak models like IA64. The beautiful thing is that we don't need full fences, even on models like X86 that make use of store buffer forwarding The classic store buffering case we may worry about (on a single-threaded execution) would be something like this, in pseudo-code:
version1++;
X = 42;
Y = 99;
version2++;
tmp = version2;
r0 = X;
r1 = Y;
success = (tmp == version1);
We'd be worried about satisfying some loads out of the store buffer, while satisfying others out of the memory system. But this is safe: if the load of X or Y sees a different processor's writes, then the subsequent load of version1 necessarily must witness the new value written by the other processor too. And therefore the validation will fail as we would expect and hope.
Here is a quick performance benchmark I whipped together, much in the same spirit as my previous reader/writer lock examples. I've measured varying numbers of writers (0%, 5%, 10%, 25%, 50%, and 100%), and each thread spends a certain amount of time inside the "lock region" doing some nonsense busy work. The certain amount of time is measured in terms of number of function calls (0, 10, 100, and 1000), and the work doesn't vary at all depending on whether a thread is reading or writing. I've measured four things: (1) Monitor.Enter/Exit as the baseline (where both readers and writers just acquire the mutually exclusive lock), (2) ReaderWriterLockSlim, (3) the spin-based lock that I showed previously, and (4) the new OptimisticSynchronizer class with optimistic retry. The values are the ratio compared to the baseline (1), so that >1.0x means the particular entry is slower, while <1.0x is faster. I did these measurements on an 8-way machine -- unlike the previous study which was on a 4-way machine -- which means that 0.125x would be a linear speedup compared to the serialized Monitor version:
0% writers:
0 calls 10 calls 100 calls 1000 calls
RWLSlim 1.26 1.55 1.39 0.38
SpinRWL 0.12 0.17 0.13 0.18
OptSync 0.05 0.08 0.11 0.12
5% writers:
0 calls 10 calls 100 calls 1000 calls
RWLSlim 1.36 1.70 1.40 1.07
SpinRWL 0.98 1.07 0.55 0.30
OptSync 0.35 0.43 0.31 0.24
10% writers:
0 calls 10 calls 100 calls 1000 calls
RWLSlim 1.42 1.66 1.23 1.06
SpinRWL 1.41 1.61 0.91 0.51
OptSync 0.56 0.66 0.46 0.31
25% writers:
0 calls 10 calls 100 calls 1000 calls
RWLSlim 1.36 1.97 1.24 1.03
SpinRWL 2.39 2.22 1.08 0.89
OptSync 0.84 0.99 0.86 0.59
50% writers:
0 calls 10 calls 100 calls 1000 calls
RWLSlim 1.48 1.80 1.21 1.05
SpinRWL 3.16 3.30 1.81 1.19
OptSync 0.91 0.94 1.10 0.92
100% writers:
0 calls 10 calls 100 calls 1000 calls
RWLSlim 1.35 1.67 1.22 1.09
SpinRWL 5.84 5.84 2.49 1.18
OptSync 0.93 0.99 1.13 1.17
For all cases but the 100% writers case, the OptimisticSynchronizer class does extraordinarily well.
Although this approach screams performance-wise, it is admittedly much more difficult and error-prone to use. If the variables protected are references to heap objects, you need to worry about using the read protection each time you touch a field. Just like locks, this technique doesn't compose. As with anything other than simple locking, use this technique with great care and caution; although the built-in acquire and release fences shield you from memory model reordering issues, there are some easy traps you can fall into. And as with any optimistic reading, memory safety is a necessity; trying to use these techniques in C++, for example, can easily lead to access violations and memory corruption. Tread with caution.
Update 6/4: This technique, of course, is subject to ABA problems. I failed to mention that originally. That is, if between reading version2, Int32.MaxValue writers perform writes, the version1 field will wrap around such that the reader will (erroneously) successfully validate. Fixing this on 64-bit is simple (use a 64-bit counter) but is less so on 32-bit due to the lack of atomicity on loads and stores of 64-bit values (without using, say, an XCHG or related primitive).
Update 6/18: My original write-up incorrectly made some hidden assumptions about the use of volatile. This has now been cleared up.
 Thursday, May 28, 2009
Two persons stand on a railway embankment at points A and C, exactly 500 meters apart. Lightning strikes precisely in the center of them, at point B, 250 meters away from both:
<--A----------B----------C-->
Presuming both persons are stationary, does the event (lightning strikes) occur “at the same time” from the perspective of the two persons? In our simplistic one dimensional model, the answer is Yes, precisely because the point of the lightning strike, B, is equidistant from A and C.
The person at point C may just as well be responsible for generating the event, by using some form of light rod instead of a lightning bolt supplied by nature. If the person at C lights such a rod, would the event still occur “at the same time” for both persons? Clearly No, because it will take some amount of time for the event’s occurrence to travel the distance from C to A, specifically the time it takes for light to travel 500 meters. Whereas for C it happens nearly instantaneously.
Practically speaking, this amount of time it takes for the light to travel to A will of course be so minute as to be nearly immeasurable, but nevertheless there are two separate times t and t’, the former being the actual time the rod is lit at C, and the latter being the time at which it is perceived at A. This is commonly referred to as relativity of simultaneity, introduced as Lorentz’s local time in the late 1800's and further formalized by Einstein's special theory in 1905.
Now imagine that a new person is placed at point B, equidistant from A and C, where the original lightning struck. If the person at C lights his rod, will the person at B observe the event before the person at A does? Most certainly. It takes less time for light to travel 250 meters than it does 500 meters.
Let us extend our working example a bit. Imagine again three persons, one at point A, one at point B, and another at point C. Those at B and C hold their own light rods. The person at C lights a rod, and in response to seeing C’s rod lighting, the person at B also lights a rod. The question is, must the person at A witness the light emanating from C’s rod prior to witnessing the light emanating from B’s rod? Unless the person at B’s response is truly instantaneous (which we assume is practically impossible), or unless he can see into the future (which we also assume is impossible), clearly the answer is Yes. Because the rod at B was lit in response to witnessing C’s lighting of the rod, some amount of time must have passed during the response, and the person at A will thus see C’s lighting first (or at the very least simultaneously, assuming near-impossible instantaneous response). We say B’s lighting is causally dependent on C’s lighting.
The main point here is that time is an illusion. There is no global time clock. Instead, events are not only distinguished by some monotonically increasing time value t, but also by a location which is defined by three-space coordinates. This is Minkowski’s four-dimensional space. One event occurring at coordinates { x=0, y=0, z=0, t=99 } may not appear to be simultaneous with some other event at coordinates { x=42, y=42, z=42, t=99 }, depending on the observer's location, even though both events occur at time t = 99.
Perception is relative. There is no global total ordering, only a local (relative) one.
A similar phenomenon is true of multiprocessors. In fact, nearly everything said above applies equally to them, provided that you replace “persons” with “processors”, and lighting of rods with writes to memory and witnessing of the light with reads from memory.
Multiprocessor architects must cope with the increasing bottleneck on a central memory unit, particularly due to shared memory programming. One common means of doing so is to increase the distance between processing elements and the memory units they use, padding this distance with ample levels of cache. Some processor A may have a local memory (cache) that is separate from some other processor C’s, and A’s writes to it will be visible to A before C, for example. And if some processor B sits in between them, it may notice such writes before C does. Locality matters.
Of course, memory ordering models are meant to eliminate such distances from the programmer’s mind, at least to some degree. They provide a set of rules governing the timing and ordering of events. But there is just no denying the laws of physics. My claim is that a proper ordering model ought to obey what can be derived from the special theory of relativity: no more, no less. That is, the fundamental laws of how events occur and are correlated in the real world should be mimicked. This means only two things, as far as I can tell:
- An event stream (writes) originating from a source must appear to happen in order.
- Causality is respected, in that when C causes B, it is implied that A must see C followed by B.
This turns out to be stronger than some models, but also weaker in some regards. Distance and latency are first class, embellished, and allowed. There is some cost to ensuring events leaving a locale do so in order, and that events arriving into a locale also do so in order. Given coarse enough locales, however, this cost ought to be amortized over the cost of inter-locale communication.
Notice that sequential consistency is explicitly discouraged. The ordinary store-followed-by-load ordering that I've written about many times is legal. Considering this phenomena in the context of light rods and relativity makes it clear why. Imagine that the persons at A and C light their rods simultaneously. If the person at A immediately, after lighting the rod, looks to the right to see if C has lit the rod, the answer will be No; and similarly, if the person at C immediately, after lighting the rod, looks to the left to see if A has lit the rod, the answer will also be No. Although the real reason has to do with gross details like store buffers and cache coherency, the elegant reason supported by the model is that it takes time for light to travel the distance between A and C.
I also want to point out that “memory ordering model” commonly refers to individual loads and stores, at a very low level, but just as well applies to a higher-level model such as might be found in an actor-oriented (message passing) programming language. People often believe memory ordering and interleaving goes away magically with message passing models. This is simply not true, even if instruction-level interleaving is eliminated. The granularity merely coarsens, but the problem still remains the same.
Despite the lack of sequential consistency, implementing such a model can pose challenges, due to restrictions on optimizations like pipelining and out of order execution. (Hey, at least we needn’t worry about processors moving about at different velocities, as in the more interesting parts of special relativity.) But I believe it is necessary. This price paid will be rewarded with a system that human beings can be taught to reason about as they do in the real world. Remember: I am not just talking about memory models in the traditional sense, where people are tempted to sweep the problem under the rug of "only super-developers doing lock-free programming need a model"; it matters for higher level concurrency orchestration patterns too. In the end, let us not forget: correctness and understandability trump performance optimizations for all but the most low-level systems developers, which make up less than 1% of the development population.
1. Relativity: The Special and General Theory. http://en.wikisource.org/wiki/Relativity:_The_Special_and_General_Theory 2. Time, Clocks, and the Ordering of Events in a Distributed System. http://research.microsoft.com/en-us/um/people/lamport/pubs/time-clocks.pdf
 Saturday, May 16, 2009
A while back, I made a big stink about what appeared to be the presence of illegal load-load reorderings in Intel's IA32 memory model. They specifically claim this is impossible in their documentation. Well, last week I was chatting with a colleague, Sebastian Burckhardt, about this disturbing fact. And it turned out he had recently written a paper that formalizes the CLR 2.0 memory model, and in fact treats this phenomenon with a great deal of rigor:
Verifying Compiler Transformations for Concurrent Programs http://research.microsoft.com/pubs/76524/tr-2008-171-latest-03-11-09.pdf
To jog your memory, the problematic example is
X = 1; r0 = X; r1 = Y;
where both X and Y are shared memory locations, and r0 and r1 are processor registers. According to Intel's IA32 memory model, two loads to different locations cannot reorder. But it is completely possible for the load of X to be satisfied out of the store buffer, and for r1=Y to pass the store (thereby also passing the load r0=X). This is a standard Dekker reordering, but the usual example consists of just { X = 1; r1 = Y }.
The key to modeling this is to turn an adjacent store-load affecting the same location into a single instruction. Therefore, the above becomes something like:
r0 = 1; X = r0; r1 = Y;
Now it becomes entirely clear what has gone wrong. I have yet to see a clear description of this phenomenon, but Sebastian's paper does a great job.
During the discussion, Sebastian showed me another disturbing four processor example:
P0 P1 P2 P3 == == == == X = 1; r0 = X; Y = 1; s0 = X; r1 = Y; s1 = Y;
Is it possible, after all four processors complete, that { r0 == 1, r1 == 0 } and { s0 == 0, s1 == 1 }? This seems ridiculous, given a memory model where loads cannot reorder. It seems that no serializable execution should lead to this. But let's look at one problematic interleaving. First, we merge the instruction stream on P0 with P1, and also P2 with P3. This effect could occur if these writes are in functions that end up running on the same processor, or running on a machine that shares functional units (like hyperthreading), hierarchies that share a cache, and so on. We end up with:
P0/P1 P2/P3 ===== ===== X = 1 Y = 1; r0 = X; s0 = X; r1 = Y; s1 = Y;
Now let's permute these with the new rule introduced above in mind:
P0/P1 P2/P3 ===== ===== r0 = 1; s0 = X; r1 = Y; s1 = 1; X = r0; Y = s1;
At this point, it should be obvious what the problematic reordering would be. Let's continue merging these into a single execution order:
P0/P1/P2/P3 =========== r0 = 1; // #1 r1 = Y; // #0 s0 = X; // #0 s1 = 1; // #1 X = r0; // #1 Y = r1; // #1
The outcome? { r0 == 1, r1 == 0 } and { s0 == 0, s1 == 1 }. Whoops.
I have yet to observe this happening in practice, but models that permit store buffer forwarding are fundamentally vulnerable to this reordering. The solution here is the same as with Dekker. Marking the volatiles is insufficient: you need to insert full memory fences between the store-load adjacent pairs.
As we were hard at work creating PFX, we had a sister team of great talent working with us every step of the way. Their job? To do to Visual Studio 2010 what PFX did to .NET 4.0, by substantially improving the development experience for parallel programming on Windows. This includes both diagnostics & debugging, as well as profiling.
Daniel Moth, the program manager for a lot of the IDE features, just wrote up a comprehensive blog post on the new tasks window:
Parallel Tasks - new Visual Studio 2010 debugger window
The new window gives you a view into all of the tasks in your process, their statuses, and where they are running:

Because both TPL and PLINQ use tasks for execution, it supports both quite nicely. And it has (consistent!) support for both .NET and C++ tasks. The parallel stacks window is also an impressive new feature, and I'm guessing Daniel will also cover that in the coming weeks. Keep your eyes peeled. If all goes well, you'll even get to try them out first-hand, once Beta1 is available.
And if that weren't enough to entice you to visit his blog, check out this nasty machine that Daniel uses to run his kitchen appliances:

Oh, the insanity. I am thinking Task Manager will need revising in Windows 8.
 Friday, May 08, 2009
The parallel computing team just shipped an early release Axum (fka Maestro), an actor based programming language with message passing and strong isolation.
I'm personally very excited to see what comes of Axum. It's one step on the long road towards the vision of automatic parallelism. Although I can't claim credit for anything concrete, I was the chief designer of the fine grained isolation model Axum is built atop (something I call "Taming Side Effects" (TSE)). It's a blend of functional programming with imperative programming enabled by using the concepts of Haskell's state monad in a more familiar way. I'll try to blog a bit more about it in coming weeks. It turns out I've recently shifted my focus to a new project with the aim of applying these ideas very broadly for a whole new platform.
Doing incubation work at Microsoft is tough work, because it takes a strong vision and drive to keep pushing forward. You need to take stances that are unconventional, risky, and often just plain unpopular, and drive against all odds. Usually you aren't going to make any money off the ideas for years at a time, so it also takes a supportive management team who is willing to give you creative freedom and cut you checks. Most such efforts fail in a vaccuum. But hats off to the team for pushing hard, and going out early to ask what developers think. This is a huge milestone.
 Tuesday, March 31, 2009
I often speak of the need to develop programming models whereby developers write code in the most natural way possible, and it just so happens to be inherently parallel. I don’t believe the lion’s share of developers want to rearchitect and rewrite their code with parallelism at the forefront of their development process. They don’t want to think about shipping memory over to the GPU and launching a highly-specialized data parallel kernel of computation, nor do they want to have to add locks and transactions everywhere to make things safe. But I do, however, believe the lion’s share of developers wouldn’t mind if their code ran faster as hardware got faster (via more cores).
(To be clear, there are certainly a lot of developers who will be happy to write specialized code if it means eking out every last bit of performance on their machine. But that’s the minority.)
This viewpoint tends to get a lot of skeptical looks from people who quickly point out that this has been tried countless times before, and always leads to failure. They, of course, are referring back to the 80’s and early 90’s where “dusty deck” parallelization was all the rage, mostly in the realm of vectorization and HPC. To be fair, there were some mild successes in getting floating point loops parallelized, but there’s no wonder these attempts had little longevity. No touch solutions are always inadequate. Trying to make a fundamentally non-secure program secure, by way of, say, virtualization, may work in some constrained circumstances. But the right solution is to develop models and practices that lead to security-by-construction.
Furthermore, languages were (and in most cases still are) lacking some major prerequisites for large-scale automatic parallelization:
- Safety. Unless a compiler can reason about the determinism of a program when run in parallel, one cannot prove that its results will remain correct when parallelism is added. Compilers are therefore limited to parallelizing highly-specialized recognized patterns, like loops comprised solely of floating point additions of two arrays indexed by the loop iteration.
- Performance. Rampantly parallelizing a huge program wherever possible is dangerous, will drain performance, and make power consumption skyrocket. Dynamic techniques like workstealing and static techniques like nested data parallelism and profile guided feedback need to work together to inform these decisions. Machine-wide resource management needs to know about the memory topology, machine load and policy, and make informed decisions based on them. Although there has been a lot of disparate research in these areas over the years, they have only recently been coming together. Certainly in the 80’s, they were in their infancy.
- Declarative patterns. Most of the prior art was done in FORTRAN, a standard imperative language riddled with loops, effects, and assignments. Programs need to be written with as few dependencies as possible in order to expose large amounts of parallelism, and the von Neumann inspired languages fall short of this aim. Data comprehensions allow set-at-a-time computations to be expressed in a higher-order way, and newer languages like Fortress have language semantics that permit parallel evaluation in many more areas, like argument evaluation. And application models that encourage isolation and loose state coupling allow coarse partitioning.
In addition to all of those three things, we must have realistic expectations. Even if a program were fully safe to parallelize, as many Haskell kernels are, we would seldom see perfect scaling. Buying a 128-core machine doesn’t necessarily give you a 128x speedup. Why? Because there are still portions of the code that will end up less parallel than other portions, and some parts may even continue to run sequentially. There will always be I/O and waiting: these are real programs, after all, and real programs tend not to be 100% computation. They need to do something useful with the real world. Moreover, safety does not mean “dependence free.” And data dependencies are ultimately what limit parallelism.
My stated goal would therefore be that parallelism in programs is solely limited by data dependence. Safety issues are handled by construction. Performance is (mostly) handled by the system, although as with all things, there will be some amount of measurement, hints, and tuning necessary. But hopefully a huge part of tuning performance will be seeking out needless dependencies, or finding new algorithms that have different dependence characteristics. And with that, we can focus our energy on raising the level of abstraction and pushing more declarative patterns that are broadly useful. Over time as more and more programs are written in this fashion, they become more and more naturally parallel.
What do you think? Am I crazy? Perhaps. But I still know we can do it.
 Friday, March 13, 2009
Managed code generally is not hardened against asynchronous exceptions.
“Hardened” simply means “written to preserve state invariants in the face of unexpected failure.” In other words, hardened code can tolerate an exception and continue being called subsequently without a process or machine restart. Conversely, code that is not hardened may react sporadically if continued use is attempted: by corrupting state and subsequently behaving strangely and unpredictably.
Asynchronous exceptions are a foreign concept to native programmers, and arise because there is a runtime underneath all managed code that is silently injecting code on behalf of the original program. The only truly asynchronous exception is ThreadAbortException, but any in the set { OutOfMemoryException, TypeInitializationException, ThreadInterruptedException, StackOverflowException } are often labeled as such. While thread aborts can happen at any line of code outside a delay-abort regions, these other exceptions can be introduced by the CLR at surprising times; i.e., { memory allocations, static member access, blocking calls, any function call }. The effect is that, unlike most exceptions, the points at which they may occur are not obvious. OOMs, for instance, can happen at any method call (due to failure to allocate memory in which to JIT code), implicit boxing, etc.
(As of 2.0, StackOverflowException is no longer relevant because SO triggers a FailFast of the process instead. So saying that managed code is not hardened against SO is an understatement.)
Also, because of the way COM reentrancy works, any blocking call can lead to any arbitrary code dispatched through STA pumping. And that arbitrary code, much like an APC, can fail via any arbitrary exception. These are a lot like asynchronous exceptions. So in truth, code that isn’t written to respond to arbitrary exceptions at all blocking points is technically not hardened either.
.NET doesn’t provide checked exceptions, so the blunt reality is that very little managed code is hardened properly to synchronous exceptions either. I think we do a better job in the framework of carefully engineering the code to resiliently tolerate failure, usually by being very careful about argument validation, but we aren’t perfect. Some things slip through.
If you stop to think about why hardening isn’t done, it’s probably obvious. It’s darn difficult. Especially for asynchronous exceptions where nearly every line of code must be considered. In Win32 programming, most failure points are indicated by return codes. (Although C++ exceptions can sneak through the cracks at surprising times. Like the fact that EnterCriticalSection can throw.) While error codes are cumbersome to program against (since every call needs to be checked for a plethora of conditions, making it easy to miss something), at least the response to failure is explicit. You can decide to propagate and leave state corrupt, fix up state and then propagate, rip the process, or ignore the failure, as appropriate. This becomes part of the API contract. In managed code, you need to know to wrap such calls in try/catch blocks. Nobody does this. It’s insane to even consider doing that. And because nobody does, you can’t even catch exceptions coming out of a single API call and know that, when faced with an OOM (for example), that all code on the propagating callgraph has transitively handled the failure in a controlled manner. The very fact that the lock{} statement auto-unlocks without rolling back corrupt state should be indication enough of the current state of affairs.
An instance of any of the aforementioned exceptions means the AppDomain is toast.
By toast, I mean that it’s soon going to be unusable, and hopefully actively being unloaded. Code in the framework assumes this, and you should too. All it does is try to get out of the way by not crashing or hanging the ensuing unload. A small fraction of code that deals with process-wide state comprised of resources not under the purview of the CLR GC needs to worry about running and avoiding leaks. This is where things like CERs, CriticalFinalizerObjects, and paired operations stuck in finally blocks come into play. They ensure cross-process state is freed, and that asynchronous exceptions cannot occur in places that would crash or hang a clean unload.
Unfortunately, it’s not always the case that the AppDomain is unloading when such an exception occurs:
- Somebody can call Thread.Abort directly, without killing the AppDomain. They can either call ResetAbort and keep it around, or let it return to the ThreadPool which catches and swallows aborts. In fact, we tell people that synchronous aborts a la Thread.CurrentThread.Abort is “always safe”, whereas we tell people asynchronous aborts are dangerous and best avoided.
- Some framework infrastructure, most notably ASP.NET, even aborts individual threads routinely without unloading the domain. They backstop the ThreadAbortExceptions, call ResetAbort on the thread and reuse it or return it to the CLR ThreadPool. That means any code running in ASP.NET is apt to be corrupted when websites are recycled and AppDomain isolation is not being used.
- Assume AppDomain B is being unloaded. If some thread has called from A to B to C, the thread will immediately suffer an abort. The result is that C will see a thread unwinding with a ThreadAbortException, back into B, and then back to A, at which point the exception turns into a deniable AppDomainUnloadedException that can be caught. But C has seen an in-flight abort and yet it is not being unloaded. The result is that C’s state may be completely corrupt. I believe this should be considered a bug in the CLR.
- We can’t differentiate between soft- and hard-OOMs today. The former are caused by requests to allocate large blocks memory. Often a failure here isn’t indicative of a disaster. It may be due to a need to allocate 1GB of contiguous memory, and perhaps there is fragmentation. Hard OOMs are often caused by running up against the edge of the machine where no pagefile space is available, and may indicate a failure to JIT some important method, among other things. But because we don’t differentiate, any managed code can catch-and-ignore any kind of OOM, including hard ones.
- Thread interruptions are often used as a form of inter-thread communication. For example, they can be used as a poor man’s cancellation. (This is inappropriate, and cooperative techniques should always be used. But it is widespread.) But because they are used as a means of communication, they are almost always caught and handled in some controlled manner. This is one place where we screwed up by not hardening the frameworks against interrupted blocking calls and reacting intelligently. Checked exceptions would have saved us.
What does all of this mean? Quite simply, the .NET Framework cannot be trusted when any of the aforementioned exceptions are thrown. Ideally the process will come tumbling down shortly thereafter, but improperly written code can catch them and continue trying to limp along. In fact, as I mentioned above, some wildly popular & successful application models do (notably, ASP.NET and WinForms).
This state of affairs is admittedly unfortunate. We don’t properly separate out the truly fatal exceptions from those that we can gracefully recover from. In an ideal world, I’d love to see us do that. For example:
- At some point, we really ought to consider FailFast instead of continuing to run code under failures we know are fatal and dangerous to attempt to recover from, much like we do with SO. At least these failures should be undeniable like thread aborts are. But this is a fairly Byzantine response and is not for the faint of heart. Given that we still live in a world where WinForms wraps the top-most frame of the GUI thread in a catch-all, presents a dialog box, and allows a user to click “Ignore & Continue”, I seriously doubt we’ll get there anytime soon.
- Never expose a ThreadAbortException to code in an AppDomain unless we can guarantee the AppDomain is being unloaded. That means getting rid of the Abort API, and thus indirectly disallowing code from catching and calling ResetAbort. It also means the A calls B calls C case would not allow B to unload until the thread voluntarily unwinds out of C.
- Allow OOMs to be caught only when they are soft. That means a call to ‘new’, and it means the catch much occur inside the same stack frame as the call to ‘new’. Such exceptions can be tolerated if code is properly written, and we will tell developed to be mindful of them. Once such an OOM propagates past the calling stack frame, they will escalate to hard.
- All other OOMs are hard and fatal. This includes failure to allocate memory to JIT code and failure to allocate 20 bytes to box an int. Hard OOMs are thus undeniable.
- Get rid of ThreadInterruptedExceptions. We screwed this up from Day One, and it’s probably too late to fix this. We added cooperative cancellation in .NET 4.0 for a reason.
- TypeInitializationExceptions can probably stay, but we should allow rerunning the cctor upon subsequent accesses. Today, once a class C throws from its cctor, the class can never be constructed. So on the current plan, it only makes sense to FailFast.
I’m sure there are many other things we could do to improve things. But these 6 general themes would be a great start.
I’m just spitballing here. There are no concrete plans to do any of these 6 things as far as I know. And at the end of the day, hardening only improves the statistics of the situation, so it tends to be very difficult to argue for one change over another, particularly if taking the change would make existing programs break. But I really would like to see the base level of reliability in managed code improve with time. Especially with the exciting work going on around contract-checking in the BCL in Visual Studio 10, I hope these topics become top-of-mind for folks again in the near future.
 Monday, February 23, 2009
Pop quiz: Can this code deadlock?
SpinLock slockA = new SpinLock();
SpinLock slockB = new SpinLock();
Thread 1 Thread 2
~~~~~~~~ ~~~~~~~~
slockA.Enter(); slockB.Enter();
slockA.Exit(); slockB.Exit();
slockB.Enter(); slockA.Enter();
slockB.Exit(); slockA.Exit();
The answer, as I'm sure you suspiciously guessed, is "it depends."
I previously posted some thoughts about whether a full fence is required when exiting the lock. In that post, I focused primarily on timeliness. But what might be even more frightening is that the answer to my question above is yes, provided two things:
1. Exit doesn't end with a full fence. 2. Enter doesn't start with a full fence.
Just making Exit a store release and Enter a load acquire is insufficient. Here's why.
Imagine a super simple spin lock that satisfies our deadlock criteria:
class SpinLock {
private volatile int m_taken;
public void Enter() {
while (true) {
if (m_taken == 0 &&
Interlocked.Exchange(ref m_taken, 1) == 0)
break;
}
}
public void Exit() {
m_taken = 0;
}
}
Clearly Exit satisfies #1. The technique of using an ordinary read of m_taken before resorting to the XCHG call is often known as a TATAS (test-and-test-and-set) lock, and this can help alleviate contention. And it also means we will satisfy #2 above.
To see why deadlock is possible, imagine the following (fully legal) compiler transformation. The compiler first inlines everything, so for Thread 1 we have:
Thread 1
~~~~~~~~
while (true) {
if (slockA.m_taken == 0 &&
Interlocked.Exchange(ref slockA.m_taken, 1) == 0)
break;
}
slockA.m_taken = 0;
while (true) {
if (slockB.m_taken == 0 &&
Interlocked.Exchange(ref slockB.m_taken, 1) == 0)
break;
}
slockB.m_taken = 0;
What has to happen next is pretty subtle. It's even unlikely a compiler would do this intentionally (as far as I can tell). But it's entirely legal to morph the above code into something like this:
Thread 1
~~~~~~~~
while (true) {
if (slockA.m_taken == 0 &&
Interlocked.Exchange(ref slockA.m_taken, 1) == 0)
break;
}
while (slockB.m_taken == 0) ;;
slockA.m_taken = 0;
if (Interlocked.CompareExchange(ref slockB.m_taken, 1) != 0)
while (slockB.m_taken != 0 ||
Interlocked.Exchange(ref slockB.m_taken, 1) != 0) ;;
slockB.m_taken = 0;
The load(s) of slockB.m_taken have moved before the store to slockA.m_taken; this is legal, even if they are both marked volatile. A load acquire can move above a store release, and the code remains functionally equivalent. Now, the code required to fix up this code motion is pretty hokey. We clearly can't do the XCHG before the store to slockA.m_taken, so we need to try it afterwards. But that brings about an awkward transformation: if it fails, we must effectively do what the original code did, spinning until we acquire the slockB lock.
Do you see the deadlock yet?
Imagine the compiler did similar code motion on Thread 2:
Thread 2
~~~~~~~~
while (true) {
if (slockB.m_taken == 0 &&
Interlocked.Exchange(ref slockB.m_taken, 1) == 0)
break;
}
while (slockA.m_taken == 0) ;;
slockB.m_taken = 0;
if (Interlocked.CompareExchange(ref slockA.m_taken, 1) != 0)
while (slockA.m_taken != 0 ||
Interlocked.Exchange(ref slockA.m_taken, 1) != 0) ;;
slockA.m_taken = 0;
Oh no! See it now?
If Thread 1 and Thread 2 both enter the critical regions for slockA and slockB at the same times, they will end up spin-waiting for the other to leave before exiting their respective lock.
Boom: deadlock.
 Sunday, February 22, 2009
A few weeks back I recorded a discussion with the infamous Erik Meijer and Charles from Channel9.
Perspectives on Concurrent Programming and Parallelism http://channel9.msdn.com/shows/Going+Deep/Joe-Duffy-Perspectives-on-Concurrent-Programming-and-Parallelism/
In it, I show my cards a bit more than intuition says I should. I'm not good at poker.
To summarize:
- Mostly functional (purity + immutability) is a great default.
- Safe, determinstic mutability (a la runST) is a must-have for cognitive familiarity.
- Isolation is key to achieve the former; type systems can help (a lot).
- Actors, agents, forkIO, <what have you> is a good model, but not the only one. Isolation is (far) more general.
- Transactions can help around the edges.
I'm working on a few papers for public consumption this year where I espouse these ideas. Keep watching for more detail.
 Friday, February 20, 2009
I was very harsh in my previous post about reader/writer locks.
The results are clearly very hardware-specific. And one can certainly argue that better implementations are possible. (In fact, I will show one momentarily.) But no matter which way you slice-and-dice it, a lock implies mutable shared state which implies contention. Herb argued this point quite well, and rather thoroughly, in his recent Dr. Dobb’s article. Interference due to contention means more time spent resolving memory conflicts and less time doing useful work. A reader/writer lock can be infinitely clever, but there is still a consensus protocol that must be established: and that implies a loss of scalability. Pretty simple.
It’s very tricky to develop a consensus protocol that is sufficiently lossy so as to relieve memory contention while at the same time being sufficiently precise that the lock works right. In the case of a spinning reader/writer lock (which is, for what it’s worth, overly naïve an approach for most circumstances), you need to ensure that a writer knows for sure when there are 0 readers, and that each reader knows for sure whether there is 0 or 1 writer. (For blocking reader/writer locks, there’s a whole lot more.) One promising thing to note is that the writer only needs to know whether there are 0 or N readers, but not the specific value N; there’s a fair bit of research on scalable counters (like this) which exploit problems of this nature. Unfortunately, it’s not completely relevant here. You need to know exactly when the transition from N to 0 readers happens in order to let the writer through in a timely fashion; and in order to account for that transition, a consensus among readers is needed. That's hard to do.
More scalable solutions are possible than the simple lock I showed previously. Although writers need to know whether readers are present, the readers themselves could care less about other readers. As a result, we can make the lock slightly more expensive for the writer, because it needs to accumulate the count of readers, but this allows us to make it it slightly cheaper for the readers to enter and exit. Where cheaper means less contention.
Here’s one possible algorithm. We’ll keep an array of read flags and a single write flag:
private volatile int m_writer;
private ReadEntry[] m_readers = new ReadEntry[Environment.ProcessorCount * 16];
A few things are noteworthy about the read flags.
First, it’s an array of ReadEntry values. These are just simple structs that wrap a volatile int, but we also pad the struct so that it’s 128 bytes in total size. That avoids the situation where multiple read flags just happen to end up sharing the same cache line (which are usually either 64 or 128 bytes in size), which leads to false sharing in the memory system (destroying our aim to reduce contention).
[StructLayout(LayoutKind.Sequential, Size = 128)]
struct ReadEntry {
internal volatile int m_taken;
}
Second, we size the array to be 16-times the number of processors. We hash into it based on the calling thread’s unique identifier, so to reduce (but not eliminate) the chance of hashing collisions, we’ll use a few times more buckets than the total number of concurrent threads. Hashing collisions are expensive: they incur some amount of memory contention, and also demand that we use an atomic CAS increment instead of an ordinary ++. (While a super-duper-cheap TLS solution might seem more ideal, there isn’t any good per-object TLS solution to use. The array hashing approach is actually quite fast.)
Notice that we’re using an awful lot of space for a single lock. This means the techniques I show here wouldn’t be readily applicable to a system that uses lots of fine-grained locks, like transactional memory. But similar ideas can be extrapolated, e.g., by using shared lock tables.
Lastly, some invariants among these fields are self-evident. When the writer flag is 0, no writers are waiting; when it is 1, either a writer is actively in the critical section, or there is a writer waiting for readers to exit. When at least one reader flag entry is non-0, there is a reader either inside the lock or attempting to enter it. Thus, no new writer is permitted while there’s a non-0 reader entry, and no new reader is permitted while there’s a non-0 writer flag. This is sufficient to ensure the reader/writer lock properties hold.
Now let’s look at how the EnterReadLock and ExitReadLock methods work.
When a reader arrives, it spins until the writer flag is non-0. It then hashes into the read flag array using its unique thread identifier, and then atomically increments the read counter. It then needs to recheck that a writer didn’t arrive in the meantime. (The CAS increment means we can safely do this without worry for reordering bugs, like the read of the writer flag passing the write to the reader flag.) If a writer hasn’t arrived, the read lock has been successfully acquired and we’re done; if a writer has arrived, however, the reader needs to back out the change (since the writer might be waiting for the read flag to become 0) and then go back to spinning. It will retry again once the writer exits.
private int ReadLockIndex {
get { return Thread.CurrentThread.ManagedThreadId % m_readers.Length; }
}
public void EnterReadLock() {
SPW sw = new SPW();
int tid = ReadLockIndex;
// Wait until there are no writers.
while (true) {
while (m_writer == 1) sw.SpinOnce();
// Try to take the read lock.
Interlocked.Increment(ref m_readers[tid].m_taken);
if (m_writer == 0) {
// Success, no writer, proceed.
break;
}
// Back off, to let the writer go through.
Interlocked.Decrement(ref m_readers[tid].m_taken);
}
}
(Note that SPW is a little type to encapsulate the spin-wait logic, including some amount of backoff to reduce contention. An example implementation at the bottom of this essay, along with the full reader/writer lock code. .NET 4.0 includes a SpinWait type that provides this same functionality.)
Exiting the read lock is pretty simple. We just need to decrement our counter.
public void ExitReadLock() {
// Just note that the current reader has left the lock.
Interlocked.Decrement(ref m_readers[ReadLockIndex].m_taken);
}
The writer lock is pretty straightforward. It works the same way most spin-based mutually exclusive locks work, but using a CAS on the writer flag, but has an extra step after successfully acquiring the lock: a writer must walk the list of read flags, and wait for each one to become 0. (This is similar to Peterson's mutual exclusion algorithm for N-threads.) Because the write flag is set first (using a CAS), and because new readers won’t enter if the flag is set, we can be assured this works correctly without hokey memory reordering problems cropping up.
public void EnterWriteLock() {
SPW sw = new SPW();
while (true) {
if (m_writer == 0 && Interlocked.Exchange(ref m_writer, 1) == 0) {
// We now hold the write lock, and prevent new readers.
// But we must ensure no readers exist before proceeding.
for (int i = 0; i < m_readers.Length; i++)
while (m_readers[i].m_taken != 0) sw.SpinOnce();
break;
}
// We failed to take the write lock; wait a bit and retry.
sw.SpinOnce();
}
}
And exiting the write lock is even simpler than exiting the read lock. We just set the writer flag to 0.
public void ExitWriteLock() {
// No need for a CAS.
m_writer = 0;
}
Given all of that, you might wonder how well this bad boy performs. Well, single-threaded performance is a bit worse than the previous spin reader/writer lock: about 1.55x the cost of a monitor acquisition for the read lock instead of 0.95x, and about 5.52x for the write lock instead of 0.85X. This makes sense. There’s simply a whole lot more work going on in this new lock compared to the old, simple one.
But scalability is vastly improved. Our hard work has apparently paid off. Here’s a table much like the one in the previous post: scaling over the equivalent mutually exclusive monitor code, for various percentages of writers and various amounts of "work" (counts of function calls) inside the lock region. (I have left out the legacy .NET ReaderWriterLock type because it is embarassingly terrible.) Remember: 1.0x means it scales the new lock is the same as monitor, 0.5x means twice as fast, and 2.0x means twice as slow. 0.25x is ideal speedup (4x) since I am running the tests on a four way machine.
0% writers:
0 calls 10 calls 100 calls 1000 calls
RWLSlim (3.5) 2.11x 2.01x 0.96x 0.32x
SpinRWL(old) 9.63x 7.04x 1.02x 0.26x
SpinRWL(new) 0.39x 0.36x 0.28x 0.25x
5% writers:
0 calls 10 calls 100 calls 1000 calls
RWLSlim (3.5) 2.29x 2.36x 1.18x 0.61x
SpinRWL(old) 5.69x 5.59x 1.43x 0.94x
SpinRWL(new) 1.01x 0.96x 0.45x 0.38x
10% writers:
0 calls 10 calls 100 calls 1000 calls
RWLSlim (3.5) 2.26x 2.04x 1.15x 1.00x
SpinRWL(old) 6.87x 5.03x 1.42x 1.34x
SpinRWL(new) 1.60x 1.51x 0.63x 0.53x
25% writers:
0 calls 10 calls 100 calls 1000 calls
RWLSlim (3.5) 2.09x 2.10x 1.14x 1.00x
SpinRWL(old) 4.70x 4.20x 1.43x 1.69x
SpinRWL(new) 2.81x 2.29x 1.27x 0.73x
50% writers:
0 calls 10 calls 100 calls 1000 calls
RWLSlim (3.5) 2.18x 1.95x 1.15x 0.95x
SpinRWL(old) 3.23x 3.73x 1.54x 1.39x
SpinRWL(new) 3.16x 2.76x 1.73x 1.10x
100% writers:
0 calls 10 calls 100 calls 1000 calls
RWLSlim (3.5) 2.18x 1.95x 1.04x 0.92x
SpinRWL(old) 2.63x 2.04x 1.06x 0.87x
SpinRWL(new) 6.79x 3.96x 1.62x 1.06x
You can see there are now several more cases where the new reader/writer lock beats out both the .NET 3.5 ReaderWriterLockSlim type in addition to our previous attempt. In fact, we now have a few new scenarios that scale, like 5% or 10% writers where the amount of work being done is at least 100 function calls. (Unfortunately, doing 100 or more function calls inside a lock that uses spin-waiting is dangerous and considered a very bad practice: you should be able to count the number of instructions on your fingers (and toes). But that’s somewhat beside the point.) In summary, so long as there is a fair amount of work going on and the percentage of writers remains very low, we might see a benefit.
So was I overly harsh on reader/writer locks in my last post? Sure, maybe a little. While I am still very disappointed in the current .NET reader/writer locks (and, I imagine, the Vista SRWLock), the results I was able to get here are a bit more promising.
But the point I was trying to get across is the same: sharing is sharing is sharing. Avoid it like the plague.
(Thanks to Tim Harris for sending me private email about my previous posts. The brief discussion inspired me to pick this back up.)
Here’s the full code for the reader/writer lock.
using System;
using System.Threading;
using System.Collections.Concurrent;
using System.Collections.Generic;
using System.Runtime.InteropServices;
// We use plenty of interlocked operations on volatile fields below. Safe.
#pragma warning disable 0420
/// <summary>
/// A very lightweight reader/writer lock. It uses a single word of memory, and
/// only spins when contention arises (no events are necessary).
/// </summary>
public class ReaderWriterSpinLockPerProc {
private volatile int m_writer;
private volatile ReadEntry[] m_readers = new ReadEntry[Environment.ProcessorCount * 16];
[StructLayout(LayoutKind.Sequential, Size = 128)]
struct ReadEntry {
internal volatile int m_taken;
}
private int ReadLockIndex {
get { return Thread.CurrentThread.ManagedThreadId % m_readers.Length; }
}
public void EnterReadLock() {
SPW sw = new SPW();
int tid = ReadLockIndex;
// Wait until there are no writers.
while (true) {
while (m_writer == 1) sw.SpinOnce();
// Try to take the read lock.
Interlocked.Increment(ref m_readers[tid].m_taken);
if (m_writer == 0) {
// Success, no writer, proceed.
break;
}
// Back off, to let the writer go through.
Interlocked.Decrement(ref m_readers[tid].m_taken);
}
}
public void EnterWriteLock() {
SPW sw = new SPW();
while (true) {
if (m_writer == 0 && Interlocked.Exchange(ref m_writer, 1) == 0) {
// We now hold the write lock, and prevent new readers.
// But we must ensure no readers exist before proceeding.
for (int i = 0; i < m_readers.Length; i++)
while (m_readers[i].m_taken != 0) sw.SpinOnce();
break;
}
// We failed to take the write lock; wait a bit and retry.
sw.SpinOnce();
}
}
public void ExitReadLock() {
// Just note that the current reader has left the lock.
Interlocked.Decrement(ref m_readers[ReadLockIndex].m_taken);
}
public void ExitWriteLock() {
// No need for a CAS.
m_writer = 0;
}
}
struct SPW {
private int m_count;
internal void SpinOnce() {
if (m_count++ > 32) {
Thread.Sleep(0);
} else if (m_count > 12) {
Thread.Yield();
} else {
Thread.SpinWait(2 << m_count);
}
}
}
 Wednesday, February 11, 2009
A couple weeks ago, I illustrated a very simple reader/writer lock that was comprised of a single word and used spinning instead of blocking under contention. The reason you might use a lock with a read (aka shared) mode is fairly well known: by allowing multiple readers to enter the lock simultaneously, concurrency is improved and therefore so does scalability. Or so the textbook theory goes.
As a purely theoretical illustration, imagine we’re on a heavily loaded 8-CPU server where a new request arrives every 0.25ms and runs for 1ms. In an ideal world, we could service requests coming in at a rate of 1ms / 8-CPUs = 0.125ms without falling behind. But imagine these requests need to access some shared state, and so there is a bit of serialization required. In fact, let’s imagine each does 0.5ms’ worth of its work inside a lock. If you were to use a mutually exclusive lock, then you’d have an immediate lock convoy on your hands. Even with 8-CPUs you won’t be able to keep up. You’ll start off gradually building up a debt, and eventually come to a crawl. Let’s examine the initial timeline:
Req# Arrival Acquire Release Wait Time
1 0.0ms 0.0ms 0.5ms 0.0ms
2 0.25ms 0.5ms 1.0ms 0.25ms
3 0.5ms 1.0ms 1.5ms 0.5ms
4 0.75ms 1.5ms 2.0ms 0.75ms
5 1.0ms 2.0ms 2.5ms 1.0ms
6 1.25ms 2.5ms 3.0ms 1.25ms
7 1.5ms 3.0ms 3.5ms 1.5ms
8 1.75ms 3.5ms 4.0ms 1.75ms
Oh jeez, after only the first 8 requests, we’ve fallen way behind.
Each new request adds 0.25ms onto the amount of time the request must wait for the lock. And it’s not going to get any better:
9 2.0ms 4.0ms 4.5ms 2ms
10 2.25ms 4.5ms 5.0ms 2.25ms
11 2.5ms 5.0ms 5.5ms 2.5ms
12 2.75ms 5.5ms 6.0ms 2.75ms
... and so on ...
By request #9, requests have to wait for twice as long as they run. Eventually something has to give, or the server will come tumbling down.
Now, imagine we used a reader/writer lock instead. Threads would never wait for each other, and we wouldn’t end up with this never-ending buildup of wait times. In other words, the “Wait Time” column above would always be 0.0ms. And because the arrival rate is less than our theoretical limit of one request per 0.125ms, our lock convoy is gone. Right?
Unfortunately, probably not; this mental model is overly naïve.
Even when a read lock is acquired, there is mutual exclusion going on:
- Some reader/writer locks actually use mutually exclusive locks to protect their own internal state, like the list of current readers! This can come as a surprise, but it’s true of the .NET reader/writer locks. Vance’s example even does and, although it uses a spin lock in an attempt to reduce the overhead, there’s still no denying that it’s mutual exclusion.
- And even if they don’t use mutually exclusive locks, like the simple spin-based one from my previous blog post, there are CAS instructions. And a CAS instruction actually amounts to a form of mutual exclusion at the hardware level, because the cache coherency machinery needs to ensure that no two processors try to acquire and modify the same cache line exclusively.
- In addition to all of that overhead, the cost in CPU cycles of acquiring the read-lock is nowhere near zero. Because of the use of locks and/or CAS internally, and the resulting cache contention and line evictions that this will cause, throughput will suffer. If there is contention, threads may end up blocked (if real locks are being used), spinning (if spin locks are being used), or simply optimistically retrying CAS’s due to line ping ponging.
The result?
Read locks are just as bad as mutually exclusive locks when lock hold times are short. In fact, they can be worse, because reader/writer locks are more complicated and therefore cost more than simple mutually exclusive locks: many need to keep track of read lists in order to disallow recursive acquires, maintain multiple event handles so certain kinds of waiters can be awakened over others, and store various kinds of counters and flags. Even my super simple single-word, spin-based reader/writer lock needed to worry about blocking out readers when a writer was waiting, properly incrementing and decrementing the reader count when readers are racing with one another (leading to more complicated CAS on the exit path than ordinary write locks), and so on.
That said, a reader/writer lock would in fact probably work in the situation above. A hold time of 0.5ms is huge, and with only 8 concurrent threads and the arrival rate we’re talking about, the overheads are apt to be quite small in relation to the work being done. Another similar setting in which reader/writer locks commonly make a noticeable difference is in the execution of large database transactions.
But the sad truth is that we tell programmers to keep lock hold times short, and most locks I see are comprised of two dozen instructions or less. So we’re in the microsecond range at the very most, which is certainly not large enough for read locks to pan out.
To illustrate this point, I wrote a little benchmark program that benchmarks the legacy .NET ReaderWriterLock, the 3.5 ReaderWriterLockSlim type, and my little spin reader/writer lock. All it does is spawn 4 threads on my dual-socket, dual-core (4-CPU) machine, and then loop around so many times acquiring and releasing a certain kind of lock. I’ve written the test so that the amount of work done inside the lock is parameterized as a certain number of non-inlined function calls. I also parameterize the percentage of acquires that will be write-locks. Then I’ve run this a bunch of times, and compared the total time taken with the equivalent code using a CLR Monitor for mutual exclusion instead.
Here are some results, where each column represents the number of function calls. The entries are the cost relative to Monitor: 1.00x means they are the same, 0.5x means the alternative lock is twice as fast, and 2.0x means the alternative lock is twice as slow. Remember, the ideal situation would be 0.25x: that is, by allowing four threads to run completely concurrently, we run four times faster.
0% writers:
0 calls 10 calls 100 calls 1000 calls
RWL (legacy) 9.23x 6.46x 0.90x 0.49x
RWLSlim (3.5) 2.11x 2.01x 0.96x 0.32x
SpinRWL 9.63x 7.04x 1.02x 0.26x
5% writers:
0 calls 10 calls 100 calls 1000 calls
RWL (legacy) 10.55x 8.23x 1.71x 0.63x
RWLSlim (3.5) 2.29x 2.36x 1.18x 0.61x
SpinRWL 5.69x 5.59x 1.43x 0.94x
10% writers:
0 calls 10 calls 100 calls 1000 calls
RWL (legacy) 20.31x 10.39x 2.34x 0.99x
RWLSlim (3.5) 2.26x 2.04x 1.15x 1.00x
SpinRWL 6.87x 5.03x 1.42x 1.34x
25% writers:
0 calls 10 calls 100 calls 1000 calls
RWL (legacy) 74.49x 49.59x 9.18x 2.15x
RWLSlim (3.5) 2.09x 2.10x 1.14x 1.00x
SpinRWL 4.70x 4.20x 1.43x 1.69x
50% writers:
0 calls 10 calls 100 calls 1000 calls
RWL (legacy) 148.34x 98.46x 20.46x 3.63x
RWLSlim (3.5) 2.18x 1.95x 1.15x 0.95x
SpinRWL 3.23x 3.73x 1.54x 1.39x
100% writers:
0 calls 10 calls 100 calls 1000 calls
RWL (legacy) 170.59x 123.66x 24.04x 4.29x
RWLSlim (3.5) 2.18x 1.95x 1.04x 0.92x
SpinRWL 2.63x 2.04x 1.06x 0.87x
Clearly there are a number of anomalies in these numbers. Why the legacy ReaderWriterLock balloons to 170X the cost of Monitor when we have 100% writers is a very interesting question indeed. Why my simple spin reader/writer lock is 9.63X when we have pure reads and 0 calls, and yet the ReaderWriterLockSlim type is only 2.11X is also interesting. And so on. The numbers are very specific to the version of .NET I am using, and indeed the precise machine configuration, including the number and layout of cores and caches.
But if we look more generally at the numbers, ignoring some of the surprising ones, we can make one interesting and safe conclusion: You need a really low percentage of writers, and a really long amount of time inside the lock, for any scalability wins to show up as a result of using a reader/writer lock. Our best case was the spin reader/writer lock when we had 0% writers and 1000 calls. But clearly if you have no writers, i.e., state is immutable, there’s little point in using any locks whatsoever! This is an extreme result, where threads are hammering on the lock constantly in a tight loop, but if you stop to think about it: When else would a reader/writer lock make a difference? If threads are just getting in and out of the lock very quickly, and arrivals are infrequent, then there is no benefit to allowing multiple threads in at once anyway.
The moral of the story? Besides suggesting that you seriously question whether a reader/writer lock is actually going to buy you anything, it's the same as the conclusion in my previous post on the matter:
Sharing is evil, fundamentally limits scalability, and is best avoided.
 Monday, February 02, 2009
I frequently get asked about the C# compiler's warning CS0420 about taking byrefs to volatile fields. For example, given a program
class P { static volatile int x;
static void Main() { f(ref x); }
static void f(ref int y) { while (y == 0) ; } }
the C# compiler will complain
xx.cs(8,15): warning CS0420: 'P.x': a reference to a volatile field will not be treated as volatile
because of the line containing 'ref x'. (The same applies to 'out' parameters too.) The natural question is, of course, whether to worry about it.
In general, the answer is yes, you must worry. In the above example, the use of the 'y' parameter inside 'f' will not be treated as volatile, as the warning says. What does that mean in practice? For one, the read of 'y' in 'f's while loop could be considered loop invariant by the JIT compiler and hoisted, and you'd possibly loop forever. It also means that on IA64 platforms, such reads will be emitted as ordinary loads instead of the special load-acquire variant that is emitted for volatile loads. This can lead to reordering bugs. In other words, you lose the volatile-ness of the field as soon as you cast it away as an ordinary byref. And unlike C++ where you can have a volatile pointer, there's no way to mark a .NET byref as volatile.
(You can use the Thread.VolatileRead and VolatileWrite methods to use a byref in a volatile manner. Unfortunately they are far more costly than ordinary volatile loads and stores.)
There is one particularly annoying case in which this warning is complete noise: when passing a byref to an API that internally performs volatile (or stronger) loads and stores. I.e., the Interlocked.*, Thread.VolatileRead, and VolatileWrite methods. Because these APIs internally use explicit memory barriers and atomic hardware instructions, the byref will effectively be treated as volatile regardless of whether it was taken from a volatile field or not. And therefore it is safe.
For instance, the compiler will warn you about the following code
volatile int x;
static void f() { Interlocked.Exchange(ref x, 1); }
even though there is no problem. You can suppress the warning with a "#pragma warning disable" just before the call
volatile int x;
static void f() { #pragma warning disable 0420 Interlocked.Exchange(ref x, 1); #pragma warning restore 0420 }
and then restore it immediately afterwards. (It's a good idea to restore the warning so that you catch other possibly-problematic instances from being missed.)
This comes up a whole lot. Why? Because many times you'll mark a field volatile, even though it is updated exclusively with CAS operations, because it's also used in other contexts: e.g., sequences where loads mustn't reorder or erroneously be considered loop invariant. I personally have a habit of always marking these variables as such, mostly as a carryover from Win32 whose InterlockedXX family of APIs demand volatile pointers (i.e., volatile * LONG).
I'm told that this annoying case might be fixed in the next C# compiler, by the way. Until then, I figured I'd throw this up for reference purposes.
 Thursday, January 29, 2009
Reader/writer locks are commonly used when significantly more time is spent reading shared state than writing it (as is often the case), with the aim of improving scalability. The theoretical scalability wins come because the lock can be acquired in a special read-mode, which permits multiple readers to enter at once. A write-mode is also available which offers typical mutual exclusion with respect to all readers and writers. The idea is simple: if many readers can read simultaneously, the theory goes, concurrency improves.
(I’ll be posting an analysis of reader/writer lock scalability in an upcoming post. For a variety of reasons--most related to my recent CAS post--they seldom make a dramatic impact in practice.)
In addition to showing up in libraries--such as Vista’s new SRWLock, .NET’s ReaderWriterLock, and .NET 3.5’s ReaderWriterLockSlim--they are used pervasively in relational databases, distributed transactions, and software transactional memory.
Vance Morrison demonstrated a lightweight reader/writer lock on his blog a couple years back. Although quite small, you can get smaller. Much like the new SpinLock type being made available in .NET 4.0, we can build a ReaderWriterSpinLock that offers several advantages:
- It’s a struct, and so there is no object allocation or space for an object header necessary.
- It’s a single word in size (i.e., 4 bytes).
- No kernel events are ever allocated; we will spin instead.
For cases in which reads are extraordinarily frequent, and writes are extraordinarily rare, this approach can actually be useful. Unfortunately, because one common case in which reader/writer locks scale very well is when hold times are lengthy, as will be shown in my upcoming post, even moderately common writes will result in chewing up a whole lot of wasted CPU time due to (3). If there’s interest, I will look into implementing a variant of this type that uses events for waiting. Clearly this would sacrifice (2).
Some design decisions have been made in the name of keeping this thing lightweight:
- No thread affinity will be used.
- And therefore no recursive acquires will be allowed.
The full code is below, at the bottom of this post. But let’s review the details one-by-one.
First, all state is packed into a single field, m_state. We’ll use the 32nd bit to represent whether the write lock is held, and we’ll use the 31st bit to represent whether a writer is attempting to acquire the lock. As with most reader/writer locks, we will give writers priority over readers because they are supposed to be very infrequent. In other words, once a writer arrives, no more read lock acquires will be permitted. The remaining 30 bits will be used to store the reader count. Some masks make this convenient:
private volatile int m_state; private const int MASK_WRITER_BIT = unchecked((int)0x80000000); private const int MASK_WRITER_WAITING_BIT = unchecked((int)0x40000000); private const int MASK_WRITER_BITS = unchecked((int)(MASK_WRITER_BIT | MASK_WRITER_WAITING_BIT)); private const int MASK_READER_BITS = unchecked((int)~MASK_WRITER_BITS);
Now we can write the four methods: EnterWriteLock, ExitWriteLock, EnterReadLock, ExitReadLock.
Entering the write lock merely entails setting m_state to MASK_WRITER_BIT, provided that we see it available. If it’s not available, we’ll just go ahead and try to set the MASK_WRITER_WAITING_BIT to prevent subsequent read locks from being acquired until we get in. We then go ahead and spin until the lock is available using the new type SpinWait in .NET 4.0, checking the m_state field over and over again. The lock is available if m_state is 0 or MASK_WRITER_WAITING_BIT:
public void EnterWriteLock() { SpinWait sw = new SpinWait(); do { // If there are no readers currently, grab the write lock. int state = m_state; if ((state == 0 || state == MASK_WRITER_WAITING_BIT) && Interlocked.CompareExchange(ref m_state, MASK_WRITER_BIT, state) == state) return;
// Otherwise, if the writer waiting bit is unset, set it. We don't // care if we fail -- we'll have to try again the next time around. if ((state & MASK_WRITER_WAITING_BIT) == 0) Interlocked.CompareExchange(ref m_state, state | MASK_WRITER_WAITING_BIT, state);
sw.SpinOnce(); } while (true); }
Leaving the write lock is actually quite simple. We just set the m_state field to 0, preserving the MASK_WRITER_WAITING_BIT just in case another writer has arrived since we acquired the lock. We use an Interlocked.Exchange (XCHG) operation for this, although we technically could have just done an ordinary write, provided doing so wouldn’t cause memory model or availability problems:
public void ExitWriteLock() { // Exiting the write lock is simple: just set the state to 0. We // try to keep the writer waiting bit to prevent readers from getting // in -- but don't want to resort to a CAS, so we may lose one. Interlocked.Exchange(ref m_state, 0 | (m_state & MASK_WRITER_WAITING_BIT)); }
Entering the read lock is even more straightforward. The lock is available for readers when m_state & MASK_WRITER_BITS is 0. In other words, no writer holds the lock and no writer is waiting for the lock. Once we see the lock in such a state, we merely try to add one to the state value and CAS it in. In this way, m_state & MASK_READER_BITS will be equal to the number of concurrent readers in the lock:
public void EnterReadLock() { SpinWait sw = new SpinWait(); do { int state = m_state; if ((state & MASK_WRITER_BITS) == 0) { if (Interlocked.CompareExchange(ref m_state, state + 1, state) == state) return; }
sw.SpinOnce(); } while (true); }
Lastly, exiting the read lock is the most complicated operation of all. It needs to decrement the reader count, while at the same time preserving the MASK_WRITER_WAITING_BIT:
public void ExitReadLock() { SpinWait sw = new SpinWait(); do { // Validate we hold a read lock. int state = m_state; if ((state & MASK_READER_BITS) == 0) throw new Exception("Cannot exit read lock when there are no readers");
// Try to exit the read lock, preserving the writer waiting bit (if any). if (Interlocked.CompareExchange( ref m_state, ((state & MASK_READER_BITS) - 1) | (state & MASK_WRITER_WAITING_BIT), state) == state) return;
sw.SpinOnce(); } while (true); }
And that’s it.
Here are some single-threaded performance numbers, comparing the relative costs of several locks out there. These are taken from a large number of acquire/release pairs, i.e., ‘for (int i = 0; i < N; i++) { lock.Enter(); lock.Exit(); }’, for a very large value of N:
Monitor 0004487479 RWL read lock (legacy) 0023042785 5.13491x RWL write lock (legacy) 0023118085 5.15169x SlimRWL read lock (3.5) 0009423579 2.099976x SlimRWL write lock (3.5) 0008680855 1.934465x Vance read lock 0004923609 1.097193x Vance write lock 0004802136 1.070123x SpinRWL read lock 0004298525 0.9579604x SpinRWL write lock 0003819024 0.8510431x
The Nx ratios compare the lock in question to Monitor as our baseline. Smaller is better. As you can see, we seem to be on pretty solid ground to start with. But clearly the most interesting part of this whole thing is the scaling numbers--in particular whether read-mode helps with throughput--both for the existing reader/writer locks and our new one. The results may surprise you. That’s coming in the next post...
(Here is the full listing.)
using System;
// We use plenty of interlocked operations on volatile fields below. Safe. #pragma warning disable 0420
namespace System.Threading { /// <summary> /// A very lightweight reader/writer lock. It uses a single word of memory, and /// only spins when contention arises (no events are necessary). /// </summary> public struct ReaderWriterSpinLock { private volatile int m_state; private const int MASK_WRITER_BIT = unchecked((int)0x80000000); private const int MASK_WRITER_WAITING_BIT = unchecked((int)0x40000000); private const int MASK_WRITER_BITS = unchecked((int)(MASK_WRITER_BIT | MASK_WRITER_WAITING_BIT)); private const int MASK_READER_BITS = unchecked((int)~MASK_WRITER_BITS);
public void EnterWriteLock() { SpinWait sw = new SpinWait(); do { // If there are no readers currently, grab the write lock. int state = m_state; if ((state == 0 || state == MASK_WRITER_WAITING_BIT) && Interlocked.CompareExchange(ref m_state, MASK_WRITER_BIT, state) == state) return;
// Otherwise, if the writer waiting bit is unset, set it. We don't // care if we fail -- we'll have to try again the next time around. if ((state & MASK_WRITER_WAITING_BIT) == 0) Interlocked.CompareExchange(ref m_state, state | MASK_WRITER_WAITING_BIT, state);
sw.SpinOnce(); } while (true); }
public void ExitWriteLock() { // Exiting the write lock is simple: just set the state to 0. We // try to keep the writer waiting bit to prevent readers from getting // in -- but don't want to resort to a CAS, so we may lose one. Interlocked.Exchange(ref m_state, 0 | (m_state & MASK_WRITER_WAITING_BIT)); }
public void EnterReadLock() { SpinWait sw = new SpinWait(); do { int state = m_state; if ((state & MASK_WRITER_BITS) == 0) { if (Interlocked.CompareExchange(ref m_state, state + 1, state) == state) return; }
sw.SpinOnce(); } while (true); }
public void ExitReadLock() { SpinWait sw = new SpinWait(); do { // Validate we hold a read lock. int state = m_state; if ((state & MASK_READER_BITS) == 0) throw new Exception("Cannot exit read lock when there are no readers");
// Try to exit the read lock, preserving the writer waiting bit (if any). if (Interlocked.CompareExchange( ref m_state, ((state & MASK_READER_BITS) - 1) | (state & MASK_WRITER_WAITING_BIT), state) == state) return;
sw.SpinOnce(); } while (true); } } }
 Friday, January 16, 2009
I just uploaded a free sample chapter for my Concurrent Programming on Windows book:
2 Synchronization and Time
STATE IS AN important part of any computer system. This point seems so obvious that it sounds silly to say it explicitly. But state within even a single computer program is seldom a simple thing, and, in fact, is often scattered throughout the program, involving complex interrelationships and different components responsible for managing state transitions, persistence, and so on. Some of this state may reside inside a process’s memory—whether that means memory allocated dynamically in the heap (e.g., objects) or on thread stacks—as well as files on-disk, data stored remotely in database systems, spread across one or more remote systems accessed over a network, and so on. The relationships between related parts may be protected by transactions, handcrafted semitransactional systems, or nothing at all.
The broad problems associated with state management, such as keeping all sources of state in-synch, and architecting consistency and recoverability plans all grow in complexity as the system itself grows and are all traditionally very tricky problems. If one part of the system fails, either state must have been protected so as to avoid corruption entirely (which is generally not possible) or some means of recovering from a known safe point must be put into place.
While state management is primarily outside of the scope of this book, state “in-the-small” is fundamental to building concurrent programs. Most Windows systems are built with a strong dependency on shared memory due to the way in which many threads inside a process share access to the same virtual memory address space. The introduction of concurrent access to such state introduces some tough challenges. With concurrency, many parts of the program may simultaneously try to read or write to the same shared memory locations, which, if left uncontrolled, will quickly wreak havoc. This is due to a fundamental concurrency problem called a data race or often just race condition. Because such things manifest only during certain interactions between concurrent parts of the system, it’s all too easy to be given a false sense of security—that the possibility of havoc does not exist.
In this chapter, we’ll take a look at state and synchronization at a fairly high level. We’ll review the three general approaches to managing state in a concurrent system:
- Isolation, ensuring each concurrent part of the system has its own copy of state.
- Immutability, meaning that shared state is read-only and never modified, and
- Synchronization, which ensures multiple concurrent parts that wish to access the same shared state simultaneously cooperate to do so in a safe way.
We won’t explore the real mechanisms offered by Windows and the .NET Framework yet. The aim is to understand the fundamental principles first, leaving many important details for subsequent chapters, though pseudo-code will be used often for illustration.
We also will look at the relationship between state, control flow, and the impact on coordination among concurrent threads in this chapter. This brings about a different kind of synchronization that helps to coordinate state dependencies between threads. This usually requires some form of waiting and notification. We use the term control synchronization to differentiate this from the kind of synchronization described above, which we will term data synchronization.
Read more here...
Related, I was recently interviewed by DZone about the book. You can read my responses here. Enjoy.
 Monday, January 12, 2009
I received some feedback on my previous post, Some performance implications of CAS operations, indicating that a few clarifications are in order. If I had to summarize the intended conclusion, it’d go something like this:
Sharing is evil, fundamentally limits scalability, and is best avoided.
I have to admit that the post was meant to focus more on concrete data, since I expected the meta-point about sharing to be implied. I figured folks would pick up on the link: (i) Sharing memory requires concurrency control, (ii) Concurrency control requires CAS, (iii) CAS is expensive, therefore (iv) Sharing memory is expensive. Many people simply don’t understand how crippling CAS can be when placed in a hot path, and I wanted to point out some (albeit extreme) examples of this point.
I did have a motivation for the post. A lot of people point at lock-free techniques, software transactional memory, reader/writer locks, etc. as ways to improve scalability. Sadly this seldom pans out. Each involves CASs of some sort, and, assuming the lock-based equivalent is written properly (that is, to hold locks for very short periods of time), the alternative can in fact often fare worse. I call this game “count the CASs.” It’s the roundtrips back to shared memory, failed optmistic attempts, cache invalidations, and line ping ponging that kills you.
Some might accuse me of unfairly targeting CAS. That’s hogwash. I’ve been in the trenches for years writing and optimizing systems-level parallel code on Windows. A parallel for loop can go from scaling perfectly to not scaling at all if you choose the wrong granularity for the loop counter increments. And vice versa. Why? Because the frequency of CASs will bring the memory system to its knees. You simply must consider these kinds of things when developing your data structures and algorithms; easing pressure on the cache hierarchy is the only way to scale beyond a handful of processors.
The sad truth is that only radical changes to the way we write software will allow fine-grained parallelism to scale to the numbers we expect in the 5 year time horizon. Hiding more and more conveniently inserted CAS operations auto-magically for folks is not doing them any good. Mostly functional combined with concurrency-safe mutation on guaranteed-isolated object graphs is, in my opinion, the only path forward.
 Friday, January 09, 2009
Along with type systems, I'm casually interested in formal specifications and verification of software. During lunch today, I watched an internal Microsoft Research talk given by Leslie Lamport. The topic was TLA+ -- his formal verification system -- during which he blurted out a couple amusing quotes:
"Writing is nature's way of letting you know how sloppy your thinking is." --- Guindon (cartoon)
"Math is nature's way of letting you know how sloppy your writing is." --- Leslie Lamport (riffing on Guindon)
And related:
"Formal math is nature's way of letting you know how sloppy your math is." --- Leslie Lamport
They made me chuckle out loud, so I figured I'd share them. Unfortunately the talk isn't available outside the company (as far as I can tell), but Lamport has written a book, Specifying Systems, available online, in addition to dozens of interesting papers, on the topic.
 Thursday, January 08, 2009
CAS operations kill scalability.
(“CAS” means compare-and-swap. This is the term most commonly used in academic literature, but it is commonly referred to under many guises. Windows has historically called it an “interlocked” operation and offers a bunch of such-named Win32 APIs; .NET does the same. This set entails X86 instructions like XCHG, CMPXCHG, and certain instructions prefixed with LOCK, such as INC, ADD, and so on.)
My opening statement is a bit extreme, but it’s true enough. There are several reasons:
0. CAS relies on support in the hardware to ensure atomicity. Namely, most Intel and AMD architectures use a MOSEI cache coherency protocol to manage cache lines. In such an architecture, CAS operations on uncontended lines that are owned exclusively (E) within a processor’s cache are relatively cheap. But any contention – false or otherwise – leads to invalidations and bus traffic. The more invalidations, the more saturated the bus, and the greater the latency for CAS completion. Cache contention is a scalability killer for non-CAS memory operations too, but the need to acquire a line exclusively makes matters doubly worse when CAS is involved.
1. CAS costs more than ordinary memory operations, in CPU cycles. This is due to the additional burden on the cache hierarchy, and also because of requirements around flushing write buffers, restrictions on speculation across the fences, and impact to a compiler’s ability to optimize around the CAS.
2. CAS is often used in optimistically concurrent operations. That means a failed CAS will lead to a retry of some sort – typically with some kind of backoff – which is purely wasted work that isn’t present when there isn’t any contention. And 0 and 1 both increase the risk of contention.
The most common occurrence of a CAS is upon lock entrance and exit. Although a lock can be built with a single CAS operation, CLR monitors use two (one for Enter and another for Exit). Lock-free algorithms often use CAS in place of locks, but due to memory reordering such algorithms often need explicit fences that are typically encoded as CAS instructions. Although locks are evil, most good developers know to keep lock hold times small. As a result, one of the nastiest impediments to performance and scaling has nothing to do with locks at all; it has to do with the number, frequency, and locality of CAS operations.
As a simple illustration, imagine we’d like to increment a counter 100,000,000 times. There are a few ways we could do this. If we’re just running on a single CPU, we can use ordinary memory operations:
Variant #0: static volatile int s_counter = 0; for (int i = 0; i < N; i++) s_counter++;
This clearly isn’t threadsafe, but provides a good baseline for the cost of incrementing a counter. The first way we might make it threadsafe is by using a LOCK INC:
Variant #1: static volatile int s_counter = 0; for (int i = 0; i < N; i++) Interlocked.Increment(ref s_counter);
This is now threadsafe. An alternative way of doing this – commonly needed if we must perform some kind of validation (like overflow prevention) – is to use a CMPXCHG:
Variant #2: static volatile int s_counter = 0; for (int i = 0; i < N; i++) { int tmp; do { tmp = s_counter; } while (Interlocked.CompareExchange(ref s_counter, tmp+1, tmp) != tmp); }
An interesting question to ask now is: How much slower will each variant be when cache contention is introduced? In other words, run a copy of each code on P separate processors, incrementing the same s_counter variable by N/P, and compare the running times for different values of P, including 1. You might be surprised by the results.
For example, on one of my dual-processor/dual-core (that’s 4-way) Intel machines, the results are as follows. I’ve run Variant #0 even though it’s not threadsafe, simply because it shows the effects of cache contention on ordinary memory loads and stores.
#0, P = 1: 1.00X #1, P = 1: 4.73X #2, P = 1: 5.38X #0, P = 2: 2.11X #1, P = 2: 10.74X #2, P = 2: 16.70X #0, P = 4: 3.87X #1, P = 4: 7.57X #2, P = 4: 73.35X
All numbers are normalized and compared to the ++ code on a single processor. In other words, Variant #0 run on 2 processors is 2.11X the cost of Variant #0 run on 1 processor; similarly, Variant #0 run on 4 processors is 3.87X the cost of Variant #0 run on 1 processor. Variant #1 gets even worse at 4.73X, 10.74X, and 7.57X, respectively. And Variant #2 explodes in cost as more contention is added, going from 5.38X, to 16.70X, to a whopping 73.35X. Adding more concurrency actually makes things substantially worse.
(The absolute numbers are not to be trusted, and there are anomalies undoubtedly introduced based on how threads are scheduled; I’ve not affinitized them, so they may end up sharing sockets at will. A more scientific experiment needs to consider such things.)
The CMPXCHG example (Variant #2) can be improved by strategic spinning when a CAS fails. Part of what makes the numbers so bad – particularly the P = 4 case – is the amount of lost time due to livelock and the associated memory system interference.
This is an extreme example. Few workloads sit in a loop modifying the same location in memory over and over and over again. Even if they do – as in the case of a parallel for loop in which all threads fight to increment the shared “current index” variable – these accesses are ordinarily broken apart by sizeable delays during which useful work is done. Augmenting the test to delay accessing the shared location by a certain number of function calls certainly relieves pressure.
For example, here are the numbers if we add a 2-function-call delay in between accesses:
#0, P = 1: 1.00X #1, P = 1: 2.54X #2, P = 1: 2.77X #0, P = 2: 1.47X #1, P = 2: 5.19X #2, P = 2: 8.59X #0, P = 4: 2.78X #1, P = 4: 3.67X #2, P = 4: 26.55X
And if we add a 64-function-call delay in between accesses, the micro-cost between the three variants doesn’t matter much. But the contention behavior sure is different. And we can even find some cases where the multithreaded variants run faster than the single-threaded counterpart:
#0, P = 1: 1.00X #1, P = 1: 1.00X #2, P = 1: 1.00X #0, P = 2: 0.59X #1, P = 2: 0.74X #2, P = 2: 0.85X #0, P = 4: 0.51X #1, P = 4: 0.45X #2, P = 4: 1.23X
This is the first time we have seen a number < 1.00X. That's a speedup; remember, we are using parallelism after all.
As you might guess, in the region between 2 and 64 function calls the results gradually get better and better; and beyond 64, they get substantially better. In fact, when we insert 128 function calls in between, we get very close to perfect, linear scaling for all 3 variants:
#0, P = 1: 1.00X #1, P = 1: 1.00X #2, P = 1: 1.00X #0, P = 2: 0.50X #1, P = 2: 0.52X #2, P = 2: 0.52X #0, P = 4: 0.30X #1, P = 4: 0.29X #2, P = 4: 0.27X
(As a reminder, 0.50X is a perfect speedup on a 2-CPU machine, and 0.25X is a perfect speedup on a 4-CPU machine.)
The moral of this story is that nothing is free, and CAS is certainly no exception. You should be extremely stingy with adding them to your code, and conscious of the frequency at which threads will perform them. The same is generally true of all memory access patterns when parallelism is in play, but particularly for expensive operations like CAS.
And even if you’re not using CAS’s directly in your code, you may be using them via some system service. Parallel Extensions uses them in many ways. For instance, when you’re doing a Parallel.For loop, we internally share a counter that is accessed by multiple threads. So even if your algorithm is theoretically embarrassingly parallel, the internally counter management could get in your way. We try to be intelligent by chunking up indices, but we aren’t perfect: if you have very small loop bodies the overhead of CAS could begin to impact scalability. You can work around this by making loop bodies more chunky; one example of how is by doing your own partitioning on top of our library (like executing multiple loop iterations inside the body passed to Parallel.For). Even things like allocating memory with the CLR’s workstation GC requires the occasional roundtrip to reserve a thread-local allocation context by issuing a CAS operation against a shared memory location.
 Sunday, December 28, 2008
As embarassing as it is, the errata for Concurrent Programming on Windows is non-empty.
I've posted an initial listing -- full of primarily simple typos like misplaced commas -- at http://www.bluebytesoftware.com/books/winconc/winconc_book_resources.html#Errata.
Sincere thanks to everybody who has reported errors thus far. If you find any additional ones, please email them to me directly: joe AT bluebytesoftware DOT com. We'll attempt to fix as many errors as possible in subsequent printings of the 1st edition and, if that fails, they'll make the 2nd edition.
I've spent the past few months (from September onward) travelling approximately 75% of the time. As a result, I may be slow responding to email concerning the book. I've also not finished putting together the code samples up for download; my current ETA for that is mid-January 2009. I already know there are a few more errata entries lurking within, due to some last minute typographical updates made late in the editing process. If only word processing software came complete with built-in compilers... (excuses, excuses)
In any case, I'd love to receive feedback on the book. Even if it's not about an error. Things you like, things you'd like to see improved, things you wish I'd not written about, requests for clarifications, etc. Just drop me an email. Cheers.
 Saturday, November 29, 2008
I've had an obsession with programming languages for some time now. This probably began the first time I learned of LISP. Most people I know have had a similar "Ah-Hah!" moment associated with LISP, but it was when I first truly realized the deep extent to which a programming language shapes thought -- sometimes in negative ways. LISP put it all into perspective.
Since then, the obsession has only become worse through my employment at Microsoft, where I've had the privilege to work alongside and interact with some of the greatest minds in programming languages. This is an absolute honor. I worked on a few compilers and did some language design, particularly when on the Common Language Runtime team, and my favorite project today is my work on type-system support for static enforcment of concurrency safety and guaranteed isolation. I have found great joy in applying underlying concepts in more niche (and extreme) languages like Haskell to more mainstream languages like C#. My favorite pasttime is tracing back the lineage of languages to their earliest ideas, especially when this leads to the unearthing of a subtle commonality among them. I have been designing one of my own and, while it is undoubtedly a 5-year project that may never see the light of day, I do it for the love of languages.
This book has been stewing inside me for a while now. And after seeing Guy L. Steele and Richard P. Gabriel's infinitely beautiful "50 in 50" presentation at JAOO this year, I decided it was time for it to escape.
Notation and Thought: Behind Computer Science's Most Influential Programming Languages
“That language is an instrument of human reason, and not merely a medium for the expression of thought, is a truth generally admitted.” --- George Boole, Laws of Thought
Programming languages are not only a notation for expression, but also a medium of thought, akin to the duality between natural written and spoken languages. If you can think it, you can create it. The reverse is also true: if a language poses impediments to your thought process, certain solutions to problems are simply unfathomable. Languages are therefore not just what you see “on paper”--each is a unique tool that can substantially limit, or expand, the creative freedom of the programmer in whose hands it sits. Good languages get out of the way, and great ones do a whole lot more.
In the early days, there was of course nothing that resembled modern day languages. Computers had to be told what to do in excruciating detail. One only has to look at modern day assembly language to see that programming a computer in this manner constrains creativity and slows progress. Alan Turing didn’t even have that when he wrote his classic On Computable Numbers with an Application to the Entscheidungsproblem paper, but he at least managed to solve some simple problems: by moving a tape reader and reading and writing symbols, he was able to create the modern day equivalent to subroutines and even add up a number or two. But our industry would have never seen radical advances in enabling technologies, and widespread computer use, that we enjoy today without significant advances in higher-order abstractions.
Plankalkül, or the plan calculus, is widely recognized as the first real programming language. It was designed by a German computer engineer, Konrad Zuse, and first written down in an unpublished manuscript in 1943. The language offered composite (albeit simplistic) data structures, arrays, named variables, subroutines, and moderately sophisticated control flow and looping constructs. Although it was never used in practice, Plankalkül was surprisingly ahead of its time. It was a big step towards more abstract problem solving.
It should be no surprise that subsequent programming languages are as varied in their design as the humans that created them. This fact can be seen by examining the ensuing decade of computing post-Plankalkül. The 1950s saw the invention of four new major languages that fundamentally shaped the future of language design. FORTRAN, or the FORmula TRANslation language, specialized in describing transformations on data and numerics, and was the first non- assembly language to reach widespread use in performance sensitive situations. LISP, or the LISt Processing language, was developed for symbolic processing and, eventually, found a home in artificial intelligence, pioneering many techniques that are still in use today such as first class functions as data, a recursive style of programming, and garbage collection. Its principles were derived from the mathematical logics of Alonzo Church and Haskell B. Curry, notably Church’s lambda calculus from the 1930s. ALGOL, or the ALGOrightmic Language, focused on describing algorithms elegantly, kick-started the imperative family of languages (of which many popular industry programming languages like C++ and Java are members), and later set the de facto standard style for Computer Science education curricula. Its method of encoding algorithms with assignments was far closer to the von Neumann architecture than was LISP, making the resulting programs behave predictably and efficiently. Lastly, COBOL, or the COmmon Business-Oriented Language, became the first domain-specific language (DSL) that targeted non-programming business and finance experts, broadening the general accessibility of computers. Each of the four has had a crucially important role to play in the history of programming languages.
There has been no shortage of language diversity after the birth of the initial four. In fact, hundreds of languages have since come and gone, some enjoying brief or extended periods of popular use. All that have since come have been deeply influenced by the pioneers, but have also contributed a handful of innovative new ideas that help programmers more clearly think about and express solutions to real-world problems. The lineage of languages has branched off into separately named family trees--such as imperative, functional, logic, declarative and domain-specific--only to reunite intimately with each other down the line. Indeed, it really is just one big happy family.
This book traces this lineage through the most influential languages--those that have deeply impacted the way that programmers think and write--and provides insight into the motivation behind them, their major influences, and the important features that each language contributed. Throughout, it is my hope to develop within the reader a new appreciation of the art of programming computers, an understanding of the impact that language has on our thinking, and an excitement about the future of language design that lies ahead.
Joe Duffy November, 2008
 Tuesday, November 04, 2008
Type classes, kinds, and higher-order polymorphism represent some of Haskell’s most unique and important contributions to the world of programming languages. They are all related, and began life as type classes in Wadler and Blott’s 1988 paper, How to make ad-hoc polymorphism less ad hoc. Eventually, Jones introduced the (then separate) concept of constructor classes, in his 1993 paper, A system of constructor classes: overloading and implicit higher-order polymorphism. Eventually these two ideas were unified into a beautiful single set of features (namely, type constructors and kinds) in Haskell.
In this short essay, I’ll explain what these things are and why I’m sad that we don’t have them in C#.
To take the simplest motivating example, say we want to define a generic square function:
square x = x * x
Given a Hindley-Milney type system (with type inference), how should the compiler type this function? The challenge that immediately arises is that, to know the type of x and the function’s return value, we must know something about the function * being called within the body of square. But to know something about that function, we’d need to know the type of x. We’ve entered into a cycle, and have hit a wall. Clearly the type will be something generic, but polymorphic on what?
Imagine that we could infer the type of the * function as follows:
(*) :: a -> a -> b
In other words, * is a function that takes two values, both of type a, and produces some value of another type b. We know its two arguments must be of the same type because in square we pass the same value x to it twice. Given this typing for *, we could then type square similarly as:
square :: a -> b
In other words, square takes a single value of type a and produces a value of type b. The constraint on the type a here is, of course, that some function * is available that is typed as taking an a as input. There’s no obvious way to capture this in the type system, though we might conceive of something like:
square :: (* :: a -> a -> b) => a -> b
In other words, given a type a for which some function * is defined, which takes two a’s and returns a single b, the type of square thus takes an a and produces a b. You can’t say that in Haskell, although we’ll see a bit later that type classes allow similar constraints (with “=>”) to be written.
While this hypothetical typing is extremely general purpose, it would produce considerable challenges in its implementation. Standard ML throws up its hands and infers all mathematical operators (like *) as working with floats, meaning that all of the types above (both a and b) will be inferred under the type of float. (*) is of type float -> float -> float, and square is of type float -> float. Similarly, F# assumes you’re working with ints. Both Standard ML and F# have amazingly rich type inference systems, but this begins to run right up against the limits of what they can do. We’ll see some harder examples shortly.
You can probably guess that Haskell’s solution to this conundrum is to use higher order polymorphism with a feature of its type system called type classes. They allow us to classify types much in the same way types ordinarily classify objects. We can classify the set of numeric types as follows, for instance:
class Num a where
(*) :: a -> a -> a
… other numeric operations …
And then we can go ahead and provide concrete mappings for integers and floating point numbers:
instance Num Int where
(*) = addInt
…
instance Num Float where
(*) = addFloat
Each instance of the type class (in this case, Num) is a bit like a dictionary mapping the named functions (in this case, just *) to other functions that are defined for the concrete type (in this case, supplied in a’s stead). With this information defined, the Haskell compiler can now infer the type of square as:
square :: Num a => a -> a
This inference really just says that the function square is defined for all types a that are in the type class Num. The “Num a =>” part is a bit like a C# generic type constraint, in that it restricts what kinds of a’s can be supplied. Given what has been stated thus far, that’s just Int and Float. So we can only call the square function with types on which multiplication is properly defined, which is exactly what we want.
At this point, we might want to try defining a similar thing in C# using generics. (And for this simplistic example, and others like Haskell’s Eq a type class, we will succeed.) There are two basic ways we could achieve this. The first is to define an INum<T> interface (or abstract class—pick your poison), and give it an instance method to multiply the target with another number:
interface INum<T> {
T Mult(T x);
}
We would then have the basic numeric data types like Int32 and Float implement INum<T>:
struct Int32 : INum<Int32> {
public Int32 Mult(Int32 x) { return value * x; }
…
}
struct Float : INum<Float> {
public Float Mult(Float x) { return value * x; }
…
}
Given these definitions, it would be a breeze to write a Square method that only operates on INum<T>s:
T Square<T>(T x) where T : INum<T> { return x.Mult(x); }
Thankfully, we can recursively reference the T from within the generic type constraint.
Now, of course, there’s no way the C# compiler would infer the necessary INum<T> constraint. But given that we don’t have rich type inference (aside from for local variables) in C#, this doesn’t pose any new problems. Another slight annoyance is that you need to modify the source type to declare support for INum<T>, when a perfectly reasonable implementation could have been provided “from the outside,” but you’ll find that this will only occasionally get under your skin.
The second way we might go about this is to take an approach similar to .NET’s EqualityComparer<T> class, where we have an abstract base class that represents the ability to do something with instances of Ts. And then we only provide implementations on concrete Ts for which that ability makes sense. For example, we could have a Multiplier<T> that looks a lot like INum<T>:
abstract class Multiplier<T> {
public abstract T Mult(T x, T y);
}
Multiplier<T> on its own isn’t usable. But we can provide implementations for Int32 and Float:
class Int32Multiplier : Multiplier<Int32> {
public override Int32 Mult(Int32 x, Int32 y) { return x * y; }
}
class FloatMultiplier : Multiplier<Float> {
public override Float Mult(Float x, Float y) { return x * y; }
}
// And so on …
Now we can write a slightly different Square method that takes a Multiplier<T> as an extra argument:
T Square<T>(T x, Multiplier<T> m) { return m.Mult(x, x); }
Now there isn’t any kind of generic type constraint on Square’s T, but of course we can only call it if we have a concrete instance of Multiplier<T> in hand. And by definition that means there is a Mult method defined that we can call. (This isn’t wholeheartedly true. You can of course call Square<U> for any U, passing in null as the second argument. But presumably the method would check for null and throw. This is a real limitation, however, which would likely push us back in the direction of the original interface solution. If we had non-null types, we could get closer to a fully statically verifiable solution.)
Aside from a lot more typing, and the lack of rich type inference, we seem to have reached parity. The simple examples provided in the literature and Haskell’s Standard Prelude can be implemented in such a fashion. But we are kidding ourselves if we think these are the same thing.
The main problem is that C# doesn’t support higher-kinded type parameters. We haven’t yet seen a type class in Haskell that fully exploits this capability, but there are several. The simplest one I know about in the Haskell Standard Prelude is the Functor type. (Monad is also a great example, but is a bit more complicated (and sufficiently frightening) that this will be a topic for another day.) Functor’s definition is:
class Functor f where
fmap :: (a -> b) -> f a -> f b
The Functor type class offers a single function, fmap. It takes two things—a function that transforms a value of type a into a value of type b and some functor value of type f a—and returns some new functor value of type f b. This looks like an ordinary type class, except for one funny (and subtle) aspect. Functor abstracts over type f, but notice that we’re using f in fmap’s second argument and return type by actually constructing it with two other types a and b! In case you’re having a hard time thinking in Haskell, it’s as though we tried to write this in C# using our interface trick from earlier:
interface IFunctor<T> {
T<B> FMap<A, B>(Func<A, B> f, T<A> a);
}
This won’t compile. We can’t refer to T in the typing of FMap as T<B> and T<A>: it’s not expressible in C# and .NET’s type system. Let’s pretend for a moment, however, that we could. What is an example of class that might implement this? How about something that deals in terms of Nullable<T> instances?
class NullableFunctor<T> : IFunctor<Nullable<>> {
Nullable<B> FMap<A, B>(Func<A, B> f, Nullable<A> a) {
return new Nullable<B>(f(a.Value));
}
}
All you need to do is take a close look at a 1997 paper by Simon Peyton Jones, Mark Jones, and Erik Meijer, entitled Type classes: an exploration of the design space, and you will find a plethora of even more complicated (and useful) examples that use an innocent-sounding aspect of Haskell’s type system called multi-parameter type classes. All of the types are higher-order and are merely moved around and manipulated like abstract (higher-order) symbols. The type system gracefully gets out of the way and allows you to drop abstract type parameters into any holes they fit in, without mandating that you say too much. The secret sauce—as noted earlier—is kinds.
Kinds are used in the implementation of Haskell’s type system, and you won’t mention a whole lot about them anywhere. They basically categorize what kind of types can appear anywhere a type is expected. A great overview (with plenty of context) can be found in Mark P. Jones’s Functional Programming with Overloading and Higher-Order Polymorphism paper and, of course, the Haskell 98 Report.
Here’s a quick rundown. Kinds appear in one of two forms:
- the symbol * represents a concrete type (a.k.a. a monotype), and,
- if k1 and k2 are kinds, then k1 -> k2 is the kind of types that take a type of kind k1 and return a type of kind k2.
Kinds are formed in many ways: the primitive types (such as Char, Int, Float, Double, etc.) are an example of the former, and are of kind *. They “bottom out.” Type constructors, however, like Functor are an example of the latter, and are of kind * -> *. That is, they take a kind k1 (the first *) and produce another kind k2 (the second *). By giving some concrete type T (*) to Functor, we get back a Functor T (also *). The latter is therefore a bit like a function mapping one kind to another. Functions have a kind of * -> * -> *, because a function has two types: the type of arguments (the first *) and the type of its return value (the second *). These compose, so that you might have (* -> *) -> * -> *. And so on. Thinking about kinds can take a bit of getting used to.
But the really useful thing here is that kinds allow you to write higher order type constructors like those we have begun to explore above, like Functors and Monads. I.e., given a type t1 of kind k1 -> k2, and a type t2 of kind k1, then t1 t2 is a type expression of kind k2. This can be applied to the occurrences of f a and f b in Functor’s fmap function. In the type Functor f they are of kind * -> * -> *. When a concrete Functor instance is specified, e.g., by substituting T for f, this turns fmap’s T a and T b arguments to kind * -> *. That is, they still both expect another kind before bottoming out. And therefore we can substitute some concrete U and V types for a and b, to reduce them from kind * -> * to kind *.
Now we’re done. And, as if by magic, it all works.
 Sunday, November 02, 2008
A few months back, while writing my new book, I whipped together a tool to dump information about your processor layout using the GetLogicalProcessorInformation function from C#. You can find the code snippet in Chapter 5, Advanced Threads, of my book. (A developer on the Windows Core OS team, Adam Glass, had also written a similar tool in C++.) I will be posting code to the companion site for my book in the coming weeks, at which point you can easily get your hands on it.
Anyway, I sent the code to Mark Russinovich suggesting it might make a useful SysInternals tool, and he agreed. Now it's up on microsoft.com for download, under the name of Coreinfo: http://technet.microsoft.com/en-us/sysinternals/cc835722.aspx. When run, Coreinfo pretty prints information about the mapping from cores to sockets, cores to NUMA nodes, and what kinds of caches are shared on the machine. Particularly for somebody like me who is always running code on different kinds of machines -- and given that parallel code performance heavily depends on memory hierarchy -- I've found this tool to be invaluable and very helpful. Enjoy.
 Friday, October 31, 2008
Dan Grossman invited me to deliver a talk as part of the University of Washington's Computer Science and Engineering Colloquia series. It was recorded and will eventually air on UWTV, but has also been posted online:
Microsoft's Parallel Computing Platform: Applied Research in a Product Setting
The goal of Microsoft's Parallel Computing Platform (PCP) team is to enable the shift to modern, multi- and manycore hardware, by providing a runtime, programming models, libraries, and tools that make it easy for developers to construct correct, efficient, maintainable, and scalable programs through the use of parallelism. In doing so, tens of years of industry research has been combined and applied in a myriad of ways. This talk examines PCP's current progress, explicitly relating it to specific research of the past and present, in addition to surveying future efforts and possible research opportunities.
http://norfolk.cs.washington.edu/htbin-post/unrestricted/colloq/details.cgi?id=768
<WMV - streaming, WMV - download, ...>
If you're not aware of the work we're doing in Visual Studio 2010 -- both in .NET 4.0 and C++ -- this talk gives a pretty good overview of all of it. It has a researchy feel to it, with plenty of pointers to interesting prior research that has influenced our work along the way.
 Thursday, October 02, 2008
The word “architect” means different things to different people in the context of software engineering. And it varies wildly depending on the kind of organization you’re in. An architect at a medium sized IT shop might focus on connecting disparate business systems together at a high level, but without diving down into code. An architect at a startup may be more like a tech lead, checking in code like mad, but also keeping the rest of the team in check. And a software architect at Microsoft can play an even varied number of roles because the company is so large and diversity of projects so great.
A colleague and mentor of mine who I respect greatly says that an architect is the guy (or gal) who is in charge of making those decisions which, if made incorrectly, could sink the project.
There is a lot to be said for this. These decisions are those with the broadest, deepest, and longest lasting impact. The decisions themselves are often made by team members initially, but the architect is responsible for providing constant and rigorous technical oversight. Architects set the high level technical agenda, look ahead several releases, and keep the team on course. They are ultimately to blame if the technical foundation is unsound and/or final solution fails to meet expectations. Their butt is on the line.
On one hand, an architect is the lead engineer with most at stake in the project. On the other hand, an architect is more like a member on the project’s board of directors, providing high level guidance and meddling as little as possible (but as much as is necessary) in the day-to-day details.
An architect’s success is measured by what he or she ships to customers, and not by the amazing ideas that were ultimately never realized. This necessarily means an architect’s success is deeply rooted in the team’s culture, work ethic, and ability. He or she needs to work through others to get things done.
There have been some great architects throughout the course of computer science, but who may not have been labeled as such. Linus Torvalds is the architect of Linux, and David Cutler the architect of Windows NT. John Backus was arguably the architect of FORTRAN, Niklaus Wirth the architect of Pascal, Bjarne Stroustrup the architect of C++, James Gosling the architect of Java, and Anders Hejlsberg the architect of C#. Bill Gates was the architect of Microsoft BASIC, and Charles Simonyi the architect of the initial versions of Microsoft Office (Word and Excel). In each case, you can see that the end result is very reflective of one person’s value system and ideas, but took a lot more than just that person to be successful. Each of these people learned to let go of their project just enough that it could achieve the scale that it was meant to achieve, but not so much that the project veered off course. Some projects have multiple architects, but the successful ones usually have one who is really in charge.
Already you can see some subjective opinion being thrown into the mix, and some of it is apt to be controversial. Although not comprehensive, I’ve put together seven guiding principles that I personally aspire to. I’ve certainly not mastered them all, but have always looked up those people around me who seem to have. Why seven? No reason, really. Over the past few years, I’ve tried to spend as much time as possible learning from successful architects, and these stand out in my mind as being the key common attributes that appear to be common among them.
0. Inspire and empower people to do their best work.
Architects ultimately succeed or fail based on the quality of people on their team. Knowing how to inspire and empower these people, so that they can do their best work, is therefore one of the most important skills an architect needs in order to be successful.
You can’t do it all yourself. This can be frustrating at times, and at times you might think that you can (particularly in times of frustration). I’ve personally hacked together 1,000s of lines of code that I’m incredibly proud of in a weekend, and that would have taken weeks or months to get done if I had to instead explain the idea to somebody else and wait for them to write those same 1,000s of lines of code. And the 1,000s of lines they write of course wouldn’t end up being the same as the ones you’d have written. And they may decide that they don’t like the design after all, start discussing it with colleagues, stage a mutiny, and ultimately overthrow what once seemed like a great idea. This is a tough pill to swallow. But it’s a sad fact of life that you need to learn to be comfortable with.
The same thing would have happened if you were the one to implement the idea, of course; the difference is that somebody else needs to be empowered to take the kernel of an idea, and run with it. That entails reshaping it as necessary to make it realistic and successful.
I’m not suggesting architects don’t write code (quite the opposite: see #3 below), but you can’t write it all (except for very small projects). If you buy the argument that an architect is just the leading senior engineer on the project, then by definition the architect is probably qualified to write quality code quickly. But what about the code they don’t write? Other people on the team need to write it, and the architect needs to have enough time (where he or she isn’t hacking code) to inspire those people to write the right code. This takes energy and effort. You need to paint a compelling picture of the future, but with enough open-endedness such that the team can flex their creative muscles and fill in the details.
This is the only way to scale. And architects need to scale to achieve broad impact.
Architects should also welcome all ideas with open arms. You want to foster an open and energetic environment on your team, where intellectual debate is the norm. All ideas are fair game.
That’s not to say all ideas are good ones, and ultimately the bad ones need to die a quick and painless death before going too far, but an architect who won’t even entertain new ideas from the team (typically because of NIH syndrome (i.e., Not Invented Here)) often drive away the best engineers. Great engineers hate to be told what to do. They don’t want to feel like they are walking in the shadows of somebody else. They want to use the skills that make them so great, which involves inventing bigger, badder, and more impactful designs. And you want them to use these skills too, because that’s why you hired them: these skills are crucial to the success of your project. Part of your role as the team’s architect is to recognize who on the team has the most potential, and to arrange for them to have as much leeway and creative freedom as possible. You don’t want to end up with a bunch of lackeys whose job is to “just implement” your ideas, because you’ll get what you paid for.
It’s a true sign of success when the culture you impart unto your team allows them to invent things in the spirit of your own design principles, but without you needing to do it yourself. Jim Gray, for example, inspired countless people to do great things. Does he get credit for each of those ideas? Of course not. But was he indirectly responsible for them to some degree, and do they all have a little Jim Gray in them? Absolutely. Being an architect on a team is similar; not every idea has to be your own. In fact, it’s far more powerful if few of them are.
1. Oversight, but not dictatorship.
That brings me to technical oversight. Because an architect is typically not a manager for his or her project (although in some cases he or she may be), arms-length influence needs to be used to get things done. In fact, the architect may have very little to say over specific project management, scheduling, and budget decisions, but is typically on the senior leadership team for the project. So when I talk about “leeway” above, I’m talking about the degree to which an architect monitors and attempts to meddle with the progress of the team. While it’s tempting for an architect to set the ship sailing to sea, and then turn around to work on the next big thing, this almost never works. The initial vision and idea is far from a shipping solution, and software engineering only gets interesting once you actually try to build something. Ideas are cheap. The architect needs to help the team work through the ramifications of certain technical decisions that were made up front, and help with the continual course correction.
Because an architect’s butt is ultimately on the line, he or she needs to work as fast as possible to correct problems when something goes wrong. This implies the architect is involved enough to notice when something goes wrong, hopefully well in advance of anybody else seeing it. I’ve seen many models that work, ranging from the architect being the approver for all major design decisions, to the architect simply reviewing all major design decisions after-the-fact, to the architect delegating this responsibility to trusted advisers. For example, Linus Torvalds for the longest time required that all checkins to the Linux code base be reviewed by him. Anders Hejlsberg still effectively approves each C# language design change. In my opinion, the closer to each major decision the architect can afford to be, the better.
Left to its own devices, the team would veer off course in no time. That’s not because of malicious intent, but rather because of the sheer diversity of software engineers. This diversity is present on many levels: in skill level, taste (which is hard to measure: more on that in #2 below), motivation, work ethic, interpretation of the vision, personal beliefs and experience, and so on. An architect acts as a low-pass signal filter, smoothing out any irregularities that deviate too far from the core design principles.
In Tony Hoare’s ACM Turing Award paper of 1981, The Emperor’s Old Clothes, he explains the risk of not providing this kind of architectural oversight:
“’You know what went wrong?’ he shouted - he always shouted – ‘You let your programmers do things which you yourself do not understand.’ I stared in astonishment. He was obviously out of touch with present day realities. How could one person ever understand the whole of a modern software product like the Elliott 503 Mark II software system? I realized later that he was absolutely right; he had diagnosed the true cause of the problem and he had planted the seed of its later solution.”
Sadly, this responsibility often entails being “the bad guy”. Sometimes you need to mercilessly kill an idea because it would put certain parts of the project at risk. Other times you need to let somewhat bad (but not too impactful) ideas go. There’s a tradeoff here, because each time you kill an idea you’re going to leave somebody feeling burned. And you may waste peoples’ time, depending on how much time has already been invested in that idea. Some battles are best left unfought. There is an art to be learned here: if you can get those with the idea to firmly believe that there has to be a better way, you can avoid being seen as the bad guy. “Sit back and wait” can work in some cases, but it can backfire too.
The deep involvement in the technical design details unfortunately means that the architect can become the bottleneck if he or she is not careful. This can slow the team down. Some slowdown can admittedly be a good thing, because it has the effect of forcing more thoughtfulness in each and every decision. But as the team grows, the granularity of decision oversight necessarily has to change to ensure the team is empowered to make progress. In order for this to work, you need to have trusted individuals who are involved at a finer granularity and will use the same principles and values. This takes trust and time.
2. Taste is a hard thing to measure, but is invaluable.
Software engineers like to measure. Many people try to make design decisions based on quantitative data, even though they know that engineering is more of an art than a science. But there is one common trait that, as far as I can tell, is impossible to measure, and yet common to all of the great software architects I know: good taste. And because it’s impossible to measure, those who lack it have a hard time understanding the difference between a design with good taste and one with bad taste.
There is a certain elegance and beauty to the designs created by architects with good taste. When you see it from a distance, you know it, but when viewed under a microscope—the kind of microscope used when debating the finer points with other engineers on the team—it is much harder to detect. Often it’s incredibly difficult to articulate why some particular design has good taste, which makes it even harder to justify. Eventually people are willing to trust your judgment because they begin to see it too.
In fact, good taste is perhaps one of the most important skills an architect needs to have. Bad taste leads to clunky designs that nobody likes to use. Steve Jobs knows this. And yet taste is probably the most difficult skill for an architect to develop, and one of the subtler ones that few people recognize as being necessary. Many managers think that throwing more engineers at a design problem will solve it, when in reality often all that is necessary is one person with very good taste and an eye for detail.
I’m not certain where taste comes from: an innate skill? Perhaps, but not exclusively. In my best estimation, good taste can be learned from paying close attention to the right things, taking a step back and viewing designs from afar often enough, being learned in what kinds of software has been built and was successful in the past, and having a true love of the code. That last part sounds cheesy, but is true enough to reemphasize: if you don’t feel a certain passion for your code and project, it’s a lot easier to let bad taste run rampant, because your care level isn’t as intense as it needs to be.
3. Write code and get your hands dirty.
The best architects realize that code is king. It rules all else. At the end of the day, Visio diagrams, high level vision documents, whiteboard works of art, design documents, emails, functional specifications, and so on, are all a means to an end, not the end itself. The code is your product, and if you don’t understand the code, you don’t understand the state of the project. And if you don’t understand that, you’re not in a position to know what’s working well, what isn’t working, and you can’t possibly have the deep understanding necessary to influence the engineers on the team. You’ve lost control.
The worst architects couldn’t code themselves out of a cardboard box. If you’re not writing actual product code, you’re not an architect: you’re an ivory tower has-been, and probably doing more damage than you are helping matters. Do your team a favor and move into management as quickly as possible.
Writing code also has the benefit of ensuring that you maintain credibility with the team. It’s easy to dictate crazy and grandiose ideas, but if you’re the one who has to implement such a grandiose idea, you’re apt to be more sympathetic with and mindful of the other engineers of the team. You need to keep yourself grounded and writing real product code will help to ensure your technical decision making carefully considers the implementability and down-to-Earth ramifications of your decisions.
Moreover, you need to be a programming expert. People need to respect your abilities, and you want your team to look up to you. You want them to come and ask for your advice because they want it, and enjoy it, and not force them to deal with you simply because of your position on the team. All of the great architects I’ve worked with have inspired me to grow simply because they know so damn much, and because I learn something new every time I interact with them. If they didn’t write code and understand the nitty gritty technical esoterica, this relationship would have been a shallow one.
4. The power of the dyad: know your weaknesses.
Architects need to play a dual role in understanding both business and technical needs and strategy. The degree of business savviness varies greatly among architects, although the best architects I know have a unique ability to understand both sides of the coin. But at the end of the day, they are first and foremost technology wonks, and the business angle is more of a curious hobby. In music, two notes sounding together form dyad, while three or more form a chord. The best architects I know realize their relative weakness on the business end of things and partner up with another senior leader with complementary skills, to fill in the gaps: this forms a harmonic interval. A dyad.
The partnership needn’t entirely be “business” vs. “technical”, although in commercial software that’s more often than not the two opposing forces. For example, my impression of the development of Scheme is that Guy Steele played the role of the architect while Gerald Sussman was the more business-oriented advisor, looking at how Scheme might be used to advance the broader research agenda but not necessarily meddling in the technical design details of the project.
If an architect is 80% technology and 20% business, partnering with somebody who is 20% technology and 80% business can be a killer combination. This allows you to bounce ideas off one another, and to get a certain level of objective feedback from a different perspective. If you’ve got a great technical idea, and bounce it off another techno-nerd, you might spend hours or days debating technical details that ultimately boil down to a matter of taste. But if you take that same idea and bounce it off your business partner, he or she is likely to provide more pertinent feedback: does it make sense from a business perspective, will customers need it, will it open up new product or revenue opportunities, are there more pressing matters to focus the team on, etc. These are things that, being a technology guy (or gal), wouldn’t immediately come to mind. But remember: it’s all about the customer.
5. It's for the customer, not you.
The best engineers often succeed because they focus on scratching a personal itch. That’s what Linus Torvalds, Bjarne Stroustrup, and countless others did. This is why Donald Knuth created TeX. The idea for a new technology thus begins as a very personal and selfish act. “Build something you’d use yourself, and the customers will come” is a common (cliché) idiom. Although there is certainly truth to this, it’s true only because the very fact that it is bothersome to the founding engineer is likely indicative that it’s bothersome to a broader set of people. It’s an example, where an example is just one element in a set that is used to demonstrate some common attribute among all elements in that set. Those people are your customers.
As a technology matures, it’s important to realize—particularly when building commercial software—that actual human beings will want to use the technology. It’s important to understand and respect their needs. It’s important to, at some point, realize that you’re not, in fact, building a system entirely for your own personal use. Not realizing this point can blind you and make you neglect the need to partner with somebody who understands the business angle of things. It can also lead to a feeling of needing to develop the perfect idealized solution and never ship to customers. Hey, when there are endless technical problems to work on, who would want to ship anyway? By its very definition, shipping software means that you’ve solved all of the major technical problems within a certain scope. What fun is that?
The fun is that you’re able to make an impact on your customers’ lives, hopefully for the better. Your initial technical vision has come to fruition, and you can move on. You get to prove your ideas by having real human beings to use the end product. If you never get to that state, then you’ve done some possibly interesting research—which is hopefully documented and used by somebody someday in the future to actually impact people by delivering a system based on those ideas—but you haven’t architected a product. You’re a researcher, not an architect.
6. Admit when you're wrong, fall on your sword, and then fix it.
You are going to be wrong sometimes. Trying to do big and bold things necessarily involves some risk. Being an architect requires a careful balance between sticking to your guns—your guiding principles and technical vision—and realizing when things aren’t working out and course correcting before it’s too late. It’s hard to tell when things are beginning to go off course, but when they’ve already gone off course it’s usually obvious. A common telltale sign that things are in trouble is when the team no longer believes in the vision. This may translate into attrition (often of your best engineers first), or just hallway grumblings. Listen carefully. If you’re not involved in the design decisions, writing code, and actually playing a significant role in your team’s daily lives, then you’re apt to miss this. As the architect, you are responsible for responding as quickly as possible to such situations before the shit hits the fan.
Some architects can fall into the trap of using dogma over intellect. Firm principles are of course something I’ve stressed throughout this article. But you need to be honest with yourself and admit when things are not going well. An architect who stands at the helm of a sinking ship, proclaiming that the ship stay its course because the brave new world lies ahead, will only drown (alone) when the ship finally goes underwater. Although this architect can then go around blaming his team for the failure (“if they had only seen the vision and stuck around, we would have succeeded”), the project will be long gone by then. It’s harder, but more noble, to recognize the problems proactively and do your best to fix them.
For example, Tony Hoare describes in the same ACM Turing Award paper mentioned above, how he felt responsible for the failure of the Elliot 503 Mark II project:
“There was no escape: The entire Elliott 503 Mark II software project had to be abandoned, and with it, over thirty man-years of programming effort, equivalent to nearly one man’s active working life, and I was responsible, both as designer and as manager, for wasting it.”
It can be particularly disturbing to realize that a large number of people have been going off in the wrong direction on your watch. Yes, you wasted their time. But you have to learn what went wrong, internalize it, and commit to never making the same mistake twice. You owe it to them to respond promptly. Everybody on the team will have learned and grown from the circumstances, and if you’re lucky the situation is salvageable. Sometimes it won’t be. But in any case you will gain the respect of many around you by making the right decision; particularly if you’re the only person with the broad technical responsibility, understanding, and insight necessary to make such a decision, people will feel relieved when you make it. And if you don’t make it, people will curse you for it.
In conclusion
I’m sure there are many other laundry lists of skills people might come up with that are necessary to be an effective architect, but these are a few of the things I see and respect in the people I look up to. I’ve named some of these people throughout this article. The most common trait is that they have done great things and left their mark on the industry. Being an architect, in the end, is all about helping others to succeed. If you’re a really good architect, you’ll inspire people and rub off on them. You’ll gain a certain level of respect that is unmistakable and priceless. And that, in my opinion, is far more fulfilling than anything you could accomplish on your own working in a vacuum.
 Wednesday, October 01, 2008
The October 2008 MSDN Magazine issue just went live with 5 articles on concurrency, plus the editor's note. Four of the articles are written by members of the Parallel Computing team here at Microsoft, including one by me:
Enjoy the text, and be careful not to overdose on the excess of parallelism goodness. This edition was timed intentionally to coincide with the PDC. I'm hoping to see you there, because we have a plethora of exciting things to show, spanning managed .NET and native C++ programming. These articles are really just teasers.
 Sunday, September 21, 2008
The enumeration pattern in .NET unfortunately implies some overhead that makes it difficult to compete with ordinary for loops. In fact, the difference between
T[] a = …;
for (int i = 0, c = a.Length; i < c; i++) …action(a[i])…;
and
T[] a = …;
IEnumerator<int> ae = ((IEnumerable<T>)a).GetEnumerator();
while (ae.MoveNext()) …action(ae.Current)…;
is about 3X. That is, the former is 1/3rd the expense of the latter, in terms of raw enumeration overhead. Clearly as action becomes more expensive the significance of this overhead lessens. But if your plan is to invoke a small action over a large number of elements, using an enumerator instead of indexing directly into the array could in fact cause your algorithm to take 3X longer to finish.
There are many reasons for this problem. They are probably obvious. Using an enumerator requires at least two interface method calls just to extract a single element from the array. Because there are O(length) number of these operations, the overhead imposed will be O(length) as well. Contrast that with the nice, compact for loop, which emits ldarg IL instructions that access the array directly. This will end up computing some offset (e.g., i * sizeof(T)) and dereferencing right into the array memory. The enumerator needs to do that, of course, but only after the two interface calls are made. Additionally, it is possible for the JIT compiler to omit the bounds check on the array access if it knows ‘c’ in the predicate ‘i < c’ was computed from ‘a.Length’, because arrays in .NET are immutable and their size cannot change.
(Strangely, it appears going through IList<T> is even slower than enumeration. In fact, it appears to be more than 3X the cost of going through IList<T>’s enumerator, and over 10X that of indexing into the array using true ldarg instructions instead of interface calls to IList<T>’s element indexer.)
All of this actually makes it somewhat difficult for those on my team building PLINQ to compete with hand written programs. That’s true of LINQ generally. In fact, LINQ tends to be worse, because you string several enumerators together to form a query, often leading to even more overhead attributed to enumeration. So you might reasonably wonder: if people care about performance, then why would they willingly start off 3X “in the hole” in hopes that they will eventually gain it back when they use machines with >= 4 cores? It’s a completely fair criticism (although you must recall that everything I’m talking about is “pure overhead” and once you begin to have sizable computations in the per-element action it matters less and less). We continually do a lot of work to try to recoup these costs.
There are actually many alternative enumeration models, and I think .NET needs to change direction in the future. In addition to the overhead associated with the pattern, .NET’s enumeration pattern is a “pull” model (versus “push”), which makes it incredibly hard to tolerate blocking within calls to MoveNext. Over time, I think we will need to pursue the push model more seriously.
I’ve thrown together a few different examples of alternative enumeration techniques. To cut to the chase, here is a simple micro-benchmark test that enumerates over 1,000,000 elements 25 times, invoking an empty (non-inlineable) method for each element. The per-element work here is quite small (although not empty) and so the results are a bit more extreme than a real workload would show:
For loop (int[]) 739255 tcks % of baseline
For loop (IList<int>) 7534609 tcks 1019.216%
ForEach loop (int[]) 829617 tcks 112.2234%
int[] IEnumerator<int> 2152414 tcks 291.1599%
IEnumerator<int> 2062876 tcks 279.048%
IFastEnumerator<int> 1758992 tcks 237.9412%
IForEachable<int> [s] 1103745 tcks 149.305%
IForEachable<int> [i] 976742 tcks 132.1252%
IForEachable2<int> 957883 tcks 129.5741%
These are:
- “For loop (int[])” is an ordinary for loop over the array directly.
- “For loop (IList<int>)” is an ordinary for loop over the array’s IList<T> interface.
- “ForEach loop (int[])” is an ordinary foreach loop over the array directly.
- “int[] IEnumerator<int>” uses the array’s implementation of IEnumerator<T>.
- “IEnumerator<int>” is a custom IEnumerator<T> implementation.
- “IFastEnumerator<int>” is an implementation of new pull interface (defined below).
- “IForEachable<int>” is an implementation of a new push interface (defined below) that uses delegates to represent the per-element action. The only difference between the “[s]” and “[i]” variants are that the delegate is bound to a static method for “[s]” and an instance method for “[i]”.
- “IForEachable2<int>” is a slight variant of IForEachable<T> (also defined below).
Notice that with IForEachable2<T>, we’ve gotten within 30% of the efficient for loop. Unfortunately, I do get somewhat different numbers when compiling with the /o+ switch:
For loop (int[]) 777746 tcks % of baseline
For loop (IList<int>) 7569517 tcks 973.2634%
ForEach loop (int[]) 735846 tcks 94.61264%
int[] IEnumerator<int> 2340361 tcks 300.9159%
IEnumerator<int> 2063039 tcks 265.2587%
IFastEnumerator<int> 1806568 tcks 232.2825%
IForEachable<int> [s] 1090644 tcks 140.2314%
IForEachable<int> [i] 946090 tcks 121.6451%
IForEachable2<int> 1234201 tcks 158.6895%
For comparison purposes, I get numbers like this if the loop body is completely empty except for accessing the current element:
For loop (int[]) 452039 tcks % of baseline
For loop (IList<int>) 422732 tcks 93.51671%
ForEach loop (int[]) 461274 tcks 102.043%
int[] IEnumerator<int> 1958711 tcks 433.3058%
IEnumerator<int> 1730502 tcks 382.8214%
IFastEnumerator<int> 1372421 tcks 303.6068%
IForEachable<int> [s] 1091720 tcks 241.5101%
IForEachable<int> [i] 958401 tcks 212.0173%
IForEachable2<int> 664572 tcks 147.0165%
And this (with /o+):
For loop (int[]) 262146 tcks % of baseline
For loop (IList<int>) 263302 tcks 100.441%
ForEach loop (int[]) 372924 tcks 142.2581%
int[] IEnumerator<int> 1889132 tcks 720.6412%
IEnumerator<int> 1635837 tcks 624.0175%
IFastEnumerator<int> 1479579 tcks 564.4103%
IForEachable<int> [s] 1096712 tcks 418.3592%
IForEachable<int> [i] 962261 tcks 367.0706%
IForEachable2<int> 698340 tcks 266.3935%
These numbers aren’t quite as meaningful because we have no idea what’s being optimized away by the C# and JIT compilers. For example, they may notice we’re not using the current element at all and therefore eliminate the access altogether. Nevertheless, the relative ranking of efficiency has remained nearly the same (with the notable exception of the array’s IList<T> test being much less worse).
(All of these numbers were gathered on a 32-bit OS on a 64-bit machine. Because the JIT compilers for 32-bit and 64-bit are so different, you can expect vastly different results across architectures.)
Anyway, here is what IFastEnumerator<T>, IForEachable<T>, and IForEachable2<T> look like:
interface IFastEnumerable<T>
{
IFastEnumerator<T> GetEnumerator();
}
interface IFastEnumerator<T>
{
bool MoveNext(ref T elem);
}
interface IForEachable<T>
{
void ForEach(Action<T> action);
}
interface IForEachable2<T>
{
void ForEach(Functor<T> functor);
}
abstract class Functor<T>
{
public abstract void Invoke(T t);
}
I also have a data type called SimpleList<T> that implements each of these, including IEnumerable<T>. This is what the test harness uses for its benchmarking. So any boneheaded mistakes I’ve made in the implementation of this class could cause us to draw the wrong conclusions about the interfaces themselves. Hopefully there are none:
class SimpleList<T> :
IEnumerable<T>, IFastEnumerable<T>, IForEachable<T>, IForEachable2<T>
{
private T[] m_array;
public SimpleList(T[] array) { m_array = array; }
// Etc …
}
The class of course implements IEnumerable<T> in the standard way:
IEnumerator<T> IEnumerable<T>.GetEnumerator()
{
return new ClassicEnumerable(m_array);
}
System.Collections.IEnumerator System.Collections.IEnumerable.GetEnumerator()
{
return new ClassicEnumerable(m_array);
}
class ClassicEnumerable : IEnumerator<T>
{
private T[] m_a;
private int m_index = -1;
internal ClassicEnumerable(T[] a) { m_a = a; }
public bool MoveNext() { return ++m_index < m_a.Length; }
public T Current { get { return m_a[m_index]; } }
object System.Collections.IEnumerator.Current { get { return Current; } }
public void Reset() { m_index = -1; }
public void Dispose() { }
}
The idea behind IFastEnumerable<T> (and specifically IFastEnumerator<T>) is to return the current element during the call to MoveNext itself. This cuts the number of interface method calls necessary to enumerate a list in half. The impact to performance isn’t huge, but it was enough to cut our overhead from about 3X to 2.3X. Every little bit counts:
IFastEnumerator<T> IFastEnumerable<T>.GetEnumerator()
{
return new FastEnumerable(m_array);
}
class FastEnumerable : IFastEnumerator<T>
{
private T[] m_a;
private int m_index = -1;
internal FastEnumerable(T[] a) { m_a = a; }
public bool MoveNext(ref T elem)
{
if (++m_index >= m_a.Length)
return false;
elem = m_a[m_index];
return true;
}
}
(Update: after writing the blog post, I made a couple slight optimizations that make this a bit tighter (fewer field fetches):
class FastEnumerable : IFastEnumerator<T>
{
private T[] m_a;
private int m_index = -1;
internal FastEnumerable(T[] a) { m_a = a; }
public bool MoveNext(ref T elem)
{
T[] a = m_a;
int i;
if ((i = ++m_index) >= a.Length)
return false;
elem = a[i];
return true;
}
}
The impact to performance isn't huge, but does improve the performance to about 2.1X of the baseline.)
The IForEachable<T> interface is a push model in the sense that the caller provides a delegate and the ForEach method is responsible for invoking it once per element in the collection. ForEach doesn’t return until this is done. In addition to having far fewer method calls to enumerate a collection, there isn’t a single interface method call. Delegate dispatch is also much faster than interface method dispatch. The result is nearly twice as fast as the classic IEnumerator<T> pattern (when /o+ isn’t defined). Now we’re really getting somewhere!
void IForEachable<T>.ForEach(Action<T> action)
{
T[] a = m_array;
for (int i = 0, c = a.Length; i < c; i++)
action(a[i]);
}
Delegate dispatch still isn’t quite the speed of virtual method dispatch. And delegates bound to static methods are actually slightly slower than those bound to instance methods, which is why you’ll notice a slight difference in the original “[s]” versus “[i]” measurements. The reason is subtle. There is a delegate dispatch stub that is meant to call the target method: when the delegate refers to an instance method, the ‘this’ reference pushed in EAX points to the delegate object when it is invoked and the stub can simply replace it with the target object and jump; for static methods, however, all of the arguments need to be “shifted” downward, because there is no ‘this’ reference to be passed and therefore the first actual argument to the static method must take the place of the current value in EAX.
The IForEachable2<T> interface just replaces delegate calls with virtual method calls. Somebody calling it will pass an instance of the Functor<T> class with the Invoke method overridden. The implementation of ForEach then looks quite a bit like IForEachable<T>’s, just with virtual method calls in place of delegate calls:
void IForEachable2<T>.ForEach(Functor<T> functor)
{
T[] a = m_array;
for (int i = 0, c = a.Length; i < c; i++)
functor.Invoke(a[i]);
}
And that’s it. Here is the program that drives the little micro-benchmark tests that I showed output for at the beginning:
class Program
{
public static void Main()
{
const int size = 2500000;
Random r = new Random();
int[] array = new int[size];
for (int i = 0; i < size; i++) array[i] = r.Next();
SimpleList<int> list = new SimpleList<int>(array);
const int iters = 25;
long baseline = 0;
GC.Collect();
GC.WaitForPendingFinalizers();
// Regular for loop
{
Stopwatch sw = Stopwatch.StartNew();
for (int i = 0; i < iters; i++)
{
for (int j = 0, c = array.Length; j < c; j++)
DoNothing(array[j]);
}
baseline = sw.ElapsedTicks;
Console.WriteLine("For loop (int[])\t{0} tcks\t% of baseline", baseline);
}
// Regular for loop (IList<int>)
{
Stopwatch sw = Stopwatch.StartNew();
IList<int> ia = array;
for (int i = 0; i < iters; i++)
{
for (int j = 0, c = ia.Count; j < c; j++)
DoNothing(ia[j]);
}
long elapsed = sw.ElapsedTicks;
Console.WriteLine("For loop (IList<int>)\t{0} tcks\t{1}%",
elapsed, 100*(elapsed / (float)baseline));
}
GC.Collect();
GC.WaitForPendingFinalizers();
// Regular foreach loop
{
Stopwatch sw = Stopwatch.StartNew();
for (int i = 0; i < iters; i++)
{
foreach (int x in array)
DoNothing(x);
}
long elapsed = sw.ElapsedTicks;
Console.WriteLine("ForEach loop (int[])\t{0} tcks\t{1}%",
elapsed, 100 * (elapsed / (float)baseline));
}
GC.Collect();
GC.WaitForPendingFinalizers();
// Regular foreach loop
{
Stopwatch sw = Stopwatch.StartNew();
for (int i = 0; i < iters; i++)
{
IEnumerator<int> e = ((IEnumerable<int>)array).GetEnumerator();
while (e.MoveNext())
DoNothing(e.Current);
}
long elapsed = sw.ElapsedTicks;
Console.WriteLine("int[] IEnumerator<int>\t{0} tcks\t{1}%",
elapsed, 100 * (elapsed / (float)baseline));
}
// IEnumerator<T>
{
Stopwatch sw = Stopwatch.StartNew();
for (int i = 0; i < iters; i++)
{
IEnumerator<int> e = ((IEnumerable<int>)list).GetEnumerator();
while (e.MoveNext())
DoNothing(e.Current);
}
long elapsed = sw.ElapsedTicks;
Console.WriteLine("IEnumerator<int>\t{0} tcks\t{1}%",
elapsed, 100 * (elapsed / (float)baseline));
}
GC.Collect();
GC.WaitForPendingFinalizers();
// IFastEnumerator<T>
{
Stopwatch sw = Stopwatch.StartNew();
for (int i = 0; i < iters; i++)
{
int x = 0;
IFastEnumerator<int> e = ((IFastEnumerable<int>)list).GetEnumerator();
while (e.MoveNext(ref x))
DoNothing(x);
}
long elapsed = sw.ElapsedTicks;
Console.WriteLine("IFastEnumerator<int>\t{0} tcks\t{1}%",
elapsed, 100 * (elapsed / (float)baseline));
}
GC.Collect();
GC.WaitForPendingFinalizers();
// IForEachable<T>
{
Stopwatch sw = Stopwatch.StartNew();
for (int i = 0; i < iters; i++)
{
Action<int> act = new Action<int>(DoNothing);
((IForEachable<int>)list).ForEach(act);
}
long elapsed = sw.ElapsedTicks;
Console.WriteLine("IForEachable<int> [s]\t{0} tcks\t{1}%",
elapsed, 100 * (elapsed / (float)baseline));
}
GC.Collect();
GC.WaitForPendingFinalizers();
// IForEachable<T>
{
Stopwatch sw = Stopwatch.StartNew();
for (int i = 0; i < iters; i++)
{
DoNothingClosure dnc = new DoNothingClosure();
Action<int> act = new Action<int>(dnc.DoNothing);
((IForEachable<int>)list).ForEach(act);
}
long elapsed = sw.ElapsedTicks;
Console.WriteLine("IForEachable<int> [i]\t{0} tcks\t{1}%",
elapsed, 100 * (elapsed / (float)baseline));
}
GC.Collect();
GC.WaitForPendingFinalizers();
// IForEachable2<T>
{
Stopwatch sw = Stopwatch.StartNew();
for (int i = 0; i < iters; i++)
{
DoNothingFunctor dnf = new DoNothingFunctor();
((IForEachable2<int>)list).ForEach(dnf);
}
long elapsed = sw.ElapsedTicks;
Console.WriteLine("IForEachable2<int>\t{0} tcks\t{1}%",
elapsed, 100 * (elapsed / (float)baseline));
}
}
[System.Runtime.CompilerServices.MethodImpl(
System.Runtime.CompilerServices.MethodImplOptions.NoInlining)]
private static void DoNothing(int x) { }
class DoNothingClosure
{
[System.Runtime.CompilerServices.MethodImpl(
System.Runtime.CompilerServices.MethodImplOptions.NoInlining)]
public void DoNothing(int x) { }
}
class DoNothingFunctor : Functor<int>
{
public override void Invoke(int x) { DoNothing(x); }
}
}
To summarize, .NET enumeration costs something over typical for loops that index straight into arrays. Most programs needn’t worry about these kinds of overheads. If you’re accessing a database, manipulating a large complicated object, or what have you, inside of the individual iterations, then the overheads we’re talking about here are miniscule. In fact, walking 1,000,000 elements is in the microsecond range for all of the benchmarks I showed, even the slowest ones. So none of this is anything to lose sleep over. But if you have a closed system that controls all of its enumeration, it may be worth doing some targeted replacement of enumerators with the more efficient patterns, particularly if you tend to enumerate lots and lots of elements lots and lots of times in your program.
 Wednesday, September 17, 2008
In part 2 of this series, I described a new work stealing queue data structure used for work item management. This structure allows us to push and pop elements into a thread-local work queue without heavy-handed synchronization. Moreover, this distributed a large amount of the scheduling responsibility across the threads (and hence processors). The result is that, for recursively queued work items, scalability is improved and pressure on the typical bottleneck in a thread pool (i.e., the global lock) is alleviated.
What we didn’t do last time was actually integrate the new queue into the thread pool that was shown in part 1. This extension is actually somewhat simple. We’ll continue to use the IThreadPool interface so that we can easily harness and benchmark the various thread pool implementations against each other.
We’ll add a new class LockAndWsqThreadPool, which mimics the design of the original SimpleLockThreadPool class. We’ll only need to add two fields to it:
- private WorkStealingQueue<WorkItem>[] m_wsQueues: This is an array of queues—one per thread in the pool—that will be used to store recursively queued work.
- [ThreadStatic] private static WorkStealingQueue<WorkItem> m_wsq: This represents the unique work stealing queue for a particular thread in the pool.
OK, so with these extensions there are clearly three specific changes we need to make:
- A new thread pool thread needs to allocate its work stealing queue.
- When queuing a new work item, we must check to see if we’re on a pool thread. If so, we will queue the work item into the work stealing queue instead of the global queue.
- When a pool thread looks for work, it needs to:
- First consult its local work stealing queue.
- If that fails, it then looks at the global queue.
- Lastly, if that fails, it needs to steal from other work stealing queues.
Let’s review each one individually. Later we’ll see the full code.
#1 is handled in the DispatchLoop function:
private WorkStealingQueue<WorkItem>[] m_wsQueues = new WorkStealingQueue<WorkItem>[Environment.ProcessorCount];
private void DispatchLoop() { // Register a new WSQ. WorkStealingQueue<WorkItem> wsq = new WorkStealingQueue<WorkItem>(); m_wsq = wsq; // Store in TLS. AddWsq(wsq);
try { /* a whole bunch of stuff … */ } finally { Remove(wsq); } }
private void AddWsq(WorkStealingQueue<WorkItem> wsq) { lock (m_wsQueues) { for (int i = 0; i < m_wsQueues.Length; i++) { if (m_wsQueues[i] == null) { m_wsQueues[i] = wsq; } else if (i == m_wsQueues.Length - 1) { WorkStealingQueue<WorkItem>[] queues = new WorkStealingQueue<WorkItem>[m_wsQueues.Length*2]; Array.Copy(m_wsQueues, queues, i+1); queues[i+1] = wsq; m_wsQueues = queues; } } } } private void RemoveWsq(WorkStealingQueue<WorkItem> wsq) { lock (m_wsQueues) { for (int i = 0; i < m_wsQueues.Length; i++) { if (m_wsQueues[i] == wsq) { m_wsQueues[i] = null; } } } }
#2, of course, happens within the QueueUserWorkItem function:
public void QueueUserWorkItem(WaitCallback work, object obj) { WorkItem wi = …; /* as before … */
// Now insert the work item into the queue, possibly waking a thread. WorkStealingQueue<WorkItem> wsq = m_wsq; if (wsq != null) { // Single TLS to determine if we're on a pool thread. wsq.LocalPush(wi); if (m_threadsWaiting > 0) // OK to read lock-free. lock (m_queue) { Monitor.Pulse(m_queue); } } else { /* as before… queue to the global queue */ } }
Lastly, #3 is the most complicated. Searching the local queue is done with a call to wsq.LocalPop. If that fails, the work stealing queue is empty, and the logic then looks a lot like the original thread pool’s dispatch loop logic in that we then look for work in the global queue. If that fails, we will just iterate over the other threads’ work stealing queues, doing a TrySteal operation. If none of them had work, we go back the global queue, try again, and then finally wait for work to arrive. (See the full code sample below for details.) Notice that there’s a fairly tricky race condition here that we’re leaving unhandled: if we search for work, try to steal, and ultimately find no work, we will then embark on a trip back to the global queue; during this trip, another pool thread might recursively queue work into its work stealing queue and we will miss it. Generally speaking, this is OK because that thread will eventually get to it (presumably) but with some clever synchronization trickery we can actually handle this case. Perhaps I will show such a solution in a subsequent part in this series.
Anyway, what we’re left with is code that looks something like this:
public class LockAndWsqThreadPool : IThreadPool { // Constructors-- // Two things may be specified: // ConcurrencyLevel == fixed # of threads to use // FlowExecutionContext == whether to capture & flow ExecutionContexts for work items public LockAndWsqThreadPool() : this(Environment.ProcessorCount, true) { } public LockAndWsqThreadPool(int concurrencyLevel) : this(concurrencyLevel, true) { } public LockAndWsqThreadPool(bool flowExecutionContext) : this(Environment.ProcessorCount, flowExecutionContext) { } public LockAndWsqThreadPool(int concurrencyLevel, bool flowExecutionContext) { if (concurrencyLevel <= 0) throw new ArgumentOutOfRangeException("concurrencyLevel"); m_concurrencyLevel = concurrencyLevel; m_flowExecutionContext = flowExecutionContext; // If suppressing flow, we need to demand permissions. if (!flowExecutionContext) new SecurityPermission(SecurityPermissionFlag.Infrastructure).Demand(); } // Each work item consists of a closure: work + (optional) state obj + context. struct WorkItem { internal WaitCallback m_work; internal object m_obj; internal ExecutionContext m_executionContext; internal WorkItem(WaitCallback work, object obj) { m_work = work; m_obj = obj; m_executionContext = null; } internal void Invoke() { // Run normally (delegate invoke) or under context, as appropriate. if (m_executionContext == null) m_work(m_obj); else ExecutionContext.Run(m_executionContext, s_contextInvoke, this); } private static ContextCallback s_contextInvoke = delegate(object obj) { WorkItem wi = (WorkItem)obj; wi.m_work(wi.m_obj); }; } private readonly int m_concurrencyLevel; private readonly bool m_flowExecutionContext; private readonly System.Collections.Queue m_queue = new System.Collections.Queue(); private WorkStealingQueue<WorkItem>[] m_wsQueues = new WorkStealingQueue<WorkItem>[Environment.ProcessorCount]; private Thread[] m_threads; private int m_threadsWaiting; private bool m_shutdown; [ThreadStatic] private static WorkStealingQueue<WorkItem> m_wsq; // Methods to queue work. public void QueueUserWorkItem(WaitCallback work) { QueueUserWorkItem(work, null); } public void QueueUserWorkItem(WaitCallback work, object obj) { WorkItem wi = new WorkItem(work, obj); // If execution context flowing is on, capture the caller's context. if (m_flowExecutionContext) wi.m_executionContext = ExecutionContext.Capture(); // Make sure the pool is started (threads created, etc). EnsureStarted(); // Now insert the work item into the queue, possibly waking a thread. WorkStealingQueue<WorkItem> wsq = m_wsq; if (wsq != null) { // Single TLS to determine if we're on a pool thread. wsq.LocalPush(wi); if (m_threadsWaiting > 0) // OK to read lock-free. lock (m_queue) { Monitor.Pulse(m_queue); } } else { lock (m_queue) { m_queue.Enqueue(wi); if (m_threadsWaiting > 0) Monitor.Pulse(m_queue); } } } // Ensures tha threads have begun executing. private void EnsureStarted() { if (m_threads == null) { lock (m_queue) { if (m_threads == null) { m_threads = new Thread[m_concurrencyLevel]; for (int i = 0; i < m_threads.Length; i++) { m_threads[i] = new Thread(DispatchLoop); m_threads[i].Start(); } } } } } private void AddWsq(WorkStealingQueue<WorkItem> wsq) { lock (m_wsQueues) { for (int i = 0; i < m_wsQueues.Length; i++) { if (m_wsQueues[i] == null) { m_wsQueues[i] = wsq; } else if (i == m_wsQueues.Length - 1) { WorkStealingQueue<WorkItem>[] queues = new WorkStealingQueue<WorkItem>[m_wsQueues.Length*2]; Array.Copy(m_wsQueues, queues, i+1); queues[i+1] = wsq; m_wsQueues = queues; } } } } private void RemoveWsq(WorkStealingQueue<WorkItem> wsq) { lock (m_wsQueues) { for (int i = 0; i < m_wsQueues.Length; i++) { if (m_wsQueues[i] == wsq) { m_wsQueues[i] = null; } } } } // Each thread runs the dispatch loop. private void DispatchLoop() { // Register a new WSQ. WorkStealingQueue<WorkItem> wsq = new WorkStealingQueue<WorkItem>(); m_wsq = wsq; // Store in TLS. AddWsq(wsq); try { while (true) { WorkItem wi = default(WorkItem); // Search order: (1) local WSQ, (2) global Q, (3) steals. if (!wsq.LocalPop(ref wi)) { bool searchedForSteals = false; while (true) { lock (m_queue) { // If shutdown was requested, exit the thread. if (m_shutdown) return; // (2) try the global queue. if (m_queue.Count != 0) { // We found a work item! Grab it ... wi = (WorkItem)m_queue.Dequeue(); break; } else if (searchedForSteals) { m_threadsWaiting++; try { Monitor.Wait(m_queue); } finally { m_threadsWaiting--; } // If we were signaled due to shutdown, exit the thread. if (m_shutdown) return; searchedForSteals = false; continue; } } // (3) try to steal. WorkStealingQueue<WorkItem>[] wsQueues = m_wsQueues; int i; for (i = 0; i < wsQueues.Length; i++) { if (wsQueues[i] != wsq && wsQueues[i].TrySteal(ref wi)) break; } if (i != wsQueues.Length) break; searchedForSteals = true; } } // ...and Invoke it. Note: exceptions will go unhandled (and crash). wi.Invoke(); } } finally { RemoveWsq(wsq); } } // Disposing will signal shutdown, and then wait for all threads to finish. public void Dispose() { m_shutdown = true; if (m_queue != null) { lock (m_queue) { Monitor.PulseAll(m_queue); } for (int i = 0; i < m_threads.Length; i++) m_threads[i].Join(); } } }
I have a little harness that measures the throughput of the different thread pool implementations for varying degrees of recursively queued work. I’ll share this out too in a subsequent part in this series, once we have a few more variants to pit against each other. Anyway, as you’d imagine, there is very little difference between LockAndWsqThreadPool and SimpleLockThreadPool when all work is queued from external (non-pool) threads. However, when I queue 10,000 items externally and, from each of those, queue 100 items recursively, I see a 3X throughput improvement on my four core machine. When I queue 100 items externally and, from each of those, queue 10,000 items recursively, the improvement is more than 8X. And so on. As the number of cores increases, the improvement only becomes greater.
Another aspect not shown—because of the very limited QueueUserWorkItem-style API we’re building on—is something called “wait inlining.” We do this in TPL. When you recursively queue work items in a divide-and-conquer kind of problem, there’s often more latent parallelism than will be realized. Instead of requiring all of that parallelism to consume a thread, and blocking each time a work item is waited on, we can run work items inline if they haven’t started yet.
One easy way to do this is to limit inlining to only threads that do so from their own local work stealing queue. Because we are guaranteed the local pop/push methods won’t interleave with such inlines, we can just acquire the stealing lock and search the list for the particular element, e.g.:
public bool Remove(T obj) { for (int i = m_tailIndex - 1; i > m_headIndex; i--) { if (m_array[i & m_mask] == obj) { lock (m_foreignLock) { if (m_array[i & m_mask] != obj) return false; // lost a race.
// Adjust indices or leave a null in our wake. if (i == m_tailIndex - 1) m_tailIndex--; else if (i == m_headIndex + 1) m_headIndex++; else m_array[i & m_mask] = null;
return true; } }
return false; } }
This is just a new method on the WorkStealingQueue<T> data structure. This requires that the local and foreign pop methods now check for null values and restart the relevant operation should one be found, because of the work item to be removed is not the head or tail item we cannot prevent subsequent removals from seeing it (i.e., the indices must remain the same).
Next time, in part 4 of this series, we’ll take a look at what it takes to share threads among multiple instances of the LockAndWsqThreadPool class. This allows many pools to be created within a single AppDomain without requiring entirely separate sets of threads to service each one of them. This capability enables you to isolate different work queues from one another, to ensure that certain components aren’t starved by other (potentially misbehaving) ones.
 Saturday, September 13, 2008
Most programs are tangled webs of data and control dependencies. For sequential programs, this doesn’t matter much aside from putting constraints on the legal optimizations available to a compiler. But it gets worse. Imperative programs today are also full of side-effect dependencies. Unlike data and control dependence—which most compilers can identify and understand the semantics of (aliasing aside)—side-effect dependencies are hidden and the semantic meaning of them is entirely ad-hoc. These can include scribbling to shared memory, writing to the disk, or printing to the console.
One of my goals is to push programming languages in the direction of full disclosure of all kinds of dependencies. I believe this will eventually help to foster ubiquitous parallelism. These dependencies, after all, are what inherently limit the latent parallelism in a program and are “real” in the sense that they are typically algorithmic. I would prefer that developers think about how to modify or rewrite their algorithm to eliminate any unnecessary dependencies, and also to be clever about eliminating necessary ones, rather than trying to navigate a minefield of dependencies that are implicit, undocumented, and often hard to understand. Our tools should be oriented towards aiding such endeavors.
That’s not to say that knowing about dependencies will immediately make all programs parallel programs. Research in automatic parallelism for purely functional languages like Haskell has shown that this is a naïve point of view. My belief is that this is a key step along the path, however. With it new models and patterns can emerge that reduce dependencies so that parallelism can be introduced without accidentally violating subtle and hidden dependencies, causing races.
The biggest question left unanswered in my mind is the role state will play in software of the future.
That seems like an absurd statement, or a naïve one at the very least. State is everywhere:
- The values held in memory.
- Data locally on disk.
- Data in-flight that is being sent over a network.
- Data stored in the cloud, including on a database, remote filesystem, etc.
Certainly all of these kinds of state will continue to exist far into the future. Data is king, and is one major factor that will drive the shift to parallel computing. The question then is how will concurrent programs interact with this state, read and mutate it, and what isolation and synchronization mechanisms are necessary to do so?
I’ve been working on or around software transactional memory (STM) for over 3 years now. Many think it’s a panacea, and it has been held up as somewhat of a “last hope for mankind” kind of technology. As with anything, it’s best to temper the enthusiasm with some realism. Things are never so simple. STM will be one tool (of many) in the toolkit of programmers writing the next generation of concurrent code. In fact, I have over time come to believe that it’s one of the least radical ones that we need. This is probably bad news, given the vast number of difficulties the community has uncovered in our attempts to efficiently and correctly implement an STM system.
Many programs have ample gratuitous dependencies, simply because of the habits we’ve grown accustomed to over 30 odd years of imperative programming. Our education, mental models, books, best-of-breed algorithms, libraries, and languages all push us in this direction. We like to scribble intermediary state into shared variables because it’s simple to do so and because it maps to our von Neumann model of how the computer works.
We need to get rid of these gratuitous dependencies. Merely papering over them with a transaction—making them “safe”—doesn’t do anything to improve the natural parallelism that a program contains. It just ensures it doesn’t crash. Sure, that’s plenty important, but providing programming models and patterns to eliminate the gratuitous dependencies also achieves the goal of not crashing but with the added benefit of actually improving scalability too. Transactions have worked so well in enabling automatic parallelism in databases because the basic model itself (without transactions) already implies natural isolation among queries. Transactions break down and scalability suffers for programs that aren’t architected in this way. We should learn from the experience of the database community in this regard.
There is a kind of natural taxonomy for the structure concurrent programs:
A. Agents, where isolation is king and interactions are loosely coupled. This is classically referred to as “message passing”, but there are many different reifications of this idea that expose the idea of messages differently: actors (e.g., as in Scheme circa 1980’s), active objects, Ada tasks, Erlang processes, web services, and so on.
B. Task parallelism, where logically independent activities (from a dependence point of view) may be run concurrently. This can range from coarse- to fine-grained, but is normally fixed in number.
C. Data parallelism, in which data drives the coarseness of concurrency.
There is also a natural taxonomy for the way concurrent programs manipulate state:
1. At a coarse-grained level, any changes to state are committed via transactions.
2. At a fine-grained level, all computations are purely functional and without side-effects.
You’ll notice a nice correlation between { (A) & (1) }, and { (B), (C), & (2) }.
And you’ll also notice that I explicitly didn’t mention mutable shared state at all, except for implying mutations would only occur at a coarse granularity and with transactions. This is an oversimplification. Even within the fine-grained computations, guaranteed isolation can allow computations to allocate new state and manipulate it in a myriad of ways. The key here is that the state must be guaranteed to be isolated, and that within such pockets of guaranteed isolation familiar imperative programming can be used. This spans graphs of structured task and data parallelism.
Even this is an oversimplification, but as a broadly appealing programming model I think it is what we ought to strive for. There will always be hidden mutation of shared state inside lower level system components. These are often called “benevolent side-effects,” thanks to Hoare, and apply to things like lazy initialization and memorization caches. These will be done by concurrency ninjas who understand locks. And their effects will be isolated by convention.
Any true effects that must escape a pocket of isolation then get communicated transactionally to others.
Efforts in Haskell have lead to similar conclusions. Monads, of course, are the way to get side-effects into a purely functional language like Haskell: http://portal.acm.org/citation.cfm?id=262011. The state monad allows one to manipulate state lazily via a monad, in a semi-imperative way, and a paper called “Lazy functional state threads” by Launchbury and Peyton-Jones shows how to combine the state monad with threading to enable a model very similar to the one I describe: http://portal.acm.org/citation.cfm?id=178243.178246. Combine this with STM and we’re getting somewhere: http://portal.acm.org/citation.cfm?id=1065944.1065952. Sadly, I do think Haskell’s syntax is too mathematical for most and that we need a fair bit of sugar on top of the raw use of monads and combining stateful effects. But as an underlying model of computation I think the kernel of the idea is just right.
I admit that I’m a little sad that F# has taken an impure-by-default stance. Given the roots in ML and O’Caml, and the more pragmatic goals of the language, this stance isn’t a surprise. And a bunch of people will be wildly successful and happy using it as-is. F# is, however, Microsoft’s first attempt to hoist functional programming unto our professional development community, and pure-by-default is actually a fairly innocuous (but subtly crucial) position to take. (Except for those damn impure libraries.) I fear we may be missing our “once in every 5 years” chance to do the right thing. But I guess we don’t quite know for sure what the right thing is just yet; we simply didn’t take the leap of faith.
Even with all of this support, we’d be left with an ecosystem of libraries like the .NET Framework itself which have been built atop a fundamentally mutable and imperative system. The path forward here is less clear to me, although having the ability to retain a mutable model within pockets of guaranteed isolation certainly makes me think the libraries are salvageable. Thankfully, the shift will likely be very gradual, and the pieces that pose substantial problems can be rewritten in place incrementally over time. But we need the fundamental language and type system support first.
 Tuesday, August 12, 2008
Miguel de Icaza recently blogged about the addition of Parallel Extensions to the Mono family.
C# 3.0 and Parallel FX/LINQ in Mono
"For a while I wanted to blog about the open source implementation of the Parallel Extensions for Mono that Jeremie Laval has been working on. Jeremie is one of our mentored students in the 2008 Google Summer of Code. Dual CPU laptops are becoming the norm; quad-core computers are now very affordable, and eight CPU machines are routinely purchased as developer workstations. The Parallel Extension API makes it easy to prepare your software to run on multi processor machines by providing constructs that take care of distributing the work to various CPUs based on the computer load and the number of processors available."
Read more...
 Monday, August 11, 2008
The primary reason a traditional thread pool doesn’t scale is that there’s a single work queue protected by a global lock. For obvious reasons, this can easily become a bottleneck. Two primary things contribute heavily to whether the global lock becomes a limiting factor for a particular workload’s throughput:
- As the size of work items become smaller, the frequency at which the pool’s threads must acquire the global lock increases. Moving forward, we expect the granularity of latent parallelism to become smaller such that programs can scale as more processors are added.
- As more processors are added, the arrival rate at the lock will increase when compared to the same workload run with fewer processors. This inherently limits the ability to “get more work through” that single straw that is the global queue.
For coarse-grained work items, and for small numbers of processors, these problems simply aren’t too great. That has been the CLR ThreadPool’s forte for quite some time; most work items range in the 1,000s to 10,000s (or more) of CPU cycles, and 8-processors was considered pushing the limits. Clearly the direction the whole industry is headed in exposes these fundamental flaws very quickly. We’d like to enable work items with 100s and 1,000s of cycles and must scale well beyond 4, 8, 16, 32, 64, ... processors.
Decentralized scheduling techniques can be used to combat this problem. In other words, if we give different components their own work queues, we can eliminate the central bottleneck. This approach works to a degree but becomes complicated very quickly because clearly we don’t want each such queue to have its own pool of dedicated threads. So we’d need some way of multiplexing a very dynamic and comparatively large number of work pools onto a mostly-fixed and comparatively small number of OS threads.
Introducing work stealing
Another technique – and the main subject of this blog post – is to use a so-called work stealing queue (WSQ). A WSQ is a special kind of queue in that it has two ends, and allows lock-free pushes and pops from one end (“private”), but requires synchronization from the other end (“public”). When the queue is sufficiently small that private and public operations could conflict, synchronization is necessary. It is array-based and can grow dynamically. This data structure was made famous in the 90’s when much work on dynamic work scheduling was done in the research community.
In the context of a thread pool, the WSQ can augment the traditional global queue to enable more efficient private queuing and dequeuing. It works roughly as follows:
- We still have a global queue protected by a global lock.
- (We can of course consider the ability to have separate pools to reduce pressure on this.)
- Each thread in the pool has its own private WSQ.
- When work is queued from a pool thread, the work goes into the WSQ, avoiding all locking.
- When work is queued from a non-pool thread, it goes into the global queue.
- When threads are looking for work, they can have a preferred search order:
- Check the local WSQ. Work here can be dequeued without locks.
- Check the global queue. Work here must be dequeued using locks.
- Check other threads’ WSQs. This is called “stealing”, and requires locks.
If you haven’t guessed, this is by-and-large how the Task Parallel Library (TPL) schedules work.
For workloads that recursively queue a lot of work, the use of a per-thread WSQ substantially reduces the synchronization necessary to complete the work, leading to far better throughput. There are also fewer cache effects due to sharing of the global queue information. “Stealing” is our last course of action in the abovementioned search logic, because it has the secondary effect of causing another thread to have to visit the global queue (or steal) sooner. In some sense, it is double the cost of merely getting an item from the global queue.
Another (subtle) aspect of WSQs is that they are LIFO for private operations and FIFO for steals. This is inherent in how the WSQ’s synchronization works (and is key to enabling lock-freedom), but has additional rationale:
- By executing the work most recently pushed into the queue in LIFO order, chances are that memory associated with it will still be hot in the cache.
- By stealing in FIFO order, chances are that a larger “chunk” of work will be stolen (possibly reducing the chance of needing additional steals). The reason for this is that many work stealing workloads are divide-and-conquer in nature; in such cases, the recursion forms a tree, and the oldest items in the queue lie closer to the root; hence, stealing one of those implicitly also steals a (potentially) large subtree of computations that will unfold once that piece of work is stolen and run.
This decision clearly changes the regular order of execution when compared to a mostly-FIFO system, and is the reason we’re contemplating exposing options to control this behavior from TPL.
A simple WorkStealingQueue<T> type
With all that background behind us, let’s jump straight into a really simple implementation of a work stealing queue written in C#.
public class WorkStealingQueue<T>
{
The queue is array-based, and we keep two indexes—a head and a tail. The tail represents the private end and the head represents the public end. We also maintain a mask that is always equal to the size of the list minus one, helping with some of the bounds-checking arithmetic and handling automatic wraparound for indexing into the array. Because of the way we use the mask (we will assume all legal bits for indexing into the list are on), the count must always be a power of two. We arbitrarily select the number 32 as the queues initial (power of two) size.
private const int INITIAL_SIZE = 32;
private T[] m_array = new T[INITIAL_SIZE];
private int m_mask = INITIAL_SIZE - 1;
private volatile int m_headIndex = 0;
private volatile int m_tailIndex = 0; We also need a lock to protect the operations that require synchronization.
private object m_foreignLock = new object();
Although they aren’t exercised very much in the code, we have some helper properties. The queue is empty when the head is equal to or greater than the tail, and the count can be computed by subtracting the head from the tail. Because these fields never wrap (because we use the mask), this is correct.
public bool IsEmpty
{
get { return m_headIndex >= m_tailIndex; }
}
public int Count
{
get { return m_tailIndex - m_headIndex; }
} OK, let’s get into the meat of the implementation. Pushing is the obvious place to start, and, for obvious reasons, we only support private pushes. Public pushes are useless given the protocol explained above, i.e., the only public operation we will support is stealing. Keep in mind when reading this code that m_tailIndex and m_headIndex are both volatile variables.
public void LocalPush(T obj)
{
int tail = m_tailIndex;
First we must check whether there is room in the queue. To do so, we just see if m_tailIndex is less than the sum of m_mask (the size of the list minus one) and m_headIndex. False negatives are OK, and are certainly possible because a concurrent steal may come along and take an element, making room, immediately after the check. We will handle this by synchronizing in a moment.
if (tail < m_headIndex + m_mask)
{
If there is indeed room, we can merely stick the object into the array (masking m_tailIndex with m_mask to ensure we’re within the legal range) and then increment m_tailIndex by one. This may look unsafe, but it is in fact safe: writes retire in order in .NET’s memory model, and we know no other thread is changing m_tailIndex (only private operations write to it) and that no thread will try to access the current array slot into which we’re storing the element.
m_array[tail & m_mask] = obj;
m_tailIndex = tail + 1;
}
Otherwise, we need to head down the slow path which involves resizing.
else
{
We will take the lock and check that we still need to make room.
lock (m_foreignLock)
{
int head = m_headIndex;
int count = m_tailIndex - m_headIndex;
if (count >= m_mask)
{
Assuming we need to make more room, we will just double the size of the array, copy elements, fix up the fields, and move on. Remember that the array length is always a power of two, so we can get the next power of two by simply bitshifting to the left by one. We do that for the mask too, but need to remember to “turn on” the least significant bit by oring one into the mask.
T[] newArray = new T[m_array.Length << 1];
for (int i = 0; i < m_array.Length; i++)
newArray[i] = m_array[(i + head) & m_mask];
m_array = newArray;
m_headIndex = 0;
m_tailIndex = tail = count;
m_mask = (m_mask << 1) | 1;
After we’re done resizing, the m_headIndex is reset to 0, and the m_tailIndex is the previous size of the queue. We can then store into the queue in same way we would have earlier.
}
m_array[tail & m_mask] = obj;
m_tailIndex = tail + 1;
}
}
}
And that’s that: we’ve added an item into the queue with a local push. Now let’s look at the reverse: removing an element with a local pop. Remember, it’s impossible for a local push and pop to interleave with one another because they must be executed by the same thread serially.
public bool LocalPop(ref T obj)
{
First we read the current value of m_tailIndex. If the queue is currently empty, i.e., m_headIndex >= m_tailIndex, then we just return false right away. This is how “emptiness” is conveyed to callers.
int tail = m_tailIndex;
if (m_headIndex >= tail)
return false; Next we disable an annoying C# compiler warning.
#pragma warning disable 0420 Now we have determined there is at least one element in the queue (or was during our previous check). We will now subtract one from the tail, which effectively removes the element. There is still a chance that we will “lose” in a race with another thread doing a steal, so we’ll need to be very careful. In fact, there is a subtle .NET memory model gotcha to be aware of: we must guarantee our write to take the element does not get trapped in the write buffer beyond a subsequent read of the m_headIndex. If that could happen, we might mistakenly think we took the element, while at the same time a stealing thread thought it took the same element! The result would be that the same item will be dequeued by two threads which could lead to disaster. In a thread pool, it’d amount to the same work item being run twice. To ensure this reordering can’t happen, we must use a XCHG to perform the write to m_tailIndex.
tail -= 1;
Interlocked.Exchange(ref m_tailIndex, tail); We detect whether we lost the race by checking to see if our dequeuing of the element has made the queue empty. If it hasn’t, we can just read the array element in the new m_tailIndex position and return it.
if (m_headIndex <= tail)
{
obj = m_array[tail & m_mask];
return true;
}
else
{
Otherwise, we take the lock and see what to do. This blocks out all steals. Either we will find that there indeed is an element remaining, and we can just return it as we would have done above, or we must “put the element back” by just incrementing the m_tailIndex. If we have to back out our modification, we just return false to indicate that the queue has become empty. We know we aren’t racing with it becoming non-empty because only private pushes are supported.
lock (m_foreignLock)
{
if (m_headIndex <= tail)
{
// Element still available. Take it.
obj = m_array[tail & m_mask];
return true;
}
else
{
// We lost the race, element was stolen, restore the tail.
m_tailIndex = tail + 1;
return false;
}
}
}
}
Lastly, let’s take a look at the public pop capability. We allow a timeout to be supplied, because it’s often useful during the stealing logic to use a 0-timeout on the first pass through all the WSQs. This can help to eliminate lock wait times and more evenly distribute contention across the list of WSQs.
private bool TrySteal(ref T obj, int millisecondsTimeout)
{
First we acquire the WSQ’s lock, ensuring mutual exclusion among all other concurrent steals, resize operations, and local pops that may make the queue empty.
bool taken = false;
try
{
taken = Monitor.TryEnter(m_foreignLock, millisecondsTimeout);
if (taken)
{
Once inside the lock, we must increment m_headIndex by one. This moves the head towards the tail, and has the effect of taking an element. Now this part gets quite tricky. We must ensure that we don’t remove the last element when racing with a local pop that went down its fast path (i.e., it didn’t acquire the lock). Given two threads racing to take an element—a steal and a local pop—we must ensure precisely one of them “wins”. Having both succeed will lead to the same element being popped twice, and having neither succeed could lead to reporting back an empty queue when in fact an element exists.
To do that, we will write to the m_headIndex variable to tentatively take the element, and must then read the m_tailIndex right afterward to ensure that the queue is still non-empty. As with the pop logic earlier, we need to use an XCHG operation to write the m_headIndex field, otherwise we will potentially suffer from a similar legal memory reordering bug.
int head = m_headIndex;
Interlocked.Exchange(ref m_headIndex, head + 1); If the queue is non-empty, we just read the element as we usually do: by indexing into the array with the new m_headIndex value using the proper masking. We then return true to indicate an element was found.
if (head < m_tailIndex)
{
obj = m_array[head & m_mask];
return true;
}
Otherwise, the queue is empty and we must return. Clearly this is racy and by the time we return the queue may be non-empty. If the pool will subsequently wait for work to arrive, this must be taken into consideration so as not to incur lost wake-ups.
else
{
m_headIndex = head;
return false;
}
}
}
We of course need to release the lock at the end of it all.
finally
{
if (taken)
Monitor.Exit(m_foreignLock);
}
return false;
}
}
And that’s it! As with most lock-free algorithms, the core idea is surprisingly simple but deceptively subtle and intricate. After seeing it written out and explained in detail, I hope that you’ll have that “Ah hah!” moment that always happens after staring at this kind of code for a little while. In future posts, we’ll take a closer look at the performance differences between this and a traditional globally synchronized queue, and discuss what it takes to merge the two ideas implementation-wise.
Appendix
For reference, here’s the full code without all the explanation intertwined:
using System;
using System.Threading;
public class WorkStealingQueue<T>
{
private const int INITIAL_SIZE = 32;
private T[] m_array = new T[INITIAL_SIZE];
private int m_mask = INITIAL_SIZE - 1;
private volatile int m_headIndex = 0;
private volatile int m_tailIndex = 0;
private object m_foreignLock = new object();
public bool IsEmpty
{
get { return m_headIndex >= m_tailIndex; }
}
public int Count
{
get { return m_tailIndex - m_headIndex; }
}
public void LocalPush(T obj)
{
int tail = m_tailIndex;
if (tail < m_headIndex + m_mask)
{
m_array[tail & m_mask] = obj;
m_tailIndex = tail + 1;
}
else
{
lock (m_foreignLock)
{
int head = m_headIndex;
int count = m_tailIndex - m_headIndex;
if (count >= m_mask)
{
T[] newArray = new T[m_array.Length << 1];
for (int i = 0; i < m_array.Length; i++)
newArray[i] = m_array[(i + head) & m_mask];
// Reset the field values, incl. the mask.
m_array = newArray;
m_headIndex = 0;
m_tailIndex = tail = count;
m_mask = (m_mask << 1) | 1;
}
m_array[tail & m_mask] = obj;
m_tailIndex = tail + 1;
}
}
}
public bool LocalPop(ref T obj)
{
int tail = m_tailIndex;
if (m_headIndex >= tail)
return false;
#pragma warning disable 0420
tail -= 1;
Interlocked.Exchange(ref m_tailIndex, tail);
if (m_headIndex <= tail)
{
obj = m_array[tail & m_mask];
return true;
}
else
{
lock (m_foreignLock)
{
if (m_headIndex <= tail)
{
// Element still available. Take it.
obj = m_array[tail & m_mask];
return true;
}
else
{
// We lost the race, element was stolen, restore the tail.
m_tailIndex = tail + 1;
return false;
}
}
}
}
private bool TrySteal(ref T obj, int millisecondsTimeout)
{
bool taken = false;
try
{
taken = Monitor.TryEnter(m_foreignLock, millisecondsTimeout);
if (taken)
{
int head = m_headIndex;
Interlocked.Exchange(ref m_headIndex, head + 1);
if (head < m_tailIndex)
{
obj = m_array[head & m_mask];
return true;
}
else
{
m_headIndex = head;
return false;
}
}
}
finally
{
if (taken)
Monitor.Exit(m_foreignLock);
}
return false;
}
}
 Tuesday, July 29, 2008
This is the first part in a series I am going to do on building a custom thread pool. Not that I’m advocating you do such a thing, but I figured it could be interesting to explore the intricacies involved. We’ll start off really simple:
- A CLR monitor based queuing mechanism.
- A static, fixed number of threads.
- The ability to create multiple pools that are isolated from one another.
- Flowing of ExecutionContexts and the ability to turn it off.
As the series progresses, I intend to incorporate some interesting facets such as:
- Dynamic thread injection, so that the number of threads is not fixed.
- Thread sharing among multiple pools in an AppDomain.
- A per-thread work stealing queue to increase the efficiency of recursively queued work.
- Interoperability with I/O completion ports.
- Returning an IAsyncResult object for seamless APM integration.
- Cancelation.
- Anything else that readers suggest might be interesting. Let me know.
And with that, let’s begin.
For now, we’ll use a very simple interface, IThreadPool, under which we can implement various mechanisms and policies. This will make it easier to write generic test harnesses that compare different implementations. For this post we won’t really make use of that capability (much), but we will use it to compare the stock CLR ThreadPool against a very simple custom one.
interface IThreadPool : IDisposable
{
void QueueUserWorkItem(WaitCallback work, object obj);
}
So that we can subsequently compare implementations, we have two simple implementations of IThreadPool. One does safe ThreadPool.QueueUserWorkItem calls, and the other does unsafe ThreadPool.UnsafeQueueUserWorkItem calls. The only difference is that the latter doesn’t flow the ExecutionContext across threads.
class CLRThreadPool : IThreadPool
{
public void QueueUserWorkItem(WaitCallback work, object obj)
{
ThreadPool.QueueUserWorkItem(work, obj);
}
public void Dispose() { }
}
class CLRUnsafeThreadPool : IThreadPool
{
public void QueueUserWorkItem(WaitCallback work, object obj)
{
ThreadPool.UnsafeQueueUserWorkItem(work, obj);
}
public void Dispose() { }
}
Our simple thread pool, SimpleLockThreadPool, will have 7 fields:
- private int m_concurrencyLevel: the number of threads to create statically, specified at construction time (w/ a default of Environment.ProcessorCount);
- private bool m_flowExecutionContext: whether execution context flowing is turned on (the default) or off. Turning it off can provide some performance gains.
- private Queue<WorkItem> m_queue: the queue of actual work items. This object is also used as a monitor. We’ll see what the WorkItem data structure looks like momentarily.
- private Thread[] m_threads: the set of threads actively dequeuing and running work items from this pool. Each instance of SimpleLockThreadPool has its own set.
- private int m_threadsWaiting: a hint used to avoid pulsing on enqueue when no threads are waiting. Threads increment and decrement before and after (respectively) waiting for work.
- private bool m_shutdown: set to true when threads are requested to exit.
Each WorkItem is a struct with three fields. Using a struct avoids superfluous heap allocations.
- internal WaitCallback m_work: the delegate to invoke.
- internal object m_obj: some optional state to pass as the argument to m_work.
- internal ExecutionContext m_executionContext: a context captured at enqueue time, to be used when running the callback. This ensures the appropriate security context and logical call context flow to the work item’s stack, for example.
There are just 4 methods of interest:
- public void QueueUserWorkItem(WaitCallback work, object obj): implements the IThreadPool interface, and does a few things. It allocates a new WorkItem, optionally captures and stores an ExecutionContext, ensures the pool has started, and then enqueues the WorkItem into the pool, possibly pulsing a single thread (if any are waiting). There’s also a convenient overload that doesn’t take an obj for situations where it isn’t needed.
- private void EnsureStarted(): a simple helper method that will lazily initialize and start the set of threads in a particular pool. These threads just sit in a loop and dequeue work. The lazy aspect ensures that a pool that doesn’t ever get used won’t allocate threads.
- private void DispatchLoop(): this is the main method run by each pool thread. All it does is sit in a loop dequeuing and (if the queue is empty) waiting for new work to arrive. When shutdown is initiated, the method voluntarily quits. If any pool work items throw an exception, this top-level method lets them go unhandled, resulting in a crash of the thread.
- public void Dispose(): shuts down all the threads in a pool. It is synchronous, so it actually waits for them to complete before returning. If work items take a long time to finish, this could be a problem. Extending this to timeout, etc., would be trivial.
And that’s really it. This is a very simple and naïve start, but it will prove to be a good starting point for all of the extensions I mentioned at the outset. Here’s the full code.
public class SimpleLockThreadPool : IThreadPool
{
// Constructors--
// Two things may be specified:
// ConcurrencyLevel == fixed # of threads to use
// FlowExecutionContext == whether to capture & flow ExecutionContexts for work items
public SimpleLockThreadPool() :
this(Environment.ProcessorCount, true) { }
public SimpleLockThreadPool(int concurrencyLevel) :
this(concurrencyLevel, true) { }
public SimpleLockThreadPool(bool flowExecutionContext) :
this(Environment.ProcessorCount, flowExecutionContext) { }
public SimpleLockThreadPool(int concurrencyLevel, bool flowExecutionContext)
{
if (concurrencyLevel <= 0)
throw new ArgumentOutOfRangeException("concurrencyLevel");
m_concurrencyLevel = concurrencyLevel;
m_flowExecutionContext = flowExecutionContext;
// If suppressing flow, we need to demand permissions.
if (!flowExecutionContext)
new SecurityPermission(SecurityPermissionFlag.Infrastructure).Demand();
}
// Each work item consists of a closure: work + (optional) state obj + context.
struct WorkItem
{
internal WaitCallback m_work;
internal object m_obj;
internal ExecutionContext m_executionContext;
internal WorkItem(WaitCallback work, object obj)
{
m_work = work;
m_obj = obj;
m_executionContext = null;
}
internal void Invoke()
{
// Run normally (delegate invoke) or under context, as appropriate.
if (m_executionContext == null)
m_work(m_obj);
else
ExecutionContext.Run(m_executionContext, ContextInvoke, null);
}
private void ContextInvoke(object obj)
{
m_work(m_obj);
}
}
private readonly int m_concurrencyLevel;
private readonly bool m_flowExecutionContext;
private readonly Queue<WorkItem> m_queue = new Queue<WorkItem>();
private Thread[] m_threads;
private int m_threadsWaiting;
private bool m_shutdown;
// Methods to queue work.
public void QueueUserWorkItem(WaitCallback work)
{
QueueUserWorkItem(work, null);
}
public void QueueUserWorkItem(WaitCallback work, object obj)
{
WorkItem wi = new WorkItem(work, obj);
// If execution context flowing is on, capture the caller's context.
if (m_flowExecutionContext)
wi.m_executionContext = ExecutionContext.Capture();
// Make sure the pool is started (threads created, etc).
EnsureStarted();
// Now insert the work item into the queue, possibly waking a thread.
lock (m_queue)
{
m_queue.Enqueue(wi);
if (m_threadsWaiting > 0)
Monitor.Pulse(m_queue);
}
}
// Ensures tha threads have begun executing.
private void EnsureStarted()
{
if (m_threads == null)
{
lock (m_queue)
{
if (m_threads == null)
{
m_threads = new Thread[m_concurrencyLevel];
for (int i = 0; i < m_threads.Length; i++)
{
m_threads[i] = new Thread(DispatchLoop);
m_threads[i].Start();
}
}
}
}
}
// Each thread runs the dispatch loop.
private void DispatchLoop()
{
while (true)
{
WorkItem wi = default(WorkItem);
lock (m_queue)
{
// If shutdown was requested, exit the thread.
if (m_shutdown)
return;
// Find a new work item to execute.
while (m_queue.Count == 0)
{
m_threadsWaiting++;
try { Monitor.Wait(m_queue); }
finally { m_threadsWaiting--; }
// If we were signaled due to shutdown, exit the thread.
if (m_shutdown)
return;
}
// We found a work item! Grab it ...
wi = m_queue.Dequeue();
}
// ...and Invoke it. Note: exceptions will go unhandled (and crash).
wi.Invoke();
}
}
// Disposing will signal shutdown, and then wait for all threads to finish.
public void Dispose()
{
m_shutdown = true;
lock (m_queue)
{
Monitor.PulseAll(m_queue);
}
for (int i = 0; i < m_threads.Length; i++)
m_threads[i].Join();
}
}
I think everything should be self-explanatory given the earlier explanation of all the fields and types. Let’s take a look at a simple test harness for this. There are a myriad of useful tests, and the one that I will show right now is but one of them. It queues a whole lot of work items, and then blocks waiting for them to complete. I have two variants: one of them allows work items to begin executing before the queuing is done, while the other separates the phases. Here is the general test.
class Program
{
public static void Main(string[] args)
{
bool separateQueueFromDrain = bool.Parse(args[0]);
const int warmupRunsPerThreadPool = 100;
const int realRunsPerThreadPool = 1000000;
IThreadPool[] threadPools = new IThreadPool[]
{
new CLRThreadPool(),
new CLRUnsafeThreadPool(),
new SimpleLockThreadPool(true), // Flow EC
new SimpleLockThreadPool(false), // Don't flow EC
};
long[] queueCost = new long[threadPools.Length];
long[] drainCost = new long[threadPools.Length];
Console.WriteLine("+ Running benchmarks ({0}) +", threadPools.Length);
for (int i = 0; i < threadPools.Length; i++)
{
IThreadPool itp = threadPools[i];
Console.Write("#{0} {1}: ", i, itp.ToString().PadRight(26));
// Warm up:
using (CountdownEvent cev = new CountdownEvent(warmupRunsPerThreadPool))
{
WaitCallback wc = delegate { cev.Decrement(); };
for (int j = 0; j < warmupRunsPerThreadPool; j++)
itp.QueueUserWorkItem(wc, null);
cev.Wait();
}
// Now do the real thing:
int g0collects = GC.CollectionCount(0);
int g1collects = GC.CollectionCount(1);
int g2collects = GC.CollectionCount(2);
using (CountdownEvent cev = new CountdownEvent(realRunsPerThreadPool))
using (ManualResetEvent gun = new ManualResetEvent(false))
{
WaitCallback wc = delegate {
if (separateQueueFromDrain) { gun.WaitOne(); }
cev.Decrement();
};
Stopwatch sw = Stopwatch.StartNew();
for (int j = 0; j < realRunsPerThreadPool; j++)
itp.QueueUserWorkItem(wc, null);
queueCost[i] = sw.ElapsedTicks;
sw = Stopwatch.StartNew();
if (separateQueueFromDrain) { gun.Set(); }
cev.Wait();
drainCost[i] = sw.ElapsedTicks;
}
g0collects = GC.CollectionCount(0) - g0collects;
g1collects = GC.CollectionCount(1) - g1collects;
g2collects = GC.CollectionCount(2) - g2collects;
Console.WriteLine("q: {0}, d: {1}, t: {2} (collects: 0={3},1={4},2={5})",
queueCost[i].ToString("#,##0"),
drainCost[i].ToString("#,##0"),
(queueCost[i] + drainCost[i]).ToString("#,##0"),
g0collects,
g1collects,
g2collects
);
itp.Dispose();
GC.Collect(2);
GC.WaitForPendingFinalizers();
}
Console.WriteLine();
Console.WriteLine("+ Comparison against baseline ({0}) +", threadPools[0]);
for (int i = 0; i < threadPools.Length; i++)
{
Console.WriteLine("#{0} {1}: q: {2}x, d: {3}x, t: {4}x",
i,
threadPools[i].ToString().PadRight(26),
queueCost[i] / (float)queueCost[0],
drainCost[i] / (float)drainCost[0],
(queueCost[i] + drainCost[i]) / ((float)queueCost[0] + drainCost[0])
);
}
}
}
If we pass ‘true’ on the command line, the phases are separated, and if we pass ‘false’ they are not. The ‘true’ part allows us to hone in on the source of overhead (is it the queuing itself, or the dispatching of work items?), but at the expense of needing to keep more of the work items in memory at once (because pool threads can’t drain them as we queue them). We run the test over an array of IThreadPool implementations, and for each one print out the cost to queue work, drain work, and the number of Gen0, Gen1, and Gen2 collections performed for each one. The GC statistics are interesting because they tell us how much more memory (roughly speaking) we are allocating for the same workload on different pool implementations. As our pool gets more complicated, this will be something to keep your eye on.
Here are some sample numbers on my dual-core laptop. Your results will vary. When ‘true’ is passed, I see numbers like the following:
+ Running benchmarks (4) +
#0 CLRThreadPool : q: 3,163,506, d: 5,137,893, t: 8,301,399 (collects: 0=16,1=8,2=3)
#1 CLRUnsafeThreadPool : q: 1,285,806, d: 4,428,451, t: 5,714,257 (collects: 0=5,1=4,2=1)
#2 SimpleLockThreadPool : q: 4,208,686, d: 11,839,614, t: 16,048,300 (collects: 0=104,1=14,2=4)
#3 SimpleLockThreadPool : q: 499,575, d: 3,992,190, t: 4,491,765 (collects: 0=1,1=1,2=1)
+ Comparison against baseline (CLRThreadPool) +
#0 CLRThreadPool : q: 1x, d: 1x, t: 1x
#1 CLRUnsafeThreadPool : q: 0.4064497x, d: 0.8619196x, t: 0.6883487x
#2 SimpleLockThreadPool : q: 1.330387x, d: 2.304371x, t: 1.933204x
#3 SimpleLockThreadPool : q: 0.1579181x, d: 0.7770092x, t: 0.5410853x
And when ‘false’ is passed, I see similar but subtly different numbers:
+ Running benchmarks (4) +
#0 CLRThreadPool : q: 3,476,630, d: 27,592, t: 3,504,222 (collects: 0=20,1=6,2=0)
#1 CLRUnsafeThreadPool : q: 2,636,319, d: 140,653, t: 2,776,972 (collects: 0=5,1=2,2=0)
#2 SimpleLockThreadPool : q: 4,850,171, d: 6,227,052, t: 11,077,223 (collects: 0=95,1=14,2=4)
#3 SimpleLockThreadPool : q: 826,987, d: 132,755, t: 959,742 (collects: 0=1,1=1,2=1)
+ Comparison against baseline (CLRThreadPool) +
#0 CLRThreadPool : q: 1x, d: 1x, t: 1x
#1 CLRUnsafeThreadPool : q: 0.7582973x, d: 5.097601x, t: 0.7924646x
#2 SimpleLockThreadPool : q: 1.395078x, d: 225.6832x, t: 3.161108x
#3 SimpleLockThreadPool : q: 0.2378703x, d: 4.811358x, t: 0.2738816x
Notice right away that we are handily beating the heck out of the CLR thread pool in the case where we don’t flow ExecutionContext objects (the #3 case). In fact, we are only 27% the cost for the ‘false’ variant. But we unfortunately don’t fare nearly as well when we flow ExecutionContext objects (the #2 case). It turns out that’s because the CLR has a unique advantage over us when compared to our naïve call to ExecutionContext.Capture. Just look at the sizeable difference in Gen0 collections; we are clearly allocating a lot more memory. This will be a topic for a subsequent post.
 Sunday, July 20, 2008
Here's a slightly more formal approach to proving that the CLR MM is improperly implemented for the particular example I showed earlier.
As the Java MM folks have done, I will use a combination of happens-before and synchronizes-with relations, which allows order in a properly synchronized program to be describe as a "flat" sequence with total ordering among elements. Assume < means synchronizes-with. If a happens-before b, and a < b, then any writes in a are visible to any loads in b. This relation is transitive: if a < b and b < c, then a < c. Given this, we can take an observed set of results (the values held in memory locations), a hypothesized execution order (which we can infer from the observation), and validate it against the program order (as written in the source); we do this by taking the MM-specific synchronizes-with relation rules, and see if we can produce the observed output given our belief of the execution order. If we find a contradiction (the execution order required to produce the output could not be produced given the program order and MM rules), either there is an alternative execution order we failed to guess, or we have found a violation of the memory model.
Single threaded programs are easy. Multi threaded programs are hard. We must manually "sequentialize" the program by constructing an interleaving of all executed program operations into a single flat sequence, and permute them as needed to produce the output in order to formulate a hypothesis of the execution order. This is of course very difficult to do, so it only works with very small programs (like the one I will show below).
I will try to define the CLR 2.0 MM in terms of synchronizes-with, although I have to admit it’s going to be difficult to do off the top of my head:
- a < b, given a volatile load a that precedes any other memory operation b. (Loads are acquire.)
- a < b, given any memory operation a that precedes any other store b. (Stores are release.)
- a < b, given two separate memory operations a which precedes b that work with the same memory location. (Data dependence.)
- a < b, given any memory operation a that precedes a full fence b. (Cannot move after a fence.)
- a < b, given a full fence a that precedes some memory operation b. (Cannot move before a fence.)
- a < b, given a lock acquire a that precedes some memory operation b. (Lock acquires are acquire fences.)
- a < b, given a memory operation a that precedes a lock release b. (Lock releases are release fences.)
Let’s take the disturbing example, assuming all loads and stores are volatile.
X = 1; Y = 1; R0 = X; R2 = Y; R1 = Y; R3 = X;
Let’s hypothesize about execution order.
To produce an output in which R1 == R3 == 0, let us observe that it must be the case that X = 1 and Y = 1 must not happen first. If one such instruction does occur first, then any possible outcome leads to R1 and/or R3 holding the value 1. That is because of rule 3: if X = 1 happened first, then X = 1 < R3 = X, leading to R3 == 1 and similarly if Y = 1 happened first, then Y= 1 < R1 = Y, leading to R1 == 1. So let us try to make X = 1 and Y = 1 not happen first.
Indeed, it is impossible for R0 = X or R2 = Y to happen first. This is because of CLR MM rule 3: X = 1; R0 = X leads to data dependence, and thus X = 1 < R0 = X. Similarly, Y = 1 < R2 = Y. Dead end. Let’s try the only other route.
The only remaining possibility to produce the output R1 == R3 == 0 is if R1 = Y or R3 = X occurs first. Let us try to make R1 = Y occur first. Ah-hah! We cannot! Given CLR MM rule 1, R0 = X < R1 = Y. And because of transitivity, this necessarily implies that X = 1 < R1 = Y. The same holds for the other thread’s instructions: Y = 1 < R3 = X. The output R1 == R3 == 0 is therefore a contradiction and disallowed by the CLR MM.
Now, this is light years from a formal proof, but is the reasoning I’ve been using in my mind to explain why this new realization is fundamentally very disturbing and is explicitly not allowed by the CLR MM. Thankfully it seems the JIT team agrees and is willing to fix this for the next release. And, I'm still in search of an example of code that is broken by this problem ...
 Wednesday, July 16, 2008
The adjacent release/acquire problem is well known. As an example, given the program:
P0 P1 ========== ========== X = 1; Y = 1; R0 = Y; R1 = X;
The outcome R0 == R1 == 0 is entirely legal. This could happen because writes are delayed in processor store buffers; so before R0 = Y retires, the store X = 1 may have not even left the local processor P0; similarly, before R1 = X retires, the store Y = 1 may not have even left processor P1. It is as if the program was written as follows:
P0 P1 ========== ========== R0 = Y; R1 = X; X = 1; Y = 1;
The standard way to fix this is to emit a full fence:
P0 P1 ========== ========== X = 1; Y = 1; XCHG; XCHG; R0 = Y; R1 = X;
But here is one that may be a little surprising:
P0 P1 ========== ========== X = 1; Y = 1; R0 = X; R2 = Y; R1 = Y; R3 = X;
Assuming X and Y are "volatile" to the compiler, is R1 == R3 == 0 a possible outcome in this program?
Based on the rules we provide for .NET's MM, and Intel's whitepaper, one could reasonably argue "no". The reasoning goes as follows. True data dependence prohibits R0 = X from moving before X = 1, and the no load/load reordering rule (e.g. Intel's Rule 2.1) prohibits R1 = Y from moving before R0 = X. Thus, transitively, R1 = Y may not move before X = 1. Similarly, true data dependence prohibits R2 = Y from moving before Y = 1, and the no load/load reordering rule prohibits R3 = X from moving before R2 = Y, and therefore R3 = X may not move before Y = 1. Given this reasoning, the individual instruction streams cannot be reordered in place. And therefore, no interleaving of them will yield R1 == R3 == 0, because either X = 1 or Y = 1 must happen first, and both R1 = Y and R3 = X must come later. Hence at least one of R1 or R3 will observe a value of 1.
Sadly, this reasoning is incorrect. Rule 2.4 in the Intel whitepaper states that "intra-processor forwarding is allowed." They even have an innocent example in the paper, but it actually doesn't exhibit load/load reordering. It does, however, illustrate that stores may be delayed for some time in a write buffer. Perhaps surprisingly, such intra-processor forwarding of buffered stores is actually permitted to satisfy subsequent loads from that location by the same processor before the store has left the processor. This can happen even if it means passing intermediate loads from different memory locations! The result is that load/load reordering is effectively possible under some circumstances. Loads still physically retire in order of course, but because they may be satisfied by pending writes that other processors cannot yet see, it is as if the original program were written as:
P0 P1 ========== ========== R1 = Y; R3 = X; X = 1; Y = 1; R0 = X; R2 = Y;
The fundamentally contradicts what most people believe about .NET's MM, and indeed, Intel's MM as specified in that whitepaper. To be fair, the whitepaper actually does call this out, but in a roundabout and misleading fashion. The text in Rule 2.1, which states that "no loads can be reordered with other loads", is far too strong.
Anytime a little hole in something as fundamental as MM axioms is uncovered, it is cause for concern. So I found this discovery deeply disturbing. Many abstractions and theorems are proved with the assumption that the MM is rock solid. I know a lot of code I have written relies on such proofs.
That said, I've been racking my brain (and in fact was having nightmares about it last evening) trying to uncover a case where this is worse than the existing release/acquire reordering issue that I opened this post with. Everything I come up with is saved at the last minute by rules 2.1 (for stores) and 2.5 "stores are transitively visible". The basic problem is that a processor can get stuck seeing its own written value for some time, during which other processors cannot, but ultimately it doesn't seem to matter because the buffer will eventually be flushed. Then any intermediary values that the destination may have held while that processor was stuck will have been overwritten anyway, so the outcome should be explainable (albeit racey). I'm still thinking hard about this.
 Monday, June 23, 2008
I just submitted the final manuscript for Concurrent Programming on Windows to Addison-Wesley.
This marks the exciting transition from things happening on my timetable to things happening on AW’s timetable.
A lot has changed for me since I decided to write this book. You might be surprised to hear that I actually signed the contract for it on November 29th, 2005. That’s 2 years and 7 months ago. It’s almost unbelievable that this book took so long to finish. By comparison, my first one took just a little over a year. The road has been a long one, full of personal ups and downs, but it’s no doubt been an exciting trip.
I’ve been at Microsoft the whole time. At the outset, I was a PM on the CLR Team, hacking on software transactional memory and PLINQ as an evening activity. Then I transitioned to doing it full time, but still as a PM. Then I joined the Parallel Computing team as the dev for PLINQ. Then I kicked off the whole Parallel Extensions effort (which is 20 members and growing strong), became the dev lead, and here I am today. It’s pretty strange to say this, but without the book very little of that would have happened. I can’t think of a better way to get entrenched in a technology, experience the breadth, and force yourself to learn every little intricate and often enlightening detail. If you can afford the impact to mental health and personal relationships ;), it’s an activity I highly recommend to anybody wanting to master a technology... not that one can actually master the concurrency beast, but y’know...
In retrospect, it should have taken a year. Maybe next time.
The good news is that you will have the book in your hands soon. (Well, if you decide to buy a copy, that is.) If you manage to make it to my PDC 2008 pre-con session, I’m hoping we will have some copies available. No promises, since I missed my final deadline by a couple weeks, but my fingers are crossed.
Oh yeah, and you can expect me to pick up blogging again now that I’ll have some free time. Hmm, free time? What will I do with myself!
Laissez les bon temps roulez!
 Friday, June 13, 2008
We had an interesting debate at a Parallel Extensions design meeting yesterday, where I tried to convince everybody that a full fence on SpinLock exit is not a requirement. We currently offer an Exit(bool) overload that accepts a flushReleaseWrite argument. This merely changes the lock release from
m_state = 0;
to
Interlocked.Exchange(ref m_state, 0);
The main purpose of this is to announce “availability” of the locks to other processors. More specifically, it ensures that before the current processor is able to turn around and reacquire the lock in its own private cache, that other processors at least have the opportunity to see the write. This is a fairness optimization, and avoiding the CAS on release halves the number of CAS operations necessary (which are expensive), so we would generally like to avoid superflous ones. It turns out you could easily do this without our help. Instead of
slock.Exit(true);
you could say
slock.Exit(); Thread.MemoryBarrier();
Most of the debate about whether the default Exit should use a fence centered around confusion over the strength of volatile vs. a full fence. For example, the C# documentation for volatile is highly misleading (http://msdn.microsoft.com/en-us/library/x13ttww7(VS.71).aspx):
The volatile modifier is usually used for a field that is accessed by multiple threads without using the lock statement to serialize access. Using the volatile modifier ensures that one thread retrieves the most up-to-date value written by another thread.
The confusion is over the “ensures that one thread receives the most up-to-date value written by another thread” part. Technically this is somewhat-accurate, but is worded in a very funny and misleading way. To see why, let’s take a step back and consider what volatile actually means in the CLR’s memory model (MM) for a moment, to set context. Note that I did my best to concisely summarize the MM here: http://www.bluebytesoftware.com/blog/2007/11/10/CLR20MemoryModel.aspx.
Volatile on loads means ACQUIRE, no more, no less. (There are additional compiler optimization restrictions, of course, like not allowing hoisting outside of loops, but let’s focus on the MM aspects for now.) The standard definition of ACQUIRE is that subsequent memory operations may not move before the ACQUIRE instruction; e.g. given { ld.acq X, ld Y }, the ld Y cannot occur before ld.acq X. However, previous memory operations can certainly move after it; e.g. given { ld X, ld.acq Y }, the ld.acq Y can indeed occur before the ld X. The only processor Microsoft .NET code currently runs on for which this actually occurs is IA64, but this is a notable area where CLR’s MM is weaker than most machines. Next, all stores on .NET are RELEASE (regardless of volatile, i.e. volatile is a no-op in terms of jitted code). The standard definition of RELEASE is that previous memory operations may not move after a RELEASE operation; e.g. given { st X, st.rel Y }, the st.rel Y cannot occur before st X. However, subsequent memory operations can indeed move before it; e.g. given { st.rel X, ld Y }, the ld Y can move before st.rel X. (I used a load since .NET stores are all release.) Note that RELEASe is the opposite of ACQUIRE: you can think of an acquire as a one-way fence that prohibits passes downward, and a release as a one-way fence that prohibits passes upward. A full fence prohibits both (lock acquire, XCHG, MB, etc).
Note one very interesting thing in this discussion: a release followed by an acquire, given the above rules, does not prohibit movement of the instructions with respect to one another! Given { st.rel X, ld.acq Y }, even though they are both volatile (i.e. acquire and release), so long as X!=Y, it is perfectly legal for the ld.acq Y to move before st.rel X. We aren’t limited to single instructions either, e.g. { st.rel X, ld.acq A, ld.acq B, ld.acq C }, all three loads (A, B, C) may indeed happen before the X. This occurs with regularity in practice, on X86, X64, and IA64, because of store buffering. It would just be too costly to hold up loads until a store has reached all processors. Superscalar execution is meant to hide such latencies.
(As an aside, many people wonder about the difference between loads and stores of variables marked as volatile and calls to Thread.VolatileRead and Thread.VolatileWrite. The difference is that the former APIs are implemented stronger than the jitted code: they achieve acquire/release semantics by emitting full fences on the right side. The APIs are more expensive to call too, but at least allow you to decide on a callsite-by-callsite basis which individual loads and stores need the MM guarantees.)
I have to admit the store buffer problem is mostly theoretical. It rarely comes up in practice. That said, on a system which permits load reordering, imagine:
Initially: X = Y = 0
T0 T1 X = 5; // st.rel while (X == 0) ; // ld.acq while (Y == 0) ; // ld X = 0; // st.rel A = X; // ld.acq Y = 5; // st.rel
After execution, is it possible that A == 5?
If the read of Y is non-volatile on T0 (which would be bad because a compiler may hoist it out of the loop, but ignore compilers for a moment), then the fact that the subsequent read of X is volatile does not save us from a reordering leading to A == 5. This is the { ld, ld.acq } case described earlier. Why might this physically occur? Well, it won’t happen on X86 and X64 because loads are not permitted to reorder. However!! IA64 permits non-acquire loads (non-volatile) to reorder, and so the A = X may actually be satisfied out of the write buffer before the store even leaves the processor. It’s as though the program became:
T0 T1 X = 5; // st.rel while (X == 0) ; // ld.acq A = X; // ld.acq X = 0; // st.rel while (Y == 0) ; // ld Y = 5; // st.rel
Whoops! This should make it apparent that this outcome is indeed a real possibility. And clearly it may cause bugs.
Note 6/13/08: Eric pointed out privately that compilers need only respect the CLR MM, and can freely reorder loads. Thus, this problem may actually arise on non-IA64 machines. Of course he is entirely correct. It was silly of me to overlook that.
All that said, let’s get back to the original concern about visibility of writes. This issue doesn’t even really involve reordering. Imagine one processor continuously executes a stream of lock acquires and releases, and that the stream goes on indefinitely (perhaps because it’s in a loop):
while (Interlocked.CompareExchange(ref m_state, 1, 0) != 0) ; m_state = 0; while (Interlocked.CompareExchange(ref m_state, 1, 0) != 0) ; m_state = 0; …
The Interlocked operation acquires the cache line in X mode. After it executes, other processors will notice that the lock is taken. But right away, the processor writes 0 to the line without a fence, and immediately goes on to execute another acquire. It is highly likely that the line will be marked dirty in the processor’s cache by the time that it acquires it in X mode again, something that the cache coherency system makes very cheap. In fact, the write of m_state = 0 probably hasn’t left the write buffer yet due to latency.
So before another processor can even see m_state as 0, the processor will have already gotten around to taking the lock again. Even for volatile loads and stores, there is no MM guarantee that writes will leave the processor immediately; hence the documentaiton earlier is slightly confusing; yes, the processor doing a volatile read will see the “most recent” value, but that “most recent” value (a) may be satisfied out of the local write buffer, and (b) may simply not have the ability to observe writes that occurred in practice due to the above timeliness issue.
 Thursday, June 05, 2008
We sat down last week with Charles from Channel9 to discuss the new CTP. Both parts got posted today:
We focus on the new aspects of the stack, incl. the new scheduler and CDS, and also discuss what's changed in PLINQ and TPL.
If you have ideas for future videos, or any feedback/questions, you know where to send 'em. joedu AT youknowwhere DOT com.
 Monday, June 02, 2008
We just released a new CTP of Parallel Extensions to .NET: get it here.
Some relevant information is up on our team blog:
I'm really excited to see our entire stack finally shipping as one cohesive unit: the data structures we use throughout the implementation exposed publicly (what we now call CDS), a new scheduler built from the ground up, TPL and PLINQ better together, and lots more. We're still very far from the end of the road, and have plenty of cool stuff still in the works, but we've made a ton of progress in just 6 months since our last CTP.
Have fun and, as always, feedback is much appreciated.
 Thursday, May 29, 2008
PDC'08 is officially on for October 27-30th this year: http://microsoftpdc.com/.
My team will certainly have some really fun stuff to show off, and just glancing at the preliminary list of teaser sessions, it's going to be a blast.
A few of us have teamed up to give a PreCon. That's code-word for a full day of neverending concurrency goodness. From http://microsoftpdc.com/agenda/Preconference.aspx:
Concurrent, Multi-core Programming on Windows and .NET Presenter(s): David Callahan, Joe Duffy, Stephen Toub The leap from single-core to multi-core technology is altering computing as we know it, enabling increased productivity, powerful energy-efficient performance, and leading-edge advanced computing experiences. The good news is that Windows and .NET offer rich support for threading and synchronization to take advantage of these new platforms. This session, presented by David Callahan, Microsoft distinguished engineer, Joe Duffy, author of “Concurrent Programming on Windows” (Addison-Wesley), and Stephen Toub, program manager lead for the Concurrency Development Platform team at Microsoft, will cover a broad range of topics, including mechanisms to create, coordinate, and synchronize among threads; best practices for concurrent libraries and apps; and techniques for improving scalability, including lock-free algorithms. Focus will be on .NET programming, including the next generation of parallel programming support within the Framework, but Windows internals and C++ nuggets will be discussed too.
About the presenter(s): David Callahan joined Microsoft in 2005. He is a Distinguished Engineer leading the Parallel Computing Platform Team within Visual Studio® focused on incubating technology for the coming manycore processors. This team is producing exciting new technologies as part of Visual Studio and also driving the Parallel Computing Initiative, a company wide effort to deliver customer value from the power of future high-performance processors. David’s background is in programming languages, parallel programming techniques, and compilation techniques focused on expressing and exploiting concurrency.
Stephen Toub is a Senior Program Manager Lead on the Parallel Computing Platform team at Microsoft, where he spends his days focusing on the next generation of programming models for concurrency. Stephen is also a Contributing Editor for MSDN® Magazine, for which he writes the .NET Matters column, and he’s an avid speaker at conferences like TechEd and DevConnections. Prior to working on the Parallel Computing Platform, Stephen designed and built enterprise applications for companies such as GE, McGraw-Hill, BankOne, and JetBlue. He was a developer for Microsoft Outlook as well as for the Microsoft Office Solution Accelerators. In his spare time, Stephen loves to sing, and he spends as much time as possible with his beautiful wife Tamara.
Joe Duffy leads development for Microsoft's Parallel Extensions to .NET technology, a set of library and runtime technologies for concurrent and parallel computing. He founded the project in 2006 with Parallel Language Integrated Query (aka PLINQ), an innovative declarative parallel query analysis and execution engine. Prior to Parallel Extensions, Joe worked on transactional memory, library and VM support for concurrency in the Common Language Runtime (CLR) team, and has written 3 functional language compilers (Scheme, Common LISP, and Haskell). He has written two books, including Concurrent Programming on Windows (Addison-Wesley, 2008), and in his spare time reads and writes (code and text), plays guitar, and studies music theory.
Be there, or be square. The slides aren't even finalized yet, so let me know if you have particular topics you'd like to learn more about.
 Friday, May 16, 2008
Counting events and doing something once a certain number have been registered is a highly common pattern that comes up in concurrent programming a lot. In the olden days, COM ref counting was a clear example of this: multiple threads might share a COM object, call Release when done with it, and hence memory management was much simpler. GC has alleviated a lot of that, but the problem of deciding when a shared IDisposable resource should be finally Disposed of in .NET is strikingly similar. And now-a-days, things like CountdownEvent are commonly useful for orchestrating multiple workers (see MSDN Magazine), which (although not evident at first) is based on the same counting principle.
Coding up one-off solutions to all of these is actually pretty simple. But doing so seems unfashionably ad-hoc, at least to me. Codifying the pattern can be done in a couple dozen lines of code, so that it can be reused for many purposes. As an example, here is a reusable Counting<T> class, written in C#, that just invokes action delegate once the count reaches zero:
#pragma warning disable 0420
using System;
using System.Threading;
public class Counting<T>
{
private readonly T m_obj;
private volatile int m_count;
private readonly Action<T> m_action;
public Counting(T obj, int initialCount, Action<T> action)
{
m_obj = obj;
m_count = initialCount;
m_action = action;
}
public int AddRef()
{
int c;
if ((c = Interlocked.Increment(ref m_count)) == 1)
throw new Exception();
return c;
}
public int Release()
{
int c;
if ((c = Interlocked.Decrement(ref m_count)) == 0)
m_action(m_obj);
return c;
}
public T Obj { get { return m_obj; } }
}
Notice I’ve used the IUnknown vocabulary of AddRef and Release. Old habits die hard.
The CountdownEvent I mentioned earlier is just a simple extension to this basic functionality. In fact, we don’t need to write another class; it’s merely an instance of Counting<T>, where the T is a ManualResetEvent. Setters directly use the Counting<T> object’s Release method to register a signal, while waiters can use the WaitOne method on the raw ManualResetEvent itself. The event will be set once all signals have arrived:
Counting<ManualResetEvent> countingEv = new Counting<ManualResetEvent>(
new ManualResetEvent(false), N, e => e.Set()
);
…
// Setter:
countingEv.Release();
// Waiter:
countingEv.Obj.WaitOne();
(Exposing a traditional Set/Wait interface would of course be nicer, but even then Counting<T> makes the implementation brain-dead simple.)
Similarly, the “who should dispose” problem is easy to solve with Counting<T>. Say that, instead of setting the event, we actually want to Dispose of some IDisposable object when all threads are done with it:
Counting<ManualResetEvent> ev = new Counting<ManualResetEvent>(
new ManualResetEvent(false), N, e => e.Dispose()
);
Though this does the trick, we might instead wrap it in a more convenient package:
public class CountingDispose<T> :
Counting<T>, IDisposable
where T : IDisposable
{
public CountingDispose(T obj, int initialCount) :
base(obj, initialCount, d => d.Dispose()) { }
}
Given this definition, threads can use the CountingDispose<T> object as they would any IDisposable thing. This facilitates use in C# using blocks. Only when all threads have called Dispose will Dispose be called on the actual underlying object:
CountingDispose<ManualResetEvent> ev = new CountingDispose<ManualResetEvent>(
new ManualResetEvent(false), N
);
…
// Some threads wait:
using (ev) {
… ev.WaitOne(); …
}
// Some threads set:
using (ev) {
… ev.Set(); …
}
I’ve found that the extremely simple Counting<T> idea is a surprisingly powerful one. It’s fairly extensible too; for example, you clearly may want to run actions at different points in the counting, use clever synchronization to ensure actions run at particular points are processed in-order (useful for progress reporting), to reset the count afterwards, and so on. It’s way too simple to claim it’s anything terribly amazing, but thought I’d share the idea anyway.
 Friday, March 28, 2008
We take code reviews very seriously in our group. No checkin is ever made without a peer developer taking a close look. (Incubation projects are often treated differently than product work, because the loss of agility is intolerable.) A lot of this is done over email, but if there’s anything that is unclear from just looking at the code, a face to face review is done. Feedback ranges from consistency (with guidelines and surrounding code), finding errors or overlooked conditions, providing suggestions on how to more clearly write something, comments, etc.; this ensures that our codebase is always of super high quality.
Concurrency adds some complexity to development, and requires special consideration during code reviews. I thought I’d put some thoughts on paper about what I look for during concurrency-oriented code reviews, in hopes that it will be useful to anybody starting to sink their teeth into concurrency; it may also help you devise your own internal review guidelines. Most of this advice just comes down to knowing a laundry list of best practices, but a lot of it is also knowing what to look for and where to spend your time during a review.
(A couple years ago I wrote a lengthy “Concurrency and its impact on reusable libraries” essay which provides a lot of the motivation behind what I look for. It’s up on my blog, http://www.bluebytesoftware.com/blog/2006/10/26/ConcurrencyAndTheImpactOnReusableLibraries.aspx, and (though slightly out of date) I’m revising it for an Appendix in my upcoming book. If you question why I believe something, chances are that this document will explain my rationale. And it’s far more complete than this short essay; I’ve only hit the high points here.)
Getting started
I first review the code in a traditional sequential code review fashion. When doing this, I earmark all state that I see as either “private” (aka isolated) or “shared”. I then go back and closely review all state that is shared (accessible from many threads) with a fine-tooth comb. Sometimes I’ll do this during my first pass through, but I usually find it helpful to understand the algorithmic structure of the changes first before fully developing an understanding of the concurrency parts.
Changes to existing code should be reviewed just as carefully (if not more carefully) as new code. Concurrency behavior is subtle, and it’s very easy to accidentally violate some unchecked assumption the code was previously making. Liberal use of asserts is therefore very important. Sadly many of the conditions code assumes are simply unassertable (like “object X isn’t shared”). I easily spend about 2x the amount of time reviewing the concurrency aspects of the code than usual sequential aspects. Perhaps more. This extra time is OK, because the concurrency portion is far smaller (in lines of code) than the sequential portion in most of the code I review. There are obvious exceptions to this rule, especially since I’m on a team building low-level primitive data types whose sole purpose in life is to be used in concurrent apps.
Shared state and synchronization
Some state, although shared, is immutable (read-only) and can be safely shared and read from concurrently. Often this is by construction (e.g. immutable value types) but sometimes this is by loose convention (e.g. a data structure is immutable for some period of time, simply by virtue of no threads actively writing to it). Both should be clearly documented in the code. Once mutable shared state is identified, I look for two major things:
- When does it become shared, i.e. publicized, and what is the protocol for the transfer of ownership? Is it done cleanly? And is it well documented?
- When does it once again become private again, if ever? And is this documented too?
Ideally all shared state would have clean ownership transitions. Any state that is disposable necessarily must have a point at the end of its life where it has a single owner, so it can be safely disposed of (unless ref-counting is used). But for most state the line will be extremely blurry and unenforced. Comments should be used to clarify, in gory detail. I also tend to prefix names of variables that refer to shared objects with the word ‘shared’ itself, so that they jump out.
Many, many bugs arise from some code publicizing some state, sometimes by accident, and then continuing to treat it as though it is private. It is also sometimes tricky to determine this precisely, since sharing can be modal. A list data structure may be shared in some contexts but not others. Knowing what its sharing policy is requires transitive knowledge of callers. Building up this level of global understanding can involve a fair bit of simply sitting back and reading and rereading the code over and over again.
Once the policy around sharing a piece of state is known, it is crucial to understand the intended synchronization policy for that data. Is it protected by a fine-grained monitor? Is it manipulated and read in a lock-free way? And so on. And once the intended policy is known, is the actual policy implemented what was intended?
While this part is extremely important, by the way, I have to admit that I feel this aspect tends to overshadow other things in conversation. This is probably because it’s the most obvious thing to look for. Sadly the world of concurrency is far more subtle than this. I’ve honestly found more bugs resulting from failing to identify shared state properly than resulting from failing to implement the synchronization logic itself properly. Your mileage will of course vary.
Locks
I treat lock-based code and lock-free as two entirely separate beasts. I spend about 5x the time reviewing lock-free code when compared to lock-based code. There is a tax to having lock-free code in any codebase, so as you are reviewing it, also ask yourself: is there a better (or almost-as-good) way that this could have been done using locks? Often the answer is no, due to the benefits lock-freedom brings (no thread can indefinitely starve another).
But if the answer is yes, that the code could be written more clearly using locks, you could save your team a lot of time by convincing the author to change his or her mind. Not only is lock-free code far more difficult to write and test, it carries a large tax during long stress hauls and end-game bug-fixing, an important and time-sensitive period in the development lifecycle of any commercial software product. Maintaining lock-free code also carries an extra long-term cost, particularly when ramping up new hires on it. All of this risks interfering with your ability to work on cool new features at some point. Don’t feel bad about pushing back on this one. Hard.
Carefully review what happens inside of a lock region. Look at every single line with scrutiny.
- Lock hold times should be as short as possible. Hold times should be counted in dozens or hundreds of cycles, not thousands (unless absolutely unavoidable).
- If lock hold times are in the dozens, you can consider using a spin-lock.
- Recursive lock acquisitions are strongly discouraged. If it can happen, did the developer clearly intend it to happen? Or is this possibility accidental? Point it out to them. Also, are there any unexpected points at which reentrancy can occur? E.g. any APC or message-pumping waits? If yes, is there a way to avoid that by simple restructuring of the code?
- Dynamic method calls via delegates or virtual methods while a lock is held should be as rare as possible. Method calls under a lock to user-supplied code should only ever happen if the concurrency behavior is clearly documented and specified for the user, and when invariants hold. All of these cases can lead to reentrancy. Often this requires special code to detect the reentrancy and respond intelligently.
- Lock regions should usually not span multiple methods: for example, acquiring the lock in one method, returning, and having the caller release it in another method is bad form. It is very easy to screw up the control flow and deadlock your library.
- CERs can only use spin-locks currently (because Monitor.ReliableEnter is currently unavailable), if you care about orphaning locks at least (which most CER-cost does). If you see somebody trying to write a CER using a CLR Monitor, their code is probably busted. Thankfully CERs are pretty rare to encounter in practice.
Races that break code are always must-fix bugs, no matter how obscure. If they happen with low frequency on the quad-cores of today, they will probably break with regularity on the 16-cores of tomorrow. The kinds of code my team writes needs to remains correct and scale well into the distant future; presumably if you’re writing concurrent code already, yours does too. If you find such a race, the code should not even be checked in until it’s fixed. “But it only happens once in a while” is an inexcusable answer. Benign races are OK but should be clearly documented.
Events
When I see any event-based code (either Monitor Wait/Pulse/PulseAll condition variables or some event type, like AutoResetEvent or ManualResetEvent), the first thing I do is build up a global understanding of all the conditions under which events are set, reset, and waited on. This is to understand the coordination and flow of threads top-down, rather than bottom-up. Because I’ve already reviewed the sequential parts of the algorithm, I typically already know the important state transitions events are guarding before I even get to this point.
Next, there are some simple aspects to specific usage of events that I look for:
- Understanding the relationship between mutual exclusion, the state, and the events is important and subtle. Comments should be used ideally to explain that.
- Does the setting of the event happen in a wake-one (Pulse, Auto-Reset) or wake-all (PulseAll, Manual-Reset) manner? If it’s wake-one, are all waiters homogeneous? Is it always strictly true that waking-one is sufficient and won’t lead to missed wake-ups?
- Waiters that release the lock and then wait should be viewed with suspicion. There’s a race between the release and wait that notoriously causes deadlocks.
- Concurrent code should never use timeouts as a way to work around sloppiness in the way threads wait and signal. A missed wake-up is a bug in the code that must be fixed.
Lock-freedom and volatiles
If you’re looking at lock-free code, you need to have a firm grasp on the CLR’s memory model. See http://www.bluebytesoftware.com/blog/2007/11/10/CLR20MemoryModel.aspx for an overview. Don’t think about the machine, think about the logical abstraction provided by the memory model. You also need a firm grasp on the invariants of the data structures involved. Specifically you are looking to see if the structure could ever move into a state, visible by another thread, where one of these invariants doesn’t hold. I explicitly permute (often on a whiteboard or in notepad) the sections of the code that involve shared loads and stores, using knowledge of the legal reorderings given our memory model, to see if the code breaks.
Any variable marked as volatile should be a red flag to carefully review all use of that variable. For every single read and every single write of that variable, you must look at it and convince yourself of why volatile is necessary. If you can’t, ask the person who wrote the code. Sometimes volatile is used because most (but not all) call sites need it; that’s often acceptable. Leaving the variable as non-volatile and selectively using Thread.VolatileRead for the reads that need it is typically too costly. Anyway, comments should always be used to explain why each load and store is volatile, even if it doesn’t strictly need the volatile semantics.
Conversely, any variable that is apparently shared, but not marked volatile, should be an even redder flag. It’s very likely that this is a mistake. Recall that writes happen in-order with the CLR’s memory model, but that reads do not. Anytime there is a relationship between multiple shared variables that are written and read together (without the protection of a lock), they typically both need to be volatile.
Any reads of shared variables used in a tight loop must be marked volatile. Otherwise the compiler may decide to hoist them, causing an infinite loop. Even if they are retrieved via simple method calls like property accessors (due to inlining).
Thread.MemoryBarrier should typically only occur to deal with store (release) followed by load (acquire) reordering problems. And it’s usually a better idea to use an InterlockedExchange for the store instead, since it implies a full barrier but combines the write. Sometimes a fence can be used to flush write buffers—like when releasing a spin-lock to avoid giving the calling thread the unfair ability to turn right around and reacquire it—but this is extremely rare, and often an indication that somebody has an inaccurate mental model of what the fence is meant to do.
Custom spin waiting should be used rarely. If you see it used, the person may not be aware that spin waits need special attention: to work well on HT machines, yield properly to all runnable threads with appropriate amortized costs, to spin only for a reasonable amount of time (in other words, less than the duration of a context switch), and so on. Thread.SpinWait does not do what most people expect, since it only covers the first. Kindly let them know about these things. If any spin waiting is used in a codebase, it’s far better to consolidate all usage into a single primitive that does it all.
Wrapping up
At the end of each review, ask yourself whether all of the concurrency-oriented parts of the code were clearly explained in the design doc for the feature. Did this carry over to clearly written comments in the implementation? These are some really hard issues to get your head around, so the time spent reviewing the code should not be lost. Somebody, someday down the road, will need to understand the code again (perhaps so that they can maintain it, test it, etc.), and it is your responsibility as a member of the team—regardless of whether you wrote the code—to do your part in making that feasible. You should explicitly go back to the design doc and suggest areas for clarification.
 Wednesday, February 27, 2008
I’ve mentioned before that the CLR has a central wait routine that is used by any synchronization waits in managed code. This covers WaitHandles (AutoResetEvent, ManualResetEvent, etc.), CLR Monitors (Enter, Wait), Thread.Join, any APIs that use such things, and the like. This routine even gets involved for waits that are internal to the CLR VM itself. This is primarily done so that the runtime can pump appropriately on STAs, and was later used to experiment with fiber-mode scheduler in SQL Server. Two years ago I showed how to use these capabilities to build a deadlock detection tool via the CLR’s hosting APIs. Sadly IO-based waits (like FileStream.Read) do not route through this.
The System.Threading.SynchronizationContext class has a very cool (but not widely known) feature that enables you to extend this central wait routine. To do so requires four steps: subclass SynchronizationContext; call base.SetWaitNotificationRequired; override the virtual Wait method to contain some custom wait logic; and then register your SynchronizationContext via the static SynchronizationContext.SetSynchronizationContext method. After you do this, most waits that occur on that thread will be redirected through your custom Wait method.
Here's a very simple example of this:
using System; using System.Threading;
class BlockingNotifySynchronizationContext : SynchronizationContext { public BlockingNotifySynchronizationContext() { SetWaitNotificationRequired(); }
public override int Wait( IntPtr[] waitHandles, bool waitAll, int millisecondsTimeout) { Console.WriteLine("Begin wait: {0} handles for {1} ms", waitHandles.Length, millisecondsTimeout); int ret = base.Wait(waitHandles, waitAll, millisecondsTimeout); Console.WriteLine("Finished wait"); return ret; } }
class Program { public static void Main() { SynchronizationContext.SetSynchronizationContext( new BlockingNotifySynchronizationContext()); ManualResetEvent mre = new ManualResetEvent(false); mre.WaitOne(1000, false); } }
If you run this, you'll see some messages printed to the console to do with beginning and finishing waits.
A few things are worth noting:
- The Wait signature looks a lot like WaitForMultipleObjects. In fact, it's fairly trivial to turn around and call it via a P/Invoke. Recovering from APCs is a tad tricky however, and you'd have to do all of your own timeout management, message pumping, and the like.
- You receive an IntPtr[], making it incredibly difficult to correlate the objects being waited on with the actual synchronization objects from which they came (e.g. Monitors, EventWaitHandles, etc.).
- The code that runs inside Wait is the wait itself. In other words, when you return, whatever code initiated the wait is going to assume that the API is being honest and truthful.
Another subtlety is that this code, as written, is subject to stack overflow. Why is that? In this particular instance, Console.WriteLine may need to block internally because it automatically serializes access to the output stream. Well, when that blocks, it just goes through the same central wait routine, which calls back out, and so on and so forth. Obviously this extends to any code that uses locks, including CLR services like cctors. So the code you write here needs to be very carefully written so as not to ever block recursively.
Notice that some waits do not call out. The reason is that the callout stems from a routine deep inside the CLR VM itself. Some waits may occur while a GC is in progress, at which point it’s illegal to invoke managed code. The CLR just reverts to using its own default wait logic in such cases.
Lastly this is not a foolproof mechanism. Other components can register their own SynchronizationContexts, replacing the context you’ve installed completely. This may mean you miss some blocking calls. If you are building a ThreadPool, you can always reset it each time the thread is returned, or even use your own ExecutionContexts when running them. It is also possible that such a context will exist by the time you get around to installing your own. For example, ASP.NET, WinForms, and WPF use custom SynchronizationContexts.
If such a context exists already when you install this custom one, you can always defer to it for things like CreateCopy, Send, Post, and Wait. For example, here’s a SynchronizationContext implementation that allows custom before/after wait actions, but otherwise relies on the existing SynchronizationContext (if any) for things like Send, Post, and Wait:
using System; using System.Threading;
delegate object PreWaitNotification( IntPtr[] waitHandles, bool WaitAll, int millisecondsTimeout); delegate void PostWaitNotification( IntPtr[] waitHandles, bool WaitAll, int millisecondsTimeout, int ret, Exception ex, object state);
class BlockingNotifySynchronizationContext : SynchronizationContext { private SynchronizationContext m_captured; private PreWaitNotification m_pre; private PostWaitNotification m_post;
public BlockingNotifySynchronizationContext( PreWaitNotification pre, PostWaitNotification post) : this(SynchronizationContext.Current, pre, post) { }
public BlockingNotifySynchronizationContext( SynchronizationContext captured, PreWaitNotification pre, PostWaitNotification post) { SetWaitNotificationRequired();
m_captured = captured; m_pre = pre; m_post = post; }
public override SynchronizationContext CreateCopy() { return new BlockingNotifySynchronizationContext( m_captured == null ? null : m_captured.CreateCopy(), m_pre, m_post); }
public override void Post(SendOrPostCallback cb, object s) { if (m_captured != null) m_captured.Post(cb, s); else base.Post(cb, s); }
public override void Send(SendOrPostCallback cb, object s) { if (m_captured != null) m_captured.Send(cb, s); else base.Send(cb, s); }
public override int Wait(IntPtr[] waitHandles, bool waitAll, int millisecondsTimeout) { object s = m_pre(waitHandles, waitAll, millisecondsTimeout); int ret = 0; Exception ex = null;
try { if (m_captured != null) ret = m_captured.Wait(waitHandles, waitAll, millisecondsTimeout); else ret = base.Wait(waitHandles, waitAll, millisecondsTimeout); } catch (Exception e) { ex = e; throw; } finally { m_post(waitHandles, waitAll, millisecondsTimeout, ret, ex, s); } return ret; } }
class Program { public static void Main() { SynchronizationContext.SetSynchronizationContext( new BlockingNotifySynchronizationContext( delegate { Console.WriteLine("PRE"); return null; }, delegate { Console.WriteLine("POST"); } ) ); ManualResetEvent mre = new ManualResetEvent(false); mre.WaitOne(1000, false); } }
That’s a fair bit of code, but it's mostly boilerplate. It allows you to easily specify a pre/post action to be invoked upon each blocking call, and will work on ASP.NET, GUI threads, and the like. The pre action can return an object for the post action to inspect. And the post action is given the return value and exception (if any). If no SynchronizationContext was present when installed, it just defers to the base SynchronizationContext implementation of Send, Post, and Wait.
Now what you actually do inside those callbacks, I suppose, is entirely your business …
 Tuesday, February 19, 2008
 Sunday, February 17, 2008
A long time ago, I wrote that you’d never need to write another finalizer again. I’m sorry to say, my friends, that I may have (unintentionally) lied. In my defense, the blog title where I proclaimed this fact did end with “well, almost never.”
Finalizers have historically been used to ensure reclamation of resources that are finite or outside of the purview of the CLR’s GC. Native memory and Windows kernel HANDLEs immediately come to mind. Without a finalizer, resources would leak; server apps would die, client apps would page like crazy, and life would be a mess. For such resources, properly authored frameworks also provide IDisposable implementations to eagerly and deterministically reclaim the resources when they are definitely done. Three years ago, I wrote a lengthy treatise on the subject.
The finalizer is there as a backstop. It is often meant to clean up after bugs , such as when a developer forgets to call Dispose in the first place, tried to but failed due to some runtime execution path skipping it (often exceptions-related), or a framework or library author hasn’t respect the transitive IDisposable rule, meaning that eager reclamation isn’t even possible. It also avoids tricky ref-counting situations as are prevalent in native code: since the GC handles tracking references, you, the programmer, can avoid needing to worry about such low-level mechanics. In all honesty, the finalizer’s main purpose is probably that we wanted to facilitate a RAD and VB-like development experience on .NET, where programmers don’t need to think about resource management at all, unlike C++ where it’s in your face. While one could reasonable argue that IDisposable is all you need (the C++ argument), that would have gone against this goal.
Concurrency changes things a little bit. A thread is just another resource outside of the purview of the CLR’s GC, and is actually backed by a kernel object and associated resources like non-pageable memory for the kernel stack, some data for the TEB and TLS, and 1MB of user-mode stack, to name a few. They also add pressure to the thread scheduler. Threads are fairly expensive to keep around, and “user” code is responsible for creating and destroying them.
Now, it’s true that we are moving towards a world where threads and logical tasks are not one and the same. This is a ThreadPool model. But it’s also true that a task that is running on a thread is effectively keeping that thread alive, and perhaps more concerning, preventing other tasks from running on it. Use of a resource is a kind of actual resource itself, although more difficult to quantify.
So, what does all of this have to do with finalization?
If some object kicks off a bunch of asynchronous work and then becomes garbage—i.e. the consumer of that object no longer needs to access it’s information—then it’s possible (or even likely) that any outstanding asynchronous operations ought to be killed as soon as possible. Otherwise they will continue to use up system resources (like threads, the CPU, IO, system locks, virtual memory, and so on), all in the name of producing results that will never be needed. The only reason this task stays alive is because the scheduler itself has a reference to it.
Just as with everything discussed above pertaining to non-GC resources, we’d like it to be the case that such a component would offer two methods of cleanup:
- Dispose: to get rid of associated asynchronous computations immediately when the caller knows they no longer need the object.
- Finalization: to get rid of associated resources that are still outstanding when the GC collects the root object that is responsible for managing those asynchronous computations.
You’ll notice that we support cancelation in a first class and deeply-ingrained way in the Task Parallel Library. While not exposed in PLINQ (yet), there is actually cancelation support built-in (though not as fundamental as we’d like (yet)). This is a useful hook to allow us to build support for both resource reclamation models. In this sense, cancelation as a pattern of stopping expensive things from happening is quite similar to resource cleanup. Clearly they aren't identical, but we will need to figure out the specific deltas.
I should also point out that we will prefer and push structured parallelism for many reasons. Parallel.For is an example, where the API looks synchronous but is internally highly parallel. One reason we like this model is that the point at which concurrency begins and ends is very specific. The call won’t return until all work is accounted for and completed. It’s only when you bleed computations into the background after a call returns that everything stated above becomes an issue. This is obviously nice for failures (e.g. you are forced to deal with them right away), but also because it alleviates this problem nicely.
I don’t think we’re at a point where we can recommend definite tried-and-true best practices for cancelation of asynchronous work and how it pertains to resource management. I do think we need to get there by the time we ship Parallel Extensions V1.0. And I think we will. Here’s a snapshot summary of my current thinking, however, and I would love to get feedback on it:
- We should tell people to implement IDisposable and to Cancel tasks inside Dispose, when their classes own unstructured asynchronous computations.
- We may or may not want people to implement a finalizer to do the same. I currently believe we will.
- I am undecided about whether these cancelations should be synchronous. In some sense, they should be since you’d like to know that all resources have definitely been reclaimed. But this would mean blocking (possibly indefinitely) on the finalizer thread. That’s a definite no-no. Blocking in Dispose would mean blocking (possibly indefinitely) inside a finally block. That’s also a no-no, although it’s less severe of one than the finalizer. It just means hosts can’t take over threads as easily when they need to abort them. Thankfully we offer the Task.Cancel method which is non-blocking. Possibly we should suggest synchronous cancelation inside of Dispose, and asynchronous inside of the finalizer.
- If we did do synchronous anywhere, presumably with Task.CancelAndWait, we’d need to recommend a practice for communicating failures. Throwing from Dispose is discouraged, but so is swallowing failures. The kind of code usually run inside of Dispose is much less likely to generate exceptions than running arbitrary tasks full of user code. Catch-22.
- There are some cases we can do the cancelation thing ourselves. Whether we do or not is subject to debate, but I believe we should. If we ensured the scheduler’s references are weak, then once all other code in the process drops the reference, we would not schedule it. This implies that tasks are seldom executed “for effect”, which is certainly a judgment call. It might be worth exposing an option that allows “for effect” tasks to be created not subject to this rule.
- The trickiest case is when a task is already running. For short-running tasks, this may not be a huge concern, but a lot of such tasks do recursively queue up additional ones. It would be nice if the fact that its results are no longer needed somehow flowed automatically to the task, perhaps through cancelation. This also means waking tasks from blocking calls.
It’s interesting to point out that 5 and 6 were part of the original motivation for the inventors of the future abstraction. They noted that representing computations as futures, and allowing the GC to collect them before they run once they’ve become unreachable, effectively makes computations garbage-collectable. This, I think, is a neat idea, particularly if your program uses futures pervasively.
In any case, I wanted to point these subtleties out, and hear any feedback folks out there might have. What I find particularly interesting about concurrency, as we move forward on things like Parallel Extensions, is that there are a lot of subtle implications to the way programs are written. This includes fundamental things like exceptions and resource management. There are other subtle impacts, like whether the ordering of results coming out of a computation matters. PLINQ surfaced this early on, and I didn’t expect the pervasive nature of the issue. Debugging and profiling are also extraordinarily different. I suspect we’ll continue running into many such things throughout the evolution to highly parallel software.
 Wednesday, February 13, 2008
A couple weeks back, we started filming a bunch of Channel9 videos about various aspects of the Parallel Computing team. This is the larger team responsible for Parallel Extensions to .NET, among other things. We'll of course spend some time on Parallel Extensions in upcoming videos.
But who better to kick it off than Burton Smith, a legend in the parallel computing arena?
On General Purpose Supercomputing and the History and Future of Parallelism http://channel9.msdn.com/Showpost.aspx?postid=382639
Burton's the kind of guy that you run into when meandering the hallways, have a 5 minute conversation, and walk away with at least 20 relevant and fascinating papers on parallel computing to go off and read. But now instead of reading 20 papers, you can just go watch the video.
 Saturday, February 09, 2008
Torn reads are possible whenever you read a shared value without synchronization that is either misaligned and/or which spans an addressible pointer-sized region of memory. This can lead to crashes and data corruption due to bogus values being seen. If not careful, torn reads can also violate type safety. If you have a static variable that points to an object of type T, and your program only ever writes references to objects of type T into it, you may still end up accessing a memory location that isn't actually a T. How could this be?
You guessed it. Torn reads apply to pointer values just as much as they do to ordinary values. So a thread reading a pointer in-flux could see bits of its value in separate pieces, blending the state before and after the update. Dereferencing this mutant pointer would lead you off into an unknown place in the address space, and most certainly not to an instance of T, breaking type safety. Since VC++ aligns pointer fields automatically, you'd have to go out of your way with __declspec(align(N)) or an unaligned allocator to create this situation. Similarly with .NET's StructLayoutAttribute. Moreover, it turns out that .NET guards against this problem in its type loader, by rejecting any types containing improperly aligned object references. This is good news, because otherwise a plethora of security vulnerabilities would be possible. But VC++ doesn't offer any such guarantees.
This is another example where trying to program in a lock-free manner can lead to difficulties that aren't present when you stick to ordinary locking.
 Saturday, February 02, 2008
I'm still plugging away at my new concurrency book.
The fact that we have cover art and that it's available for pre-order on Amazon are both great signs that it's getting close to being done:

The full TOC (as of right now) is available here. A few chapters are still in the works, and there's a bit of editing ahead. I've also decided to write Appendices on PFX and CCR. In the meantime, definitely feel free to pick up a PDF of some early content via RoughCuts. Believe it or not, this is a way to provide real-time feedback that can impact the book before it hits the presses.
I have to say this is the most complicated piece of writing I've ever undertaken. Not only does the material require going very deep in a lot of hard areas, but I've noticed that as a community we're learning to speak speak better and to use consistent terminology about various aspects of concurrency. To stay relevant, I'm finding a fair bit of content has had to be reworked several times.
At the same time, and from a selfish standpoint, this book has been a wonderful forcing function to learn everything there is to know about concurrency. Not that this is a realistic goal, mind you, but gosh darnit I'm trying. I'm convinced at this point that the best way to become an expert on something is to try to teach it to other people. And what better way than writing a book? If you can't explain it clearly to a broad, on-the-average-less-expert-than-you audience, you probably have a faulty mental model to begin with. The process of merely trying usually reveals this. I encourage everybody out there to try it. At least once.
 Thursday, January 17, 2008
Most schedulers try to keep the number of runnable threads (within a certain context) equal to the number of processors available. Aside from fairness desires, there’s usually little reason to have more: and in fact, having more can lead to more memory pressure due to the cost of stacks and working set held on the stacks, non-pageable kernel memory, per-thread data structures, etc., and also has execution time costs due to increased pressure on the kernel scheduler, more frequent context switches, and poor locality due to threads being swapped in and out of processors. In extreme cases, blocked threads can build up only for all of them to be awoken and released to wreak havoc on the system at once, hurting scalability.
A naïve approach of one-thread-per-processor works great until a thread on one of these processors blocks, either “silently” as a result of a pagefault or “explicitly” due to IO or a synchronization wait. (I should mention that due to the plethora of hidden blocking calls in the kernel, Win32 user-mode code, and the .NET Framework, a lot of IO and synchronization waiting is “silent” too.) In this case, a processor becomes idle (0% utilization) for some period of time. If there is other work that could be happening instead, this is clearly bad.
Many programs spend most of their time blocked.
Four particular solutions to this problem are commonplace on Windows:
- Create more threads than processors and hope for the best. This trades some amount of runtime efficiency for the insurance that processor time won’t go to waste.
- Periodically poll for blocked threads using some kind of daemon and respond to the presence of one by creating a new thread to execute work. Eventually this thread would go away, for instance when the blocked thread awakens. This is the approach used by the CLR ThreadPool, although it caps the total. (The TPL uses this appraoch today also, but we're changing/augmenting it.) For obvious reasons, this approach is quite flawed: you easily end up with more running threads than processors, have to trade-off more frequent polling--which implies more runtime overhead--with less frequent polling--which adds time to the latency in the scheduler’s response to a blocked thread.
- Block on an IO Completion Port at periodic intervals--e.g. when dispatching a new work item in a ThreadPool-like thing--which has the effect of throttling running threads. This still requires creating more threads than processors, but helps to ensure few of them run at the same time. Unfortunately, it still does lead to more of them actually running than you’d like since the port can only prevent a thread from running when it goes back and blocks on the port in the kernel. But this is only done periodically.
- Specialized systems like SQL have used Fibers in the past to avoid needing full-fledged threads to replace the blocking ones. To do this, they ensure all blocking goes through a cooperative layer, which notifies a user-mode scheduler (UMS). The user-mode scheduler maintains a list of blocked Fibers, but can multiplex runnable Fibers onto threads, keeping the number of threads equal to the number of processors. A thread effectively never blocks, Fibers do, but this requires all blocking to notify the UMS. Aside from extraordinarily closed world systems, this approach doesn’t usually work. That’s because Fibers are not threads and multiplexing entirely different contexts of work onto a shared pool of threads (at blocking points) can easily lead to thread affinity nightmares.
The CLR facilities #4 by funneling all synchronization waits in managed code through one point in the VM codebase. This was done initially to ensure consistent message pumping on STA threads, via CoWaitForMultipleHandles. But it was then exploited in 2.0 to expand the CLR Hosting APIs to enable custom hosts to hook all synchronization calls. This is convenient for building interesting debugging tools, like deadlock detecting hosts.
A fifth approach is often viable and even preferable, and that is to avoid blocking altogether. Often referred to as continuation passing style (CPS), the idea is that, where you’d normally have blocked, the callstack is transformed into a resumable continuation. For an example of this, look at Jeff Richter’s ReaderWriterLockGate class: it’s a reader/writer lock with no blocking. Asynchronous IO is supported by files and sockets on Windows, and enables a similar style of programming. The continuation is ordinarily just a closure object that has enough state to restart itself when the sought-after condition arises. When it does arise, the continuation is scheduled on something like the CLR ThreadPool. This avoids burning any threads while the wait occurs.
For obvious reasons, CPS is usually hard to achieve in .NET: there is no language support for first class continuations in .NET, all synchronization primitives are wait-based, and keeping a whole stack around in memory would be a terrible idea anyway. You’d also need to worry about resources held on the stack, including locks. Instead, you should save only that state which is needed during the continuation. In a message passing system this is much simpler, since most of the program is full of continuations in the form of message handlers. For an example of such a system, check out the Concurrent and Coordination Runtime (CCR) and/or Erlang.
Even in message passing systems, it’s impossible to escape the fundamental blocking issue, since it is platform-wide. And in ordinary imperative programs, the CPS transformation is near-impossible at the leaves of callstacks: unless you have whole program knowledge, who knows what your caller expects? Most APIs are synchronous. Futures and Promises potentially make this style of programming easier, though in the extreme all APIs would need to return a Promise rather than a true value.
Nothing conclusive, just some random thoughts ...
 Wednesday, January 02, 2008
I've been very remiss with blogging lately. This was mostly due to travel (5 weeks out of the past 2 months), but also because I'm focused intensely on the book, and have been super busy working on Parallel Extensions: the December CTP, hiring for our team, planning what we do in the next year, designing stuff, implementing stuff, ... it's been a lot of fun. I've also been trying to learn classical guitar after 15 years of playing electric and some acoustic (metal, rock, blues). It's more challenging than I anticipated... I guess I should have started when I was six.
Somewhere in there, I recorded an episode of .NET Rocks with Carl and Richard about Parallel Extensions: check it out. I haven't listened to it yet, but I distinctly remember having a lot of fun. When the hour was up, I couldn't believe how quickly the time had passed.
Happy New Year, everybody. I promise to blog more in the coming months. Famous last words, eh? ;)
 Thursday, November 29, 2007
Today is an extraordinarily exciting day for me. After about two years of work by several great people across the company, the first Parallel Extensions (a.k.a. Parallel FX) CTP has been posted to MSDN. Check out Soma’s blog post for an overview, and the new MSDN parallel computing dev center for more details. Keep an eye on the team’s new blog too, as we’ll be posting a lot of content there as we make progress on the library; in fact, thanks to Steve (who writes blog posts in his sleep), there’s already a bunch of reading to catch up on!
I began kicking the tires on PLINQ back in October of 2005. In September of 2006, I described PLINQ as “a fully functional prototype” and “research.” Well, it’s come a long way since then, and we’re finally ready for real human beings to start hammering on it. Not only that, but we’ve expanded the scope of the original project significantly, from PLINQ to Parallel FX, to include new imperative data parallel APIs (for situations where expressing a problem in LINQ is unnatural), powerful task APIs that offer waiting and cancelation, all supported by a common work scheduler based on CILK-style work-stealing techniques developed in collaboration with Microsoft Research. And there’s even more to come. Daniel Moth spilled some beans in his screencast on Channel9 when he described the additional data structures, like synchronization primitives and scalable collections, which will come online later. Some of them are even in the new CTP, but have remained internal for now.
The shift to parallel computing will have an industry-wide impact, and will undoubtedly take several phases and many years to tame completely. We have focused on the lowest hanging fruit and the most important foundational shifts in direction we can incite—like encouraging the over-representation of latent parallelism to aid in future scalability—but there are certainly things that the current CTP doesn’t fully address. GPGPUs, verifiable thread safety, automatic parallelism, great tools support, etc., are all topics that are of great interest to us. We have a lot of work to do for the final release of Parallel FX, and expect a whole lot of feedback from the community on specific features and general direction. So let us have it! You can use our Connect site, or even just email me directly at joedu AT you-know-where DOT com.
Consider this an early Christmas present. Now you have something fun to do, in the privacy of your own office, when trying to avoid family members during the holidays. Whoops—did I say that out loud? Enjoy!
 Saturday, November 17, 2007
I recently described an approach to adding immutability to existing, mutability-oriented programming languages such as C#. When motivating such a feature, I alluded to the fact that immutability can make concurrent programming simpler. This seems obvious: a major difficulty in building concurrent systems today is dealing with the ever-changing state of “the world,” requiring synchronization mechanisms to control concurrent reads and writes. This synchronization leads to inefficiencies in the end product, complexity in the design process, and, if not done correctly, bugs: race conditions, deadlocks due to the lack of composability of traditional locking mechanisms, and so forth.
Lock-free algorithms simplify matters (in some ways) by compressing state transitions into single, atomic writes. For instance, a lock-free stack has a single head node; pushing requires swapping the current head with the new node to be enqueued, and popping entails swapping the current head with its current next pointer. I mentioned some of the benefits of lock-freedom here. Sadly these lock-free techniques do not really compose. For instance, if I want to pop from a lock-free stack and push onto another, in an atomic fashion, single-word compare-and-swap is insufficient. We’ll return to this later in the context of immutability.
I will draw an analogy between lock-free algorithms and synchronization involving immutable objects. Imagine an immutable type represents “the world” in a concurrent system. Threads are constantly interacting with the world, by reading its state, and sometimes changing it. Because the world is immutable, we must copy it and publish an entire new one any time we wish our changes to become visible. This is key! It enables optimistic concurrency and synchronization protocols more similar to lock-free algorithms than lock-based algorithms.
Because individual components inside the world can’t change, the entire world can be read from without synchronization. Moreover, the creation of the new world needn’t use synchronization either, so long as we are willing to tolerate the possibility of wasted work, as is always the case with optimistic concurrency: we are being optimistic that the work will indeed not have been wasted, because writes are infrequent. That’s the basic premise of most concurrency algorithms, e.g. reader/writer locks. What requires synchronization is merely the publication of the new world.
internal static World s_theWorld = new World(…); // The initial world. internal void ReadTheWorld() { World w = s_theWorld; // This copy never changes! // Read state from the world … } internal void ChangeTheWorld() { World oldWorld; World newWorld; do { oldWorld = s_theWorld; // Read the world. // Read state to compute the new world … newWorld = new World(…); } while (Interlocked.CompareExchange(ref s_theWorld, newWorld, oldWorld) != oldWorld); }
In this scheme, there will always be inherent races between threads trying to call ReadTheWorld and threads trying to call ChangeTheWorld. This is a basic characteristic of the system. But it is more functional. If ReadTheWorld produces correct answers for any world sequentially, then it will also produce correct answers in a parallel program too, since worlds cannot change. And because writes are guaranteed to retire in order on the CLR 2.0 memory model, we can be assured that ReadTheWorld will not be subject to memory reordering bugs.
This is a very powerful technique, making code that reads from and changes the world much simpler to write, understand, and debug. How many times have you longed for a programming technique where code can be tested in a sequential context and still be guaranteed to produce the right answers in a parallel context?
With all that said, it’s sadly not applicable to every synchronization problem. Some of the problems that arise can be worked around, and others are more difficult. I will outline the most difficult of them.
Problem 1. As with most lock-free algorithms, livelock is a distinct possibility.
Livelock occurs because many threads may attempt to compute a new world simultaneously. If they both read the same old world, only one will succeed at publishing their copy of the new world. The other thread will have to go back ‘round the loop, re-read the current world, compute a new one, and try again. Computing a new world may take some time which, coupled with high arrival times, may lead to unfair forward progress and wasted work due to spinning. We hope, however, that in most cases the frequency of writes is small enough to make this a less pressing issue. Also note that the algorithm shown above is truly wait-free: the failure of one thread indicates that another thread has succeeded in publishing a new world. So forward progress of our system is not compromised by this problem.
Problem 2. ChangeTheWorld may need to perform impure, irreversible side effects.
A nasty issue in today’s programming languages is the reliance on side-effects. (OK, that’s unfair. This is a nasty issue in today’s conventional programming techniques and styles, and the languages we use simply accommodate these techniques. This is mostly because, if we didn’t, developers would find a way to circumvent the system, prefer to use alternative languages, and so on.) But if, corresponding with the update of the world, another side effect must be made, this technique doesn’t accommodate that. We could work around that with other synchronization techniques but it is likely to be cumbersome. In some cases, there is no harm in performing a side effect, so long as it does not “undo” the side effect associated with a newer world.
For instance, say we needed to update a GUI with the results of the new world computation. We would want to ensure the GUI was refreshed with the most recent update. Thankfully we don’t need to worry about the world changing itself, so we may do something like this:
internal void ChangeTheWorld() { … as before … myGuiControl.BeginInvoke(delegate { if (s_theWorld == newWorld) // Update the GUI with newWorld. }); }
This ensures we only update the GUI with the latest world. Imagine the sequence: world A is published, world B is published, we refresh the GUI with world B, we receive the BeginInvoke associated with world A now (out of order, which is perfectly reasonable). We must make sure that the graphical depiction of world A doesn’t overwrite world A. Thankfully the automatic mutual exclusion inherent in GUI programming ensures that our dirty check s_theWorld == newWorld is sufficient to prevent this from occurring. Clearly this technique can’t always apply.
Problem 3. This technique doesn’t expand to support multiple immutable fields.
This is a fundamental flaw with techniques that rely on a single, atomic compare-and-swap. Processor vendors have recently published papers for technologies like transactional memory and multi-word compare-and-swap (and in fact, there is plenty of research into pure software implementations of both), which would allow us to fix this problem.
The title of this article, by the way, is a play on Wadler’s paper ‘Linear types can change the world!’ I highly recommend that paper to anybody interested in immutability, isolation, and purity. This paper demonstrates type system support for linearity which, in essence, enables safe reading and mutual exclusion via the type system instead of locking. There is only one world, after all, the paper argues, so the traditional functional programming approach of disallowing mutability is actually a bit unrealistic.
In summary, immutable types add a lot of simultaneous power and simplicity to parallel programming. They are certainly not a panacea, since they require shifting to a more functional style of programming, where side-effects are discouraged and state is updated by making altered copies. This is often non-trviial. But this is one of many steps in what I consider to be the right direction. A direction that will enable us to tolerate large amounts of parallelism... without causing all of us to die of race-condition-induced anxiety attacks before the age of 30.
 Sunday, November 11, 2007
I’ve been asked a number of times about immutable types support for C#. Although C# doesn't offer first class language support in the way that F# does, you can get pretty far with what you do have in your hands already. Nothing prevents you from creating immutable data structures today, of course, but the problem is that there’s no compiler or runtime support to ensure you’ve done it right.
I just hacked together some new attributes and a handful of FxCop rules as an experiment. I’ve been very happy with the result. Sure it’s not baked into the language, but it’s a start. If there’s any interest, I can make the code available so you can play with it too.
Attributes and analysis
Imagine we had an ImmutableAttribute. Annotating a type with it indicates that objects of that particular type are immutable, i.e. that their state never changes after being constructed. This is great from a concurrency standpoint because it means access to such objects do not require synchronization. This can lead to more efficient code that not only has a higher chance of being correct but is also vastly easier to maintain. Well, what kind of restrictions would such a type be subject to?
1. Immutable types must have only initonly fields.
The first rule takes advantage of existing CLR type system and language support for initonly fields (a.k.a. readonly in C#). Marking a field as initonly ensures it is never written to after the constructor has finished executing. So long as all fields are initonly, the class is effectively already “shallow” immutable.
2. Immutable types must only contain fields of immutable types.
The second rule ensures transitivity, or “deep” immutability. The mutability of a complex object is typically, but not always, comprised of not only its own fields but also the state in the objects it refers to. With this rule and the prior rule, we’re about 90% there.
3. Immutable types should only subclass other immutable types.
To give the appearance that a particular object is immutable, that object’s type must not depend on other types that are mutable, as articulated by the previous rule. The ‘base’ reference is effectively just another field, and so this rule is derived from the previous one. If an immutable type could inherit mutable members and fields, then it wouldn’t really be immutable after all.
4. Mutable types should not subclass immutable types.
Similar to the previous rule, it is also a bad thing if a mutable subtype can override behavior from the subtype, giving the appearance of mutability. Say we have an immutable class IC with a virtual method f, and some mutable subclass MC overrides f to introduce logic that logically mutates the state of an object. Although the rules above are sufficient to ensure that the object is physically immutable, this can circumvent immutability safety through polymorphism. A related piece of advice: public immutable types should be sealed, to prevent outside classes that do not abide by immutability analysis from breaking code which assumes a given type is immutable. Alternatively, virtual members can be eliminated.
5. Immutable types must not leak the ‘this’ reference during construction.
This rule is subtle. Although initonly ensures fields are never written to after construction, these fields may be written to any number of times while an object is actively being constructed. If some code called during construction publishes a reference to the object (e.g. by storing it in a static variable), then other threads might access the object while it still appears to be mutable. They may witness a partially initialized object, an object whose fields are still changing, and so on.
And that’s it! 5 simple rules. It may sound complicated, but the code to perform the static analysis for all but the last one is straightforward and a dozen-or-so lines apiece. A few things are worth mentioning. First, the CLR’s memory model ensures that, after an object is constructed and published, reads of its fields cannot be reordered to break immutability. Additionally, there are many immutable types in the CLR today that are not verified as such. So in my analysis rules, I have hard-coded a set of well known immutable types so that you can use them w/out problem: Object, String, DateTime, TimeSpan, Boolean, Byte, SByte, Int16, Int32, Int64, IntPtr, UInt16, UInt32, UInt64, UIntPtr, Decimal, Double, Single, and ValueType. Ideally if we supported this in the .NET Framework, we'd annotate them.
Impact to imperative programming
Programming in an immutable world is rather tricky. As someone who has done most of his programming for the past 10+ years in C-style languages, I just take for granted that data structures change over time. With immutability, there tends to be a whole lot more copying and functional-style function calling, where data structures are passed as an input argument and the “mutated” copy is returned as an output argument. I’m trying to kick the mutability habit, since I fully believe immutability is one key to being successful with massive degrees of parallelism. And it usually leads to cleaner code too.
But it’s hard. Using something as simple as an array field on an immutable type will fail the above rules, since the CLR’s array types are mutable. I’ll explore building one below, but this probably points to a need for better immutability support in the .NET Framework. It’s not too difficult to imagine providing base classes for common needs when building immutable data structures.
Circumventing the analysis
As you begin to explore immutable types in a bit more depth, you’ll realize there are often cases where immutability-by-cleverness is possible. That is to say, although one or more of the rules above have been violated, the end result still appears to be immutable. I can build an immutable list out of a linked list to avoid depending on CLR arrays, and mark the nodes as immutable, but they must refer to elements stored within the list. Should we require the elements to also be immutable? Perhaps, but perhaps not, depending on whether you consider the state of the list to also include the state of the elements inside it. Usually that wouldn't be the case. And, besides, if we know what we’re doing, we can create an immutable list based on an array anyway, which enables O(1) IList<T>-style random access. We just need to be careful to encapsulate the array object and to never store an element into it post-construction.
To facilitate working around some of the rules in ways that are often necessary, I have provided options on ImmutableAttribute to suppress certain checks. Additionally, there is a MutableAttribute which can mark certain fields to indicate they are not subject to the same restrictions as other fields on an immutable type.
An ImmutableList<T>
As an illustration, here is an ImmutableList<T>. It implements IList<T>, but sadly it must throw exceptions in several circumstances because both IList<T> and ICollection<T> offer methods that are intrinsically mutable. Undoubtedtly there are bugs because I whipped it up quickly and have omitted a lot of needed error checking. I just wanted to give the general idea of how it might be done:
/// <summary> /// A list that has been written to be observationally immutable. A mutable array /// is used as the backing store for the list, but no mutable operations are offered. /// </summary> /// <typeparam name="T">The type of elements contained in the list.</typeparam> [Immutable] public sealed class ImmutableList<T> : IList<T> {
[Mutable] private readonly T[] m_array;
/// <summary> /// Create a new list. /// </summary> public ImmutableList() { m_array = new T[0]; }
/// <summary> /// Create a new list, copying elements from the specified array. /// </summary> /// <param name="arrayToCopy">An array whose contents will be copied.</param> public ImmutableList(T[] arrayToCopy) { m_array = new T[arrayToCopy.Length]; Array.Copy(arrayToCopy, m_array, arrayToCopy.Length); }
/// <summary> /// Create a new list, copying elements from the specified enumerable. /// </summary> /// <param name="enumerableToCopy">An enumerable whose contents will /// be copied.</param> public ImmutableList(IEnumerable<T> enumerableToCopy) { m_array = new List<T>(enumerableToCopy).ToArray(); }
/// <summary> /// Retrieves the immutable count of the list. /// </summary> public int Count { get { return m_array.Length; } }
/// <summary> /// A helper method used below when a mutable method is accessed. Several /// operations on the collections interfaces IList<T> and /// ICollection<T> are mutable, so we cannot support them. We offer /// immutable versions of each. /// </summary> private static void ThrowMutableException(string copyMethod) { throw new InvalidOperationException( String.Format("Cannot mutate an immutable list; " + "see copying method '{0}'", copyMethod)); }
/// <summary> /// Whether the list is read only: because the list is immutable, this /// is always true. /// </summary> public bool IsReadOnly { get { return true; } }
/// <summary> /// Accesses the element at the specified index. Because the list is /// immutable, the setter will always throw an exception. /// </summary> /// <param name="index">The index to access.</param> /// <returns>The element at the specified index.</returns> public T this[int index] { get { return m_array[index]; } set { ThrowMutableException("CopyAndSet"); } }
/// <summary> /// Copies the list and adds a new value at the end. /// </summary> /// <param name="value">The value to add.</param> /// <returns>A modified copy of this list.</returns> public ImmutableList<T> CopyAndAdd(T value) { T[] newArray = new T[m_array.Length + 1]; m_array.CopyTo(newArray, 0); newArray[m_array.Length] = value; return new ImmutableList<T>(newArray); }
/// <summary> /// Returns a new, cleared (empty) immutable list. /// </summary> /// <returns>A modified copy of this list.</returns> public ImmutableList<T> CopyAndClear() { return new ImmutableList<T>(new T[0]); }
/// <summary> /// Copies the list and modifies the specific value at the index provided. /// </summary> /// <param name="index">The index whose value is to be changed.</param> /// <param name="item">The value to store at the specified index.</param> /// <returns>A modified copy of this list.</returns> public ImmutableList<T> CopyAndSet(int index, T item) { T[] newArray = new T[m_array.Length]; m_array.CopyTo(newArray, 0); newArray[index] = item; return new ImmutableList<T>(newArray); }
/// <summary> /// Copies the list and removes a particular element. /// </summary> /// <param name="item">The element to remove.</param> /// <returns>A modified copy of this list.</returns> public ImmutableList<T> CopyAndRemove(T item) { int index = IndexOf(item); if (index == -1) { throw new ArgumentException("Item not found in list."); }
return CopyAndRemoveAt(index); }
/// <summary> /// Copies the list and removes a particular element. /// </summary> /// <param name="index">The index of the element to remove.</param> /// <returns>A modified copy of this list.</returns> public ImmutableList<T> CopyAndRemoveAt(int index) { T[] newArray = new T[m_array.Length - 1]; Array.Copy(m_array, newArray, index); Array.Copy(m_array, index + 1, newArray, index, m_array.Length - index - 1); return new ImmutableList<T>(newArray); }
/// <summary> /// Copies the list adn inserts a particular element. /// </summary> /// <param name="index">The index at which to insert an element.</param> /// <param name="item">The element to insert.</param> /// <returns>A modified copy of this list.</returns> public ImmutableList<T> CopyAndInsert(int index, T item) { T[] newArray = new T[m_array.Length + 1]; Array.Copy(m_array, newArray, index); newArray[index] = item; Array.Copy(m_array, index, newArray, index + 1, m_array.Length - index); return new ImmutableList<T>(newArray); }
/// <summary> /// This method is unsupported on this type, because it is immutable. /// </summary> void ICollection<T>.Add(T item) { ThrowMutableException("CopyAndAdd"); }
/// <summary> /// This method is unsupported on this type, because it is immutable. /// </summary> void ICollection<T>.Clear() { ThrowMutableException("CopyAndClear"); }
/// <summary> /// Checks whether the specified item is contained in the list. /// </summary> /// <param name="item">The item to search for.</param> /// <returns>True if the item is found, false otherwise.</returns> public bool Contains(T item) { return Array.IndexOf<T>(m_array, item) != -1; }
/// <summary> /// Copies the contents of this list to a destination array. /// </summary> /// <param name="array">The array to copy elements to.</param> /// <param name="index">The index at which copying begins.</param> public void CopyTo(T[] array, int index) { m_array.CopyTo(array, index); }
/// <summary> /// Retrieves an enumerator for the list's collections. /// </summary> /// <returns>An enumerator.</returns> public IEnumerator<T> GetEnumerator() { for (int i = 0; i < m_array.Length; i++) { yield return m_array[i]; } }
/// <summary> /// Retrieves an enumerator for the list's collections. /// </summary> /// <returns>An enumerator.</returns> IEnumerator IEnumerable.GetEnumerator() { return ((IEnumerable<T>)this).GetEnumerator(); }
/// <summary> /// Finds the index of the specified element. /// </summary> /// <param name="item">An item to search for.</param> /// <returns>The index of the item, or -1 if it was not found.</returns> public int IndexOf(T item) { return Array.IndexOf<T>(m_array, item); }
/// <summary> /// This method is unsupported on this type, because it is immutable. /// </summary> void IList<T>.Insert(int index, T item) { ThrowMutableException("CopyAndInsert"); }
/// <summary> /// This method is unsupported on this type, because it is immutable. /// </summary> bool ICollection<T>.Remove(T item) { ThrowMutableException("CopyAndRemove"); return false; }
/// <summary> /// This method is unsupported on this type, because it is immutable. /// </summary> void IList<T>.RemoveAt(int index) { ThrowMutableException("CopyAndRemoveAt"); }
}
I won’t spend much time going over this code. Just notice that the type is marked with the ImmutableAttribute, the array field is marked with the MutableAttribute (since it’s not itself an immutable type and would fail the analysis otherwise), and that any operations that modify the list must make a copy of the entire thing.
Summary
This has been an interesting exercise. Through it, I have come to realize that first class immutability in the type system is not such a farfetched dream. The most onerous aspect to it is probably the restrictions it imposes on subclassing in the programming model, effectively bifurcating the type system into those types that are mutable and those types that are immutable. But, in the end, I’m not so sure it’s too bad a problem: interchanging the two seems like a very bad idea anyway.
Feedback on all of this would be appreciated. Do you see it as useful? If you had it, would you use it in your programs today? Do you believe that it is one step needed (of many!) to bring us towards a world in which building concurrent programs is simpler?
 Saturday, November 10, 2007
There are several docs out there that describe the CLR memory, most notably this article.
When describing the model, one can either use acquire/release, barrier/fence, or happens-before terminology. They all acheive the same goal, so I will simply choose one, acquire/release: an acquire operation means no loads or stores may move before it, and a release operation means no loads or stores may move after it. I can explain it with such simple terms because the CLR is homogeneous in the kinds of operations it permits or disallows to cross such a barrier, e.g. there's never a case where loads may cross such a chasm but stores may not.
Despite the great article referenced above, I find that it’s still not entirely straightforward. It is important to code to a well-understood abstract model when writing lock-free code. For reference, here are the rules as I have come to understand them stated as simply as I can:
- Rule 1: Data dependence among loads and stores is never violated.
- Rule 2: All stores have release semantics, i.e. no load or store may move after one.
- Rule 3: All volatile loads are acquire, i.e. no load or store may move before one.
- Rule 4: No loads and stores may ever cross a full-barrier (e.g. Thread.MemoryBarrier, lock acquire, Interlocked.Exchange, Interlocked.CompareExchange, etc.).
- Rule 5: Loads and stores to the heap may never be introduced.
- Rule 6: Loads and stores may only be deleted when coalescing adjacent loads and stores from/to the same location.
Note that by this definition, non-volatile loads are not required to have any sort of barrier associated with them. So loads may be freely reordered, and writes may move after them (though not before, due to Rule 2). With this model, the only true case where you’d truly need the strength of a full-barrier provided by Rule 4 is to prevent reordering in the case where a store is followed by a volatile load. Without the barrier, the instructions may reorder.
It is unfortunate that we’ve never gone to the level of detail and thoroughness the Java memory model folks have gone to. We have constructed our model over years of informal work and design-by-example, but something about the JMM approach is far more attractive. Lastly, what I’ve described applies to the implemented memory model, and not to what was specified in ECMA. So this is apt to change from one implementation to the next. I have no idea what Mono implements, for example.
 Friday, November 09, 2007
Lock-free algorithms are often “better” than lock-based algorithms:
- They are, by definition, wait-free, ensuring threads never block.
- State transitions are atomic such that failure at any point will not corrupt the data structure.
- Because threads never block, they typically lead to greater throughput, as the granularity of synchronization is a single atomic write or compare-exchange.
- In some cases, lock-free algorithms incur fewer total synchronizing writes (e.g. interlocked ops) and thus can be cheaper from a pure performance standpoint.
But lock-freedom is not a panacea. I’ve gained a lot more experience using lock-free algorithms in the past 3 years: first, when working on memory model improvements for Whidbey, and recently during the implementation of the ParallelFX library and as I write new content for my book.
There are some obvious drawbacks:
- The use of optimistic concurrency can lead to livelock for hot data structures.
- The code is significantly harder to test. Usually its correctness hinges on a correct interpretation of the target machine’s memory model –in my case, the .NET memory model (the topic of an upcoming post)—which can be misinterpreted and is hard to validate. Moreover, because the most popular hardware is stronger than the lesser popular hardware (e.g. X86 vs. IA64), testing needs to explicitly focus on the esoteric hardware to expose races.
- The code is significantly harder to write and maintain, for many of the same reasons.
I have learned about a less obvious drawback the hard way. When initially implementing a certain data structure, you may make a bunch of assumptions about the use cases your class needs to accommodate. And you may actually succeed in writing an implementation that is correct given those assumptions. But over time, as new use cases are discovered, it is much harder to retrofit the code and revalidate that the lock-free algorithms are still correct given the new assumptions. There is no magic oracle that says: hey, adding feature X is going to invalidate assumption Y over there, creating a memory reordering bug.
In several recent cases, I’ve discovered such problems, and dealt with them one-by-one, usually by adding additional memory barriers. As the numbers of memory barriers increase (roughly ½ the cycle-cost of acquiring and releasing a lock), however, the benefits of lock-free algorithms begin to dwindle. It’s easy to begin with an algorithm that scales and performs nicely, and over time add a memory barrier here and there, and eventually end up with something that performs worse than the lock-based equivalent. Unfortunately, this threshold isn’t always obvious, so you can end up with a real mess on your hands: a buggy, impossible to test, and difficult to understand hunk of code. All the drawbacks of lock-free code with none of the benefits. Whoops.
The moral of the story? Be careful and conservative in your use of lock-free code. There are many well-known published lock-free algorithms, and it’s usually a good idea to stick to them, if you use lock-free code at all. When in doubt, just use locks. Truthfully, they are hard enough.
 Wednesday, October 17, 2007
For those who haven't heard yet, F# is joining the "official" .NET language family. As Soma says, "[t]his is one of the best things that has happened at Microsoft since we created Microsoft Research over 15 years ago." I am just as excited as he is. A huge congrats to Don. He's been working super hard to build and evangelize F# for several years now, and getting the attention of a huge division like DevDiv isn't an easy thing, but in the end sanity does prevail. As a blanket statement, the more functional .NET programmers' brains become, the more latent parallelism the programs they write will contain.
 Saturday, October 13, 2007
Charles from Channel9 recorded a conversation with Anders and me a couple weeks back. The topic? Concurrency. More specifically, Parallel FX (PFX):
Programming in the Age of Concurrency: Concurrent Programming with PFX
Microsoft is developing a number of technologies to simplify the expression of parallelism in code. An example of this work is Parallel Extensions for the .NET Framework (PFX), a managed programming model for data parallelism, task parallelism, scheduling, and coordination on parallel hardware.
PFX makes it easier for developers to write programs that take advantage of parallel hardware (you've all heard of multi-core and what the future holds with many-core...), without having to deal with the complexities of threads and locks in today’s concurrent programming story
We don't go too deep, but you can bet we'll be doing more of these things as the technology matures and gets closer to general availability. Enjoy!
(Note: you may also be interested in Stephen Toub's PFX interview with Scott Hanselman, available here.)
 Tuesday, October 09, 2007
When COM came onto the scene in the early 90’s, Symmetric Muiltiprocessor (SMP) architectures had just been introduced into the higher end of the computer market. “Higher end” in this context basically meant server-side computing, a market in which the increase in compute power promised increased throughput for heavily loaded backend systems. Parallelism per se—that is, breaking apart large problems into smaller ones so that multiple processors can hack away to solve the larger problem more quickly—was still limited to the domain of specialty computing, such as high-end supercomputing and high-performance computing communities. The only economic incentive for Windows programmers to use multithreading, therefore, was limited mostly to servers. Heck, parallelism is still pretty much limited to those domains, but the economic incentives are clearly in the midst of a fundamental change.
As is already well established, server-side computing is highly parallel for several reasons. The most obvious is the steady stream of work a server farm usually enjoys, meaning there is seldom a shortage of compute work to do. Even if work is IO bound, there’s typically at least some work that could use a CPU waiting in the arrival queue to overlap execution with. Moreover, sever workloads are usually isolated except for some select and small amount of application-wide state. Each user has his own account, order history, bank transaction information, etc., and therefore the interaction between sessions can be carefully controlled and nearly non-existent, leading (once again) to a good cost/benefit tradeoff, due to the large scalability wins.
Human productivity has always been markedly more important than other software features, like performance, reliability, and security, unless the domains in which programs are being developed require an intense focus on certain attributes. I’m sure the DOA prioritizes security far above productivity, but the same isn’t true of most of the industry. This was true back in the COM days, and is still true to this day (perhaps more so). So it’s safe to conclude that the designers of COM had “ease of development” at the forefront of their minds when creating it. That coupled with the kind of multithreading in use back then on Windows machines (servers), putting an emphasis on lack of sharing, lead to the development of the Single Threaded Apartment (STA) model. And, related, were COM+’s addition of explicit synchronization contexts which took the STA auto-synchronization idea and generalized it to make synchronization policies more customizable.
These features made synchronization, an often-impossible task, and less important to be precise about when isolation is pervasive, much simpler. Instead of having to test a million different machine configurations, various difficult-to-predict-ahead-of-time component interactions, and so on, a component got the STA stamp and was guaranteed safe in a multithreaded environment. The alternative then is the alternative today: go free-threaded (MTA or NTA) and deal with all of the nasty synchronization problems that arise “the old fashioned way.” In other words, use locks and events, and run the risk of race conditions, deadlocks, and various other latent bugs that would ruin the composability and reliability of any less-than-bulletproof component. Sadly, “the old fashioned way” is still “the state of the art” until we build a better mousetrap.
Now, the STA’s gotten a really bad rap over the years. (I’ll ignore synchronization contexts for the time being but just about everything I say applies to them too.) It’s true that STAs cause us a lot of problems when thinking about legacy compatibility, and will make it just that much more difficult to migrate legacy Windows apps over to a massively parallel world, but I’m going to stick my neck out and make a claim that won’t win me friends (and in fact might lose me some): STAs aren’t entirely evil, and are an interesting idea that we as a community can learn a lot from. What’s more, we have years of experience using them. I see a lot of people basically reinventing the STA model, often without realizing it due either to a lack of understanding of (or interest in) COM or simply a lack of pattern matching abilities. “History will repeat itself, because nobody was listening the first time.”
Automatic synchronization is now the holy grail of the new multicore era. STM is another attempt at that. Active objects, however, which have shown up in numerous places are another more closely related attempt to the STA. Yet another closely related technology is message passing in general, where isolated domains of control do not share state and instead communicate via disconnected message passing. All strive to attain similar goals, improved developer productivity and safety, usually with some performance or scaling overhead. The biggest difference, from my perspective, is that design priorities are now different due to the environment at the time these things are being created. It’s clear today that any automatic synchronization technology we invent should scale to hundreds (perhaps thousands) of processors, not one or two (or, at the extreme, eight), that fine-grained parallelism will become more and more important, and that the degree of sharing will be high, whether that means logically (by message exchange) or physically (in the most literal shared memory sense).
Clearly the worst aspect of COM STAs is that they are obviously not up for the task of scaling like this at such a fine-grain, because a single thread is responsible for executing all code for some particular set of objects in the process. It's just plain impossible to parallelize finer than the granularity of a single component, and it's common to glump many components together into one apartment which is worse. As the number of available processors grows, and/or the number of objects instantiated inside a particular STA which need to interact, scalability suffers. Sadly we’ve inherited huge hunks of code that have been written in this fashion, with all of the assumptions about the multithreading environment in which the components will be deployed as immutable laws.
But there are good things about COM STAs! They are brain-dead simple in the most common cases. Synchronization doesn’t take nearly as much brainpower and development time away from the component creation process, improving developer productivity and the robustness of the software written. So long as your STA component never blocks or performs a cross-apartment invocation, life remains very simple. This is an example of a leaky abstraction, however, because it’s not always evident to the programmer when this chasm has been crossed. Proxies do attempt to hide the gunk of crossing the chasm, though at the risk of introducing reentrancy, which itself comes with a lot of baggage. I’d like to stop and point out something at this point, perhaps helping to support the “reinventing the wheel” claim earlier. Active objects and message passing systems generally suffer from similar problems. If one object uses another (by enqueueing a message) and then, at some point, waits for a response message to arrive, there is the risk that the thread which is now blocked will need to itself respond to a message coming from another object. Ahh, the classic reentrancy versus deadlock tradeoff. Event-driven, stackless systems like the Concurrency and Coordination Runtime (CCR), etc., mitigate this problem but require a fundamentally different way of programming. UI programmers are generally more comfortable with this approach. And linear types a la Singularity's exchange heap also offers a promising way to enable concurrency, but to safely guarantee certain state will not be shared.
In the end, COM STAs are still an invention I wish we could do away with. I think of the technology a bit like a cheap, half-way immitation of Hoare's CSPs. But at the same time, I fear we as an industry will continue to reinvent them, just under a different guise or with subtly different nuances. We need to resist the urge to pretend they don't exist just because they contain the letters C, O, and M and because the sound of STA is known to trigger feelings of intense nausea. What’s scary to me is that, STM aside, there doesn’t seem to be any super-promising alternative to the automatic synchronization problem for shared memory, aside from provable declarative and functional safety. As I’ve noted above, true fine-grained message passing has a lot of similar issues, but I do wonder at the end of the day if Joe Armstrong has been right all along. (Well, Tony Hoare really deserves the credit, and perhaps David May too, but Erlang is en vogue currently.) Time will tell.
 Saturday, September 15, 2007
Two articles about ParallelFX (PFX) are in the October issue of MSDN magazine and have been posted online:
- Parallel LINQ: Running Queries on Multi-Core Processors. An overview of an implementation of LINQ-to-Objects and -XML which automagically uses data parallelism internally to execute declarative language queries. It supports the full set of LINQ operators, and several ways of consuming output in parallel.
- Parallel Performance: Optimize Managed Code for Multi-Core Machines. Describes the Task Parallel Library (TPL), a new "thread pool on steroids" with cancelation, waiting, and pool isolation support, among many other things. Uses dynamic work stealing techniques (see here and here) for superior scalability.
As noted in the article, there's a PFX CTP planned for 2007*. Watch my blog for more details when it's available.
*Note: some might wonder why we released the articles before the CTP was actually online. When we originally put the articles in the magazine's pipeline, our intent was that they would in fact line up. And both were meant to align with PDC'07. But when PDC was canceled, we also delayed our CTP so that we had more time to make progress on things that would have otherwise been cut. It's less than ideal, but I'm still confident this was the right choice.
 Wednesday, September 12, 2007
Lock recursion is usually a bad idea. It can seem convenient (at first), but once the slippery slope of making calls from critical regions into complex ecosystems of code is embarked upon (which is usually a necessary pre-requisite to lock recursion, except for some relatively simple cases), it’s easy to accidentally fall right off the edge. This topic was part of the doc I wrote previously about using concurrency inside of reusable libraries. My opinions haven't changed much since then.
Lock recursion coupled with condition variables is even worse. In fact, its behavior might surprise you.
To motivate this, would you ever think of writing code that does something like this?
void BreakAtomicity() { … I assume somebody called me with a recursive lock on ‘obj’ … Monitor.Exit(obj); Monitor.Exit(obj); … Do something … Monitor.Enter(obj); Monitor.Enter(obj); }
I should certainly hope not! Unless you're crazy or reeeeally know what you're doing. Who knows what state invariants are busted at the time the call to BreakAtomicity was made? Releasing the lock in this manner hoists these ticking timebombs onto the other threads into the process that might want to inspect the shared state. If you, the author of BreakAtomicity, have all-knowing omnipresent knowledge of the entire program, perhaps you know precisely. But, particularly in the case of recursion, where it's all-too-common to engage in practices of sloppy composition, this is actually quite unlikely. Lock recursion is typically used for convenience, not because of a really solid design that is based on clean algorithmic recursion.
What does this example have to do with condition variables anyway? Glad you asked! It matters because of what happens when you wait on a monitor that has been recursively acquired. In such cases, Monitor.Wait will release all recursive acquires as part of its waiting. I.e. if it has been acquired 10 times, it is released 10 times before waiting. It does this, of course, because otherwise it would deadlock waiting for some other thread to make a call to Monitor.Pulse/PulseAll (since a separate thread needs to first acquire the lock in order to do either). This is symmetric, so once the thread has been awoken, it will reacquire the lock as many times as needed before returning to attain the same level of recursion that existed prior to the call.
Now, Monitor.Wait breaks atomicity anyway. This is obvious. It releases and reacquires the lock internally, and so any conditions regarding shared state that exist prior to the call cannot be assumed to exist after the call returns. (Most) people understand this and tend to use Wait in fairly common and safe patterns, such as guarded regions where some predicate is checked for validity at the very front of a critical region before doing anything interesting with state. But the really nasty thing about recursive locks and the Wait behavior described above is that this breaks atomicity for some unknowing number of nested critical regions that have existed for some unknown amount of time leading up to the Wait. This is a recipe for pain. My recommendation is probably predictable: just following the broadly accepted advice that, because lock recursion is evil to begin with, it is best avoided, and you will safely avoid the more complicated case outlined above.
It’s worth pointing out that the new CONDITION_VARIABLE in Vista, i.e. SleepConditionVariableCS and SleepConditionVariableSRW, only release the lock once, despite recursive acquire counts. (SRWL doesn’t officiailly support recursion, although it works for shared locks since there is no affinity used and it is undetectable.) Deadlocks result instead. From an editorial perspective, I prefer this behavior quite a bit, since it’s easier to debug. (Admittedly, if Monitor’s behavior is what you want, it’s less than straightforward to achieve, unless you know the recursion counter somehow. Although, I will also note that I am convinced very few real people would want Monitor’s current behavior...) My preferred solution to this would have been to throw an exception, since I do think issuing a Wait when locks have been recursively acquired is in most cases a bug. As a workaround, we could have exposed a RecursionCount on Monitor so that a developer could manually exit the lock RecursionCount-1 times before the call to Wait and then reacquire it RecursionCount-1 times after the call returns. (Actually no -- I would have made Monitor non-recursive by default, like the new ReaderWriterLockSlim.) Sadly, I guess I'm only about 10 years too late...
 Wednesday, August 22, 2007
Most managed code in the .NET Framework has not been hardened against asynchronous exceptions. This includes out of memory (OOM) conditions and asynchronous thread aborts, and is entirely by design. Hardening against OOM, for example, is historically an extraordinarily difficult feat, and few systems undertake the development and QA costs needed to do so. (FWIW, the CLR VM is one such system.) Simply failing gracefully is usually hard enough. Failing gracefully is admittedly leaps and bounds easier in managed code because allocation failures are communicated via exceptions rather than return values, and are thus transitively propagated “by default.” Thread aborts are even more difficult to harden against, however, because they can originate at any instruction (with a handful of exceptions). Ensuring data invariants are protected for every single instruction is clearly just a little difficult.
These things are certainly not impossible. With enough effort, you can make inroads toward solutions for both issues. Portions of the .NET Framework have gone to such lengths. For example, code that manipulates process-wide state spanning AppDomains needs to ensure that this state is not corrupted by an unfortunately placed thread abort when run inside systems like SQL Server that use aborts to tear down boundaries of isolation. While possible, the important thing to understand here is that most of the .NET Framework is in fact not resilient to these things. See this doc as an example of guidance the CLR team provided to other developers inside of Microsoft to this effect. OOMs are in a similar category, though many subsystems take different, inconsistent approaches to memory allocation failures (e.g. WPF takes a different stance than WCF).
All of this is a long winded build up to the following problem: thread interrupts are just about as evil as these sorts of asynchronous exceptions. The failure injection points are more constrained—e.g. an OOM can occur wherever an allocation occurs, a thread abort can happen in between nearly any two instructions, and thread interruptions can only occur at blocking calls that transition the managed thread into the state WaitSleepJoin—but this doesn’t change the fact that most code is unprepared to deal properly with such interruptions. Once again, it’s not that managed code cannot be constructed to be resilient to interruptions—in fact, it’s much easier than OOMs and thread aborts—it’s simply that the .NET Framework hasn’t been constructed to tolerate arbitrary interruptions. If threads are calling into these APIs and thread interruptions are provoked, state corruption, memory leaks, and possible deadlocks can be left in the wake.
To take a brief example of where such a problem might crop up, imagine a thread has blocked on FileStream.EndRead because it is finishing some asynchronous IO operation. After a brief inspection of the code, I’m convinced interrupting the call it makes to WaitHandle.WaitOne internally will lead to a memory leak:
if (1 == Interlocked.CompareExchange(ref result._EndXxxCalled, 1, 0)) { __Error.EndReadCalledTwice(); } WaitHandle handle = result._waitHandle; if (handle != null) { try { handle.WaitOne(); } finally { handle.Close(); } } NativeOverlapped* nativeOverlappedPtr = result._overlapped; if (nativeOverlappedPtr != null) { Overlapped.Free(nativeOverlappedPtr); }
The method ensures only one call to EndRead can occur, and will throw on subsequent attempts. So the above code will only ever run once. Sadly, EndRead needs to free the NativeOverlapped structure used internally for asynchronous IO completion. But because the call to Overlapped.Free follows the call to WaitOne, and doesn’t occur inside of a finally block, it won’t execute. In summary: interrupt that call to WaitOne, and boom, we leak a NativeOverlapped object. Whether or not this is disastrous of course depends on the precise scenario. A few bytes here and a few bytes there can quickly add up, particularly for long running programs. At least this particular example protects invariants sufficiently well to avoid state corruption that would lead to further unpredictability. But recall that this is just one example. In my experience, the BCL represents some of the most carefully written code in the .NET Framework, so this problem is undoubtedly scattered about all over the place.
Unfortunately, it’s become somewhat common advice that using thread interruption as a synchronization and control mechanism is a GoodThing™. Andrew Birrell, a researcher from Microsoft Research, for example, suggested this in his paper “An Introduction to Programming with C# Threads”:
“Interrupts are most useful when you don’t know exactly what is going on. For example, the target thread might be blocked in any of several packages, or within a single package it might be waiting on any of several objects. In these cases an interrupt is certainly the best solution. Even when other alternatives are available, it might be best to use interrupts just because they are a single unified scheme for provoking thread termination.” (p33)
While I am sure this advice is well intentioned, it is extremely dangerous for the subtle reasons outlined above and can lead to reliability problems in any programs that follow it. My recommendation is to build this kind of higher level synchronization into the code that you actually own, and handle shutdown and interruption logic yourself. This is a bit cumbersome and is more work, but it also ensures that arbitrary blocking points in the libraries you use will not be affected by interruptions.
With the increase in hardware parallelism over the coming years, I worry that the use of interruptions will become more widespread as a popular technique developers use to control threads. And as more and more of the .NET Framework uses higher degrees of concurrency, necessarily requiring more internal synchronization, the number of blocking points that are vulnerable to this kind of abuse will grow accordingly. So, please, do your part… avoid Thread.Interrupt like the plague. In fact, perhaps we should deprecate it.
 Saturday, July 21, 2007
We're hiring developers and testers for the Parallel Computing team in Microsoft's Developer Division.
Update: Here are links to the jobs on microsoft.com:
Our team works regularly with MSR and the CTO, as well as other Developer Division teams like CLR, VS, C#, VB, etc.
If you want to help define and build the next generation of concurrency support in C# and the .NET Framework, this is your chance.
If you want to work closely with supersmart folks like Anders, Burton, and David, wait no longer.
Send me an email at joedu at microsoft dot com if you are interested.
 Friday, July 20, 2007
Whether or not it’s possible for an object to be published before it has been fully constructed is perhaps the most common .NET memory model-related question that arises time and time again. In fact, there was a discussion this week on an internal .NET alias, and another a couple weeks ago in the Joel on Software forums.
The basic question is: Can one thread read a pointer to an object whose constructor has not finished running on a separate thread?
This pattern pops up quite a bit in lazily initialization scenarios, for instance. For example, given some class C:
class C { public int f; public static C s_c; public C() { f = 55; } }
And some code that lazily initializes and then uses the object:
if (s_c == null) s_c = new C(); Console.WriteLine(s_c.f);
Specifically, is it possible in this case to write 0 (or garbage) to the console, instead of 55?
(Note that related examples, like the Joel on Software thread, use separate initialization routines or steps before publishing the pointer. It boils down to the same issue.)
How could observing anything other ‘f’ value than 55 possibly happen anyway, you might wonder? Well, since some processors are free to execute certain instructions out-of-order, the write of the return value of ‘new C()’ could theoretically retire before the write to that instance’s ‘f’ field. This isn’t an issue on X86, since the processor memory model doesn’t permit it, but architectures like IA64 do permit such reordering. Moreover, some compilers might decide to reorder writes; in this example, if the constructor were to be inlined, the compiler could subsequently use code motion to delay the write to the field.
(Note: obviously the constructor could publish a reference to 'this' before it has finished. In this case, clearly other threads could then access the instance before it was fully constructed.)
On .NET, the answer is no, this kind of code motion and processor reordering is not legal. This specific example was a primary motivation for the strengthening changes we made to the .NET Framework 2.0’s implemented memory model in the CLR. Writes always retire in-order. To forbid out-of-order writes, the CLR’s JIT compiler emits the proper instructions on appropriate architectures (i.e. in this case, ensuring all writes on IA64 are store/release). Although reads can retire out-of-order, the data dependence on the pointer value being published prevents subsequent read of fields from happening before the read of the pointer itself. So thankfully this simply cannot happen.
A lot of .NET code out there, including code in the Framework itself, would have suddenly been open to reordering bugs when the CLR 2.0 shipped with IA64 support had we not made this decision. We decided to sacrifice some performance on one particular architecture (and possibly subsequent ones) to ensure these tricky races didn’t bite people unexpectedly, and to avoid a costly audit of the entire Framework.
Lastly, I will note a couple things. First, this strength is not specified in ECMA, so other implementations of the CLI do not provide such guarantees. (I hope one day we decide to standardize the stronger model.) I don’t know what Mono implements, but it may be weaker. Second, the Java Memory Model does not prohibit such publication reorderings, unless the assignments are to a ‘final’ field. So I’m sure people who are familiar with the JMM will assume this pattern is broken on .NET and use locks and/or explicit memory barriers instead. This approach is more conservative and still leads to correct code, however, so it really matters very little for most code.
 Sunday, June 24, 2007
In response to a previous post, a reader said
“I was under the impression that monitors were implemented in .NET in a multiplexed way, so that events are only allocated to an object while there is contention - and that they aren't "sticky", becoming permanently attached to the object.”
This is absolutely correct. My nulling out of the object reference in the example only has the slight advantage of promoting the object’s collection sooner, but it does not have the effect of speeding up the reclamation of the internally managed monitor state. My original posting erroneously said that it would.
Let’s take a quick step back, and see exactly what this means.
Monitors are comprised of two capabilities: critical regions (i.e. Monitor.Enter and Exit), to achieve mutual exclusion, and condition variables (i.e. Monitor.Wait, Pulse, and PulseAll), to coordinate between threads. Any CLR object can be used as a monitor.
For the critical region case, the CLR uses an efficient thin lock which simply embeds locking information as a bit pattern inside the object’s header word. Other parts of the system also try to use the header, e.g. when caching an object’s default hash-code, COM interop, etc. There are limits to what can be stored in the header, so use of any two of these things simultaneously causes inflation, meaning the object header’s contents become an index into a table of sync blocks. Sync blocks are just little data structures capable of holding all of that state simultaneously. The CLR manages a system-wide table of them and recycles and reuses them as objects need them. Another event that causes inflation is the first occasion on which a thread tries to enter the critical region while another thread holds it (i.e. contention).
When contention arises, the CLR will spin briefly before truly waiting, but it may eventually need to allocate a Windows kernel event object. This is an auto-reset (synchronization) event, and a handle to it gets stored on the sync block. Waiting threads just wait on it, and threads exiting the critical region will set it (if the wait count is non-zero). Note that this leads to unfair behavior, because threads can steal the critical region between the signal and the wake-up, but helps to prevent convoys.
Condition variables are implemented slightly differently. Each CLR thread object has a single event object dedicated to it. The first time a thread calls Wait on a condition variable, the event is lazily allocated. And then the thread simply places its own thread-local event into a list of events associated with the monitor. Registering the event also requires inflation to a sync block, if it hasn’t happened already, because obviously the event list can’t be stored in the object header. When a Pulse happens, the pulsing thread just signals the first event in the list. Waiting and pulsing is thus actually somewhat fair, but there are other races that can eliminate this that I won't get into. When a PulseAll occurs, the pulsing thread walks the whole list and signals each.
So now back to the question: when are sync blocks reclaimed?
When a GC is triggered, objects in the reachability traversal may have their sync blocks reclaimed, even if the object in question is still alive, and made available again in the system-wide pool of reusable sync blocks. This reclamation can happen so long as the sync block isn’t needed permanently (as would be the case if COM interop information was stored inside of it) and the sync block isn’t marked precious. A sync block is precious anytime there is a thread inside of the object’s critical region, when a thread is waiting to enter the critical region, or when at least one thread has registered its event into the associated condition variable list. Notice that orphaning monitors can thus lead to leaking events, because they will remain precious, unless the monitor object itself becomes unreachable. When a sync block is reclaimed in this fashion, certain reusable state is kept, like the critical region event object, so that the next monitor to use the sync block can reuse it.
 Saturday, June 09, 2007
Windows Vista has a new one-time initialization feature, which I’m pretty envious of being someone who writes most of his code in C# and answers countless questions about double-checked locking in the CLR. Rather than sprinkling double-checked locking all over your code base, along with the ever-lasting worry in the back of your mind that you’ve gotten the synchronization incorrect, it's a better idea to consolidate it into one place.
That’s the purpose of the LazyInit<T> and LazyInitOnlyOnce<T> structs below. Both let you specify an “initialization” routine (as a delegate) which gets invoked at the appropriate time to lazily initialize the state. The only difference between the two is that LazyInit<T> might invoke your delegate more than once, due to races, but it will ensure only one value “wins”. LazyInitOnlyOnce<T> does the extra work to ensure the initialization routine only gets called once, though at a slightly higher cost: we might need to block a thread, which means allocating a Win32 event.
Why the two? I had originally written this with a Boolean specified at construction time to pick one over the other, but this required an extra object field which, for LazyInit<T> which was never used, along with a Boolean field. I defined both as structs to make them super lightweight to use, and getting rid of the extra two fields seemed worth the extra baggage of an extra class, given that such a type could end up used very pervasively throughout a large code-base. As it stands, LazyInit<T> is just the size of a pointer plus the size of T. LazyInitOnlyOnce<T> adds one additional pointer to that.
To start with, both use the same Initializer<T> delegate:
public delegate T Initializer<T>();
And here’s LazyInit<T>, the simpler of the two:
public struct LazyInit<T> where T : class { private Initializer<T> m_init; private T m_value;
public LazyInit(Initializer<T> init) { m_init = init; m_value = null; }
public T Value { get { if (m_value == null) { T newValue = m_init(); if (Interlocked.CompareExchange(ref m_value, newValue, null) != null && newValue is IDisposable) { ((IDisposable)newValue).Dispose(); } }
return m_value; } } }
Note that T is constrained to a reference type, so that we can use a null check to determine when initialization is needed. We could have used a separate Boolean, but this would required adding another field as well as considering some trickier memory model issues.
If the Interlocked.CompareExchange fails, it means we lost the lazy initialization race with another thread, and thus just return the value the other thread produced. We also Dispose of the garbage object if it implements IDisposable. This pattern is very common in lazy initialization scenarios, like allocating an expensive kernel object lazily on demand. We’d prefer to get rid of it right away since we know it will never be used.
I wish there was a way to make boxing a compile-time error for some value types. Clearly you don't ever want to box one of these, because making a copy will entirely break the synchronization guarantees.
I’ve omitted some error checking, like ensuring m_init actually got initialized to a non-null value.
Say you need a lazily initialized event on your object. You would just do this:
public class C { private LazyInit<EventWaitHandle> m_event; private object m_otherState; public C() { m_event = new LazyInit<EventWaitHandle>( delegate { return new ManualResetEvent(false); }); m_otherState = ...; } ... private void DoSomething() { ... if (... need to set the event ...) m_event.Value.Set(); } }
And lastly, here’s LazyInitOnlyOnce<T>:
public struct LazyInitOnlyOnce<T> where T : class { private Initializer<T> m_init; private T m_value; private object m_syncLock;
public LazyInitOnlyOnce(Initializer<T> init) { m_init = init; m_value = null; m_syncLock = null; }
public T Value { get { if (m_value == null) { object newSyncLock = new object(); object syncLockToUse = Interlocked.CompareExchange( ref m_syncLock, newSyncLock, null); if (syncLockToUse == null) syncLockToUse = newSyncLock; lock (syncLockToUse) { if (m_value == null) m_value = m_init(); m_syncLock = null; m_init = null; } }
return m_value; } } }
We use a monitor to ensure mutual exclusion. I lazily allocate the object used for synchronization, but this is clearly a tradeoff. We pay for the added complexity to the code and the extra interlocked instruction (on the slow path), but avoid having to allocate an extra object when we create the struct itself and keep it alive, when we might not ever need it. There’s already an allocation for the delegate, but this just means there’s one instead of two.
It may also not be obvious why I null out the m_syncLock field before exiting. If we don't, the object will remain live as long as the lazily initialized variable remains live. We want the object to be GC'd as soon as possible, because it is no longer needed.
You can use a class constructor in .NET to acheive a similar effect. Static field initializers, however, execute in the class constructor, meaning if you have multiple lazily initialized objects or static methods, they all get initialized at once. This is much more like LazyInitOnlyOnce<T> than LazyInit<T>, since the CLR uses locks to prevent the class constructor from running on multiple threads at once.
Anyway, there’s very little that is novel here. But I do believe having these primitives in the .NET Framework would be immensely useful. It would at least help steer people towards the recommended and most efficient lazy initialization pattern, which is to use double-checked locking, rather than having them possibly pursue more complicated designs. It also removes the need to worry about volatile and Thread.MemoryBarrier, for those that aren't knowledgeable of the work we did in the CLR 2.0 to ensure double-checked locking works properly. Lastly, it has the added benefit of getting rid of tricky calls to Interlocked.CompareExchange and lock statements scattered throughout your code, in favor of something more declarative. What do you think?
 Wednesday, May 30, 2007
Intel and AMD processors have had very limited support for SIMD computations in the form of MMX and SSE since the late 90s. Though most programmers live in a MIMD-oriented world, SIMD programming had a surge in research interest in the 80s, and has remained promising for all those years, albeit a bit silently. Vectorization is a fairly popular technique primarily in niche markets such as the FORTRAN and supercomputing communities. Given the rise of GPGPU (see here, here, and here) and rumors floating about in the microprocessor arena, this is an interesting space to watch.
You can get at SSE from managed code, though it requires some hoop jumping and the interop overheads end up killing you. Let's take a quick look at what it takes to use classic loop stripmining techniques for a pairwise multiplication of two arrays.
Since we can't access the SSE instructions directly in managed code, we need to first define a native DLL. We'll call it 'vecthelp.dll' and it just exports a single function:
#include <xmmintrin.h>
const int c_vectorStride = 4;
extern "C" __declspec(dllexport) void VectMult(float * src1, float * src2, float * dest, int length) { for (int i = 0; i < length; i += c_vectorStride) { // Vector load, multiply, store. __m128 v1 = _mm_load_ps(src1 + i); // MOVAPS __m128 v2 = _mm_load_ps(src2 + i); // MOVAPS __m128 vresult = _mm_mul_ps(v1, v2); // MULPS _mm_store_ps(dest + i, vresult); // MOVAPS } }
'VectMult' takes two pointers to float arrays, 'src1' and 'src2', of size 'length', and does a pairwise multiplication, storing results into 'dest'. It walks the array with a stride of 4. On each iteration, it does a vector load using the SSE intrnsic '_mm_load_ps', which loads 4 contiguous floats from 'src1' and 'src2' into XMMx registers. Then we multiply them via '_mm_mul_ps' which is a 4-way vector multiply (i.e. the multiplication for each pair occurs in parallel). Lastly, we store the results back to the 'dest' array. Note we naively assume the array's size is a multiple of 4.
To use this routine, we just need to P/Invoke. Well, sadly we also need to do some tricky alignment since SSE demands 16 byte alignment. As I've written before, this isn't easy to acheive on the CLR. I've used stack allocation to avoid pinning the arrays, though clearly for large arrays this would easily lead to stack overflow. It's just for illustration.
using System;
unsafe class Program { [System.Runtime.InteropServices.DllImport("vecthelp.dll")] private extern static void VectMult(float * src1, float * src2, float * dest, int length);
public static void Main() { const int vecsize = 1024 * 16; // 16KB of floats.
float * a = stackalloc float[vecsize + (16 / sizeof(float)) - 1]; float * b = stackalloc float[vecsize + (16 / sizeof(float)) - 1]; float * c = stackalloc float[vecsize + (16 / sizeof(float)) - 1];
// To use SSE, we must ensure 16 byte alignment. a = (float *)AlignUp(a, 16); b = (float *)AlignUp(b, 16); c = (float *)AlignUp(c, 16);
// Initialize 'a' and 'b': for (int i = 0; i < vecsize; i++) { a[i] = i; b[i] = vecsize - i; }
// Now perform the multiplication. VectMult(a, b, c, vecsize);
... do something with c ... }
private static void * AlignUp(void * p, ulong alignBytes) { ulong addr = (ulong)p; ulong newAddr = (addr + alignBytes - 1) & ~(alignBytes - 1); return (void *)newAddr; } }
I wish I could report some stellar perf numbers, to the tune of the vector version being 4X faster than a non-vector equivalent. Sadly the P/Invoke overheads kill perf unless the array is unreasonably large. Who needs to multiply two 16MB arrays of floats together? Some people I'm sure, but not many. If the P/Invoke overheads are excluded, however, arrays as small as a few hundred elements see 2X speedup. And for larger arrays it hovers around 3X.
Clearly as future architectures offer more vector width, these speedups just increase. And perhaps there will eventually be more incentive for native CLR support. Just imagine if we had a 32-core system in which each core had a 16-way vector arithmetic unit: that's 32X16 (512) degrees of parallelism if you can just subdivide the problem appropriately. GPUs, of course, already offer many-fold larger vector width than SSE, which is one reason why GPGPU is attractive. Maybe I'll show how to write a DirectX pixel shader that adds two float arrays together in a future post.
 Saturday, May 19, 2007
I’ve opined on thread affinity several times in the past. I think the term “thread affinity” is en vogue only internal to Microsoft, so it may help to define what it means for the rest of the world.
Many services on Windows have traditionally associated state with the executing thread to keep track of certain ambient contextual information. Errors, security, arbitrary library state. Storing data on the physical thread ensures that it flows around with the logical continuation of work, no matter what APIs are called or how interwoven the stack ends up, and is therefore “always” accessible. Thank our imperative history for this one. People have had to deal with this in Haskell, though since Haskell generally doesn’t have statefulness which persists across callstacks, they came up with a more elegant “implicit parameter” solution.
Affinity creates difficulties for parallel programming models for a number of reasons. We’d like it to be the case that work can be transferred for execution on separate processors when feasible and profitable, and often even implicitly. For example in the query ‘var q = from x in A where p(x) select f(x);’, so long as ‘p’ and ‘f’ are sufficiently complicated and ‘A’ sufficiently large, perhaps we want to run this in parallel. But “transferring work for execution on separate processors” means using many threads to execute the same logical chunk of work. If ‘p’ or ‘f’ rely on thread affinity, what are we to do? Affinity becomes a concurrency blocker here: if part of that work depends on the thread’s identity across multiple steps, then how can we possibly use multiple threads?
One answer is that we first need to know the duration of the affinity if we’re to deal intelligently with it somehow. That’s what the .NET Framework’s Thread.BeginThreadAffinity and EndThreadAffinity are meant for: they denote the boundaries of affinity that has been acquired and then released. But this still doesn’t solve the fundamental problem, which is the mere presence of thread affinity in the first place. We would presumably respond to the affinity by just suppressing parallelism. That’s no good. And sadly affinity isn’t really a well-defined single thing that we can do away with in one well-defined step.
Win32 is littered with affinity: error codes are stored in the TEB (accessible via GetLastError), as are impersonation tokens and locale IDs. Arbitrary program and library can be—and routinely is—stashed away into Thread Local Storage (TLS) for retrieval later on. In fact, most mutual exclusion mechanisms today assume thread affinity: that is, a lock is taken by some thread and then the only agent in the system working under the protection of that lock is that one thread. Various transactional memory nesting forms seek to solve this problem, including what happens when many threads which comprise the same logical piece of work need to write to overlapping data. Heck, stacks are even a subtle form of thread affinity too, in which some portion of the program state is all cobbled up with the thing which is meant to execute the program itself. COM introduced an even more grotesque form of affinity with its Threading apartment model, particularly Single Threaded Apartments (STAs), in which components created on an STA are only ever accessed from the single STA thread in that apartment. And let’s not forget all of the GUI frameworks: all of the Windows GUI frameworks are built on the notion of strong affinity. And since the introduction of LIBCMT and MSVCRT those C Runtime library functions which historically relied on global state now rely on TLS, so some of the CRT itself is even guilty (which means those programs that use the guilty parts are also guilty, though perhaps unknowingly). Managed code adds one degree of separation by detaching the CLR thread from the OS thread, which is a step in the right direction; but the .NET Framework is still littered with affinity that is either inherited from Win32 or of its own creation. And so on, and so forth.
All of those examples of thread affinity above are cases where the library developer needed to have an isolated context. There really is no reason that this context needs to be specific to a single OS thread, it just so happens the context that most of them chosen is in fact specific to one. The .NET Framework's approach of offering a layered and shiftable abstraction on top of the concrete thing is promising... assuming you’re comfortable using that abstraction. CLR remoting offers various forms of contexts that flow in a multitude of ways. Sadly the machinery is complex, not documented satisfactorily, and is, well, tied to remoting. We need something more general purpose and ubiquitous. Maybe the CLR thread is it. Until somebody needs to come along and build something that is one level above CLR threads, I suppose.
So how bad is all of this anyway? It’s actually fairly bad. Any one of these things in isolation is teachable and avoidable, but pile it all up and what you’re left with is a veritable minefield to navigate. Affinity shows up as a huge concurrency blocker alongside other favorites like mutable data structures and impure functions. As if concurrency weren’t hard enough!
 Sunday, May 13, 2007
Everybody’s probably aware of the RegisterWaitForSingleObject method: it exists in the native and CLR thread pools, and does pretty much the same thing in both. (It's called CreateThreadpoolWait and SetThreadpoolWait in Vista.) This feature allows you to consolidate a bunch of waits onto dedicated thread pool wait threads. Each such thread waits on up to 63 registered objects using a wait-any-style WaitForMultipleObjects. When any of the objects become signaled, or a timeout occurs, the wait thread just wakes up and queues a callback to run on the normal thread pool work queue. Then it updates timeouts and possibly removes the object from its wait set, and then goes right back to waiting.
This is great. Fewer threads, more overlapped waiting, better performance. If you wait on 1,024 objects, you only need 17 threads instead of 1,024. Not only do you end up with fewer threads, but your program can actually handle the case where all 1,024 objects become signaled at once, because the thread pool throttles the number of threads that can run callbacks.
But you really don’t want to register a wait for a mutex. If you stop to think about it for a moment, the reason will become clear. It just doesn’t make any sense with the architecture I just explained.
The pool’s wait threads are the ones that do the actual waits. And when a wait for a mutex is satisfied, the thread which performed the wait now owns the mutex. Uh oh. In our case that means the wait thread owns the mutex. But all the wait thread knows how to do is wait on stuff and queue callbacks. There are two problems here.
The first problem is that the thread which will run the callback lives in the thread pool’s worker queue, and doesn’t actually own the mutex. Which means it can’t actually release the mutex either. In fact, nobody really can, except for the wait thread that performed the wait, but remember all that thread knows how to do is wait on stuff and queue callbacks. A mutex? What the heck is that? Eventually the wait thread may exit and the mutex may become abandoned, but whether this actually happens depends on the ebb and flow of wait registrations.
(With the Win32 thread pool, you can specify the WT_EXECUTEINWAITTHREAD flag during registration, which ensures the callback is run in the wait thread itself and not queued to worker thread. While this can suffice as a workaround to this problem, it’s generally a bad practice to hold up the wait thread from doing its job. And there is no equivalent in Vista or with the CLR thread pool.)
The second problem may or may not surface depending on whether you’ve specified that the wait callback should execute only once. If the callback executes only once, the thread pool will remove it from its wait set after waking up once. Otherwise, it keeps it in the wait set and goes back to waiting on it right after queuing the callback. Here are the "only once" defaults for the Vista, legacy, and CLR pools: yes in Windows Vista (and no way to specify otherwise, other than reregistering manually), no in the legacy Win32 pool (unless the WT_EXECUTEONLYONCE flag is passed during registration), and you always have to specify in managed code with the executeOnlyOnce argument.
So what's the problem? Because mutexes allow recursive acquires, then if the callback is set to execute multiple times, the wait thread will simply go back and wait on all of its objects, including the mutex, after it queues a callback. The same thing that happens with a persistent signal object like a manual-reset event now happens. Each time the wait thread tries to wait, the acquisition of the mutex immediately succeeds, incrementing its recursion counter by one, and each time causing the wait thread to queue yet another callback. Ouch. The insanity never stops:
Mutex m = new Mutex(); m.WaitOne(); ThreadPool.RegisterWaitForSingleObject( m, delegate { Console.WriteLine("The insanity!"); }, null, -1, false); m.ReleaseMutex();
The moral of the story? Nothing terribly deep. Thread affinity strikes once again.
 Tuesday, April 24, 2007
Haskell is the most underappreciated yet extraordinarily significant programming language in the world. The syntax is frightening enough to scare off those with weak stomaches, but some of the most interesting and creative research in type systems and, within recent years, parallelism have arisen from the Haskell community. I recently stumbled across a fascinating paper from the ACM SIGPLAN History of Programming Languages Conference (HOPL'III) from earlier this year:
A History of Haskell: being lazy with class
Abstract This long (55-page) paper describes the history of Haskell, including its genesis and principles, technical contributions, implementations and tools, and applications and impact.
First I'll admit that I'm a functional programming geek. Second I'll admit that I love reading about technology history. But those biases aside, the paper is really quite good. Recommended reading for anybody who's ever run across a lambda floating around in their dreams. I know that I have.
 Sunday, April 22, 2007
Somebody I respect a lot on our team said something interesting the other day: paraphrasing, "parallelism is about taking one trick and applying it to as many things as possible." Well, what's the trick? The trick is breaking a problem into successively smaller pieces on which disjoint subsets of the overall computation can run concurrently. Pieces in this sense can be little bits of data or instructions, or both. It seems so obvious, but that really is all there is to it. That's not to say it's easy, of course, though some people believe it is. One of the nice things about PLINQ is that you express a big computation and we hide the tricks. But the tricks aren't impossible to do on your own... today, even.
 Thursday, April 19, 2007
Michael Suess, author of the tremendously cool blog thinkingparallel.com, recently ran a series of interviews. He asked five parallelism experts from different domains (Erlang, MPI, OpenMP, POSIX threads, .NET threads) to answer the same set of questions:
It's interesting to see the range of answers given. We agree on many things, but our different backgrounds really shine through in other cases.
 Monday, April 16, 2007
To gather meaningful performance metrics, it's usually a good idea to run several iterations of the same test, averaging the numbers in some way, to eliminate noise from the results. This is true of sequential and fine-grained parallel performance analysis alike. Though it's clearly important for sequential code too, data locality can add enough noise to your parallel tests that you'll want to do something about it. For example, if iteration #1 enjoys some form of temporal locality left over from iteration #0, then all but the first iteration would receive an unfair advantage. This advantage isn't usually present in the real world -- most library code isn't called over and over again in a tight loop -- and could cause test results to appear rosier than what customers will actually experience. Therefore, we probably want to get rid of it.
To eliminate this noise, we can clear the Dcaches (data caches) of all processors used to run tests before each iteration. How do you do that? Well...
Intel offers a WBINVD instruction to clear a processor's Dcache, but sadly it's a privileged instruction and there's no way to get at it from user-mode. So that's a no-go for most Windows programmers. There's also a Win32 function to clear a processor's Icache, but this doesn't work for Dcaches, which is what we're after, so we're out of luck there too.
Instead, we can implement a really low tech solution. Take some random data, sized big enough to fill a processor's L2 cache, and read the whole thing from each processor whose cache we wish to clear before each iteration. This will evict all of the existing lines in the caches that could be left over from previous iterations. All of the new lines will be brought in as shared, and, while they will be evicted when we start using real data in the query, this effectively simulates a cold cache. Here's an example of this:
const int s_garbageSize = 1024 * 1024 * 64; // 64MB. static IntPtr s_garbage = System.Runtime.InteropServices.Marshal.AllocHGlobal(s_garbageSize);
unsafe static void ClearCaches() { for (int i = 0; i < Environment.ProcessorCount; i++) { SetThreadAffinityMask(GetCurrentThread(), new IntPtr(1 << i)); long * gb = (long *)s_garbage.ToPointer(); for (int j = 0; j < s_garbageSize / sizeof(long); j++) { long x = *(gb + j); // Read the line (shared). long y = Math.Max(Math.Min(x, 0L), 0L); // Prevent optimizing away the read. } } SetThreadAffinityMask(GetCurrentThread(), new IntPtr(0)); }
[DllImport("kernel32.dll")] static extern IntPtr GetCurrentThread(); [DllImport("kernel32.dll")] static extern IntPtr SetThreadAffinityMask(IntPtr hThread, IntPtr dwThreadAffinityMask);
This clearly isn't the most efficient implementation. On multi-core architectures, some cores are apt to share some levels of cache, so with the above approach we'll end up doing duplicate (and wasted) work. And few processors on the market have 64MB of L2 cache -- I've just chosen a reasonable number that's bigger than most processors -- so we could try to be more precise there too. You could use the GetLogicalProcessorInformation API, new to Windows Server 2003 (server) and Vista (client), if you really wanted to be a stud. In any case, this does the trick.
 Thursday, April 12, 2007
I wrote an article that appears in the May 2007 issue of MSDN Magazine. It's now online for your reading pleasure:
CLR Inside Out: 9 Reusable Parallel Data Structures and Algorithms
This column is less about the mechanics of a common language runtime (CLR) feature and more about how to efficiently use what you’ve got at your disposal. Selecting the right data structures and algorithms is, of course, one of the most common yet important decisions a programmer must make. The wrong choice can make the difference between success and failure or, as is the case most of the time, good performance and, well, terrible performance. Given that parallel programming is often meant to improve performance and that it is generally more difficult than serial programming, the choices are even more fundamental to your success.
In this column, we’ll take a look at nine reusable data structures and algorithms that are common to many parallel programs and that you should be able to adapt with ease to your own .NET software. Each example is accompanied by fully working, though not completely hardened, tested, and tuned, code. The list is by no means exhaustive, but it represents some of the more common patterns. As you’ll notice, many of the examples build on each other.
(Read more...)
The 9 items are: Countdown Latch, Reusable Spin Wait, Barrier, Blocking Queue, Bounded Buffer, Thin Event, Lock-Free Stack, Loop Tiling, Parallel Reduction. Much of the content is closely related to, or even derived from, content that will appear in my book. (Yes, it's still in the works.)
As Stephen notes on the MSDN Magazine blog, there was a printing error which resulted in the last page of the article being printed twice, one of which overwrote another page in the article. Thankfully the online article doesn't suffer from this same problem. But to remedy this, the article will also appear in next month's magazine, for double the fun.
 Wednesday, April 11, 2007
Late last summer, an interesting issue with traditional optimistic read-based software transactional memory (STM) systems surfaced. We termed this “privatization” and there has been a good deal of research on possible solutions since then. I won’t talk about solutions here, but I will give a quick overview of the problem and a pointer to recent work.
As a quick refresher, optimistic reads are nice because they are invisible. Being invisible eliminates the responsibility from the reading transaction to inform other transactions about the act of reading. Why is that nice? Because informing other transactions requires shared writes, which are expensive and often require atomic (compare-and-swap) writes. Doing that on every read can clearly hamper your performance. With optimistic systems, this step can be skipped, which is a stark contrast to pessimistic read-based systems.
As many have already described (e.g. Harris and Fraser), this is often accomplished by having the reader observe a location-specific version number after the read and ensuring that writing transactions increment this same version number during commit. Transactions later validate the observed version numbers, and if any changed, the transaction rolls back. Notice that the reading transactions continue to do work after the optimistic read, in hopes that the work needn’t be thrown away later due to a conflict. This, as you can imagine, is where the “optimistic” terminology comes from. But this is also why we run into the privatization issue.
To illustrate, imagine that we have some linked list and two transactions, Tx1 and Tx2. Tx1 walks the list and updates all nodes (perhaps by incrementing some counter), and Tx2 simply removes one node from the list so that it can do some work privately with it. The code might look like this:
class Node { Node next; int value; } Node head = …;
// Tx1: atomic { S0: Node n = head; S1: while (n != null) { S2: n.value++; S3: n = n.next; } }
// Tx2: Node n; atomic { S4: n = head.next; // take the 2nd element S5: head.next = n.next; } S6: Console.WriteLine(n.value);
Assuming all nodes have values of 0 to begin with, and Tx2 commits before Tx1, is it possible for Tx2 to print out the value 1 at S6? Perhaps surprisingly (and disappointingly), the answer is yes (with traditional optimistic read systems, as described in the literature).
How? Say Tx1 executes S0 through S3 first. So Tx1’s local variable n now contains a reference to the 2nd node in the list. Then Tx2 runs S4 and S5, removing the 2nd node from the list. Then Tx2 commits successfully, and the IP is sitting at S6 but hasn’t run yet. Note that Tx1 is doomed at this point—it has read a reference to the 2nd node via the head’s next reference which is now out of date—but doesn’t know it and, with traditional optimistic read systems, won’t find out until it tries to commit.
From here, things go terribly wrong. Tx1 may run and write to the 2nd node’s value which, in this particular example, could cause S6 to erroneously print out the value of 1. Worse, for more complicated data structures, invariants may be horribly broken as there are plenty of races: S6 could even execute while Tx1 is in the process of rolling back, etc. This can clearly be catastrophic.
This problem is described in depth in Larus and Rajwar’s recent transactional memory book. Spear, et. al recently released a technical report that also overviews the problem and presents some possible approaches to solve it. Some have suggested that data must be once-transactional, always-transactional, but some thought exercises and other simpler examples should be sufficient to convince you that that direction isn't very straightforward either.
 Thursday, March 29, 2007
One of the motivations of doing a new reader/writer lock in Orcas (ReaderWriterLockSlim) was to do away with one particular scalability issue that customers commonly experienced with the old V1.1 reader/writer lock type (ReaderWriterLock). The basic issue stems from exactly how the lock decides when (or in this case, when not) to wake up waking writers. Jeff Richter’s MSDN article from June of last year highlights this problem. This of course wasn't the primary motivation, but it was just another straw hanging off the camel's back.
Contrast some choice behavior exhibited by the two lock types:
- Both ReaderWriterLock and ReaderWriterLockSlim will block new readers from acquiring the lock as soon as a writer begins waiting to enter. That means that as soon as all active readers exit the lock, the writer will be awoken and allowed to enter the lock. (For all intents and purposes, by the way we treat upgrade and write locks the same.)
- When a write lock is released, the ReaderWriterLock type will wake up all waiting readers, even if there are waiting writers. Again, new readers are blocked and once this awoken batch of read locks exit the lock again, the next writer in line is awoken.
- When a write lock is released, the ReaderWriterLockSlim type will wake up a waiting writer instead of readers. Readers may only proceed once there are no longer any waiting writers.
These last two points illustrate the basic issue with the old lock. (And the new one, too, to be brutally honest.) If a large number of writers is waiting to enter the ReaderWriterLock, each will be staggered by the amount of time it takes for all intervening readers to enter the lock, do their work, and exit. This can send the wait time for writers through the roof.
To further illustrate the point, imagine we have two writers (W0 and W1) and two readers (R0 and R1), each of which enters the lock, does some work for 1 unit of time, exits, and then goes back around and tries to acquire the lock again:
Thread Arrival Enter Exit W0 0 0 1 W1 0 3 4 R0 0 1 2 R1 0 1 2 W0 1 5 6 W1 4 8 9 R0 2 4 5 R1 2 4 5 ... R0 5 6 7 R1 5 6 7
Notice that the writers have to wait for a very long time to be serviced in comparison to the readers. If more and more writers show up, this problem becomes magnified, regardless of how many readers there are or the ratio of readers to writers.
The new lock doesn’t suffer from this same problem. But it does suffer from a different one: possible starvation of readers if there is always a writer arriving or waiting at the lock. As Jeff mentions in his MSDN article, most reader/writer locks work best when the ratio of reads to writes is high. In the above example, though, the readers actually would never get to enter the lock:
Thread Arrival Enter Exit W0 0 0 1 W1 0 1 2 R0 0 ?? ?? R1 0 ?? ?? W0 1 2 3 W1 2 3 4 ... W0 3 4 5 W1 4 5 6
So which is better? If writers are less frequent in your scenario—as they usually are—then the new lock will probably fit the bill. If not, you might run into troubles with the new one.
We had originally planned to allow you to configure the contention policy. In fact, if you picked up an earlier Orcas CTP, you probably noticed the ReaderWriterLockSlim constructor that took an enumeration value specifying the contention policy: PrefersReaders, PrefersWritersAndUpgrades, and Fifo. This simply added too much complexity for the short timeframe of the Orcas release, so it recently silently disappeared.
Though it’s the hardest to implement, I do think FIFO (or some heuristic approximation thereof) is the right answer here. Block new readers once a writer begins waiting. When the last reader exits, wake up the waiting writer. When the writer exits, wake up the next n contiguous waiting readers (all readers between the exiting writer and the next writer in the wait queue, if any) or the next writer (if the writer is next in the wait queue). Or, as noted, some approximation of this logic, since it could be fairly costly to orchestrate all the bookeeping, particularly the event waiting and signaling. But a FIFO-like ordering ensures some strong correlation between arrival time and relative wait time, which, I think, is what most people expect and desire. There are of course convoy problems that can happen when strict FIFO ordering is used, so I would expect we would still allow (some) arriving requests to pass others in line.
This last suggestion is actually quite similar to what the new SRWLOCK in Vista does. ReleaseSRWLockShared and ReleaseSRWLockExclusive signal the next threads in line based on a wait queue structure, without any sort of “prefers readers over writers” (or vice versa) policy. But that’s a topic for a separate day.
 Friday, March 09, 2007
The CLR commits the entire reserved stack for managed threads. This by default is 1MB per thread, though you can change the values with compiler settings, a PE file editor, or by changing the way you create threads. We've been having a fascinating internal discussion on the topic recently, and I've been surprised how many people were unaware that the CLR engages in this practice. I figure there's bound to be plenty of customers in the real world that are also unaware.
Let's see some pages
This behavior can be seen quite easily by breaking into a debugger (like WinDbg) and inspecting the status of the virtual memory pages comprising a thread's stack. For example, from WinDbg the !teb command will show you the highest stack address (StackBase) and !vadump will show you the status of all pages in the process's address space. From this you can see that the relevant stack pages are in the state MEM_COMMIT rather than MEM_RESERVE.
Here's a quick example taken from a sample managed program. I've broken right inside the Main function and will dump the TEB:
0:000> !teb TEB at 000007fffffde000 ExceptionList: 0000000000000000 StackBase: 0000000000190000 StackLimit: 0000000000189000 ...
Based on this information, combined with the fact that we know managed thread stacks are 1MB by default, we can determine what memory addresses to look for: we subtract 100000 (1MB) from 190000 (StackBase) to arrive at the base address of the stack pages: 90000. Now we dump the virtual memory pages:
0:000> !vadump
... (1) BaseAddress: 0000000000090000 RegionSize: 0000000000001000 State: 00002000 MEM_RESERVE Type: 00020000 MEM_PRIVATE
(2) BaseAddress: 0000000000091000 RegionSize: 00000000000f0000 State: 00001000 MEM_COMMIT Protect: 00000004 PAGE_READWRITE Type: 00020000 MEM_PRIVATE
(3) BaseAddress: 0000000000181000 RegionSize: 0000000000001000 State: 00002000 MEM_RESERVE Type: 00020000 MEM_PRIVATE
(4) BaseAddress: 0000000000182000 RegionSize: 0000000000007000 State: 00001000 MEM_COMMIT Protect: 00000104 PAGE_READWRITE + PAGE_GUARD Type: 00020000 MEM_PRIVATE
(5) BaseAddress: 0000000000189000 RegionSize: 0000000000007000 State: 00001000 MEM_COMMIT Protect: 00000004 PAGE_READWRITE Type: 00020000 MEM_PRIVATE ...
That summarizes the stack memory. But what does it all mean? I've labeled the individual regions above with numbers so I can reference them. And, remember folks, the stack grows downward in the address space, so we'll discuss them in reverse order:
5. The actively used portion of the stack. Notice that the BaseAddress equals the thread's current StackLimit, and that its BaseAddress+RegionSize equals StackBase. The thread is actively using stack memory only within this region.
4. The guard portion of the stack. When an attempt is made to write to an address within this range (i.e. as the thread's stack grows by virtue of the program calling functions, stackalloc'ing, etc.), the memory's guard status is cleared and a fault is triggered. The OS traps this fault, and responds by committing additional guard region and then resuming at the faulting instruction. What used to be the guard page has now become part of 5, and the program can continue on its merry old way. (Assuming there is room to commit another guard region; if not, stack overflow ensues.) A couple things are worth noting. Because the CLR commits the entire stack, the OS doesn't really have to "commit" the memory: it just annotates the next region as the new guard. Also notice that the guard in this program is 28KB in size. Normally the guard is just a single page, but the CLR uses SetThreadStackGuarantee to increase the amount of committed stack we are guaranteed to have at any point in time, at least on OSes that support it. This makes responding to stack overflow easier.
3. This is often referred to as the "hard guard page". If you try to write to this, the OS rips down your process. In the wink of an eye, it’s gone, no callbacks, no nice Dr. Watson dumps, it just disappears. As guard pages are committed, this page is moved so that it's just beyond the guard region. I don't know precisely how this happens w/out having to commit more memory (since it's marked MEM_RESERVE), but I suspect the OS just magically rearranges some page table information.
2. This is the rest of the stack. It hasn't been used yet, and it's not part of the guard region. This is where you'll see a difference between a managed thread's stack and a native thread's stack: the pages are marked MEM_COMMIT for managed code, whereas they'd be MEM_RESERVE for native.
1. This is the final destination of the hard guard page, after the whole stack has been committed and the guard rests just before it at the end of the stack, this page will always remain. It is treated as a separate MEM_RESERVED page and will never be committed.
One additional thing is worth noting. When the CLR pre-commits the whole stack, it uses VirtualAlloc to do so. This leaves the guard page close to the bottom of the stack, the hard guard page just behind it, and the StackLimit in the TEB, set to the address where the guard page ends. This surprises some people, i.e. they expect to see a StackLimit set to, say, 91000 in the above example. But remember, the OS doesn’t get involved at all in our pre-committing of the stack.
Method to the madness
Why in the heck would we do all of this?
Believe it or not, there is a method to the madness. When the OS tries to commit a new guard region, it can fail. It won’t fail due to insufficient virtual memory space (not that such things would matter much on X64 anyhow), because the memory is already reserved. We can handle those cases just fine. Rather, it might fail if there is insufficient pagefile backing to commit the memory. This would manifest as a stack overflow. Sadly, at the point of this stack overflow, the CLR’s vectored exception handler (which responds to ERROR_STACK_OVERFLOW) would then have only the guard region’s worth of stack space in which to do anything reasonably intelligent. (Which, recall, was traditionally one page.) The unhosted CLR just has to issue a failfast in this case, but it also wants to do things like create a Windows Event Log entry, play nicely with Dr. Watson and debuggers, and so on.
This is also required for hosts like SQL Server who try to continue running in the face of stack overflow. In these cases, the CLR has to call out to the host to see what it would like to do. Maybe the host can run in just a page’s worth of stack, and maybe it can’t. The CLR doesn’t try to recover in unhosted situations because it is extremely difficult, and there are even problems with some of Win32 itself not being able to tolerate the presence of stack overflow (most notably CRITICAL_SECTIONs). But the SQL Server engine is a very carefully engineered piece of software and they have a lot of experience (and success, apparently) remaining running in these cases.
If we commit the entire stack, there is no fear of running out of physical space during stack growth, because the whole thing has already been backed.
But of course this is the major downside to this design as well. The pressure this puts on the pagefile is not negligible. If you have 1,000 threads in a process, you need 1GB of pagefile space to back all of their stacks. Sure, that’s a lot of threads, but heck, that’s a lot of disk space too! A stack page won’t require physical memory until it’s actually faulted in (i.e. read from or written to), but the pagefile expense is a high price to pay for what amounts to be an obscure (and dubious) condition.
I say “dubious” because you have to wonder: is it even worth it anyway? Probably not. On modern Windows OSes that support SetThreadStackGuarantee, there’s little reason to commit anything but the guaranteed guard region. The CLR uses this API, which means we can size the guaranteed region large enough to so we can always run our stack overflow logic within it. Committing any more than this really is just a waste. Even on OSes without this API, however, we’re going to failfast the process in this situation anyhow. Sure, if we didn’t commit the whole thing up front, then these “out of pagefile space” situations might result in an inability to log an Event Log entry, notify a debugger, and so on, but will we really be able to do that anyway given the extreme resource pressure the machine has to be under to create this situation? Probably not.
In the end, it matters little what I think about the design. This is how it is, and I figured you all should know about it.
 Sunday, March 04, 2007
In 2.0 SP1, we changed the threadpool's default maximum worker thread count from 25/CPU to 250/CPU.
The reason wasn't to encourage architectures that saturate so many threads with CPU-bound work, nor was it to suggest that having 249 threads/CPU blocked and 1/CPU actively running is actually a good design. A non-blocking, continuation-passing/event-driven architecture is generally a better approach for the latter case. Rather, we did this to statistically reduce the frequency of accidental and nondeterministic deadlocks.
Believe it or not, deadlocking the threadpool was the most frequently reported threading-related customer problem/complaint during my tenure as the CLR's concurrency PM. There are KB articles and a wealth of customer and Microsoft employee blog posts about this issue.
Many algorithms demand dependencies between parallel tasks. And sometimes—as is frequently the case in data parallel algorithms—the number of tasks can be variable up to a factor of the input size. A "task" in this context is just the closure passed to QueueUserWorkItem. Consider a dumb parallel merge sort, which uses divide-and-conquer style parallelism, for example:
void Sort(int[] numbers, int from, int to) { if (from < to) { ... ThreadPool.QueueUserWorkItem(delegate { Sort(numbers, from, (from+to+1)/2); }); // T1 Sort(numbers, (from+to+1)/2, to); // T2 ... Wait for T1 to finish ... ... Merge left and right ... } }
In this case, T1 is run in parallel and sorts the left half; T2 runs on the calling thread and sorts the right. After T2 runs sequentially, it must wait for T1 to complete before moving on to the merge step. As written, this algorithm is clearly inefficient and deadlocks easily. Pass it an array of size 33 on a 2-CPU machine, and the threadpool's default maximums will ensure that some T1's can't even get scheduled, leaving threadpool threads blocked waiting for their queued (and stuck) counterparts to finish. Depending on how tasks are scheduled at runtime this could deadlock.
Clearly the programmer needs to "stop" dividing the problem at some reasonable point, i.e. limit the maximum number of tasks generated; otherwise the task count will grow with some factor of the input size, causing deadlocks for large inputs (not to mention huge context switch and resource consumption overheads). When might that point be? Perhaps the programmer calculates some degree-of-parallelism (DOP) at the top of the recursive call stack, say log2(#ofCPUs). DOP is passed to the first call to Sort and each subsequent recursive call decrements the DOP by 1. So long as the DOP argument is >0, T1 is run in parallel; otherwise, T1 is run sequentially on the same thread, just like T2. This ensures we don't spawn more tasks than there are CPUs.
And this will probably work. Most of the time.
What if, just by chance, the stars aligned and 25 instances of this algorithm ran simultaneously? Seems farfetched? Maybe so. Consider this: using log2(#ofCPUs) might not be enough in the case that some comparison routines block during sorting, possibly suggesting log2(2(#ofCPUs)) as a better DOP instead. And then all we need is 12.5 occurrences. Still a little farfetched, but not quite as much. But wait: there could be other algorithms using the thread pool simultaneously, particularly on a server. (Yes, data parallelism on a server is probably suspect in the first place, but for highly parallel servers with volatile loads, it could be useful.) And remember, the thread pool is shared across all AppDomains in the process, so if you've written a reusable component, you're relying on all other software in the process to behave properly too (which you may have absolutely no control over).
Most of these admittedly represent imperfections in the overall design and architecture of the application, but the sad fact is that they tend to be somewhat common. Especially when components are dynamically composed in the process, as is common with server and AddIn-style apps. And they are very nondeterministic and hard to test for. Our platform doesn't offer a mechanism today that allows developers to write code that is intelligently-parallel, particularly when many heterogeneous components are trying to use concurrency for performance. Even with the suggested improvements and the CLR threadpool's old 25/CPU thread limit, the Sort routine could deadlock once in a while, maybe under extreme stress and very hard to reproduce conditions. This will occur less frequently, statistically speaking, with the 250/CPU limit. The problem is that all of this is just a heuristic, there aren't any hard numbers and coordination involved.
It’s also worth noting that the threadpool throttles its creation of threads to 2/second once the count has exceeded the #ofCPUs. That means if a programmer sees this situation happening with regularity, they will also observe hard to diagnose performance degradations. Once in a while, that sort algorithm might take 10-times longer to run, inexplicably. If this happens a lot, the developer will notice, profile, and fix the issue. While this isn’t great, this problem is typically quite rare in any one (properly written) program, and doesn’t happen with regularity. Our first priority was to prevent the periodic and sporadic hangs, the things eating away at the reliability and uptime of programs, to trade them off for possible periodic performance blips. Many of those horribly misarchitected programs will still deadlock deterministically with the new limits, and the thread injection throttling will help to discourage them from being written this way. (It would take 125 seconds to create 250 threads on a 1-CPU machine, and it seems unlikely that a 2 minute-plus delay would be tolerated. Some people use SetMinThreads to get around this, which is (usually) inexcusable.)
With all of this said, we clearly have a lot of work to do in the future to encourage better parallel program architectures and to provide better tools for diagnostics in this area. I agree with the basic tenet "use as few threads as possible, but no fewer," but sometimes we have to sacrifice idealism to solve practical real-world problems. In my experience, this should solve many such problems.
 Monday, February 19, 2007
A reader asked for clarification on a past article of mine, regarding my claim that one particular variant of the double checked locking pattern won't work on the .NET 2.0 memory model. The confusion was caused because my advice seems to contradict Vance's MSDN article on the topic.
The problem is with variants of double checked locking that use a flag to indicate that a variable has or has not been initialized, versus using the presence of null to indicate this. This can come in handy if null is a valid initialized value, when the value is a value type, and/or if multiple variables are involved in the initialization.
After following up with a few Microsoft and Intel folks about this, I still believe this to be an issue. Here is what I claim:
- Because standard Intel processors (X86/IA32, EM64T) use non-binding speculative reads, the problem will not happen due to speculation. And because processor consistency memory models don’t permit loads to freely reorder, this won’t happen because of cache hits.
- However, on IA64, non-volatile loads can be freely reordered, and therefore a cache hit can cause the load of the value to pass the load of the flag. I have not been given a clear answer yet on the nature of IA64’s speculation model, but I suspect IA64 is non-binding too, and therefore this cannot occur as a sole result of branch prediction (though that is pretty much immaterial because of cache reordering).
- In talking with some compiler folks here, they also agree that legal compiler transformations (according to .NET 2.0’s memory model) can break the code.
- With that said, no Microsoft compiler we know of will actually make the transformation.
- With some simple (though unlikely) modifications, existing compilers could find it more attractive to apply CSE/PRE, causing the read to move and break the code pattern.
The take-away is not necessarily the specific details, though perhaps those are interesting too. Rather, the primary take-away is that you really ought to use the volatile modifier whenever you aren’t 100% certain that the default memory model will prevent these kinds of reorderings. (And even then, volatile is still a good idea, to declare your intent to other programmers looking at the code.)
As I mentioned in the original article, the use of volatile is enough to ensure this particular example works correctly.
 Monday, February 12, 2007
Somebody recently asked in a blog comment whether the new ReaderWriterLockSlim uses a full barrier (a.k.a. two-way fence, CMPXCHG, etc.) on lock exit. It does, and I claimed that "it has to". It turns out that my statement was actually too strong. Doing so prevents a certain class of potentially surprising results, so it's a matter of preference to the lock designer whether these results are so surprising as to incur the cost of a full barrier. Vance Morrison's "low lock" article, for instance, shows a spin lock that doesn't make this guarantee. And, FWIW, this is also left unspecified in the CLR 2.0 memory model. Java's memory model permits non-barrier lock releases, though I will also note the JMM is substantially weaker in areas when compared to the CLR's.
Here's an example of a possibly surprising result that non-barrier releases can cause:
Initially x == y == 0.
Thread 0 Thread 1 =============== =============== lock(A); lock(B); x = 1; y = 1; unlock(A); unlock(B); t0 = y; t1 = x;
Is it possible after executing both that: t0 == t1 == 0?
It is simple to reason that this is impossible with sequential consistency. In SC the only way that t0 == 0 is if Thread 0's t0 = y statement (and therefore x = 1, assuming a memory model in which writes happen in order) were to occur before Thread 1's y = 1 (and therefore t1 = x). In this case, t1 = x must subsequently see t1 == x == 1, otherwise the history contradicts SC. The only other possibility is that Thread 1's t1 = x happens before Thread 0's x = 1 and therefore also t0 = y, in which case it must be the case that the subsequent t0 = y by Thread 0 yields t0 == y == 1. In both cases, either x or y must be seen as 1. (Interleavings are clearly possible that result in x == y == 1.)
The CLR 2.0's memory model guarantees that, if the unlock incurs a barrier, the same SC reasoning applies. Unfortunately, if the unlock is not a barrer, then in both cases the load of x or y may occur before the write buffer has been flushed, meaning the write to x or y and the unlocking write itself, possibly leading to t0 == t1 == 0. This happens even on relatively strong processor consistency memory models such as X86, and on weaker ones such as IA-64 (even when all loads are acquire and all stores are release, which only happens for volatile CLR fields). To ensure the write buffer has been flushed before the read happens, the unlock statement must flush the buffer (or an explicit barrier is needed), accomplished with CMPXCHG on X86 ISAs.
Many would argue that, because the locks taken are different between the two threads, SC does not apply and therefore implementing unlock as a non-barrier write is legal. JMM takes this stance. This actually seems like a fine argument to me, and after thinking about it for a bit, it's probably what I would choose if I were defining a memory model. But the CLR 2.0 MM is generally stronger than most, so people might actually depend on this and expect it to work. This could cause Monitor-based code to break when moved to alternative lock implementations that don't issue full barriers at release time. This is just one example of why it'd be really great to have a canonical specification for the CLR's MM. At least we'd then have a leg to stand on when faced with tricky compatability trade-offs some day down the line.
 Wednesday, February 07, 2007
In Orcas, we offer a new reader/writer lock: System.Threading.ReaderWriterLockSlim.
Motivation for a new lock
The primary reason for creating this type was that we wanted to provide an official reader/writer lock for the .NET Framework that people could actually rely on for performance-critical code. It was no secret that the current ReaderWriterLock type was such a pig, costing somewhere around 6X the cost of a monitor acquisition for uncontended write lock acquires, that most people avoided it entirely. Jeff Richter wrote an entire MSDN article about this, and Vance Morrison showed how to build your own on his weblog. It was really too bad customers couldn't depend on the class in the Framework, and to be honest most devs really shouldn't be in the business of writing and maintaining their own reader/writer lock.
Second, we had a large number of qualms with the existing lock’s design. It had funny recursion semantics (and is in fact broken in a few thread reentrancy cases we know about), and has a dangerous non-atomic upgrade method. Did you know that you actually need to check the WriterSeqNum before and after a call to our ReaderWriterLock’s UpgradeToWriteLock method to ensure it didn’t change during the upgrade? You do. The code actually releases the reader lock before upgrading to the write lock, which allows other threads to sneak in, acquire the lock in between, and possibly change state that was read during the decision to upgrade. The reason? If we upgraded and then released the read lock, two threads trying to simultaneously upgrade would deadlock one another. All of these problems represent very fundamental flaws in the existing type’s design.
So we decided to build a new one that solves all of these problems. To be honest, I would have liked to fix the current one, but the existing API and compatibility responsibilities make that just about impossible. We considered obsoleting the existing one, but as I note at the end of this article, there are still reasons to prefer the old lock.
Three modes: read, write, and upgradeable-read
The new ReaderWriterLockSlim supports three lock modes: Read, Write, and UpgradeableRead, and has the methods EnterReadLock, EnterWriteLock, EnterUpgradeableReadLock, and corresponding TryEnterXXLock and ExitXXLock methods, that do what you’d expect. Read and Write are easy and should be familiar: Read is a typical shared lock mode, where any number of threads can acquire the lock in the Read mode simultaneously, and Write is a mutual exclusion mode, where no other threads are permitted to simultaneously hold the lock in any mode. The UpgradeableReadLock will probably be new to most people, though it’s a concept that’s well known to database folks, and is the magic that allows us to fix the upgrade problem I mentioned earlier. We’ll look at it more closely in a moment.
The performance of the new lock is roughly equal to that of Monitor. When I say “roughly”, I mean that it’s within a factor of 2X in just about all cases. And the new lock favors letting threads acquire the lock in Write mode over Read or UpgradeableRead, since writers tend to be less frequent than readers, generally leading to better scalability. We’d originally considered providing a set of contention management options to choose from, but decided in the end to ship a simpler design that works well for most cases.
Upgrading
Let’s look at upgrades more closely now. The UpgradeableRead mode allows you to safely upgrade from Read to Write mode. Remember I mentioned earlier that our old lock breaks atomicity in order to provide deadlock-free upgrade capabilities (which is bad, particularly since most people don't realize it). The new lock neither breaks atomicity nor causes deadlocks. We acheive this by allowing only one thread to be in the UpgradeableRead mode at once, though there may be any number of other threads in Read mode while there’s one UpgradeableRead owner.
Once the lock is held in the UpgradeableRead mode, a thread can then read state to determine whether to downgrade to Read or upgrade to Write. Note that this decision should ideally be made as fast as possible: holding the UpgradeableRead lock forces any new Read acquisitions to wait, though existing Read holders are still permitted to remain active. (Sadly the CLR team seems to have removed two methods, DowngradeToRead and UpgradeToWrite, that I had originally designed for this purpose. I admit what follows isn’t the most obvious way to do it.) To downgrade, you simply call EnterReadLock followed by ExitUpgradeableReadLock: this permits other Read and UpgradeableRead acquisitions to finish that were previously held up by the fact that there was an UpgradeableRead lock held. To upgrade, you similarly call EnterWriteLock: this may actually have to wait until there are no longer any threads that still hold the lock in Read mode. There’s no real reason to also exit the UpgradeableReadLock at this point unlike the downgrade case, though in some cases it makes your code more uniform. E.g.:
ReaderWriterLockSlim rwl = …; … bool upgraded = true; rwl.EnterUpgradeableReadLock(); try { if (… read some state to decide whether to upgrade …) { rwl.EnterWriteLock(); try { … write to state … } finally { rwl.ExitWriteLock(); } } else { rwl.EnterReadLock(); rwl.ExitUpgradeableReadLock(); upgraded = false; try { … read from state … } finally { rwl.ExitReadLock(); } } } finally { if (upgraded) rwl.ExitUpgradeableReadLock(); }
Recursive acquires
Another nice feature with the new lock is how it treats recursion. By default, all recursive acquires, aside from the upgrade and downgrade cases already mentioned, is disallowed. This means you can’t call EnterReadLock twice on the same thread without first exiting the lock, for example, and similarly with the other modes. If you try, you get a LockRecursionException thrown at you. You can, however, turn recursion on at construction time: pass the enum value LockRecursionPolicy.SupportsRecursion to your lock’s constructor, and voila, recursion will be permitted. The chosen policy for a given lock is subsequently accessible from its RecursionPolicy property.
There’s one special case that is never permitted, regardless of the lock recursion policy: acquiring a Write lock when a Read lock is held. We considered enabling this, or at least giving a new enum value for it, but decided to hold off for now: if it turns out customers need it, we can always add it later. But it’s dangerous and leads to the same Read-to-Write upgrade deadlocks that the old lock was prone to, and so we didn’t want to lead developers down a path fraught with danger. If you need this kind of recursion, it’s a “simple” matter of changing your design to hoist a call to either EnterWriteLock or EnterUpgradeableReadLock (and the corresponding Exit method) to the outermost scope in which the lock is acquired.
There are corresponding properties IsReadLockHeld, IsWriteLockHeld, and IsUpgradeableReadLockHeld, to determine whether the current thread holds the lock in the specified mode. You can also query the WaitingReadCount, WaitingWriteCount, and WaitingUpgradeCount properties to see how many threads are waiting to acquire the lock in the specific mode, and CurrentReadCount to see how many concurrent readers there are. The RecursiveReadCount, RecursiveWriteCount, and RecursiveUpgradeCount properties tell you how many recursive acquires the current thread has made for the specific mode.
Some limitations: reliability
Lastly, I mentioned there are some caveats around where this lock’s use is appropriate. Well, there’s one, really: it’s not hardened to be reliable. This means a few things.
First, unlike the existing ReaderWriterLock, the ReaderWriterLockSlim type does not cooperate with CLR hosts through the hosting APIs. This means a host will not be given a chance to override various lock behaviors, including performing deadlock detection (as SQL Server does). Thus, you really ought not to use this lock if your code will be run inside SQL Server.
Next, the lock is not robust to asynchronous exceptions such as thread aborts and out of memory conditions. If one of these occurs while in the middle of one of the lock’s methods, the lock state can be corrupt, causing subsequent deadlocks, unhandled exceptions, and (sadly) due to the use of spin locks internally, a pegged 100% CPU. So if you’re going to be running your code in an environment that regularly uses thread aborts or attempts to survive hard OOMs, you’re not going to be happy with this lock. Unfortunately the lock doesn’t even mark critical regions appropriately, so hosts that do make use of thread aborts won’t know that the thread abort could possibly put the AppDomain at risk: many hosts would prefer to wait, or immediately escalate to an AppDomain unload, if an individual thread abort is necessary while the thread is in a critical region. But in the case of ReaderWriterLockSlim, a host has no idea if a thread holds the lock because the implementation doesn’t call Begin- and EndCriticalRegion. And the kind of problems I mentioned in the previous post are always an issue with ReaderWriterLockSlim, because we don't necessarily guarantee that there will be no instructions in the JIT-generated code between the acquisition and entrance to the following try block.
Summary
In summary, the new ReaderWriterLockSlim lock eliminates all of the major adoption blockers that plagued the old ReaderWriterLock. It performs much better, has deadlock-free and atomicity-preserving upgrades, and leads developers to program cleaner designs free of lock recursion. There are some downsides to the new lock, however, that may cause programmers writing hosted or low-level reliability-sensitive code to wait to adopt it. Don’t get me wrong, most people really don’t need to worry about these topics, so I apologize if my words of warning have scared you off: but those that do really need to be told about the state of affairs. Thankfully, I’m confident that many of these issues will be fixed in subsequent releases. And for most developers out there, the new ReaderWriterLockSlim is perfect for the job.
 Monday, January 29, 2007
I previously mentioned the X86 JIT contains a "hack" to ensure that thread aborts can't sneak in between a Monitor.Enter(o) and the subsequent try-block. This ensures that a lock won't be leaked due to a thread abort occurring in the middle of a lock(o) { S1; } block. In the following example, that means an abort can't be triggered at S0:
Monitor.Enter(o); S0; try { S1; } finally { Monitor.Exit(o); }
If an abort could happen at S0, it'd be possible for a thread to acquire lock o, but before entering the try block, be asynchronously aborted, and then not run the finally block to release the lock on o. This would lead to an orphaned lock, and probable deadlocks later on during execution. Debugging an instance of such a deadlock would of course be rather difficult because it depends on a very subtle race condition that must occur within the tiny window of a single instruction. On a single-processor machine, this would require a precariously placed context switch, but as more and more cores are added to the machines that this software runs on, the probability simply increases.
Characterizing this as a "hack" was a little harsh. It's really just a byproduct of the way that the X86 JIT generates code.
For an asynchronous thread abort to be thrown in a thread, that thread must be either: (1) polling for the abort in the EE or (2) running inside of managed code. And even if the thread is in managed code, we may not be able to abort it, as is the case if the thread is currently executing a finally block, inside a constrained execution region, etc. The C# code generation for the lock statement ensures there are no IL instructions between the CALL to Monitor.Enter and the instruction marked as the start of the try block. The JIT correspondingly will not insert any machine instructions in between the two. And since any attempted thread aborts in Monitor.Enter are not polled for after the lock has been acquired and before returning, the soonest subsequent point at which an abort can happen is the first instruction following the call to Monitor.Enter. And at that point, the IP will already be inside the try block (the return from Monitor.Enter returns to the CALL+1), thereby ensuring that the finally block will always run if the lock was acquired.
This might seem like an implementation detail, but the reality is that we can never change it. Too many people depend on this guarantee.
It turns out that Whidbey's X64 JIT does not guarantee this behavior. (I suspect IA64 doesn't either, but don't know for sure.) In fact there's a high probability that this won't work: there is always a NOP instruction before the CALL and the instruction marking the try block in the JITted code. This is done to make it easier to identify try/catch scopes during stack unwind. This means that, yes indeed, an abort can happen at S0 on 64-bit.
This will likely be fixed for the next runtime release, but I can't say for sure.
Update 4/17/08: This was indeed fixed for the X64 JIT in Visual Studio 2008. Note that when compiling C# code targeting both X86 and X64, if you do not use the /o+ switch, this problem can still occur due to extra explicit NOPs inserted before the try.
The framework implements a method Monitor.ReliableEnter, by the way, that could be used to avoid orphaning locks in the face of thread aborts, but it's internal to mscorlib.dll. It sets an out parameter within a region of code that cannot be interrupted by a thread abort, which the caller can then check inside the finally block. The acquisition then gets moved inside so that, if the CALL is reached, the finally block is guaranteed to always run. You'd then write this instead:
bool taken; try { Monitor.ReliableEnter(o, out taken); S1; } finally { if (taken) Monitor.Exit(o); }
It's also possible the CLR team would expose this API in the future. We wanted to in Whidbey, but didn't have enough time. If 64-bit code generation was changed so that it doesn't emit a NOP before the try block, however, we probably wouldn't need ReliableEnter after all.
 Monday, January 22, 2007
I was recently asked by a customer how to guarantee alignment of CLR data on 16-byte boundaries. They needed this capability to interoperate with code that uses SSE vector instructions to manipulate the data (which require 16-byte alignment). The bad news is that there’s no real good way of doing this. That is, there isn’t any “align at N bytes” feature for the CLR in which type layout and stack and heap allocation cooperate. The good news is that you can fake it.
(I spoke about alignment with respect to atomic cmpxchg8b instructions previously, right here, for those interested in reading about that too.)
The details of how to go about ensuring 16-byte alignment depend on whether you allocate your data on the stack or the GC heap. For illustration purposes, imagine we’re dealing with an array of float32[]’s. We’d like to ensure the beginning lies on a 16-byte boundary:
- float [] a0 = new float[N]; // GC-allocated array of N floats
- float * a1 = stackalloc float[N]; // stack-allocated array of N floats
If you use the former, GC allocation (1), you’re going to have a really tough time. The GC moves objects around on you as it performs compactions, and only aligns the 1st element of the array on a 4-byte boundary. So even if you manage to get your object allocated on a 16-byte boundary (by chance), it is apt to move during a subsequent GC.
To solve this problem, you’d have to pin the object. Pinning causes GC fragmentation, so I really encourage you to avoid this approach and go with stack allocation, (2), if you can afford it. A float[] on the stack is similarly aligned to begin at a 4-byte boundary, but, unlike (1), it will subsequently not move around. Of course stack allocation is often impossible, or difficult, if you are writing a reusable library that may be called in an unknown context (where the caller may have very little stack left). This is a tradeoff you would have to make. If the pinning is very short lived, i.e. the duration of a single function call, it might be tolerable for you, a la P/Invoke.
Regardless of whether you choose (1) with pinning or (2) by itself, you’ve now got a stable address. And you can use the stable address to calculate the next 16-byte element in the array from the base address, and then use that as the start of the array. You will need some extra padding at the end for the worst case, which is base + 3, meaning at most 12 bytes, so you need to allocate 3 extra floats in the array. Here’s an example:
void * AlignUp(void * p, ulong alignBytes) { ulong addr = new UIntPtr(p).ToUInt64(); alignBytes -= 1; // adjust pointer for arithmetic if (((1<<(IntPtr.Size*4 - 1)) - alignBytes) <= addr) throw new Exception(“overflow”); ulong newAddr = (addr + alignBytes) & ~alignBytes; return new UIntPtr(newAddr).ToPointer(); } … float * p = stackalloc float[N + 3]; p = (float *)AlignUp(p, 16); … use p …
Note that if you were to use an array of doubles instead, you’d have some challenges. That’s because a 8-byte value on the 32-bit CLR is only 4-byte aligned, and therefore you can end up with a situation where the next 16-byte granularity is in the middle of a single element. For example, 12 + 8 = 20 byte, +8 = 28 byte, +8 = 36 byte, and so on. None of these are 16-byte aligned. Not that it really matters, so long as you allocate enough memory, but you will need to do some casting of the array reference, as shown in the above code, to do the arithmetic.
Note also that there’s a StructLayout attribute that allows you to specify alignment, through its padding field, but sadly this doesn’t impact the GC’s heap or the JIT’s stack alignment, and so it’s useless for our purposes. Though the relative alignment within the data structure will be correct, the absolute alignment is not guaranteed to be so.
OK, so I know all of this isn’t pretty. But it works.
 Sunday, January 07, 2007
 |
Jim Larus, a Microsoft colleague of mine, recently co-authored a book on Transactional Memory with Ravi Rajwar from Intel, one of the most prominent authorities in the TM community. It just became available. You can purchase and download it online at Morgan & Claypool's website: Transactional Memory (Synthesis Lectures on Computer Architecture). The series of which it is part is new, but there are at least a few other great books in its pipeline, including a CMP (chip multi-processor) architecture book by Kunle Olukotun, from Stanford and an architect of Sun's Niagara processor. |
 Friday, December 29, 2006
Deadlocks aren’t always because you’ve taken locks in the wrong order.
In many systems, tasks communicate with other tasks through shared buffers. In a concurrent shared memory system, these buffers might be simple queues shared between many threads. In COM and Windows GUI programs these buffers might take the form of a window’s message queue. In any case, if some task A performs a synchronous message send to task B, and task B does a synchronous send to task A, near simultaneously, and if neither task continues to process incoming messages, both will be blocked forever.
This is the classic reentrancy versus deadlock problem. Ensuring that both A and B continue to process incoming messages while blocked on a send will eliminate the deadlock, albeit at the cost of possible reentrancy headaches. Better yet, you could just send messages asynchronously.
Things can get quite a bit more complicated than this, of course. Imagine that we have three operations, A, B, and C, being run over N concurrent streams of data. We use data parallelism to partition the input data into and replicate the operations over the N streams, such that we have A0…AN-1, B0…BN-1, and C0…CN-1 operating on disjoint input. A0 produces data for B0 which produces data for C0, and so on. Elements are pulled on demand from the leaves (A0…AN-1) to the root (C0…CN-1), using a single execution resource (like a thread) per stream, i.e. E0…EN-1.
This is quite a bit like many real data parallel systems, including stream processing.
Imagine that sometimes AN finds that some input data must be given to BM instead of BN (where N != M); there is a similar story for B and C too. We might be tempted to use some form of shared buffer to perform the inter-task communication here. In other words, when A0 finds something for B1, it sends the data to it by placing it into B1’s input buffer. This might be done asynchronously, i.e. A0 needn’t wait for B1 to actually consume the message, hence avoiding the sort of deadlock we noted earlier.
Unfortunately, since it might take some unknowable amount of time for B1 to process its input, we might worry about excessive memory usage for these buffers. So we could put a bound on its maximum size using an ordinary bounded buffer… but once we’ve done that, we have turned what was an asynchronous send into a possibly-synchronous one, and in doing so introduced the same deadlock problem with which I began this whole discussion.
We could solve this by ensuing that, whenever a task must block because the destination buffer is full, it also processes incoming messages in its own buffer. In other words, we use reentrancy. Sadly, things are not always quite so simple.
Imagine this case: A0 has found data for B1, but B1’s buffer has become full. So A0 is now waiting for B1 to process messages from its buffer to make room. Nobody else will produce data for A0 at this point, so it’s stuck waiting. Sadly, B1 has too become blocked trying to send a message that it has found for C0. Because the same execution resource E0 that must free space in C0’s buffer is currently blocked in A0 waiting for B1 to free space from its buffer, and because the execution resource E1 is also waiting for E0, we now have a very convoluted deadlock on our hands.
The solution? There’s no reason to keep the execution resource E0 occupied in A0 waiting for B1 in this case. E0 could instead be freed up to run C0, freeing space for B1, and untangling the system. Reentrancy strikes again, but this time, in a good way. Note that in a heterogeneous system where these buffers are not controlled by the same resource, this solution is difficult to realize in practice. Maybe A uses a custom bounded buffer written in C# to communicate, B uses SendMessages and COM message queues, and C uses GUI messages. In this case, orchestrating the waiting to be deadlock free becomes a real challenge.
 Friday, December 22, 2006
If you need RegisterWaitForSingleObject-like behavior in the new Vista threadpool -- a great feature, by the way, that amortizes the cost of losing a thread to waiting by cramming up to 63 waits into a single threadpool thread, leading to better scalability on systems that must wait for a lot of things concurrently, sometimes for long or unpredictable times (and the magic behind ASP.NET Asynchronous Pages) -- you'll want to look at the SetThreadpoolWait and related APIs. You get basically the same functionality, with a mostly cleaned up interface and the added benefit of having cleanup groups to take care of resource cleanup (if you so choose):
VOID WINAPI SetThreadpoolWait(PTP_WAIT pwa, HANDLE h, PFILETIME pftTimeout);
One thing annoyed me about this API when I first tried to use it. To be truthful, it still does. Instead of a DWORD timeout specified in 1ms intervals, you'll notice you have to supply a pointer to a FILETIME data structure. Wha-wha-whaat? This differs from just about every other wait API on Windows and is not as simple as it sounds.
The no-timeout case, which is typically -1 (a.k.a. INFINITE), is simple enough: just pass a NULL pointer. The "check, but immediately timeout if unsignaled without waiting," variant, typically specified with 0, is pretty straightforward. Just initialize your FILETIME to contain 0's. There are many ways to do this but using a struct field initializer is probably the easiest:
FILETIME ft = {0,0}; SetThreadpoolWait(..., &ft);
But for real timeouts, how the heck do you get your hands on a properly initialized FILETIME? OK, here's where things get nasty. You can grab ahold of a FILETIME from many places, including from an actual file's creation or modification time, but you probably don't want to do that. GetSystemTimeAsFileTime(&ft) will turn the current time into a FILETIME, which is useless without additional math to turn it into a time relative to now (otherwise, you'd just use a {0,0} FILETIME as noted above). Alternatively, you could start with a SYSTEMTIME (the current can be retrieved with GetSystemTime), which allows you to define a precise year, month, day, hour, minute, second, and/or millisecond, and then turn that into a FILETIME with SystemTimeToFileTime(..., &ft). But how many times do you really have a precise absolute date and time at which you want your wait to stop? Yes, February 3rd, 2019, at 6:29 AM (and 39.66 seconds) please.
Whew! At this point, you're probably bewildered. Most wait timeouts are defined in terms of relative time. Thankfully, you can still define a relative time when calling SetThreadpoolWait. But sadly you have to resort to some messy casting and data conversion to do this. How do you do that? Simple. If the FILETIME's contents represent a negative integer, then the threadpool adds the negation of that to the current system time during registration to come up with an absolute time. So it does the GetSystemTimeAsFileTime/addition nonsense for you. Well, how the heck do you do create a negative FILETIME anyway? "Easy"!
__declspec(align(8)) FILETIME ft; *reinterpret_cast<LONG64 *>(&ft) = -1000; SetThreadpoolWait(..., &ft);
That causes the threadpool to use a timeout of 1000 from the current time. 1000 units of what you might ask? FILETIME represents its time with 100ns units, so this particular timeout of -1000 is interpreted to mean 100 microseconds from now. Generally, if you have n milliseconds, you'd calculate n * 1000 (to get microsecond units) * 10 (to get 100 nanosecond units).
[Update: 12/26/2006: I changed the examples to use __declspec(align(8)), ensuring that the FILETIME is aligned on an 8-byte boundary. Pavel noted in comments to this post that, since FILETIME consists of two DWORDs, VC++ will align on 4-byte boundaries. On some architectures -- like IA64 -- treating these as a single 64-bit aggregate piece of data will present alignment faults to the software. (This is unlike the behavior you'll see on X86 and X64, where such faults are fixed transparently by the hardware and/or OS, albeit at a performance hit. You can ask the OS to fix these up on IA64 automatically, with SetErrorMode(SEM_NOALIGNMENTFAULTEXCEPT), but this results in even worse performance than on other architectures.) Raymond wrote about this previously. One solution is to align manually, as shown here; another is to memcpy data to and from the FILETIME/LONG64; yet another, which is probably the cleanest, is to simply access the FILETIME's fields directly. E.g. FILETIME ft = { -m * 1000 * 10, ~0 } or ft.dwLowDateTime = -m * 1000 * 10; ft.dwHighDateTime = ~0, for some milliseconds timeout m.]
To save you some time and frustration in this process, you can just use the following little shim in your code. I do. It gives you a nice, relative DWORD-based timeout interface to the SetThreadpoolWait API:
VOID SetThreadpoolWaitWithMsTimeout(PTP_WAIT pwa, HANDLE h, DWORD dwMilliseconds) { __declspec(align(8)) FILETIME ft; PFILETIME pft;
if (dwMilliseconds == -1) { // Just pass NULL to the wait API to signal "no timeout". pft = NULL; } else { pft = &ft;
// FILETIME uses 100ns intervals. To convert milliseconds into 100ns units, // we must multiply by 1000 (to get microseconds) and then by 10 (to get 100ns). // We take the negative of this number to specify "relative to now" time. *reinterpret_cast<LONG64 *>(pft) = -dwMilliseconds * 1000 * 10; }
SetThreadpoolWait(pwa, h, pft); }
To be honest, I don't know the reasoning behind the break from tradition here. [Update: 12/26/2006: Richard pointed out, very correctly, that the NTDLL implementations have alwasy dealt in terms of FILETIMEs and performed DWORD ms to 100ns conversions; in that sense, this is hardly a "break from tradition" for Windows generally, but is certainly a new direction for KERNEL32 itself.] Using a FILETIME clearly causes some trickiness for something that should be damn simple, but it does have one distinct advantage. The DWORD-based APIs only permit you to specify relative time, whereas the FILETIME approach allows you to specify relative or absolute, depending on what you need. I have to admit that I can’t think of many cases with wait registrations where you’d want an absolute timeout, but I might be unimaginative. This also has the benefit of removing some degree of ambiguity: with relative, you always have to wonder: time relative to what? Is it relative to the time at which an actual wait thread sees the registration and adds the HANDLE to its list of waitable objects? Or is it relative to the time at which the call to register the wait is made? Or something else? In the end, I’d have preferred a separate API for absolute time… I just like my DWORD-based relative timeouts. They are familiar and comfortable. Continuing to use RegisterWaitForSingleObject is an option, but doesn't let you use environments, cleanup groups, etc. Tradeoffs abound.
 Thursday, December 14, 2006
Raymond just posted a brief entry about lock convoys. While he links to a few other relevant posts, none of them mention the new anti-convoy features that all Windows locks now use. So I thought that I would take the chance to do just that.
Many people claim that fair locks lead to convoys. In my experience, however, few people really know the reason why. Before Windows Vista (client OS) and Windows Server SP1 (server OS), mutexes, critical sections, and internal locks like kernel pushlocks used a lock handoff mechanism to guarantee true fairness. In other words, the lock would not actually become "available" when released so long as there were threads waiting to acquire it. Instead, the thread releasing the lock would modify the lock's state so that it appeared as if the next thread in the wait queue already owned it. The new owner thread would then simply wake up, typically via the releaser signaling an event, and the owner would find that it already owned the lock, proceeding happily. No other thread can "sneak in" between the time the new owner is woken and the time that it is actually scheduled for execution with this design.
While this sounds nice and, well, fair, it exacerbates convoys. Why? Because it effectively extends the lock hold time by the communication and context switch latency required to wake and reschedule the new owner. Context switches on Windows are anything but cheap, and tend to cost anywhere from 4,000 to 10,000+ cycles. Assume C represents the cost of a context switch. Then with a truly fair lock, the lock will be in an intermediary handed off state, with no thread actually running code under its protection, for about C cycles. It's actually worse than this. Assuming the system is busy, a thread that is woken just goes to the end of the OS's thread scheduler queue, and is required to wait until it gets allocated a timeslice in which to execute. This can make C much larger in practice. And of course on highly loaded systems the condition worsens, which can add insult to injury (as we see momentarily).
To cope with the possibility of a scheduling delay, Windows uses something called a priority boost for any thread waiting on an auto-reset event (or that owns a window): the boost temporarily increases the target thread's priority which subsequently decays after it gets scheduled. Assuming no other high priority threads are runnable, this ensures the latency is very close to C. But C's still pretty darned big...
To illustrate the problem with the fair handoff scheme, imagine we have a lock L for which a new thread arrives every 2,000 cycles. Each such thread runs for 1,000 cycles under the protection of the lock. No problem, right? On average, the lock will be held for 1,000 cycles, unheld for 1,000 cycles, and so on. Assuming the arrival rate is somewhat random, but statistically averages out to the values mentioned, then occasionally we might get some contention, requiring waiting. But for every contentious acquire, there should also be a big gap in time where there are no owners (or where wait queues can be drained). A system with these characteristics should balance out well. It should survive until threads begin arriving at a frequency of more than 1,000 cycles, give or take some epsilon, which is actually a doubling in throughput. It's not even close to capacity. (Real systems depend on many more factors than this simplistic view, but you get the point.)
As soon as the lock is fair, however, this scheme quickly becomes untenable and will come to a grinding halt. Imagine that thread T0 acquires L at cycle 0; if it just so happens that T1 tries to acquire it at cycle 500, then T1 will have to wait. Remember, on average, the arrival rate is 1,000 cycles, but that's just an average. We expect the occasional wait to occur. This wait, unfortunately, causes a domino effect from which the system will never recover. T0 then releases L at cycle 1,000, as expected, handing off ownership to T1; sadly, T1 doesn't actually start running inside the lock until 5,000 (assuming 4,000 for C, and assuming no scheduling delay); in the 4,000 cycles it took for T1 to wake back up and start running, we expect 2 new threads will have arrived on average; these threads would see L as owned by T1 and respond by waiting. By the time those threads execute, another 10,000 cycles (2*(C+1,000)) will have passed, and another 4 threads will have begun waiting. And so on. This process repeats indefinitely, the requests pile up (hopefully with a bound), and disaster strikes. The system simply won't scale this way.
If you remove the strict fairness policy, however, the system scales. And that, my friends, is why all of the locks in Windows are now unfair.
Of course, Windows locks are still a teensy bit fair. The wait lists for mutually exclusive locks are kept in FIFO order, and the OS always wakes the thread at the front of such wait queues. (APCs regularly disturb this ordering -- a topic for another day -- which actually calls into question the merit of the original design goal of attaining fairness in the first place.) Now when a lock becomes unowned, a FIFO waking algorithm is still used, but the lock is immediately marked as unavailable. Another thread can sneak in and take the lock before the woken thread is even scheduled (although priority boosting is still, somewhat questionably, in the system). If another thread steals the lock, the woken thread may subsequently have to rewait, meaning it must go to the back of the queue, again disturbing the nice FIFO ordering.
The change to unfair locks clearly has the risk of leading to starvation. But, statistically speaking, timing in concurrent systems tends to be so volatile that each thread will eventually get its turn to run, probabilistically speaking. Many more programs would suffer from the convoy problems resulting from fair locks than would notice starvation happening in production systems as a result of unfair locks.
 Wednesday, December 06, 2006
I took this past week off so that I could work on my book. Well, I'm happy to report that I've been successfully writing like a madman, averaging around 15-20 solid pages per day. I still have a long way to go, but I'm getting more confident with the passing of each day that this book will be... well... a book that I'd actually like to sit down and read.
[Update: 12/7/2006: Correction made -- the CLR's JITs do not generate manual alignment code. Instead, we defer to the costly OS handler for alignment fixups.]
In the process of writing the section on data alignment, I realized there is very little documentation on the alignment policy used by the CLR. This is in contrast to Kang Su Gatlin's wonderful MSDN treatise on the subject for VC++, which leaves absolutely nothing hidden in the closet. Well, I still don't have all of the answers for you. Sorry. You'll have to wait for the book. But in the meantime, I've discovered that there's a myth that deserves a little debunking.
In the MSDN documentation for InterlockedCompareExchange64, it says:
"The variables for this function must be aligned on a 64-bit boundary; otherwise, this function will behave unpredictably on multiprocessor x86 systems and any non-x86 systems."
I've also heard and read this from other various sources. I've heard, for example, that LOCK CMPXCHG8B will still do a load/compare/store sequence, but that, if the address isn't 8-byte aligned, the instruction will not be atomic. This would lead to sporadic atomicity failures, probably even more difficult to track down than a typical race. Given that the CLR doesn't faithfully align 64-bit data types on 8-byte boundaries (as we'll see momentarily), I suddenly feared that Interlocked.CompareExchange(ref Int64, ...) was HORRIBLY broken. Without an MP machine at home, I couldn't test this out, so I decided to do a little digging.
In the manuals for many AMD processors and older Intel X86 processors, I found no reference to CMPXCHG8B requiring an aligned address. What I did find, however, in the Intel 64-bit and IA32 System Programmer's Manual Part A was the following (emphasis mine):
"The integrity of a bus lock is not affected by the alignment of the memory field. The LOCK semantics are followed for as many bus cycles as necessary to update the entire operand. However, it is recommend that locked accesses be aligned on their natural boundaries for better system performance:
- Any boundary for an 8-bit access (locked or otherwise).
- 16-bit boundary for locked word accesses.
- 32-bit boundary for locked doubleword accesses.
- 64-bit boundary for locked quadword accesses."
If I'm reading that right, this means the common wisdom around 8-byte alignment and LOCK CMPXCHG8B is hogwash. (Sadly, proving the absence of some flaky processor that crashes or has unpredictable behavior under certain circumstances is rather difficult, especially if someone at some point though it was true enough to put it in the MSDN documentation. If somebody out there knows of a real case -- and it's not just hear say -- please let me know!) If this is true of all X86 processors, it means that Interlocked.CompareExchange(ref Int64, ...) isn't horribly broken on the CLR after all. (Yaay.) It would have been broken... because, as I said earlier, the CLR does NOT align 64-bit values on 8-byte address boundaries...
Conversing briefly with Simon Hall over email, the dev that owns most (all?) of the type layout infrastructure, I've concluded the following: CLR type layout tries to eliminate all misaligned data layout through a combination of padding and field reordering. This means that data of >= 8-bytes on 64-bit always begins on 8-byte boundaries, and data of >= 4-bytes on 32-bit always begins (at least) on 4-byte boundaries. I say "at least" because emperical evidence shows that type layout actually aligns many 8-byte fields on 8-byte boundaries, even on 32-bit. (It turns out this doesn't matter much... neither the 32-bit JIT nor the GC respect this when allocating data.) In summary, the CLR ensures that no field that could have fit inside a single 4/8-byte segment ever spills across a boundary. The CLR also adds necessary padding to StructLayout(Sequential) types, while still preserving the original field ordering.
Therefore, the only cases where we end up with truly misaligned data is with StructLayout(Explicit) and StructLayout(Pack=...) types. For example the simple struct, struct S { [FieldOffset(6)] int i; }, will always be misaligned, on 32- and 64-bit alike. In such cases, our JIT simply generates the naive code and lets the OS perform misalignment fixups. This is actually rather costly, as Kang Su's aforementioned article explaines. We could have, like the VC++ compiler, generate the manual alignment code using a combination of loads and shifts, but my guess is that most of our customers don't care and will never notice.
To preserve the hard work done by type layout, our JITs and the GC guarantee that all allocated data is aligned on at least 4-byte (on 32-bit) or 8-byte (on 64-bit) boundaries. I say "at least" once again because I know, for example, that VC++ aligns stack frames on 16-byte boundaries for 64-bit. I don't claim to understand why. We might do something similar.
Here's an interesting program that just prints out a few field addresses, and whether things are 8-byte aligned. You'll interestingly notice that the int/long fields that are adjacent to one another are padded with 4-bytes in between on 32- and 64-bit, but that the JIT and GC only align on 4-byte addresses on 32-bit. I presume this is so that the layout doesn't have to change between 32- and 64-bit, but I can't say for sure:
using System; using System.Runtime.InteropServices;
class C { internal S s; }
struct S { internal int x; internal long y; internal byte z; }
unsafe class P { static void Main(string[] args) { int pad = 5; if (args.Length > 0) pad = int.Parse(args[0]);
Console.WriteLine("Field\t[Begin\tEnd)\t%8"); PrintStackS(pad); PrintHeapS(pad); }
static void PrintStackS(int x) { int * pad = stackalloc int[x]; S * s = stackalloc S[1]; PrintAddr(s); }
static void PrintHeapS(int x) { for (int i = 0; i < x; i++) new object(); C c = new C(); fixed (S * pcs = &c.s) { PrintAddr(pcs); } }
static unsafe void PrintAddr(S * ps) { ulong xa = new UIntPtr(&ps->x).ToUInt64(); Console.WriteLine("X\t{0:X}\t{1:X}\t{2}", xa, xa + sizeof(int), xa % 8); ulong ya = new UIntPtr(&ps->y).ToUInt64(); Console.WriteLine("Y\t{0:X}\t{1:X}\t{2}", ya, ya + sizeof(long), ya % 8); ulong za = new UIntPtr(&ps->z).ToUInt64(); Console.WriteLine("Z\t{0:X}\t{1:X}\t{2}", za, za + sizeof(byte), za % 8); } }
Running it with a few different inputs yields these results:
C:\Temp>8by Field [Begin End) %8 X 12F440 12F444 0 Y 12F448 12F450 0 Z 12F450 12F451 0 X 1273670 1273674 0 Y 1273678 1273680 0 Z 1273680 1273681 0
C:\Temp>8by 2 Field [Begin End) %8 X 12F44C 12F450 4 Y 12F454 12F45C 4 Z 12F45C 12F45D 4 X 1273664 1273668 4 Y 127366C 1273674 4 Z 1273674 1273675 4
If the CLR ever decides to support a 128 CAS operation, Interlocked.CompareExchange(ref Int128, ...), which I hope we will, we would need to guarantee alignment on 16-byte boundaries. In comparison to CMPXCHG8B, CMPXCHG16B does indeed fail when issued against an address that isn't 16-byte aligned. Instead of failing silently, a GP fault is generated. This is difficult, because not only must type layout respect the alignment (you can already get this with StructLayout(..., Pack=16)), but the JIT and the GC would also need to allocate correctly. Or, of course, you could over-allocate a chunk of data and shift the start pointer to the first aligned address inside of it. This might work for the stack, but for GC allocated data this is going to keep shifting around on you, and probably won't work very well. Before the CLR supports Interlocked.CompareExchange(ref Int128, ...), however, I suppose we ought to provide an Int128. :)
 Tuesday, November 28, 2006
There’s surprisingly little information out there on Windows keyed events. That’s probably because the feature is hidden inside the OS, not exposed publicly through the Windows SDK, and is used only by a small fraction of kernel- and user-mode OS features. The Windows Internals book makes brief mention of them, but only in passing on page 160. Since keyed events have been revamped markedly for Vista, a quick write up on them felt appropriate. I had the pleasure to chat at length today with the developer who designed and implemented the feature back in Windows XP. (I typically try to get work done during the day, but it seems the whole Microsoft campus was offline, aside from the two of us, due to the 1 or 2 inches of snow we received last evening). I doubt much of this will make it into my book, since knowing it all won’t necessarily help you write better concurrent software.
First here’s the quick backgrounder on why keyed events were even necessary. Before Windows 2000, critical sections, when initialized, were truly initialized. That meant their various dynamically allocated blobs of memory were allocated, contained the right bit patterns, and also that the per-section auto-reset event that is used to block and wake threads under contention was allocated and ready. Unfortunately, there are a finite number of kernel object HANDLEs per process, of which auto-reset events consume one, and each object consumes some amount of non-pageable pool memory. It also turns out lots of code uses critical sections. Around the Windows 2000 time frame, a lot more people were writing multithreaded code, primarily for server SMP programs. It’s relatively common now-a-days to have hundreds or thousands of them in a single process. And many critical sections are used only occasionally (or never at all), meaning the auto-reset event often isn't even necessary! Aside from the auto-reset event, the entire critical section is pageable and has no impact on a fixed size resource.
This was a problem, and had big scalability impacts. So starting with Windows 2000, the kernel team decided that allocation of the event would be delayed until it’s first needed. That means EnterCriticalSection had to, in response to the first contended acquire, allocate the event. But there’s a problem with this. If memory is low, or the number of HANDLEs in the process had been exhausted, this lazy allocation would actually fail. Suddenly EnterCriticalSection, which would never have failed previously (stack overflow aside), could throw an exception. What’s worse, you couldn’t really recover from these exceptions: the CRITICAL_SECTION data structure was left in an unusable and damaged state. But wait, it gets worse. I’m told there was a sizeable cleanup initiated that involved filing many, many bugs to fix code that used EnterCriticalSection throughout the Windows and related code-bases. Unfortunately, then people realized that LeaveCriticalSection could also fail under some even more obscure circumstances. (If EnterCriticalSection fails, throwing an out of memory exception, the subsequent LeaveCriticalSection would see the damaged state and think it could help out by allocating the event. This too could fail, causing even more corruption.) What to do? Wrap each call to EnterCriticalSection AND LeaveCriticalSection in its own separate __try/__except clause? And do precisely what in response, since the data structure was completely hosed anyway?
The bottom line was that no human being could possibly write reliable software using critical sections. Terminating the process, or isolating those bits of code that used such a damaged critical section somehow, were the only intelligible responses. Most of Microsoft's software, including the CRT and plenty of important applications, would probably not do anything, and remain busted.
Still, the people responsible for the original change believed strongly that the impacts to reliability were the lesser of two evils: that limiting Windows scalability so fundamentally was a complete non-starter. As a short-term solution, then, InitializeCriticalSectionAndSpinCount was hacked so that passing a dwSpinCount argument with its high-bit set, e.g. InitializeCriticalSectionAndSpinCount(&cs, 0x80000000 | realSpinCount), would revert to the pre-Windows 2000 behavior of pre-allocating the event object. No longer would low resources possibly cause exceptions out of EnterCriticalSection and LeaveCriticalSection. But all that code written to use the ordinary InitializeCriticalSection API was still vulnerable. And this just pushed the fundamental reliability vs. scalability decision back onto the poor developer. What a horrible choice to have to make.
This is when keyed events were born. They were added to Windows XP as a new kernel object type, and there is always one global event \KernelObjects\CritSecOutOfMemoryEvent, shared among all processes. There is no need for any of your code to initialize or create it—it’s always there and always available, regardless of the amount of resources on the machine. Having it there always adds a single HANDLE per process, which is a very small price to pay for the benefit that comes along with it. If you dump the handles with !handle in WinDbg, you’ll always see one of type KeyedEvent. Well, what does it do?
A keyed event allows threads to set or wait on it, just like an ordinary event. But having just a single, global event would be pretty useless, given that we’d like to somehow solve the original critical section problem, which effectively requires a single event per critical section. Here's where the ingenuity arises. When a thread waits on or sets the event, they specify a key. This key is just a pointer-sized value, and represents a unique identifier for the event in question. When a thread sets an event for key K, only a single thread that has begun waiting on K is woken (like an auto-reset event). Only waiters in the current process are woken, so K is effectively isolated between processes although there’s a global event. K is most often just a memory address. And there you go: you have an arbitrarily large number of events in the process (bounded by the addressable bytes in the system), but without the cost of allocating a true event object for each one.
By the way, if N waiters must be woken, the same key N is set multiple times, meaning for manual-reset-style sets, the list of waiters somehow needs to be tracked. (Although not an issue for critical sections, this comes up for SRWLs, noted below.) This gives rise to a subtle corner case: if a setter finds the wait list associated with K to be empty, it must wait for a thread to arrive. Yes, that means the thread setting the event can wait too. Why? Because it's just how keyed events work; without it, there would be extra synchronization needed to ensure a waiter didn't record that it was about to wait (e.g. in the critical section bits), the setter to see this and set the keyed event (and leave), and lastly the waiter to actually get around to waiting on the keyed event. This would lead to a missed pulse, and possibly deadlock, if it weren't for the current behavior.
So you can probably imagine how this solves the original problem. When a critical section finds that it can’t allocate a dedicated event due to low resources, it will just wait and set the keyed event, using the critical section’s address in memory as the key K. You might think: well, gosh, with this nifty new keyed events thingamajiggit, why didn’t they get rid of per-critical section events entirely? I did at least.
There are admittedly some drawbacks to keyed events. First and foremost, the implementation in Windows XP was not the most efficient one. It maintained the wait list as a linked list, so finding and setting a key required an O(n) traversal. Here n is the number of threads waiting globally in the system. The head of the list is in the keyed event object itself, and entries in the linked list are threaded by reusing a chunk of memory on the waiting thread’s ETHREAD for forward- and back-links—cleverly avoiding any dynamic allocation whatsoever (aside from the ETHREAD memory which is already allocated at thread creation time). But given that the event is shared physically across the entire machine, depending on a linked list like this for all critical sections globally would not have scaled very well at all. And this sharing can also result in contention that is difficult to explain, since threads have to use synchronization when accessing the list.
[Update: 2/2/2007: Neill, the dev I mentioned at the outset, just emailed me a correction to my original write-up. I had incorrectly stated that the forward- and back-links happen through TEB memory (which is user-mode); they actually use ETHREAD memory.]
But keyed events have improved quite a bit in Windows Vista. Instead of using a linked list, they now use a hash-table scheme, trading the possibility of hash collisions (and hence some amount of contention unpredictability) in favor of improved lookup performance. This improvement was good enough to use them as the sole event mechanism for the new “slim” reader/writer locks (SRWLs), condition variables, and one-time initialization APIs. Yes, you heard that right… None of these new features use traditional events under the covers. They use keyed events instead. This is in part why the new SRWLs are super light-weight, taking up only a pointer-sized bit of data and not requiring any event objects whatsoever. Critical sections still use auto-reset events, but I understand that this is primarily for AppCompat reasons. It’s admittedly nice when debugging to be able to grab hold of the HANDLE for the internal event and dump information about it, something you can’t do with keyed events, and plenty of customers depend on this information being there.
The improvement that keyed events offer to reliability and the alleviation of HANDLE and non-pageable pressure is overall a very welcome one, and one that will undoubtedly pave the way for new synchronization OS features in the future. I personally hope that one day they are made available to user-mode code through the Windows SDK.
 Saturday, November 18, 2006
I was surprised to find out that attempting to acquire an orphaned native "slim" reader/writer lock (SRWL) on the shutdown path hangs on Windows Vista. Unlike orphaned critical section acquisitions during shutdown on Windows -- which, in Vista cause the process to terminate immediately, and pre-Vista enjoyed "weakening" to avoid deadlocking at the risk of seeing corrupt state -- SRWL's AcquireSRWLockXXX methods are not shutdown aware.
This is actually pretty dangerous, and effectively means you should stay as far away from SRWLs during shutdown as possible. Avoiding synchronization and any sort of cross-thread coordination in DllMain is generally a good rule of thumb anyway, since it runs under the protection of the loader lock, has to tolerate very harsh conditions, and often runs w/out the presence of other active threads.
But this means something even stronger and sets SRWLs distinctly apart from Win32 critical sections. If you're writing a reusable native library whose functionality somebody might conceivably want to use on the shutdown path, you really ought not to be using SRLWs internally. If app developers don't realize you employ internal SRWL synchronization, they might call you and, every so often when the stars align, their users will experience a random hang during shutdown. Library authors might consider giving TryXXX variants of their APIs so that developers can at least deal with the case in which a SRWL has been orphaned.
Notice that hanging is similar to managed code's approach to lock acquisitions during shutdown. There's a big difference, though: the CLR anoints a shutdown watchdog thread that kills the process after 2 seconds when a hang occurs, whereas native code won't. The native hang persists indefinitely.
I'd be a much happier guy if SRWLs mirrored the new Vista orphaned critical section shutdown behavior.
 Thursday, November 09, 2006
The CLR tried to add support for fibers in Whidbey. This was done in response to SQL Server Yukon hosting the runtime in process, aka SQLCLR. Eventually, mostly due to schedule pressure and a long stress bug tail related to fiber-mode, we threw up our hands and declared it unsupported. Given the choice between fixing mainline stress bugs (which almost exclusively use the unhosted CLR, meaning OS threads) and fixing fiber-related stress bugs, the choice was a fairly straightforward one. This impacts SQL customers that want to run in fiber mode, but there are much fewer of those than those who want to run in thread mode.
Perhaps the biggest thing we did to support fibers intrinsically in the runtime was to decouple the CLR thread object from the physical OS thread. Since most managed code accesses thread-specific state through this façade, we are able to redirect calls to threads or fibers as appropriate. And we of course plumbed the EE to call out to hosts so they can perform task management at various points in the code, enabling a non-preemptive scheduler to do its job. When a CLR host with a registered TaskManager object is detected, we defer many tasks to it that we’d ordinarily implement with OS calls. For example, instead of just creating a new OS thread, we will call out through the TaskManager interface so that the thread can use a fiber if it wishes.
We do various other things of interest:
- Because the CLR thread object is per-fiber, any information hanging off of it is also per-fiber. Thread.ManagedThreadId returns a stable ID that flows around with the CLR thread. It is not dependent on the identity of the physical OS thread, which means using it implies no form of affinity. Different fibers running on the same thread return different IDs. Impersonation and locale is also carried around with the fiber instead of the thread. This also ensures we can properly walk stacks, propagate exceptions, and report all of the active roots held on stack frames (for all fibers) to the GC.
- Managed TLS is stored in FLS if a fiber is being used. This includes the ThreadStaticAttribute and Thread.GetData and Thread.SetData routines. We avoid introducing thread affinity when these APIs are used.
- Any time we block in managed code or at most places in the runtime, we call out to the host so that it may SwitchToFiber. This includes calls to WaitHandle.WaitOne, contentious Monitor.Enters, Thread.Sleep, and Thread.Join, as well as any other APIs that use those internally. Some managed code blocks by P/Invoking, either intentionally or unintentionally, which leaves us helpless. The sockets classes in Whidbey, for instance, make possibly-blocking calls to Win32. These should really be cleaned up. Not only does it prevent us from switching in fiber mode, but it also prevents us from pumping the message queue on an STA thread. Apps do this too, such as P/Invoking to MsgWaitForMultipleObjects in order to do some custom message pumping code. The lack of coordination with blocking in the kernel also makes it way too easy to accidentally forfeit an entire CPU for lengthy periods of time.
- We do some things during a fiber switch to shuffle data in and out of TLS. This includes copying the current thread object pointer and AppDomain index from FLS to TLS, for example, as well as doing general book-keeping that is used by the internal fiber switching routines (SwitchIn and SwitchOut).
- Our CLR internal critical sections coordinate with the host. Anytime we create or wait on an event, it is a thin wrapper that calls out to the host. This meant sacrificing some freedom around waiting, like doing away with WaitForMultipleObjectsEx with WAIT_ANY and WAIT_ALL, but ensures seamless integration with a fiber-mode host.
- All thread creation, aborts, and joins are host aware, and call out to the host so they can ensure these events are processed correctly given an alternative scheduling mechanism.
None of this logic kicks in if fibers are used underneath the CLR. It all requires close coordination between the host which is doing user-mode scheduling and the CLR which is executing the code running on those fibers. If you call into managed code on a thread that was converted to a fiber, and then later switch fibers without involvement w/ the CLR, things will break badly. Our stack walks and exception propagation will rely on the wrong fiber’s stack, the GC will fail to find roots for stacks that aren’t live on threads, among many, many other things.
Important areas of the BCL and runtime that can introduce thread affinity, then make a call that might block, and later release thread affinity—such as the acquisition and release of an OS CRITICAL_SECTION or Mutex—have been annotated with calls to Thread.BeginThreadAffinity and Thread.EndThreadAffinity. These APIs call out to the host who maintains a recursion counter to track regions of affinity. If a blocking operation happens inside such a region (i.e. count > 0), the host should avoid rescheduling another fiber on the current thread and/or moving the current fiber to another thread. This can create CPU stalls, so we try to avoid it, but is better than the consequence of ignoring the affinity.
In reality, there is little code today that actually uses these APIs. Large swaths of the .NET Framework have not yet been modified to use these calls and thus remain unprotected. We inherit a lot of the affinity problems from Win32. This can have a dramatic impact on reliability and correctness when used in a fiber-mode host. Switching a fiber that has acquired OS thread affinity can result in data being accidentally shared between units of work (like the ownership of a lock) or movement of work to a separate thread (which then expects to find some TLS, but is surprised when it isn’t there). Both are very bad. If we were serious about supporting fibers underneath managed code, we really ought to do an audit of the libraries to find any dangerous unmarked P/Invokes or OS thread affinity.
Spin loops without going through the user-mode scheduler first potentially wastefully burn CPU cycles. A lot of the .NET Framework and some of the CLR itself spins without host coordination. While not disastrous, presuming they all fall back to waiting eventually, this can have a negative impact on performance and scalability.
The 2.0 CLR’s policy in response to stack overflow is to FailFast the whole process. Too much of Win32 is unreliable in the face of overflow to try and continue running. With fibers in the picture it might be attractive to reserve smaller stacks since presumably the smaller work items will need less. And you're apt to have a lot more of them. This is a dangerous game to play. This trades off some amount of committed memory for an increased chance of overflowing the stack, an event that is clearly catastrophic.
Fibers and debuggers don’t interact well today either. Most rely on Win32 CONTEXTs pulled from the OS thread, in a fiber-unaware way. Depending on the frequency at which it resamples the context, this can get out of sync in the face of fiber switches. Even if you’ve suspended all threads, you’ll not be able to peer into the stacks of fibers that aren’t currently scheduled. FuncEval and EnC also depend on thread suspension and coordination in a way that makes it hard to predict will happen when fibers are added to the mix. A lot of the debugging libraries we have, such as System.Diagnostics, are also not fiber-aware and may yield surprising answers to API calls.
In the end, remember that we decided to cut fiber support because of stress bugs. Most of these stress bugs wouldn’t have actually blocked the simple, short-running scenarios, but would have plagued a long-running host like SQL Server. The ICLRTask::SwitchOut API was cut, which is unfortunate: it means you can’t switch out a task while it is running, which effectively makes it impossible to build a fiber-mode host on the 2.0 RTM CLR. Thankfully, re-enabling it (for those playing w/ SSCLI) would be a somewhat trivial exercise.
 Wednesday, November 01, 2006
People often ask whether they should use EventWaitHandle objects or the Monitor.Wait, Pulse, and PulseAll methods for synchronization. There is no simple answer to this question; although, as with most software problems, it can be summarized as: It depends.
EventWaitHandle comes in two flavors, auto- and manual-reset. EventWaitHandle subclasses WaitHandle and offers two subclasses for convenience: AutoResetEvent and ManualResetEvent. These are just thin wrappers on top of the CreateEvent and related APIs in Win32. The differences are deceivingly simple.
Auto-reset, when signaled with the EventWaitHandle.Set API (kernel32!SetEvent internally), allows one thread to witness the signal before the event automatically transitions back to the unsignaled state. If there are any waiting threads, one will be chosen and unblocked. The waiting threads are maintained in a FIFO queue, but it’s not strictly FIFO for the same reasons very few things on Windows are FIFO: various events, like device IO completion, kernel-, and user-mode APCs can wake a thread temporarily, removing it and then requeuing it in the wait queue. If a thread is constantly woken to process device IO, it might be starved indefinitely (if the queue is long). If no threads were waiting at the time of this signal, the next thread to wait on the event will not block, and instead just moves the event back to the unsignaled state and returns. This is all done atomically so you are guaranteed only one thread will ever witness a signal.
Manual-reset, when signaled, wakes all threads that are waiting on it. As its name implies, it must be manually reset with the EventWaitHandle.Reset API (kernel32!ResetEvent internally). While the event remains signaled, any threads that try to wait on it will not block and just return from the wait function immediately.
Signaling an already-signaled event has no effect. It’s easy to get into trouble in this area with auto-reset events. If you signal the event N times, expecting N threads to see the signals and do some amount of processing, you’re betting the farm on a race condition, for instance. This is easy to get wrong, very easily leading to deadlocks. If you have a shared buffer, an attractive design might to simply call Set on the event each time a new item arrives. The thinking might be that, while threads might not be sleeping, at least one thread will wake up per item and process it. This thinking is dead wrong and can get you into a quagmire. The waking threads would need to contain a loop ‘while (!empty) { … }’ before going back to sleep, otherwise one of the signals may go missing. If production of new items depended on consumers making forward progress, the program might lock up. And it entirely depends on consumers going to sleep in the first place which, if producers typically produce faster than consumers, might only happen occassionally (and hence not show up during testing).
Monitor.Wait, Pulse, and PulseAll are very different from their close Win32 event cousins. They are much more akin to the new Windows Vista APIs, SleepConditionVariableCS and SleepConditionVariableSRW. Wait will exit the monitor (lock) for the object in question until another thread pulses the object. Once the thread wakes up, it immediately reacquires the lock on the object. Pulse wakes up one waiting thread, in FIFO order, while PulseAll wakes all waiting threads. Notice that the monitor has no residual effect from the pulse; that is, if no threads were waiting at the exact moment of a pulse, there is no evidence that it actually happened. This leads to the notorious missed-pulse problem. To solve it, you just have to ensure that the wait condition is always tested (in a loop) around the Wait.
Note that Wait does something a little dirty. It releases an arbitrary amount of recursive acquisitions. As soon as it does this, other threads can acquire the monitor. If you are not careful with recursion, you can end up Waiting with broken invariants, accidentally letting other threads peer into this state. This is just another bit of evidence that recursion is something that is best avoided.
The first major consideration to make when selecting between EventWaitHandle and Monitor’s methods is whether you need a stand-alone event or a real condition variable that is integrated with locks. That is, the two have very distinct and disjoint feature sets. Win32 events also let you do more sophisticated waits, with the WaitHandle.WaitAll or WaitAny APIs, allowing you to wait for all of the events or a single event in an array to be signaled. So which feature-set do you want? That Win32 events give you events without the synchronization looks simpler, but is probably misleading. You typically need to manage mutual exclusion in some way with events, too, so you’ll end up using a monitor, ReaderWriterLock, Mutex, etc. in addition. The one benefit is that you have more control over locking and can be more conscious of certain policies like recursion. The fact that a Win32 event “sticks” in the signaled state can also be useful to avoid the missed-pulse problem, although with some discipline it is easily avoided with monitors too. Often people end up building a sticky event with a bool and monitor pair. One-time or lazy initialization is an example of this.
Win32 events are fairly heavyweight too. Each one consumes some amount of kernel memory, and setting, resetting, and waiting on one incurs somewhat expensive kernel transitions. In managed code, simply allocating one increases pressure on the GC because of yet another finalizable object to track. In a well-tuned system, you have to manage events carefully, which usually means Disposing of them far before the GC’s finalizer thread has a chance to see one. Even cleverer systems will pool them to amortize the cost of creating and closing the events. This is a double-edged sword. The V1.1 ReaderWriterLock we shipped in the CLR pools events. In my opinion, this is a little too clever and myopic: a good solution would pool events across many components in the process, not just ReaderWriterLocks. Imagine if each type we shipped tried to maintain its own pool of events.
As you may have guessed, Monitor actually uses Windows events underneath it all. Each CLR thread has a manual-reset event, allocated when it is created (or lazily when the thread first wanders into managed code). When a Wait is issued, this per-thread event is stuck on the tail of a linked list associated with the target object’s sync block. We can use a single event per thread since a thread can only ever be waiting for a single object at a given time. (You can’t do a WAIT_ANY or WAIT_ALL on monitors.) The thread then releases the lock on the object (accounting for any recursion), waits on this event, and then reacquires the lock on the object (again, accounting for any recursion). When a Pulse is issued, the head of the object’s linked list of waiters is popped off and its associated event is signaled. Similarly, PulseAll clears the entire linked list and signals all of the events. Notice I said that Pulse operates on the head of the list: we use a strict FIFO ordering (as of 2.0). And since we don’t remove the list entries in the face of an APC, there is no risk of perturbing the FIFO ordering, aside from premature exits due to thread aborts or interruptions.
There are a few things to note about this.
The signals on the thread events happen while the signaler still owns the lock. In other words, the thread calling Pulse(o) will still own the lock on o for some time after the call, yet the thread that called Wait(o) will immediately wake after the Pulse and try to acquire this lock (failing and waiting). Yes, all woken threads have to immediately wait when attempting to reacquire this lock, which is actually pretty crappy. If you’re using PulseAll, this could have a noticeable (and in some cases, dramatic) impact on scalability. Windows uses priority boosts to “hand off” the current time-slice to the recipient of an event signal, similar to what occurs when a GUI event is enqueued into a thread’s message queue, which just exacerbates this effect. You’re just about guaranteed that there will be a scheduler ping-pong effect immediately after a pulse. I am honestly surprised we don’t enqueue the Pulse/PulseAll calls on the object’s sync-block, processing them only once the lock has been exited. Yet another benefit to using events is that you can devise algorithms that signal events outside of critical sections, often leading to improved scalability.
We also don’t do any form of spinning. Events are generally speaking very volatile in terms of timing, so spinning only buys you something if you know that the occurrence of events are frequent enough that wait-avoidance will pay off. In many low-level concurrent algorithms, this is a worth-while technique, just as with spinning while trying to acquire a CRITICAL_SECTION in Win32 (see InitializeCriticalSectionWithSpinCount and SetCriticalSectionSpinCount) can improve scalability by avoiding expensive kernel-mode transitions due to waiting. In fact, it’s conceivable that somebody would want to use an event that never did a real wait, particularly if you’re dealing with a very tiny race condition that is expected to arise very infrequently. This is also dangerous, however, as it can lead to those rare CPU spikes that are almost impossible to debug and discern from a crash dump. This is pretty simple to build, but very hard to fine-tune so that it performs adequately.
So in the end, I will simply fall back to my original answer: It depends.
 Thursday, October 26, 2006
The meat of this article is in section II, where a set of best practices are listed. If you’re not interested in the up-front background and high level direction—or you’ve heard it all before—I suggest skipping ahead right to it.
As has been widely publicized recently, mainstream computer architectures in the future will rely on concurrency as a way to improve performance. This is in contrast to what we’ve grown accustomed to over the past 30+ years (see Olukotun, 05 and Sutter, 05): a constant increase in clock frequency and advances in superscalar execution techniques. In order for software to be successful in this brave new world, we must transition to a fundamentally different way of approaching software development and performance work. Simply reducing the number of cycles an algorithm requires to compute an answer won’t necessarily translate into the fastest possible algorithm that scales well as new processors are adopted. This applies to client & server workloads alike. Dual-core is already widely available—Intel Core Duo is standard on the latest Apples and Dells, among others—with quad-core imminent (in early 2007), and predictions of as many as 32-cores in the 2010-12 timeframe. Each core could eventually carry up to 4 hardware threads to mask memory latencies, equating to 128 threads on that same 32-core processor. This trend will continue into the foreseeable future, with the number of cores doubling every 2 years or so.
If we want apps to get faster as new hardware is purchased, then app developers need to slowly get onto the concurrency bandwagon. Moreover, there is a category of interesting apps and algorithms that only become feasible with the amount of compute power this transition will bring—what would you do with a 100 GHz processor?—ranging from rich immersive experiences complete with vision and speech integration to deeper semantic analysis, understanding, and mining of information. If you’re a library developer and want your libraries to fuel those new-age apps, then you also need to hop onto the concurrency bandwagon. There is a catch-22 here that we must desperately overcome: developers probably can’t build the next generation of concurrent apps until libraries help them, but yet we typically wouldn’t decide to do large-scale proactive architectural and design work for app developers until they were clamoring for it.
Although it may sound rather glamorous & revolutionary at first, this transformation won’t be easy and it certainly won’t happen overnight. Our libraries will slowly and carefully evolve over time to better support this new approach to software development. We’ve done a lot of work laying the foundation over the .NET Framework’s first 3 major releases, but this direction in hardware really does represent a fundamental shift in how software will have to be written.
This document exposes some issues, articulates a general direction for fixing them, and, hopefully, will stimulate a slow evolution of our libraries. App developers will want to take incremental advantage of these new architectures as soon as possible, ramping up over time. The practices in here are based on experience. My hope is that many of them (among others) are eventually integrated into libraries, tools, and standard testing methodologies over time. Nobody in their right mind can keep all these rules in their head.
I. The 20,000 Foot View
There are several major themes library developers must focus on in their design and implementation in order to prepare for multi-core:
A. The level of reliability users demand of apps built on the .NET Framework and CLR is increasing over time. Being brought in-process with SQL Server made the CLR team seriously face this fact in Whidbey. At the same time, with the introduction of more concurrency, subtle timing bugs—like races and deadlocks—will actually occur with an increasing probability. Those rare races that would have required an obscure 5-step sequence of context switches at very specific lines of code on a uni-processor, for example, will start surfacing regularly for apps running on 8-core desktop machines. Library authors have gotten better over time at finding and fixing these types of bugs during long stress hauls before shipping a product, but nobody catches them all. Fixing this will require intense testing focus on these types of bugs, hopefully new tools, and the wide-scale adoption of best practices that statistically reduce the risk, as outlined in this doc.
B. Nobody has seriously worked out the scheduling mechanisms for massively concurrent programs in detail, but it will likely involve some form of user-mode scheduling that keeps logical work separate from physical OS threads. Unfortunately, many libraries assume that the identity of the OS thread remains constant over time in a number of places—something called thread affinity—preventing two important things from happening: (1) multiple pieces of work can’t share the same OS thread, i.e. it has become polluted, and (2) a user-mode scheduler can no longer move work between OS threads as needed. Windows’s GUI APIs are notorious for this, including the Shell APIs, in addition to the reams of COM STA code written and thriving in the wild. Fibers are the “official” mechanism on Windows today for user-mode scheduling, and—although there are several problems today—the CLR and SQL Server teams have experience trying to make serious use of them. Regardless of the solution, thread affinity will remain a problem.
C. Scalability via concurrency will become just as important—if not more important (eventually, for some categories of problems)—than sequential performance. If you assume that most users will try using your library in their now-concurrent programs, you also have to assume they will notice when you take an overly coarse grained lock, block the thread unexpectedly, or pollute the physical thread such that work can’t remain agile. Moreover, a compute-intensive sequential algorithm lurking in a reusable library and exposed by a coarse-grained API will eventually lead to scalability bottlenecks in customer code. Faced with such issues, developers will have no recourse other than to refactor, rewrite, and/or avoid the use of certain APIs. And even worse, they’ll learn all of this through trial & error.
D. It’s not always clear what APIs will lead to synchronization and variable latency blocking. If a customer is trying to build a scalable piece of code, it’s very important to avoid blocking. And of course GUI developers must avoid blocking to maintain good responsiveness (see Duffy, 06c). But if blocking is inevitable, either because of an API design or architectural issue, developers would rather know about it and choose to use an alternative asynchronous version of the API—such as is used by the System.IO.Stream class—or take the extra steps to “hide” this latency by transferring the wait to a spare thread and then joining with it once the wait is done. Libraries need to get much better at informing users about the performance characteristics of APIs, particularly when it comes to blocking. And everybody needs to get better at exposing the power of Windows asynchronous IO through APIs that use file and network IO internally.
These are all fairly dense and complex issues, and are all intertwined in some way. Many of them can be teased apart and mitigated by following a set of best practices. This is not to say they are all easy to follow. These guidelines should evolve as we as a community learn more, so please let me know if you have specific suggestions, or ideas about how we can make this list more useful. I seriously hope these are reinforced with library and tool support over time.
II. The Details
Locking Models
1. Static state must be thread-safe.
Any library code that accesses shared state must be done thread-safely. For most managed code-bases this means that all accesses to objects reachable through a static variable (i.e. that the library itself places there) must be protected by a lock. The lock has to be held over the entire invariant under protection—e.g. for multi-step operations—to ensure other threads don’t witness state inconsistencies in between the updates. Protecting multi-step invariants requires that the granularity of your lock is big enough, but not so big that it leads to scalability problems. Read-modify-write bugs are also a common mishap here; e.g. if you’re updating a lightweight counter held in a static variable, it must be done with an Interlocked.Increment operation, under a lock, or some other synchronization mechanism.
Reads and writes to statics whose data types are not assigned atomically (>32 bits on 32-bit, >64 bits on 64-bit) also need to happen under a lock or with the appropriate Interlocked method. If they are not, threads can observe “torn values”; for example, while one thread writes a 64-bit value, 0xdeadbeefcafebabe to a field—which actually involves two individual 32-bit writes in the object code—another thread may run concurrently and see a garbage value, say, 0xdeadbeef00000000, because the high 32-bit word was written first. Similar problems can happen to GUID fields on all architectures, for instance, because GUIDs are 128 bits wide. Longs on 32-bit machines also fall into this category, as do value types built out of said data types.
This responsibility doesn’t extend to accesses to instance fields or static fields for objects that library users explicitly share themselves. In other words, only if the library makes state accessible through a static variable does it need to protect it with synchronization. In some cases, a library author may choose to make a stronger guarantee—and clearly document it—but it should certainly be the exception rather than the default choice, for instance with libraries specific to the concurrency domain.
2. Instance state doesn’t need to be thread-safe. In most cases it should not be.
Protecting instance state with locks introduces performance overhead that is often ill-justified. The granularity of these locks is typically too small to protect any operation of interesting size in the app. And if the granularity might be wrong you need to expose implementation locking details or it was a waste of time. Claiming an object performs thread-safe reads/writes to instance fields can even give users a false sense of safety because they might not understand the subtleties around locking granularity.
In V1.0 the .NET Framework shipped synchronizable collections with SyncRoots, for example, which in retrospect turned out to be a bad idea: customers were frequently bitten by races they didn’t understand; and, for those who kept a collection private to a single thread or used higher level synchronization rather than the collection’s lock, the performance overhead was substantial and prohibitive. Thankfully we left that part of the V1.0 design out of our new V2.0 generic collections. We still have numerous types that claim “This type is thread-safe” in the MSDN docs, but this is typically limited to simple, immutable value types.
3. Use isolation & immutability where possible to eliminate races.
If you don’t share and mutate data, it doesn’t need lock protection. CLR strings and many value types, for example, are immutable. Isolation can be used to hide intermediate state transitions, although typically also requires that multiple copies are maintained and periodically synchronized with a central version to eliminate staleness. Sometimes this approach can be used to improve scalability particularly for highly shared state. Many CRT malloc/free implementations will use a per-thread pool of memory and occasionally rendezvous with a central process-wide pool to eliminate contention, for example.
4. Document your locking model.
Most library code has a simple locking model: code that manipulates statics is thread-safe and everything else is not (see #1 and #2 above). If your internal locking schemes are more complex, you should document those using asserts (see below), good comments, and by writing detailed dev design docs with information about what locks protect what data to help others understand the synchronization rules. If any of these subtleties are surfaced to users of your class then those must also be explained in product documentation and, preferably, reinforced with some form of tools & analysis support. COM/GUI STAs, for example, are one such esoteric scheme, where the locking model leaks directly into the programming model. As a community, we would be best served if new instances of such specialized models are few and far between; I for one would be interested in hearing of and understanding any such cases.
Using Locks
5. Use the C# ‘lock’ and VB ‘SyncLock’ statements for all synchronized regions.
Following this guidance ensures that locks will be released even in the face of asynchronous thread aborts, leading to fewer deadlocks statistically. The code generated by these statements is such that our finally block will always run and execute the Monitor.Exit if the lock was acquired. This still doesn’t protect code from rude AppDomain unloads—but this is not something most library developers have to worry about, except for reliability-sensitive code that maintains complex process-wide memory invariants, such as code that interops heavily with native code. (See Duffy, 05 for more details.)
If you decide to violate this guidance, it should be for one of two reasons: (1) you need to use a CER to absolutely guarantee the lock is released in rude AppDomain unload cases—perhaps because a lock will be used during AppDomain tear-down and you’d like to avoid deadlocks—or (2) you have some more sophisticated Enter/Exit pattern that is not lexical. For (1) I would recommend talking to somebody at Microsoft so we can understand these scenarios better; if there are enough people who need to do this, we may conceivably consider adding better Framework support in the future. For (2) you should first try to avoid this pattern. If it’s unavoidable, you must use finalizers to ensure that locks are not orphaned if the expected releasing thread is aborted and never reaches the Exit. As with (1), you may or may not need to use a critical finalizer based on your reliability requirements.
6. Avoid making calls to someone else’s code while you hold a lock.
This applies to most virtual, interface, and delegate calls while a lock is held—as well as ordinary statically dispatched calls—into subsystems with which you aren’t familiar. The more you know about the code being run when you hold a lock, the better off you will be. If you follow this approach, you’ll encounter far fewer deadlocks, hard-to-reproduce reentrancy bugs, and surprising dynamic composition problems, all of which can lead to hangs when your API is used on the UI thread, reliability problems, and frustration for your customer. Locks simply don’t compose very well; ignoring this and attempting to compose them in this way is fraught with peril.
7. Avoid blocking while you hold a lock.
Admittedly sometimes violating this advice is unavoidable. Trying to acquire a lock is itself an operation that can block under contention. But blocking on high or variable latency operations such as IO will effectively serialize any other thread trying to acquire that lock “behind” your IO request. If that other thread trying to acquire the lock is on the UI thread, you may have just helped to cause a user-visible hang. The app developer may not understand the cause of this hang if the lock is buried inside of your library, and it may be tricky and error-prone to work around.
Aside from having scalability impacts, blocking while a lock is held can lead to deadlocks and invariants being broken. Any time you block on an STA thread, the CLR uses it as a chance to run the message loop. When run on pre-Windows 2000 we use custom MsgWaitForMultipleObjects pumping code, and post-Windows 2000 we use OLE’s CoWaitForMultipleHandles. While this style of pumping processes only a tiny subset of UI messages, it can dispatch arbitrary COM-to-CLR interop calls. These calls include cross-thread/apartment SendMessages, such as an MTA-to-STA call through a proxy. If this happens while a lock is held, that newly dispatched work also runs under the protection of the lock. If the same object is accessed, this can lead to surprising bugs where invariants are still broken inside the lock. (Note that COM offers ways to exit the SynchronizationContext when blocking in this fashion, but this is outside of the scope of this doc.)
Try to refactor your code so the time you hold a lock is minimal, and any communication across threads, processes, or to/from devices happens at the edges of those lock acquisition/releases. All libraries really should minimize all synchronization to leaf-level code.
8. Assert lock ownership.
Races often result when some leaf-level code assumes a lock has been taken at a higher level on the call-stack, but the caller has forgotten to acquire it. Or maybe the owner of that code recently refactored it and didn’t realize the implicit pre-condition that was accidentally broken. This may go undetected in test suites unless the race actually happens. I personally hope we add a Monitor.IsHeld API in the future which you could wrap in a call to Debug.Assert (or whatever your assert infrastructure happens to be). Sans that, you can build this today by wrapping calls to Monitor.Enter/Exit and maintaining recursion state yourself. It’d be great if somebody developed some type of annotations in the future to make such assertions easier to write and maintain.
Note that the IsHeld functionality should never be used to dynamically influence lock acquisition and release at runtime, e.g. avoiding recursion and taking or releasing based on its value. This indicates poorly factored code. In fact, the only use we would encourage is SomeAssertAPI(Monitor.IsHeld(foo)).
9. Avoid lock recursion in your design. Use a non-recursive lock where possible.
Recursion typically indicates an over-simplification in your synchronization design that often leads to less reliable code. Some designs use lock recursion as a way to avoid splitting functions into those that take locks and those that assume locks are already taken. This can admittedly lead to a reduction in code size and therefore a shorter time-to-write, but results in a more brittle design in the end.
It is always a better idea to factor code into public entry-points that take non-recursive locks, and internal worker functions that assert a lock is held. Recursive lock calls are redundant work that contributes to raw performance overhead. But worse, depending on recursion can make it more difficult to understand the synchronization behavior of your program, in particular at what boundaries invariants are supposed to hold. Usually we’d like to say that the first line after a lock acquisition represents an invariant “safe point” for an object, but as soon as recursion is introduced this statement can no longer be made confidently. This in turn makes it more difficult to ensure correct and reliable behavior when dynamically composed.
As a community, we should transition to non-recursive locks as soon as possible. Most locks that you have in your toolkit today—including Win32 CRITICAL_SECTIONs and the CLR Monitor—screwed this up. Java realized this and now ships non-recursive variants of their locks. Using a non-recursive design requires more discipline, and therefore we expect some scenarios to continue using recursive locks for some time to come. Over time, however, we’d like to wean developers off of lock recursion completely.
10. Don’t build your own lock.
Most locks are built out of simple principles at the core. There’s a state variable, a few interlocked instructions (exposed to managed code through the Interlocked class), with some form of spinning and possibly waiting on an event when contention is detected. Given this, it may look straightforward to build your own. This is deceivingly difficult.
The CLR locks have to coordinate with hosts so that they can perform deadlock detection and sophisticated user-mode scheduling for hosted customer-authored code. Some of our locks (Monitor) make higher reliability guarantees so that we can safely use them during AppDomain teardown. We have tuned our Monitor implementation to use an ideal mixture of spinning & waiting across many OS SKUs, CPU architectures, and cache hierarchy arrangements. We make sure we work correctly with Intel HyperThreading. We mark critical regions of code manipulating the lock data structure itself so that would-be thread aborts will be processed correctly while sensitive shared state manipulation is underway. And last but not least, the C# and VB languages offer the ‘lock’ and ‘SyncLock’ keywords whose code-generation pattern ensures that our Framework and our customer’s code doesn’t orphan locks in the face of asynchronous thread aborts. To get all of this right requires a lot of hard work, time, and testing.
With that said, we may not have every lock you could ever want in the Framework today. Spin locks are a popular request that can help scalability of highly concurrent and leaf-level code. Thankfully, Jeff Richter wrote an article and supplied a suitable spin lock on MSDN some time ago. In Orcas, we are tentatively going to supply a new ReaderWriterLockSlim type that offers much better performance and scalability than our current ReaderWriterLock (watch those CTPs). If there’s still some interesting lock you need but we don’t currently offer, please drop me a line and let me know. If you need it, chances are somebody else does too.
11. Don’t call Monitor.Enter on AppDomain-agile objects (e.g. Types and strings).
Instances of some Type objects are shared across AppDomains. The most notable are Type objects for domain neutral assemblies (like mscorlib) and cross-assembly interned strings. While it may look innocuous, locks taken on these things are visible across all AppDomains in the process. As an example, two AppDomains executing this same code will stop all over each other:
lock (typeof(System.String)) { … }
This can cause severe reliability problems should a lock get orphaned in an add-in or hosted scenario, possibly causing deadlocks from deep within your library that seemingly inexplicably span multiple AppDomains. The resulting code also exhibits false conflicts between code running in multiple domains and therefore can impact scalability in a way that is difficult for customers (and library authors!) to reason about.
12. Don’t use a machine- or process-wide synchronization primitive when AppDomain-wide would suffice.
The Mutex and Semaphore types in the .NET Framework should only ever be used for legacy, interop, cross-AppDomain, and cross-process reasons. They very heavy-weight—several orders of magnitude slower than a CLR Monitor actually—and they also introduce additional reliability and affinity problems: they can be orphaned, process-external DOS attacks can be mounted, and they can introduce synchronization bottlenecks that contribute to scalability blockers. Moreover, they are associated with the OS thread, and therefore impose thread affinity. As already noted, this is a bad thing.
13. A race condition or deadlock in library code is always a bug.
Race conditions and deadlocks can be very difficult to fix. Sometimes it requires refactoring a bunch of code to work around a seemingly corner case & obscure sequence of events. It’s tempting to rearrange things to narrow the window of the race or reduce the likelihood of a deadlock. But please don’t lose sight of the fact that this still represents a correctness problem in the library itself, no matter how narrow the race is made. Sometimes fixing the bug would require breaking changes. Sometimes you simply don’t have enough time to fix the bug in time for product ship. In either case, this is something that should be measured and decided based on the quality bar for the product at the time the bug is found. Remember that as higher degrees of concurrency are used in the hardware, the probability of these bugs resurfacing becomes higher. A murky won’t fix race condition in 2008 that repros only once in a while on high end machines could become a costly servicing fix by 2010 that repros routinely on middle-of-the-line hardware. That jump from 32 to 64 cores is a rather substantial one, at least in terms of change to program timing.
Reliability
14. Every lock acquisition might throw an exception. Be prepared for it.
Most locks lazily allocate an event if a lock acquisition encounters contention, including CLR monitors. This allocation can fail during low resource conditions, causing OOMs originating from the entrance to the lock. (Note that a typical non-blocking spin lock cannot fail with OOM, which allows it to be used in some resource constrained scenarios such as inside a CER.) Similarly, a host like SQL Server can perform deadlock detection and even break those deadlocks by generating exceptions that originate from the Enter statement, manifesting as a System.Runtime.InteropServices.COMException.
Often there isn’t much that can be done in response to such an exception. But reliability- and security-sensitive code that must deal with failure robustly should consider this case. We would have liked it to be the case that host deadlocks can be responded to intelligently, but most library code can’t intelligently unwind to a safe point on the stack so that it can back-off and retry the operation. There is simply too much cross-library intermixing on a typical stack. This is why timeout-based Monitor acquisitions with TryEnter are typically a bad idea for deadlock prevention.
15. Lock leveling should be used to avoid deadlocks.
Lock leveling (a.k.a. lock hierarchy) is a scheme in which a relative number is assigned to all locks, and lock acquisition is restricted such that only locks at monotonically decreasing levels than those already held by the current thread can be acquired. Strictly following this discipline guarantees a deadlock free system, and is described in more detail in Duffy, 06b. Without this, libraries are subject to dynamic composition- and reentrancy-induced deadlocks, which causes users trying to write even moderately reliable code a lot of pain and frustration. This pain will only become worse as more of them try to compose our libraries into highly concurrent applications. An alternative to true lock leveling which doesn’t require new BCL types is to stick to non-recursive locks and to ensure that multiple lock acquisitions are done at once, in some well-defined order.
There are two big problems that will surely get in the way of adopting lock leveling today.
First, we don’t have a standard leveled lock type in the .NET Framework today. While the article I referenced contains a sample, the simple fact is that the lion’s share of library developers and customers will not start lock leveling in any serious way without official support. There is also a question of whether programmers can be wildly successful building apps and libraries with lock leveling without good tool and deeper programming model support.
Second, lock leveling is a very onerous discipline. We’ve used it in the CLR code base for the parts of the system that are relatively closed. (I’m fine saying this since I’m basing this off of the Rotor code-base.) Lock leveling doesn’t typically compose well with other libraries because the levels are represented using arbitrary numbering schemes. You might want to extend it to prevent certain cross-assembly calls, interop calls that might acquire Win32 critical sections, or calls into other parts of the system that acquire locks outside of the current hierarchy. These are all features that have to be built on top of the base lock leveling scheme; again, without a standard library for this, it’s unlikely everybody will want to build it all themselves. Lock leveling is not a silver bullet, but it’s probably the best thing we have for avoiding deadlocks with today’s multithreading primitives.
16. Restore sensitive invariants in the face of an exception before the 1st pass executes up the stack.
This is in part a security concern as well as a reliability one. The CLR exception model is a 2-pass model which we inherit from Windows SEH. The 1st pass runs before finally blocks execute, meaning that the locks held by the thread are held when up-stack filters are run and get a chance to catch the exception. VB and VC++ expose filters in the language, while C# doesn’t. Code inside of filters can see inside possibly broken invariants because the locks are still held.
Thankfully CAS asserts and impersonation cannot leak in this way, but this can still cause some surprising failures. You can stop the 1st pass and ensure your lock is released by wrapping a try/catch around the sensitive operation and re-throwing the exception from your catch:
try { lock (foo) { // (Break invariants…) // Possibly throw an exception… // (Restore invariants…) } } catch { throw; }
This is only something you should consider if security and reliability requirements dictate it.
17. If class constructors are required to have run for code inside of a lock, consider eagerly running the constructor with a call to RuntimeHelpers.RunClassConstructor.
Reentrancy deadlocks and broken invariants involving cctors are difficult to reason about because behavior is based on program history and timing, often in a nondeterministic way. The problem specific to locks is that running a cctor effectively introduces possibly reentrant points into your code anywhere statics are accessed for a type with a cctor. If running the cctor causes an exception or attempts to access some data structure which the current thread has already locked and placed into an inconsistent state, you may encounter bugs related to these broken invariants. If using a non-recursive lock, this can lead to deadlocks. Calling Runtime.RunClassConstructor hoists potential problems such as this to a well-defined point in your code. It is not perfect, as other locks may be held higher up on the call-stack, but it can statistically reduce the chance of problems in your users’ code.
18. Don’t use Windows Asynchronous Procedure Calls (APCs) in managed code.
We recently considered adding APC-based file IO completion to the BCL file APIs. Several Win32 IO APIs offer this, and some use it for scalable IO that doesn’t need to use an event or an IO Completion thread. After considering it briefly, we realized how bad of an idea adding similar support to managed code would have been. APCs pollute the OS thread to which they are tied, and are a strange form of thread affinity (more on that later). They can fire at arbitrary alertable blocking points in the code, including after a thread pool thread has been returned back to the pool, after the finalizer thread has gone on to Finalize other objects in the process, or even at some random blocking point deep within the EE (perhaps while we aren’t ready for it). If an APC raises an exception, the state of affairs at the time of the crash is likely to be quite confusing. The stack certainly will be. Not only do APCs represent possible security threats, but they can also introduce many of the subtle reliability problems already outlined. They have been avoided almost entirely in three major releases of the .NET Framework, and we ought to continue avoiding them.
19. Don’t change a thread’s priority.
This could fall into the rules below about “Scheduling & Threads,” because it is semantically tied to the notion of an OS thread, were it not for the large reliability risk inherent in it. Priority changes can cause subtle scalability problems due to priority inversion, including preventing the CLR’s finalizer thread (which runs at high priority) from making forward progress. The OS has support for anti-starvation of threads—including a balance set manager which boosts the priority of a thread waiting for a lock for certain OS synchronization primitives—but this actually doesn’t extend to CLR locks. Testing in isolation will tend not to find priority-related bugs. Instead, app developers trying to compose libraries into their programs will discover them. Users may decide to go ahead and change priorities themselves, but then the onus for breaking a best practice is on them, not us.
20. Always test & retest a wait condition inside of a lock.
A common mistake when writing cross-thread synchronization code is to either forget to retest a condition each time a thread wakes up or to test this condition outside of a lock. If you’re using an EventWaitHandle or Monitor.Wait/Pulse/PulseAll, for example, to put one thread to sleep while another produces some state transition of interest, you typically need to double-check that that state is in the expected condition when waking. This is especially true of single-producer/multi-consumer scenarios, where multiple threads frequently race with one another. For example:
void Put(T obj) { lock (myLock) { S1; // enqueue it Monitor.PulseAll(myLock); } }
T Get() { lock (myLock) { while (empty) { Monitor.Wait(myLock); } S2; // dequeue and return the item } }
Notice that Get loops around testing the ‘empty’ variable to decide when to wait for a new item, and it does so while holding the lock. Whenever this consumer is woken up, it must retest the variable. If it doesn’t, multiple threads may wake up due to a single new item becoming available only for all but one of them to find that the queue actually became empty by the time it reached S2. This is generally easier to do with Monitors because they combine the lock with the condition variable. Missteps with Win32 events are easier because the lock must be separately managed.
Scheduling & Threads
21. Don’t write code that depends on the OS thread ID or HANDLE. Use Thread.Current or Thread.Current.ManagedThreadId instead.
When code depends on the identity of the actual OS thread, the logical task running that code is bound to the thread. This is a major piece of the thread affinity problem mentioned earlier on. If running on a system where threads are migrated between OS threads using some form of user-mode scheduling—such as fibers—this can break if user-mode switches happen at certain points in the code. If this dependency is enforced (using Thread.BeginThreadAffinity and EndThreadAffinity), at least the system remains correct, but this still limits the ability of the scheduler to maintain overall system forward progress.
Unfortunately, many Win32 and Framework APIs may imply thread affinity when used. Several GUI APIs require that they are called from a thread which owns the message queue for the GUI element in question. Historically, some Microsoft components like the Shell, MSHTML.DLL, and Office COM APIs have also abused this practice. The situation on the server is much better, but it still isn’t perfect. Some APIs we design with the client in mind end up being used on the server, often with less than desirable results. My hope is that the whole platform moves away from these problems in the future.
22. Mark regions of code that do depend on the OS thread identity with Thread.BeginThreadAffinity/EndThreadAffinity.
The corollary to the previous rule is that, if you must have code that depends on the OS identity, you must tell the CLR (and potential host) about it. That’s what the Thread.BeginThreadAffinity and EndThreadAffinity methods do, new to V2.0. Marking these regions prevent OS thread migration altogether. This is a crappy practice, but is less crappy than allowing thread migration to happen anyway, causing things to break in unexpected and unpredictable ways.
23. Always access TLS through the .NET Framework mechanisms: ThreadStaticAttribute or Thread.GetData/SetData and related members.
The implementation of these APIs abstract away the dependency on the OS thread allowing you to store state associated with the logical piece of work. Although they sound very thread-specific, these actually store state based on whatever user-mode scheduling mechanism is being used, and therefore you don’t actually take thread affinity when you use them. For example, we can (in theory) store information into Fiber Local Storage (FLS) or manually move data across fibers rather than using the underlying Windows Thread Local Storage (TLS) mechanisms if a host has decided to use fibers. While it’s tempting to say “Who cares?” for this one, particularly since Whidbey decided not to support fiber mode before shipping, I believe it’s premature: we haven’t seen the death of fibers just yet.
24. Always access the security/impersonation tokens or locale information through the Thread object.
As with the previous item, we abstract away the storage of this information on the Thread object, via the Thread.CurrentCulture, Thread.CurrentUICulture, and Thread.CurrentPrincipal properties. We flow this information across logical async points as required, and therefore using them doesn’t imply any sort of hard OS thread affinity.
25. Always access the “last error” after an interop call via Marshal.GetLastWin32Error.
If you mark a P/Invoke signature with [DllImportAttribute(…, SetLastError=true)], then the CLR will store the Win32 last error on the logical CLR thread. This ensures that, even if a fiber switch happens before you can check this value, your last error will be preserved across the physical OS threads. The Win32 APIs kernel32!GetLastError and kernel32!SetLastError, on the other hand, store this information in the TEB. Therefore, if you are P/Invoking to get at the last error information, you are apt to be quite surprised if you are running in an environment that permits thread migration. You can avoid this by always using the safe Marshal.GetLastWin32Error function.
26. Avoid P/Invoking to other Win32 APIs that access data in the Thread Environment Block (TEB).
Security and locale information is something Win32 stores in the TEB that we already expose in the Framework APIs, so it’s rather easy to follow the advice here. Unfortunately, many Win32 APIs access data from the TEB without necessarily saying so, or look for & possible lazily create a window message queue (i.e. in USER32), all of which creates a sort of silent thread affinity. In other words, a disaster waiting to happen. I wish I had a big laundry list of black-listed APIs, but I don’t.
Scalability & Performance
27. Consider using a reader/writer lock for read-only synchronization.
A lot of concurrent code has a high read-to-write ratio. Given this, using exclusive synchronization (like CLR monitors) can hurt scalability in situations with a large numbers of concurrent readers. While starting off with a reader/writer lock could be viewed as a premature optimization, the reality is that many situations warrant using one due to the inherent properties of the problem. If you know you’ll have more concurrent readers than writers, you can probably do some quick back-of-the-napkin math and come to the conclusion that a reader/writer lock is a good first approach. For other cases, refactoring existing code to use one can be a fairly straightforward translation. If you do this, obviously you need to be careful that the read-lock-protected code actually only performs reads to maintain the correctness of your system.
There has been a lot of negative press about the BCL’s ReaderWriterLock. In particular, the performance is at about 8x of that of successful Monitor for acquires. Unfortunately, this has (in the past) prevented many library developers from using reader/writer locks altogether. This is the primary motivation we are tentatively supplying a new lock implementation, ReaderWriterLockSlim, in Orcas. The BCL’s synchronization primitives ought not to get in the way of optimal synchronization for your data structures.
28. Avoid lock free code at all costs for all but the most critical perf reasons.
Compilers and processors reorder reads and writes to get better perf, but in doing so make it harder to code that is sensitive to the read/write orderings between multiple threads. The CLR memory model gives a base level of guarantees that we preserve across all hardware platforms. With that said, any sort of dependence on the CLR memory model is advised against; we did that work in 2.0 to strengthen the memory model to eliminate object-publish-before-initialization and double-checked locking bugs that were found throughout the .NET Framework, not to encourage you to write more lock free code.
The reason? Lock free code is impossible to write, maintain, and debug for most developers, even those who have been doing it for years. This is the type of code whose proliferation will lead to poor reliability across the board for managed libraries, longer stress lock downs on multi-core and MP machines, and is best avoided. Use of volatile reads and writes and Thread.MemoryBarrier should be viewed with great suspicion, as it probably means somebody is trying to be more clever than is required.
With all of that said, there are a couple “blessed” techniques that can be considered when informed by scalability and perf testing (see Morrison, 2005):
(a) The simple double checked locking pattern can be used when you need to prevent multiple instances from being created and you don’t want to use a cctor (because the state may not be needed by all users of your class). This pattern takes the form:
static State s_state; static object s_stateLock = new object(); static T GetState() { if (s_state == null) { lock (s_stateLock) { if (s_state == null) { s_state = new State(); } } } return s_state; }
Note that simple variants of this pattern don’t work, such as keeping a separate ‘bool initialized’ variable, due to read reordering (see Duffy, 06).
(b) Optimistic/non-blocking concurrency. In some cases, you can safely use Interlocked operations to avoid a heavyweight lock, such as doing a one-time allocation of data, incrementing counters, or inserting into a list. In other areas, you might use a variable to determine when a ready has become dirty, and retry it, typically done via a version number incremented on each update.
Again, you should only pursue these approaches if you’ve measured or done the thought exercise to determine it will pay off. There are additional tricks you can play if you really need to, but most library code should not go any further than what is listed here.
29. Avoid hand-coded spin waits. If you must use one, do it right.
Sometimes it is tempting to put a busy wait in very tightly synchronized regions of code. For instance, when one part of a two-part update is observed then you may know that the second part will be published imminently; instead of giving up the time-slice, it may look appealing to enter a while loop on an MP machine, continuously re-reading whatever state it is waiting to be updated, and then proceed once it sees it. Unless written properly, however, this technique won’t work well on single-CPU and Intel HyperThreaded systems. It’s often simpler to use locks or events (such as Monitor.Wait/Pulse/PulseAll) for this type of cross-thread communication. These employ some reasonable amount of spinning versus waiting for you.
Spin waits can actually improve scalability for profiled bottlenecks or when your scalability goals make it necessary. Note that this is NOT a complete replacement for a good spin lock. If you decide to use a spin wait, follow these guidelines. The worst type of spin wait is a ‘while (!cond) ;’ statement. A properly written wait must yield the thread in the case of a single-CPU system, or issue a Thread.SpinWait with some reasonably small argument (25 is a good starting point, tune from there) on every loop iteration otherwise. This last point ensures good perf on Intel HyperThreading. E.g.:
{ uint iters = 0; while (!cond) { if ((++iters % 50) == 0) { // Every so often we sleep with a 1ms timeout (see #30 for justification). Thread.Sleep(1); } else if (Environment.ProcessorCount == 1) { // On a single-CPU machine we yield the thread. Thread.Sleep(0); } else { // Issue YIELD instructions to let the other hardware thread move. Thread.SpinWait(25); } } }
The spin count of ‘25’ is fairly arbitrary and should be tuned on the architectures you care about. And you may want to consider backing off or adding some randomization to avoid regular contention patterns. Except for very specialized scenarios, most spin waits will have to fall back to waiting on an event after so many iterations. Remember, spinning is just a waste of CPU time if it happens too frequently or for too long, and can result in an angry customer. A hung app is generally preferable to a machine who’s CPUs are spiked at 100% for minutes at a time.
30. When yielding the current thread’s time slice, use Thread.Sleep(1) (eventually).
Calling Thread.Sleep(0) doesn’t let lower priority threads run. If a user has lowered the priority of their thread and uses it to call your API, this can lead to nasty priority inversion situations. Eventually issuing a Thread.Sleep(1) is the best way to avoid this problem, perhaps starting with a 0-timeout and falling back to the 1ms-timeout after a few tries. Particularly if you come from a Win32 background, it might be tempting to P/Invoke to kernel32!SwitchToThread—it is cheaper than issuing a kernel32!SleepEx (which is what Thread.Sleep does). This is because SleepEx is called in alertable mode, which incurs somewhat expensive checks for APCs. Unfortunately, P/Invoking to SwitchToThread bypasses important thread scheduling hooks that call out to a would-be host. Therefore, you should continue to use Thread.Sleep until if and when the .NET Framework offers an official Yield API.
31. Consider using spin-locks for high traffic leaf-level regions of code.
A spin-lock avoids giving up the time-slice on MP systems, and can lead to more scalable code when used correctly. Context switches in Windows are anything but cheap, ranging from 4,000 to 8,000 cycles on average, and even more on some popular CPU architectures. Giving up the time-slice also means that you’re possibly giving up data in the cache, depending on the data intensiveness of the work that is scheduled as a replacement on the processor. And any time you have cross thread causality, it can cause a rippling effect across many threads, effectively stalling a pipeline of parallel work. As usual, using a spin-lock should always be done in response to a measured problem, not to look clever to your friends.
32. You must understand every instruction executed while a spin lock is held.
Spin locks are powerful but very dangerous. You must ensure the time the lock is held is very small, and that the entire set of instructions run is completely under your control. Virtual method calls and blocking operations are completely out of the question. Because a spin-lock spins rather than blocking under contention, a deadlock will manifest as a spiked CPU and system-wide performance degradation, and therefore is a much more serious bug than a typical hang.
Whenever you use a spin lock you are making a bold statement about your code and thread interactions: it is more profitable for other contending threads to possibly waste CPU cycles than to wait and let other work make forward progress. If this statement turns out to be wrong, a large number of cycles will frequently get thrown away due to spinning, and the overall throughput of the system will suffer. On servers the result could be catastrophic and you may cost your customers money due to an impact to the achievable throughput. Each cycle you waste in a loop waiting for a spin lock to become available is one that could have been used to make forward progress in the app.
33. Consider a low-lock data structure for hot queues and stacks.
Windows has a set of ‘S-List’ APIs that provide a way to do “lock free” pushes and pops from a stack data structure. This can lead to highly scalable, non-blocking algorithms, much in the same way that spin-locks do. Unfortunately it is not a simple matter to use ‘S-Lists’ from managed code, due to the requirements for memory pinning among other things. It’s very easy to build a lock-free stack out of CAS operations which is suitable for these situations. The algorithm goes something like this:
class LockFreeStack<T> { class Node { T m_obj; Node m_next; }
private Node m_head;
void Push(T obj) { Node n = new Node(obj); Node h; do { h = m_head; n.m_next = h; } while (Interlocked.CompareExchange(ref m_head, n, h) != h); }
T Pop() { Node h; Node nh; do { h = m_head; nh = h.m_next; } while (Interlocked.CompareExchange(ref m_head, nh, h) != h); return h.m_obj; } }
This sample implementation carefully avoids the well-known ABA problem as a result of two things: (1) it assumes a GC, ensuring memory isn’t reclaimed and reused so long as at least one thread has a reference to it; and (2) we don’t make any attempt to pool nodes. A more efficient solution might pool nodes so that each Push doesn’t have to allocate one, but then would have to also implement an ABA-safe algorithm. This is typically done by widening the target of the CAS so that it can contain a version number in addition to the “next node” reference.
There are other permutations of this lock-free data structure pattern which can be useful. Lock-free queues can be built (see Michael, 96 for an example algorithm), permitting concurrent access to both ends of the data structure. All of the same caveats explained with the earlier lock free item apply.
34. Always use the CLR thread pool or IO Completion mechanisms to introduce concurrency.
The CLR’s thread pool is optimized to ensure good scalability across an entire process. If we end up with multiple custom pools in a process, they will compete for CPUs, over-create threads, and generally lead to scalability degradation. We already (unfortunately) have this situation with the OS’s thread pool competing with the CLR’s. We’d prefer not to have three or more. If you will be in the same process as the CLR, you should use our thread pool too. We’re doing a lot of work over the next couple releases to improve scalability and introduce new features—including substantially improved throughput (available in the last Orcas CTP)—so if it still doesn’t suit your purposes we would certainly like to hear from you.
Blocking
35. Document latency expectations for your users.
We haven’t yet come up with a consistent way to describe the performance characteristics of managed APIs. When writing concurrent software, however, it’s frequently very important for developers to understand and reason about the performance of the dependencies they choose to take. This includes things like knowing the probability of blocking—and therefore whether to try and mask latency by transferring work to a separate thread, overlapping IO, and so on—as well as the compute and memory intensiveness of the internal operations. Please make an effort to document such things. Incremental and steady improvements are important in this area.
36. Use the Asynchronous Programming Model (APM) to supply async versions of blocking APIs.
Particularly if you are building a feature that mimics existing Win32 IO APIs or uses such APIs heavily, you should also consider exposing the built-in asynchronous nature of IO on Windows. For example, file and network IO is highly asynchronous in the OS; if you know your API will spend any measurable portion of its execution time blocked waiting for IO, the same customers who use asynchronous file IO APIs will want some way to turn that into asynchronous IO. The only way they can do that is if you go the extra step and provide an Asynchronous Programming Model (APM) version of your API.
Details on precisely how to implement the APM are available in Cwalina, 05. It involves adding ‘IAsyncResult BeginXX(params, AsyncCallback, object)’ and ‘rettype EndXX(IAsyncResult) APIs for your ‘rettype XX(params)’ method. As an example, consider System.IO.Stream:
int Read(byte[] buffer, int offset, int count); IAsyncResult BeginRead(byte[] buffer, int offset, int count, AsyncCallback callback, object state); int EndRead(IAsyncResult asyncResult);
A good hard-and-fast rule is that if you use an API that offers an asynchronous version, then you too should offer an asynchronous version (and so on, recursively up the call stack).
This is very important to many app developers who need to tightly control the amount of concurrency on the machine. Having lots of IO happening asynchronously can permit operations to overlap in ways they couldn’t otherwise, therefore improving the throughput at which the work is retired. IO Completion Ports, for example, allow highly scalable asynchronous IO without having to introduce additional threads. There is simply no way to build a robust and scalable server program without them. If the library doesn’t expose this capability, customers are left with a clumsy design: they have to manually marshal work to a thread pool thread—or one of their own—to mask the latency, and then rendezvous with it later on. And this doesn’t work at all for massive numbers of in-flight IO requests. Or even worse, users are forced to create, maintain, and use their own incarnations of existing library APIs.
37. Always block using one of these existing APIs: lock acquisition, WaitHandle.WaitOne, WaitAny, WaitAll, Thread.Sleep, or Thread.Join.
The CLR doesn’t block in a straightforward manner. As noted earlier, we use blocking as an opportunity to run the message loop on STA threads. We also call out to the host to give it a chance to switch fibers or perform any other sort of book-keeping. This is required to ensure good CPU utilization and to achieve the goal of having all CPUs constantly busy on a MP machine, instead of the alternative of wasting potential execution time by letting threads block. P/Invoking or COM interoping to a blocking API completely bypasses this machinery, and we are then at the mercy of that API’s implementation. Aside from the thread switching problems, if this API blocks but doesn’t pump messages on an STA, for instance, we may end up in a cross-apartment deadlock, among other problems.
III. References
[Brumme, 03] Brumme, C. AppDomains (“application domains”). Blog article. June 2003. [Cwalina, 05] Cwalina, K., Abrams, B. Framework design guidelines: Conventions, idioms, and patterns for reusable .NET libraries. Addison-Wesley Professional. September 2005. [Duffy, 05] Duffy, J. Atomicity and asynchronous exceptions. Blog article. March 2005. [Duffy, 06] Duffy, J. Broken variants of double-checked locking. Blog article. January 2006. [Duffy, 06b] Duffy, J. No more hangs: Advanced techniques to avoid and detect deadlocks in .NET apps. MSDN Magazine. April 2006. [Duffy, 06c] Duffy. J. Application responsiveness: Using concurrency to enhance user experiences. Dr. Dobb’s Journal. September 2006. [Michael, 96] Michael, M., Scott, M. Simple, Fast, and Practical Non-blocking and Blocking Concurrent Queue Algorithms. PODC’06. [Morrison, 05] Morrison, V. Understand the impact of low-lock techniques in multithreaded apps. MSDN Magazine. October 2005. [Olukotun, 05] Olukotun, K., Hammond, L. The future of microprocessors. ACM queue, vol. 3, no. 7. September 2005. [Sutter, 05] Sutter, H., Larus, J. Software and the concurrency revolution. ACM queue, vol. 3, no. 7. September 2005.
 Tuesday, October 17, 2006
The CLR's approach to monitor acquisition (i.e. Monitor.Enter and Monitor.Exit) during shutdown is very different from native CRITICAL_SECTIONs and mutexes (as described in my last post). In particular, the CLR does not ensure requests to acquire monitors on the shutdown path succeed, preferring instead to cope with the risk of deadlock rather than the risk of broken state invariants.
Managed code is run during orderly shutdowns in two places: the AppDomain.ProcessExit event and inside the Finalize method for all finalizable objects in the heap. (The term "orderly shutdown" is used to distinguish an Environment.Exit from a P/Invoke to kernel32!TerminateProcess, for instance.) Just as with the example described for native code, threads can be suspended while they hold arbitrary locks and have partially mutated state to the point where invariants do not hold any longer. Instead of permitting the shutdown code to observe this state--possibly causing corruption or unhandled exceptions on the finalizer thread--the CLR treats lock acquisitions as it normally does.
If a lock was orphaned in the process of stopping all running threads, then, the shutdown code path will fail to acquire the lock. If these acquisitions are done with non-timeout (or long timeout) acquires, a hang will ensue. To cope with this (and any other sort of hang that might happen), the CLR annoints a watchdog thread to keep an eye on the finalizer thread. Although configurable, by default the CLR will let finalizers run for 2 seconds before becoming impatient; if this timeout is exceeded, the finalizer thread is stopped, and shutdown continues without draining the rest of the finalizer queue.
This is typically not horrible since many finalizers are meant to cleanup intra-process state that Windows will cleanup automatically anyway. This covers things like file HANDLEs. But it does mean that any additional logic won't be run, like flushing file write-buffers. And for any cross-process state, you're screwed and had better have a fail-safe plan in place, like detecting corrupt machine-wide state and repairing upon the next program restart. (For what it's worth, DLL_PROCESS_DETACH notifications aren't run in all process exits either, so this really is not any worse than what you have with native code today.)
AppDomain unloads are very different beasts. Any reliability-critical code that will run as part of unload (CERs, critical finalizers, and generally any Cer.Success/Consistency.WillNotCorruptState methods) should strictly only ever acquire locks that are always dealt with in a reliable manner throughout the code-base. That statement is actually a little too strong. In reality, either (1) locks must never be orphaned (aside from process exit) or (2) the associated broken state invariants that will occur (e.g. in the face of asynchronous exceptions) can be tolerated.
Unfortunately, we don't give you access to Monitor.ReliableEnter (the BCL team gets to use it, though, as it's internal to mscorlib), which means almost nobody is equipped to do (1) today. It's impossible to write code that will reliably release a monitor in the face of possible asynchronous thread aborts and out of memory exceptions without it. Only a very tiny fraction of the BCL actually deals with locks in such a strictly reliable manner, so as a general rule of thumb very little of it actually acquires and releases locks while executing reliable-critical code. Without the risk of deadlock that is. Hosts will of course use policy to escalate to rude AppDomain unloads in the face of hangs, much like the CLR does by default for process exit.
(Note: Thanks to Jan Kotas--a SDE on the CLR team--for noticing that I confused AppDomain unloads with process exit in my last post, in addition to pointing out that appearances are deceiving: the multi-threaded CRT can actually suffer from the sort of shutdown problems outlined in the last post.)
 Saturday, October 14, 2006
When a Windows process shuts down, one of the very first things to happen is the killing of all but one thread. This sole remaining thread is then responsible for performing shutdown duties, both in kernel and in user mode, including executing the appropriate DLL_PROCESS_DETACH notifications for the DLLs loaded in the process. A great treatise on shutdown and the associated subtleties can be found on, of course, Chris Brumme’s weblog.
It’s entirely possible that at least one of those threads was executing under the protection of one or more critical sections when the shutdown was initiated. Since threads are killed in a fairly hostile manner (not like, say, asynchronous thread aborts which are at least a little less rude, even the so-called rude version of a thread abort), these critical sections will have been left in an acquired state. And any associated program state is apt to be left very inconsistent indeed. Worse, you might imagine that if the shutdown thread later needed to acquire one of those oprhaned critical sections, the shutdown process would deadlock.
Although that’s intuitively what you may expect to occur, the OS actually does something a little funny during shutdown to avoid this problem. It effectively ignores calls to kernel32!EnterCriticalSection and kernel32!LeaveCriticalSection. A call to enter a CRITICAL_SECTION will first check to see if it's owned by another thread and, if it is, the section is automatically re-initialized before acquiring it. The result? If one of the previously killed threads, t0, held on to critical section A, for instance, and had partially modified some state protected by it just before the shutdown began, then the shutdown thread, t1, is permitted to freely “acquire” critical section A too, even though it was found as being officially owned by t0.
This means that code running during shutdown must tolerate any corrupt state that may have been left behind as a result. For obvious reasons, this is quite difficult. It's especially difficult if you write some code that somebody believes they can call during shutdown without you having gone through that thoughts exercise. The multi-threaded CRT uses locks internally for malloc/free, for instance, and reportedly cannot reliably tolerate process exit code-paths, which means can't even safely rely on memory allocation and freeing during process exit without spurious AVs, heap corruption, and other bad things. Other services are obviously apt to suffer from similar problems, particularly if they comprise of arbitrary application logic. You simply can't rely on invariant safe-points holding at lock boundaries when a shutdown is in process.
Mutexes also enjoy this same "weakening" behavior, at least on Windows XP. This policy doesn’t, however, apply to waits on other kernel synchronization objects, like events and semaphores. If you rely on these during shutdown you’re just asking for a deadlock. Actually if you are regularly using any sort of synchronization in your DllMain—including acquiring critical sections and mutexes—you’re asking for loads of trouble. Shutdown callbacks run under the protection of the OS loader lock, demanding extreme care, but that’s another topic altogether.
Here is a sample VC++ program that shows off this behavior. We declare a bunch of code in the DllMain: process attach initializes a CRITICAL_SECTION and a mutex, and then detach attempts to acquire them. We then define an exported function, GetAndBlock, that acquires the synchronization objects and sleeps for a long time:
#include <stdio.h> #include <windows.h>
CRITICAL_SECTION g_cs; HANDLE g_mutex;
BOOL WINAPI DllMain(HINSTANCE hinstDLL, DWORD fdwReason, LPVOID lpReserved) { switch (fdwReason) { case DLL_PROCESS_ATTACH: InitializeCriticalSection(&g_cs); g_mutex = CreateMutex(NULL, FALSE, NULL); break; case DLL_PROCESS_DETACH: printf("%x: Acquiring g_cs during shutdown...", GetCurrentThreadId()); EnterCriticalSection(&g_cs); printf("success.\r\n");
printf("%x: Acquiring g_mutex during shutdown...", GetCurrentThreadId()); WaitForSingleObject(g_mutex, INFINITE); printf("success.\r\n");
DeleteCriticalSection(&g_cs); CloseHandle(g_mutex); break; }
return TRUE; }
__declspec(dllexport) DWORD WINAPI GetAndBlock(LPVOID lpParameter) { // Acquire the mutual exclusion locks. EnterCriticalSection(&g_cs); WaitForSingleObject(g_mutex, INFINITE);
printf("%x: g_cs and g_mutex acquired.\r\n", GetCurrentThreadId());
// And just wait for a little while... SleepEx(25000, TRUE);
return 0; }
And finally we have an EXE that just invokes GetAndBlock and initiates a process shutdown on separate threads. The result is that the shutdown thread acquires the synchronization objects which the GetAndBlock thread currently has ownership of. Post Windows 95, the shutdown thread is always the thread that initiated the shutdown, whereas before that it was (seemingly) chosen at random; so when run on a modern OS at least, this sample is guaranteed to demonstrate the desired behavior:
#include <windows.h>
DWORD WINAPI GetAndBlock(LPVOID lpParameter);
int main() { HANDLE hT1 = CreateThread(NULL, 0, &GetAndBlock, NULL, 0, NULL); SleepEx(100, TRUE); ExitProcess(0); }
The results of running are a little non-eventful:
C:\...>shutdown.exe 664: g_cs and g_mutex acquired. d18: Acquiring g_cs during shutdown...success. d18: Acquiring g_mutex during shutdown...success.
As expected, no hangs occur. If you want to see what happens when a hang does happen, just replace CreateMutex with CreateEvent. It's not pretty.
Update 10/17/2006: Thanks to Jan Kotas for pointing out that the multi-threaded CRT is actually not safe from the sort of issues I talk about in this article. I wasn't able to get it to happen in a test program--one of the great things about repro'ing race conditions :)--but have fixed that part up.
 Tuesday, October 03, 2006
I am often confronted with the question of whether concurrency programming models that employ shared memory are evil. I was asked this question directly on the concurrency panel at JAOO’06 earlier this week, for instance, and STM makes a big bet that such models are tenable.
Without shared memory, it’s tempting to think that traditional concurrency problems go away, as if by magic. If no two pieces of code are simultaneously working on the same location in memory, for instance, there are (seemingly) no race conditions or deadlocks. Most people believe this, and it (on the surface) seems somewhat reasonable. Until you realize that it’s fundamentally flawed.
Shared memory systems are just an abstraction in which data can be named by its virtual memory address. In fact, one could argue that it’s an optimization—that the same sort of systems could be built by mapping virtual memory addresses (at a logical level) to some other location (at a physical level) using an algorithm that doesn't rely on page-tables, TLBs, and so on. Distributed RPC systems in the past have tried this very thing: to map object references to data residing on far-away nodes, and have mostly failed in the process. I’m not trying to convince you that alternative mapping techniques are a good thing, only that abstractly speaking at least, all of the same concurrency control problems will arise in systems that exhibit this fundamental property. Interestingly, shared memory systems have turned into tiny distributed systems with complex cache coherency logic anyway, so one has to wonder where the boundary between shared memory and message passing really lies.
There is a fundamental, undeniable law here:
Any system in which two concurrent agents can name the same piece of data must also exhibit the standard problems of concurrency: broken serializability, race conditions, deadlocks, livelocks, lost event notifications, and so on. Concurrency control is simply a requirement if correctness is desired.
So in reality, the real question at hand should be, would a system in which every concurrent agent operates on its own, completely isolated piece of data be more attractive? I personally think that’s farfetched and unrealistic. Systems with shared data need to have shared data; it's a property of the system being modeled. Even with isolated data, concurrency control would be required if, say, a central copy is rendezvoused with periodically (which, by the way, is the only way I can see such a system remaining correct). And then you have to wonder what copying buys you. It certainly costs you. Data locality is crucial to achieving adequate performance in most low-to-mid-level systems software. Yet copy-on-send message passing systems throw this out the window entirely. I refuse to believe that this will ever be the dominant model of fine-grained concurrency, at least on the current hardware architectures available by Intel and AMD. And certainly not without a whole lot more research and perhaps hardware support.
A distributed system in which many simultaneous clients might access the same piece of data on the server has all the same issues. AJAX systems, for instance, easily lull the author into a false sense of security. But, unfortunately(?), a transaction is a transaction, and if concurrency control isn’t in place, such systems are effectively executing without any isolation or serialization guarantees whatsoever—I just read an article in the latest DDJ where this was explained. I'm surprised a dedicated article actually needs to point this out: concurrent access to data under any other name is still concurrent access to data. And of course, once you start to employ concurrency control, you are susceptible to deadlocks and so on—unless you have a system that can transparently resolve them.
Interesting research has been done recently by MSR on static verification to prove the absence of sharing (across processes)—called Software Isolated Processes (SIPs)—building on the type safe, verifiable subset of IL. STM of course also builds on top of the shared memory programming model; but, although threads can name the same location in memory, this is completely hidden—concurrency control is still employed in the implementation where necessary. I believe this systems are promising. They also have the benefit of building on the same foundational memory performance equations that software developers are used to relying on today.
 Friday, September 22, 2006
An article I wrote (seemingly ages ago) just appeared in the September issue of Dr. Dobb's journal:
Application Responsiveness: Using concurrency to enhance user experiences Thanks to recent innovation in both hardware graphics processors and client-side development frameworks, GUIs for Windows applications have become more and more visually stunning over time. But throughout the evolution of such frameworks, one problem hasn't gone away—poor responsiveness. Studies show that positive user experiences are linked to application responsiveness and, conversely, that frustrating experiences are often caused by poor responsiveness. More often than not, this lack of responsiveness is due to a series of subtle (and sometimes accidental) design choices made during development. In this article, I examine the root of the responsiveness problem, and then suggest some best practices for eliminating it.
My article only touches on some important issues that are described in detail elsewhere. Here are the references I used:
- D. Duis, J. Johnson. Improving User Interface Responsiveness Despite Performance Limitations. Proc. IEEE Computer Society Intl. Conference. February 1990.
- J. Duffy. No More Hangs: Techniques for Avoiding and Detecting Deadlocks. MSDN Magazine. April 2006.
- G. H. Forman. Obtaining Responsiveness in Resource-Variable Environments. PhD Dissertation, University of Washington. 1998.
- I. Griffiths. Windows Forms: Give Your .NET-based Applications a Fast and Responsive UI with Multiple Threads. MSDN Magazine. February 2003.
- N. Kramer. Threading Models (Windows Presentation Foundation). Weblog essary. June 2005.
- G. Maffeo, P. Silwowicz. Win32 I/O Cancellation in Windows Vista. MSDN. September 2005.
- V. Morrison. Concurrency: What Every Dev Must Know About Multithreaded Apps. MSDN Magazine. August 2005.
- M. E. Russinovich, D. A. Solomon. Microsoft Windows Internals. ISBN 0-735-61917-4, MS Press. December 2004.
- C. Sells. Safe, Simple Multithreading in Windows Forms, Part 1. MSDN. June 2002.
- C. Sells, I. Griffiths. Programming Windows Presentation Foundation. ISBN 0-596-10113-9, O'Reilly. September 2005.
Thanks go to Jeff Richter, Nick Kramer, Alessandro Catorcini1, and Vance Morrison for reviewing early drafts. Enjoy.
1. Alessandro, man, you need a blog! ;)
 Wednesday, September 13, 2006
LINQ coaxes developers into writing declarative queries that specify what is to be computed instead of how to compute the results. This is in contrast to the lion's share of imperative programs written today, which are huge rat nests of for-loops, switch statements, and function calls. The result of this new direction? Computationally intensive filters, projections, reductions, sorts, and joins can be evaluated in parallel... transparently... with little-to-no extra input from the developer. The more data the better.
If you buy the hypothesis--still unproven--that developers will write large swaths of code using LINQ, then by inference, they will now also be writing large swaths of implicitly data parallel code. This, my friends, is very good for taking advantage of multi-core processors.
If you want to get a little glimpse of what I've been spending my time working on, check out these (brief) stories about Parallel LINQ (aka PLINQ), a parallel query execution engine for LINQ:
We've spent many, many months now cranking out a fully functional prototype. The numbers were impressive enough to catch the eye of some key people around the company. And the rest is history... (well, not quite yet...)
I'll no doubt be disclosing more about this in the coming weeks.
(Note: I am in no way committing to any sort of product or release timeframe. This technology is quite early in the lifecycle, and, while unlikely, might never actually make the light of day... Label this puppy as "research" for now.)
 Friday, September 01, 2006
Tim Harris, a Microsoft colleague I've had the please to work a lot with lately, joined Simon Peyton Jones, of Glasgow Haskell fame, to do a Channel9 interview on Software Transactional Memory (STM). I encourage you to check it out.
Update: I didn't say this before, but I would love any feedback about this technology. What if you had this in C# today? (Do you want it in C# today?) Would it make your life simpler? What are the major challenges you'd encounter if you were to start using it in your programs and libraries? What are the major benefits? Feel free to leave comments (either here or in the Channel9 post) or send me email directly at joedu AT microsoft DOT com.
 Tuesday, August 22, 2006
A common technique to avoid giving up your time-slice on multi-CPU machines is to use a hand-coded spin wait. This is appropriate when the cost of a context switch (4,000+ cycles) and ensuing cache effects are more expensive than the possibly wasted cycles used for spinning, which is to say not terribly often. When used properly, however, very little time is spent spinning, and the spin wait is only ever invoked rarely when very specific cross-thread state is seen, such as lock-free code observing a partial update. There are some best practices that must be followed when writing such a spin wait to guarantee good behavior across different machine configurations, i.e. HT, single-CPU, and multi-CPU systems.
A correct wait must issue a yield/pause instruction on each loop iteration to work well on Intel HT machines:
while (!cond) { Thread.SpinWait(20); }
Many implementations should also fall back to a more expensive wait on, say, a Windows event or CLR monitor after spinning a while. This handles the worst case situation in which the thread that is destined to make 'cond' true is not making forward progress as quickly as you'd hoped. A complementary and alternative technique is to simply give up the time-slice in such cases using the Thread.Sleep API:
uint loops = 0; while (!cond) { if ((++loops % 100) == 0) { Thread.Sleep(0); } else { Thread.SpinWait(20); } }
This approach ensures that, if the machine is saturated, the spin wait doesn't prevent the thread which will set the event from being scheduled and making forward progress.
All of this is pure nonsense and ludicrousness on single-CPU machines. If you're waiting for another thread to set an event... well... it clearly isn't going to do that if you're actively using the one and only CPU to waste cycles spinning! Therefore a natural extension to the above approach is to check for a single-CPU machine and respond by yielding to another thread:
uint loops = 0; while (!cond) { if (Environment.ProcessorCount == 1 || (++loops % 100) == 0) { Thread.Sleep(0); } else { Thread.SpinWait(20); } }
OK, this is looking rather nice now. But wait. There's a subtle but nasty problem lurking here.
Sleep(0) actually only gives up the current thread's time-slice if a thread at equal priority is ready to run. Don't believe me? Check out the MSDN docs. If you're writing a reusable API that will be called by a user app, they might decide to drop a few of their threads' priorities. Messing with priorities is actually a very dangerous practice, and this is only one illustration of what can go wrong. (Other illustrations are topics for another day.) In summary, plenty of people do it and so reusable libraries need to be somewhat resilient to it; otherwise, we get bugs from customers who have some valid scenario for swapping around priorities, and then we as library developers end up fixing them in service packs. It's less costly to write the right code in the first place.
Here's the problem. If somebody begins the work that will make 'cond' true on a lower priority thread (the producer), and then the timing of the program is such that the higher priority thread that issues this spinning (the consumer) gets scheduled, the consumer will starve the producer completely. This is a classic race. And even though there's an explicit Sleep in there, issuing it doesn't allow the producer to be scheduled because it's at a lower priority. The consumer will just spin forever and unless a free CPU opens up, the producer will never produce. Oops!
You can solve this problem by changing the Sleep to use a parameter of 1:
uint loops = 0; while (!cond) { if (Environment.ProcessorCount == 1 || (++loops % 100) == 0) { Thread.Sleep(1); } else { Thread.SpinWait(20); } }
This fixes the problem, albeit with the disadvantage that the thread is unconditionally removed from the scheduler temporarily. (We also call SleepEx with an alertable flag which is more expensive due to APC checks, but I digress.) It's unfortunate that a quick 5 minute audit turns up plenty of Sleep(0)'s in the .NET Framework. I hope to get an FxCop rule created to catch this.
The kernel32!SwitchToThread API doesn't exhibit the problems that Sleep(0) and Sleep(1) do. Unfortunately, you can't reliably get at it from managed code. You can P/Invoke, but it's actually dangerous to do if you end up running in a host. We've overridden thread yielding behavior on the CLR such that we can call out to a host for notification purposes. This was used primarily for fiber mode in SQL Server (which was cut), so that it could use this as an opportunity to switch fibers, but other hosts are free to do what they please. If you don't care about working in a host, then feel free to do this, but please document it clearly and use the following HPA signature so people don't use your type incorrectly unknowingly:
[DllImport("kernel32.dll"), HostProtection(SecurityAction.LinkDemand, ExternalThreading=true)] static extern bool SwitchToThread();
We're looking at adding a Thread.Yield API in the next rev of the CLR that does this in a host-friendly way. For now, you'll have to rely on Sleep(1).
Thankfully, the starvation problem is not quite *that* bad. The Windows scheduler combats this problem. It uses a balance set manager: a system daemon thread whose responsibility it is to wake up once a second to check for threads that are being starved because of a lower priority than other runnable threads. The goal of this service is to prevent CPU starvation and to minimize the impact of priority inversion. If any threads are found by the balance set manager which have been starved for ~3-4 seconds, those starved threads enjoy a temporary priority boost to priority 15 ("time critical"), virtually ensuring the thread will be scheduled. (Although this won't strictly guarantee it: if your other threads have real-time priorities, i.e. >15, then starvation will continue indefinitely... you're playing with dynamite once you enter that realm.) And once the thread does get scheduled, it also enjoys a quantum boost: its next quantum is stretched to 2x its normal time on client SKUs, and 4x its normal time on server SKUs. The priority decays as each quantum passes, continuing until the thread reaches its original lower priority.
In our example above when Sleep(0) is used, we hope this will unstick the machine and let the producer produce and finally the consumer to consume. Indeed with some testing, we see it unstick after a little more than 3 seconds. This is still long enough, however, to kill performance on a server application, cause a noticeable perf degradation on the client, and destroy responsiveness in a GUI app. Here's a simple test that exposes the problem (on a single-CPU machine):
using System; using System.Diagnostics; using System.Threading;
class Program { public static volatile int x = 0;
public static void Main() { Stopwatch sw = new Stopwatch(); sw.Start();
SpawnWork(); while (x == 0) { Thread.Sleep(0); }
sw.Stop(); Console.WriteLine("Sleep(0) = {0}", sw.Elapsed);
x = 0;
sw.Reset(); sw.Start();
SpawnWork(); while (x == 0) { Thread.Sleep(1); }
sw.Stop(); Console.WriteLine("Sleep(1) = {0}", sw.Elapsed); }
private static void SpawnWork() { ThreadPool.QueueUserWorkItem(delegate { Thread.CurrentThread.Priority = ThreadPriority.BelowNormal; x = 1; }); } }
And here's some example output which is quite consistent from run to run:
Sleep(0) = 00:00:03.8225238 Sleep(1) = 00:00:00.0000678
As we can see, in the case of Sleep(0), the balance set manager stepped in and boosted our producer thread after ~3-4 seconds as promised. We avoid the problem altogether with Sleep(1).
The moral of the story? Priorities are evil, don't mess with them. Always use Sleep(1) instead of Sleep(0). The Windows balance set manager is cool.
 Monday, August 21, 2006
[Update - 8/22/06 - fixed typos and paid homage to VSTS 2005's code analysis which checks for this problem.]
From the department of Spolsky's Law of Leaky Abstractions, we turn today to accidental lock conflicts across AppDomain boundaries.
The CLR supports various cross-AppDomain marshaling mechanisms, one of which is known by the lovely name of marshal-by-bleed. This simply means that pointers from multiple AppDomains actually refer to the same location in memory. Most of the time some form of marshaling is required for objects so that we can safely isolate separate AppDomains from one another.
In managed code, you can lock on any object through the Monitor type, exposed in C# and VB via the 'lock' and 'SyncLock' keywords, respectively. The implementation of Monitor.Enter/Exit uses space in the object header and/or the object's sync-block to record exclusive ownership of the lock. The fact that objects typically don't bleed across AppDomains is a GoodThing(tm), as this is how add-ins, SQL Server, and other hosts isolate failures between components. When writing code, we typically assume state in one AppDomain can't corrupt state in another, totally independent, AppDomain.
Unfortunately, domain neutral Type objects (as well as other Reflection types, e.g. XXInfos) are actually shared across all AppDomains in the process. They are marshal-by-bleed. Strings also fall into this camp. A string argument to a remoted MarshalByRefObject method invocation may be bled, as can be any process-wide interned string literal. The System.Threading.Thread object (called the thread-base-object, aka TBO, internally) also bleeds across AppDomains. What a bloody mess! (Ha ha.)
So why does this all matter?!
Recall that lock owner information is tied to the instance. If you use any of these things as a target of Monitor.Enter, code running in one AppDomain can actually interfere with code in another AppDomain. That's because they are using the same object and thus the same lock information underneath. What a lousy abstraction--this was never meant to leak through! And it can cause trouble too. If one AppDomain orphans the lock (forgets to release it), it may cause deadlocks in other AppDomains. Even sans deadlocks, this fact can simply yield false conflicts, which can subsequently negatively impact scalability.
For example, consider this code:
lock (typeof(object)) { ... }
Code in AppDomain A uses the same Type object to represent 'typeof(object)' as code in AppDomain B. Therefore they share lock information.
If we run such code from multiple AppDomains, the code yields a conflict:
WaitHandle wh = new ManualResetEvent("XXX", false); lock (typeof(object)) { AppDomain ad2 = AppDomain.CreateDomain("2"); ad2.DoCallBack(delegate { ThreadPool.QueueUserWorkItem(delegate { WaitHandle wh2 = new ManualResetEvent("XXX", false); lock (typeof(object)) { wh2.Set(); } }); }); wh.WaitOne(); }
If one AppDomain is waiting for a synchronization event from another--as in this example--this can actually yield a deadlock. If you replaced the lock statements in this example with, say, lock ("Foo") { ... }, you'll see the same result due to string literal interning.
Clearly this is nasty problem, especially if Framework code were to use such patterns. This is one reason you'll notice we strongly discourage locking on Type objects. Even if you're not in mscorlib (by default the only domain neutral assembly), your type can be loaded domain neutral based on hosting policy, among other things. And therefore you may not even catch said bugs during testing.
Note that MarshalByRefObjects aren't subject to these problems. Although operations in one AppDomain can refer to the same instance in another, these accesses go through a proxy. Locking on the proxy is different than locking on the raw underlying object, and thus no false conflicts.
This is enforced with the DoNotLockOnObjectsWithWeakIdentity VSTS 2005 code analysis rule.
If all of this is making you feel rather queasy, fear not. We have a weekly "CLR Foundations" meeting where a large portion of the CLR Team meets to discuss the history of the CLR and .NET Framework. A couple weeks back this topic came up in passing. Most people on the team were quite surprised, and many even seemed to be enshrouded in disbelief. At least we can recognize a mistake after it's been made. ;)
 Sunday, August 13, 2006
I've run across several algorithms lately that have benefited from the use of a Bloom filter. This led me to dig up the original paper (Space/time trade-offs in hash coding with allowable errors) in which the idea was proposed. What surprised me a little was that this technique was invented 35 years ago, by a fellow by the name of Burton Bloom, and yet it remains a simple and effective way to speed up a certain category of modern software problems.
The technique summarizes the contents of a set into a concise value (the filter) from which quick answers may be computed. This value is typically generated using some form of hash-coding of the set elements and stored within a bitmap, enabling ultra-fast bitwise updates and queries. This value promises to never return a false negative, although it is permitted to return false positives. So long as the false positive rate is low, many queries that would ordinarily have come up empty handed after a lengthy search will be retired in constant time. As an element is added to the set, the value is updated to reflect its presence. A tricky part of this technique, however, is that, because the value often summarizes the elements using one-bit-to-many-elements, removing an element from the set typically cannot just reset the corresponding bit. As items are removed over time, this can lead to a stale value, an increasing rate of false positives, and a higher number of lengthy searches. Thus the value must be periodically recomputed to combat this problem.
Let's take an example. Say you had a balanced binary tree that was frequently searched, infrequently deleted from, and whose searches more often lead to misses than they do hits. You can expect O(log n) search time. That's not bad, but for large values of n you may have incentive to speed up the common case, which happens to be the worst case for a perfectly balanced binary tree: a search that turns up empty. Using a Bloom filter gives you a slightly modified algorithm whose search complexity is Ω(1) for our problem's best case, and still O(log n) otherwise. This also adds some notable cost due to the need to periodically recompute the filter value, which is an O(n) operation, but since we infrequently delete items, we expect this to pay off in spades.
Say we already had a BinaryTree<T> data structure. We could easily adopt a Bloom filter with a series of minor modifications.
First, a new field to remember the filter's value. I've chosen a 64-bit bitmap for illustration:
long filterValue = 0;
And since we've used a bitmap, we need a routine to calculate any arbitrary element's single bit position. Remember, false positives are OK, and therefore multiple elements may share the same bit:
long GetFilterValueForElement(T e) { return 1 << (e.GetHashCode() % (sizeof(long) * 8)); }
When adding an item to the set, we must change the filter's value:
filterValue |= GetFilterValueForElement(e);
Here's the beneficial change. While searching for an item, we can add a quick check up front to speed up queries:
if ((filterValue & GetFilterValueForElement(e)) == 0) { // Element not found: we can be assured this is correct. } // Perform the existing, lengthier search. might be a false positive.
And of course, we must periodically do the O(n) operation on the tree to recompute the filter. We might do this whenever the tree becomes empty and every n deletions, for example:
if (isEmpty || (++deletionCount % recomputeFilterPeriod) == 0) { filterValue = 0; foreach (T e in this) { filterValue |= GetFilterValueForElement(e); } }
You could even keep track of the number of false positives seen, and use that to determine when to initiate the recomputation.
This technique clearly won't work well in data structures and problems in which the deletion-to-query ratio is high, but there are plenty of situations where this technique helps tremendously. Logs of various forms, for example, exhibit the property that they are added to but never deleted. Depending on the density of the data structure, you may want to use an array instead of a bitmap--or a bitmap with more bits--to represent the summary. I ran a quick test and the simple bit-shifting hash-coding mechanism above yielded a full filter (~0) after only 227 random object allocations. You can of course even decide to reduce the density of your bitmap dynamically by detecting a close-to-full map and upgrading to a larger filter data structure. A dense filter often corresponds to a higher false positives rate, in which case you simply waste time maintaining and checking a value that doesn't give you any benefit.
 Sunday, August 06, 2006
This falls into the "fun hacks" category, meaning the result is not necessarily something you'd want to use in your everyday life. To go a step further, I strongly recommend you don't use the code shown here as-is; read the summary at the end for some rationale behind that statement. Enough with the disclaimer. On with the show...
Some requirements for our cross-process RWLock
Imagine you had the need for:
1. A managed reader/writer lock 2. that runs on down-level (pre-Vista) operating systems, 3. and that optionally works across process boundaries and AppDomains.
What do we already have that might fit the bill? The existing ReaderWriterLock type in the System.Threading namespace works fine for 1 and 2, but not 3. I suppose you could share it across AppDomains--or even processes--with some form of messaging scheme, but let's ignore that for a moment. It's a little *too* clever for my taste. Windows Vista of course comes with a new slim reader/writer lock. It's a close cousin of the Win32 CRITICAL_SECTION, and can be used for cross-AppDomain synchronization. Unfortunately, you have to P/Invoke to get at it from managed code, it won't run on pre-Vista operating systems, and doesn't work for cross-process scenarios.
Let's build it out of duct tape and barbed wire
It turns out that you can build a type that meets our requirements out of existing Windows kernel objects. All it takes is a little imagination. Here's what you need:
1. A semaphore to count the number of readers inside the lock. 2. A mutex to ensure only one writer can be in the lock at a time. 3. (Optionally) a manual-reset event used by writers to ensure no new readers enter the lock while it waits.
The scaffolding for one such implementation--which I will call the IpcReaderWriterLock--is as follows:
public class IpcReaderWriterLock : IDisposable
{
/** Fields **/
private const int DEFAULT_MAX_READER_COUNT = 25;
private const string NAME_PREFIX = @"IpcRWL#";
private readonly int m_maxReaderCount;
private Semaphore m_readerSemaphore;
private EventWaitHandle m_blockReadsEvent;
private Mutex m_writerMutex;
private int m_writerRecursionCount;
/** Constructors **/
public IpcReaderWriterLock() : this(null, DEFAULT_MAX_READER_COUNT) { }
public IpcReaderWriterLock(string name) : this(name, DEFAULT_MAX_READER_COUNT) { }
public IpcReaderWriterLock(int maxReaderCount) : this(null, maxReaderCount) { }
public IpcReaderWriterLock(string name, int maxReaderCount)
{
m_maxReaderCount = maxReaderCount;
string blockReadsEventName = null;
string writerMutexName = null;
string readerSemaphoreName = null;
if (name != null)
{
blockReadsEventName = string.Format("{0}{1}#{2}", NAME_PREFIX, name, "RdEv");
writerMutexName = string.Format("{0}{1}#{2}", NAME_PREFIX, name, "WrMtx");
readerSemaphoreName = string.Format("{0}{1}#{2}", NAME_PREFIX, name, "RdSem");
}
m_blockReadsEvent = new EventWaitHandle(true, EventResetMode.ManualReset, blockReadsEventName);
m_writerMutex = new Mutex(false, writerMutexName);
m_readerSemaphore = new Semaphore(maxReaderCount, maxReaderCount, readerSemaphoreName);
}
/** Methods **/
public void Dispose()
{
// Just close all of the kernel objects we opened during construction.
// Note: this method is not thread-safe. If threads race with
// one another to call Dispose, some nasty bugs will arise.
if (m_blockReadsEvent != null)
{
m_blockReadsEvent.Close();
m_blockReadsEvent = null;
}
if (m_writerMutex != null)
{
m_writerMutex.Close();
m_writerMutex = null;
}
if (m_readerSemaphore != null)
{
m_readerSemaphore.Close();
m_readerSemaphore = null;
}
}
public void EnterReadLock() { ... }
public void ExitReadLock() { ... }
public void EnterWriteLock() { ... }
public void ExitWriteLock() { ... }
}
Notice that we allow naming of the lock. Any name given flows into the kernel objects used underneath, enabling cross-process and cross-AppDomain communication. You just create the same IpcReaderWriterLock with the same name in multiple processes or AppDomains, and they will magically interact with one another (whether you want them to or not). An unnamed lock is isolated inside of the AppDomain in which it was created. Notice also that there's a maximum number of simultaneous readers, the default for which is 25. This isn't terribly important, but any override does impact performance (as described below).
Read and write lock implementation
Now let's implement the read-lock acquisition and release functions, EnterReadLock and ExitReadLock. We support more than one reader at a time via the use of a semaphore (#1 above). We also support preventing blocking new readers from entering the lock while the writer waits for all readers to exit (#3 above). Thus, both of these functions are quite trivial to write:
public void EnterReadLock()
{
Thread.BeginCriticalRegion();
// We first wait on the read blocking event, in case a writer
// has tried to acquire the lock and wants us to wait.
m_blockReadsEvent.WaitOne();
// Now take '1' from the reader semaphore to count the number
// of simultaneous readers inside the lock.
m_readerSemaphore.WaitOne();
}
public void ExitReadLock()
{
// Just release '1' back to the semaphore to let others know
// the number of simultaneous readers just decreased.
m_readerSemaphore.Release();
Thread.EndCriticalRegion();
}
Next comes the write-lock acquisition and release functions, EnterWriteLock and ExitWriteLock. They are slightly more complicated, but not by much. First we acquire the writer mutex. Once we've done that, we increment the recursion count, and ensure that we do some other work only the first time a writer lock is acquired on the thread. We block any new readers from entering, and then we effectively wait for all readers to exit. We do that by acquiring the semaphore n times, where n is the maximum number of readers that we support. Releasing the write lock does the reverse of all of that:
public void EnterWriteLock()
{
Thread.BeginCriticalRegion();
// We have to first ensure only one writer can get in at a time.
m_writerMutex.WaitOne();
// Increment our recursion count.
m_writerRecursionCount++;
// For the first writer who enters, we need to block new readers
// and wait for any existing readers to exit the lock.
if (m_writerRecursionCount == 1)
{
// Next we block any new readers from entering the lock.
m_blockReadsEvent.Reset();
// And lastly, we ensure that all readers have exited the lock.
// We do this by acquiring the semaphore's capacity. It's
// unfortunate that the Win32 APIs don't support a take-n
// function for semaphores.
for (int i = 0; i < m_maxReaderCount; i++)
{
m_readerSemaphore.WaitOne();
}
}
}
public void ExitWriteLock()
{
// We have to do everything in the reverse order as we did
// during acquisition. Not doing so can lead to subtle bugs,
// including lost resets and deadlocks.
m_writerRecursionCount--;
// The last writer to release has to signal readers.
if (m_writerRecursionCount == 0)
{
// We release the semaphore's capacity back, enabling readers
// to take from it. Note that as soon as we call this, other
// threads may wake up and race to acquire the semaphore. In
// fact, simultaneous readers can get in, even though we still
// have a writer in here!
m_readerSemaphore.Release(m_maxReaderCount);
// Unblock any readers that are waiting. Note: ideally we would
// do this after signaling writers, so that readers can't sneak
// in before the writer, but that would be more complicated: we
// keep it simple for now.
m_blockReadsEvent.Set();
}
// And lastly release the mutex.
m_writerMutex.ReleaseMutex();
Thread.EndCriticalRegion();
}
And that's it. A fully functioning reader/writer lock, for some definition of "functioning." The full source code can be found here: IpcReaderWriterLock.cs.
The test case
This example wouldn't be complete with a simple test case to prove that it works. The sample program included in the IpcReaderWriterLock source code creates 20 threads--10 readers and 10 writers--in 10 AppDomains. Each does a piece of work designed to expose race conditions via context switching at sensitive points. It prints out "Success" at the end, assuming it all worked. I see "Success" every time I run it, so it works on my machine at least. Hooray.
Summary: Don't use this thing
OK, OK... this lock is pretty icky and nasty to be honest, and you probably wouldn't want to use it. Ever. A simple write-lock acquisition incurs 27 kernel transitions with the default settings. This ends up costing over 1000-times the cost a simple monitor acquisition! (Yes, there are three 0's in that number... ouch.) Moreover, the cost increases proportional to the number of simultaneous readers that the lock instance supports, which is not very good. That's why I've used such a low default: 25. And it's not very reliable either, which, for anything that does cross-AppDomain or cross-process synchronization can be disastrous. One process that crashes can lead to machine-wide corruption and a user who needs to reboot the machine. A much better lock (performant, reliable, etc.) could be built using memory mapped files, although the implementation would be substantially more complicated.
I almost didn't write this post for all of these reasons. I'm not sure whether I'm doing a disservice to my readers by doing this. Instead, I hope that showing how simple Windows kernel objects can be composed together in interesting, powerful, and non-obvious ways is interesting, if only for trivia reasons. I'd also like to think that it may inspire you to think about things a little differently, perhaps helping you to write clever (but useful!) things out of the building blocks you already have.
 Thursday, August 03, 2006
A glimpse of the future?
32-Core Processors: Intel Reaches for (the) Sun
Intel is of rolling out its Core 2 micro-architecture now. The Xeon 5100 server processor aka Woodcrest was released only weeks ago, Core 2 Duo for the desktop (Conroe) is expected on July 27th and the mobile version Merom will follow only weeks later. The next milestone is quad-core processors, which the firm will produce by fitting two Woodcrest dual cores inside a physical processor package (Clovertown). You may have realized that there is a product development pattern behind recent and upcoming Intel multi core processor releases. Amazingly enough, Intel has been studying Sun's UltraSPARC T1 (Niagara) to come up with a radical processor redesign for 2010 that could perform 16 times faster than Woodcrest. This is no marketing blurb, guys; this is technical intelligence from within the Borg collective.
Read More...
I want a 32-CPU MP machine with those puppies on it. 1,024 cores anybody? Engadget claims each of those cores has 4 hardware threads. 4,096 hardware threads anybody?
Engadget goes on to say:
The biggest hurdle of all, however, could be a consumer Microsoft OS that can fully help software take advantage of multiple cores, a task which Vista isn't quite up to.
This is a bit of an exaggeration. Server SKUs will take advantage of multi-core without problem. It's our client situation that needs a little help, and even then the OS has little to do with it.
 Tuesday, August 01, 2006
The September edition of MSDN Magazine contains an article I wrote. Surprise! It's on concurrency:
Using Concurrency for Scalability
There's been a lot of buzz lately about concurrency. The reason for this is due mostly to major hardware vendors' plans to add more processor cores to both client and server machines, and also to the relatively unprepared state of today's software for such hardware. Many articles focus on how to make concurrency safe for your code, but they don't deal with how to get concurrency into your code in the first place. ...
In some sense, a large chunk of the responsibility for making software go faster with the next generation of hardware has been handed off from hardware to software. That means in the medium-to-long term, if you want your code to get faster automatically, you'll have to start thinking about architecting and implementing things differently. This article is meant to explore these architectural issues from the bottom up, and aims to guide you through this new world. In the long run, new programming models are likely to appear that will abstract away a lot of the challenges you will encounter.
Enjoy. The original go-'round was about 10,000 words, far too long for a column-length magazine article. After cutting it back, I do feel like it's a bit fast paced and glosses over some important details. In particular, I had wanted to cut the profiling section but there wasn't enough time in the process to do it. I hope somebody writes a feature-length article on that sometime soon. Nevertheless I think the article will be useful. Any feedback would be appreciated.
I'm also pleased to see that Jeff Richter has a great article on CCR in this same edition (and a sister Channel9 video).
 Tuesday, July 25, 2006
A colleague lent me a copy of W. Daniel Hillis’s PhD thesis, The Connection Machine, which is also available in book form from The MIT Press. I only began reading it last night, but I have been continuously amazed. It’s been enlightening to realize how much framing problems differently (and, in many cases, more naturally) can make programming without concurrency seem ridiculous.
To give you an idea, here's a quote from the thesis:
“When performing simple computations over large amounts of data, von Neumann computers are limited by the bandwidth between memory and processor. This is a fundamental flaw in the von Neumann design; it cannot be eliminated by clever engineering.”
Here is a quick illustration: What’s the most efficient way of copying a source array of 100,000 elements to a destination array of 100,000 elements? With a single-CPU this would typically be O(n), where n is the length of the array. If you could minimize costs due to thread creation and communication, and ensure good locality, you might be able to gain some parallel speedup by using multiple CPUs.
With The Connection Machine, however, you can do it in O(1) time. Simply instruct the 100,000 source cells, each of which holds a single array element, to communicate their value to the 100,000 destination cells, and instruct the destination cells to receive and store the value. This happens instantaneously, across the machine, not in serial fashion. If any node a can communicate with any other node b in 1 time unit, the entire array is copied in just 1 time unit, not 100,000! (Designing such an interconnect is, of course, quite difficult, but in theory it is quite nice.)
I found it particularly interesting that, back in 1985, at least this author recognized the impending demise of an entirely sequential approach to all problems. I guess the von Neumann-plus-Knuth knockout punch infected the world of programming to the contrary, and it was straight down-hill from there…
 Saturday, July 15, 2006
Stack overflow can be catostrophic for Windows programs. Some Win32 libraries and commercial components may or may not respond intelligently to it. For example, I know that, at least as late as Windows XP, a Win32 CRITICAL_SECTION that has been initialized so as to never block can actually end up stack overflowing in the process of trying to acquire the lock. Yet MSDN claims it cannot fail if the spin count is high enough. A stack overflow here can actually lead to orphaned critical sections, deadlocks, and generally unreliable software in low stack conditions. The Whidbey CLR now does a lot of work to probe for sufficient stack in sections of code that manipulate important resources. And we pre-commit the entire stack to ensure that overflows won't occur due to failure to commit individual pages in the stack. If a stack overflow ever does occur, however, it's considered a major catastrophy--since we can't reason about the state of what native code may have done in the face of it--and therefore, the default unhosted CLR will fail-fast.
In some rare cases, it is useful to query for the remaining stack space on your thread, and change behavior based on it. It could enable you to fail gracefully rather than causing a stack overflow, possibly in Win32, causing the process to terminate. A UI that needs to render some very deep XML tree, and does so using stack recursion, could limit its recursion or show an error message based on this information, for example. It could decide that it needs to spawn a new thread with a larger stack to perform the rendering. Or it may just be a handy way to log an error message during early testing so that the developers can fine tune the stack size or depend less heavily on stack allocations to get the job done.
I've previously mentioned that the TEB has a StackBase and StackLimit, and that it can be dynamically queried using the ntdll!NtCurrentTeb function. Unfortunately, the StackLimit is only updated as you actually touch pages on the stack, and thus it's not a reliable way to find out how much uncommitted stack is left. The CLR uses kernel32!VirtualAlloc to commit the pages, not by actually moving the guard page, so StackLimit is not updated as you might have expected. There's an undocumented field, DeallocationStack, at 0xE0C bytes from the beginning of the TEB that will give you this information, but that's undocumented, subject to change in the future, and is too brittle to rely on.
The RuntimeHelpers.ProbeForSufficientStack function may look promising at first, but it won't work for this purpose. It probes for a fixed number of bytes (48KB on x86/x64), and if it finds there isn't enough, it induces the normal CLR stack overflow behavior.
The good news is that the kernel32!VirtualQuery function will get you this information. It returns a structure, one field of which is the AllocationBase for the original allocation request. When Windows reserves your stack, it does so as one contiguous piece of memory. The MM remembers the base address supplied at creation time, and it turns out that this is the "end" of your stack (remember, the stack grows downward). With a little P/Invoke magic, it's simple to create a CheckForSufficientStack function using this API. Our new function takes a number of bytes as an argument and returns a bool to indicate whether there is enough stack to satisfy the request:
public unsafe static bool CheckForSufficientStack(long bytes) {
MEMORY_BASIC_INFORMATION stackInfo = new MEMORY_BASIC_INFORMATION();
// We subtract one page for our request. VirtualQuery rounds UP to the next page.
// Unfortunately, the stack grows down. If we're on the first page (last page in the
// VirtualAlloc), we'll be moved to the next page, which is off the stack! Note this
// doesn't work right for IA64 due to bigger pages.
IntPtr currentAddr = new IntPtr((uint)&stackInfo - 4096);
// Query for the current stack allocation information.
VirtualQuery(currentAddr, ref stackInfo, sizeof(MEMORY_BASIC_INFORMATION));
// If the current address minus the base (remember: the stack grows downward in the
// address space) is greater than the number of bytes requested plus the reserved
// space at the end, the request has succeeded.
return ((uint)currentAddr.ToInt64() - stackInfo.AllocationBase) >
(bytes + STACK_RESERVED_SPACE);
}
// We are conservative here. We assume that the platform needs a whole 16 pages to
// respond to stack overflow (using an x86/x64 page-size, not IA64). That's 64KB,
// which means that for very small stacks (e.g. 128KB) we'll fail a lot of stack checks
// incorrectly.
private const long STACK_RESERVED_SPACE = 4096 * 16;
[DllImport("kernel32.dll")]
private static extern int VirtualQuery (
IntPtr lpAddress,
ref MEMORY_BASIC_INFORMATION lpBuffer,
int dwLength);
private struct MEMORY_BASIC_INFORMATION {
internal uint BaseAddress;
internal uint AllocationBase;
internal uint AllocationProtect;
internal uint RegionSize;
internal uint State;
internal uint Protect;
internal uint Type;
}
If this returns true, you can be guaranteed that an overflow will not occur. Well, modulo stack guarantee issues, that is...
Notice that we have to consider some amount of reserved space at the end of the stack. Platforms typically reserve a certain amount to ensure custom stack overflow processing can be triggered. Windows actually reserves a few pages at the end of the stack for this reason. If, after a stack overflow occurs, a double stack overflow is triggered (that is, stack overflow handling actually exceeds these pages), Windows takes over and kills the process. The CLR prefers to initiate a controlled shut-down: telling the host, if any, and fail-fasting otherwise. This means it needs to reserve even more than Windows does automatically. The kernel32!SetThreadStackGuarantee can be used for this. In any case, we need to consider that when looking for enough stack space in our function. The code above assumes 16 4KB pages are required; this is more than is typically needed, so it may lead to false positives (but we hope no false negatives). Also note the program above is very x86/x64-specific, and won't work reliably on IA-64: it hard-codes a 4KB page size. It's a trivial excercise to extend this to use information from kernel32!GetSystemInfo to use the right page size dynamically.
As an example, check out this code:
static unsafe void Main() {
Test(8*1024, 8*1024, true);
Test(0, (960*1024) + (8*1024), false);
Test(960*1024, 8*1024, false);
}
static unsafe void Test(int eatUp, long check, bool expect) {
byte * bb = stackalloc byte[eatUp];
Console.WriteLine("eatUp: {0}, check: {1}: {2}",
eatUp, check,
CheckForSufficientStack(check) == expect ?
"SUCCESS" : "FAIL");
}
As I've described previously, the stack size can depend on the EXE PE file or parameters passed when creating a thread. This example assumes a 1MB stack size.
 Saturday, July 08, 2006
The CLR thread pool is a very useful thing. It amortizes the cost of thread creation and deletion--which, on Windows, are not cheap things--over the life of your process, without you having to write the hairy, complex logic to do the same thing yourself. The algorithms it uses have been tuned over three major releases of the .NET Framework now. Unfortunately, it’s still not perfect. In particular, it stutters occasionally.
As I’ve hinted at before, we have a lot of work actively going on right now that we hope to show up over the course of the next couple CLR versions (keep an eye on those CTPs!). This may include vastly improved performance for work items and IO completions, significantly reducing the overhead of using our thread pool (in some cases to as little as ~1/8th of what it is today), eliminating accidental deadlocks due to lots of blocked thread pool workers, and a slew of useful new features (prioritization, isolation, better debugging, etc.).
One silly thing our thread pool currently does has to do with how it creates new threads. Namely, it severely throttles creation of new threads once you surpass the "minimum" number of threads, which, by default, is the number of CPUs on the machine. We limit ourselves to at most one new thread per 500ms once we reach or surpass this number. This can be pretty bad for some workloads, most notable those that are "bursty"; i.e. those that exhibit interspersed inactive and active periods rather sporadically and unpredictably. ASP.NET is a great example of an environment in which this frequerntly happens. Here’s an illustration:
- Imagine we have a 4 CPU web server. The "minimum" thread count used thus 4 (assuming the default).
- The web server has just started up.
- 16 new requests come in within a short period of time.
- Ther CLR quickly create 4 thread pool threads to service the first 4 requests. Because we don’t want to add any more for another 500ms, the other 12 requests sit in the queue.
- The 4 thread pool threads are running some arbitrary web page response. Imagine the response generation code does some type of database query that takes 4 seconds to complete. (This is a strong argument for using ASP.NET asynchronous pages (see http://msdn.microsoft.com/msdnmag/issues/05/10/WickedCode/) -- in which case, the 4 thread pool threads would free up to execute 4 new requests almost immediately -- or perhaps simply rearchitecting the seemingly poor database interaction, but ignore this for now.)
- After 500ms, a new thread pool thread is created, and the 5th request is serviced.
- We now wait another 500ms to add another thread, service, the next request, and so forth.
If the server has a constant load, eventually the pool will become "primed." But if a burst of work is followed by an inactive period of time, the threads in the thread pool start timing out waiting for new work, and eventually will retire themselves, until the pool shrinks back to the minimum. Imagine that this happens and then a bunch of new work arrives. Oops. This can clearly lead to some nasty scalability nightmares. KB article 821261, Contention, poor performance, and deadlocks when you make Web service requests from ASP.NET applications, describes this problem among others.
To "fix" this we added the ability in v1.1 to specify the minimum thread count in the thread pool, either with the configuratoin file or with the ThreadPool.SetMinThreads API. See KB article 810259, FIX: SetMinThreads and GetMinThreads API Added to Common Language Runtime ThreadPool Class, for details. It turns out that Microsoft Biztalk Server has run into the same problem: FIX: Slow performance on startup when you process a high volume of messages through the SOAP adapter in BizTalk Server 2004. I suspect many other commercial products have run into this as well. And it's rather annoying that each of them have to figure this out after they've shipped something, turning into a support bulletin, an internal bug-fix, and (I would guess) a service pack containing said bug-fix.
I wouldn't actually call what we did a fix. At best, it's a workaround. Hell, one of the KB articles above says that if you want decent scalability you need to change the minWorkerThreads count to 50. Our default is 1! Not too far off, eh? Shouldn't decent scalability be the default behavior?
We need to fix this for real.
Now, of course, it's a hard problem to solve. You don’t want to be too liberal adding threads to the pool because it can cause poor scalability should a large number of those extra threads suddenly become runnable. In an ideal world, no threads block, and having the same number of threads as you have CPUs gives you the best performance. (Better cache utilization, less overhead due to context switching, and so forth.) But software is often far less than ideal. As noted above, ASP.NET asynchronous pages are a great way to acheive this, and compared to injecting a whole bunch of relatively expensive threads into the process, it's obviously a better design. Unfortunately, I am not convinced all of our customers will stumble across this design, nor will it be brain-dead simple to rearchitect an existing site to take advantage of the feature without considerable work.
My hope is that we can solve this problem in the CLR by applying clever heuristics that even out over time. For example, we may start out life being over eager and generous with thread injection, but then "learn our lesson" after running for a period of time. This would lead to stabalization and an increasingly superior performance over the life of the process for the work that the server experiences. For example, if the server often experiences bursts, we will monitor the number of threads that lead to the best throughput during such an active period, and during periods of inactivity we will avoid retiring threads. This ensures that the next time the server is busy, work can be responded to in a more scalable manner, albeit with some extra working set overhead for keeping those threads around for longer. Perhaps more appropriate configuration settings could be dynamically recommended based on statistics gatherer during previous up-times of the server. And of course, we can offer more reasonable defaults for clients with short-living processes that might be harmed by over-eagerness with thread injection.
If anybody has experienced this problem in the wild, I’d love any feedback you might have. Feel free to leave a comment or email me at joedu at you-know-where dot com.
 Thursday, July 06, 2006
As I mentioned in a recent post, Windows Vista has new built-in support for deadlock detection. At the time, I couldn't find any publicly available documentation on this feature. Well, I just found it:
Wait Chain Traversal Wait Chain Traversal (WCT) enables debuggers to diagnose application hangs and deadlocks. A wait chain is an alternating sequence of threads and synchronization objects; each thread waits for the object that follows it, which is owned by the subsequent thread in the chain. Read More.
The new APIs OpenThreadWaitChainSession, CloseThreadWaitChainSession, and GetThreadWaitChain permit both asynchronous and synchronous detection and response. MSDN also has a fairly detailed code sample that uses the new APIs to print out the wait chain for all threads in a process.
 Monday, July 03, 2006
The world at large seems to be gravitating towards AJAX applications. I suppose this shouldn’t be surprising, given the relative lack of differentiation when comparing today’s “rich” client apps with what can run inside of a browser. We’ve actually hit a low point on the client if you ask me: I actually prefer Outlook Web Access over the rich client because at least my web browser doesn’t hang as much. While rich media and ink have made client-side interactions (theoretically) more interactive, satisfying, and powerful, I have to admit that my personal day-to-day experience with client software is anywhere near what I’d like it to be.
It’s surprising (to me) that people can build such nice looking, responsive, and (dare I say?) rich programs with a whole lot of ingenuity, blood, sweat, and tears, all using technologies that date back to when I was a teenager and which have evolved at a tremendously slow pace. I mean, standards committees aren’t necessarily known for rapid innovation. What’s even more surprising is that, with a state of the art IDE, Visual Studio, impressive presentation stacks like Windows Forms and Presentation Foundation, and a killer VM and associated Framework, the momentum is decidedly in the web frontier. Thank God for ASP.NET.
The problem isn’t that I don’t understand the AJAX sales pitch. It’s that I can’t believe we haven’t solved those problems on the client and really set it apart.
Despite the A in AJAX standing from Asynchronous, the marriage with multi-core seems less obvious than the rich client even. When the computational edge is offloaded to a set of back-office machines running in an expensive data center, the need for tera[fl]ops on the client seems slightly farfetched. Those servers sure better have the goods, though.
It seems to me that concurrency in AJAX apps will be more about masking latency and aggregating content from disparate sources than anything else: issuing tons of network requests and letting them complete in an overlapped fashion. Fault tolerance is another really attractive feature that concurrency could buy you. A quick search turned up this post about a Scheme message passing AJAX library. Very handy, and in one of my favorite languages to boot. But I wonder how mainstream these techniques will become?
Maybe I’m not thinking big. Speech and handwriting recognition, data mining over terabytes of personal information (which, by the way, it seems you need access to a hard drive for; see, the client’s good for something! ...or was that GFS?), synthesis and analysis of complex business, scientific, financial, and personal problems, and so on, are all things I can easily see a rich client doing. But than again, there's no real reason that such things couldn't be packaged up into AJAX libraries and all executed inside of the browser. A CPU cycle is a cycle, regardless of whether it's in the data center or running inside a web browser. This is where technologies like Flash/Flex come into play, taking the web experience and incrementally improving it to deliver richer experiences.
Google has already mastered the art of offloading impressive and constantly improving analysis over hoardes of data to the data-center. It seems to me that they may also be in a position to transition some of this complex analysis onto the clients connected to that same data-center. A cluster of machine nodes lookes surprisingly similar to a cluster of CPU nodes. Cutting down on communication and data transfer costs is key in both domains. And after all, why waste all of that (potentially massive) computing power that the 8-, 16-, 32-, ... core client has available? Instead of dumbing down the algorithms that are shared by [b|m]illions of clients simply so that a finite amount of computing power can be evenly distributed among a statistical peak load of consumers, make the clients do some work for themselves instead. And of course, the "constantly improving" attribute needn’t be lost: after all, it’s just a JS file. (IP of course starts to matter.)
Or is this all something that only a totally integrated client package can provide (whatever that means)? I suppose we’ll eventually find out.
When threads are created on Windows, the caller of the CreateThread API has the option to supply stack reserve/commit sizes. If not specified--i.e. the stack size parameter is 0--Windows just uses the sizes found in the PE header of the executable. Microsoft's linkers by and large use 1MB reserve/2 page commit by default, although most let you override this (e.g. LINK.EXE's /STACK:xxx,[yyy] option and VC++'s CL.EXE /F xxx). The CLR always pre-commits the entire stack for managed threads.
You'll often find situations where a program has been deployed and starts running out of stack space. Many times this is just a bug. But this also often happens when more data is fed to the application than was used during testing, causing deeper recursion or larger stack allocated data structures than is typical. ASP.NET, for example, uses 256KB stack sizes by default to minimize memory pressure due to large numbers of concurrent requests. It does this by setting the PE header's reserve size to 256KB, and relying on the fact that the CLR thread-pool creates its threads with a default stack size. I think WSDL.EXE also uses a 256KB stack to make startup faster. I was recently chatting with a customer who kept stack overflowing WSDL.EXE due to an extremely large XML file they were trying to parse (recursive XML parsers tend to use very deep stacks anyhow).
If you don't have the source code for the program in question, you can always use the EDITBIN.EXE utility that comes in the VC++ SDK to change the PE header's default stack values. Say you have an executable, FOO.EXE, that has been deployed and suddenly starts running out of stack space. You know it's not a bug -- it simply needs to consume more stack than was originally reserved. Running `EDITBIN.EXE FOO.EXE /STACK:2097152`, for example, changes the default stack to 2MB. This of course only works for threads that are created using the default stack size; if they override it explicitly, changing the PE header has no effect. This always works for threads in the CLR's thread pool.
Warning: Using EDITBIN.EXE like this can invalidate support and servicing warranties on commercial executables. You might want to use this approach for workarounds in your own organization or for personal use, but I don't recommend it for, say, Microsoft shipped binaries. There's no guarantee things will continue working as you'd hope, especially if you're shrinking the stack size instead of growing it. And next time you download an update from the Windows Update server, you may find that you've accidentally hosed your machine (although it honestly seems rather unlikely).
 Sunday, July 02, 2006
There are two main reasons to use concurrency.
The first reason is throughput. If you have multiple CPUs, then clearly you need at least as many threads as CPUs to keep them all busy. It's a little odd to talk about client-side workloads in terms of throughput, but we'll have to get used to it as multi-core becomes more prevalent. In the best case, there would be the same number of active threads as there are CPUs, each of which are entirely CPU-bound.
This is a very simplistic view of the world, however, and you typically end up needing more than that. The reason? Latency. Whenever you issue an operation with a non-zero latency, there will be some number of wasted CPU cycles during which computations will not make forward progress. If the latency is high enough, you can mask it with concurrency, and instead overlap some of the computation that needs to get done. Simply put, this maximizes the amount of work that actually gets done in a given amount of time (i.e. throughput).
To illustrate this point, consider Intel’s HyperThreading (HT) technology for a moment. Any memory access—and particularly those that miss cache entirely and go to main memory—have a noticeable latency (e.g. on the order of tens to hundreds of cycles). Instruction-level parallelism (ILP) can mask this to some degree. But HT also improves instruction-level throughput by overlapping adjacent instruction streams as stalls occur due to latency. This is clearly concurrency-in-the-small and doesn’t incur any noticeable overhead for context switching as do coarser grained forms of concurrency. But for many workloads it can do a surprisingly good job at masking delays. This technology, by the way, is based on technologies pioneered by super-computer makers like Cray and Tera years ago; many such architectures actually don’t use caches, so the latency of accessing main memory is incurred much more frequently, and thus this technique is much more beneficial in practice.
To illustrate this idea further, consider coarse-grained IO, such as issuing a web service request. The latency here is huge when compared to a simple cache miss, often warranting application-level concurrency to mask the latency. Again, if your goal is to maximize throughput, then you’d like to use as many cycles/time as possible, assuming that ensures you get the most work done. Asynchronous overlapped IO via Windows Completion Ports is meant exactly for this purpose (e.g. via the Stream.BeginXXX/EndXXX functions combined with the thread pool), allowing you to resume the paused “continuation” once the IO completes. In the meantime, you can continue performing meaningful work. This technique also often leads to better bandwidth utilization; for example, you can have several pending network requests which complete as individual responses are received, again masking the unpredictable latencies and response times.
A special case is when maximizing throughput of an individual component rather than the system as a whole. The UI thread, for example, is a precious resource that needs to maximize its message dispatching throughput so that latencies are masked from the user. Instead of statistical throughput degradation, failing to do this can lead to disastrous user experiences. This typically involves dispatching events to a separate worker thread whenever any IO might occur during the event's execution. And it may mean sacrificing the throughput of the entire system so that you can maximize throughput and remove waiting from the single component. Other systems with finite resources often exhibit this same characteristic.
The second major reason to use concurrency is fairness. If you are performing some work and suddenly some new work arrives, it often makes sense to start the computations associated with the new work as soon as possible. This allows round-robin servicing (e.g. at thread quantum intervals), ensuring that multiple pieces of work make progress at somewhat equivalent speeds. Anti-starvation of pending requests can often mandate this technique. For example, if you have a shared hosted web server whose pages just block indefinitely, you may end up starving other sites if you don’t create more threads to service them. In some cases, you may actually want to preempt the existing work if the new work is a higher priority. Windows thread priorities are good for that.
For compute-intensive workloads, optimizing for fairness will typically decrease throughput. That’s because you often need to create more threads than you have CPUs to accommodate the new work, and therefore more time is spent context switching and damaging locality. What may not be obvious is that this can actually lead to better throughput for many workloads, because IO can be overlapped and therefore as instruction streams stall, other threads can overlap progress.
Locking messes with all of this. Today's locking mechanisms aren't conducive to optimizing for throughput. The latency involved with racing with other concurrent workers is unpredictable but measurable at best. It is very difficult to systematically design to hide such latencies. And of course most locks have no idea of fairness or priority. Because context switches can happen while a lock is held, it may be the case that every thread about to be scheduled tries to acquire that same lock. Bam, you suddenly have a lock convoy on your hands. And priorities and threads don't mix very well, priority inversion can happen unexpectedly, leading to substantial loss in throughput at best and deadlock at worst. STM is a glimmer of hope in both regards.
 Thursday, June 29, 2006
I'll be speaking at JAOO'06 in Denmark this October. They have an entire track dedicated to concurrency. If you're in the area (or don't mind the travel), I highly recommend checking it out:
Concurrency and the composition of frameworks
Abstract: Multi-core computer architectures pose both a threat and an opportunity to modern software. The amount of computing power that will soon be available will enable mainstream applications to solve problems that require computing power that has until recently only been available in supercomputers. But it also means that our software needs to evolve alongside to better support and enable the levels of concurrency we'll need to effectively use all of those cores. This fact applies as much to reusable software libraries as it does to applications themselves.
This new direction imposes some new and interesting constraints on the architecture of reusable software components, including the need to remain thread agnostic, expose latency characteristics and mechanisms for hiding latency, and, for computationally expensive library routines, some way to carry them out in parallel based on the context in which they were called. These are all areas which have not yet been researched heavily and which commercial library vendors are only now beginning to seriously deal with.
This talk presents an overview of the problem, identifies some key challenges, and proposes some direction for enabling our software to both take advantage of concurrency and to avoid inhibiting it. While the discussion has been derived from experiences on the Windows and .NET Framework platforms, the ideas presented aim to transcend any specific technology.
If you're in Denmark and want to meet up to chat, definitely drop me a line.
 Wednesday, June 21, 2006
Windows Vista has some great new features for concurrent programming. For those of you still writing native code, it's worth checking them out. For those writing managed code, we have a bunch of great stuff in the pipeline for the future, but unfortunately you'll have to wait. Or convert (back) to the dark side.
The Vista features include:
1. Reader/writer locks. The kernel32 function InitializeSRWLock takes a pointer to a SRWLOCK structure, just like InitializeCriticalSection, and initializes it. AcquireSRWLockExclusive and AcquireSRWLockShared acquire the lock in the specific mode and ReleaseSRWLockXXX releases the lock. This is a "slim" RW lock, meaning it's actually comprised of a pointer-sized value, and is ultra-fast and lightweight, much like existing Win32 CRITICAL_SECTIONs. It should be about the cost of a single interlocked operation to acquire. E.g.
SRWLOCK rwLock; InitializeSRWLock(&rwLock); AcquireSRWLockShared(&rwLock); // ... shared operations ... ReleaseSRWLockShared(&rwLock);
2. Condition variables. These integrate with RW locks and critical sections, enabling you to do essentially what you can already do with Monitor.Wait/Pulse/PulseAll. InitializeConditionVariable takes a pointer to a CONDITION_VARIABLE and initializes it. SleepConditionVariableCS and SleepConditionVariableSRW release the specified lock (either CRITICAL_SECTION or SRWLOCK) and wait on the condition variable as an atomic action. When the thread wakes up again, it immediately attempts to acquire the lock it released during the wait. WakeConditionVariable wakes a single waiter for the target condition and WakeAllConditionVariable wakes all waiters, much like Pulse and PulseAll. E.g.
Buffer * pBuffer = ...; PCRITICAL_SECTION pCsBufferLock = ...; PCONDITION_VARIABLE pCvBufferHasItem = ...;
// Producer code: EnterCriticalSection(pCsBufferLock); while (pBuffer->Count == 0) { SleepConditionVariableCS(pCvBufferHasItem, pCsBufferLock, INFINITE); } // process item... LeaveCriticalSection(pCsBufferLock);
// Consumer code: EnterCriticalSection(pCsBufferLock); pBuffer->Put(NewItem()); LeaveCriticalSection(pCsBufferLock); WakeAllConditionVariable(pCvBufferHasItem);
More details on condition variables can be found on MSDN.
3. Lazy/one-time initialization. This allows you to write lazy allocation without fully understanding memory models and that sort of nonsense. The new APIs in kernel32, InitXXX, support both synchronous and asynchronous initialization. These have some amount of overhead for the initialization case due to the use of a callback, but in general this will be fast enough for most lazy initialization and much less error prone. Herb Sutter has proposed a similar construct for the VC++ language, and to be honest I wish we had this built-in to C# too. See the MSDN docs for an example and more details.
4. An overhauled thread pool API. The Windows kernel team has actually rewritten the thread pool from the ground up for this release. Their APIs now support creating multiple pools per process, waiting for queues to drain or a specific work item to complete, cancellation of work, cancellation of IO, and new cleanup functionality, including automatically releasing locks. It also has substantial performance improvements due to a large portion of the code residing in user-mode instead of kernel-mode. MSDN has a comparison between the old and new APIs.
5. A bunch of new InterlockedXXX variants.
6. Application deadlock detection. This is separate from the existing Driver Verifier ability to diagnose deadlocks in drivers. This capability integrates with all synchronization mechanisms, from CRITICAL_SECTION to SRWLOCK to Mutex, and keys off of any calls to XXXWaitForYYYObjectZZZ. Unfortunately, I think this is new to the latest Vista SDK, and thus there isn't a lot of information available publicly. This could probably make a good future blog post if there's interest.
Have fun with this stuff, of course. But be careful. Don't poke an eye out.
 Tuesday, June 20, 2006
Jim Johnson started a series back in January that I’m dying to see continued. It's about writing resource managers in System.Transactions, which surprisingly turns out to be incredibly straightforward. Provided you are able to implement the correct ACI[D] transactional qualities for the resource in question, that is. Juval Lowy’s December 2005 MSDN Magazine article on volatile resource managers described how to build what turns out to be essentially mini-transactional memory, without much of the syntax, implicit and transitive qualities, and robustness.
As an example of where you might use a resource manager, imagine that you wanted to ensure that any memory allocations and deallocations inside a transaction scope participate with the System.Transactions ambient transaction. Maybe you'd like your allocations to be in sync with the database server or web service to which you’re also transacting access. I’ll walk through an example of how straightforward writing such a resource manager can be.
First, our starting class is quite simple. It just allocates and frees memory. Sans transactions, it looks like this:
using System.Runtime.InteropServices;
public static class Mm { public static IntPtr Malloc(long bytes) { return Marshal.AllocHGlobal(new IntPtr(bytes)); } public static void Free(IntPtr pp) { Marshal.FreeHGlobal(pp); } }
Mm.Malloc returns a pointer to ‘bytes’ amount of memory via kernel32!GlobalAlloc (which turns out to be a crappy way to manage memory by the way, and is still alive only to support DDE, the clipboard, and OLE, or so I’m told; it works as an example though). Mm.Free takes a pointer to memory that was previously allocated via Mm.Malloc and frees it. Pretty simple.
OK, that’s not incredibly useful, especially considering that we’re just making single-line invocations to the Marshal class. But it’s a starting point.
Ultimately, what we want to ensure is that at the end of a transaction, any memory allocation and deallocation that happened within the transaction is consistent with the outcome of that transaction. That means, quite simply, that if memory was allocated and the transaction commits, we keep the memory allocated around; but if, on the other hand, the transaction rolls back, we must undo the allocation. Similarly, if we free memory and the transaction commits, then the memory remains freed; if it rolls back, we must undo the freeing.
If we want to build such a thing directly on top of existing facilities we clearly can’t do this precisely as I suggest. How do you undo a call to free in the CRT, for example? You can’t. Once you call free, the memory's gone, returned to the pool, and possibly used before your transaction even knows what to do with itself. But it turns out that we can "fake it" sufficiently close enough that most people can’t tell that we're faking it. Here’s what we do instead:
- When somebody allocates memory, we log a compensating action in the transaction that frees the memory should we roll back. If the transaction commits, we do nothing more.
- When somebody frees memory, we defer the call to commit time. If it never commit, we never free the memory.
This is fairly well known in database literature. Take a look at Jim Gray’s 1980 paper, A Transaction Model, where he describes REDO and UNDO actions, to see what I mean. (1980! That was ages ago.) What we’re saying basically is that allocation logs an UNDO action and freeing logs a REDO action. The isolation leaks out of this in some regard--evidence of our "faking it"--because the fact that our freed memory isn’t instantaneously available to other allocations might be noticed, especially under high stress conditions. OOMs may result that would have not otherwise happened, and the working set of the program may increase, especially for long running transactions. Cest la vie.
Anyhow… on to the implementation of these ideas. It’s surprisingly simple.
We will allow instances of our Mm class to be created by the implementation. From the viewpoint of a user, the class is still entirely static and cannot be constructed. These instances will become enlistments responsible for implementing transactional semantics and responding to certain event notifications from the System.Transactions machinery. To do so, the class must implement the System.Transactions interface IEnlistmentNotification:
using System; using System.Collections.Generic; using System.Runtime.InteropServices; using System.Transactions;
public sealed class Mm : IEnlistmentNotification {
/** Fields **/
private LinkedList<IntPtr> m_freeOnCommit = new LinkedList<IntPtr>(); private LinkedList<IntPtr> m_freeOnRollback = new LinkedList<IntPtr>();
[ThreadStatic] private static Dictionary<Transaction, Mm> s_currentMm;
/** Constructors **/
private Mm() {}
/** Methods **/
public static IntPtr Malloc(long bytes) { … }
public static void Free(IntPtr pp) { … }
}
We’ve added two linked lists to hold the deferred (m_freeOnCommit) and compensating actions (m_freeOnRollback). And we have a thread-static dictionary that maps the current transaction to the enlisted instance of Mm. This is pretty straightforward stuff, although there are a plethora of alternative designs. Now let’s see how we get data into these things. The Malloc and Free implementations will change slightly to check for an existing transaction:
public static IntPtr Malloc(long bytes) { IntPtr pp = Marshal.AllocHGlobal(new IntPtr(bytes));
// If insufficient memory, OOM is thrown and we never log the free. Mm mm = GetCurrentMm(); if (mm != null) { // Compensating activity to ensure that if we rollback, we free. mm.m_freeOnRollback.AddLast(pp); }
return pp; }
public static void Free(IntPtr pp) { Mm mm = GetCurrentMm(); if (mm != null) { // We defer the freeing of memory in case we don't commit. mm.m_freeOnCommit.AddLast(pp); } else { Marshal.FreeHGlobal(pp); } }
This implements the commit and rollback behavior I described above, i.e. we add the memory location to free on to the deferred or compensated list according to the rules we’ve already established. GetCurrentMm is responsible for lazily allocating and enlisting an instance of Mm. If there is no active ambient transaction, it just returns null:
private static Mm GetCurrentMm() { // Are we in a transaction? Transaction currTx = Transaction.Current; if (currTx == null) { // Return null to indicate we're not in a transaction. return null; }
// Have we already allocated and enlisted a volatile RM for this transaction? Mm currMm = null; if (s_currentMm == null) { s_currentMm = new Dictionary<Transaction, Mm>(); } else { s_currentMm.TryGetValue(currTx, out currMm); }
// No RM found, create/enlist one. if (currMm == null) { currMm = new Mm(); s_currentMm.Add(currTx, currMm); currTx.EnlistVolatile(currMm, EnlistmentOptions.None); }
return currMm; }
And of course we have a RemoveCurrentMm which will be used eventually to remove the enlistment information from our dictionary:
private static void RemoveCurrentMm() { Transaction currTx = Transaction.Current; if (currTx != null && s_currentMm != null) { s_currentMm.Remove(currTx); } }
So now we have all of the information about what should be freed and when, but there’s no code that actually executes the free operations. To do that, all we have to do is implement the IEnlistmentNotification interface properly, iterating the proper list, and invoking Malloc.FreeHGlobal on the contents. In other words, Commit and Rollback just invoke free on all of the memory addresses in the respective linked list:
void IEnlistmentNotification.Commit(Enlistment enlistment) { FreeAll(m_freeOnCommit); RemoveCurrentMm(); enlistment.Done(); }
void IEnlistmentNotification.Rollback(Enlistment enlistment) { FreeAll(m_freeOnRollback); RemoveCurrentMm(); enlistment.Done(); }
private void FreeAll(LinkedList<IntPtr> toFree) { foreach (IntPtr p in toFree) { Marshal.FreeHGlobal(p); } }
We’re assuming in all of those cases that FreeAll and RemoveCurrentMm can’t fail. If our commit or rollback processing failed mid-way, that would put the entire process at risk: memory could be leaked or become corrupt. System.Transactions will respond to that by sending InDoubt notifications to all enlistments. Since the only way we can potentially contain and resolve volatile state corruption is to crash the process, that’s exactly what we do:
void IEnlistmentNotification.InDoubt(Enlistment enlistment) { Environment.FailFast("State protected by RM is in question"); }
This is a Byzantine response, sure, but it’s the only way we can guarantee that state doesn’t become corrupt when the transaction’s fate is InDoubt. If the fate of one or more resource managers cannot be determined, we don't know whether to commit or fail for sure. We could guess, of course, but guessing doesn’t lead to pleasant behavior in software, especially when we have to debug it. (And if you're making guesses, you'll probably have to spend more time debugging, so it's a double whammy of sorts.)
And that’s it! Now we can use memory operations inside of a transaction, and have it behave as expected. Just as an example, this test case ensures that writing to memory that was allocated in a transaction that eventually aborts causes an AccessViolation:
bool test1success = false; IntPtr pMem1 = IntPtr.Zero; try { try { using (TransactionScope txScope = new TransactionScope()) { pMem1 = Mm.Malloc(1024 * 1024); // get 1MB of space throw new Exception(); // cause an abort } } catch { // The txn was aborted, we expect reading from memory to fail. uint * pInt = (uint *)pMem1.ToPointer(); *pInt = 0xdeadbeef; } } catch (AccessViolationException) { test1success = true; } Console.WriteLine("Test 1 succeeded: {0} (rollback of malloc)", test1success);
There are three other tests that may be of interest in the source file for the Mm class and associated code: MallocFree.cs.
Of course this approach is not perfect in several areas. One that comes to mind immediately is the fact that we're doing a potentially expensive lookup for an ambient transaction on every memory allocation and deletion, which could be too much, especially if it happened in some general purpose allocation and deallocation routines. And of course automatically finding the transaction and using might also be a bad idea. We might instead want the user to opt-in to transactional Malloc and Free at the callsite, so that users aren't surprised when their malloc or free never happens (the transaction rolled back). Nevertheless, this article at least cracked the surface of a very difficult problem and surfaced some interesting issues.
 Saturday, June 17, 2006
Ntdll exports an undocumented function from WinNT.h:
PTEB NtCurrentTeb();
This gives you access to the current thread's TEB (thread environment block), which is a per-thread data structure that holds things like a pointer to the SEH exception chain, stack range, TLS, fiber information, and so forth. This function actually returns you a PTEB, which is defined as _TEB*. _TEB is an internal data structure defined in winternl.h, and consists of a bunch of byte arrays. You can cast this to PNT_TIB (defined as _NT_TIB*), which gives you access to the data in a strongly typed way. And _NT_TIB is a documented data structure, unlike _TEB, meaning you can actually rely on it not breaking between versions of Windows.
For example, this code prints out the current thread's stack base and limit. The base is the start of the user-mode stack, and the limit is the last committed page, which grows as you use more stack:
PNT_TIB pTib = reinterpret_cast<PNT_TIB>(NtCurrentTeb()); printf("Base = %p, Limit = %p\r\n", pTib->StackBase, pTib->StackLimit);
There's a shortcut you can take. You can always find a pointer to the TEB in the register FS:[18h]:
PNT_TIB pTib; _asm { mov eax,fs:[18h] mov pTib,eax } printf("Base = %p, Limit = %p\r\n", pTib->StackBase, pTib->StackLimit);
There's an even shorter shortcut you can take. You can actually find the base and limit in different segments of the FS register, FS:[04h] for the base and FS:[08h] for the limit:
void * pStackBase; void * pStackLimit; _asm { mov eax,fs:[04h] mov pStackBase,eax mov eax,fs:[08h] mov pStackLimit,eax } printf("Base = %p, Limit = %p\r\n", pStackBase, pStackLimit);
Unfortunately, the _asm keyword is not supported on all architectures, so the above code is only guaranteed to work on x86 (e.g. the VC++ Intel Itanium compiler doesn't support it). Furthermore, the hardcoded offsets 04h and 08h are clearly wrong on 64-bit: you need more than 4 bytes to represent a 64-bit pointer. NtCurrentTeb hides all of this and uses whatever platform-specific technique is needed to retrieve the information.
Matt Pietrek's 1996 and 1998 Microsoft Systems Jounral articles are the best reference I could find on TEBs, aside from the Windows Internals book.
Believe it or not, this is useful information. I wrote some code recently that took a different code path based on whether it was writing to the stack or the heap, and using the TEB does the trick.
I have written about 7 pages on user-mode stacks in my upcoming concurrency book. This ranges from CLR stack frames to stack overflow to just how stacks work internally in Windows. I haven't found any book or resource that collects all of this information together in one place. It turns out that most developers don't need to worry about stacks at all, but this understanding is crucial to moving forward to more advanced concurrency programming models.
 Sunday, June 11, 2006
After posting my last article on creating a lazy allocation IAsyncResult, I received a few mails on the ordering of the completion sequence. It was wrong and has been updated. Thanks to those who pointed this out.
I used the following incorrect ordering: 1) set IsCompleted to true, 2) invoke the callback, and 3) signal the handle.
This can lead to deadlocks if the callback waits on the handle. My implementation carefully avoided EndXxx-induced deadlocks (by checking IsCompleted before waiting on the handle), but if the callback directly WaitOne's on the IAsyncResult.AsyncWaitHandle property, the callback will of course deadlock. Directly accessing the handle might be attractive to the callback author, especially for higher level orchestration via WaitAny and WaitAll. So it's probably something we'd like to support. One way to avoid this is to invoke the callback asynchronously with BeginInvoke, but a better solution is to use the correct ordering instead.
The correct ordering is: 1) set IsCompleted to true, 2) signal the handle, and 3) invoke the callback.
The first version I wrote had the correct ordering, since that seemed to be the logical choice. Unfortunately, the Framework Design Guidelines lists the steps in the wrong order, which led me down that path. I've let Brad and Krzys about this. A customer who read my blog actually mailed Brad about this error too, just about simultaneously. There may be rationale behind this, but we've used the correct ordering in the file and network IO APIs since V1.0 so I think it's just wrong.
It's worth pointing out that the network classes already use a lazy allocation scheme very similar to the one I wrote about. Check out the System.Net.LazyAsyncResult internal type in System.dll. I'm advocating for moving file IO onto the same plan in the next release of the Framework. We'll see how it turns out.
Lastly, some might notice I originally said this would be a two part series. Well, I wrote a whole bunch of code to implement a sophisticated LRU-based resurrection caching scheme--to avoid allocating IAsyncResults every time--and then realized that my example doesn't do anything expensive on IAsyncResult creation that would warrant such a thing. The result? It was actually slower than the ordinary lazy version I posted a couple weeks back. I think the techniques I used are interesting nonetheless, so I am going to try and rework the example to incorporate some expensive buffer management, and then see where I stand. But I'm not promising anything just yet. And this was a great reminder that solving actual profiled problems is always more worthwhile than solving perceived, yet unmeasured, non-problems.
 Tuesday, May 30, 2006
Lock free code is hard. But it can come in handy in a pinch.
There have been some recent internal discussions about the IAsyncResult pattern and performance. Namely that, for high throughput scenarios, where the cost of the asynchronous work is small relative to the cost of instantiating new objects, there is a considerable overhead to using the IAsyncResult pattern. This is due to two allocations necessary to implement the pattern: (1) the object that implements IAsyncResult itself, and (2) the WaitHandle for consumers of the API who must access the IAsyncResult.AsyncWaitHandle property. I will address (2) in this article, since it is much more expensive than (1).
Update: I've posted an addendum to this article here.
Rendezvousing
Just to recap, there are four broad ways to rendezvous with the IAsyncResult pattern:
- You can poll the IAsyncResult.IsCompleted boolean flag. If it's true, the work has completed. If it's false, you can go off and do some interesting work, coming back to check it once in a while.
- Supplying a delegate callback to the BeginXxx method. This callback is invoked when the work completes, passing the IAsyncResult as an argument to your callback.
- Waiting on the IAsyncResult.AsyncWaitHandle. This is a Windows WaitHandle, typically a ManualResetEvent, which allows you to block for a while until the work completes.
- Call the EndXxx method. Internally, this will often check IsCompleted and, if it's false, will wait on the AsyncWaitHandle.
Notice that in cases 1 and 2, the WaitHandle isn't even needed. And in case 4, it's only needed some fraction of the time. Well, it turns out we can avoid allocating it altogether for those cases where it's not used. We can "just" lazily allocate it. Note that for asynchronous IO, the majority of code will use method 2 above. For scalable servers, we often don't want to tie up an extra thread polling or waiting for completion, since that contradicts the primary benefits of Windows IO Completion Ports.
The Requirements
Notice I enclosed the word just above in quotes when mentioning lazy allocation. We could of course use a lock. But that would require that we allocate an object to lock against. We could of course just lock 'this', but that also comes with a performance overhead. We can get away with lock free code in this case, so long as we recognize a very important race condition that we must tolerate. Imagine this case: Thread A checks IsCompleted. It's false. So it accesses the AsyncWaitHandle property, triggering lazy allocation. Meanwhile, Thread B finishes the async work, and sets IsCompleted to true. We need to ensure a deadlock doesn't ensue.
This race could go one of two ways:
- Thread A lazily allocates and publishes the WaitHandle before Thread B sets IsCompleted to true. Thread B must now witness a non-null WaitHandle when it checks, and it must return the WaitHandle in the signaled state. If it returns an unsigned WaitHandle, Thread A will wait on it, and never be woken up. This is a deadlock.
- Thread B finishes first, setting IsCompleted to true, and seeing a null WaitHandle. Thread A must see IsCompleted as true and consequently return the event in a signaled state. Just like before, if this doesn't happen, Thread A will wait on an unsigned WaitHandle which will never be signaled. Deadlock.
To ensure both of these cases work, Thread A's read of the WaitHandle field and Thread B's read of IsCompleted must be preceded by a memory barrier. This ensures the memory accesses aren't reordered at the compiler or processor level, either of which could lead to the deadlock situations we are worried about. The CLR 2.0's memory model is not sufficient even with volatile loads, because the load acquire can still move "before" the store release.
An Implementation
Here is one simplistic implementation of a FastAsyncResult class, with ample comments embedded within to explain things:
class FastAsyncResult : IAsyncResult, IDisposable {
// Fields
private object m_state;
private ManualResetEvent m_waitHandle;
private bool m_isCompleted;
private AsyncCallback m_callback;
internal object m_internal;
// Constructors
internal FastAsyncResult(AsyncCallback callback, object state) {
m_callback = callback;
m_state = state;
}
// Properties
public object AsyncState {
get { return m_state; }
}
public WaitHandle AsyncWaitHandle {
get { return LazyCreateWaitHandle(); }
}
public bool CompletedSynchronously {
get { return false; }
}
public bool IsCompleted {
get { return m_isCompleted; }
}
internal object InternalState {
get { return m_internal; }
}
// Methods
public void Dispose() {
if (m_waitHandle != null) {
m_waitHandle.Close();
}
}
public void SetComplete() {
// We set the boolean first.
m_isCompleted = true;
// And then, if the wait handle was created, we need to signal it. Note the
// use of a memory barrier. This is required to ensure the read of m_waitHandle
// never moves before the store of m_isCompleted; otherwise we might encounter a
// race that leads us to not signal the handle, leading to a deadlock. We can't
// just do a volatile read of m_waitHandle, because it is possible for an acquire
// load to move before a store release.
Thread.MemoryBarrier();
if (m_waitHandle != null) {
m_waitHandle.Set();
}
// If the callback is non-null, we invoke it.
if (m_callback != null) {
m_callback(this);
}
}
private WaitHandle LazyCreateWaitHandle() {
if (m_waitHandle != null) {
return m_waitHandle;
}
ManualResetEvent newHandle = new ManualResetEvent(false);
if (Interlocked.CompareExchange(ref m_waitHandle, newHandle, null) != null) {
// We lost the race. Release the handle we created, it's garbage.
newHandle.Close();
}
if (m_isCompleted) {
// If the result has already completed, we must ensure we return the
// handle in a signaled state. The read of m_isCompleted must never move
// before the read of m_waitHandle earlier; the use of an interlocked
// compare-exchange just above ensures that. And there's a race that could
// lead to multiple threads setting the event; that's no problem.
m_waitHandle.Set();
}
return m_waitHandle;
}
}
Notice also that we tolerate the race where two threads try to lazily allocate the handle. Only one thread wins. The thread that loses is responsible for cleaning up after itself so that we don't "leak" a WaitHandle (requiring finalization to close it). This is an example of tolerating races instead of preventing them, and is similar to the design we use for jitting code in the runtime, for example.
Some Initial Results
I'll do a more thorough analysis as follow up to my next post, including profiling traces. But the initial results are promising.
I wrote a test harness that calculates the fibonacci series asynchronously, based on the sample code used in the Framework Design Guidelines book. As you can see by the comparisons, the larger the series being calculated, the less substantial the impact my performance work makes:
Size Normal Lazy Improvement 1 177 ms 62 ms 185% 2 179 ms 63 ms 184% 3 180 ms 65 ms 177% 4 181 ms 66 ms 174% 5 187 ms 69 ms 171% 6 195 ms 79 ms 147% 7 210 ms 97 ms 116% 8 239 ms 122 ms 96% 9 275 ms 165 ms 67% 10 356 ms 257 ms 39% 12 745 ms 631 ms 18% 15 3217 ms 3075 ms 05%
As we would expect, as the ratio of the cost of computation to the cost of allocating the WaitHandle increases (with an increased "size" of the fibonacci series being calculated), the observed performance improvement also decreases. For very small computations, however, this technique can really pay off. In the case of high performance asynchronous IO, for example, where completion often involves simply marshaling some bytes between buffers, this can be a key step in the process of improving system throughput.
As I noted earlier, lock free techniques are almost never worth the trouble unless you've measured a problem, especially due to the maintenance and testing costs for such code. And I jumped right over using a lock for allocation. It turns out in this particular scenario, that technique fares just as well as the lock free code, albeit with a lot more simplicity. It only incurs slight overhead when checking the handle to see if it has been allocated yet as well as when setting it at completion time. Since my test case never has to check the WaitHandle, it only has to enter the lock upon completion, which is relatively cheap. As always, start simple and then go from there.
C# 1.0 shipped with the ability to stack allocate data with the stackalloc keyword, much like C++'s alloca. There are restrictions, however, around what you can allocate on the stack: Inline arrays of primitive types or structs that themselves have fields of primitive types (or structs that etc...). That's it. C# 2.0 now allows you to embed similar inline arrays inside other value types, even for those that are allocated inside of a reference type on the heap, by using the fixed keyword.
And of course, you can allocate arrays of those value types on the stack too:
using System;
unsafe class Program {
struct A {
internal int x;
internal fixed byte y[1024];
}
public static void Main() {
byte * bb = stackalloc byte[2048];
Console.WriteLine("&bb : {0:X}", (uint)&bb);
Console.WriteLine("&bb[1] : {0:X}", (uint)&bb[1]);
Console.WriteLine("&bb[2048] : {0:X}", (uint)&bb[2048]);
A * a = stackalloc A[2048];
Console.WriteLine("&a : {0:X}", (uint)&a);
Console.WriteLine("&a->x : {0:X}", (uint)&a->x);
Console.WriteLine("&a->y[0] : {0:X}", (uint)&a->y[0]);
Console.WriteLine("&a->y[2048] : {0:X}", (uint)&a->y[2048]);
Console.WriteLine("&a[1] : {0:X}", (uint)&a[1]);
Console.WriteLine("&a[2048] : {0:X}", (uint)&a[2048]);
}
}
The use of this is of course almost always limited to unmanaged interop scenarios. For example, there's at least one place in the BCL where we use this to stack allocate the binary layout of a security descriptor that we then pass into the Win32 CreateMutex API, which avoids having to create a new interop struct. (Whether such hacks are a good thing to put in our code-base is another topic altogether...)
The stack allocated data doesn't outlive the stack frame, so as soon as you return from the function in which the stackalloc occurs, the data is gone. If you pass a pointer to it and somebody stores it, they could later try to dereference a pointer into dead (and possibly since reused) stack space. And reading too far can lead to buffer over- or underflows which bash other data on the stack. Using this requires compilation with /unsafe, and needless to say, you need to be careful with it (if not avoid it altogether).
 Wednesday, May 24, 2006
In managed code, you can pass ByRefs "down the stack." You can't do much aside from that, however, other than use things like ldind and stind on them. And of course, you can cast them to native pointers, store them elsewhere, and so on, but those sorts of (evil) things are unverifiable.
Right? Well, sort of.
In Whidbey, we made a change to the verification rules such that a function can now return a ByRef to a caller. This of course is safe so long as the ByRef doesn't refer to a stack location. A field ref inside a heap-allocated object, static field ref, or an array element ref are all just fine. And of course, a function can just return a ByRef that was passed to it as an argument. Take a look at this IL:
.assembly extern mscorlib {} .assembly byrefret {}
.class Program extends [mscorlib]System.Object {
.field static int32 s_x
.method static void Main() { .entrypoint
call int32& Program::f() ldind.i4 call void [mscorlib]System.Console::WriteLine(int32)
call int32& Program::g() ldind.i4 call void [mscorlib]System.Console::WriteLine(int32)
ret }
.method static int32& f() { ldsflda int32 Program::s_x ret }
.method static int32& g() { .locals init (int32 x) ldloca x ret }
}
Function f verifies just fine, since it just returns a ByRef to a static field, whereas function g fails verification, because it returns a ByRef to a local on the stack. You actually can't write code to produce the IL shown above from any of Microsoft's compilers except for VC++, i.e. C# won't let you say "static ref int f() { return ref s_x; }".
Now, why would you ever want such a thing? VC++ needed it, for example, to implement STL.NET. Traditional STL returns references to elements inside of internal data structures, which can subsequently be modified. Without this support, such values would need to be copied, or the STL.NET APIs would have had to deviate from the traditional STL APIs.
Interestingly, this doesn't change the ECMA Specification. It's always been loose on the issue, saying in Section 12.1.1.2 of Partition I: "Verification restrictions guarantee that, if all code is verifiable, a managed pointer to a value on the evaluation stack doesn't outlast the life of the location to which it points." Since you can't return a ByRef to a stack location, we don't violate this guarantee. Our previous verifier was simply being overly strict.
 Saturday, May 20, 2006
Via DBox and TBray, I stumbled upon Will Continuations Continue?, a great essay about why continuation support in modern VMs is not a good idea after all:
"By far he most compelling use case for continuations are continuation-based web servers. ... Rather than relying on the server’s stack to keep track of what location we’re looking at, the [UI] will be a view on a model ... When you pressed “Buy”, it would pass all the information necessary to complete the transaction onto the server. Consequently, we’ll have no more of a pressing need for continuations than traditional applications have today."
I couldn't agree more, although I arrive at the conclusion via a different line of thought.
Just over a year ago now, I was working on continuations for my Scheme interpreter and compiler, Sencha. I managed to create something that "worked" -- in the sense that the stack could be captured, passed around, and restored; and it even still reported locals as roots to the GC -- but there are so many facets of a modern runtime to consider that true product support would be a massive undertaking. I thought continuations were a good idea. Why? To be honest, the main reason was my simple goal of having a full-fidelity Scheme runtime. But I also admired their power.
In retrospect, I now realize something important: the stack is evil. It's a horrible representation of state, especially for web applications.
The stack is unnecessarily bound to an OS thread, and munges control flow with the "state" of the program. The fact that return addresses for function calls lives on it has been the source of many security problems and counter-measures (/GS). When a thread blocks, the entire stack is wasted, even if there is logical work on it that could progress if it weren't for the arbitrary physical association. There's so much crap on it that to summarize the state of your entire program often requires pausing threads and walking their stacks. How dirty and impolite! Freak-of-nature abominations have twisted what the stack was meant for, e.g. COM and GUI reentrancy and APCs, completely disassociating logical and physical representations. You have to reserve a contiguous chunk of the thing per thread (often 1MB), wasting virtual memory space, because Windows doesn't support linked stack regions (not as big a deal on 64-bit as on 32-bit, sure), which also leads to the CLR ripping the process if you ever exceed it (overflow).
So many problems we encounter with parallel programming (among other domains) would go away with a more structured representation of the program as a state machine.
Dharma and the rest of the WF team are delivering just that (in the large). C# 2.0's iterator feature supplies a similar capability (in the small). The Concurrency and Coordination Runtime (CCR) eschews stack in favor of orchestration and message passing. We'll converge at some point. And it won't be around serializing stacks, it will be around getting rid of the damned things.
 Monday, May 15, 2006
The use of parallel spreadsheet calculations in Excel 12 is a great example of how software vendors can use multi-core CPUs to vastly improve the user experience.
There is some great stuff here: Intelligent parallel execution based on dependency analysis; near-linear speedup for spreadsheets with minimal dependencies; an extension model, where user-defined functions can be written either thread-safe or thread-unsafe, and be treated accordingly by the engine; user-defined thread-counts for functions that perform blocking operations; among others.
 Sunday, May 07, 2006
One of the challenges when designing reusable software that employs hidden parallelism -- such as a BCL API that parallelizes sorts, for-all style data parallel loops, and so forth -- is deciding, without a whole lot of context, whether to run in parallel or not. A leaf level function call running on a heavily loaded ASP.NET server, for example, probably should not suddenly take over all 16 already-busy CPUs to search an array of 1,000 elements. But if there's 16 idle CPUs and one request to process, doing so could reduce the response latency and make for a happier user. Especially for a search of an array of 1,000,000+ elements, for example. In most cases, before such a function "goes parallel," it has to ask: Is it worth it?
Answering this question is surprisingly tough. Running parallel at a high level might be more profitable, such as enabling multiple incoming ASP.NET requests to be processed, but often fine-grained parallelism can lead to better results. And just as often, a combination of the two works best. Consider an extreme case: Imagine that most ASP.NET web requests for a particular site ultimately acquire a mutual exclusive lock on a resource, essentially serializing a portion of all web requests. Of course, this is a design that's going to kill scalability eventually. But regardless, it could be present to a lesser degree, and might actually be an architectural requirement of the system. Executing some finer-grained operations in parallel might lead to better throughput in this case, especially those performed while the lock is held.
And clearly, the act of parallelizing an algorithm is not just based on the static properties of the system itself, but also dynamic capabilities and utilization of the machine. There are some APIs that allow dynamic querying of the machine state, which can aid in this process, e.g.:
- System.Environment.ProcessorCount: This property (new in 2.0) tells you how many hardware threads are on the system. Note that the number includes hyper-threads on Intel architectures, which really shouldn't be counted as a full parallel unit when deciding whether to parallelize your code. GetSystemInfo can give you richer information, albeit with some P/Invoke nonsense. We should give a better interface into this data for the next version of the Framework.
- Processor:% Processor Time performance counter: This gives you the % utilization of a specific processor and allows asynchronous querying. Using it, you could query each processor on the system to figure out what the overall system utilization is, and specifically how many sub-parts to break your problem into. The CLR thread-pool uses this today to decide when to inject or retire threads. You can use it too to determine whether introducing parallelism is a wise thing to do. Although your code may not have a lot of "context," this is often a good heuristic that even leaf level algorithms can use.
- System:Processor Queue Length performance counter: For more sophisticated situations, you can not only key off of the processor utilization, but also off the queue length of processes waiting to be scheduled. For a really deep queue (say, more than 2x the number of processors), introducing additional work is likely to lead to unnecessary waiting.
Using these are apt to lead to statistically good decisions. But clearly this is a heuristic, and as such the state of the system could change dramatically immediately after obtaining the values, perhaps making your deicision look naive and ill-conceived in retrospect. The worst case could be bad, but perhaps not terrible. The worst aspect of this is that performance characteristics could vary dramatically, and your users might respect predictable execution over sometimes-fast execution. The good news is that each of these functions are fairly cheap to call, amounting to less than 0.5ms total in some quick-and-dirty tests I wrote that read from all three.
But spending any time answering the question is tricky business. Assuming the software dynamically executes some code to decide if, and to what degree, we should run in parallel, and assuming these calculations are not done in parallel themselves ;), all of this work amounts to a fixed overhead on some part of the overall system, reducing overall parallel speedup (due to Amdahl's Law). We hope that in the future we can hide a lot of this messy work in the guts of the runtime and WinFX stack, but for now it's mostly up to you to decide.
Databases have utilized parallelism for a long time now to effectively scale-up and scale-out with continuously improving chip and cluster technologies. Consider a few high-level examples:
Parallel query execution is employed by all sophisticated modern databases, including SQL Server and Oracle. This comes in two flavors: (1) execution of multiple queries simultaneously which potentially access intersecting resources, and (2) implicit parallelization of individual queries, to acheive speed-ups even when a large quantity of incoming work is not present (e.g. high-cost queries, lots of data, etc.). Often a combination of both is used dynamically in a production system. I won't say much more, other than to refer to an interesting new query technology on the horizon.
Transactions are used as a simple model for concurrency control, enabling high scalability due to dynamic fine-grained locking techniques and policies, while supplying conveniences such as intelligent contention management and deadlock detection. And of course reliability is improved, because of the all-or-nothing semantics of transactions. Even in the face of asynchronous thread aborts, a transaction can ensure inconsistent state isn't left behind to corrupt a process, greatly improving the reliability of software at a surprisingly low cost. Software transactional memory (STM) borrows directly from the field, and applies it to general purpose parallel programming.
Invariants about data in databases are often modeled as integrity checks and foreign key constraints, which help to maintain reliable and consistent execution even in the face of concurrency. This, coupled with transactions, helps to ensure invariants aren't broken at transaction boundaries, and recent work done by MSR explores how this might be applied to general programming. STM combined with a rich system like Spec# could facilitate highly reliable and consistent systems that don't expose latent race conditions in the face of parallel execution.
Assuming you have (1) a lot of data to process, (2) complex computations to perform, and/or (3) simply a lot of individual tasks to accomodate, this model of parallel programming stretches quite far. With many cores per CPU, TB disks, and 100+-GB memories on desktops just around the corner; an order of magnitude more network bandwith available to consumers; and a continuing explosion of the amount of information humans generate and have to make some sense of, similar approaches could enable the next era of computer applications. I will also observe that surprisingly similar models of computation are precisely what fuel technologies like Google's MapReduce, albeit at a coarser granularity.
 Wednesday, May 03, 2006
Raymond's recent post talks about queueing user-mode APCs in Win32.
When you block in managed code, the CLR is responsible for figuring out the correct style of wait. This ends up in a CoWaitForMultipleHandles (on Win2k+) or MsgWaitForMultipleObjectsEx if you're executing in an STA; else, this ends up in a non-pumping wait, such as WaitForSingleObjectEx/WaitForMultipleObjectsEx. In any case, the wait is alertable, meaning that user-mode APCs will have a chance to run. There are various blocking calls hidden in Win32 and the CLR itself, so it's not guaranteed that all waits are alertable; but any that originate from managed code are, which we hope is a significant percentage.
This code illustrates a simple user-mode APC reentering as we do an alertable wait (via Thread.CurrentThread.Join(0)):
using System; using System.Runtime.InteropServices; using System.Threading;
static class Program {
static void Main() { QueueUserAPC( delegate { Console.WriteLine("APC fired"); }, GetCurrentThread(), UIntPtr.Zero);
Console.WriteLine("Doing join"); Thread.CurrentThread.Join(0); Console.WriteLine("Finishing join"); }
delegate void APCProc(UIntPtr dwParam);
[DllImport("kernel32.dll")] static extern uint QueueUserAPC(APCProc pfnAPC, IntPtr hThread, UIntPtr dwData);
[DllImport("kernel32.dll")] static extern IntPtr GetCurrentThread();
}
While this technique seems like an effective way to reuse a thread while it is blocked -- for example, you might contemplate doing this for thread-pool threads -- a little problem called thread affinity tends to arise. I wrote about this in terms of COM reentrancy before. An APC reentering doesn't perform a context transition, so even if we used a logical context to store such state, the problem would still exist. The simple fact is that user-mode APCs are good for system bookkeeping, but not for running general purpose code that modifies arbitrary program state.
 Saturday, April 29, 2006
I don't know what's publicly available about our future ship schedules. But regardless, we begin M1 -- our first real coding milestone for the next version of the CLR -- on Monday. There's been some work going on in the meantime, of course, limited mostly to prototyping, design, and prioritization, but it's finally time to get serious, write real product code, and start hitting dates.
One fairly large item on our schedule is revamping our thread-pool. Our primary aim there is to enable fine-grained parallelism, and to supply new scheduling features that many people have asked for in the past. Today, coarse-grained parallelism is more attractive due to the costs associated with scheduling and dispatching work items, but we are going to change that.
This includes these tentative high level items:
- Low performance overhead of queueing and dispatching work
- Deadlock avoidance (surging) due to 100% blocking
- Queue partitioning and isolation
- Prioritization of work items
- Cancellation of work items, possibly with support for Vista IO Cancellation
- NUMA awareness such as CPU affinitization and/or user-hinted node affinitization
- And, of course, enhanced debugging and diagnostics
We'd love any feedback on any of these, including which sound more or less important to you. And if you have an interesting problem or scenario we might not have considered, please, please, please let me know.
A colleague of mine recently referred me to the Cilk work at MIT. This paper supplies a good overview. We've been slowly arriving at a similar design, so it's great to have prior art from which to draw. The idea most important with respect to the thread-pool is how multiple queues can be backed by a single physical thread store, and further the way in which queues are dynamically load balanced via thread leases and work stealing.
 Saturday, April 22, 2006
By now you’ve probably read things like Herb Sutter’s free lunch paper. And if you follow my blog at all, you’ll know that I do a bit of writing and thinking about how Microsoft can make our platform better suited for the multi-core era that stands in front of us.
Most people, when considering the topic of parallelism vis-à-vis multi-core, start by jumping straight to the bottom of the stack. I’ll admit that I sure did. They think about threads, locks, and the associated headaches. Some even think about the chip architecture and memory hierarchy. They take it for granted that the work exists. But these same people seldom stop to think—or when they do think often hit the same wall—about what workloads will actually substantially benefit from massive amounts of parallelism. This is a difficult topic.
Scientific computing of course has this nailed pretty good already. But how much of the code do you write that actually resembles scientific problems, like n-bodies, heat transfer, fluid dynamics, and the like? My guess is that, for most of Microsoft’s customers, the answer is: Not much. That’s especially true on the client, where data-intensive operations are often shipped to a high-end server for processing, leaving what amounts to quasi-workflow orchestration initiated by UI events, for example. I’m not going to refute the massive gains in CPU scalability we’ve seen over the past 10 years due to superscalar execution, via techniques like pipelining and branch prediction, and the effect that has had on client and server programs alike. But for most application code today, the network and disk are the limiting factors, not the CPU.
Of course, to the extent that there is work the CPU must perform for any problem—even for IO-bound ones—code needs to be architected to separate logical tasks, ensuring that a bunch of otherwise ready-to-run work doesn’t get backed up behind a blocking call unnecessarily. And of course, separating logical work is important for other reasons, like avoiding a hung UI thread. Unfortunately, we don’t make this overly easy today. Win32 and WinFX APIs (nor the associated documentation or tool support) are not overly helpful when it comes to figuring out the performance characteristics of the code they invoke, including latency and blocking. This makes it tricky to architect things as I suggest. New programming models like the CCR provide the infrastructure that could facilitate such a shift, but it will take hard work to get to a reasonable place.
Back to workloads. Consider server applications for a moment. The model of concurrency there is actually quite simple. And in fact I believe the majority of server programs will be equipped to exploit multi-core right away. Each incoming request is considered a logical task and is assigned to an available thread of work, often using the CLR’s thread-pool. Sharing between concurrent requests is (hopefully) minimal, meaning that the one-thread-per-request model leads to naturally good scaling. This works up to a point. Once the average number of available CPUs surpasses the average number of incoming workers, the need to assign multiple CPUs to a single request becomes more important. This is obviously very workload dependent. Databases already do this with individual queries. Their use a single-thread-per-request model, but often use individual query parallelization to get better utilization. SQL Server added support for this in 7.0. I’ve been working quite a bit over the past year on similar techniques for LINQ. I’m almost to the point where I can disclose more information publicly, in the form of a paper.
Search is clearly a workload of recent importance that, whether on the client or server, benefits tremendously from parallel execution. This applies not only to the act of searching, but also to the act of indexing the data in preparation for search. MSN and Google’s current desktop search products are cognizant not to interrupt your primary work by doing indexing while your computer is idle. But given a bunch more cores, they needn’t wait. Further, parallelizing search is a well researched topic. You still need to solve some tough problems like ensuring parallel tasks aren’t contending heavily for the disk (becoming IO bound), but it’s very possible.
There are of course other workloads. Graphics processing on modern computers is extremely parallel, currently handled by the GPU. But I am going to wrap up, and summarize all of this by saying: It remains to be seen whether most mainstream Windows programs can become highly parallel, and if they can, how profitable it will be. We'll also find out over time whether reaching that stage will require radically new programming models and a gradual shift over time. I am optimistic, and confident that parallel execution is the direction we ultimately need to go down. Surely the workloads are there, seemingly obscured by the traditional sequential approach to software.
 Sunday, April 16, 2006
I'm writing an article for an upcoming MSDN Magazine CLR Inside Out column. And I am looking for topic suggestions.
Of course, my expertise is around concurrency, but I'm also a CLR internals-kinda geek. So, what do you want to read about?
I have some ideas. But I'll post them after I hear yours.
I wrote about torn reads previously, in which, because loads from and stores to > 32-bit data types are not actually "atomic" on a 32-bit CPU, obscure magic values are seen in the program from time to time. This isn't as scary as "out of thin air" values, but can be troublesome nonetheless. I noted that, by using a lock, you can serialize access to the location to ensure safety.
You can of course write such thread-safe code that avoids taking a lock, usually motivated by performance. Vance has a pretty detailed write-up of this on MSDN. Most of the time, you shouldn't try to be so clever, as it will get you in trouble sooner or later, and is even worse to debug than a typical race. But for really hot code-paths, it can make a measurable difference. (Note the key word: measurable. If you've measured a problem, you might consider such techniques... but otherwise, stay far, far away. (Have I made enough qualifications and disclaimers yet?))
If you access individual pointer-sized byte segments of the data structure, such as 32-bit aligned segments (e.g. volatile or __declspec(align(x)) in VC++, all values on the CLR), you can load and store in a known order. Furthermore, you need to use the appropriate types of loads and stores with fences in the appropriate places; load/acquire and store/release are usually adequate. You can then use the intrinsic properties of this order to make statements about the correctness of your algorithm.
For example, imagine you have some code that increments a 64-bit counter on a 32-bit system. Aside from overflow, the value always increases. If you always increment the low 32-bits, followed by the high, and if you always read the high, followed by the low, you'll be guaranteed that, should you read a torn value, it will be too low rather than too high (not counting for overflow, of course). Sometimes it can be really low, such as when the low 32-bits wrap back to 0, in which case the higher 32-bit increment needs to carry one. Depending on your situation, this might be precisely what you are looking for. (I wrote some code last week that needed exactly this.)
For example, your typical code might read and write under a lock, in VC++/Win32:
ULONGLONG ReadCounter_Lock( volatile ULONGLONG * pTarget, CRITICAL_SECTION * pCs) { ULONGLONG val;
EnterCriticalSection(pCs); val = *pTarget; LeaveCriticalSection(pCs);
return val; }
ULONGLONG IncrCounter_Lock( volatile ULONGLONG * pTarget, CRITICAL_SECTION * pCs) { ULONGLONG val;
EnterCriticalSection(pCs); val = *pTarget; *pTarget = val + 1; LeaveCriticalSection(pCs);
return val; }
But, using the load/store order described above, it can become lock free:
#define LO_LONG(x) (reinterpret_cast<volatile LONG *>((x))) #define HI_LONG(x) (reinterpret_cast<volatile LONG *>((x)) + 1)
ULONGLONG ReadCounter_NoLock(volatile ULONGLONG * pTarget) { ULONGLONG val;
#ifdef _Win64
val = *pTarget;
#else
// Read high 32-bits first, then low: *HI_LONG(&val) = *HI_LONG(pTarget); *LO_LONG(&val) = *LO_LONG(pTarget);
#endif
return val; }
ULONGLONG IncrCounter_NoLock( volatile ULONGLONG * pTarget) { ULONGLONG oldVal;
#ifdef _Win64
oldVal = static_cast<LONGLONG>( InterlockedIncrement64(static_cast<LONGLONG *>(pTarget)));
#else
// Increment the low 32-bits first, then high: if ((*LO_LONG(&oldVal) = InterlockedIncrement(LO_LONG(pTarget))) == 0) { *HI_LONG(&oldVal) = InterlockedIncrement(HI_LONG(pTarget)); } else { *HI_LONG(&oldVal) = *HI_LONG(pTarget); }
#endif
return oldVal; }
It's obvious which is simpler to code, understand, and maintain. But the latter technique can come in handy when you're in a pinch.
For information on other similar techniques, including multi-word CAS and object-based STM, Tim Harris's recent "Concurrent programming with locks" paper is an excellent read. Most of it isn't built and ready for you to use today, but the details of the algorithms are in there if you'd like to play around a little. And there's a lot of literature out there about creating lock-free data structures. Interestingly, you can end up worse off than if you'd used a lock in the first place. Many such lock free algorithms are optimistic meaning that they do a bunch of work hoping not to run into contention; when they do, they have to throw away work, rinse, and repeat. Your mileage can vary dramatically based on workload.
 Thursday, March 30, 2006
Wow, not only does Vance Morrison have a blog--he's a performance architect on the CLR team--but he recently wrote two articles on reader-writer locks:
In them, he walks through a custom implementation of a lock, and then does some insightful performance analysis on it. As usual with Vance's writing, it's very detailed and precise.
 Saturday, March 25, 2006
The profiler that ships with Visual Studio is great for "real" CPU profiling. But let's face it: there are still some situations where a good ole' stopwatch works just fine (of the System.Diagnostics.Stopwatch variety). For example, when you're trying to do some quick and dirty measurement on a very specific region of code, and don't want to deal with the rest of the noise.
The BCL stopwatch isn't inherently thread-safe. Even if you protect access to it (and somehow account for the overhead of doing so), it maintains a single counter. In the past, I've wanted to measure the total amount of time spent inside a select few regions of code, across all threads. The profiler works for this, but you need to get the sampling granularity right, and deal with all of the extra data collected. Then you have to mine it.
So I whipped up a stopwatch that maintains a counter that is the cumulative total of all threads that have started/stopped it, across an entire AppDomain. It's nothing incredibly clever, but I've found it to be quite useful. In many code projects I have, I've simply set up a file that declares a bunch of static ThreadSafeStopwatches, which I can then just call from anywhere.
Here's the code (also available for download here: ThreadSafeStopwatch.cs):
using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Threading;
/// <summary>
/// This class enables AppDomain-wide profiling of multi-threaded
/// code, by tracking a cumulative number of ticks spent across all
/// threads. If multiple threads start the same watch in parallel,
/// this class ensures that we count time for both threads, and that
/// they do not interfere with each other. It does this by storing an
/// internal stop-watch in TLS, and an instance-wide tick counter.
///
/// Note that the cumulative count is only ever incremented if a
/// thread actually calls Stop in its own stop-watch. If a thread
/// routine terminates w/out stopping the watch, it's as if it never
/// began.
/// </summary>
public sealed class ThreadSafeStopwatch : IDisposable
{
/** Fields **/
private long ticks;
// These thread-specific fields are used to maintain a cache
// of thread-safe stopwatches to actual stopwatches. For long
// running threads, we can build up some amount of trash over
// time, so a cache management scheme is implemented via a
// combination of mechanisms: Dispose, ~ThreadSafeStopwatch,
// GetWatch, and PruneCache.
[ThreadStatic]
private static Dictionary<WeakReference, Stopwatch> threadWatches;
[ThreadStatic]
private static int cacheCounter;
/** Properties **/
public long ElapsedTicks
{
get { return ticks; }
}
public float ElapsedMilliseconds
{
get { return (float)ticks / TimeSpan.TicksPerMillisecond; }
}
/** Methods **/
~ThreadSafeStopwatch()
{
Dispose(false);
}
public void Dispose()
{
Dispose(true);
}
private void Dispose(bool disposing)
{
// Clean up the associated cache entry with this object.
// NOTE: I am explicitly not guarding this code-path by if
// (disposing) { ... } because the Dictionary is not finaliz-
// able. Thus, I know it is safe to access it. In the case of
// non-process-exit finalizers, we do want to clean up the
// associated cache entry, thus we access another object on
// the finalizer code-path.
if (!Environment.HasShutdownStarted)
{
Dictionary<WeakReference, Stopwatch> watches = threadWatches;
WeakReference thisRef = new WeakReference(this);
if (watches != null && watches.ContainsKey(thisRef))
{
// Deallocate the cache entry:
watches.Remove(thisRef);
}
}
}
const int threadCachePruneCount = 100;
private Stopwatch GetWatch()
{
if (threadWatches == null)
{
// First time called on this thread, allocate a new dictionary:
threadWatches = new Dictionary<WeakReference, Stopwatch>(
WeakRefEqualityComparer.Comparer);
}
else
{
// This has been called before. Increment the thread counter;
// every so often, we prune out trash being held alive by our
// cache.
if (++cacheCounter % threadCachePruneCount == 0)
{
PruneCache();
cacheCounter = 0;
}
}
// Now look for the associated stopwatch:
Stopwatch sw;
if (threadWatches.TryGetValue(new WeakReference(this), out sw))
return sw;
// If we didn't find the stopwatch, simply return null to
// indicate that the caller needs to allocate a new one:
return null;
}
private void PruneCache()
{
// BUGBUG: This is probably a poor cache management policy. But
// I'm only enabling this in DEBUG builds for now, and until I
// run into a real problem, I'm not spending time on it.
List<WeakReference> toRemoveWrefs = null;
// Look for dead references, and add them to the list (which is
// lazily created, by the way).
foreach (WeakReference wr in threadWatches.Keys)
{
// If the weak-reference is no longer alive, we add it to the
// list of 'to-remove' stop-watches.
if (!wr.IsAlive)
{
if (toRemoveWrefs == null)
toRemoveWrefs = new List<WeakReference>();
toRemoveWrefs.Add(wr);
}
}
// If we found any dead entries, remove them now.
if (toRemoveWrefs != null)
{
foreach (WeakReference wr in toRemoveWrefs)
{
threadWatches.Remove(wr);
}
}
}
[Conditional("DEBUG")]
public void Start()
{
// We look in TLS to see if this thread has already allocated a
// stopwatch for the current thread-safe stopwatch. This is
// thread-safe, of course, since each thread gets their own list
// of Stopwatches (reentrancy aside--there aren't any blocking
// points below):
// Since we are about to retrieve something from TLS, and use it
// across a set of paired operations (Start/Stop), we mark the
// beginning of a thread-affinity region.
Thread.BeginThreadAffinity();
// Access TLS:
Stopwatch sw = GetWatch();
if (sw == null)
{
// No watch was found, allocate a new one and publish it.
sw = new Stopwatch();
threadWatches.Add(new WeakReference(this), sw);
}
// First, ensure we haven't begun it yet. If the stopwatch is
// already running, we ignore this call. This is consistent with
the System.Diagnostics.Stopwatch
// class's behavior.
if (!sw.IsRunning)
{
// And if that check succeeds, start the stop-watch ticking.
sw.Start();
}
}
[Conditional("DEBUG")]
public void Reset()
{
// Get the current stopwatch in TLS -- see above comments (in
// Start) for details on thread-safety.
Stopwatch sw = GetWatch();
// If we found one, reset it.
if (sw != null)
sw.Reset();
// And also set our cumulative ticks to 0.
ticks = 0;
}
[Conditional("DEBUG")]
public void Stop()
{
// Get the current stopwatch in TLS -- see above comments (in
// Start) for details on thread-safety.
Stopwatch sw = GetWatch();
// First, ensure we are running. If the stopwatch isn't running
// yet, we ignore this call. This is consistent with the System.
// Diagnostics.Stopwatch class's behavior.
if (sw != null && sw.IsRunning)
{
// Add the stopwatch's total time to our instance counter.
// This has to be an interlocked operation, because the whole
// point of this class is to be shared across threads. 'ticks'
// is the only instance state.
Interlocked.Add(ref ticks, sw.ElapsedTicks);
// We reset the stopwatch because we want to start at 0 upon
// the next invocation to 'Start' -- the cumulative time is
// kept in the 'ticks' variable.
sw.Reset();
// We can now end the thread-affinity that was started in the
// Start operation above.
Thread.EndThreadAffinity();
}
}
class WeakRefEqualityComparer : IEqualityComparer<WeakReference>
{
internal static WeakRefEqualityComparer Comparer =
new WeakRefEqualityComparer();
public bool Equals(WeakReference wr1, WeakReference wr2)
{
// For purposes of our hash-table, if two weak-references
// refer to the same object, we consider them equal.
object o1 = wr1.Target;
object o2 = wr2.Target;
if (!wr1.IsAlive || !wr2.IsAlive)
return false;
// We shouldn't ever have null weak-references that aren't
// dead.
Debug.Assert(o1 != null && o2 != null);
// If the two underlying objects are equal, we pretend the
// weak-refs are too.
return o1.Equals(o2);
}
public int GetHashCode(WeakReference wr)
{
object o = wr.Target;
// Just return 0 for dead objects. We actually shouldn't
// ever use a dead object for hashing, although there could
// be some benign races above that result in this case. I
// haven't convinced myself otherwise, and they will get clean-
// ed up with the normal finalization code-path. It's A-OK.
if (!wr.IsAlive)
return 0;
// Again, shouldn't get a live weak-ref that has a null
// object ref.
Debug.Assert(o != null);
// Now, simply return the underlying object's hash-code.
return o.GetHashCode();
}
}
}
The Whidbey version of Rotor just went up for download on MSDN on Thursday. I downloaded it and built it this morning.
While some of the big rock features are getting press (e.g. generics, LCG, anonymous methods/closures, etc.), some of the smaller features are plenty cool, and can be grokked in their entirety in much less time. For example, do a 'grep -i nullable clr/src/*/*.*' if you want to see what went into implementing the Nullable DCR that Soma mentioned over here. And check out stuff like the WrapNonCompliantException function in vm/excep.cpp (and its callsites) to see how non-CLS exceptions get wrapped. And of course there's all that reliability work that went into Whidbey, leading to things like SafeHandles (vm/safehandle.cpp), and OOM and SO hardening.
Most of it's there for you to tinker with. Or to simply print out and enjoy, reading it as you sit beside the fireplace with a nice Bourdeaux. To each his (or her) own.
 Wednesday, March 15, 2006
Jim recently asked me to put together a few stand-alone articles, loosely excerpted from my upcoming book.
The first one just went online: One Platform to Rule Them All. This is based on a section from pretty early in the book. In fact, it's near the front of the first chapter.
I chose the title based on T-shirts the CLR team had printed up a few years back, which prominently displays this slogan on it along with some nifty graphics (e.g. a ring inscribed with code). As you can imagine, it's based on a famous book series, and more recently, movie series...
 Wednesday, March 08, 2006
Here are a few recent happenings.
And... my new book is just about to come out (late March/early April). I'm happy to report that I recently accepted an offer to write another, Concurrent Programming on Windows, the goal for which is to be the definitive guide to concurrent programming on Windows, WinFX, and the .NET Framework. It'll show you how to write kick-butt software on these new shnazzy multi-core machines.
 Thursday, February 23, 2006
When you perform a wait on the CLR, we make sure it happens in an STA-friendly manner. This entails using msg-waits, such as MsgWaitForMultipleObjectsEx and/or CoWaitForMultipleHandles. Doing so ensures we pick up and dispatch incoming RPC work mid-stack, while the STA isn't necessarily sitting in a top-level message loop. In fact, an STA that doesn't pump temporarily can easily lead to temporary and permanent hangs (i.e. deadlocks), especially in common COM scenarios where reentrant calls across apartments are made (e.g. MTA->STA->MTA->STA). Even where deadlock isn't possible, failing to pump can have a ripple effect across your process, as components wait for other components to complete intensive work.
I was recently writing about this fact for an article, and realized I had leaped to an incorrect assumption. OLE creates special RPC windows for processing these apartment transitions. I knew that. But I had assumed that we blatently pump for messages for any windows. But in reality, we don't. We only pump the special "WIN95 RPC Wmsg", "OleMainThreadWndClass", and "OleObjectRpcWindow" RPC windows. Consider why.
If you were on a UI and you pumped your window's message queue, you could end up dispatching new events before old events had completed. If you dispatched a WM_CLOSE message on the same stack you were doing some other UI processing, you'd destroy the window before that other processing was done. Without the GUI message loop taking this into account somehow, you'd crash. There are other factors. Imagine you were processing a click event that required movement of several UI elements. If some bit of infrastructure--or perhaps your code--ended up pumping, UI invariants could be broken. (Lots of FX code pumps for RPC, by the way.) At best, this could lead to strange visual artifacts, and at worst a crash.
You'd also have to deal with the subtleties of reentrancy. I've written about this before. Imagine a UI event had waited on an auto-reset event, did some work, and then set the event. If another event--perhaps the same type--were dispatched while it was doing "some work," it might try to wait on this same event. This would be a deadlock. A pretty difficult one to track down too, especially if it only occurred if the user clicked on certain elements at precise timings to get the reentrancy to occur. This is already a problem with COM interop. But thankfully we don't burden Windows Forms and WPF programming with it too.
 Tuesday, February 21, 2006
I've been pretty busy lately orchestrating all the concurrency work the CLR is undertaking for the next few versions.
We have several long lead incubation projects underway to explore new programming models which--some of us at least--believe will make concurrent programming at least an order of magnitude easier. While I hope we ship them, the fate of those projects is currently unknown. We'll understand better over time what is feasible and in what timeframe.
But there's also a set of features and work items we are actively planning for in the mainline product. This includes looking at new abstractions (e.g. futures, barriers, better lock factoring, a richer variety of locks, lock leveling), fixing existing ones (e.g. nagging issues and obvious missing features (e.g. prioritization, isolation, diagnostics) with our thread-pool, improving RWL's performance), debugging and tool support, and general cleanup activities (e.g. getting our own locking practices in order).
If there's anything you'd like to see us consider in this category, please let me know. Either leave a comment here or send me an email at joedu at microsoft. I'll share it with the rest of the concurrency team, and we'll seriously consider it.
I also have a couple related articles to watch out for: one on deadlock avoidance and detection in next month's MSDN magainzine, and another on application responsiveness in WPF in DDJ. I'm also going to be on .NET Rocks next week discussing some of this.
Unfortunately, all of this has left very little time to blog. If you have ideas for topics, drop me a line.
 Tuesday, February 07, 2006
I was on a mail thread today, the topic for which was the meaning—and perhaps lack of comprehensiveness—of the statement: “This type is thread safe.” Similar statements are scattered throughout our product documentation, without any good central explanation of its meaning and any caveats.
It’s relatively difficult to make such a statement. The .NET Framework is generally written so that all static members are thread-safe, while instance members are not. There are some notable exceptions, mostly to do with immutable types (e.g. the primitives, System.DateTime, System.Type, etc.), but they are infrequent.
This brings me to the types of thread safety issues we’re generally concerned with.
The first problem is torn reads and writes. Operations that deal with data whose size is greater than the native machine-sized pointer (i.e. sizeof(void*)) are not atomic in the ISA. This applies to 64-bit operations on 32-bit platforms, 128-bit on 64-bit, etc. We happen to provide you two intrinsic 64-bit types—Int64 (C# long) and Double (C# double)—which makes this issue a tad tricky. So this code
static long x; // … x = 1111222233334444;
consists of two DWORD MOVs at the machine level, one to the most significant half and then the least significant half (in that order, at least in the Whidbey x86 JIT). Reads likewise involve two MOV instructions.
If you’re doing naked reads and writes with such values, instructions can be interleaved such that only one DWORD has been written. That means a thread racing with the above assignment can see x with a value of (x & 0xFFFFFFFF00000000), or 2524709548 in decimal. This is obviously surprising, as the two values (at first glance) don’t seem to be related. And this same principle applies to reads and writes of any value type instances whose size is greater than sizeof(void*).
This can be solved by protecting all access to the data under a lock or via an interlocked operation. Interlocked.Exchange will do the trick for writes, and Interlocked.Read for reads. Note that most platforms offer 128-bit interlocked instructions. Unfortunately, because of the platform-specificity, the Win32 APIs and our System.Threading APIs don’t broadly support them. Hopefully this changes over time. For the same reason you often need two void*-sized writes on 32-bit, you often need the same on 64-bit.
In summary, any type that exposes a writable 64-bit field, or which returns a 64-bit value which has been copied by a field that might be in motion, is not thread-safe. And any internal reads and writes need to be done under the protection of a lock or interlocked operation. A method that updates an internal field, for example, can race with a property that returns the current value.
The second problem is read/modify/write sets of instructions. If reads and writes can be multiple instructions long, it should be clear that by default a read/modify/write is at least three. For a 64-bit value on a 32-bit platform, it’s six. At any point in that invocation, interleaved execution can cause an update to go missing. The solution here, again, is to do this inside a lock. The Interlocked.CompareExchange function is great for this purpose, as it takes advantage of hardware-level read/modify/write instructions. Thankfully they are supported by all modern hardware ISAs.
The last problem is that of ensuring coarser-grained data structure invariants can never be seen in a broken state by concurrent execution. This is especially difficult since arbitrary managed programs don’t capture such invariants in the program itself. Aside from static state, most Framework types don’t even come close to attempting to provide this level of guarantee. Static caches and lazily initialized state, for example, are places where the Framework needs to account for concurrent access. Old-style collections with SyncRoots tried to provide similar protection, but the new generic collections don’t any longer, mostly because of the performance hit you take on sequential code-paths. But those cases are the exception, not the norm.
The immutable types mentioned above are nice in that, aside from initialization-time, they never break their internal invariants. Thus, aside from assignments of instances to shared variables, you needn’t worry about any special synchronization.
In summary, any type that breaks invariants must do so in such a way that these invariants can never be observed due to concurrent execution. This means all access to data needs to be serialized with respect to coarser grained operations updating state. Our Framework isn’t written in this way, so if you share it, you usually take responsibility for locking it.
My preference is for developers to assume that all types are unsafe, and to explicitly lock when accessing them concurrently. Regardless of the documentation’s claims. We simply do not check for these things across releases, and some code that works today might break tomorrow because we accidentally forgot to account for a torn read.
 Thursday, January 26, 2006
Vance Morrison's excellent MSDN article from a few months back talks about why double checked locking is guaranteed to work on the CLR v2.0, and why it is one of the few safe lock-free mechanisms on the runtime. He also sent an email to the develop.com mailing list a while back explaining why this pattern wasn’t guaranteed to work on the ECMA memory model. We did quite a bit of implementation work and testing to tame the crazy memory model of IA-64 on 2.0. (Note that none of this is in the ECMA specification, so if you’re worried about CLI compatibility, beware.)
These modifications not only enable the double checked locking pattern, but also prevent constructors from publishing the newly allocated object before their state has been initialized, as I mentioned in my PDC presentation on concurrency last year. We accomplish this by ensuring writes have 'release' semantics on IA-64, via the st.rel instruction. A single st.rel x guarantees that any other loads and stores leading up to its execution (in the physical instruction stream) must have appeared to have occurred to each logical processor at least by the time x's new value becomes visible to another logical processor. Loads can be given 'acquire' semantics (via the ld.acq instruction), meaning that any other loads and stores that occur after a ld.acq x cannot appear to have occurred prior to the load. The 2.0 memory model does not use ld.acq’s unless you are accessing volatile data (marked w/ the volatile modifier keyword or accessed via the Thread.VolatileRead API). This can lead to some subtle problems.
For example, a slight variant of the double checked lock will not work under our model:
class Singleton { private static object slock = new object(); private static Singleton instance; private static bool initialized; private Singleton() {} public Instance { get { if (!initialized) { lock (slock) { if (!initialized) { instance = new Singleton(); initialized = true; } } } return instance; } } }
You might have decided to use this pattern to determine whether to initialize a value-type, since checking the variable for null isn’t possible. If you had some more complex set of state, perhaps you want to use a single Boolean rather than checking, say, 10 separate variables to see if they have each been initialized. Whatever your reasoning, as written the above code is prone to a subtle race condition.
The problem here is that both reads of initialized and instance do not have 'acquire' semantics. Thus, instance could appear to have been read before initialized, e.g. as follows:
| Time |
Thread A |
Thread B |
| 0 |
|
Reads instance as null |
| 1 |
Reads initialized as false |
| 2 |
Sets instance to ref to new obj |
| 3 |
Sets initialized to true |
| 4 |
Uses instance (initialized) |
| 5 |
|
Reads initialized as true |
| 6 |
|
Uses instance (null!) |
Thread B ends up returning a null reference. If a caller tried to use it, they might encounter a spurious NullReferenceException, the cause of which is incredibly hard to debug. For example:
void f() { Singleton s = Singleton.Instance; s.DoSomething(); // Boom! }
For this to have happened, Thread B would have had to read instance entirely out of order. It might have done so for any number of reasons. If it recently executed some code that pulled it into cache—either directly or due to locality—it isn’t required to invalidate the cache with non-acquire reads, even though it observed a new write with release semantics, because it's as if the load was moved before the load of initialized. Or superscalar execution might perform branch prediction and retrieve the value of instance, assuming that initialized will be false, pulling it into cache ahead of the read of initialized. Again, because it is a non-acquire read, this is a valid thing to do. If it reads initialized as true, its prediction was actually correct, and it just returns the null value that was pre-fetched. It might even be the case that a compiler along the way moved the read, which is also entirely legal with our memory model.
One possible solution for this is to employ a volatile-read on the first read of the initialized variable, prohibiting the read of instance from moving prior to the read of initialized. Control dependency prevents us from having to use a volatile-read for the reads of both variables.
class Singleton { private static object slock = new object(); private static Singleton instance; private static int initialized; private Singleton() {} public Instance { get { if (Thread.VolatileRead(ref initialized) == 0) { lock (slock) { if (initialized == 0) { instance = new Singleton(); initialized = 1; } } } return instance; } } }
You could have instead inserted a call to Thread.MemoryBarrier instead, which is a two way memory-fence, in between if-block and the read of instance, but the cost of a barrier is generally higher than both a st.rel and ld.acq because it affects surrounding instructions and movement in both directions.
The take-away here is not that you must understand the specifics of how cache coherency, speculative execution, and our memory model interact. Rather, it should be that once you venture even slightly outside of the bounds of the few "blessed" lock-free practices mentioned in the article mentioned above, you are opening yourself up to the worst kind of race conditions. Using locks is a simple way to avoid this pain. And hopefully someday in the future, transactional memory will enable performant execution of code with lock elision techniques that lead to the performance of lock-free code, but without any of the mental illness that such techniques have been proven to cause.
 Monday, January 16, 2006
Transactional memory promises to improve the lives of developers everywhere. From races, to deadlocks, to lock granularity and scalability headaches, the concept of transactions cleans up a lot of the worries inherent in the current lock-based concurrency programming model.
You have it today, sort of. Juval Lowy wrote a great piece on MSDN a few months back on how to build System.Transactions resource-managers over volatile in-memory data. I highly recommend checking it out. Maurice Herlihy has also made his SxM library available online here. I wonder if there's an interesting intersection between the two.
Update: I neglected to mention failure atomicity as one of the major benefits of TM, whether running concurrently or not.
 Saturday, January 07, 2006
I've posted before about how you might use C# enumerators to simulate coroutines. Enumerators are a very powerful feature, but unfortunately have one big drawback vis-à-vis their attempt at coroutines: you can yield only from one stack frame deep. The C# compiler state-machine transforms enough information for a single function, but obviously doesn't do that for the entire stack. Real coroutines can yield from an arbitrarily nested callstack, and have that entire stack restored when they are resumed.
There are other techniques. If you're willing to spend an entire thread to keep the stack alive, for example, you can use events to model coroutines with a standard producer/consumer relationship. The benefit to this approach is that you are in fact able to yield from arbitrarily nested frames. The clear drawback is the performance overhead. Each coroutine will eat up 1MB of reserved stack space from the virtual address space. But, probably worse, each time a new item is requested, an OS context switch is required; and similarly, whenever a new item becomes available (i.e. yielded), a context switch occurs again. This back-and-forth switching is pure overhead that could be eliminated with true coroutines.
(Note: this article describes how to use Fibers to avoid this context switch penalty. Fibers are dynamite on the CLR, however, so tread with caution if you even contemplate using this approach. Furthermore, you can easily dream up ways to serialize the physical stack, a la continuations. You do have access to the current CONTEXT, via GetThreadContext, on Windows and can use the thread's stack base and context ESP to determine the boundaries. But so many things in Windows rely on the TEB, from the CRT to exception handling to GetLastError to arbitrary usage of the TLS, like the way the CLR maintains a list of frame transitions. Nevermind having to accurately report roots back to the GC. These nightmares make real coroutines on Windows almost unapproachable, at least for the faint of heart.)
I hacked up a little Coroutine class this morning that uses the thread-per-coroutine approach mentioned above. Up front, I have to warn you: I spent 30 minutes on this thing. It's bound to be buggy, and I took some shortcuts (like not implementing the respective collections interfaces). Rather than walking through bit-by-bit, I've tried to comment the source code to explain how it works:
using System;
using SD = System.Diagnostics;
using System.Threading;
public delegate void CoroutineStart();
public class Coroutine<T> : IDisposable
{
// Fields
private CoroutineStart start;
private Thread thread;
private AutoResetEvent computeNextEvent;
private AutoResetEvent nextAvailableEvent;
private ManualResetEvent doneEvent;
private T current;
// We have a thread-static here so the coroutine needn't track the Coroutine<T> object
// manually. The Yield function is static, so they can just call Coroutine.Yield(v);
[ThreadStatic]
private static Coroutine<T> coroutine;
// Constructors
public Coroutine(CoroutineStart start)
{
this.start = start;
this.thread = new Thread(Worker);
this.computeNextEvent = new AutoResetEvent(false);
this.nextAvailableEvent = new AutoResetEvent(false);
this.doneEvent = new ManualResetEvent(false);
}
// Properties
public T Current
{
// TODO: we could add some error checking here, e.g. if somebody tries to
// read past the end-of-stream.
get { return current; }
}
// Methods
public bool MoveNext()
{
if (thread.ThreadState == ThreadState.Unstarted)
thread.Start();
else
computeNextEvent.Set();
// We wait on the 'next available' and 'done' events simultaneously. And then
// we use this to determine whether the coroutine has finished or not. The consumer
// will typically use this in a loop, e.g. while (c.MoveNext()) { f(c.Current); }.
return (0 ==
WaitHandle.WaitAny(new WaitHandle[] { nextAvailableEvent, doneEvent }));
}
private void Worker()
{
try
{
// Stash the coroutine object in TLS and start the CoroutineStart routine.
coroutine = this;
start();
}
catch (ThreadInterruptedException)
{
// Ignore the interrupt request. We use this as the 'proper' way to shut-down
// a couroutine. This is really a hack. Needs to be revisited.
}
finally
{
// Lastly, signal to the caller that the coroutine is done producing. Note that
// we'd ideally just use the thread executive object directly. But unfortunately the
// managed thread class doesn't expose this WaitHandle. :(
doneEvent.Set();
}
}
public static void Yield(T value)
{
Coroutine<T> c = coroutine;
// First, ensure we're on a coroutine thread.
if (c == null)
throw new InvalidOperationException("You can only yield from a coroutine thread");
// Now, set the coroutine's current value to the argument, signal to the consumer
// that we have a new item, and go to sleep until we're asked to compute the next item.
c.current = value;
c.nextAvailableEvent.Set();
c.computeNextEvent.WaitOne();
}
public void Dispose()
{
// We ensure the thread has stopped here. We use a really ugly interrupt to bring
// it down if not.
if (thread.ThreadState != ThreadState.Aborted &&
thread.ThreadState != ThreadState.Stopped &&
thread.ThreadState != ThreadState.Unstarted)
{
SD.Trace.TraceWarning("Coroutine thread has not stopped when Disposing, in state {0}",
thread.ThreadState);
thread.Interrupt();
// Joining here is questionable at best. It could lead to deadlocks.
thread.Join();
}
// Close out all of the events.
computeNextEvent.Close();
nextAvailableEvent.Close();
doneEvent.Close();
}
}
public static class Coroutine
{
// This is a trick. The C# compiler will infer the method argument <T>, enabling
// us to shunt right over to the Coroutine<T> implementation. This is nice because
// the user can just write Coroutine.Yield(n) instead of Coroutine<T>.Yield(n). The
// annoying part is that you can easily yield something of the wrong type, leading to
// an IllegalOperationException because C<T>.Yield will look in TLS and not find anything.
public static void Yield<T>(T t)
{
Coroutine<T>.Yield(t);
}
}
Now let's see it in action. Given a function Fibonacci, which continuously yields the next item in the Fibonacci sequence:
void Fibonnaci()
{
long n0 = 0;
long n1 = 1;
long n;
while (true)
{
n = n0 + n1;
n0 = n1;
n1 = n;
Coroutine.Yield(n);
}
}
We can form a coroutine over it and scroll through the first 10 numbers:
using (Coroutine<long> c = new Coroutine<long>(Fibonnaci))
{
int i = 0;
while (c.MoveNext() && i++ < 10)
Console.WriteLine(c.Current);
}
And of course, we can create a coroutine over a function that yields from functions deep in the call stack:
void a()
{
Coroutine.Yield("a");
b();
e();
}
void b()
{
Coroutine.Yield("a.b");
c();
}
void c()
{
Coroutine.Yield("a.b.c");
d();
}
void d()
{
Coroutine.Yield("a.b.c.d");
}
void e()
{
Coroutine.Yield("e");
}
And iterate over it:
using (Coroutine<string> c = new Coroutine<string>(a))
{
while (c.MoveNext())
Console.WriteLine(c.Current);
}
A neat extension to this whole idea might be a BeginMoveNext function that follows the asynchronous programming model. You could then exploit the fact that the consumer and producer are on separate threads to make progress while the producer is calculating the next item in line. Assuming you're on a multi-hardware-thread machine, this would cut down on the context switch penalty by as much as half.
 Tuesday, December 27, 2005
Some fundamental changes were made in the .NET Framework 2.0 that just about obviate the need to ever write a traditional finalizer. A lot of the guidance written here is now obsolete, not because it is incorrect, but rather because there is one important new consideration to make (hosting) and a set of new tools to aid you in the task. Jeff Richter pointed this out to all of us a few months back.
As Stephen Toub discusses in depth in his recent MSDN Magazine article on CLR reliability, resources not under the protection of critical finalizers are doomed for leakage when run inside of sophisticated hosts. SQL Server uses AppDomains as the unit of code isolation, much like Windows’ use of processes. When it tears one down, it expects there to be no resulting residual build-up over time. But if the best you’ve got are ordinary finalizers to clean up resources, a rude AppDomain unload can bypass execution of them entirely, leading to leaks over time. This might happen if a finalizer in the queue with you takes too long to complete, perhaps by deadlocking on entry to a non-pumping STA, causing the host to escalate to a rude unload.
Critical finalizers
During a rude unload, normal finalizers are skipped, finally blocks aren’t run, and only critical finalizers get a chance to make the world sane again. Thus we can immediately form a guiding principle:
Any resource whose natural lifetime outlasts an AppDomain must be protected by a critical finalizer to avoid leaks.
Notice that I say "lifetime spans an AppDomain." This is important. Finalizers are often used for process-wide resources, such as file HANDLEs and Semaphores. But a resource whose lifetime is limited to the enclosing process’s surely outlasts any single AppDomain; a finalizer is not good enough. Another piece of code in the same process might be denied access to the file handle because the (now-dead) AppDomain orphaned an exclusively-opened handle to it. Windows ensures this HANDLE will get released when the process shuts down, but our goal with critical finalization is to do this at AppDomain unload time (avoiding cross-AppDomain interference). In the worst case, not doing so can actually lead to state corruption; a process crash is then likely to ensue, taking down a host like SQL Server with it. Imagine if two AppDomains—perhaps even multiple processes—communicate via memory mapped I/O inside of a shared address space. If an AppDomain gets interrupted by an unload mid-way through a paired operation and intends to clean up state in its finalizer, failure to execute the finalizer might lead to chaos. A critical finalizer should have been used. (And use of BeginDelayAbort, e.g. via a CER. But that’s digging a little too deep for now.)
Critical finalizers are somewhat easier to write when compared to ordinary finalizers, due to the out-of-the-box plumbing that you get. But they impose additional constraints on what you can actually do at finalization time. To implement a critical finalizer, simply subclass the System.Runtime.ConstrainedExecution.CriticalFinalizerObject (CFO) type, provide a way for users to acquire a resource (e.g. in the constructor), and override its Finalize method to perform cleanup. When instantiated, your object will be placed onto the critical-finalization object queue. CFOs can be suppressed as usual with the GC.SuppressFinalize method, and can be re-registered onto the critical-finalization queue with the GC.ReRegisterForFinalize method. The CLR then ensures your object is finalized should a rude unload occur; obviously, it also runs them in the same cases ordinary finalizers are run too: i.e. standard GC finalization, managed shut-down, ordinary AppDomain unload, etc. There is a weak guarantee that CFOs are finalized after other finalizable objects, specifically to accomodate relationships like how the FileStream must flush its buffer before its underlying SafeHandle has been released.
As noted, writing a CFO Finalize method is trickier than a standard finalizer due to additional constraints. This is because it can be called from inside of a CER if the host escalates to a rude unload. It must guarantee that state will not be corrupted as a result of its execution and that it will never fail (i.e. by leaking an exception). And of course you can only call non-virtual methods that make similar guarantees. This means your code has to be written to succeed in the most hostile of situations, for instance in situations where any attempt to allocate memory dynamically will be rejected via an OutOfMemoryException. If you let that exception leak, you’ve violated the contract and can expect the host to respond in any number of ways, including crashing the process immediately. CERs perform eager preparation to statically ensure your code can execute, jitting the transitive closure of methods you invoke, but it’s easy to make a misstep here due to the massive number of hidden allocations in the runtime. A box instruction allocates memory; unbox does, too, but only if you’re unboxing a Nullable<T>; throw has to manufacture a RuntimeWrappedException if you're throwing a non-Exception object; and so forth. And unfortunately there aren’t any tools to prove that you’ve written your CER correctly. Thankfully most developers write bug free code on their first attempt. ;)
Critical- and safe-handles
Using the base CFO type directly has a couple drawbacks. First, it doesn’t fully implement the IDisposable pattern. There are two convenient Framework CFO abstract classes that do, both in the System.Runtime.InteropServices namespace: CriticalHandle and SafeHandle.
The CriticalHandle type is sufficient to get critical finalization semantics: you simply override its protected constructor and ReleaseHandle methods, performing open and close operations inside of them respectively. Your ReleaseHandle implementation can be called from inside of a critical finalization CER, so as with writing CFOs by hand you must make the same guarantees outlined above. This type provides a cleanly factored and encapsulated interface to your users.
But more concerning is the fact that both CFO and CriticalHandle are still prone to security problems that you might need to worry about if you’re building any sort of reusable Framework. BrianGru outlines this situation here. To tackle those issues, you need SafeHandle. Implementing SafeHandle is much like CriticalHandle, in that you override the protected constructor and ReleaseHandle methods, and abide by CER rules inside of ReleaseHandle. One additional piece is necessary, however: you must implement the abstract IsInvalid property getter and return true or false to indicate whether the SafeHandle refers to an invalid handle. (The SafeHandleMinusOneIsInvalid and SafeHandleZeroOrMinusOneIsInvalid types in the Microsoft.Win32.SafeHandles namespace are there to help out here, returning true if the handle is the value -1 in the first case and true if the handle is the value -1 or 0 in the latter case. A PVOID with a value of 0 (i.e. NULL), for example, would be invalid for a handle to a memory address; SafeHandleZeroOrMinusOneIsInvalid would be perfect for this.) ShawnFa discusses implementing SafeHandle in more detail on his blog.
Your CriticalHandle and SafeHandle types should never take on additional business-logic responsibility; make them as light-weight as possible, doing just enough to allocate and free resources. You’ll probably have a number of other functional classes that make use of these handles. The Framework’s Stream types are a classic example. Such types should implement the IDisposable interface and invoke Dispose on the underlying handle, providing an eager way to dispose of the resource. They should furthermore take care to never publicly expose the underlying handle, as doing so could be used to erroneously suppress finalization on a handle, leading to resource leaks.
Did you really mean never?
Almost. There are still several situations in which people must still write complex finalizers. The tax they must pay for stepping outside of the simple allocate/deallocate pattern is understanding intimately the big mess outlined here. Most people should consider factoring their real cleanup code to use a SafeHandle, and only then layering specialized code on top of that inside of a normal finalizer.
After a brief email thread with Chris Brumme, a number of legitimate cases of alternative finalizer patterns were identified, including:
- Sophisticated APIs can use finalizers to return expensive objects—like large buffers or database connections—back to a pool, amortizing the cost of creating and destroying them over the life of the application. System.EnterpriseServices does this. This is one of the only cases where resurrection is an acceptable practice. Critical finalization should only be used here if resources are pooled across an entire process. Most resources are AppDomain-local, and thus do not qualify for CFO status.
- Calling GC.RemoveMemoryPressure to compensate for a previous GC.AddMemoryPressure, used to communicate to the GC that the pressure associated with an object's resources is no longer a factor (because it's been cleaned up). This should be protected by a CFO if the resource whose pressure it tracks is also allocated/deallocated under a CFO. It’s unfortunate that the RemoveMemoryPressure API doesn’t make reliability guarantees (e.g. with ReliabilityContractAttribute). If it attempts to allocate memory—I can’t imagine that it would—you could end up crashing the process due to an unhandled OutOfMemoryException. You could consider swallowing such exceptions, at the risk of violating the corruption contract. This is a crappy situation, but if a large quantity of pressure were leaked after an AppDomain unload, a skew could build up over time, affecting all parties in the process, precisely what we’re trying to avoid by using CFOs. You need to make an intelligent tradeoff. We should fix this in a later release.
- Incrementing or decrementing a performance counter or lighter-weight counter like a static field. This is often used to monitor the rate of creation/destruction of objects, and is often turned off in retail builds. Assuming imprecise counting is OK—e.g. if it’s used only for testing purposes—this should not use a CFO. If you do use a CFO, you have you follow the guidelines above. For light-weight counters this is easy (i++ and i-- traditionally don’t allocate memory); but for performance counters it is not.
- Asserting to find cases where an object should have been, but was not, eagerly cleaned up using the IDisposable pattern. Properly written eager disposal is supposed to call GC.SuppressFinalization to eliminate the assert. It would be inappropriate to use CFOs for this purpose. Finally blocks will not run under a rude unload (which includes Dispose methods), and thus under any rude unload situation your CFO will fire.
- Some external resources have elaborate rules for sequencing cleanup. The COM ADO APIs (not ADO.NET) require that fields are cleaned up before rows, which must precede tables, which must precede connections. If objects are cleaned up in a free-threaded manner or in the wrong order, memory corruption will occur. In other words, they violate the standard COM pUnk AddRef/Release rules. Outlook exposes COM APIs with similar sequencing rules. This is traditionally addressed by writing elaborate finalization code that walks the graph on the managed side and initiates the sequenced cleanup. This is the trickiest of all. If you can guarantee you follow the CFO rules outlined above, this probably belongs in a critical finalizer. But it’s quite easy to make a misstep...you're basically playing with dynamite at this point.
If you decide you must write a finalizer, it’s still important to follow the pattern described here, or the condensed version in the .NET Framework Design Guidelines book. This facilitates seamless integration with VC++ 2005’s new destructor, Dispose, and finalization unification features.
Summary
At first glance, it appears that the world is simpler with CFOs. But when you consider that you have to abide by the same rules for normal finalizers plus the new ornery CER rules, life still isn’t very simple at all. CriticalFinalizerObject makes sure your resources don’t leak during hostile takeovers, and SafeHandle makes life more secure and a little easier in that the plumbing required to get IDisposable hooked up is all built for you, but one thing remains the same: Interoperating with unmanaged code is tricky stuff. But thankfully the world will be written in nothing but managed code sometime in the near future. Then we can get rid of all of this hairy finalization code once and for all.
 Tuesday, December 13, 2005
My recent {End Bracket} column, Transactions for Memory, shipped in the January MSDN Magazine. It's now been posted online: http://msdn.microsoft.com/msdnmag/issues/06/01/EndBracket/.
It's admittedly just a teaser, but hopefully strikes a good balance between hand-waviness and a useful explanation of the core ideas.
 Thursday, December 01, 2005
Lots of people try to roll their own thread-pool. Many people have different (good) reasons for doing so.
If you're one of these people, please tell me why. Either leave me a comment or send me an email at joedu@microsoft.com.
But if you're interested in performance, getting a good heuristic isn't as easy as you might think. The goal of such a heuristic is to have one runnable thread per hardware thread at any given moment. (A HT thread isn't equal to a full thread, but for sake of conversation let's pretend it is.) Acheiving this goal is much more complicated than it sounds.
- If you have a task sitting in front of you, it's hard to intelligently determine whether scheduling it on another thread is the right thing to do. It might be quicker just to execute it synchronously on the current thread. When is that the case? When the current number of running threads is equal to or greater than the number of hardware threads. And any decisions must be made statistically, because presumably concurrent tasks could be contemplating new work simultaneously.
- Remember I said running threads. If you have blocked threads, they are not making use of the CPU and thus need to be be considered differently in the heuristic. Just a count of threads isn't enough. If you have 16 tasks, 8 hardware threads, and statistically 50% of those tasks will be blocked at any given quantum, you want 16 real threads. If they block 75% of the time, you want 24. And so forth.
- You aren't the only code on the machine. Another process could be happily hogging as many threads as there are hardware threads, in which case your algorithm just got twice as bad (or half as good) as it was originally. This type of global data is hard to come by. (I should note that most machines have more than 2 processes running simultaneously. I currently have 67 processes running with 605 total threads. That's an average of ~9 threads per process. Clearly this is a real concern.)
Scheduling a task on another thread is costly. Why? For a number of reasons.
- Because unless you have ample hardware resources to run it, this implies at least one context switch to swap the work in. If it runs longer than that, it means many more. If you have more than one long running tasks competing for the same hardware thread, it means they will continually thrash the thread context in an attempt to make forward progress. As Larry puts it so eloquently, "...Context switches are BAD. They represent CPU time that the application could be spending on working for the customer."
- And not only that (and perhaps worse), you're going to mess with the cache hierarchy. Your program might be happily working on conflict-free cache-lines, CASing right in the local cache without locking the bus, and then boom: You pass a pointer to an object to another thread (e.g. on the thread-pool), it pulls in the same lines of cache, and then you're both contending for the same lines back and forth. Your good locality goes right out the window and becomes a tax instead of a blessing. This sort of cache thrashing can kill good performance and scaling.
- Lastly, threads aren't free you know. Just having one around consumes 1MB of reserved stack space (0.5MB in SQL Server). Same goes for fibers.
Some people are interested in using thread-pools for other purposes. (That, is: not performance.) They might want to manage a pool of work items, for example, which get scheduled fairly with respect to each other (in the fine-grained sense). No one task will complete very quickly during saturation, but at least each is guaranteed to move forward. A newly enqueued item won't sit festering in the queue while an older item continues bumbling along towards its goal. And sometimes, priorities must be used to evict lesser priority tasks when a higher priority task gets enqueued. These are all perfect cases where user-mode scheduling makes sense. Co-routines or (*cough, cough*) perhaps fibers could be used. Using threads for this simply adds way too much overhead.
Clearly getting this right is difficult. But the consequence of getting it horribly wrong today isn't too bad. (Although really crappy algorithms are noticeable.) When you only have 1-4 hardware threads on the average high-end machine, the difference between a great heuristic and a poor one isn't significant. That will change.
 Sunday, November 27, 2005
Each Windows thread has a Thread Environment Block (i.e. TEB) which is a block of user-mode memory pointed at and reserved for use by the Windows kernel Thread data structure (KTHREAD). In addition to basic OS information like the active SEH filter chain, stack base and limit, and owned critical sections, applications can easily stash data into and retrieve data out of the Thread Local Storage (TLS) area of the TEB. This is done using the Win32 TlsAlloc, TlsGetValue, TlsSetValue, and TlsFree functions. You can view the TEB via the kernel debugger's !thread command.
(The CLR of course offers TLS functionality too, i.e. using ThreadStatics and the System.Threading.Thread's AllocateDataSlot, SetData, and GetData functions. This information does go into the TEB, but it is managed by the CLR. A call to SetData does not translate directly to a call to TlsSetValue.)
Win32--and Windows in general--makes liberal use of thread-local memory. I noted a few uses above (e.g. exception handlers) which are pervasive. Such usage creates an implicit affinity between the workload running on the thread and the physical OS thread itself. What do I mean by affinity? Simply that the work executing on a thread must continue executing on that exact physical thread for it to remain correct. This affinity isn't documented consistently nor is it easy to detect. You might be able to weasel around it by chance. But it makes it extraordinarily difficult to transfer logical work from one physical thread to another.
Imagine what would happen if we made a call to some Win32 function and then decided to swap out the logical work so that we could install new work. SetLastError might have been used to communicate a failure in a function called on either the thread the work is being swapped out of, or the destination once it gets rescheduled. But SetLastError installs the error information into the TEB. GetLastError will then either fail to retrieve information or, more likely, will retrieve somebody else's information, either of which would lead to all sorts of serious problems. Similar issues can happen if we (foolishly) tried to swap out a thread that owned a critical section, or some other thread-specific resource (like a mutex).
This is one major reason why fibers are still problematic as a general task scheduling solution for Windows. And it's a challenge if you even want to consider user-mode scheduling a la continuations. You just can't get around the platform's hidden thread affinity. We've done much better in managed code. Over time we are trying to use ExecutionContext as the currency for logical context information, which can be easily captured and restored by the runtime. But there are examples where we violate this (e.g. monitors), where we use the physical OS thread as the context (be fair: we do notify hosts of such situations via Thread.Begin/EndThreadAffinity).
But you can't escape the fact that the runtime itself is built right on top of Win32.
 Monday, November 21, 2005
I just replied to a set of questions on Brad's blog. But I then thought perhaps the information would be more generally interesting to my readers. So here it is:
In response to the two issues you brought up on your first reply:
- Failure to mention "chain-to-base" for Dispose in the Framework Design Guidelines Book;
- Question about when it is safe to call methods on base types within a Finalize method.
Update: First, I have to mention something up front. I failed to mention that almost nobody should write Dispose/Finalize any longer. SafeHandle is the best way to protect resources that span (or outlive) a process, beginning in 2.0. This alleviates the details of implementing this pattern and gives you reliability guarantees that you otherwise wouldn't get (i.e. critical finalization). If you need to do some form of pooling or asserting on failure-to-Dispose, this is still an option; but all "real" resource cleanup should be encapsulated in a SafeHandle.
Re (1):
If we neglect to say a derived class must chain to its base class Dispose method (if it has one), that’s a book-bug/omission. If you're writing a Dispose(bool) your preference is to call base.Dispose(bool), flowing the bool value to the base method; but if one doesn’t exist, a base.Dispose() suffices; otherwise, you simply must call base.Dispose().
This is important because Dispose implies cleaning up resources; if you don’t chain to the base type, clearly you are going to leak those resources. And it might be worse than just nondeterministic cleanup (when the user expected deterministic). Presumably Dispose on the derived class does a GC.SuppressFinalize(this), meaning the base type will never get placed into the Finalize queue, and thus won't release its resources (well, until process exit). The user of this class would notice this as unbounded resource consumption. I suspect the bug would be incredibly hard to find, too.
Re (2):
Generally, you should not make complex method calls from your Finalize method that could result in (accidentally) trying to use a resource which has already been disposed during the destruction process. This is ordinarily more of a concern during virtual method calls, where the most derived type's version is chosen dynamically; if the derived type has overridden a method and tries to use its own resources (which were already relinquished because of destruction ordering), bad things will happen. Calling a base method isn't nearly so dangerous, unless you do it after chaining the base type's destructor. The original document from which the book text was derived acknowledged that Dispose's use of this practice is risky and goes against the general advice.
But we made an exception for Dispose because it is a carefully controlled pattern. (This was in the original document.) Those writing types to follow the pattern are usually more sophisticated users that will feel more comfortable analyzing the call graphs. And virtual calls during destruction aren’t nearly as dangerous as virtual calls during construction. You’re typically concerned that a resource will be used before it’s been initialized (i.e. in the construction case), but presumably code called from your Finalize will be resilient to uninitialized state and isn’t going to make further virtual methods (which might introduce problems). Clearly if this isn't the case--and it could be difficult to verify that it is through test coverage--you will run into bugs, perhaps manifesting as crashing the Finalizer thread.
Note that the book contains an abridged (and more clear/scrubbed) version of the original document: http://www.bluebytesoftware.com/blog/PermaLink.aspx?guid=88e62cdf-5919-4ac7-bc33-20c06ae539ae. That document is actually quite a mess, doesn't stay on point, and bombards the reader with way too many details. I'm glad it got chopped up for the book.
 Wednesday, November 16, 2005
I wonder a few things.
How many out there write lots of multi-threaded code? If so, why; if not, why?
For those of you that do, is there a set of standard guidelines and practices that you follow? Are they public (e.g. a white-paper, book)? How much experience do you have with them? Any emperical evidence that they are better than nothing (or even another set of guidelines)?
For those practices, do you have tools to support development consistent with those practices (e.g. static analysis, dynamic analysis)? Are they commercial or homegrown? How do you protect against race conditions? How do you protect against deadlocks? Does your locking protocol employ specific rigorous engineering practices, such as using lock hierarchies/leveling, avoidance of dynamic dispatch under a lock, etc.?
Do you do user-mode scheduling? How? (E.g. fibers?)
Do you use our ThreadPool, or did you decide to roll your own? Why?
Are you a Monitor.Enter/Exit or a Win32 kind of guy or gal? Same goes for Monitor.Wait/Pulse/PulseAll and EventWaitHandle. Was this choice based on any data, or was it simply what you're most comfortable with?
There are just some of the things I wonder. Any answers to any questions would be super-cool.
 Tuesday, November 15, 2005
A few weeks back, Erica Wichers asked if Joel and I would tape an episode of MSDN TV. We, of course, kindly accepted the offer.
Joel was responsible for scheduling the shin-dig. We failed to prepare in the weeks to follow, and so when the date loomed we ran like frightened little squirrels. In other words, we pushed the date back another week. Brad would be quick to note this is a common case of pushing work, delaying the inevitable, allowing the snowball of "stuff to do" to grow, etc.
Well, the date then came again. I suspected it would eventually, but again chose to fiddle with other priorities instead of preparing. Thus we were still totally unprepared. The morning of the taping, we headed out for a latte at Victor's in Redmond (as all such situations call for). We'd tentatively planned on doing a short rendition of our Abbott and Costello act at PDC. But we hadn't prepared anything in the way of demos... What to do?!
Consequently, Joel had recently published a great paper on the performance characteristics of method calls on the CLR. And I had just finished doing a whole lot of playing around with SOS to do with some articles and book content I've been working on. We thought: Let's slam the two together, and talk ad-hoc about the flavors of method calls on the CLR. Sold!
We spent the next hour hacking together sample programs, refreshing our memories, and preparing for the taping. The result is here.
I am truthfully amazed at how well it flows together. I used to watch quite a few MSDN TV episodes back in the day (before joining the firm), and they always seemed so scripted. After seeing how it all works, I bet they all happen in a similar last-minute fashion.
 Wednesday, October 12, 2005
Haibo recently posted about nullbox impacts on reflection.
Haibo is a test engineer on the Managed Services Team (MST), the portion of the CLR team that focuses on dynamic programming features. MST owns such things as reflection, delegates, and generally any feature that is used to dynamically interact with the CTS.
Other interesting MST blogs include Jim Huginin, Joel Pobar, and Kathy Kam. If you want weak delegates (or similar features) in the next version of the CLR, feel free to send any of them exorbitant amounts of email.
 Wednesday, October 05, 2005
Our verifier has knowledge of statically typed boxed value types. Such things are never surfaced to user code, not even through reflection. In user code, the type of a statically typed boxed value can only be referred to as object, both statically and dynamically. This is one of the nice properties of a nominal type system; if you can't name it, you can't refer to it.
We use boxed types for type tracking in the verifier. This is why it's legal to make virtual method calls against boxed value types, for example, and similar reasons apply to the formation of delegates over boxed value types. Using the box instruction's type token argument, we can easily calculate the resulting boxed type. We say the result of boxing a value of type T is an item on the stack of type boxed<T>.
Well, most of the time...
nullbox Rears its Head
Life was simple before the nullbox DCR. In the good ole' days, when you boxed a value of type T, you got back a boxed<T>. Always. But now things change ever-so-slightly. If T is a Nullable<U>, the static type of the item left on the stack is a boxed<U>; all other cases are still typed as boxed<T>. Given that you can only refer to these things as 'object', perhaps this isn't concerning to you.
But from the verifier's perspective, it changes the user-observable semantics. It means you can't verifiably form a delegate over a Nullable<U> ever. The only way to produce something of type boxed<T> is through the use of a 'box' instruction; and we've established box<Nullable<U>> is a boxed<U>. There should be no way to produce a boxed<Nullable<U>>. It similarly means that you can't call virtual methods on a Nullable<U> because, by definition, you must box value types in order to callvirt them.
(This statement doesn't apply to all virtual method dispatch. If you use constrained calls--as the C# compiler does--they do the magic to determine if there's a suitable implementation on the target type and, if so, avoids boxing. They operate on the raw unboxed item as 'this'. This applies when you call ToString on a Nullable, for example, because Nullable overrides it.)
Notice also that this is the very magic that permits you to make interface calls on a Nullable<T> for interfaces that T supports. It gets boxed and the type tracking permits you to perform the operation.
Type Hole?
A type hole is a situation where the dynamic type of something doesn’t match what the static verifier said it would be. This is very bad. By definition, something which is verifiably type safe should not be able to corrupt data at runtime as a result of such type mismatches. A sound type system has no holes. The change associated with nullbox, however, introduces one specific cause for concern. We call it a “wart” but won’t go so far as to call it a “type hole.”
Consider what happens if a box instruction were to operate using a type token which was parameterized by some generic parameter in the scope. In other words, the thing being boxed was of some type T (perhaps declared on the enclosing method) that wasn’t known at verification time. Our verification rules state that the result of box 'T' is always boxed<T>. But as we’ve seen already, if this scope were instantiated with Nullable<U> as the argument for the type parameter T, this would be a lie. The dynamic type is actually a boxed<U>, yet we said it was a boxed<Nullable<U>> (indirectly).
Unfortunately, because the verifier operates on abstract type forms, not precise generic instantiations, fixing this isn't quite as simple as you might imagine. Verification at JIT-time could actually determine this, but our model does not surface such lazy verification. In other words, the verifier does not have the information it needs at verification time to determine whether it is lying.
Why isn't this a type hole? If there was a property of an item statically typed as boxed<Nullable<U>> that you could verifiably make use of, but that wasn't also a property of boxed<U>, this would precisely the definition of a type hole stated aboved. We'd discover the problem at runtime. Right?
*Whew!*
It turns out you can’t do any such things.
First of all, you cannot do anything to an unconstrained type parameter T except perform operations statically verifiable against ‘object’. A boxed<U> (the dynamic type of a static boxed<Nullable<U>>) is obviously a derivative of ‘object’, so this is OK. You couldn't verifiably form a delegate over a boxed<Nullable<U>>, for example, because of the verifier not the absence of Nullable's methods at runtime.
Next, you can also constrain type parameters to things that implement specific interfaces. So if Nullable<U> implemented an interface that its inner U didn’t support, you could legally make interface calls which would cause holes at runtime. But (thankfully) Nullable<U> doesn’t implement any interfaces. (Notice we removed INullableValue from Whidbey altogether. :P)
Furthermore, Nullable<U>’s type parameter is constrained to value types, so even if you could represent a constraint that allowed only value type values for T (which you can’t, you can only use the ‘valuetype’ constraint which omits Nullable directly) you couldn’t do anything dangerous.
Summary?
Move along. Nothing to see here. I just typed up a brief summary because we were discussing this at great lengths during ECMA specification. It sparked a lot of interesting discussion. And it's a little bit of trivia to impress your friends with.
I personally get a tad queasy when I think about proving the soundness of a type system through the use of members of that type system, but that's a separate topic...
 Tuesday, October 04, 2005
 Thursday, September 29, 2005
I’ve talked about Thread Aborts before. And I spoke briefly at PDC about why you shouldn’t lock on objects shared across AppDomains (ADs). But I wanted to spend a brief moment fusing the two together to illustrate the point. There are some interesting factors at play here.
The Guidance
To begin with, our Design Guidelines advise:
Do not lock on any public types, or on instances you do not control. Notice that the common constructs lock (this), lock (typeof (Type)), and lock (“myLock”) violate this guideline.
Most people don’t intuitively understand the why behind not locking on Types and Strings. I’ll leave the public type discussion off the table for this post.
The Reasons
To understand why this is problematic, first you have to understand that we share objects across ADs. And you need to understand when we do it. Various Reflection bits and bytes—such as instances of the Type class—are one such case, when they refer to a domain neutral assembly. mscorlib always gets loaded domain neutral, for example; other assemblies can fall into this category too, based on Hosting policies. Interned strings are also shared across ADs, so a “Hello, World” literal in AD #A is the same precise object as that “Hello, World” in AD #B in the same process. All of the above are called cross-AD bled objects or AD-agile instances, something discussed in reasonable detail here.
A conclusion that you can make right away is that locking on an object shared across ADs #A and #B can interfere with each other, even if it’s only by coincidence. Subtle timing oddities might arise—including starving another AD’s completely unrelated opaque body of code for a seemingly unknown reason—but in many cases the effects won’t be so catastrophic. And in some rare situations it might even be intentional. But let’s take it a step further.
Next, you need to know a little about how we perform AD unloads. If you want to know a lot about this, go read Chris’s excellent post on the subject (the same as above). But I will try to summarize, and in doing so will paint a naïve picture of the world. During ordinary AD unloads we are careful to ensure that threads are unwound in an orderly fashion. That is to say, finally blocks lexically surrounding the instruction pointer are run, and of course objects in that AD are given a chance to run their Finalizers. This happens because a ThreadAbortException is generated at the current point of execution in each thread which actively has a stack in the AD.
Assuming you’ve written your code to use a lock statement (or at least to release the Monitor in a finally block), this orderly thread unwinding permits you to release any locks held. You may catch a Thread Abort, but it is a so-called undeniable exception, meaning it will be reraised at the end of your catch blocks. This is quite visible during an ordinary unload. And of course, Aborts are suspended in the case that you’re in a CER, unmanaged chunk of code that isn’t polling for aborts, a finally block, and so on. Lastly, if you see an Abort happen when your code holds a Monitor, you can be assured the entire AD is being ripped—not just a single Thread; this assumption is safe because we work with Hosts (via Begin- and EndCriticalRegion) to let them know when the whole AD could become corrupt as the result of a single ThreadAbort.
But if you piss SQL Server off by taking too long in one of your finally blocks (for example), it will get a tad snippy. Specifically, it can respond by escalating to a rude AD unload. A rude unload does not tear the AD down by injecting ThreadAbortExceptions and enabling them to percolate back to the top of the call-stack. Rather, it rips it down very aggressively, bypassing lexically relevant finally blocks, only giving a best effort attempt at running CERs, and executing critical finalizers (CFOs) only. Of course, this isn’t nearly as aggressive as a P/Invoke to kernel32!TerminateProcess, but it’s not quite as polite as an ordinary unload.
This means, as a very specific example, that a finally block wishing to execute Monitor.Exit won’t even get run. And if the Exit doesn’t run, that Monitor will be permanently left stamped with the Thread’s ID as the owner. But the Thread has gone bye-bye. Orphaned. Until you’ve created 4,294,967,295 threads such that the Thread IDs wrap around and the old ID gets assigned to a new Thread, and that thread spuriously decides to Exit the Monitor without acquiring it first, your system is going to be locked up for a bit. In other words, deadlocked.
Side Note: Arguably this behavior in any case; if two ADs were intentionally coordinating work, an orphaned lock is better than observing corrupt data structures. But for accidentally shared objects, perhaps it's overly draconian. But I digress.
In fact, any host might do this based on a variety of policies. Some might choose to perform rude AD unloads all of the time, while others might not do it at all. Most of them will use an escalation policy rather than doing it outright—such as SQL Server—but anything’s fair game when the host is in control. A matrix of which hosts do what would be nice, but I don’t have one. We have a nifty tool internally that allows simulation of any of these policies, but you can just as “easily” do it yourself by navigating the Hosting APIs. The general idea is described in more detail in Stephen Toub’s recent excellent MSDN article, and in gory detail in the Customizing the CLR book.
A Demonstration
Let’s first take a look at and observe the effects of a scenario which locks on cross-AD objects:
using System; using System.Threading; class Program { static void Main() { // Start up a new AppDomain that hogs a lock. AppDomain ad = AppDomain.CreateDomain("FooDomain"); ad.DoCallBack(delegate { Thread t = new Thread(delegate() { lock(typeof(string)) { try { Console.WriteLine("AD#B: Got it."); Thread.Sleep(10000); } catch (Exception e) { Console.WriteLine("AD#B: {0}", e); //Thread.Sleep(5000); // provoke a rude unload? } } }); t.Start(); }); // Pause briefly. Thread.Sleep(500); // This will fail because AD#B owns the shared lock. bool b = Monitor.TryEnter(typeof(string), 500); if (b) { Console.WriteLine("AD#A: Got it."); Monitor.Exit(typeof(string)); } // Kill the other AppDomain. AppDomain.Unload(ad); Console.WriteLine("AD#A: AD#B is dead."); // Is the lock orphaned? If we provoked a rude unload, this should hang. lock(typeof(string)) { Console.WriteLine("AD#A: I got in!"); } } }
I hope the code is simple enough to be obvious. A brief explanation is warranted:
- From an existing Thread T1 in AD #A, we create a new AD #B, and start a new Thread of execution T2 running inside of it;
- T1 resumes and waits briefly to ensure T2 can make forward progress first;
- T2 locks on typeof(String), and then goes to sleep for a while;
- Meanwhile, T1 resumes, attempts to acquire the lock, and fails (because the lock is held by T2 because the String type is shared across ADs);
- T1 then initiates an Unload on AD #B;
- The result is a ThreadAbort in T2, the finally block releases the Monitor, and AD #B is successfully unloaded;
- T1 in AD #A successfully acquires the lock.
Throughout all of this, there is some nice text being printed to the console. I see the following:
AD#B: Got it. AD#B: System.Threading.ThreadAbortException: Thread was being aborted. at System.Threading.Thread.SleepInternal(Int32 millisecondsTimeout) at Program.<Main>b__1() AD#A: AD#B is dead. AD#A: I got in!
Looks Fine, Eh?
Well that works just fine in unhosted scenarios, as we might have expected it to. The lock-protected bits of code stomp on each other, but at least AD #B happily gives up the lock during an ordinary unload. Note that if the code running in AD #B were careless, it might not have protected the lock acquisition/release in a try/finally, in which case AD #A would be screwed. It would deadlock when it attempted to acquire the lock.
But things get worse. More subtle deadlocks can occur, even if AD #B were written correctly through the use of the C# ‘lock’ statement. As we’ve already established, this might happen if the code were run inside a host that employed rude AD unloads, such as SQL Server. If a thread initiated a rude AD unload in AD #B while it held the lock, the same exact code that worked in the unhosted case would deadlock as soon as AD #A’s last attempt to acquire the lock executed. Presumably SQL Server would notice this deadlock and kill the code—perhaps leading to both ADs ultimately being unloaded—but I am not 100% certain about this.
A Possible Refinement
Through a combination of CERs, we can get our code working again. Note that—if it’s not obvious by now—the real solution is to avoid locking on cross-AD bled objects! Just don’t do it and you won’t get into this trouble. But of course, the geek inside instigates more fun…
Brian Grunkemeyer, a developer on our team, wrote a great piece of code sometime between Beta2 of Whidbey and now. It’s a method on Monitor called ReliableEnter, and it permits you to acquire a Monitor and know reliably whether it succeeded. It does so with a Boolean byref parameter which is set inside of a ThreadAbort-safe region of native code. This means that you can actually rely on the value of the Boolean in a cleanup CER, for example, to indicate whether the Monitor was successfully acquired or not, while at the same time not actually suspending ThreadAborts by wrapping the whole acquisition in a CER.
Unfortunately, we were unable to make it accessible in Whidbey. It’s an internal method, and it got added too late. We’ll probably do that in the future. To make calling it cheap and possible, I wrote a little hack that uses a DynamicMethod to bind to it. In fact I did a little more than just that. I’m not going to analyze it in detail. Feel free to ask questions if you wonder how it works:
delegate void MonitorAction(); class ReliableMonitor { class Holder<T> { internal Holder() { this.value = default(T); } internal Holder(T value) { this.value = value; } internal T value; }
delegate void ReliableEnterDelegate(object obj, Holder<bool> taken); private static ReliableEnterDelegate monReliableEnter;
static ReliableMonitor() { MethodInfo reMi = typeof(Monitor).GetMethod("ReliableEnter", BindingFlags.Static | BindingFlags.NonPublic); DynamicMethod dm = new DynamicMethod("Mon_ReliableEnter", null, new Type[] { typeof(object), typeof(Holder<bool>) }, typeof(Program), true); ILGenerator ilg = dm.GetILGenerator(); ilg.Emit(OpCodes.Ldarg_0); ilg.Emit(OpCodes.Ldarg_1); ilg.Emit(OpCodes.Ldflda, typeof(Holder<bool>).GetField("value", BindingFlags.Instance | BindingFlags.NonPublic)); ilg.Emit(OpCodes.Call, reMi); ilg.Emit(OpCodes.Ret); monReliableEnter = (ReliableEnterDelegate)dm.CreateDelegate(typeof(ReliableEnterDelegate)); }
internal static void Enter(object obj) { Monitor.Enter(obj); }
internal static void RunWithLock(object obj, MonitorAction action) { Holder<bool> taken = new Holder<bool>();
System.Runtime.CompilerServices.RuntimeHelpers.ExecuteCodeWithGuaranteedCleanup( delegate { monReliableEnter(obj, taken); action(); }, delegate { if (taken.value) { Monitor.Exit(obj); taken.value = false; } }, null); }
internal static void Exit(object obj) { Monitor.Exit(obj); } }
Notice the RunWithLock method. It uses a great method RuntimeHelpers.ExecuteCodeWithGuaranteedCleanup located in the System.Runtime.CompilerServices namespace. We call it SRCSRHECWGC—pronounced “shreek shreck woogy-cuck”—for short around here. Well, we don't really call it that, but I think I will from now on. SRCSRHECWGC runs the first delegate and uses some CER magic to guarantee that the cleanup code passed as the second argument executes in the face of rude AD unloads. At least the type of failures we’re concerned about here. It might not do its job very well if you pull the plug on your computer, for example.
If we were to rewrite our code above to use the RunWithLock method, it could survive a rude AD unload and skirt the frightening onset of a deadlock:
class Program {
static void Main() { // Start up a new AppDomain that hogs a lock. AppDomain ad = AppDomain.CreateDomain("FooDomain"); ad.DoCallBack(delegate { Thread t = new Thread(delegate() { ReliableMonitor.RunWithLock(typeof(string), delegate { try { Console.WriteLine("AD#B: Got it."); Thread.Sleep(10000); } catch (Exception e) { Console.WriteLine("AD#B: {0}", e); //Thread.Sleep(5000); // provoke a rude unload } }); }); t.Start(); }); // Pause briefly. Thread.Sleep(500); // This will fail because AD#B owns the shared lock. bool b = Monitor.TryEnter(typeof(string), 500); if (b) { Console.WriteLine("AD#A: Got it."); Monitor.Exit(typeof(string)); } // Kill the other AppDomain. AppDomain.Unload(ad); Console.WriteLine("AD#A: AD#B is dead."); // Is the lock orphaned? If we provoked a rude unload, this should hang. lock(typeof(string)) { Console.WriteLine("AD#A: I got in!"); } } }
This has the effect that we wanted. When run in a situation where the RunWithLock method guarantees that we release the lock even in the face of a rude unload. The result? AD #A does not deadlock.
Hoorah.
And they all rejoiced.
 Thursday, September 22, 2005
We made a change in Whidbey recently that impacts the verification of calls to virtual methods.
Invoking Virtual Methods Statically
Valid IL could previously invoke a precise implementation of a virtual method with a call instruction instead of a callvirt. The target type's exact method token could be specified, bypassing all dynamic dispatch altogether. For example, given two classes A and B
class A { public virtual void f() { Console.WriteLine("A::f"); } }
class B : A { public override void f() { Console.WriteLine("B::f"); } }
a consumer would ordinarily emit IL to perform a virtual dispatch, looking something like this in IL
newobj instance void B::.ctor() callvirt instance void A::f()
The result is of course a properly dispatched virtual call which resolves to B's override and prints out "B::f". But somebody could do this instead
newobj instance void B::.ctor() call instance void A::f()
The result of which is an ordinary statically dispatched call to A's implementation of f, printing out "A::f".
Some consider this a violation of privacy through inheritence. Lots of code is written under the assumption that overriding a virtual method is sufficient to guarantee custom logic within gets called. Intuitively, this makes sense, and C# lulls you into this sense of security because it always emits calls to virtual methods as callvirts. C++ offers language syntax to do precisely this, however, e.g.
B b; b.A::f();
I don't know of any other language that support this type of call directly, but presumably somebody else followed in C++'s footsteps here. C# (and others) use this technique to implement 'call to base' functionality. Some compilers emit this type of IL so that their method resolution code can bind to virtual methods in a custom way. And others could do it in an attempt to "devirtualize" method calls when they know there are no overrides.
Verification Changes
Late in Whidbey, some folks decided this is subtly strange enough that we at least don't want partially trusted code to be doing it. That it's even possible is often surprising to people. We resolved the mismatch between expectations and reality through the introduction of a new verification rule.
The rule restricts the manner in which callers can make non-virtual calls to virtual methods, specifically by only permitting it if the target method is being called on the caller's 'this' pointer. This effectively allows an object to call up (or down, although that would be odd) its own type hierarchy. With this change, the above example fails verification, "The 'this' parameter to the call must be the calling method's 'this' parameter."
Identity Tracking
The verifier implements this magic using a technique called identity tracking. We don't use this style of tracking in many places. The verifier ordinarily tracks only the static type of items on the stack. But in this case, it needs to be comfortable that you're using the same arg.0 pointer for the method call as was passed onto the caller's stack frame. If you've executed a starg 0 in the IL stream, for example, you won't be permitted to make the call. Even if you do a ldarg.0 followed by a starg 0, the verifier tosses you out the window.
A catch here is that while you might be operating dynamically on the 'this' pointer, the verifier avoids statically tracking pointers across method calls. An example of where this can produce a false positive is as follows
class A { public virtual void f() { Console.WriteLine("Foo::f"); } }
class B : A { public override void f() { Console.WriteLine("Bar::f"); }
private B Echo(B b) { return b; }
public void FailsVerification() { Echo(this).A::f(); } }
It's clear that FailsVerification is really just invoking methods on its this pointer. But it does so in a roundabout fashion. (Of course that 'A::f()' syntax is psuedo-code; it would compile in C++, but C# doesn't offer such a feature.) Regardless, the IL that gets produced isn't verifiable.
 Thursday, September 08, 2005
Nearly 20 people from the CLR Team will be at PDC next week. This includes our Product Unit Manager, some senior Architects, Program Managers, Devs, and Testers. Of course, we'd love to meet one-on-one with folks from your company, or even you individually. We have a room available, and are flexible on the times.
If you're interested, just send an email to: PDC0511s@microsoft.com. Let us know what you're interested in. Note: Being the CLR Team, we're admittedly focused on the low level goop...But almost anything's fair game: AppCompat, Security, Reliability, Performance, Concurrency, Base Class Libraries, Garbage Collection, the Future of the CLR, etc.
If you want to meet with me in particular, email me at: joedu@microsoft.com.
 Sunday, August 28, 2005
This is a fun example that illustrates a few topics I'm discussing at PDC in a couple weeks.
What, if anything, can cause Thread 2's assert below to fire?
class Foo { static Foo lastFoo;
string state; bool initialized;
public Foo() { state = "Developers, Developers, Developers!"; initialized = true; } }
// Thread #1: // Thread #2: lastFoo = new Foo(); Foo f = lastFoo; Debug.Assert(f.initialized == true && f.state != null);
For purposes of illustration, imagine that lastFoo has already been initialized to some Foo prior to threads #1 and #2 executing.
 Thursday, August 25, 2005
Following on the tail of Mr. Abrams's Channel9 video (is there one too many s's in there?), check out our video where we discuss the CLR Team's presence at the PDC (Part 1 and Part 2).
We talk with:
There were a few folks we didn't get to chat with:
In watching them, I think I frontloaded the talk in a selfish way. My two talks are first...It wasn't intentional, I swear! But oh well... Check it out!
I even made up a new word in the process: expressitivity. Doh! ;)
 Tuesday, August 23, 2005
You know you're a geek when:
- You read processor manuals for fun.
- ...
I've been deeply internalizing the memory models implemented on various flavors of x86, IA-32, AMD-64, and IA-64 lately. And then rationalizing how our various JITs manage to implement the new strengthened Whidbey memory model on each architecture. Believe it or not, I love this stuff. One of the perks of being a Microsoft employee is that you can gain access to dual-proc/dual-core/HT machines, AMD-64 and IA-64 boxes, and basically anything else you could imagine. Now if there were just more hours in the day.
Here are a few good resources if you're interested in doing some research yourself:
My PDC talk touches on some of the details of memory models briefly. I wish I could do an entire talk on cache coherency, branch prediction, pipelining, instruction reordering, and the like...But I think that would put most attendees to sleep. There needs to be more me's in the world.
 Friday, August 12, 2005
Check out Soma's post about the Nullable<T> DCR we recently implemented...we referred to the project as nullbox internally. This one kept me up at night on a few occassions, but was a lot of fun. :) Huge risk, but based on lots of feedback it was the right thing to do. And the team executed perfectly, nailing our target dates at each step along the way.
I alluded to this work here and here. I was vague and avoided answering probing comments intentionally. Now I can answer them...so ask away!
The core of this change is that the IL box instruction has been modified to recognize Nullable<T>s. For non-Nullables, behavior remains the same; but upon seeing one, it inspects its HasValue property. If HasValue is true, box peeks inside the structure, extracts the T value, and boxes that instead; otherwise, box simply leaves behind a null reference. Obviously, unbox has also been changed to allow nulls to be unboxed back into Nullable<T> structures. This had a rippling effect in the CLR codebase and also required changes to late-bound semantics to mimic the static case.
The result is that given
int? x = null; object y = x;
both expressions
x == null y == null
evaluate to true. And furthermore, given
bool F<T>(T t) { return t == null; }
the following expressions
F(x) F(y)
also evaluate to true.
I intend to post a more detailed summary of the DCR over the coming week[s].
 Saturday, August 06, 2005
I got to work on a fun DCR with Chris Brumme back around the time we were shipping Whidbey Beta2. (DCR means Design Change Request, essentially an unplanned change to the design of a component.) We went back and forth as to whether or not to release it with Beta2, but given that the implementation would have been right up against our lock down period, the risk was too high. Thus, it'll first appear in our next CTP, RC, or whatever release comes out before Whidbey RTMs.
The Problem
The crux of the problem is this. Lots of code gets written in C# assuming that catch (Exception) is sufficient to backstop any exception a piece of CLR software can generate. It turns out that, while doing so is not CLS compliant, IL can throw just about anything. The throw instruction will happily take a reference on the stack to any managed object--not just those whose type falls into the Exception type hierarchy--and unwind the stack with it in hand.
A typical user (and even some Framework developers) write exception handling code that looks like this (without all the Console.WriteLines of course :P):
try { Console.WriteLine("Inside try..."); F(); Console.WriteLine("Exiting try"); } catch (Exception e) { Console.WriteLine("In catch ({0})", e); }
Console.WriteLine("Outside try...exiting gracefully");
Now, this will work perfectly fine if F did as follows:
static void F() { // foo... throw new InvalidOperationException(); // bar... }
InvalidOperationException derives from Exception, so the catch block picks it up. But what if F did this?
static void F() { // foo... throw 0; // bar... }
Well, thankfully you can't write that in C#. But you can in verifiable IL:
.method private hidebysig static void F() cil managed { .maxstack 1 .locals init (int32 V_0) ldc.i4.0 box [mscorlib]System.Int32 throw }
The specific type, int in this case, really doesn't matter. It could be any other reference type that doesn't somehow derive from Exception, a value type, or even a null reference!
You might turn your nose up at the idea of catching all exceptions. I did. But consider if you need to roll back sensitive state that was introduced inside the try block. I've already covered why doing this in the finally block only might not be sufficient. If F() were a virtual method that a user could override and somehow supply an object of their choosing, a malicious user could use this (along with an exception filter) to mount a nasty security attack. Coming from a Java background, I was initially very surprised how real this problem is...The world becomes much more complex when you interop so tightly with the OS. For example, the CLR has to work well with SEH primarily for situations where mixed call stacks make unmanaged-to-managed (and vice versa) transitions. Suffice it to say that the two pass model introduces lots of complexities.
The Solution
Many people think that this is inherently a C# problem. Isn't it C#, not the runtime, that forces people to think in terms of Exception-derived exception hierarchies? Certainly there is precedent that indicates throwing arbitrary objects is a fine thing for a language to do. Just take a look at C++ and Python. And furthermore, C# actually enables you to fix this problem:
try { F(); } catch { // ... }
This approach has two problems. First, the catch-all handler doesn't expose to the programmer the exception that was thrown. C# could have changed this (e.g. with TLS data exposed through a static member, e.g. Exception.GetLastThrown, or something like that). That still wouldn't solve the problem that things that aren't exceptions don't accumulate a stack trace as they pass through the stack, making them nearly impossible to debug. But probably worse, the average programmer doesn't even know this is a problem! Including those who are writing code for the Frameworks that Microsoft ships. But they really shouldn't have to know. This problem spans many languages, and it really made sense for the runtime to help them out.
We solved the problem by introducing some new behavior inside the exception subsystem of the CLR. It's mostly transparent to the user. When something gets thrown that is not derived from Exception, we instantiate a new System.Runtime.CompilerServices.RuntimeWrappedException, supply the originally thrown object as an instance field of that puppy, and propagate that instead. It's public; most people will never catch such things directly, but you can if you need to access the thing that got thrown in the first place.
This has some nice benefits. The C# user can continue writing catch (Exception), and--since RuntimeWrappedException derives from Exception--will receive any non-CLS exceptions. The try/catch block we had originally written will just work for free now. And furthermore, we now capture stack trace for everything, meaning that debugging and crash dumps are immediately much more useful. Lastly, there's still a playground for languages that wish to continue participating in throwing exceptions not derived from Exception.
Supporting Naughty Languages
This last point actually complicates the design quite a bit. We queried our language community, and perhaps not-so-surprisingly, there are a lot of compilers that can throw anything. C++/CLI is one of them. So we had to preserve the existing semantics for those languages, while still enabling C# users to get the benefits of this change. Thus was born System.Runtime.CompilerServices.RuntimeCompatibilityAttribute. The C# and VB compilers will auto-decorate any compiled assemblies with this attribute, setting its property WrapNonClsExceptions to true. The runtime keys off of that to determine whether the old or new behavior is desired. The default is that we don't surface the aforementioned wrapping behavior (although as an implementation detail, we still do it). We expect more of these compatibility-preserving changes in the future, which resulted in the somewhat generic attribute naming.
If the attribute is absent, or present and WrapNonClsExceptions is set to false, we still actually wrap the exception internally so we can (1) maintain good stack traces for debugging and (2) to cleanup and optimize some of the exception code paths that had to branch based on the type of the exception. But we unwrap it as we match it against catch handlers. And we unwrap it when we deliver it to catch filters. So these languages don't know anything ever changed.
It's actually gets a bit more complicated than this, however. For cross-language call stacks, we actually do the unwrapping based on whatever the assembly in which the catch clause's assembly wants to do. Say method M in C++/CLI assembly A throws an int; this is called by method N in C# assembly B. At throw time, we construct a new RuntimeWrappedException and use that for propagation. If assembly A catches it, all it sees is the int...It never knows we wrapped it. But if it leaks, and assembly B had wrapped the call in M with a catch (Exception), that handler will actually see a RuntimeWrappedException. Furthermore, consider if there were another C++/CLI assembly C; if N didn't catch the leaked int, it would surface in C as if it never got wrapped. This is what users expect to happen, and it composes very nicely.
Most users won't even know about this change. But hopefully their code gets more secure and robust for free.
 Friday, July 22, 2005
The CLR was designed to work very well in a COM world. This design choice is not at all surprising given the history of programming on Windows, and that the CLR began life as the COM+ 2.0 Runtime (among other temporary names). When it comes to concurrency in this world, however, there's a whole host of crap that can go wrong. Thankfully most of the time it doesn't.
Before moving on, if you have a finite amount of time, I'd recommend reading Chris Brumme's weblog on Apartments and Pumping. It's exponentially more worth your time than this post. I'm going to assume you have been introduced to at least a few of the concepts there. Most mortals on this planet haven't. You'll also want to come to my Programming w/ Concurrency talk at PDC, where I'll discuss such "to the metal" details.
OK. I've hyped it up. But I don't really have that much to say.
Using monitors for critical sections
When somebody accesses a shared piece of memory from multiple units of parallel execution, some form of locking is usually necessary. For a very small class of programmers, avoiding locks and retaining correctness is possible, but it's rocket science. Most people give up quickly if they even think to try in the first place (except for double checked locking, which is often copied from some book or website on the topic). If it's a simple primitive operation, interlocked operations might work. But in other cases, you need a coarser grained critical section-ish lock. For manager programmers, this is Monitor (i.e. 'lock' keyword in C#).
A class that has a private shared static variable, for example, would lock on it before mutating its contents. Imagine we have a class Coords:
class Coords { public int x; public int y; }
Our program decides it needs to maintain the invariant that x == y (don't ask why), and here's the code a developer might write:
class MyComponent : ServicedComponent { private static Coords c = new Coords(); void DoWork() { lock (myCoords) { myCoords.x++; DoMoreWork(); myCoords.y++; } } void DoMoreWork() { /* code that tolerates broken invariants */ } }
So long as we never leak the myCoords instance (raising the risk somebody accesses it w/out locking), we're safe. Right?
Not quite.
Enter STA
You might not have noticed that MyComponent derives from ServicedComponent. This is a ContextBoundObject that lives by all of the standard COM component rules. If it's instantiated inside an STA (Single Threaded Apartment), all access is serialized, as is the case with ordinary COM components. Now, this might seem a tad esoteric, but consider if you have a class that's called by a user who wrote their own ServicedComponent. It might seem more real, and is equally as problematic.
Chris's article above talks at great length about message pumping. STAs have to pump messages, otherwise queued messages could get starved. For UI applications, this pisses users off. For other applications, it can lead to fairness issues at best and incorrect code at worst. We pump for you so you don't need to worry about it, but we might do it in places you might not expect. This ends up being nearly anywhere you can block.
Let's pretend DoMoreWork above did this:
void DoMoreWork() { Thread.CurrentThread.Join(0); }
Join waits for the target thread to complete execution or the timeout to expire, whichever comes first. Since we call it on our own thread, it should be clear which occurs first. (You are still awake, right?)
When you pump, code can reenter on top of your existing stack. Let's look at the entire snippet of code:
[ComVisible(true)] public class MyComponent : ServicedComponent { private static Coords c = new Coords();
public void DoWork(int n) { Console.WriteLine("{0}->", n);
lock (c) { // Check invariant x==y upon entry int x = c.x, y = c.y; Console.WriteLine("{0}:{1},{2}", n, x, y); Debug.Assert(x == y, string.Format("Broken invariant on entry (#{0}, {1}!={2})", n, x, y));
c.x++; DoMoreWork(); c.y++;
// Ensure invariant x==y upon exit x = c.x; y = c.y; Debug.Assert(x == y, string.Format("Broken invariant on exit (#{0}, {1}!={2})", n, x, y)); }
Console.WriteLine("{0}<-", n); }
private void DoMoreWork() { Thread.CurrentThread.Join(0); } }
Recap: The call to DoMoreWork from the DoWork function occurs while invariants are broken. And DoMoreWork (or a function that DoMoreWork calls, e.g. some opaque inside the Framework) pumps. This is a recipe for bad things.
I also added some Console.WriteLines and Debug.Asserts in there so you can watch the world fall down.
Breaking monitors with reentrancy
The situation we need to get into in order to show off this neat parlor trick is as follows:
- A bunch of MyComponents are created inside an STA server;
- We try to make a load of calls to DoWork on those components from an MTA client;
- This requires that the MTA code reenter the STA to execute;
- Our STA thread pumps while invariants are broken, thus reentering another set of work (and enabling it to see us in an inconsistent state).
It's not quite as difficult as it sounds, thanks to the CLR's accomodating interaction with the world of COM.
class Program { const int threadCount = 5;
[STAThread] static void Main() { // Create our components in our STA server (note the STAThread on Main) MyComponent[] components = new MyComponent[threadCount]; for (int i = 0; i < threadCount; i++) components[i] = new MyComponent();
// Instantiate a bunch of MTA threads to work on the STA component List<Thread> threads = new List<Thread>(threadCount); for (int i = 0; i < threadCount; i++) { int v = i; Thread t = new Thread(delegate () { components[v].DoWork(v); }); t.SetApartmentState(ApartmentState.MTA); // default--here for illustration threads.Add(t); }
// Let 'em loose threads.ForEach(delegate (Thread t) { t.Start(); });
// If you haven't Aborted by now, wait for completion threads.ForEach(delegate (Thread t) { t.Join(); }); } }
This glob of code does exactly what my bullets indicate. The whole thing can be downloaded here. Note: ensure you compile this with the DEBUG symbol defined, otherwise your calls to Debug.Assert won't be present and you won't get the desired effect of being bombarded with assert dialogs.
It's quite nice that the CLR goes out of its way to marshal across contexts, moving our code over from the MTA to the thread in the STA, executing it, and marshaling back. And furthermore, the pumping it is doing is in good faith. It's trying to make our application responsive and fair.
Unfortunately, I see the following output when I run the code:
Constructing components in a STA server... Instantiating 5 MTA threads to operate on our components... Starting up MTA threads... Waiting for MTA completion... 3-> 3:0,0 3<- 2-> 2:1,1 2<- 1-> 1:2,2 0-> 0:3,2 4-> 4:4,2 4<- 0<- 1<-
Notice the "3,2" line. That prints out "x,y"... and does so at a point in the program where they should always be equal. Unfortunately, we've got reentrant code inside our lock, and it now has access to broken invariants! Your mileage may vary based on the inherent race condition. To be fair, this is also a byproduct of our decision to make monitors reentrant. But this decision was made for recursion, not reentrancy. It turns out we don't recognize the difference.
Of course, the above example doesn't demonstrate anything too terrible. But if you happened to apply some sensitive thread wide state that you intended to roll back before enabling other code to run, for example, it means you absolutely want to avoid pumping inside a critical section. That means mostly avoiding opaque method calls, even if you suspect they don't pump. In the future, they could. In practice, this is tough to acheive. And in practice, it usually doesn't matter.
I made a pretty simple mistake just now. I knew the source of the problem immediately when I saw the exception, but it's fairly interesting.
What's wrong with this code?
MyComponent[] myComponents = Create25Components(); for (int i = 0; i < 25; i++) { Thread t = new Thread(delegate () { // do some stuff myComponent[i].DoWork(); // do some more stuff }); t.Start(); }
Oops! you say? Oops! for sure.
The reference to the induction variable i gets treated like an ordinary variable C# captures inside an anonymous delegate closure. Namely that it just gets captured into a closure class which each thread shares access to. Access to i inside the thread simply dereferences the shared memory location to obtain the value during execution... not when you capture it. So assuming the parent thread is able to spin through the loop quickly, all of your threads will probably see the final result of i, which is 25. It turns out 25 is an invalid index for the array, resulting in an IndexOutOfRangeException or two. If they get a chance to run quickly, they will see some number in between, but probably not the correct one!
One (of many) solutions is to write this instead:
MyComponent[] myComponents = Create25Components(); for (int i = 0; i < 25; i++) { MyComponent mc = myComponents[i]; Thread t = new Thread(delegate () { // do some stuff mc.DoWork(); // do some more stuff }); t.Start(); }
Easy mistake.
 Thursday, July 21, 2005
It seems JScript was a bit ahead of its time, in much the same sense that Lisp was. More accurately ECMAScript was, the standard on which Microsoft's implementation was based.
In 3 lines of code, here is a REPL that compiles and works. It is written in JScript.NET itself (jsc.exe):
import System; while (true) print(eval(Console.ReadLine()));
A more robust version can be written in roughly 50 lines of code:
import System; import System.Text;
function read() { var program = new StringBuilder();
while (true) { Console.Write("{0}{0} ", program.Length == 0 ? ">" : "."); var line = Console.ReadLine(); if (line != "") { if (program.Length == 0 && line.StartsWith("!q")) break; else program.AppendLine(line); } else if (program.Length != 0) { break; } }
return program.ToString(); }
function evalprint(program) { try { var o = eval(program); Console.WriteLine("=> {0}", o); } catch (e) { Console.ForegroundColor = ConsoleColor.Red; Console.Error.WriteLine(e.ToString()); Console.ResetColor(); } }
while (true) { var program = read(); if (program == "") break; evalprint(program); }
It's just a matter of time before C# and the world at large embrace the power of eval().
 Sunday, July 17, 2005
Note to self: if you happen to write code that AVs the FJIT in Rotor, this prevents you from building the FX tree. I honestly don't quite know why the runtime is loaded there and what managed code executes, but according to Jan Kotas, a dev on the CLR team, it is. And furthermore, the debugging experience kind of sucked... the build log had no errors in it, but binplace failed to find the output DLL when it tried to copy. Sure enough it didn't get built. I'm pasting the error for folks searching in the future:
Binplacing - objd\rotor_x86\system.xml.dll for all platforms binplace : warning BNP0000: CopyFile(C:\dev\play\sscli-1.0__STM\fx\src\xml\objd\rotor_x86\System.Xml.dll,C:\dev\play\sscli-1.0__STM\build\v1.x86chk.rotor\.\System.Xml.dll) failed 2 binplace : error BNP0000: Unable to place file objd\rotor_x86\System.Xml.dll - exiting.
It killed at least 1 hour of my time tracking it down. Running a quick test under the devenv debugger:
C:\dev\play\sscli-1.0__STM\tests\il_bvt\base\objd>devenv /debugexe %TARGETCOMPLUS%\clix.exe ceq.exe
Did the trick. Stupid bug, easily fixed, and now I'm back building the tree again. Hoorah.
I've spent a bit of time this weekend on my concurrency talk for PDC. It's taking me longer than expected, mostly because I'm writing "a story" up front... before I even think about touching PPT or writing code. The end result will be a great story to tell captured in a paper and--so that I have a convenient way to guide me through the talk--a slide deck. Too many people use PPT as a crutch for presentations, and most of the time it shows.
The talk's focus is on the hows and whys of concurrency with a good mix of the realities of the Windows platform thrown in. This necessarily involves some mechanics (e.g. best practices with explicit threading and the ThreadPool, synchronization, locks, lock-free programming), but also a detailed look inside our platform's legacy, how and why we got here, why some of our legacy still affects how we write concurrent code (anti-concurrency), and where we're headed.
If you're interested in reading up on this stuff, I'd recommend any of the following books. It just so happens that they're all sitting in front of me and being used as references:
Another great related resource that you might want to check out is an article Vance Morisson wrote for August's MSDN magazine. Vance is one of the most senior guys on the team, and is the architect for the CLR's JIT. Bottom line: one of the smartest guys I've ever met.
 Thursday, July 14, 2005
As I'm sure you've heard, the PDC talk abstracts just went live.
Jan Gray and I are doing a two part series on Concurrency. His talk (Part I) focuses on the philosophy, hardware, and primitives. I jump up a notch (in Part II) to look at how the Windows platform exposes concurrency, and some of the abstractions we ship to help you take advantage of it:
Programming with Concurrency (Part II)—Multithreaded Programming with Shared Memory
In this session, see hands-on examples illustrating how best to achieve parallelism safely using multithreaded techniques on Longhorn and the Common Language Runtime (CLR). We walk through some common scenarios, APIs, best practices and pitfalls, and take an in-depth look at both managed and native technologies such as threading on the CLR, Windows threads, and OpenMP. To protect your code from concurrency hazards, we discuss how Longhorn and CLR can help you handle deadlocks and other hangs as well as shared memory exhaustion. We touch on more advanced topics such as CLR explicit threading and the thread pool and asynchronous programming. Legacy issues that impact concurrent programming today such as COM and UI apartment threading and thread affinity are considered. You can expect to walk away from this session with the knowledge necessary to get started on writing efficient and reliable concurrent programs.
Session Level(s): 400
Track(s): Fundamentals
I'm also co-presenting with Mr. Pobar in a talk every compiler geek would love... (and anybody who wants to see an Aussie and a Bostonian duke it out on stage over my assertion that Lisp is the one and only truth in the world (don't worry, I'm just an academic ;) )...)
CLR: Writing a Managed Script Compiler in One Hour
Learn about writing a scripting language compiler that targets Intermediate Language in a single session. Coverage of key decision/choice points when compiling a language on the Common Language Runtime (CLR) are discussed, including the decision to statically or dynamically type and the impacts this has on your design. Includes coverage of writing a late-binder, showing that everything really can be typeless in your source language. Demonstration uses a strawman scripting language as the base language.
Track(s): Tools & Languages
Are you going to PDC? Either one of these talks interest you?
 Sunday, July 10, 2005
I'm wrapping up a chapter in my book on Unmanaged Interop this evening. In the process, I've fallen back in love with some classics on my bookshelf:
COM is still cool in my book. (Pun not intended.)
And furthermore, I can't tell you what a blessing it is to be able to write about a topic, encounter a question or two, and walk right down the hall to the guy's office who 1) knows the absolute most about a specific technology and 2) is kind enough to answer questions in exceeding detail. I hope this translates into a better end product.
Here are just a few (externally available) amazing resources related to hosting, CERs, and SafeHandles, the new face of Unmanaged Interop for V2.0:
The Designing .NET Framework Class Libraries series that aired on MSDN a while back was good fun. The format was fun for us here at Microsoft (video on Friday, chat on the following Wednesday). I hope those who participated agreed. If not, I'd love to get your feedback: what did you like, what didn't you like about the format? Should we do precisely the same way if we do that type of thing in the future... make a couple tweaks?
Here's a cross index to each of the talks with associated chat transcripts:
- Setting the Stage
- Naming Conventions
- Rich Type System
- Member Types
- Designing Inheritence Hierarchies
- API Usability
- Designing Progressive APIs
- CLR Performance Tips**
- Designing for a Managed Memory World**
- Understanding Interoperability**
- Packaging, Assemblies, and Namespaces
- FxCop in Depth
- Enabling Development Tools**
- Security**
- Q&A**
We had a great viewing count (I don't have the #s off hand), comparable to some of the more popular MSDN articles and downloads.
Unfortunately, a few chat transcripts are still missing from the home page. I apolgoize for this, it's in part my fault. The good news: I've been assured they'll be up this week.
In the interim, this cross index should suffice. Those marked with ** are currently missing their chat transcripts. The videos for all are available from the Talk Homepage links. Enjoy!
 Friday, July 08, 2005
Brad just pointed out the recent DotNetRocks episode with the PDC planning folks. They give a good insight into the insanity that goes behind planning such a huge event. I'm helping to organize the CLR Team's presence there, but my part is minimal compared to some guys (like the folks in the interview).
I'm really psyched about PDC this year. We've got a plethora of great talks lined up, and some of the best technical speakers you'll find this side of tha' Missisippi. Just take a look at the talks we've release thus far...
And they have an RSS feed so you can keep an eye on the talks as they are released!
If you haven't convinced your boss to pay yet, better work harder (and faster).
The fun begins 2 months from this weekend!
When access to a location in memory is shared among multiple tasks executing in parallel, some form of serialization is necessary in order to guarantee consistent and predictable logic. Furthermore, in many situations, a number of such reads and writes to shared memory are expected to happen “all at once,” in other words atomically. Serializability and atomicity are both often implemented using mutual exclusion locks. This is bread and butter stuff.
An important concept in concurrent programming is forward progress. This is the idea that the largest number of parallel tasks should make the most amount of progress towards their goal as possible for every given time unit of execution. If you can manage to divvy up the work such that all tasks can execute completely logically independently from each other—called linear parallelization, something that is actually difficult to achieve in practice—then sharing resources such as memory can quickly bog down your theoretical linear speedup in practice. Shared memory prevents each task from making forward progress because there are points of execution where access to resources must be serialized. That means code has to wait in line in order to execute. That’s generally bad.
What an ambitious introduction. Unfortunately, I must constrain the rest of this particular article to some very precise, more manageable topics… Else I would never complete it, and might end up with a book on my hands. And furthermore, I am going to constrain my conversation to the CLR, with a focus on the Monitor APIs. I intend to write a series of these articles over the coming months, since I’ve been writing a lot about the topic in general lately.
Eliminating deadlocks
Deadlocks are well documented out there, and are simple to understand. Thus I will start with them. Deadlocks are by far the #1 forward progress inhibitors. While contention over shared memory can prevent all but one parallel unit from making forward progress (in the extreme case, where all tasks request access to the same resource simultaneously), deadlocks prevent all units involved in the access from making forward progress. Without detection and correction logic, your program is likely to come to a grinding halt.
For example, consider two bits of code running in parallel:
#1: #2 lock (a) lock (b) { { lock (b) lock (a) { { // atomic code // more atomic code } } } }
As written, these can easily get into a so called deadly embrace. Because they acquire and release locks in the opposite order, it’s not a difficult stretch to imagine #1 acquiring a, #2 acquiring b, and then #1 trying to acquire b (blocking forever), and #2 trying to acquire a (also blocking forever). The result is often a hung application or background worker thread. The result is a frustrated user having to open up Task Manager so they can slam the End Task button tens of times… and then waiting for dumprep.exe to get done with its jazz.
The solution to this problem is often “acquire and release locks in the same order,” but that’s seldom achievable in practice. It’s more likely that a and b are acquired in entirely separate functions, deep in some complex call-stack, which can moreover alter the flow of control at runtime. It’s not always a statically detectable situation. Another solution is to write your code so that it can back off of lock acquisitions if it suspects a deadlock has occurred. With the new Monitor.TryEnter API, this is relatively trivial to do (in the simple case).
Regardless of how ridiculously simplified this scenario is, let’s start here. It’s easier to understand and solve.
A quick note on SQL Server
Through the CLR’s hosting APIs, you can actually hook all blocking points, including Monitor.Enter calls. SQL Server (and possible other sophisticated hosts) use this to detect deadlocks and prevent them from occurring. Unfortunately, I don’t know their policy for handling, but presumably it is a fair one whereby a victim is chosen at random and killed. This is consistent with the way SQL Server handles deadlocks pertaining to data transactional deadlocks. Chris Brumme’s weblog entry on Hosting has a plethora of related information.
Lock ordering and optimistic deadlock back-off
An old fashioned solution to this problem is to mentally tag all locks in your program, and ensure that you acquire them in a consistent manner. You could use a simple algorithm, such as “sort by variable name.” This works so long as you never alias a memory location. Oh, and so long as you don’t make a mistake when you’re writing the code (and anybody else who is touching your program). But this would be error prone and laborious. We can do better.
We could, for example, write a function that accepts a list of objects and does a few things in the process of locking on them:
- Sorts the objects in identity order to ensure consistent lock acquisition ordering;
- Uses a simple back-off strategy in case there are other lockers not using our ordered locking scheme.
The code might look like this:
static int deadlockWait = 15;
static bool EnterLocks(params object[] locks) { return EnterLocks(-1, locks); }
static bool EnterLocks(int retryCount, params object[] locks) { // Clone and sort our locks by object identity. object[] locksCopy = (object[])locks.Clone(); Array.Sort<object>(locksCopy, delegate(object x, object y) { int hx = x == null ? 0 : RuntimeHelpers.GetHashCode(x); int hy = y == null ? 0 : RuntimeHelpers.GetHashCode(y); return hx.CompareTo(hy); });
// Now begin the lock acquisition process. bool successful = false; for (int i = 0; !successful && (retryCount == -1 || i < retryCount); i++) { successful = true; for (int j = 0; j < locksCopy.Length; j++) { try { if (!Monitor.TryEnter(locksCopy[j], deadlockWait)) { // We couldn't acquire this lock, ensure we back off. successful = false; break; } } catch { // An exception occurred--we don't know whether we got the last lock // or not. Assume we did. We indicate that by incrementing the counter. j++; successful = false; throw; } finally { if (!successful) { for (int k = 0; k < j; k++) { try { Monitor.Exit(locksCopy[k]); } catch (SynchronizationLockException) { /* eat it */ } } Thread.Sleep(0); // Might increase chances that a thread will steal a lock (good). } } } }
return successful; }
This method is actually sufficiently complex that it warrants a bit of discussion. Most of the complexity stems from our paranoia about orphaning locks coupled with the back-off algorithm. Notice that we first sort the list of locks, using the System.Runtime.CompilerServices.RuntimeHelpers.GetHashCode method for comparisons (this function returns a unique hash code based on an object’s identity). We then use a loop to try acquisition of the locks. If an acquisition fails, we begin the back-off logic by unraveling any locks we had acquired previously, yielding the thread to increase the chance that another possibly deadlocked thread is able to make forward progress, and starting over again.
Of course, a real function would probably offer a timeout variant. The timout for the Monitor.TryEnter isn’t configurable, the retry Count is near meaningless to the user, and the routine is still subject to denial of service attacks whereby somebody grabs a lock and holds on to it forever. In that case, we’ll loop forever (unless an explicit retryCount is provided, it defaults to -1 which means infinite). We also need a similar, although much simpler, ExitLocks mechanism. I’ve omitted these implementations for brevity. Lastly, in the face of asynchronous aborts, this code falls on its face. Nevertheless, it demonstrates the concepts (I hope).
Cross call-stack ordering and back-off
Again, this strategy works only if you know all of your locks up front. With deep call-stacks, this may not be the case. For example, consider:
void f(bool b) { if (b) { lock (a) { g(!b); } } else { lock (b) { g(!b); } } }
void g(bool b) { if (b) { lock (a) { // some atomic function } } else { lock (b) { // some other atomic function } } }
If these were called from two parallel tasks, one task run as f(true) the other as f(false), you’d have a similar, although much more complex and difficult to follow, deadlock scenario. We might be able to (almost) solve this, too, however, with some really ugly hacks that I wouldn’t suggest anybody uses in real code. With that caveat, let’s take a look at them…
We could learn a thing or two from STM. If we performed only idempotent and reversible operations inside the atomic (lock protected block), you could imagine a more complex back-off strategy that spanned multiple levels of a call-stack. This requires you to make a lot of assumptions, use exceptions for control flow, and quite truthfully some unorthodox strategies (including polluting your thread with state). Moreover, without some form of transactional memory, rollback in the case of failure has to be done manually. These are in general bad practices, but the result seems to exhibit some redeeming qualities.
Here’s a big steaming pile of code that attempts to demonstrate a possible implementation:
static LocalDataStoreSlot atomicSlot; static Par() { atomicSlot = Thread.AllocateNamedDataSlot("AtomicContext"); }
internal class AtomicFailedException : Exception { public AtomicFailedException() {} }
internal class AtomicContext { internal AtomicContext parent; internal List<object> toLock = new List<object>(); }
static bool DoAtomically(Action<object> action, params object[] locks) { return DoAtomically(action, null, locks); }
static bool DoAtomically(Action<object> action, Action<object> cleanup, params object[] locks) { return DoAtomically(action, cleanup, 10, locks); }
static bool DoAtomically(Action<object> action, Action<object> cleanup, int retryCount, params object[] locks) { bool entered = false;
// We have to maintain our context so that we can unravel the parent correctly. AtomicContext ctx = new AtomicContext(); ctx.toLock.AddRange(locks); ctx.parent = (AtomicContext)Thread.GetData(atomicSlot); Thread.SetData(atomicSlot, ctx); try { for (int i = 0; !entered && i < retryCount; i++) { if (entered = EnterLocks(10, ctx.toLock.ToArray())) { bool retryRequested = false; try { action(null); } catch (AtomicFailedException) { if (cleanup != null) cleanup(null); retryRequested = true; } finally { if (entered) ExitLocks(locks); if (retryRequested) entered = false; } } } } finally { // Reset the context to what it was before we polluted it. AtomicContext cctx = (AtomicContext)Thread.GetData(atomicSlot); Thread.SetData(atomicSlot, cctx.parent); if (!entered && cctx.parent != null) { cctx.parent.toLock.AddRange(cctx.toLock); throw new AtomicFailedException(); } }
return entered; }
The last overload is obviously the most complex, and the meat of the implementation. DoAtomically uses a back-off strategy not unlike the first EnterLocks function. In fact, it uses EnterLocks for lock acquisition. DoAtomically maintains a context of the locks that must be acquired, and can be chained such that there is a parent/child relationship between two contexts (representing multiple DoAtomically calls in a single call stack).
The function then goes ahead and attempts to acquire each object that much be locked. If it succeeds, it calls the delegate that was supplied as an argument. This delegate can likewise make DoAtomically calls which will recursively detect deadlocks and perform escalation if they occur. Note: there is some noise here. Because of the small timeout we use, a function that holds a lock for an extended period of time can give the impression of a deadlock. This number could probably use some tuning. Further, I haven’t tested the interaction between this code and non-DoAtomically code. Presumably, it would be more succeptable to livelock, but wouldn’t actually fail or deadlock (assuming the other code doesn’t mount a denial of service).
The escalation policy we use is to perform cleanup logic (since we tried to execute the action, there could be broken invariants that must be restored), mutate the parent context so that it will attempt to acquire the locks the child tried to acquire (and failed), and essentially unravel the stack to the parent (using an exception—ugh—I think continuations would make this a much prettier situation). The parent then tries to acquire its own locks in addition to the child locks that got escalated. This can be an arbitrarily nested call-stack, so a parent could end up with more than just a single child’s locks to acquire. But this ensures an entire call stack’s worth of locks are acquired in an ordered fashion, and furthermore backed off of all at once. The obvious downside to this approach is that you end up taking a coarser grained lock than necessary, but with the benefit of avoiding deadlocks.
Assuming all of the back off and retry succeeds, it will return a true indicating success. If it doesn’t, and it’s exhausted all of its retries and escalation space, the topmost atomic block will simply return false to indicate failure. Honestly, an exception in this case might be more appropriate.
An overly simple example
A small test function that uses this (sorry, I didn't have time to write up a more complex one), is as follows:
static int i = 0; static object x = new object(); static object y = new object();
static void Main() { List<Thread> ts = new List<Thread>(); for (int j = 0; j < 20; j++) { Thread t = new Thread(new ThreadStart(delegate { DoAtomically(delegate { i++; Console.WriteLine("{0}, {1}", Thread.CurrentThread.ManagedThreadId, i); i--; }, x, y); })); ts.Add(t); }
ts.ForEach(delegate (Thread t) { t.Start(); }); ts.ForEach(delegate (Thread t) { t.Join(); }); }
Of course, all threads should print out the number 1.
A brief word on livelock
A quick word on livelock with the above design. With an escalation policy as defined above—your standard back-off, yield, and retry—it is highly susceptible to live-lock. This is a situation where code is trying to make progress, but chasing its own tail, or continually hitting conflicts. Consider what happens if a very long block and a very short block are competing in a deadlock fashion for the same resources.
The policy defined above will always back-off and retry, meaning that a short transaction has less work to do in order to perform its task. If the larger block is higher priority than the smaller one, we’re unfairly favoring the small block simply due to its size. But similarly, a long running block could acquire a lot of resources, and the smaller block could quickly try (and retry) to acquire locks, fail, and give up.
Lock leveling or some more intelligent queuing system might help out here. But I’ve written enough already.
Future topics
If you’re interested in a particular concurrency-related topic, let me know!
I’d like to spend more time in the future on:
- Events and signaling;
- Managing large groups of complex parallel tasks;
- Implicit parallelism, e.g. using compiler code generation and IL rewriting;
- STA, COM and UI programming, reentrancy;
- More on livelock—it happens in a lot of contexts—and some ideas on how to solve them;
- Lock free programming, and why you should avoid it.
Feedback will help me write about things you want to know about.
Happy hacking!
 Saturday, July 02, 2005
Those guys on the VC++ team have been busy workers. In the Whidbey release of C++/CLI (was Managed C++), they've added a whole big batch of new features. The best part? Some of these are things you simply can't do in C#. Put another way, C++/CLI exposes a larger set of the underlying features that the CTS/CLI has to offer.
For example, want to create a ref type on the stack? Fine.
MyType mt("Foo");
On the GC heap? Alright.
MyType^ mt = gcnew MyType("Foo");
(Note if you're wondering "how'd they do that?" The answer is that the first case only has stack semantics. It still lives on the GC heap. In other words, it acts very similar to 'using' in C#, but it maps nicely to the C++ programmer's existing understanding of stack versus heap semantics.)
Similarly, did you want to maintain a typed reference to a boxed value type? Ok.
int^ i = gcnew int(5);
This compiles to IL which uses modopts to store typing and boxing information so that the runtime/JIT know how to treat it, and for the verifier so that it knows it's being used in a type-sound manner. Did you need a Nullable<int>? Nonsense! Just set your reference to null and you've got it:
int^ i = nullptr; // now it's null i = gcnew int(5); // now it's not
Furthermore, with stack semantics for ref types, deterministic finalization is simple. Just write a destructor for your type (it gets compiled down to a Dispose method), and it gets invoked for you when leaving the scope. Just like the old C++ days. This means you can say:
{ StreamReader sr(...); // do some stuff with stream }
And sr gets disposed just prior to leaving the block scope. You can also create your own standard resource mgmt wrappers like that come with TR1 (e.g. tr1::shared_ptr<>). Using the terms that Rico Mariani came up with in a meeting a while back, you've got "the bang" and "the twiddle"...
!MyType() {} // the bang: a finalizer ~MyType() {} // the twiddle: a Dispose method
They've also implemented STL with full interoperability with Whidbey's generics.
They've also implemented OpenMP, a fairly ubiquitous shared memory parallelism library that I've been using a lot for research recently. Now they just need MPI and the world would be complete.
I'm using C++ for many things lately (mostly due to my Rotor work), and I have to say: as I use it more and more, I am starting to miss it. But admittedly I do sometimes prefer the cozy confines of managed code. C++/CLI enables me to nicely sit in between the two worlds, getting the best of both (and leaving behind the worst). There's a hell of a lot more to it than this post surfaces. Check it out.
Happy hacking!
 Monday, June 27, 2005
I need some feedback from the community here.
You see, we've got a veeerry interesting late-game DCR that we're wrapping up this week. It was on the order of 5 weeks of development work to implement. In other words, not small. (DCR == "Design Change Request," i.e. not a bug. We're changing the design of a feature we've already implemented.) I'll be less secretive once I am able to. In fact, I can't wait to blog about this one in detail...
Anyhow.
We're fundamentally mucking with the type system. (Yes, this makes me quite queasy.) In doing so we've introducing a bit of an oddity. The unfortunate thing is that we won't have a Beta with this functionality included. So... software being the inexact science it is, I figured maybe I'd get a response or two from folks out there. Not quite the same, but better than nothing I proclaim!
Now that I've played it up, it's really quite simple. What if
(new T()).GetType() == typeof(T)
evaluated to false? How horrible would that be? Think about it. I believe this is a fairly fundamental invariant we preserve when spanning the static and dynamic type systems.
If we broke it, would you lose sleep? Hate us? Throw your copy of Whidbey into the trash compactor and pick up Java 5 instead? (Go back to COM, VB6, ... DOS programming perhaps?)
I'm being flippant. Mostly because I'm extraordinarily tired. But I really wonder.
 Tuesday, June 14, 2005
I'm on a CLR Road Trip right now, which basically means a bunch of us on the team are out visiting customers. We're trying to get a clearer understanding of how folks use (or would like to use) our technology, and also to get feedback on our future direction. Despite the 19 hour day yesterday (no joke, 6:30am-1:30am--that doesn't even include any flying or anything, we flew in the night before!), I'm having a blast. We've done a couple of these in the past.
I gave a "CLR Internals" talk to the Vancouver .NET user group last night. Thanks to those who showed up! I found othis MSDN article which drills nicely into some of my main topics: http://msdn.microsoft.com/msdnmag/issues/05/05/JITCompiler/. I'll try to post a deck in the next few weeks.
We're travelling to Calgary tonight and will be talking at the Calgary .NET user group in the next couple days.
 Saturday, June 04, 2005
I remember the distinct feeling I got the first time I entered 'csc.exe' at the command line and realized that the C# compiler wasn’t doing any exception checking for me. Surely I had done something incorrectly. Or so I thought. After a bit of time searching around, asking around, and banging my head against the wall (still got the fracture in my skull), I came to the realization that C# had chosen not to support checked exceptions. Hmm.
Surprised? Sure. Confused? Yep. Sad? Quite.
Once I began to understand the implications of this design choice, I just became more and more confused. How the hell do I know in what manner this method could fail?! Trial and error? Manually fuzzing an API and interpreting the errors it throws? Wow, that seems like a great process, eh? Guessing? Not caring? And nevermind the problem of it changing the exceptions it decides to throw in the next version once I managed to figure it all out. (Remember: exceptions aren’t a static part of a method’s signature like they are in Java, so the implementation is free to silently alter its exception throwing policy without notice.) The horridness of this situation seemed to spiral into a rotting pit of smelly bannana peels.
The next phase of my surprise, disgust, confusion, <insert word here> was to scour the tools and documentation. How could a hole this huge be left gaping open? Surely either the Object Browser in VS or the SDK documentation would fully expose this data, or maybe a magical Intellisense switch that revealed the Truth. Well, the answer was a resounding no. The SDK did an OK job (they’re getting better over time, but not nearly as good as JavaDoc can do with the information stored in metadata), but they weren’t complete, had to be grokked out of band, and were free to change silently (all still significant problems IMHO). And it didn’t do much good for my own code and its exceptions!
I learned the ropes programming in x86 ASM and C, when I was ~13 and into hacking MUDs and other games on Amiga, Linux, and then DOS. Then I moved to C++. And then I moved to Java, and I stayed there for 5-ish years. Along the way I experimented with LISP and Smalltalk, but never to a large degree. My first professional programming experience was with C++ and COM, but by and large I’ve written more lines of real project/product code in Java.
So my move to C# was one made with a lot of expectations around how things worked, and it took me some time to get through a whole slew of these little gotchas—small discrepancies between the JVM/Java and CLI/C#. But you know what? There’s only one that stands out in my mind today, and continues to bother the hell out of me: Checked exceptions.
Arguments against checked exceptions.
If you haven’t already, you should read this interview [http://www.artima.com/intv/handcuffs.html] with Anders Hejlsberg, the topic of which is C#’s decision not to support checked exceptions. I keep referencing C# as being the decision maker; while it’s true that the CLI could have natively support them and thus it's in part their decision too, I have little doubt they’d be there if C# 1.0 wanted them. Further, Java's implementation is compiler-specific, and has no JVM support other than the metadta, so it's not clear the CLI would have even had to provide support.
Anders is a smart guy, one of very few Distinguished Engineers at Microsoft, and has his head more than screwed on right and tight. The article sounds reasonable, although a number of times I'm left thinking the data is incomplete. Maybe my head's not quite right. Or maybe Java has corrupted my mind.
I honestly see this debate as another incarnation of the static vs. dynamic typing debate that often plagues the language space. There's not a right answer in principle, but there's certainly an answer to what's right for the majority of users of your language. Anders' certainly understands the target audience of C# more than I do, so his call was probably right. But I'm left feeling neglected, poor little Java programmer man.
Many people jump on the anti-checked exceptions bandwagon without ever having done significant programming in Java. I’m sure Anders isn’t in this camp. But a lot of folks are. And until you’ve done significant programming and maintenance on a complex system with checked exceptions, you are probably not going to appreciate the safety and self-documentation of intent that it provides. You absolutely cannot understand the benefits of checked exceptions by just writing a 10-20 line program.
Some common claims made against checked exceptions come down to:
• Most developers subvert them. • They make it difficult to version code. • You usually don’t want to catch an exception, you want to let it leak.
I disagree wholeheartedly with each of these statements. And here’s why.
Most developers subvert them.
Based on a decent amount of Java experience, this is incorrect by a long shot. In those 10-20 line programs I mentioned, yes a lot of folks write "throws Exception" at the end of their methods, swallow exceptions, or generally subvert the checked exceptions system. But do we really want to tune the C# language for such small programs? I would argue Python, Perl, or a lighter weight language that doesn’t even do static type checking (since that’s another similar "annoyance" which prevents programs from compiling) for this category of programs. When checked exceptions are understood or provide value, subversion is unnecessary. I admit some users don't understand them (and hence use them incorrectly in the way somebody might use any language feature incorrectly), and that they don't provide value in the small, simple, script-ish program cases.
They make it difficult to version code.
Hell yes they do! But it's "difficult" in a good way, the same way that it's "difficult" to change the semantics or signature of a piece of code which is relied on throughout a complex system. If you could change it without compiler help, well, you'd be working in a dynamic language (and we ain't going there right now buddy).
There are two basic cases to consider: versioning public APIs consumed by somebody else and internal methods.
The API case is just like any other static feature of an API. Can you change the signature of your method after you ship it? Yes you can, if you want to break people. The same is true of checked exceptions. If you don’t have a statically checked list of exceptions you can throw, you’re going to break somebody anyhow.
If my API does this today:
object Foo() { // … if (theNetworkIsUnavailable) throw new FooException(); // … }
And tomorrow it does:
object Foo(int x) { // … if (theNetworkIsUnavailable) throw new NetworkConnectionUnavailable(); // … }
Your code that used to say:
try { object o = Foo(); } catch (FooException e) { // Do something with it }
Will no longer catch the error condition, since NetworkConnectionUnavailable will just head right past the FooException catch block. Hopefully your program has a backstop to catch this and respond accordingly, but depending on your error handling logic, this is likely to result in bugs either way. Is this the type of thing you want silently slipping past a compiler? Probably not. If you’re writing an API, error conditions are like any other pre-/post-condition, and should be treated as such. If the type system can enforce this, all the better for program correctness I say. (Static vs. dynamic languages again, ugh...)
The implementation case is simply not a problem. An implementation, by its definition, is a method which other programs or units don't get to see or use—i.e. non-public—and thus any problems are caught at compile time. You don't have to recompile dependencies, for example, which you may or may not have access to. This is just like changing the type or parameter list for a method… You make the change, see where things fall out, and fix them. It's the ordinary static language "beat the compiler" game. This is ordinarily a good thing, as it forces you to take a look at the exception handling code at the call sites to ensure they remain correct with the new exception behavior.
You usually don’t want to catch an exception, you want to let it leak.
I would argue about the use of "usually" here, a more correct word being "sometimes." And because this is a less-than-average situation, I don’t think the language should be tuned for it. It needs to support it, sure, but I am arguing it’s not the common case.
Note that Java supports unchecked exceptions in the type system. Basically, anything that derives from RuntimeException needn’t be checked. These are common errors like OutOfMemory, StackOverflow, NullRef, and so on, which a program almost never catches. Note that it does derive from Exception, so they don’t escape past general error handling code. These are by and large the errors that developers merely want to leak, so the checked exception subsystem doesn’t get in your way here at all!
I also believe the style of coding which has a main entrypoint wrapped in an exception handler is not the best way to do exceptions management. Perhaps I’m whacked, but in most Java systems I’ve worked on, letting an exception rip up the callstack to the toplevel is not the preferred approach, and wouldn't make it past a single code review. Usually there’s a common function used to publish an error (e.g. pop UI, write to the log, and so on) once it gets caught, but unless we need to tear down the application, the exception never shoots through the entire program to the top, leaving holes in the brains of your callers. Sure we have a handler at the top in case something escapes, but it’s not relied on as a crutch when we might be able to recover.
But it ain’t perfect.
I agree that Java’s implementation of checked exceptions isn’t perfect. But I feel much more comfortable in it than I do in C#’s fully unchecked system. Both suffer from a common problem of code duplication or over-catching in the handlers. My experience also shows that Java’s exceptions class hierarchy is better in the JSE (I know for sure at least two other smart people agree with this), and its users tend to get their own exception hierarchies right. Without a good factoring, a scaling problem ensues as you need to deal with a larger quantity of exceptions. Your catch block has to either have 10 incarnations of similar code for each exception type that can get thrown, you over-catch, or you give up and subvert the system. But provided that the factoring is clean, you can easily skirt this issue.
Still, having a "catch (Exception ex) where ex : FooException, BarException, FooBarException { }" syntax which caught any instance of those three and stored it into an Exception-typed variable would help to eliminate some of the nastiness of exception handling code. An implementation of this using exception filters in CLI would be trivial.
Having to deal with checked exceptions when you don’t care would be a nice thing to be able to express, for example, in the case of smaller programs. A compiler switch would suffice here so long as the runtime doesn’t actually enforce the "handle everything but what you’ve said you’re going to throw" policy. The JVM itself doesn’t do this check, that it’s a compiler-only enforced policy, so it's reasonable to expect that a CLI implementation would be a compiler deicision. This JVM behavior exists so that versioning can occur without recompilation.
There are other possible improvements, and certainly a whole category of other statically-detectable things (pre-/post-conditions, invariants), but I’ll stop here. Since this is mostly a language decision, it's interesting to see some approaches to solving the problem. For example, check out Spec# [http://research.microsoft.com/specsharp/] a nifty rifty roo MSR project.
Checked exceptions as implemented in Java ain’t perfect, but sufficiently close that I miss it.
 Saturday, May 28, 2005
My implementation of Software Transactional Memory (STM) on Rotor is coming along nicely. I mentioned it briefly here and here. I've taken a different approach than most, actually taking advantage of the JIT and EE to do my dirty work for me. Some prefer to stay inside the cozy confines of managed code, but I'd like to understand better the impact that my design might have on non-transactional code. I admit that my approach is likely not viable for real commercial use mostly due to its intrusive nature. Unless all code were transactional, which it's not. But this is partly what I'd like to understand better.
Most people confuse the idea of memory transactions with, say,
|