RSS 2.0

Personal Info:

Joe Send mail to the author(s) leads the architecture of an experimental OS's developer platform, where he is also chief architect of its programming language. His current mission is to enable writing large-scale software that is reliable, secure, and scalable by-construction. Before this, Joe founded the Parallel Extensions to .NET project. He has been granted 19 patents, with 49 pending. When not working, Joe enjoys travelling with his wife, writing books, writing music, studying music theory & mathematics, and doing anything involving food & wine.

My books

My music

Disclaimer:
The content of this site are my own personal opinions and do not represent my employer's view in anyway.

© 2012, Joe Duffy

 
 Thursday, November 09, 2006

The CLR tried to add support for fibers in Whidbey.  This was done in response to SQL Server Yukon hosting the runtime in process, aka SQLCLR.  Eventually, mostly due to schedule pressure and a long stress bug tail related to fiber-mode, we threw up our hands and declared it unsupported.  Given the choice between fixing mainline stress bugs (which almost exclusively use the unhosted CLR, meaning OS threads) and fixing fiber-related stress bugs, the choice was a fairly straightforward one.  This impacts SQL customers that want to run in fiber mode, but there are much fewer of those than those who want to run in thread mode.

Perhaps the biggest thing we did to support fibers intrinsically in the runtime was to decouple the CLR thread object from the physical OS thread.  Since most managed code accesses thread-specific state through this façade, we are able to redirect calls to threads or fibers as appropriate.  And we of course plumbed the EE to call out to hosts so they can perform task management at various points in the code, enabling a non-preemptive scheduler to do its job.  When a CLR host with a registered TaskManager object is detected, we defer many tasks to it that we’d ordinarily implement with OS calls.  For example, instead of just creating a new OS thread, we will call out through the TaskManager interface so that the thread can use a fiber if it wishes.

We do various other things of interest:

  1. Because the CLR thread object is per-fiber, any information hanging off of it is also per-fiber.  Thread.ManagedThreadId returns a stable ID that flows around with the CLR thread.  It is not dependent on the identity of the physical OS thread, which means using it implies no form of affinity.  Different fibers running on the same thread return different IDs.  Impersonation and locale is also carried around with the fiber instead of the thread.  This also ensures we can properly walk stacks, propagate exceptions, and report all of the active roots held on stack frames (for all fibers) to the GC.
  2. Managed TLS is stored in FLS if a fiber is being used.  This includes the ThreadStaticAttribute and Thread.GetData and Thread.SetData routines.  We avoid introducing thread affinity when these APIs are used.
  3. Any time we block in managed code or at most places in the runtime, we call out to the host so that it may SwitchToFiber.  This includes calls to WaitHandle.WaitOne, contentious Monitor.Enters, Thread.Sleep, and Thread.Join, as well as any other APIs that use those internally.  Some managed code blocks by P/Invoking, either intentionally or unintentionally, which leaves us helpless.  The sockets classes in Whidbey, for instance, make possibly-blocking calls to Win32.  These should really be cleaned up.  Not only does it prevent us from switching in fiber mode, but it also prevents us from pumping the message queue on an STA thread.  Apps do this too, such as P/Invoking to MsgWaitForMultipleObjects in order to do some custom message pumping code.  The lack of coordination with blocking in the kernel also makes it way too easy to accidentally forfeit an entire CPU for lengthy periods of time.
  4. We do some things during a fiber switch to shuffle data in and out of TLS.  This includes copying the current thread object pointer and AppDomain index from FLS to TLS, for example, as well as doing general book-keeping that is used by the internal fiber switching routines (SwitchIn and SwitchOut).
  5. Our CLR internal critical sections coordinate with the host.  Anytime we create or wait on an event, it is a thin wrapper that calls out to the host.  This meant sacrificing some freedom around waiting, like doing away with WaitForMultipleObjectsEx with WAIT_ANY and WAIT_ALL, but ensures seamless integration with a fiber-mode host.
  6. All thread creation, aborts, and joins are host aware, and call out to the host so they can ensure these events are processed correctly given an alternative scheduling mechanism.

None of this logic kicks in if fibers are used underneath the CLR.  It all requires close coordination between the host which is doing user-mode scheduling and the CLR which is executing the code running on those fibers.  If you call into managed code on a thread that was converted to a fiber, and then later switch fibers without involvement w/ the CLR, things will break badly.  Our stack walks and exception propagation will rely on the wrong fiber’s stack, the GC will fail to find roots for stacks that aren’t live on threads, among many, many other things.

Important areas of the BCL and runtime that can introduce thread affinity, then make a call that might block, and later release thread affinity—such as the acquisition and release of an OS CRITICAL_SECTION or Mutex—have been annotated with calls to Thread.BeginThreadAffinity and Thread.EndThreadAffinity.  These APIs call out to the host who maintains a recursion counter to track regions of affinity.  If a blocking operation happens inside such a region (i.e. count > 0), the host should avoid rescheduling another fiber on the current thread and/or moving the current fiber to another thread.  This can create CPU stalls, so we try to avoid it, but is better than the consequence of ignoring the affinity.

In reality, there is little code today that actually uses these APIs.  Large swaths of the .NET Framework have not yet been modified to use these calls and thus remain unprotected.  We inherit a lot of the affinity problems from Win32.  This can have a dramatic impact on reliability and correctness when used in a fiber-mode host.  Switching a fiber that has acquired OS thread affinity can result in data being accidentally shared between units of work (like the ownership of a lock) or movement of work to a separate thread (which then expects to find some TLS, but is surprised when it isn’t there).  Both are very bad.  If we were serious about supporting fibers underneath managed code, we really ought to do an audit of the libraries to find any dangerous unmarked P/Invokes or OS thread affinity.

Spin loops without going through the user-mode scheduler first potentially wastefully burn CPU cycles.  A lot of the .NET Framework and some of the CLR itself spins without host coordination.  While not disastrous, presuming they all fall back to waiting eventually, this can have a negative impact on performance and scalability.

The 2.0 CLR’s policy in response to stack overflow is to FailFast the whole process.  Too much of Win32 is unreliable in the face of overflow to try and continue running.  With fibers in the picture it might be attractive to reserve smaller stacks since presumably the smaller work items will need less.  And you're apt to have a lot more of them.  This is a dangerous game to play.  This trades off some amount of committed memory for an increased chance of overflowing the stack, an event that is clearly catastrophic.

Fibers and debuggers don’t interact well today either.  Most rely on Win32 CONTEXTs pulled from the OS thread, in a fiber-unaware way.  Depending on the frequency at which it resamples the context, this can get out of sync in the face of fiber switches.  Even if you’ve suspended all threads, you’ll not be able to peer into the stacks of fibers that aren’t currently scheduled.  FuncEval and EnC also depend on thread suspension and coordination in a way that makes it hard to predict will happen when fibers are added to the mix.  A lot of the debugging libraries we have, such as System.Diagnostics, are also not fiber-aware and may yield surprising answers to API calls.

In the end, remember that we decided to cut fiber support because of stress bugs.  Most of these stress bugs wouldn’t have actually blocked the simple, short-running scenarios, but would have plagued a long-running host like SQL Server.  The ICLRTask::SwitchOut API was cut, which is unfortunate:  it means you can’t switch out a task while it is running, which effectively makes it impossible to build a fiber-mode host on the 2.0 RTM CLR.  Thankfully, re-enabling it (for those playing w/ SSCLI) would be a somewhat trivial exercise.

11/9/2006 5:32:44 PM (Pacific Standard Time, UTC-08:00)  #   

 Wednesday, November 01, 2006

People often ask whether they should use EventWaitHandle objects or the Monitor.Wait, Pulse, and PulseAll methods for synchronization.  There is no simple answer to this question; although, as with most software problems, it can be summarized as:  It depends.

EventWaitHandle comes in two flavors, auto- and manual-reset.  EventWaitHandle subclasses WaitHandle and offers two subclasses for convenience: AutoResetEvent and ManualResetEvent.  These are just thin wrappers on top of the CreateEvent and related APIs in Win32.  The differences are deceivingly simple.

Auto-reset, when signaled with the EventWaitHandle.Set API (kernel32!SetEvent internally), allows one thread to witness the signal before the event automatically transitions back to the unsignaled state.  If there are any waiting threads, one will be chosen and unblocked.  The waiting threads are maintained in a FIFO queue, but it’s not strictly FIFO for the same reasons very few things on Windows are FIFO: various events, like device IO completion, kernel-, and user-mode APCs can wake a thread temporarily, removing it and then requeuing it in the wait queue.  If a thread is constantly woken to process device IO, it might be starved indefinitely (if the queue is long).  If no threads were waiting at the time of this signal, the next thread to wait on the event will not block, and instead just moves the event back to the unsignaled state and returns.  This is all done atomically so you are guaranteed only one thread will ever witness a signal.

Manual-reset, when signaled, wakes all threads that are waiting on it.  As its name implies, it must be manually reset with the EventWaitHandle.Reset API (kernel32!ResetEvent internally).  While the event remains signaled, any threads that try to wait on it will not block and just return from the wait function immediately.

Signaling an already-signaled event has no effect.  It’s easy to get into trouble in this area with auto-reset events.  If you signal the event N times, expecting N threads to see the signals and do some amount of processing, you’re betting the farm on a race condition, for instance.  This is easy to get wrong, very easily leading to deadlocks.  If you have a shared buffer, an attractive design might to simply call Set on the event each time a new item arrives.  The thinking might be that, while threads might not be sleeping, at least one thread will wake up per item and process it.  This thinking is dead wrong and can get you into a quagmire.  The waking threads would need to contain a loop ‘while (!empty) { … }’ before going back to sleep, otherwise one of the signals may go missing.  If production of new items depended on consumers making forward progress, the program might lock up.  And it entirely depends on consumers going to sleep in the first place which, if producers typically produce faster than consumers, might only happen occassionally (and hence not show up during testing).

Monitor.Wait, Pulse, and PulseAll are very different from their close Win32 event cousins.  They are much more akin to the new Windows Vista APIs, SleepConditionVariableCS and SleepConditionVariableSRW.  Wait will exit the monitor (lock) for the object in question until another thread pulses the object.  Once the thread wakes up, it immediately reacquires the lock on the object.  Pulse wakes up one waiting thread, in FIFO order, while PulseAll wakes all waiting threads.  Notice that the monitor has no residual effect from the pulse; that is, if no threads were waiting at the exact moment of a pulse, there is no evidence that it actually happened.  This leads to the notorious missed-pulse problem.  To solve it, you just have to ensure that the wait condition is always tested (in a loop) around the Wait.

Note that Wait does something a little dirty.  It releases an arbitrary amount of recursive acquisitions.  As soon as it does this, other threads can acquire the monitor.  If you are not careful with recursion, you can end up Waiting with broken invariants, accidentally letting other threads peer into this state.  This is just another bit of evidence that recursion is something that is best avoided.

The first major consideration to make when selecting between EventWaitHandle and Monitor’s methods is whether you need a stand-alone event or a real condition variable that is integrated with locks.  That is, the two have very distinct and disjoint feature sets.  Win32 events also let you do more sophisticated waits, with the WaitHandle.WaitAll or WaitAny APIs, allowing you to wait for all of the events or a single event in an array to be signaled.  So which feature-set do you want?  That Win32 events give you events without the synchronization looks simpler, but is probably misleading.  You typically need to manage mutual exclusion in some way with events, too, so you’ll end up using a monitor, ReaderWriterLock, Mutex, etc. in addition.  The one benefit is that you have more control over locking and can be more conscious of certain policies like recursion.  The fact that a Win32 event “sticks” in the signaled state can also be useful to avoid the missed-pulse problem, although with some discipline it is easily avoided with monitors too.  Often people end up building a sticky event with a bool and monitor pair.  One-time or lazy initialization is an example of this.

Win32 events are fairly heavyweight too.  Each one consumes some amount of kernel memory, and setting, resetting, and waiting on one incurs somewhat expensive kernel transitions.  In managed code, simply allocating one increases pressure on the GC because of yet another finalizable object to track.  In a well-tuned system, you have to manage events carefully, which usually means Disposing of them far before the GC’s finalizer thread has a chance to see one.  Even cleverer systems will pool them to amortize the cost of creating and closing the events.  This is a double-edged sword.  The V1.1 ReaderWriterLock we shipped in the CLR pools events.  In my opinion, this is a little too clever and myopic: a good solution would pool events across many components in the process, not just ReaderWriterLocks.  Imagine if each type we shipped tried to maintain its own pool of events.

As you may have guessed, Monitor actually uses Windows events underneath it all.  Each CLR thread has a manual-reset event, allocated when it is created (or lazily when the thread first wanders into managed code).  When a Wait is issued, this per-thread event is stuck on the tail of a linked list associated with the target object’s sync block.  We can use a single event per thread since a thread can only ever be waiting for a single object at a given time.  (You can’t do a WAIT_ANY or WAIT_ALL on monitors.)  The thread then releases the lock on the object (accounting for any recursion), waits on this event, and then reacquires the lock on the object (again, accounting for any recursion).  When a Pulse is issued, the head of the object’s linked list of waiters is popped off and its associated event is signaled.  Similarly, PulseAll clears the entire linked list and signals all of the events.  Notice I said that Pulse operates on the head of the list: we use a strict FIFO ordering (as of 2.0).  And since we don’t remove the list entries in the face of an APC, there is no risk of perturbing the FIFO ordering, aside from premature exits due to thread aborts or interruptions.

There are a few things to note about this.

The signals on the thread events happen while the signaler still owns the lock.  In other words, the thread calling Pulse(o) will still own the lock on o for some time after the call, yet the thread that called Wait(o) will immediately wake after the Pulse and try to acquire this lock (failing and waiting).  Yes, all woken threads have to immediately wait when attempting to reacquire this lock, which is actually pretty crappy.  If you’re using PulseAll, this could have a noticeable (and in some cases, dramatic) impact on scalability.  Windows uses priority boosts to “hand off” the current time-slice to the recipient of an event signal, similar to what occurs when a GUI event is enqueued into a thread’s message queue, which just exacerbates this effect.  You’re just about guaranteed that there will be a scheduler ping-pong effect immediately after a pulse.  I am honestly surprised we don’t enqueue the Pulse/PulseAll calls on the object’s sync-block, processing them only once the lock has been exited.  Yet another benefit to using events is that you can devise algorithms that signal events outside of critical sections, often leading to improved scalability.

We also don’t do any form of spinning.  Events are generally speaking very volatile in terms of timing, so spinning only buys you something if you know that the occurrence of events are frequent enough that wait-avoidance will pay off.  In many low-level concurrent algorithms, this is a worth-while technique, just as with spinning while trying to acquire a CRITICAL_SECTION in Win32 (see InitializeCriticalSectionWithSpinCount and SetCriticalSectionSpinCount) can improve scalability by avoiding expensive kernel-mode transitions due to waiting.  In fact, it’s conceivable that somebody would want to use an event that never did a real wait, particularly if you’re dealing with a very tiny race condition that is expected to arise very infrequently.  This is also dangerous, however, as it can lead to those rare CPU spikes that are almost impossible to debug and discern from a crash dump.  This is pretty simple to build, but very hard to fine-tune so that it performs adequately.

So in the end, I will simply fall back to my original answer:  It depends.

11/1/2006 9:26:26 PM (Pacific Daylight Time, UTC-07:00)  #   

 Thursday, October 26, 2006

The meat of this article is in section II, where a set of best practices are listed.  If you’re not interested in the up-front background and high level direction—or you’ve heard it all before—I suggest skipping ahead right to it.

As has been widely publicized recently, mainstream computer architectures in the future will rely on concurrency as a way to improve performance.  This is in contrast to what we’ve grown accustomed to over the past 30+ years (see Olukotun, 05 and Sutter, 05): a constant increase in clock frequency and advances in superscalar execution techniques.  In order for software to be successful in this brave new world, we must transition to a fundamentally different way of approaching software development and performance work.  Simply reducing the number of cycles an algorithm requires to compute an answer won’t necessarily translate into the fastest possible algorithm that scales well as new processors are adopted.  This applies to client & server workloads alike.  Dual-core is already widely available—Intel Core Duo is standard on the latest Apples and Dells, among others—with quad-core imminent (in early 2007), and predictions of as many as 32-cores in the 2010-12 timeframe.  Each core could eventually carry up to 4 hardware threads to mask memory latencies, equating to 128 threads on that same 32-core processor.  This trend will continue into the foreseeable future, with the number of cores doubling every 2 years or so.

If we want apps to get faster as new hardware is purchased, then app developers need to slowly get onto the concurrency bandwagon.  Moreover, there is a category of interesting apps and algorithms that only become feasible with the amount of compute power this transition will bring—what would you do with a 100 GHz processor?—ranging from rich immersive experiences complete with vision and speech integration to deeper semantic analysis, understanding, and mining of information.  If you’re a library developer and want your libraries to fuel those new-age apps, then you also need to hop onto the concurrency bandwagon.  There is a catch-22 here that we must desperately overcome:  developers probably can’t build the next generation of concurrent apps until libraries help them, but yet we typically wouldn’t decide to do large-scale proactive architectural and design work for app developers until they were clamoring for it.

Although it may sound rather glamorous & revolutionary at first, this transformation won’t be easy and it certainly won’t happen overnight.  Our libraries will slowly and carefully evolve over time to better support this new approach to software development.  We’ve done a lot of work laying the foundation over the .NET Framework’s first 3 major releases, but this direction in hardware really does represent a fundamental shift in how software will have to be written.

This document exposes some issues, articulates a general direction for fixing them, and, hopefully, will stimulate a slow evolution of our libraries.  App developers will want to take incremental advantage of these new architectures as soon as possible, ramping up over time.  The practices in here are based on experience.  My hope is that many of them (among others) are eventually integrated into libraries, tools, and standard testing methodologies over time.  Nobody in their right mind can keep all these rules in their head.

I. The 20,000 Foot View

There are several major themes library developers must focus on in their design and implementation in order to prepare for multi-core:

A. The level of reliability users demand of apps built on the .NET Framework and CLR is increasing over time.  Being brought in-process with SQL Server made the CLR team seriously face this fact in Whidbey.  At the same time, with the introduction of more concurrency, subtle timing bugs—like races and deadlocks—will actually occur with an increasing probability.  Those rare races that would have required an obscure 5-step sequence of context switches at very specific lines of code on a uni-processor, for example, will start surfacing regularly for apps running on 8-core desktop machines.  Library authors have gotten better over time at finding and fixing these types of bugs during long stress hauls before shipping a product, but nobody catches them all.  Fixing this will require intense testing focus on these types of bugs, hopefully new tools, and the wide-scale adoption of best practices that statistically reduce the risk, as outlined in this doc.

B. Nobody has seriously worked out the scheduling mechanisms for massively concurrent programs in detail, but it will likely involve some form of user-mode scheduling that keeps logical work separate from physical OS threads.  Unfortunately, many libraries assume that the identity of the OS thread remains constant over time in a number of places—something called thread affinity—preventing two important things from happening: (1) multiple pieces of work can’t share the same OS thread, i.e. it has become polluted, and (2) a user-mode scheduler can no longer move work between OS threads as needed.  Windows’s GUI APIs are notorious for this, including the Shell APIs, in addition to the reams of COM STA code written and thriving in the wild.  Fibers are the “official” mechanism on Windows today for user-mode scheduling, and—although there are several problems today—the CLR and SQL Server teams have experience trying to make serious use of them.  Regardless of the solution, thread affinity will remain a problem.

C. Scalability via concurrency will become just as important—if not more important (eventually, for some categories of problems)—than sequential performance.  If you assume that most users will try using your library in their now-concurrent programs, you also have to assume they will notice when you take an overly coarse grained lock, block the thread unexpectedly, or pollute the physical thread such that work can’t remain agile.  Moreover, a compute-intensive sequential algorithm lurking in a reusable library and exposed by a coarse-grained API will eventually lead to scalability bottlenecks in customer code.  Faced with such issues, developers will have no recourse other than to refactor, rewrite, and/or avoid the use of certain APIs.  And even worse, they’ll learn all of this through trial & error.

D. It’s not always clear what APIs will lead to synchronization and variable latency blocking.  If a customer is trying to build a scalable piece of code, it’s very important to avoid blocking.  And of course GUI developers must avoid blocking to maintain good responsiveness (see Duffy, 06c).  But if blocking is inevitable, either because of an API design or architectural issue, developers would rather know about it and choose to use an alternative asynchronous version of the API—such as is used by the System.IO.Stream class—or take the extra steps to “hide” this latency by transferring the wait to a spare thread and then joining with it once the wait is done.  Libraries need to get much better at informing users about the performance characteristics of APIs, particularly when it comes to blocking.  And everybody needs to get better at exposing the power of Windows asynchronous IO through APIs that use file and network IO internally.

These are all fairly dense and complex issues, and are all intertwined in some way.  Many of them can be teased apart and mitigated by following a set of best practices.  This is not to say they are all easy to follow.  These guidelines should evolve as we as a community learn more, so please let me know if you have specific suggestions, or ideas about how we can make this list more useful.  I seriously hope these are reinforced with library and tool support over time.

II. The Details

Locking Models

1. Static state must be thread-safe.

Any library code that accesses shared state must be done thread-safely.  For most managed code-bases this means that all accesses to objects reachable through a static variable (i.e. that the library itself places there) must be protected by a lock.  The lock has to be held over the entire invariant under protection—e.g. for multi-step operations—to ensure other threads don’t witness state inconsistencies in between the updates.  Protecting multi-step invariants requires that the granularity of your lock is big enough, but not so big that it leads to scalability problems.  Read-modify-write bugs are also a common mishap here; e.g. if you’re updating a lightweight counter held in a static variable, it must be done with an Interlocked.Increment operation, under a lock, or some other synchronization mechanism.

Reads and writes to statics whose data types are not assigned atomically (>32 bits on 32-bit, >64 bits on 64-bit) also need to happen under a lock or with the appropriate Interlocked method.  If they are not, threads can observe “torn values”; for example, while one thread writes a 64-bit value, 0xdeadbeefcafebabe to a field—which actually involves two individual 32-bit writes in the object code—another thread may run concurrently and see a garbage value, say, 0xdeadbeef00000000, because the high 32-bit word was written first.  Similar problems can happen to GUID fields on all architectures, for instance, because GUIDs are 128 bits wide.  Longs on 32-bit machines also fall into this category, as do value types built out of said data types.

This responsibility doesn’t extend to accesses to instance fields or static fields for objects that library users explicitly share themselves.  In other words, only if the library makes state accessible through a static variable does it need to protect it with synchronization.  In some cases, a library author may choose to make a stronger guarantee—and clearly document it—but it should certainly be the exception rather than the default choice, for instance with libraries specific to the concurrency domain.

2. Instance state doesn’t need to be thread-safe.  In most cases it should not be.

Protecting instance state with locks introduces performance overhead that is often ill-justified.  The granularity of these locks is typically too small to protect any operation of interesting size in the app.  And if the granularity might be wrong you need to expose implementation locking details or it was a waste of time.  Claiming an object performs thread-safe reads/writes to instance fields can even give users a false sense of safety because they might not understand the subtleties around locking granularity.

In V1.0 the .NET Framework shipped synchronizable collections with SyncRoots, for example, which in retrospect turned out to be a bad idea:  customers were frequently bitten by races they didn’t understand; and, for those who kept a collection private to a single thread or used higher level synchronization rather than the collection’s lock, the performance overhead was substantial and prohibitive.  Thankfully we left that part of the V1.0 design out of our new V2.0 generic collections.  We still have numerous types that claim “This type is thread-safe” in the MSDN docs, but this is typically limited to simple, immutable value types.

3. Use isolation & immutability where possible to eliminate races.

If you don’t share and mutate data, it doesn’t need lock protection.  CLR strings and many value types, for example, are immutable.  Isolation can be used to hide intermediate state transitions, although typically also requires that multiple copies are maintained and periodically synchronized with a central version to eliminate staleness.  Sometimes this approach can be used to improve scalability particularly for highly shared state.  Many CRT malloc/free implementations will use a per-thread pool of memory and occasionally rendezvous with a central process-wide pool to eliminate contention, for example.

4. Document your locking model.

Most library code has a simple locking model:  code that manipulates statics is thread-safe and everything else is not (see #1 and #2 above).  If your internal locking schemes are more complex, you should document those using asserts (see below), good comments, and by writing detailed dev design docs with information about what locks protect what data to help others understand the synchronization rules.  If any of these subtleties are surfaced to users of your class then those must also be explained in product documentation and, preferably, reinforced with some form of tools & analysis support.  COM/GUI STAs, for example, are one such esoteric scheme, where the locking model leaks directly into the programming model.  As a community, we would be best served if new instances of such specialized models are few and far between; I for one would be interested in hearing of and understanding any such cases.

Using Locks

5. Use the C# ‘lock’ and VB ‘SyncLock’ statements for all synchronized regions.

Following this guidance ensures that locks will be released even in the face of asynchronous thread aborts, leading to fewer deadlocks statistically.  The code generated by these statements is such that our finally block will always run and execute the Monitor.Exit if the lock was acquired.  This still doesn’t protect code from rude AppDomain unloads—but this is not something most library developers have to worry about, except for reliability-sensitive code that maintains complex process-wide memory invariants, such as code that interops heavily with native code.  (See Duffy, 05 for more details.)

If you decide to violate this guidance, it should be for one of two reasons: (1) you need to use a CER to absolutely guarantee the lock is released in rude AppDomain unload cases—perhaps because a lock will be used during AppDomain tear-down and you’d like to avoid deadlocks—or (2) you have some more sophisticated Enter/Exit pattern that is not lexical.  For (1) I would recommend talking to somebody at Microsoft so we can understand these scenarios better; if there are enough people who need to do this, we may conceivably consider adding better Framework support in the future.  For (2) you should first try to avoid this pattern.  If it’s unavoidable, you must use finalizers to ensure that locks are not orphaned if the expected releasing thread is aborted and never reaches the Exit.  As with (1), you may or may not need to use a critical finalizer based on your reliability requirements.

6. Avoid making calls to someone else’s code while you hold a lock.

This applies to most virtual, interface, and delegate calls while a lock is held—as well as ordinary statically dispatched calls—into subsystems with which you aren’t familiar.  The more you know about the code being run when you hold a lock, the better off you will be.  If you follow this approach, you’ll encounter far fewer deadlocks, hard-to-reproduce reentrancy bugs, and surprising dynamic composition problems, all of which can lead to hangs when your API is used on the UI thread, reliability problems, and frustration for your customer.  Locks simply don’t compose very well; ignoring this and attempting to compose them in this way is fraught with peril.

7. Avoid blocking while you hold a lock.

Admittedly sometimes violating this advice is unavoidable.  Trying to acquire a lock is itself an operation that can block under contention.  But blocking on high or variable latency operations such as IO will effectively serialize any other thread trying to acquire that lock “behind” your IO request.  If that other thread trying to acquire the lock is on the UI thread, you may have just helped to cause a user-visible hang.  The app developer may not understand the cause of this hang if the lock is buried inside of your library, and it may be tricky and error-prone to work around.

Aside from having scalability impacts, blocking while a lock is held can lead to deadlocks and invariants being broken.  Any time you block on an STA thread, the CLR uses it as a chance to run the message loop.  When run on pre-Windows 2000 we use custom MsgWaitForMultipleObjects pumping code, and post-Windows 2000 we use OLE’s CoWaitForMultipleHandles.  While this style of pumping processes only a tiny subset of UI messages, it can dispatch arbitrary COM-to-CLR interop calls.  These calls include cross-thread/apartment SendMessages, such as an MTA-to-STA call through a proxy.  If this happens while a lock is held, that newly dispatched work also runs under the protection of the lock.  If the same object is accessed, this can lead to surprising bugs where invariants are still broken inside the lock.  (Note that COM offers ways to exit the SynchronizationContext when blocking in this fashion, but this is outside of the scope of this doc.)

Try to refactor your code so the time you hold a lock is minimal, and any communication across threads, processes, or to/from devices happens at the edges of those lock acquisition/releases.  All libraries really should minimize all synchronization to leaf-level code.

8. Assert lock ownership.

Races often result when some leaf-level code assumes a lock has been taken at a higher level on the call-stack, but the caller has forgotten to acquire it.  Or maybe the owner of that code recently refactored it and didn’t realize the implicit pre-condition that was accidentally broken.  This may go undetected in test suites unless the race actually happens.  I personally hope we add a Monitor.IsHeld API in the future which you could wrap in a call to Debug.Assert (or whatever your assert infrastructure happens to be).  Sans that, you can build this today by wrapping calls to Monitor.Enter/Exit and maintaining recursion state yourself.  It’d be great if somebody developed some type of annotations in the future to make such assertions easier to write and maintain.

Note that the IsHeld functionality should never be used to dynamically influence lock acquisition and release at runtime, e.g. avoiding recursion and taking or releasing based on its value.  This indicates poorly factored code.  In fact, the only use we would encourage is SomeAssertAPI(Monitor.IsHeld(foo)).

9. Avoid lock recursion in your design.  Use a non-recursive lock where possible.

Recursion typically indicates an over-simplification in your synchronization design that often leads to less reliable code.  Some designs use lock recursion as a way to avoid splitting functions into those that take locks and those that assume locks are already taken.  This can admittedly lead to a reduction in code size and therefore a shorter time-to-write, but results in a more brittle design in the end.

It is always a better idea to factor code into public entry-points that take non-recursive locks, and internal worker functions that assert a lock is held.  Recursive lock calls are redundant work that contributes to raw performance overhead.  But worse, depending on recursion can make it more difficult to understand the synchronization behavior of your program, in particular at what boundaries invariants are supposed to hold.  Usually we’d like to say that the first line after a lock acquisition represents an invariant “safe point” for an object, but as soon as recursion is introduced this statement can no longer be made confidently.  This in turn makes it more difficult to ensure correct and reliable behavior when dynamically composed.

As a community, we should transition to non-recursive locks as soon as possible.  Most locks that you have in your toolkit today—including Win32 CRITICAL_SECTIONs and the CLR Monitor—screwed this up.  Java realized this and now ships non-recursive variants of their locks.  Using a non-recursive design requires more discipline, and therefore we expect some scenarios to continue using recursive locks for some time to come.  Over time, however, we’d like to wean developers off of lock recursion completely.

10. Don’t build your own lock.

Most locks are built out of simple principles at the core.  There’s a state variable, a few interlocked instructions (exposed to managed code through the Interlocked class), with some form of spinning and possibly waiting on an event when contention is detected.  Given this, it may look straightforward to build your own.  This is deceivingly difficult.

The CLR locks have to coordinate with hosts so that they can perform deadlock detection and sophisticated user-mode scheduling for hosted customer-authored code.  Some of our locks (Monitor) make higher reliability guarantees so that we can safely use them during AppDomain teardown.  We have tuned our Monitor implementation to use an ideal mixture of spinning & waiting across many OS SKUs, CPU architectures, and cache hierarchy arrangements.  We make sure we work correctly with Intel HyperThreading.  We mark critical regions of code manipulating the lock data structure itself so that would-be thread aborts will be processed correctly while sensitive shared state manipulation is underway.  And last but not least, the C# and VB languages offer the ‘lock’ and ‘SyncLock’ keywords whose code-generation pattern ensures that our Framework and our customer’s code doesn’t orphan locks in the face of asynchronous thread aborts.  To get all of this right requires a lot of hard work, time, and testing.

With that said, we may not have every lock you could ever want in the Framework today.  Spin locks are a popular request that can help scalability of highly concurrent and leaf-level code.  Thankfully, Jeff Richter wrote an article and supplied a suitable spin lock on MSDN some time ago.  In Orcas, we are tentatively going to supply a new ReaderWriterLockSlim type that offers much better performance and scalability than our current ReaderWriterLock (watch those CTPs).  If there’s still some interesting lock you need but we don’t currently offer, please drop me a line and let me know.  If you need it, chances are somebody else does too.

11. Don’t call Monitor.Enter on AppDomain-agile objects (e.g. Types and strings).

Instances of some Type objects are shared across AppDomains.  The most notable are Type objects for domain neutral assemblies (like mscorlib) and cross-assembly interned strings.  While it may look innocuous, locks taken on these things are visible across all AppDomains in the process.  As an example, two AppDomains executing this same code will stop all over each other:

lock (typeof(System.String)) { … }

This can cause severe reliability problems should a lock get orphaned in an add-in or hosted scenario, possibly causing deadlocks from deep within your library that seemingly inexplicably span multiple AppDomains.  The resulting code also exhibits false conflicts between code running in multiple domains and therefore can impact scalability in a way that is difficult for customers (and library authors!) to reason about.

12. Don’t use a machine- or process-wide synchronization primitive when AppDomain-wide would suffice.

The Mutex and Semaphore types in the .NET Framework should only ever be used for legacy, interop, cross-AppDomain, and cross-process reasons.  They very heavy-weight—several orders of magnitude slower than a CLR Monitor actually—and they also introduce additional reliability and affinity problems: they can be orphaned, process-external DOS attacks can be mounted, and they can introduce synchronization bottlenecks that contribute to scalability blockers.  Moreover, they are associated with the OS thread, and therefore impose thread affinity.  As already noted, this is a bad thing.

13. A race condition or deadlock in library code is always a bug.

Race conditions and deadlocks can be very difficult to fix.  Sometimes it requires refactoring a bunch of code to work around a seemingly corner case & obscure sequence of events.  It’s tempting to rearrange things to narrow the window of the race or reduce the likelihood of a deadlock.  But please don’t lose sight of the fact that this still represents a correctness problem in the library itself, no matter how narrow the race is made.  Sometimes fixing the bug would require breaking changes.  Sometimes you simply don’t have enough time to fix the bug in time for product ship.  In either case, this is something that should be measured and decided based on the quality bar for the product at the time the bug is found.  Remember that as higher degrees of concurrency are used in the hardware, the probability of these bugs resurfacing becomes higher.  A murky won’t fix race condition in 2008 that repros only once in a while on high end machines could become a costly servicing fix by 2010 that repros routinely on middle-of-the-line hardware.  That jump from 32 to 64 cores is a rather substantial one, at least in terms of change to program timing.

Reliability

14. Every lock acquisition might throw an exception.  Be prepared for it.

Most locks lazily allocate an event if a lock acquisition encounters contention, including CLR monitors.  This allocation can fail during low resource conditions, causing OOMs originating from the entrance to the lock.  (Note that a typical non-blocking spin lock cannot fail with OOM, which allows it to be used in some resource constrained scenarios such as inside a CER.)  Similarly, a host like SQL Server can perform deadlock detection and even break those deadlocks by generating exceptions that originate from the Enter statement, manifesting as a System.Runtime.InteropServices.COMException.

Often there isn’t much that can be done in response to such an exception.  But reliability- and security-sensitive code that must deal with failure robustly should consider this case.  We would have liked it to be the case that host deadlocks can be responded to intelligently, but most library code can’t intelligently unwind to a safe point on the stack so that it can back-off and retry the operation.  There is simply too much cross-library intermixing on a typical stack.  This is why timeout-based Monitor acquisitions with TryEnter are typically a bad idea for deadlock prevention.

15. Lock leveling should be used to avoid deadlocks.

Lock leveling (a.k.a. lock hierarchy) is a scheme in which a relative number is assigned to all locks, and lock acquisition is restricted such that only locks at monotonically decreasing levels than those already held by the current thread can be acquired.  Strictly following this discipline guarantees a deadlock free system, and is described in more detail in Duffy, 06b.  Without this, libraries are subject to dynamic composition- and reentrancy-induced deadlocks, which causes users trying to write even moderately reliable code a lot of pain and frustration.  This pain will only become worse as more of them try to compose our libraries into highly concurrent applications.  An alternative to true lock leveling which doesn’t require new BCL types is to stick to non-recursive locks and to ensure that multiple lock acquisitions are done at once, in some well-defined order.

There are two big problems that will surely get in the way of adopting lock leveling today.

First, we don’t have a standard leveled lock type in the .NET Framework today.  While the article I referenced contains a sample, the simple fact is that the lion’s share of library developers and customers will not start lock leveling in any serious way without official support.  There is also a question of whether programmers can be wildly successful building apps and libraries with lock leveling without good tool and deeper programming model support.

Second, lock leveling is a very onerous discipline.  We’ve used it in the CLR code base for the parts of the system that are relatively closed.  (I’m fine saying this since I’m basing this off of the Rotor code-base.)  Lock leveling doesn’t typically compose well with other libraries because the levels are represented using arbitrary numbering schemes.  You might want to extend it to prevent certain cross-assembly calls, interop calls that might acquire Win32 critical sections, or calls into other parts of the system that acquire locks outside of the current hierarchy.  These are all features that have to be built on top of the base lock leveling scheme; again, without a standard library for this, it’s unlikely everybody will want to build it all themselves.
Lock leveling is not a silver bullet, but it’s probably the best thing we have for avoiding deadlocks with today’s multithreading primitives.

16. Restore sensitive invariants in the face of an exception before the 1st pass executes up the stack.

This is in part a security concern as well as a reliability one.  The CLR exception model is a 2-pass model which we inherit from Windows SEH.  The 1st pass runs before finally blocks execute, meaning that the locks held by the thread are held when up-stack filters are run and get a chance to catch the exception.  VB and VC++ expose filters in the language, while C# doesn’t.   Code inside of filters can see inside possibly broken invariants because the locks are still held.

Thankfully CAS asserts and impersonation cannot leak in this way, but this can still cause some surprising failures.  You can stop the 1st pass and ensure your lock is released by wrapping a try/catch around the sensitive operation and re-throwing the exception from your catch:

try {
    lock (foo) {
        // (Break invariants…)
        // Possibly throw an exception…
        // (Restore invariants…)
    }
} catch {
    throw;
}

This is only something you should consider if security and reliability requirements dictate it.

17. If class constructors are required to have run for code inside of a lock, consider eagerly running the constructor with a call to RuntimeHelpers.RunClassConstructor.

Reentrancy deadlocks and broken invariants involving cctors are difficult to reason about because behavior is based on program history and timing, often in a nondeterministic way.  The problem specific to locks is that running a cctor effectively introduces possibly reentrant points into your code anywhere statics are accessed for a type with a cctor.  If running the cctor causes an exception or attempts to access some data structure which the current thread has already locked and placed into an inconsistent state, you may encounter bugs related to these broken invariants.  If using a non-recursive lock, this can lead to deadlocks.  Calling Runtime.RunClassConstructor hoists potential problems such as this to a well-defined point in your code.  It is not perfect, as other locks may be held higher up on the call-stack, but it can statistically reduce the chance of problems in your users’ code.

18. Don’t use Windows Asynchronous Procedure Calls (APCs) in managed code.

We recently considered adding APC-based file IO completion to the BCL file APIs.  Several Win32 IO APIs offer this, and some use it for scalable IO that doesn’t need to use an event or an IO Completion thread.  After considering it briefly, we realized how bad of an idea adding similar support to managed code would have been.  APCs pollute the OS thread to which they are tied, and are a strange form of thread affinity (more on that later).  They can fire at arbitrary alertable blocking points in the code, including after a thread pool thread has been returned back to the pool, after the finalizer thread has gone on to Finalize other objects in the process, or even at some random blocking point deep within the EE (perhaps while we aren’t ready for it).  If an APC raises an exception, the state of affairs at the time of the crash is likely to be quite confusing.  The stack certainly will be.  Not only do APCs represent possible security threats, but they can also introduce many of the subtle reliability problems already outlined.  They have been avoided almost entirely in three major releases of the .NET Framework, and we ought to continue avoiding them.

19. Don’t change a thread’s priority.

This could fall into the rules below about “Scheduling & Threads,” because it is semantically tied to the notion of an OS thread, were it not for the large reliability risk inherent in it.  Priority changes can cause subtle scalability problems due to priority inversion, including preventing the CLR’s finalizer thread (which runs at high priority) from making forward progress.  The OS has support for anti-starvation of threads—including a balance set manager which boosts the priority of a thread waiting for a lock for certain OS synchronization primitives—but this actually doesn’t extend to CLR locks.  Testing in isolation will tend not to find priority-related bugs.  Instead, app developers trying to compose libraries into their programs will discover them.  Users may decide to go ahead and change priorities themselves, but then the onus for breaking a best practice is on them, not us.

20. Always test & retest a wait condition inside of a lock.

A common mistake when writing cross-thread synchronization code is to either forget to retest a condition each time a thread wakes up or to test this condition outside of a lock.  If you’re using an EventWaitHandle or Monitor.Wait/Pulse/PulseAll, for example, to put one thread to sleep while another produces some state transition of interest, you typically need to double-check that that state is in the expected condition when waking.  This is especially true of single-producer/multi-consumer scenarios, where multiple threads frequently race with one another.  For example:

void Put(T obj) {
    lock (myLock) {
        S1; // enqueue it
        Monitor.PulseAll(myLock);
    }
}

T Get() {
    lock (myLock) {
        while (empty) {
            Monitor.Wait(myLock);
        }
        S2; // dequeue and return the item
    }
}

Notice that Get loops around testing the ‘empty’ variable to decide when to wait for a new item, and it does so while holding the lock.  Whenever this consumer is woken up, it must retest the variable.  If it doesn’t, multiple threads may wake up due to a single new item becoming available only for all but one of them to find that the queue actually became empty by the time it reached S2.  This is generally easier to do with Monitors because they combine the lock with the condition variable.  Missteps with Win32 events are easier because the lock must be separately managed.

Scheduling & Threads

21. Don’t write code that depends on the OS thread ID or HANDLE.  Use Thread.Current or Thread.Current.ManagedThreadId instead.

When code depends on the identity of the actual OS thread, the logical task running that code is bound to the thread.  This is a major piece of the thread affinity problem mentioned earlier on.  If running on a system where threads are migrated between OS threads using some form of user-mode scheduling—such as fibers—this can break if user-mode switches happen at certain points in the code.  If this dependency is enforced (using Thread.BeginThreadAffinity and EndThreadAffinity), at least the system remains correct, but this still limits the ability of the scheduler to maintain overall system forward progress.

Unfortunately, many Win32 and Framework APIs may imply thread affinity when used.  Several GUI APIs require that they are called from a thread which owns the message queue for the GUI element in question.  Historically, some Microsoft components like the Shell, MSHTML.DLL, and Office COM APIs have also abused this practice.  The situation on the server is much better, but it still isn’t perfect.  Some APIs we design with the client in mind end up being used on the server, often with less than desirable results.  My hope is that the whole platform moves away from these problems in the future.

22. Mark regions of code that do depend on the OS thread identity with Thread.BeginThreadAffinity/EndThreadAffinity.

The corollary to the previous rule is that, if you must have code that depends on the OS identity, you must tell the CLR (and potential host) about it.  That’s what the Thread.BeginThreadAffinity and EndThreadAffinity methods do, new to V2.0.  Marking these regions prevent OS thread migration altogether.  This is a crappy practice, but is less crappy than allowing thread migration to happen anyway, causing things to break in unexpected and unpredictable ways.

23. Always access TLS through the .NET Framework mechanisms: ThreadStaticAttribute or Thread.GetData/SetData and related members.

The implementation of these APIs abstract away the dependency on the OS thread allowing you to store state associated with the logical piece of work.  Although they sound very thread-specific, these actually store state based on whatever user-mode scheduling mechanism is being used, and therefore you don’t actually take thread affinity when you use them.  For example, we can (in theory) store information into Fiber Local Storage (FLS) or manually move data across fibers rather than using the underlying Windows Thread Local Storage (TLS) mechanisms if a host has decided to use fibers.  While it’s tempting to say “Who cares?” for this one, particularly since Whidbey decided not to support fiber mode before shipping, I believe it’s premature: we haven’t seen the death of fibers just yet.

24. Always access the security/impersonation tokens or locale information through the Thread object.

As with the previous item, we abstract away the storage of this information on the Thread object, via the Thread.CurrentCulture, Thread.CurrentUICulture, and Thread.CurrentPrincipal properties.  We flow this information across logical async points as required, and therefore using them doesn’t imply any sort of hard OS thread affinity.

25. Always access the “last error” after an interop call via Marshal.GetLastWin32Error.

If you mark a P/Invoke signature with [DllImportAttribute(…, SetLastError=true)], then the CLR will store the Win32 last error on the logical CLR thread.  This ensures that, even if a fiber switch happens before you can check this value, your last error will be preserved across the physical OS threads.  The Win32 APIs kernel32!GetLastError and kernel32!SetLastError, on the other hand, store this information in the TEB.  Therefore, if you are P/Invoking to get at the last error information, you are apt to be quite surprised if you are running in an environment that permits thread migration.  You can avoid this by always using the safe Marshal.GetLastWin32Error function.

26. Avoid P/Invoking to other Win32 APIs that access data in the Thread Environment Block (TEB).

Security and locale information is something Win32 stores in the TEB that we already expose in the Framework APIs, so it’s rather easy to follow the advice here.  Unfortunately, many Win32 APIs access data from the TEB without necessarily saying so, or look for & possible lazily create a window message queue (i.e. in USER32), all of which creates a sort of silent thread affinity.  In other words, a disaster waiting to happen.  I wish I had a big laundry list of black-listed APIs, but I don’t.

Scalability & Performance

27. Consider using a reader/writer lock for read-only synchronization.

A lot of concurrent code has a high read-to-write ratio.  Given this, using exclusive synchronization (like CLR monitors) can hurt scalability in situations with a large numbers of concurrent readers.  While starting off with a reader/writer lock could be viewed as a premature optimization, the reality is that many situations warrant using one due to the inherent properties of the problem.  If you know you’ll have more concurrent readers than writers, you can probably do some quick back-of-the-napkin math and come to the conclusion that a reader/writer lock is a good first approach.  For other cases, refactoring existing code to use one can be a fairly straightforward translation.  If you do this, obviously you need to be careful that the read-lock-protected code actually only performs reads to maintain the correctness of your system.

There has been a lot of negative press about the BCL’s ReaderWriterLock.  In particular, the performance is at about 8x of that of successful Monitor for acquires.  Unfortunately, this has (in the past) prevented many library developers from using reader/writer locks altogether.  This is the primary motivation we are tentatively supplying a new lock implementation, ReaderWriterLockSlim, in Orcas.  The BCL’s synchronization primitives ought not to get in the way of optimal synchronization for your data structures.

28. Avoid lock free code at all costs for all but the most critical perf reasons.

Compilers and processors reorder reads and writes to get better perf, but in doing so make it harder to code that is sensitive to the read/write orderings between multiple threads.  The CLR memory model gives a base level of guarantees that we preserve across all hardware platforms.  With that said, any sort of dependence on the CLR memory model is advised against; we did that work in 2.0 to strengthen the memory model to eliminate object-publish-before-initialization and double-checked locking bugs that were found throughout the .NET Framework, not to encourage you to write more lock free code.

The reason?  Lock free code is impossible to write, maintain, and debug for most developers, even those who have been doing it for years.  This is the type of code whose proliferation will lead to poor reliability across the board for managed libraries, longer stress lock downs on multi-core and MP machines, and is best avoided.  Use of volatile reads and writes and Thread.MemoryBarrier should be viewed with great suspicion, as it probably means somebody is trying to be more clever than is required.

With all of that said, there are a couple “blessed” techniques that can be considered when informed by scalability and perf testing (see Morrison, 2005):

(a) The simple double checked locking pattern can be used when you need to prevent multiple instances from being created and you don’t want to use a cctor (because the state may not be needed by all users of your class).  This pattern takes the form:

static State s_state;
static object s_stateLock = new object();
static T GetState() {
    if (s_state == null) {
        lock (s_stateLock) {
            if (s_state == null) {
                s_state = new State();
            }
        }
    }
    return s_state;
}

Note that simple variants of this pattern don’t work, such as keeping a separate ‘bool initialized’ variable, due to read reordering (see Duffy, 06).

(b) Optimistic/non-blocking concurrency.  In some cases, you can safely use Interlocked operations to avoid a heavyweight lock, such as doing a one-time allocation of data, incrementing counters, or inserting into a list.  In other areas, you might use a variable to determine when a ready has become dirty, and retry it, typically done via a version number incremented on each update.

Again, you should only pursue these approaches if you’ve measured or done the thought exercise to determine it will pay off.  There are additional tricks you can play if you really need to, but most library code should not go any further than what is listed here.

29. Avoid hand-coded spin waits.  If you must use one, do it right.

Sometimes it is tempting to put a busy wait in very tightly synchronized regions of code.  For instance, when one part of a two-part update is observed then you may know that the second part will be published imminently; instead of giving up the time-slice, it may look appealing to enter a while loop on an MP machine, continuously re-reading whatever state it is waiting to be updated, and then proceed once it sees it.  Unless written properly, however, this technique won’t work well on single-CPU and Intel HyperThreaded systems.  It’s often simpler to use locks or events (such as Monitor.Wait/Pulse/PulseAll) for this type of cross-thread communication.  These employ some reasonable amount of spinning versus waiting for you.

Spin waits can actually improve scalability for profiled bottlenecks or when your scalability goals make it necessary.  Note that this is NOT a complete replacement for a good spin lock.  If you decide to use a spin wait, follow these guidelines.  The worst type of spin wait is a ‘while (!cond) ;’ statement.  A properly written wait must yield the thread in the case of a single-CPU system, or issue a Thread.SpinWait with some reasonably small argument (25 is a good starting point, tune from there) on every loop iteration otherwise.  This last point ensures good perf on Intel HyperThreading.  E.g.:

{
    uint iters = 0;
    while (!cond) {
        if ((++iters % 50) == 0) {
            // Every so often we sleep with a 1ms timeout (see #30 for justification).
            Thread.Sleep(1);
        } else if (Environment.ProcessorCount == 1) {
            // On a single-CPU machine we yield the thread.
            Thread.Sleep(0);
        } else {
            // Issue YIELD instructions to let the other hardware thread move.
            Thread.SpinWait(25);
        }
    }
}

The spin count of ‘25’ is fairly arbitrary and should be tuned on the architectures you care about.  And you may want to consider backing off or adding some randomization to avoid regular contention patterns.  Except for very specialized scenarios, most spin waits will have to fall back to waiting on an event after so many iterations.  Remember, spinning is just a waste of CPU time if it happens too frequently or for too long, and can result in an angry customer.  A hung app is generally preferable to a machine who’s CPUs are spiked at 100% for minutes at a time.

30. When yielding the current thread’s time slice, use Thread.Sleep(1) (eventually).

Calling Thread.Sleep(0) doesn’t let lower priority threads run.  If a user has lowered the priority of their thread and uses it to call your API, this can lead to nasty priority inversion situations.  Eventually issuing a Thread.Sleep(1) is the best way to avoid this problem, perhaps starting with a 0-timeout and falling back to the 1ms-timeout after a few tries.  Particularly if you come from a Win32 background, it might be tempting to P/Invoke to kernel32!SwitchToThread—it is cheaper than issuing a kernel32!SleepEx (which is what Thread.Sleep does).  This is because SleepEx is called in alertable mode, which incurs somewhat expensive checks for APCs.  Unfortunately, P/Invoking to SwitchToThread bypasses important thread scheduling hooks that call out to a would-be host.  Therefore, you should continue to use Thread.Sleep until if and when the .NET Framework offers an official Yield API.

31. Consider using spin-locks for high traffic leaf-level regions of code.

A spin-lock avoids giving up the time-slice on MP systems, and can lead to more scalable code when used correctly.  Context switches in Windows are anything but cheap, ranging from 4,000 to 8,000 cycles on average, and even more on some popular CPU architectures.  Giving up the time-slice also means that you’re possibly giving up data in the cache, depending on the data intensiveness of the work that is scheduled as a replacement on the processor.  And any time you have cross thread causality, it can cause a rippling effect across many threads, effectively stalling a pipeline of parallel work.  As usual, using a spin-lock should always be done in response to a measured problem, not to look clever to your friends.

32. You must understand every instruction executed while a spin lock is held.

Spin locks are powerful but very dangerous.  You must ensure the time the lock is held is very small, and that the entire set of instructions run is completely under your control.  Virtual method calls and blocking operations are completely out of the question.  Because a spin-lock spins rather than blocking under contention, a deadlock will manifest as a spiked CPU and system-wide performance degradation, and therefore is a much more serious bug than a typical hang.

Whenever you use a spin lock you are making a bold statement about your code and thread interactions:  it is more profitable for other contending threads to possibly waste CPU cycles than to wait and let other work make forward progress.  If this statement turns out to be wrong, a large number of cycles will frequently get thrown away due to spinning, and the overall throughput of the system will suffer.  On servers the result could be catastrophic and you may cost your customers money due to an impact to the achievable throughput.  Each cycle you waste in a loop waiting for a spin lock to become available is one that could have been used to make forward progress in the app.

33. Consider a low-lock data structure for hot queues and stacks.

Windows has a set of ‘S-List’ APIs that provide a way to do “lock free” pushes and pops from a stack data structure.  This can lead to highly scalable, non-blocking algorithms, much in the same way that spin-locks do.  Unfortunately it is not a simple matter to use ‘S-Lists’ from managed code, due to the requirements for memory pinning among other things.  It’s very easy to build a lock-free stack out of CAS operations which is suitable for these situations.  The algorithm goes something like this: 

class LockFreeStack<T> {
    class Node {
        T m_obj;
        Node m_next;
    }

    private Node m_head;

    void Push(T obj) {
        Node n = new Node(obj);
        Node h;
        do {
          h = m_head;
           n.m_next = h;
       } while (Interlocked.CompareExchange(ref m_head, n, h) != h);
    }

    T Pop() {
        Node h;
        Node nh;
        do {
            h = m_head;
            nh = h.m_next;
        } while (Interlocked.CompareExchange(ref m_head, nh, h) != h);
        return h.m_obj;
    }
}

This sample implementation carefully avoids the well-known ABA problem as a result of two things: (1) it assumes a GC, ensuring memory isn’t reclaimed and reused so long as at least one thread has a reference to it; and (2) we don’t make any attempt to pool nodes. A more efficient solution might pool nodes so that each Push doesn’t have to allocate one, but then would have to also implement an ABA-safe algorithm. This is typically done by widening the target of the CAS so that it can contain a version number in addition to the “next node” reference.

There are other permutations of this lock-free data structure pattern which can be useful.  Lock-free queues can be built (see Michael, 96 for an example algorithm), permitting concurrent access to both ends of the data structure.  All of the same caveats explained with the earlier lock free item apply.

34. Always use the CLR thread pool or IO Completion mechanisms to introduce concurrency.

The CLR’s thread pool is optimized to ensure good scalability across an entire process.  If we end up with multiple custom pools in a process, they will compete for CPUs, over-create threads, and generally lead to scalability degradation.  We already (unfortunately) have this situation with the OS’s thread pool competing with the CLR’s.  We’d prefer not to have three or more.  If you will be in the same process as the CLR, you should use our thread pool too.  We’re doing a lot of work over the next couple releases to improve scalability and introduce new features—including substantially improved throughput (available in the last Orcas CTP)—so if it still doesn’t suit your purposes we would certainly like to hear from you.

Blocking

35. Document latency expectations for your users.

We haven’t yet come up with a consistent way to describe the performance characteristics of managed APIs.  When writing concurrent software, however, it’s frequently very important for developers to understand and reason about the performance of the dependencies they choose to take.  This includes things like knowing the probability of blocking—and therefore whether to try and mask latency by transferring work to a separate thread, overlapping IO, and so on—as well as the compute and memory intensiveness of the internal operations.  Please make an effort to document such things.  Incremental and steady improvements are important in this area.

36. Use the Asynchronous Programming Model (APM) to supply async versions of blocking APIs.

Particularly if you are building a feature that mimics existing Win32 IO APIs or uses such APIs heavily, you should also consider exposing the built-in asynchronous nature of IO on Windows.  For example, file and network IO is highly asynchronous in the OS; if you know your API will spend any measurable portion of its execution time blocked waiting for IO, the same customers who use asynchronous file IO APIs will want some way to turn that into asynchronous IO.  The only way they can do that is if you go the extra step and provide an Asynchronous Programming Model (APM) version of your API.

Details on precisely how to implement the APM are available in Cwalina, 05.  It involves adding ‘IAsyncResult BeginXX(params, AsyncCallback, object)’ and ‘rettype EndXX(IAsyncResult) APIs for your ‘rettype XX(params)’ method.  As an example, consider System.IO.Stream:

int Read(byte[] buffer, int offset, int count);
IAsyncResult BeginRead(byte[] buffer, int offset, int count,
                                   AsyncCallback callback, object state);
int EndRead(IAsyncResult asyncResult);

A good hard-and-fast rule is that if you use an API that offers an asynchronous version, then you too should offer an asynchronous version (and so on, recursively up the call stack).

This is very important to many app developers who need to tightly control the amount of concurrency on the machine.  Having lots of IO happening asynchronously can permit operations to overlap in ways they couldn’t otherwise, therefore improving the throughput at which the work is retired.  IO Completion Ports, for example, allow highly scalable asynchronous IO without having to introduce additional threads.  There is simply no way to build a robust and scalable server program without them.  If the library doesn’t expose this capability, customers are left with a clumsy design: they have to manually marshal work to a thread pool thread—or one of their own—to mask the latency, and then rendezvous with it later on.  And this doesn’t work at all for massive numbers of in-flight IO requests.  Or even worse, users are forced to create, maintain, and use their own incarnations of existing library APIs.

37. Always block using one of these existing APIs: lock acquisition, WaitHandle.WaitOne, WaitAny, WaitAll, Thread.Sleep, or Thread.Join.

The CLR doesn’t block in a straightforward manner.  As noted earlier, we use blocking as an opportunity to run the message loop on STA threads.  We also call out to the host to give it a chance to switch fibers or perform any other sort of book-keeping.  This is required to ensure good CPU utilization and to achieve the goal of having all CPUs constantly busy on a MP machine, instead of the alternative of wasting potential execution time by letting threads block.  P/Invoking or COM interoping to a blocking API completely bypasses this machinery, and we are then at the mercy of that API’s implementation.  Aside from the thread switching problems, if this API blocks but doesn’t pump messages on an STA, for instance, we may end up in a cross-apartment deadlock, among other problems.

III. References

[Brumme, 03]  Brumme, C.  AppDomains (“application domains”).  Blog article.  June 2003.
[Cwalina, 05]  Cwalina, K., Abrams, B.  Framework design guidelines: Conventions, idioms, and patterns for reusable .NET libraries.  Addison-Wesley Professional.  September 2005.
[Duffy, 05]  Duffy, J.  Atomicity and asynchronous exceptions.  Blog article.  March 2005.
[Duffy, 06]  Duffy, J.  Broken variants of double-checked locking.  Blog article.  January 2006.
[Duffy, 06b]  Duffy, J.  No more hangs: Advanced techniques to avoid and detect deadlocks in .NET apps.  MSDN Magazine.  April 2006.
[Duffy, 06c]  Duffy. J.  Application responsiveness: Using concurrency to enhance user experiences.  Dr. Dobb’s Journal.  September 2006.
[Michael, 96]  Michael, M., Scott, M.  Simple, Fast, and Practical Non-blocking and Blocking Concurrent Queue Algorithms.  PODC’06.
[Morrison, 05]  Morrison, V.  Understand the impact of low-lock techniques in multithreaded apps.  MSDN Magazine.  October 2005.
[Olukotun, 05]  Olukotun, K., Hammond, L.  The future of microprocessors.  ACM queue, vol. 3, no. 7.  September 2005.
[Sutter, 05]  Sutter, H., Larus, J.  Software and the concurrency revolution.  ACM queue, vol. 3, no. 7.  September 2005.

10/26/2006 1:05:12 PM (Pacific Daylight Time, UTC-07:00)  #   

 Saturday, October 21, 2006

As many of you know, I'm right in the middle of writing a new book on concurrency. I'm curious what people would like to see covered and in what proportions.

If you had $100 to spend as you see fit across any number of the following concurrency topics, how would you distribute it?

(0) Parallel algorithms (Comp Sci., technology agnostic)
(1) Architecting large scale concurrent programs
(2) Windows concurrency internals
(3) CLR concurrency internals
(4) Windows (Win32) concurrency API best practices
(5) CLR concurrency API best practices
(6) Client-side concurrency
(7) Server-side concurrency
(8) Reusable concurrency primitives (building and using them)
(9) Performance and scalability
(A) Debugging

I know there's some overlap and that each isn't entirely orthogonal, but that's okay. If there's something missing, feel free to add it to your response and allocate some funds for it.

I'm looking forward to seeing what y'all think!

10/21/2006 2:24:14 PM (Pacific Daylight Time, UTC-07:00)  #   

 Tuesday, October 17, 2006

The CLR's approach to monitor acquisition (i.e. Monitor.Enter and Monitor.Exit) during shutdown is very different from native CRITICAL_SECTIONs and mutexes (as described in my last post). In particular, the CLR does not ensure requests to acquire monitors on the shutdown path succeed, preferring instead to cope with the risk of deadlock rather than the risk of broken state invariants.

Managed code is run during orderly shutdowns in two places: the AppDomain.ProcessExit event and inside the Finalize method for all finalizable objects in the heap. (The term "orderly shutdown" is used to distinguish an Environment.Exit from a P/Invoke to kernel32!TerminateProcess, for instance.) Just as with the example described for native code, threads can be suspended while they hold arbitrary locks and have partially mutated state to the point where invariants do not hold any longer. Instead of permitting the shutdown code to observe this state--possibly causing corruption or unhandled exceptions on the finalizer thread--the CLR treats lock acquisitions as it normally does.

If a lock was orphaned in the process of stopping all running threads, then, the shutdown code path will fail to acquire the lock. If these acquisitions are done with non-timeout (or long timeout) acquires, a hang will ensue. To cope with this (and any other sort of hang that might happen), the CLR annoints a watchdog thread to keep an eye on the finalizer thread. Although configurable, by default the CLR will let finalizers run for 2 seconds before becoming impatient; if this timeout is exceeded, the finalizer thread is stopped, and shutdown continues without draining the rest of the finalizer queue.

This is typically not horrible since many finalizers are meant to cleanup intra-process state that Windows will cleanup automatically anyway. This covers things like file HANDLEs. But it does mean that any additional logic won't be run, like flushing file write-buffers. And for any cross-process state, you're screwed and had better have a fail-safe plan in place, like detecting corrupt machine-wide state and repairing upon the next program restart. (For what it's worth, DLL_PROCESS_DETACH notifications aren't run in all process exits either, so this really is not any worse than what you have with native code today.)

AppDomain unloads are very different beasts. Any reliability-critical code that will run as part of unload (CERs, critical finalizers, and generally any Cer.Success/Consistency.WillNotCorruptState methods) should strictly only ever acquire locks that are always dealt with in a reliable manner throughout the code-base. That statement is actually a little too strong. In reality, either (1) locks must never be orphaned (aside from process exit) or (2) the associated broken state invariants that will occur (e.g. in the face of asynchronous exceptions) can be tolerated.

Unfortunately, we don't give you access to Monitor.ReliableEnter (the BCL team gets to use it, though, as it's internal to mscorlib), which means almost nobody is equipped to do (1) today. It's impossible to write code that will reliably release a monitor in the face of possible asynchronous thread aborts and out of memory exceptions without it. Only a very tiny fraction of the BCL actually deals with locks in such a strictly reliable manner, so as a general rule of thumb very little of it actually acquires and releases locks while executing reliable-critical code. Without the risk of deadlock that is. Hosts will of course use policy to escalate to rude AppDomain unloads in the face of hangs, much like the CLR does by default for process exit.

(Note: Thanks to Jan Kotas--a SDE on the CLR team--for noticing that I confused AppDomain unloads with process exit in my last post, in addition to pointing out that appearances are deceiving: the multi-threaded CRT can actually suffer from the sort of shutdown problems outlined in the last post.)

10/17/2006 12:42:42 PM (Pacific Daylight Time, UTC-07:00)  #   

 Saturday, October 14, 2006

When a Windows process shuts down, one of the very first things to happen is the killing of all but one thread. This sole remaining thread is then responsible for performing shutdown duties, both in kernel and in user mode, including executing the appropriate DLL_PROCESS_DETACH notifications for the DLLs loaded in the process. A great treatise on shutdown and the associated subtleties can be found on, of course, Chris Brumme’s weblog.

It’s entirely possible that at least one of those threads was executing under the protection of one or more critical sections when the shutdown was initiated. Since threads are killed in a fairly hostile manner (not like, say, asynchronous thread aborts which are at least a little less rude, even the so-called rude version of a thread abort), these critical sections will have been left in an acquired state. And any associated program state is apt to be left very inconsistent indeed. Worse, you might imagine that if the shutdown thread later needed to acquire one of those oprhaned critical sections, the shutdown process would deadlock.

Although that’s intuitively what you may expect to occur, the OS actually does something a little funny during shutdown to avoid this problem. It effectively ignores calls to kernel32!EnterCriticalSection and kernel32!LeaveCriticalSection. A call to enter a CRITICAL_SECTION will first check to see if it's owned by another thread and, if it is, the section is automatically re-initialized before acquiring it. The result? If one of the previously killed threads, t0, held on to critical section A, for instance, and had partially modified some state protected by it just before the shutdown began, then the shutdown thread, t1, is permitted to freely “acquire” critical section A too, even though it was found as being officially owned by t0.

This means that code running during shutdown must tolerate any corrupt state that may have been left behind as a result. For obvious reasons, this is quite difficult. It's especially difficult if you write some code that somebody believes they can call during shutdown without you having gone through that thoughts exercise. The multi-threaded CRT uses locks internally for malloc/free, for instance, and reportedly cannot reliably tolerate process exit code-paths, which means can't even safely rely on memory allocation and freeing during process exit without spurious AVs, heap corruption, and other bad things. Other services are obviously apt to suffer from similar problems, particularly if they comprise of arbitrary application logic. You simply can't rely on invariant safe-points holding at lock boundaries when a shutdown is in process.

Mutexes also enjoy this same "weakening" behavior, at least on Windows XP. This policy doesn’t, however, apply to waits on other kernel synchronization objects, like events and semaphores. If you rely on these during shutdown you’re just asking for a deadlock. Actually if you are regularly using any sort of synchronization in your DllMain—including acquiring critical sections and mutexes—you’re asking for loads of trouble. Shutdown callbacks run under the protection of the OS loader lock, demanding extreme care, but that’s another topic altogether.

Here is a sample VC++ program that shows off this behavior. We declare a bunch of code in the DllMain: process attach initializes a CRITICAL_SECTION and a mutex, and then detach attempts to acquire them. We then define an exported function, GetAndBlock, that acquires the synchronization objects and sleeps for a long time:

#include <stdio.h>
#include <windows.h>

CRITICAL_SECTION g_cs;
HANDLE g_mutex;

BOOL WINAPI DllMain(HINSTANCE hinstDLL, DWORD fdwReason, LPVOID lpReserved)
{
    switch (fdwReason) {
        case DLL_PROCESS_ATTACH:
            InitializeCriticalSection(&g_cs);
            g_mutex = CreateMutex(NULL, FALSE, NULL);
            break;
        case DLL_PROCESS_DETACH:
            printf("%x: Acquiring g_cs during shutdown...", GetCurrentThreadId());
            EnterCriticalSection(&g_cs);
            printf("success.\r\n");

            printf("%x: Acquiring g_mutex during shutdown...", GetCurrentThreadId());
            WaitForSingleObject(g_mutex, INFINITE);
            printf("success.\r\n");

            DeleteCriticalSection(&g_cs);
            CloseHandle(g_mutex);
            break;
    }

    return TRUE;
}

__declspec(dllexport) DWORD WINAPI GetAndBlock(LPVOID lpParameter) {
    // Acquire the mutual exclusion locks.
    EnterCriticalSection(&g_cs);
    WaitForSingleObject(g_mutex, INFINITE);

    printf("%x: g_cs and g_mutex acquired.\r\n", GetCurrentThreadId());

    // And just wait for a little while...
    SleepEx(25000, TRUE);

    return 0;
}

And finally we have an EXE that just invokes GetAndBlock and initiates a process shutdown on separate threads. The result is that the shutdown thread acquires the synchronization objects which the GetAndBlock thread currently has ownership of. Post Windows 95, the shutdown thread is always the thread that initiated the shutdown, whereas before that it was (seemingly) chosen at random; so when run on a modern OS at least, this sample is guaranteed to demonstrate the desired behavior:

#include <windows.h>

DWORD WINAPI GetAndBlock(LPVOID lpParameter);

int main() {
    HANDLE hT1 = CreateThread(NULL, 0, &GetAndBlock, NULL, 0, NULL);
    SleepEx(100, TRUE);
    ExitProcess(0);
}

The results of running are a little non-eventful:

C:\...>shutdown.exe
664: g_cs and g_mutex acquired.
d18: Acquiring g_cs during shutdown...success.
d18: Acquiring g_mutex during shutdown...success.

As expected, no hangs occur. If you want to see what happens when a hang does happen, just replace CreateMutex with CreateEvent. It's not pretty.

Update 10/17/2006: Thanks to Jan Kotas for pointing out that the multi-threaded CRT is actually not safe from the sort of issues I talk about in this article. I wasn't able to get it to happen in a test program--one of the great things about repro'ing race conditions :)--but have fixed that part up.

10/14/2006 10:30:11 PM (Pacific Daylight Time, UTC-07:00)  #   

 Tuesday, October 10, 2006

It's probably old news on the street, but I just (happily) received my copy of the new of the dragon book yesterday. Yes, after 20 years, there is now a 2nd edition of the cult classic, Compilers: Principles, Techniques, and Tools. I have to admit I like the old cover much better--what can I say, I'm a cheesy cartoons over cheesy 3d models kind of guy--but the fact that there are 3 entirely new chapters on topics near and dear to my heart more than makes up for it: one on runtimes, and two on parallelism.  
10/10/2006 8:45:08 AM (Pacific Daylight Time, UTC-07:00)  #   

 Tuesday, October 03, 2006

I am often confronted with the question of whether concurrency programming models that employ shared memory are evil. I was asked this question directly on the concurrency panel at JAOO’06 earlier this week, for instance, and STM makes a big bet that such models are tenable.

Without shared memory, it’s tempting to think that traditional concurrency problems go away, as if by magic. If no two pieces of code are simultaneously working on the same location in memory, for instance, there are (seemingly) no race conditions or deadlocks. Most people believe this, and it (on the surface) seems somewhat reasonable. Until you realize that it’s fundamentally flawed.

Shared memory systems are just an abstraction in which data can be named by its virtual memory address. In fact, one could argue that it’s an optimization—that the same sort of systems could be built by mapping virtual memory addresses (at a logical level) to some other location (at a physical level) using an algorithm that doesn't rely on page-tables, TLBs, and so on. Distributed RPC systems in the past have tried this very thing: to map object references to data residing on far-away nodes, and have mostly failed in the process. I’m not trying to convince you that alternative mapping techniques are a good thing, only that abstractly speaking at least, all of the same concurrency control problems will arise in systems that exhibit this fundamental property. Interestingly, shared memory systems have turned into tiny distributed systems with complex cache coherency logic anyway, so one has to wonder where the boundary between shared memory and message passing really lies.

There is a fundamental, undeniable law here:

Any system in which two concurrent agents can name the same piece of data must also exhibit the standard problems of concurrency: broken serializability, race conditions, deadlocks, livelocks, lost event notifications, and so on. Concurrency control is simply a requirement if correctness is desired.

So in reality, the real question at hand should be, would a system in which every concurrent agent operates on its own, completely isolated piece of data be more attractive? I personally think that’s farfetched and unrealistic. Systems with shared data need to have shared data; it's a property of the system being modeled. Even with isolated data, concurrency control would be required if, say, a central copy is rendezvoused with periodically (which, by the way, is the only way I can see such a system remaining correct). And then you have to wonder what copying buys you. It certainly costs you. Data locality is crucial to achieving adequate performance in most low-to-mid-level systems software. Yet copy-on-send message passing systems throw this out the window entirely. I refuse to believe that this will ever be the dominant model of fine-grained concurrency, at least on the current hardware architectures available by Intel and AMD. And certainly not without a whole lot more research and perhaps hardware support.

A distributed system in which many simultaneous clients might access the same piece of data on the server has all the same issues. AJAX systems, for instance, easily lull the author into a false sense of security. But, unfortunately(?), a transaction is a transaction, and if concurrency control isn’t in place, such systems are effectively executing without any isolation or serialization guarantees whatsoever—I just read an article in the latest DDJ where this was explained. I'm surprised a dedicated article actually needs to point this out: concurrent access to data under any other name is still concurrent access to data. And of course, once you start to employ concurrency control, you are susceptible to deadlocks and so on—unless you have a system that can transparently resolve them.

Interesting research has been done recently by MSR on static verification to prove the absence of sharing (across processes)—called Software Isolated Processes (SIPs)—building on the type safe, verifiable subset of IL. STM of course also builds on top of the shared memory programming model; but, although threads can name the same location in memory, this is completely hidden—concurrency control is still employed in the implementation where necessary. I believe this systems are promising. They also have the benefit of building on the same foundational memory performance equations that software developers are used to relying on today.

10/3/2006 4:13:32 PM (Pacific Daylight Time, UTC-07:00)  #   

 Monday, October 02, 2006

Here are the slides for my JAOO'06 talk Concurrency and the composition of frameworks:

JAOO-06__ConcurrencyAndFrameworks.ppt (2.06 MB)

10/2/2006 7:16:10 AM (Pacific Daylight Time, UTC-07:00)  #   

 Friday, September 22, 2006

An article I wrote (seemingly ages ago) just appeared in the September issue of Dr. Dobb's journal:

Application Responsiveness: Using concurrency to enhance user experiences
Thanks to recent innovation in both hardware graphics processors and client-side development frameworks, GUIs for Windows applications have become more and more visually stunning over time. But throughout the evolution of such frameworks, one problem hasn't gone away—poor responsiveness. Studies show that positive user experiences are linked to application responsiveness and, conversely, that frustrating experiences are often caused by poor responsiveness. More often than not, this lack of responsiveness is due to a series of subtle (and sometimes accidental) design choices made during development. In this article, I examine the root of the responsiveness problem, and then suggest some best practices for eliminating it.

My article only touches on some important issues that are described in detail elsewhere.  Here are the references I used:

  1. D. Duis, J. Johnson. Improving User Interface Responsiveness Despite Performance Limitations. Proc. IEEE Computer Society Intl. Conference. February 1990.
  2. J. Duffy. No More Hangs: Techniques for Avoiding and Detecting DeadlocksMSDN Magazine. April 2006.
  3. G. H. Forman. Obtaining Responsiveness in Resource-Variable Environments. PhD Dissertation, University of Washington. 1998.
  4. I. Griffiths. Windows Forms: Give Your .NET-based Applications a Fast and Responsive UI with Multiple Threads. MSDN Magazine. February 2003.
  5. N. Kramer. Threading Models (Windows Presentation Foundation). Weblog essary. June 2005.
  6. G. Maffeo, P. Silwowicz. Win32 I/O Cancellation in Windows Vista. MSDN. September 2005.
  7. V. Morrison. Concurrency: What Every Dev Must Know About Multithreaded Apps. MSDN Magazine. August 2005.
  8. M. E. Russinovich, D. A. Solomon. Microsoft Windows Internals. ISBN 0-735-61917-4, MS Press. December 2004.
  9. C. Sells. Safe, Simple Multithreading in Windows Forms, Part 1. MSDN. June 2002.
  10. C. Sells, I. Griffiths. Programming Windows Presentation Foundation. ISBN 0-596-10113-9, O'Reilly. September 2005.

Thanks go to Jeff Richter, Nick Kramer, Alessandro Catorcini1, and Vance Morrison for reviewing early drafts.  Enjoy.

1. Alessandro, man, you need a blog! ;)

9/22/2006 11:27:36 AM (Pacific Daylight Time, UTC-07:00)  #   

 

Recent Entries:

Search:

Browse by Date:
<November 2006>
SunMonTueWedThuFriSat
2930311234
567891011
12131415161718
19202122232425
262728293012
3456789

Browse by Category:

Notables: