RSS 2.0

Personal Info:

Joe Send mail to the author(s) leads the architecture of an experimental OS's developer platform, where he is also chief architect of its programming language. His current mission is to enable writing large-scale software that is reliable, secure, and scalable by-construction. Before this, Joe founded the Parallel Extensions to .NET project. He has been granted 19 patents, with 49 pending. When not working, Joe enjoys travelling with his wife, writing books, writing music, studying music theory & mathematics, and doing anything involving food & wine.

My books

My music

Disclaimer:
The content of this site are my own personal opinions and do not represent my employer's view in anyway.

© 2012, Joe Duffy

 
 Wednesday, July 16, 2008

The adjacent release/acquire problem is well known.  As an example, given the program:

P0          P1
==========  ==========
X = 1;      Y = 1;
R0 = Y;     R1 = X;

The outcome R0 == R1 == 0 is entirely legal.  This could happen because writes are delayed in processor store buffers; so before R0 = Y retires, the store X = 1 may have not even left the local processor P0; similarly, before R1 = X retires, the store Y = 1 may not have even left processor P1.  It is as if the program was written as follows:

P0          P1
==========  ==========
R0 = Y;     R1 = X;
X = 1;      Y = 1;

The standard way to fix this is to emit a full fence:

P0          P1
==========  ==========
X = 1;      Y = 1;
XCHG;       XCHG;
R0 = Y;     R1 = X;

But here is one that may be a little surprising:

P0          P1
==========  ==========
X = 1;      Y = 1;
R0 = X;     R2 = Y;
R1 = Y;     R3 = X;

Assuming X and Y are "volatile" to the compiler, is R1 == R3 == 0 a possible outcome in this program?

Based on the rules we provide for .NET's MM, and Intel's whitepaper, one could reasonably argue "no".  The reasoning goes as follows.  True data dependence prohibits R0 = X from moving before X = 1, and the no load/load reordering rule (e.g. Intel's Rule 2.1) prohibits R1 = Y from moving before R0 = X.  Thus, transitively, R1 = Y may not move before X = 1.  Similarly, true data dependence prohibits R2 = Y from moving before Y = 1, and the no load/load reordering rule prohibits R3 = X from moving before R2 = Y, and therefore R3 = X may not move before Y = 1.  Given this reasoning, the individual instruction streams cannot be reordered in place.  And therefore, no interleaving of them will yield R1 == R3 == 0, because either X = 1 or Y = 1 must happen first, and both R1 = Y and R3 = X must come later.  Hence at least one of R1 or R3 will observe a value of 1.

Sadly, this reasoning is incorrect.  Rule 2.4 in the Intel whitepaper states that "intra-processor forwarding is allowed."  They even have an innocent example in the paper, but it actually doesn't exhibit load/load reordering.  It does, however, illustrate that stores may be delayed for some time in a write buffer.  Perhaps surprisingly, such intra-processor forwarding of buffered stores is actually permitted to satisfy subsequent loads from that location by the same processor before the store has left the processor.  This can happen even if it means passing intermediate loads from different memory locations!  The result is that load/load reordering is effectively possible under some circumstances.  Loads still physically retire in order of course, but because they may be satisfied by pending writes that other processors cannot yet see, it is as if the original program were written as:

P0          P1
==========  ==========
R1 = Y;     R3 = X;
X = 1;      Y = 1;
R0 = X;     R2 = Y;

The fundamentally contradicts what most people believe about .NET's MM, and indeed, Intel's MM as specified in that whitepaper.  To be fair, the whitepaper actually does call this out, but in a roundabout and misleading fashion.  The text in Rule 2.1, which states that "no loads can be reordered with other loads", is far too strong.

Anytime a little hole in something as fundamental as MM axioms is uncovered, it is cause for concern.  So I found this discovery deeply disturbing.  Many abstractions and theorems are proved with the assumption that the MM is rock solid.  I know a lot of code I have written relies on such proofs.

That said, I've been racking my brain (and in fact was having nightmares about it last evening) trying to uncover a case where this is worse than the existing release/acquire reordering issue that I opened this post with.  Everything I come up with is saved at the last minute by rules 2.1 (for stores) and 2.5 "stores are transitively visible".  The basic problem is that a processor can get stuck seeing its own written value for some time, during which other processors cannot, but ultimately it doesn't seem to matter because the buffer will eventually be flushed.  Then any intermediary values that the destination may have held while that processor was stuck will have been overwritten anyway, so the outcome should be explainable (albeit racey).  I'm still thinking hard about this.

7/16/2008 7:43:00 PM (Pacific Daylight Time, UTC-07:00)  #   

 Monday, June 23, 2008

I just submitted the final manuscript for Concurrent Programming on Windows to Addison-Wesley.

This marks the exciting transition from things happening on my timetable to things happening on AW’s timetable.

A lot has changed for me since I decided to write this book.  You might be surprised to hear that I actually signed the contract for it on November 29th, 2005.  That’s 2 years and 7 months ago.  It’s almost unbelievable that this book took so long to finish.  By comparison, my first one took just a little over a year.  The road has been a long one, full of personal ups and downs, but it’s no doubt been an exciting trip.

I’ve been at Microsoft the whole time.  At the outset, I was a PM on the CLR Team, hacking on software transactional memory and PLINQ as an evening activity.  Then I transitioned to doing it full time, but still as a PM.  Then I joined the Parallel Computing team as the dev for PLINQ.  Then I kicked off the whole Parallel Extensions effort (which is 20 members and growing strong), became the dev lead, and here I am today.  It’s pretty strange to say this, but without the book very little of that would have happened.  I can’t think of a better way to get entrenched in a technology, experience the breadth, and force yourself to learn every little intricate and often enlightening detail.  If you can afford the impact to mental health and personal relationships ;), it’s an activity I highly recommend to anybody wanting to master a technology...  not that one can actually master the concurrency beast, but y’know...

In retrospect, it should have taken a year.  Maybe next time.

The good news is that you will have the book in your hands soon.  (Well, if you decide to buy a copy, that is.)  If you manage to make it to my PDC 2008 pre-con session, I’m hoping we will have some copies available.  No promises, since I missed my final deadline by a couple weeks, but my fingers are crossed.

Oh yeah, and you can expect me to pick up blogging again now that I’ll have some free time.  Hmm, free time?  What will I do with myself!

Laissez les bon temps roulez!

6/23/2008 12:34:18 AM (Pacific Daylight Time, UTC-07:00)  #   

 Friday, June 13, 2008

We had an interesting debate at a Parallel Extensions design meeting yesterday, where I tried to convince everybody that a full fence on SpinLock exit is not a requirement.  We currently offer an Exit(bool) overload that accepts a flushReleaseWrite argument.  This merely changes the lock release from

m_state = 0;

to

Interlocked.Exchange(ref m_state, 0);

The main purpose of this is to announce “availability” of the locks to other processors.  More specifically, it ensures that before the current processor is able to turn around and reacquire the lock in its own private cache, that other processors at least have the opportunity to see the write.  This is a fairness optimization, and avoiding the CAS on release halves the number of CAS operations necessary (which are expensive), so we would generally like to avoid superflous ones.  It turns out you could easily do this without our help.  Instead of

slock.Exit(true);

you could say

slock.Exit();
Thread.MemoryBarrier();

Most of the debate about whether the default Exit should use a fence centered around confusion over the strength of volatile vs. a full fence.  For example, the C# documentation for volatile is highly misleading (http://msdn.microsoft.com/en-us/library/x13ttww7(VS.71).aspx):

The volatile modifier is usually used for a field that is accessed by multiple threads without using the lock statement to serialize access. Using the volatile modifier ensures that one thread retrieves the most up-to-date value written by another thread.

The confusion is over the “ensures that one thread receives the most up-to-date value written by another thread” part.  Technically this is somewhat-accurate, but is worded in a very funny and misleading way.  To see why, let’s take a step back and consider what volatile actually means in the CLR’s memory model (MM) for a moment, to set context.  Note that I did my best to concisely summarize the MM here: http://www.bluebytesoftware.com/blog/2007/11/10/CLR20MemoryModel.aspx.

Volatile on loads means ACQUIRE, no more, no less.  (There are additional compiler optimization restrictions, of course, like not allowing hoisting outside of loops, but let’s focus on the MM aspects for now.)  The standard definition of ACQUIRE is that subsequent memory operations may not move before the ACQUIRE instruction; e.g. given { ld.acq X, ld Y }, the ld Y cannot occur before ld.acq X.  However, previous memory operations can certainly move after it; e.g. given { ld X, ld.acq Y }, the ld.acq Y can indeed occur before the ld X.  The only processor Microsoft .NET code currently runs on for which this actually occurs is IA64, but this is a notable area where CLR’s MM is weaker than most machines.  Next, all stores on .NET are RELEASE (regardless of volatile, i.e. volatile is a no-op in terms of jitted code).  The standard definition of RELEASE is that previous memory operations may not move after a RELEASE operation; e.g. given { st X, st.rel Y }, the st.rel Y cannot occur before st X.  However, subsequent memory operations can indeed move before it; e.g. given { st.rel X, ld Y }, the ld Y can move before st.rel X.  (I used a load since .NET stores are all release.)  Note that RELEASe is the opposite of ACQUIRE: you can think of an acquire as a one-way fence that prohibits passes downward, and a release as a one-way fence that prohibits passes upward.  A full fence prohibits both (lock acquire, XCHG, MB, etc).

Note one very interesting thing in this discussion: a release followed by an acquire, given the above rules, does not prohibit movement of the instructions with respect to one another!  Given { st.rel X, ld.acq Y }, even though they are both volatile (i.e. acquire and release), so long as X!=Y, it is perfectly legal for the ld.acq Y to move before st.rel X.  We aren’t limited to single instructions either, e.g. { st.rel X, ld.acq A, ld.acq B, ld.acq C }, all three loads (A, B, C) may indeed happen before the X.  This occurs with regularity in practice, on X86, X64, and IA64, because of store buffering.  It would just be too costly to hold up loads until a store has reached all processors.  Superscalar execution is meant to hide such latencies.

(As an aside, many people wonder about the difference between loads and stores of variables marked as volatile and calls to Thread.VolatileRead and Thread.VolatileWrite.  The difference is that the former APIs are implemented stronger than the jitted code: they achieve acquire/release semantics by emitting full fences on the right side.  The APIs are more expensive to call too, but at least allow you to decide on a callsite-by-callsite basis which individual loads and stores need the MM guarantees.)

I have to admit the store buffer problem is mostly theoretical.  It rarely comes up in practice.  That said, on a system which permits load reordering, imagine:

Initially: X = Y = 0

T0                       T1
X = 5; // st.rel         while (X == 0) ; // ld.acq
while (Y == 0) ; // ld   X = 0; // st.rel
A = X; // ld.acq         Y = 5; // st.rel

After execution, is it possible that A == 5?

If the read of Y is non-volatile on T0 (which would be bad because a compiler may hoist it out of the loop, but ignore compilers for a moment), then the fact that the subsequent read of X is volatile does not save us from a reordering leading to A == 5.  This is the { ld, ld.acq } case described earlier.  Why might this physically occur?  Well, it won’t happen on X86 and X64 because loads are not permitted to reorder.  However!!  IA64 permits non-acquire loads (non-volatile) to reorder, and so the A = X may actually be satisfied out of the write buffer before the store even leaves the processor.  It’s as though the program became:

T0                       T1
X = 5; // st.rel         while (X == 0) ; // ld.acq
A = X; // ld.acq         X = 0; // st.rel
while (Y == 0) ; // ld   Y = 5; // st.rel

Whoops!  This should make it apparent that this outcome is indeed a real possibility.  And clearly it may cause bugs.

Note 6/13/08: Eric pointed out privately that compilers need only respect the CLR MM, and can freely reorder loads.  Thus, this problem may actually arise on non-IA64 machines.  Of course he is entirely correct.  It was silly of me to overlook that.

All that said, let’s get back to the original concern about visibility of writes.  This issue doesn’t even really involve reordering.  Imagine one processor continuously executes a stream of lock acquires and releases, and that the stream goes on indefinitely (perhaps because it’s in a loop):

while (Interlocked.CompareExchange(ref m_state, 1, 0) != 0) ;
m_state = 0;
while (Interlocked.CompareExchange(ref m_state, 1, 0) != 0) ;
m_state = 0;

The Interlocked operation acquires the cache line in X mode.  After it executes, other processors will notice that the lock is taken.  But right away, the processor writes 0 to the line without a fence, and immediately goes on to execute another acquire.  It is highly likely that the line will be marked dirty in the processor’s cache by the time that it acquires it in X mode again, something that the cache coherency system makes very cheap.  In fact, the write of m_state = 0 probably hasn’t left the write buffer yet due to latency.

So before another processor can even see m_state as 0, the processor will have already gotten around to taking the lock again.  Even for volatile loads and stores, there is no MM guarantee that writes will leave the processor immediately; hence the documentaiton earlier is slightly confusing; yes, the processor doing a volatile read will see the “most recent” value, but that “most recent” value (a) may be satisfied out of the local write buffer, and (b) may simply not have the ability to observe writes that occurred in practice due to the above timeliness issue. 

6/13/2008 12:52:45 PM (Pacific Daylight Time, UTC-07:00)  #   

 Thursday, June 05, 2008

We sat down last week with Charles from Channel9 to discuss the new CTP.  Both parts got posted today:

We focus on the new aspects of the stack, incl. the new scheduler and CDS, and also discuss what's changed in PLINQ and TPL.

If you have ideas for future videos, or any feedback/questions, you know where to send 'em.  joedu AT youknowwhere DOT com.

6/5/2008 5:14:47 PM (Pacific Daylight Time, UTC-07:00)  #   

 Monday, June 02, 2008

We just released a new CTP of Parallel Extensions to .NET: get it here.

Some relevant information is up on our team blog:

I'm really excited to see our entire stack finally shipping as one cohesive unit: the data structures we use throughout the implementation exposed publicly (what we now call CDS), a new scheduler built from the ground up, TPL and PLINQ better together, and lots more.  We're still very far from the end of the road, and have plenty of cool stuff still in the works, but we've made a ton of progress in just 6 months since our last CTP.

Have fun and, as always, feedback is much appreciated.

6/2/2008 9:00:26 AM (Pacific Daylight Time, UTC-07:00)  #   

 Thursday, May 29, 2008

PDC'08 is officially on for October 27-30th this year: http://microsoftpdc.com/.

My team will certainly have some really fun stuff to show off, and just glancing at the preliminary list of teaser sessions, it's going to be a blast.

A few of us have teamed up to give a PreCon.  That's code-word for a full day of neverending concurrency goodness.  From http://microsoftpdc.com/agenda/Preconference.aspx:

Concurrent, Multi-core Programming on Windows and .NET
Presenter(s): David Callahan, Joe Duffy, Stephen Toub
The leap from single-core to multi-core technology is altering computing as we know it, enabling increased productivity, powerful energy-efficient performance, and leading-edge advanced computing experiences. The good news is that Windows and .NET offer rich support for threading and synchronization to take advantage of these new platforms. This session, presented by David Callahan, Microsoft distinguished engineer, Joe Duffy, author of “Concurrent Programming on Windows” (Addison-Wesley), and Stephen Toub, program manager lead for the Concurrency Development Platform team at Microsoft, will cover a broad range of topics, including mechanisms to create, coordinate, and synchronize among threads; best practices for concurrent libraries and apps; and techniques for improving scalability, including lock-free algorithms. Focus will be on .NET programming, including the next generation of parallel programming support within the Framework, but Windows internals and C++ nuggets will be discussed too.

About the presenter(s):
David Callahan joined Microsoft in 2005. He is a Distinguished Engineer leading the Parallel Computing Platform Team within Visual Studio® focused on incubating technology for the coming manycore processors. This team is producing exciting new technologies as part of Visual Studio and also driving the Parallel Computing Initiative, a company wide effort to deliver customer value from the power of future high-performance processors. David’s background is in programming languages, parallel programming techniques, and compilation techniques focused on expressing and exploiting concurrency.

Stephen Toub is a Senior Program Manager Lead on the Parallel Computing Platform team at Microsoft, where he spends his days focusing on the next generation of programming models for concurrency. Stephen is also a Contributing Editor for MSDN® Magazine, for which he writes the .NET Matters column, and he’s an avid speaker at conferences like TechEd and DevConnections. Prior to working on the Parallel Computing Platform, Stephen designed and built enterprise applications for companies such as GE, McGraw-Hill, BankOne, and JetBlue. He was a developer for Microsoft Outlook as well as for the Microsoft Office Solution Accelerators. In his spare time, Stephen loves to sing, and he spends as much time as possible with his beautiful wife Tamara.

Joe Duffy leads development for Microsoft's Parallel Extensions to .NET technology, a set of library and runtime technologies for concurrent and parallel computing. He founded the project in 2006 with Parallel Language Integrated Query (aka PLINQ), an innovative declarative parallel query analysis and execution engine. Prior to Parallel Extensions, Joe worked on transactional memory, library and VM support for concurrency in the Common Language Runtime (CLR) team, and has written 3 functional language compilers (Scheme, Common LISP, and Haskell). He has written two books, including Concurrent Programming on Windows (Addison-Wesley, 2008), and in his spare time reads and writes (code and text), plays guitar, and studies music theory.

Be there, or be square.  The slides aren't even finalized yet, so let me know if you have particular topics you'd like to learn more about.

5/29/2008 11:19:11 PM (Pacific Daylight Time, UTC-07:00)  #   

 Friday, May 16, 2008

Counting events and doing something once a certain number have been registered is a highly common pattern that comes up in concurrent programming a lot.  In the olden days, COM ref counting was a clear example of this: multiple threads might share a COM object, call Release when done with it, and hence memory management was much simpler.  GC has alleviated a lot of that, but the problem of deciding when a shared IDisposable resource should be finally Disposed of in .NET is strikingly similar.  And now-a-days, things like CountdownEvent are commonly useful for orchestrating multiple workers (see MSDN Magazine), which (although not evident at first) is based on the same counting principle.

Coding up one-off solutions to all of these is actually pretty simple.  But doing so seems unfashionably ad-hoc, at least to me.  Codifying the pattern can be done in a couple dozen lines of code, so that it can be reused for many purposes.  As an example, here is a reusable Counting<T> class, written in C#, that just invokes action delegate once the count reaches zero:

#pragma warning disable 0420

 

using System;

using System.Threading;

 

public class Counting<T>

{

    private readonly T m_obj;

    private volatile int m_count;

    private readonly Action<T> m_action;

 

    public Counting(T obj, int initialCount, Action<T> action)

    {

        m_obj = obj;

        m_count = initialCount;

        m_action = action;

    }

 

    public int AddRef()

    {

        int c;

        if ((c = Interlocked.Increment(ref m_count)) == 1)

            throw new Exception();

        return c;

    }

 

    public int Release()

    {

        int c;

        if ((c = Interlocked.Decrement(ref m_count)) == 0)

            m_action(m_obj);

        return c;

    }

 

    public T Obj { get { return m_obj; } }

}

Notice I’ve used the IUnknown vocabulary of AddRef and Release.  Old habits die hard.

The CountdownEvent I mentioned earlier is just a simple extension to this basic functionality.  In fact, we don’t need to write another class; it’s merely an instance of Counting<T>, where the T is a ManualResetEvent.  Setters directly use the Counting<T> object’s Release method to register a signal, while waiters can use the WaitOne method on the raw ManualResetEvent itself.  The event will be set once all signals have arrived:

Counting<ManualResetEvent> countingEv = new Counting<ManualResetEvent>(

    new ManualResetEvent(false), N, e => e.Set()

);

 

// Setter:

countingEv.Release();

 

// Waiter:

countingEv.Obj.WaitOne();

(Exposing a traditional Set/Wait interface would of course be nicer, but even then Counting<T> makes the implementation brain-dead simple.)

Similarly, the “who should dispose” problem is easy to solve with Counting<T>.  Say that, instead of setting the event, we actually want to Dispose of some IDisposable object when all threads are done with it:

Counting<ManualResetEvent> ev = new Counting<ManualResetEvent>(

    new ManualResetEvent(false), N, e => e.Dispose()

);

Though this does the trick, we might instead wrap it in a more convenient package:

public class CountingDispose<T> :

        Counting<T>, IDisposable

        where T : IDisposable

{

    public CountingDispose(T obj, int initialCount) :

        base(obj, initialCount, d => d.Dispose()) { }

}

Given this definition, threads can use the CountingDispose<T> object as they would any IDisposable thing.  This facilitates use in C# using blocks.  Only when all threads have called Dispose will Dispose be called on the actual underlying object:

CountingDispose<ManualResetEvent> ev = new CountingDispose<ManualResetEvent>(

    new ManualResetEvent(false), N

);

 

// Some threads wait:

using (ev) {

    … ev.WaitOne(); …

}

 

// Some threads set:

using (ev) {

    … ev.Set(); …

}

I’ve found that the extremely simple Counting<T> idea is a surprisingly powerful one.  It’s fairly extensible too; for example, you clearly may want to run actions at different points in the counting, use clever synchronization to ensure actions run at particular points are processed in-order (useful for progress reporting), to reset the count afterwards, and so on.  It’s way too simple to claim it’s anything terribly amazing, but thought I’d share the idea anyway.

5/16/2008 8:18:59 PM (Pacific Daylight Time, UTC-07:00)  #   

 Friday, March 28, 2008

We take code reviews very seriously in our group.  No checkin is ever made without a peer developer taking a close look.  (Incubation projects are often treated differently than product work, because the loss of agility is intolerable.)  A lot of this is done over email, but if there’s anything that is unclear from just looking at the code, a face to face review is done.  Feedback ranges from consistency (with guidelines and surrounding code), finding errors or overlooked conditions, providing suggestions on how to more clearly write something, comments, etc.; this ensures that our codebase is always of super high quality.

Concurrency adds some complexity to development, and requires special consideration during code reviews.  I thought I’d put some thoughts on paper about what I look for during concurrency-oriented code reviews, in hopes that it will be useful to anybody starting to sink their teeth into concurrency; it may also help you devise your own internal review guidelines.  Most of this advice just comes down to knowing a laundry list of best practices, but a lot of it is also knowing what to look for and where to spend your time during a review.

(A couple years ago I wrote a lengthy “Concurrency and its impact on reusable libraries” essay which provides a lot of the motivation behind what I look for.  It’s up on my blog, http://www.bluebytesoftware.com/blog/2006/10/26/ConcurrencyAndTheImpactOnReusableLibraries.aspx, and (though slightly out of date) I’m revising it for an Appendix in my upcoming book.  If you question why I believe something, chances are that this document will explain my rationale.  And it’s far more complete than this short essay; I’ve only hit the high points here.)

Getting started

I first review the code in a traditional sequential code review fashion.  When doing this, I earmark all state that I see as either “private” (aka isolated) or “shared”.  I then go back and closely review all state that is shared (accessible from many threads) with a fine-tooth comb.  Sometimes I’ll do this during my first pass through, but I usually find it helpful to understand the algorithmic structure of the changes first before fully developing an understanding of the concurrency parts.

Changes to existing code should be reviewed just as carefully (if not more carefully) as new code.  Concurrency behavior is subtle, and it’s very easy to accidentally violate some unchecked assumption the code was previously making.  Liberal use of asserts is therefore very important.  Sadly many of the conditions code assumes are simply unassertable (like “object X isn’t shared”).
I easily spend about 2x the amount of time reviewing the concurrency aspects of the code than usual sequential aspects.  Perhaps more.  This extra time is OK, because the concurrency portion is far smaller (in lines of code) than the sequential portion in most of the code I review.  There are obvious exceptions to this rule, especially since I’m on a team building low-level primitive data types whose sole purpose in life is to be used in concurrent apps.

Shared state and synchronization

Some state, although shared, is immutable (read-only) and can be safely shared and read from concurrently.  Often this is by construction (e.g. immutable value types) but sometimes this is by loose convention (e.g. a data structure is immutable for some period of time, simply by virtue of no threads actively writing to it).  Both should be clearly documented in the code.
Once mutable shared state is identified, I look for two major things:

  1. When does it become shared, i.e. publicized, and what is the protocol for the transfer of ownership?  Is it done cleanly?  And is it well documented?
  2. When does it once again become private again, if ever?  And is this documented too?

Ideally all shared state would have clean ownership transitions.  Any state that is disposable necessarily must have a point at the end of its life where it has a single owner, so it can be safely disposed of (unless ref-counting is used).  But for most state the line will be extremely blurry and unenforced.  Comments should be used to clarify, in gory detail.  I also tend to prefix names of variables that refer to shared objects with the word ‘shared’ itself, so that they jump out.

Many, many bugs arise from some code publicizing some state, sometimes by accident, and then continuing to treat it as though it is private.  It is also sometimes tricky to determine this precisely, since sharing can be modal.  A list data structure may be shared in some contexts but not others.  Knowing what its sharing policy is requires transitive knowledge of callers.  Building up this level of global understanding can involve a fair bit of simply sitting back and reading and rereading the code over and over again.

Once the policy around sharing a piece of state is known, it is crucial to understand the intended synchronization policy for that data.  Is it protected by a fine-grained monitor?  Is it manipulated and read in a lock-free way?  And so on.  And once the intended policy is known, is the actual policy implemented what was intended?

While this part is extremely important, by the way, I have to admit that I feel this aspect tends to overshadow other things in conversation.  This is probably because it’s the most obvious thing to look for.  Sadly the world of concurrency is far more subtle than this.  I’ve honestly found more bugs resulting from failing to identify shared state properly than resulting from failing to implement the synchronization logic itself properly.  Your mileage will of course vary.

Locks

I treat lock-based code and lock-free as two entirely separate beasts.  I spend about 5x the time reviewing lock-free code when compared to lock-based code.  There is a tax to having lock-free code in any codebase, so as you are reviewing it, also ask yourself: is there a better (or almost-as-good) way that this could have been done using locks?  Often the answer is no, due to the benefits lock-freedom brings (no thread can indefinitely starve another).

But if the answer is yes, that the code could be written more clearly using locks, you could save your team a lot of time by convincing the author to change his or her mind.  Not only is lock-free code far more difficult to write and test, it carries a large tax during long stress hauls and end-game bug-fixing, an important and time-sensitive period in the development lifecycle of any commercial software product.  Maintaining lock-free code also carries an extra long-term cost, particularly when ramping up new hires on it.  All of this risks interfering with your ability to work on cool new features at some point.  Don’t feel bad about pushing back on this one.  Hard.

Carefully review what happens inside of a lock region.  Look at every single line with scrutiny.

  • Lock hold times should be as short as possible.  Hold times should be counted in dozens or hundreds of cycles, not thousands (unless absolutely unavoidable).
  • If lock hold times are in the dozens, you can consider using a spin-lock.
  • Recursive lock acquisitions are strongly discouraged.  If it can happen, did the developer clearly intend it to happen?  Or is this possibility accidental?  Point it out to them.  Also, are there any unexpected points at which reentrancy can occur?  E.g. any APC or message-pumping waits?  If yes, is there a way to avoid that by simple restructuring of the code?
  • Dynamic method calls via delegates or virtual methods while a lock is held should be as rare as possible.  Method calls under a lock to user-supplied code should only ever happen if the concurrency behavior is clearly documented and specified for the user, and when invariants hold.  All of these cases can lead to reentrancy.  Often this requires special code to detect the reentrancy and respond intelligently.
  • Lock regions should usually not span multiple methods: for example, acquiring the lock in one method, returning, and having the caller release it in another method is bad form.  It is very easy to screw up the control flow and deadlock your library.
  • CERs can only use spin-locks currently (because Monitor.ReliableEnter is currently unavailable), if you care about orphaning locks at least (which most CER-cost does).  If you see somebody trying to write a CER using a CLR Monitor, their code is probably busted.  Thankfully CERs are pretty rare to encounter in practice.

Races that break code are always must-fix bugs, no matter how obscure.  If they happen with low frequency on the quad-cores of today, they will probably break with regularity on the 16-cores of tomorrow.  The kinds of code my team writes needs to remains correct and scale well into the distant future; presumably if you’re writing concurrent code already, yours does too.  If you find such a race, the code should not even be checked in until it’s fixed.  “But it only happens once in a while” is an inexcusable answer.  Benign races are OK but should be clearly documented.

Events

When I see any event-based code (either Monitor Wait/Pulse/PulseAll condition variables or some event type, like AutoResetEvent or ManualResetEvent), the first thing I do is build up a global understanding of all the conditions under which events are set, reset, and waited on.  This is to understand the coordination and flow of threads top-down, rather than bottom-up.  Because I’ve already reviewed the sequential parts of the algorithm, I typically already know the important state transitions events are guarding before I even get to this point.

Next, there are some simple aspects to specific usage of events that I look for:

  • Understanding the relationship between mutual exclusion, the state, and the events is important and subtle.  Comments should be used ideally to explain that.
  • Does the setting of the event happen in a wake-one (Pulse, Auto-Reset) or wake-all (PulseAll, Manual-Reset) manner?  If it’s wake-one, are all waiters homogeneous?  Is it always strictly true that waking-one is sufficient and won’t lead to missed wake-ups?
  • Waiters that release the lock and then wait should be viewed with suspicion.  There’s a race between the release and wait that notoriously causes deadlocks.
  • Concurrent code should never use timeouts as a way to work around sloppiness in the way threads wait and signal.  A missed wake-up is a bug in the code that must be fixed.

Lock-freedom and volatiles

If you’re looking at lock-free code, you need to have a firm grasp on the CLR’s memory model.  See http://www.bluebytesoftware.com/blog/2007/11/10/CLR20MemoryModel.aspx for an overview.  Don’t think about the machine, think about the logical abstraction provided by the memory model.  You also need a firm grasp on the invariants of the data structures involved.  Specifically you are looking to see if the structure could ever move into a state, visible by another thread, where one of these invariants doesn’t hold.  I explicitly permute (often on a whiteboard or in notepad) the sections of the code that involve shared loads and stores, using knowledge of the legal reorderings given our memory model, to see if the code breaks.

Any variable marked as volatile should be a red flag to carefully review all use of that variable.  For every single read and every single write of that variable, you must look at it and convince yourself of why volatile is necessary.  If you can’t, ask the person who wrote the code.  Sometimes volatile is used because most (but not all) call sites need it; that’s often acceptable.  Leaving the variable as non-volatile and selectively using Thread.VolatileRead for the reads that need it is typically too costly.  Anyway, comments should always be used to explain why each load and store is volatile, even if it doesn’t strictly need the volatile semantics.

Conversely, any variable that is apparently shared, but not marked volatile, should be an even redder flag.  It’s very likely that this is a mistake.  Recall that writes happen in-order with the CLR’s memory model, but that reads do not.  Anytime there is a relationship between multiple shared variables that are written and read together (without the protection of a lock), they typically both need to be volatile.

Any reads of shared variables used in a tight loop must be marked volatile.  Otherwise the compiler may decide to hoist them, causing an infinite loop.  Even if they are retrieved via simple method calls like property accessors (due to inlining).

Thread.MemoryBarrier should typically only occur to deal with store (release) followed by load (acquire) reordering problems.  And it’s usually a better idea to use an InterlockedExchange for the store instead, since it implies a full barrier but combines the write.  Sometimes a fence can be used to flush write buffers—like when releasing a spin-lock to avoid giving the calling thread the unfair ability to turn right around and reacquire it—but this is extremely rare, and often an indication that somebody has an inaccurate mental model of what the fence is meant to do.

Custom spin waiting should be used rarely.  If you see it used, the person may not be aware that spin waits need special attention: to work well on HT machines, yield properly to all runnable threads with appropriate amortized costs, to spin only for a reasonable amount of time (in other words, less than the duration of a context switch), and so on.  Thread.SpinWait does not do what most people expect, since it only covers the first.  Kindly let them know about these things.  If any spin waiting is used in a codebase, it’s far better to consolidate all usage into a single primitive that does it all.

Wrapping up

At the end of each review, ask yourself whether all of the concurrency-oriented parts of the code were clearly explained in the design doc for the feature.  Did this carry over to clearly written comments in the implementation?  These are some really hard issues to get your head around, so the time spent reviewing the code should not be lost.  Somebody, someday down the road, will need to understand the code again (perhaps so that they can maintain it, test it, etc.), and it is your responsibility as a member of the team—regardless of whether you wrote the code—to do your part in making that feasible.  You should explicitly go back to the design doc and suggest areas for clarification.

3/28/2008 10:04:50 AM (Pacific Daylight Time, UTC-07:00)  #   

 Wednesday, February 27, 2008

I’ve mentioned before that the CLR has a central wait routine that is used by any synchronization waits in managed code.  This covers WaitHandles (AutoResetEvent, ManualResetEvent, etc.), CLR Monitors (Enter, Wait), Thread.Join, any APIs that use such things, and the like.  This routine even gets involved for waits that are internal to the CLR VM itself.  This is primarily done so that the runtime can pump appropriately on STAs, and was later used to experiment with fiber-mode scheduler in SQL Server.  Two years ago I showed how to use these capabilities to build a deadlock detection tool via the CLR’s hosting APIs.  Sadly IO-based waits (like FileStream.Read) do not route through this.

The System.Threading.SynchronizationContext class has a very cool (but not widely known) feature that enables you to extend this central wait routine.  To do so requires four steps: subclass SynchronizationContext; call base.SetWaitNotificationRequired; override the virtual Wait method to contain some custom wait logic; and then register your SynchronizationContext via the static SynchronizationContext.SetSynchronizationContext method.  After you do this, most waits that occur on that thread will be redirected through your custom Wait method.

Here's a very simple example of this:

using System;
using System.Threading;

class BlockingNotifySynchronizationContext : SynchronizationContext {
    public BlockingNotifySynchronizationContext() {
        SetWaitNotificationRequired();
    }

    public override int Wait(
            IntPtr[] waitHandles, bool waitAll, int millisecondsTimeout) {
        Console.WriteLine("Begin wait: {0} handles for {1} ms",
            waitHandles.Length, millisecondsTimeout);
        int ret = base.Wait(waitHandles, waitAll, millisecondsTimeout);
        Console.WriteLine("Finished wait");
        return ret;
    }
}

class Program {
    public static void Main() {
        SynchronizationContext.SetSynchronizationContext(
            new BlockingNotifySynchronizationContext());
        ManualResetEvent mre = new ManualResetEvent(false);
        mre.WaitOne(1000, false);
    }
}

If you run this, you'll see some messages printed to the console to do with beginning and finishing waits.

A few things are worth noting:

  • The Wait signature looks a lot like WaitForMultipleObjects.  In fact, it's fairly trivial to turn around and call it via a P/Invoke.  Recovering from APCs is a tad tricky however, and you'd have to do all of your own timeout management, message pumping, and the like.
  • You receive an IntPtr[], making it incredibly difficult to correlate the objects being waited on with the actual synchronization objects from which they came (e.g. Monitors, EventWaitHandles, etc.).
  • The code that runs inside Wait is the wait itself.  In other words, when you return, whatever code initiated the wait is going to assume that the API is being honest and truthful.

Another subtlety is that this code, as written, is subject to stack overflow.  Why is that?  In this particular instance, Console.WriteLine may need to block internally because it automatically serializes access to the output stream.  Well, when that blocks, it just goes through the same central wait routine, which calls back out, and so on and so forth.  Obviously this extends to any code that uses locks, including CLR services like cctors.  So the code you write here needs to be very carefully written so as not to ever block recursively.

Notice that some waits do not call out.  The reason is that the callout stems from a routine deep inside the CLR VM itself.  Some waits may occur while a GC is in progress, at which point it’s illegal to invoke managed code.  The CLR just reverts to using its own default wait logic in such cases.

Lastly this is not a foolproof mechanism.  Other components can register their own SynchronizationContexts, replacing the context you’ve installed completely.  This may mean you miss some blocking calls.  If you are building a ThreadPool, you can always reset it each time the thread is returned, or even use your own ExecutionContexts when running them.  It is also possible that such a context will exist by the time you get around to installing your own.  For example, ASP.NET, WinForms, and WPF use custom SynchronizationContexts. 

If such a context exists already when you install this custom one, you can always defer to it for things like CreateCopy, Send, Post, and Wait.  For example, here’s a SynchronizationContext implementation that allows custom before/after wait actions, but otherwise relies on the existing SynchronizationContext (if any) for things like Send, Post, and Wait:

using System;
using System.Threading;

delegate object PreWaitNotification(
    IntPtr[] waitHandles, bool WaitAll, int millisecondsTimeout);
delegate void PostWaitNotification(
    IntPtr[] waitHandles, bool WaitAll, int millisecondsTimeout,
    int ret, Exception ex, object state);

class BlockingNotifySynchronizationContext : SynchronizationContext {
    private SynchronizationContext m_captured;
    private PreWaitNotification m_pre;
    private PostWaitNotification m_post;

    public BlockingNotifySynchronizationContext(
            PreWaitNotification pre, PostWaitNotification post) :
        this(SynchronizationContext.Current, pre, post) {
    }

    public BlockingNotifySynchronizationContext(
            SynchronizationContext captured, PreWaitNotification pre, PostWaitNotification post) {
        SetWaitNotificationRequired();

        m_captured = captured;
        m_pre = pre;
        m_post = post;
    }

    public override SynchronizationContext CreateCopy() {
        return new BlockingNotifySynchronizationContext(
            m_captured == null ? null : m_captured.CreateCopy(), m_pre, m_post);
    }

    public override void Post(SendOrPostCallback cb, object s) {
        if (m_captured != null)
            m_captured.Post(cb, s);
        else
            base.Post(cb, s);
    }

    public override void Send(SendOrPostCallback cb, object s) {
        if (m_captured != null)
            m_captured.Send(cb, s);
        else
            base.Send(cb, s);
    }

    public override int Wait(IntPtr[] waitHandles, bool waitAll, int millisecondsTimeout) {
        object s = m_pre(waitHandles, waitAll, millisecondsTimeout);
        int ret = 0;
        Exception ex = null;

        try {
            if (m_captured != null)
                ret = m_captured.Wait(waitHandles, waitAll, millisecondsTimeout);
            else
                ret = base.Wait(waitHandles, waitAll, millisecondsTimeout);
        } catch (Exception e) {
            ex = e;
            throw;
        } finally {
            m_post(waitHandles, waitAll, millisecondsTimeout, ret, ex, s);
        }
        return ret;
    }
}

class Program {
    public static void Main() {
        SynchronizationContext.SetSynchronizationContext(
            new BlockingNotifySynchronizationContext(
                delegate { Console.WriteLine("PRE"); return null; },
                delegate { Console.WriteLine("POST"); }
            )
        );
        ManualResetEvent mre = new ManualResetEvent(false);
        mre.WaitOne(1000, false);
    }
}

That’s a fair bit of code, but it's mostly boilerplate.  It allows you to easily specify a pre/post action to be invoked upon each blocking call, and will work on ASP.NET, GUI threads, and the like.  The pre action can return an object for the post action to inspect.  And the post action is given the return value and exception (if any).  If no SynchronizationContext was present when installed, it just defers to the base SynchronizationContext implementation of Send, Post, and Wait.

Now what you actually do inside those callbacks, I suppose, is entirely your business …

2/27/2008 2:47:47 PM (Pacific Standard Time, UTC-08:00)  #   

 Tuesday, February 19, 2008

A few of us got in a room two weeks ago with Charles to discuss the Task Parallel Library component of Parallel Extensions.

It's now available on Channel9: http://channel9.msdn.com/Showpost.aspx?postid=384229.

2/19/2008 11:27:36 AM (Pacific Standard Time, UTC-08:00)  #   

 

Recent Entries:

Search:

Browse by Date:
<July 2008>
SunMonTueWedThuFriSat
293012345
6789101112
13141516171819
20212223242526
272829303112
3456789

Browse by Category:

Notables: