RSS 2.0

Personal Info:

Joe Send mail to the author(s) leads the architecture of an experimental OS's developer platform, where he is also chief architect of its programming language. His current mission is to enable writing large-scale software that is reliable, secure, and scalable by-construction. Before this, Joe founded the Parallel Extensions to .NET project. He has been granted 19 patents, with 49 pending. When not working, Joe enjoys travelling with his wife, writing books, writing music, studying music theory & mathematics, and doing anything involving food & wine.

My books

My music

Disclaimer:
The content of this site are my own personal opinions and do not represent my employer's view in anyway.

© 2012, Joe Duffy

 
 Thursday, May 28, 2009

Two persons stand on a railway embankment at points A and C, exactly 500 meters apart.  Lightning strikes precisely in the center of them, at point B, 250 meters away from both:

<--A----------B----------C-->

Presuming both persons are stationary, does the event (lightning strikes) occur “at the same time” from the perspective of the two persons?  In our simplistic one dimensional model, the answer is Yes, precisely because the point of the lightning strike, B, is equidistant from A and C.

The person at point C may just as well be responsible for generating the event, by using some form of light rod instead of a lightning bolt supplied by nature.  If the person at C lights such a rod, would the event still occur “at the same time” for both persons?  Clearly No, because it will take some amount of time for the event’s occurrence to travel the distance from C to A, specifically the time it takes for light to travel 500 meters.  Whereas for C it happens nearly instantaneously.

Practically speaking, this amount of time it takes for the light to travel to A will of course be so minute as to be nearly immeasurable, but nevertheless there are two separate times t and t’, the former being the actual time the rod is lit at C, and the latter being the time at which it is perceived at A.  This is commonly referred to as relativity of simultaneity, introduced as Lorentz’s local time in the late 1800's and further formalized by Einstein's special theory in 1905.

Now imagine that a new person is placed at point B, equidistant from A and C, where the original lightning struck.  If the person at C lights his rod, will the person at B observe the event before the person at A does?  Most certainly.  It takes less time for light to travel 250 meters than it does 500 meters.

Let us extend our working example a bit.  Imagine again three persons, one at point A, one at point B, and another at point C.  Those at B and C hold their own light rods.  The person at C lights a rod, and in response to seeing C’s rod lighting, the person at B also lights a rod.  The question is, must the person at A witness the light emanating from C’s rod prior to witnessing the light emanating from B’s rod?  Unless the person at B’s response is truly instantaneous (which we assume is practically impossible), or unless he can see into the future (which we also assume is impossible), clearly the answer is Yes.  Because the rod at B was lit in response to witnessing C’s lighting of the rod, some amount of time must have passed during the response, and the person at A will thus see C’s lighting first (or at the very least simultaneously, assuming near-impossible instantaneous response).  We say B’s lighting is causally dependent on C’s lighting.

The main point here is that time is an illusion.  There is no global time clock.  Instead, events are not only distinguished by some monotonically increasing time value t, but also by a location which is defined by three-space coordinates.  This is Minkowski’s four-dimensional space.  One event occurring at coordinates { x=0, y=0, z=0, t=99 } may not appear to be simultaneous with some other event at coordinates { x=42, y=42, z=42, t=99 }, depending on the observer's location, even though both events occur at time t = 99.

Perception is relative.  There is no global total ordering, only a local (relative) one.

A similar phenomenon is true of multiprocessors.  In fact, nearly everything said above applies equally to them, provided that you replace “persons” with “processors”, and lighting of rods with writes to memory and witnessing of the light with reads from memory.

Multiprocessor architects must cope with the increasing bottleneck on a central memory unit, particularly due to shared memory programming.  One common means of doing so is to increase the distance between processing elements and the memory units they use, padding this distance with ample levels of cache.  Some processor A may have a local memory (cache) that is separate from some other processor C’s, and A’s writes to it will be visible to A before C, for example.  And if some processor B sits in between them, it may notice such writes before C does.  Locality matters.

Of course, memory ordering models are meant to eliminate such distances from the programmer’s mind, at least to some degree.  They provide a set of rules governing the timing and ordering of events.  But there is just no denying the laws of physics.  My claim is that a proper ordering model ought to obey what can be derived from the special theory of relativity: no more, no less.  That is, the fundamental laws of how events occur and are correlated in the real world should be mimicked.  This means only two things, as far as I can tell:

  1. An event stream (writes) originating from a source must appear to happen in order.
  2. Causality is respected, in that when C causes B, it is implied that A must see C followed by B.

This turns out to be stronger than some models, but also weaker in some regards.  Distance and latency are first class, embellished, and allowed.  There is some cost to ensuring events leaving a locale do so in order, and that events arriving into a locale also do so in order.  Given coarse enough locales, however, this cost ought to be amortized over the cost of inter-locale communication.

Notice that sequential consistency is explicitly discouraged.  The ordinary store-followed-by-load ordering that I've written about many times is legal.  Considering this phenomena in the context of light rods and relativity makes it clear why.  Imagine that the persons at A and C light their rods simultaneously.  If the person at A immediately, after lighting the rod, looks to the right to see if C has lit the rod, the answer will be No; and similarly, if the person at C immediately, after lighting the rod, looks to the left to see if A has lit the rod, the answer will also be No.  Although the real reason has to do with gross details like store buffers and cache coherency, the elegant reason supported by the model is that it takes time for light to travel the distance between A and C.

I also want to point out that “memory ordering model” commonly refers to individual loads and stores, at a very low level, but just as well applies to a higher-level model such as might be found in an actor-oriented (message passing) programming language.  People often believe memory ordering and interleaving goes away magically with message passing models.  This is simply not true, even if instruction-level interleaving is eliminated.  The granularity merely coarsens, but the problem still remains the same.

Despite the lack of sequential consistency, implementing such a model can pose challenges, due to restrictions on optimizations like pipelining and out of order execution.  (Hey, at least we needn’t worry about processors moving about at different velocities, as in the more interesting parts of special relativity.)  But I believe it is necessary.  This price paid will be rewarded with a system that human beings can be taught to reason about as they do in the real world.  Remember: I am not just talking about memory models in the traditional sense, where people are tempted to sweep the problem under the rug of "only super-developers doing lock-free programming need a model"; it matters for higher level concurrency orchestration patterns too.  In the end, let us not forget: correctness and understandability trump performance optimizations for all but the most low-level systems developers, which make up less than 1% of the development population.

1. Relativity: The Special and General Theory. http://en.wikisource.org/wiki/Relativity:_The_Special_and_General_Theory
2. Time, Clocks, and the Ordering of Events in a Distributed System. http://research.microsoft.com/en-us/um/people/lamport/pubs/time-clocks.pdf

5/28/2009 10:29:41 AM (Pacific Daylight Time, UTC-07:00)  #   

 Saturday, May 16, 2009

A while back, I made a big stink about what appeared to be the presence of illegal load-load reorderings in Intel's IA32 memory model.  They specifically claim this is impossible in their documentation.  Well, last week I was chatting with a colleague, Sebastian Burckhardt, about this disturbing fact.  And it turned out he had recently written a paper that formalizes the CLR 2.0 memory model, and in fact treats this phenomenon with a great deal of rigor:

Verifying Compiler Transformations for Concurrent Programs
http://research.microsoft.com/pubs/76524/tr-2008-171-latest-03-11-09.pdf

To jog your memory, the problematic example is

X = 1;
r0 = X;
r1 = Y;

where both X and Y are shared memory locations, and r0 and r1 are processor registers.  According to Intel's IA32 memory model, two loads to different locations cannot reorder.  But it is completely possible for the load of X to be satisfied out of the store buffer, and for r1=Y to pass the store (thereby also passing the load r0=X).  This is a standard Dekker reordering, but the usual example consists of just { X = 1; r1 = Y }.

The key to modeling this is to turn an adjacent store-load affecting the same location into a single instruction.  Therefore, the above becomes something like:

r0 = 1;
X = r0;
r1 = Y;

Now it becomes entirely clear what has gone wrong.  I have yet to see a clear description of this phenomenon, but Sebastian's paper does a great job.

During the discussion, Sebastian showed me another disturbing four processor example:

P0        P1         P2        P3
==        ==         ==        ==
X = 1;    r0 = X;    Y = 1;    s0 = X;
          r1 = Y;              s1 = Y;

Is it possible, after all four processors complete, that { r0 == 1, r1 == 0 } and { s0 == 0, s1 == 1 }?  This seems ridiculous, given a memory model where loads cannot reorder.  It seems that no serializable execution should lead to this.  But let's look at one problematic interleaving.  First, we merge the instruction stream on P0 with P1, and also P2 with P3.  This effect could occur if these writes are in functions that end up running on the same processor, or running on a machine that shares functional units (like hyperthreading), hierarchies that share a cache, and so on.  We end up with:

P0/P1     P2/P3
=====     =====
X = 1     Y = 1;
r0 = X;   s0 = X;
r1 = Y;   s1 = Y;

Now let's permute these with the new rule introduced above in mind:

P0/P1     P2/P3
=====     =====
r0 = 1;   s0 = X;
r1 = Y;   s1 = 1;
X = r0;   Y = s1;

At this point, it should be obvious what the problematic reordering would be.  Let's continue merging these into a single execution order:

P0/P1/P2/P3
===========
r0 = 1; // #1
r1 = Y; // #0
s0 = X; // #0
s1 = 1; // #1
X = r0; // #1
Y = r1; // #1

The outcome?  { r0 == 1, r1 == 0 } and { s0 == 0, s1 == 1 }.  Whoops.

I have yet to observe this happening in practice, but models that permit store buffer forwarding are fundamentally vulnerable to this reordering.  The solution here is the same as with Dekker.  Marking the volatiles is insufficient: you need to insert full memory fences between the store-load adjacent pairs.

5/16/2009 10:31:07 PM (Pacific Daylight Time, UTC-07:00)  #   

As we were hard at work creating PFX, we had a sister team of great talent working with us every step of the way.  Their job?  To do to Visual Studio 2010 what PFX did to .NET 4.0, by substantially improving the development experience for parallel programming on Windows.  This includes both diagnostics & debugging, as well as profiling.

Daniel Moth, the program manager for a lot of the IDE features, just wrote up a comprehensive blog post on the new tasks window:

Parallel Tasks - new Visual Studio 2010 debugger window

The new window gives you a view into all of the tasks in your process, their statuses, and where they are running:

Because both TPL and PLINQ use tasks for execution, it supports both quite nicely.  And it has (consistent!) support for both .NET and C++ tasks.  The parallel stacks window is also an impressive new feature, and I'm guessing Daniel will also cover that in the coming weeks.  Keep your eyes peeled. If all goes well, you'll even get to try them out first-hand, once Beta1 is available.

And if that weren't enough to entice you to visit his blog, check out this nasty machine that Daniel uses to run his kitchen appliances:

Oh, the insanity.  I am thinking Task Manager will need revising in Windows 8.

5/16/2009 5:27:35 PM (Pacific Daylight Time, UTC-07:00)  #   

 Friday, May 08, 2009

The parallel computing team just shipped an early release Axum (fka Maestro), an actor based programming language with message passing and strong isolation.

I'm personally very excited to see what comes of Axum.  It's one step on the long road towards the vision of automatic parallelism.  Although I can't claim credit for anything concrete, I was the chief designer of the fine grained isolation model Axum is built atop (something I call "Taming Side Effects" (TSE)).  It's a blend of functional programming with imperative programming enabled by using the concepts of Haskell's state monad in a more familiar way.  I'll try to blog a bit more about it in coming weeks.  It turns out I've recently shifted my focus to a new project with the aim of applying these ideas very broadly for a whole new platform.

Doing incubation work at Microsoft is tough work, because it takes a strong vision and drive to keep pushing forward.  You need to take stances that are unconventional, risky, and often just plain unpopular, and drive against all odds.  Usually you aren't going to make any money off the ideas for years at a time, so it also takes a supportive management team who is willing to give you creative freedom and cut you checks.  Most such efforts fail in a vaccuum.  But hats off to the team for pushing hard, and going out early to ask what developers think.  This is a huge milestone.

5/8/2009 12:05:04 PM (Pacific Daylight Time, UTC-07:00)  #   

 Tuesday, March 31, 2009

I often speak of the need to develop programming models whereby developers write code in the most natural way possible, and it just so happens to be inherently parallel.  I don’t believe the lion’s share of developers want to rearchitect and rewrite their code with parallelism at the forefront of their development process.  They don’t want to think about shipping memory over to the GPU and launching a highly-specialized data parallel kernel of computation, nor do they want to have to add locks and transactions everywhere to make things safe.  But I do, however, believe the lion’s share of developers wouldn’t mind if their code ran faster as hardware got faster (via more cores).

(To be clear, there are certainly a lot of developers who will be happy to write specialized code if it means eking out every last bit of performance on their machine.  But that’s the minority.)

This viewpoint tends to get a lot of skeptical looks from people who quickly point out that this has been tried countless times before, and always leads to failure.  They, of course, are referring back to the 80’s and early 90’s where “dusty deck” parallelization was all the rage, mostly in the realm of vectorization and HPC.  To be fair, there were some mild successes in getting floating point loops parallelized, but there’s no wonder these attempts had little longevity.  No touch solutions are always inadequate.  Trying to make a fundamentally non-secure program secure, by way of, say, virtualization, may work in some constrained circumstances.  But the right solution is to develop models and practices that lead to security-by-construction.

Furthermore, languages were (and in most cases still are) lacking some major prerequisites for large-scale automatic parallelization:

  1. Safety.  Unless a compiler can reason about the determinism of a program when run in parallel, one cannot prove that its results will remain correct when parallelism is added.  Compilers are therefore limited to parallelizing highly-specialized recognized patterns, like loops comprised solely of floating point additions of two arrays indexed by the loop iteration.
  2. Performance.  Rampantly parallelizing a huge program wherever possible is dangerous, will drain performance, and make power consumption skyrocket.  Dynamic techniques like workstealing and static techniques like nested data parallelism and profile guided feedback need to work together to inform these decisions.  Machine-wide resource management needs to know about the memory topology, machine load and policy, and make informed decisions based on them.  Although there has been a lot of disparate research in these areas over the years, they have only recently been coming together.  Certainly in the 80’s, they were in their infancy.
  3. Declarative patterns.  Most of the prior art was done in FORTRAN, a standard imperative language riddled with loops, effects, and assignments.  Programs need to be written with as few dependencies as possible in order to expose large amounts of parallelism, and the von Neumann inspired languages fall short of this aim.  Data comprehensions allow set-at-a-time computations to be expressed in a higher-order way, and newer languages like Fortress have language semantics that permit parallel evaluation in many more areas, like argument evaluation.  And application models that encourage isolation and loose state coupling allow coarse partitioning.

In addition to all of those three things, we must have realistic expectations.  Even if a program were fully safe to parallelize, as many Haskell kernels are, we would seldom see perfect scaling.  Buying a 128-core machine doesn’t necessarily give you a 128x speedup.  Why?  Because there are still portions of the code that will end up less parallel than other portions, and some parts may even continue to run sequentially.  There will always be I/O and waiting: these are real programs, after all, and real programs tend not to be 100% computation.  They need to do something useful with the real world.  Moreover, safety does not mean “dependence free.”  And data dependencies are ultimately what limit parallelism.

My stated goal would therefore be that parallelism in programs is solely limited by data dependence.  Safety issues are handled by construction.  Performance is (mostly) handled by the system, although as with all things, there will be some amount of measurement, hints, and tuning necessary.  But hopefully a huge part of tuning performance will be seeking out needless dependencies, or finding new algorithms that have different dependence characteristics.  And with that, we can focus our energy on raising the level of abstraction and pushing more declarative patterns that are broadly useful.  Over time as more and more programs are written in this fashion, they become more and more naturally parallel.

What do you think?  Am I crazy?  Perhaps.  But I still know we can do it.

3/31/2009 8:29:49 PM (Pacific Daylight Time, UTC-07:00)  #   

 Friday, March 13, 2009

Managed code generally is not hardened against asynchronous exceptions.

“Hardened” simply means “written to preserve state invariants in the face of unexpected failure.”  In other words, hardened code can tolerate an exception and continue being called subsequently without a process or machine restart.  Conversely, code that is not hardened may react sporadically if continued use is attempted: by corrupting state and subsequently behaving strangely and unpredictably.

Asynchronous exceptions are a foreign concept to native programmers, and arise because there is a runtime underneath all managed code that is silently injecting code on behalf of the original program.  The only truly asynchronous exception is ThreadAbortException, but any in the set { OutOfMemoryException, TypeInitializationException, ThreadInterruptedException, StackOverflowException } are often labeled as such.  While thread aborts can happen at any line of code outside a delay-abort regions, these other exceptions can be introduced by the CLR at surprising times; i.e., { memory allocations, static member access, blocking calls, any function call }.  The effect is that, unlike most exceptions, the points at which they may occur are not obvious.  OOMs, for instance, can happen at any method call (due to failure to allocate memory in which to JIT code), implicit boxing, etc.

(As of 2.0, StackOverflowException is no longer relevant because SO triggers a FailFast of the process instead.  So saying that managed code is not hardened against SO is an understatement.)

Also, because of the way COM reentrancy works, any blocking call can lead to any arbitrary code dispatched through STA pumping.  And that arbitrary code, much like an APC, can fail via any arbitrary exception.  These are a lot like asynchronous exceptions.  So in truth, code that isn’t written to respond to arbitrary exceptions at all blocking points is technically not hardened either.

.NET doesn’t provide checked exceptions, so the blunt reality is that very little managed code is hardened properly to synchronous exceptions either.  I think we do a better job in the framework of carefully engineering the code to resiliently tolerate failure, usually by being very careful about argument validation, but we aren’t perfect.  Some things slip through.

If you stop to think about why hardening isn’t done, it’s probably obvious.  It’s darn difficult.  Especially for asynchronous exceptions where nearly every line of code must be considered.  In Win32 programming, most failure points are indicated by return codes.  (Although C++ exceptions can sneak through the cracks at surprising times.  Like the fact that EnterCriticalSection can throw.)  While error codes are cumbersome to program against (since every call needs to be checked for a plethora of conditions, making it easy to miss something), at least the response to failure is explicit.  You can decide to propagate and leave state corrupt, fix up state and then propagate, rip the process, or ignore the failure, as appropriate.  This becomes part of the API contract.  In managed code, you need to know to wrap such calls in try/catch blocks.  Nobody does this.  It’s insane to even consider doing that.  And because nobody does, you can’t even catch exceptions coming out of a single API call and know that, when faced with an OOM (for example), that all code on the propagating callgraph has transitively handled the failure in a controlled manner.  The very fact that the lock{} statement auto-unlocks without rolling back corrupt state should be indication enough of the current state of affairs.

An instance of any of the aforementioned exceptions means the AppDomain is toast.

By toast, I mean that it’s soon going to be unusable, and hopefully actively being unloaded.  Code in the framework assumes this, and you should too.  All it does is try to get out of the way by not crashing or hanging the ensuing unload.  A small fraction of code that deals with process-wide state comprised of resources not under the purview of the CLR GC needs to worry about running and avoiding leaks.  This is where things like CERs, CriticalFinalizerObjects, and paired operations stuck in finally blocks come into play.  They ensure cross-process state is freed, and that asynchronous exceptions cannot occur in places that would crash or hang a clean unload.

Unfortunately, it’s not always the case that the AppDomain is unloading when such an exception occurs:

  • Somebody can call Thread.Abort directly, without killing the AppDomain.  They can either call ResetAbort and keep it around, or let it return to the ThreadPool which catches and swallows aborts.  In fact, we tell people that synchronous aborts a la Thread.CurrentThread.Abort is “always safe”, whereas we tell people asynchronous aborts are dangerous and best avoided.
  • Some framework infrastructure, most notably ASP.NET, even aborts individual threads routinely without unloading the domain.  They backstop the ThreadAbortExceptions, call ResetAbort on the thread and reuse it or return it to the CLR ThreadPool.  That means any code running in ASP.NET is apt to be corrupted when websites are recycled and AppDomain isolation is not being used.
  • Assume AppDomain B is being unloaded.  If some thread has called from A to B to C, the thread will immediately suffer an abort.  The result is that C will see a thread unwinding with a ThreadAbortException, back into B, and then back to A, at which point the exception turns into a deniable AppDomainUnloadedException that can be caught.  But C has seen an in-flight abort and yet it is not being unloaded.  The result is that C’s state may be completely corrupt.  I believe this should be considered a bug in the CLR.
  • We can’t differentiate between soft- and hard-OOMs today.  The former are caused by requests to allocate large blocks memory.  Often a failure here isn’t indicative of a disaster.  It may be due to a need to allocate 1GB of contiguous memory, and perhaps there is fragmentation.  Hard OOMs are often caused by running up against the edge of the machine where no pagefile space is available, and may indicate a failure to JIT some important method, among other things.  But because we don’t differentiate, any managed code can catch-and-ignore any kind of OOM, including hard ones.
  • Thread interruptions are often used as a form of inter-thread communication.  For example, they can be used as a poor man’s cancellation.  (This is inappropriate, and cooperative techniques should always be used.  But it is widespread.)  But because they are used as a means of communication, they are almost always caught and handled in some controlled manner.  This is one place where we screwed up by not hardening the frameworks against interrupted blocking calls and reacting intelligently.  Checked exceptions would have saved us.

What does all of this mean?  Quite simply, the .NET Framework cannot be trusted when any of the aforementioned exceptions are thrown.  Ideally the process will come tumbling down shortly thereafter, but improperly written code can catch them and continue trying to limp along.  In fact, as I mentioned above, some wildly popular & successful application models do (notably, ASP.NET and WinForms).

This state of affairs is admittedly unfortunate.  We don’t properly separate out the truly fatal exceptions from those that we can gracefully recover from.  In an ideal world, I’d love to see us do that.  For example:

  1. At some point, we really ought to consider FailFast instead of continuing to run code under failures we know are fatal and dangerous to attempt to recover from, much like we do with SO.  At least these failures should be undeniable like thread aborts are.  But this is a fairly Byzantine response and is not for the faint of heart.  Given that we still live in a world where WinForms wraps the top-most frame of the GUI thread in a catch-all, presents a dialog box, and allows a user to click “Ignore & Continue”, I seriously doubt we’ll get there anytime soon.
  2. Never expose a ThreadAbortException to code in an AppDomain unless we can guarantee the AppDomain is being unloaded.  That means getting rid of the Abort API, and thus indirectly disallowing code from catching and calling ResetAbort.  It also means the A calls B calls C case would not allow B to unload until the thread voluntarily unwinds out of C.
  3. Allow OOMs to be caught only when they are soft.  That means a call to ‘new’, and it means the catch much occur inside the same stack frame as the call to ‘new’.  Such exceptions can be tolerated if code is properly written, and we will tell developed to be mindful of them.  Once such an OOM propagates past the calling stack frame, they will escalate to hard.
  4. All other OOMs are hard and fatal.  This includes failure to allocate memory to JIT code and failure to allocate 20 bytes to box an int.  Hard OOMs are thus undeniable.
  5. Get rid of ThreadInterruptedExceptions.  We screwed this up from Day One, and it’s probably too late to fix this.  We added cooperative cancellation in .NET 4.0 for a reason.
  6. TypeInitializationExceptions can probably stay, but we should allow rerunning the cctor upon subsequent accesses.  Today, once a class C throws from its cctor, the class can never be constructed.  So on the current plan, it only makes sense to FailFast.

I’m sure there are many other things we could do to improve things.  But these 6 general themes would be a great start.

I’m just spitballing here.  There are no concrete plans to do any of these 6 things as far as I know.  And at the end of the day, hardening only improves the statistics of the situation, so it tends to be very difficult to argue for one change over another, particularly if taking the change would make existing programs break.  But I really would like to see the base level of reliability in managed code improve with time.  Especially with the exciting work going on around contract-checking in the BCL in Visual Studio 10, I hope these topics become top-of-mind for folks again in the near future.

3/13/2009 1:24:25 PM (Pacific Daylight Time, UTC-07:00)  #   

 Friday, March 06, 2009

A developer from the Parallel Extensions team, Igor Ostrovsky, recently whipped together a really neat Silverlight game.  It's called RoboZZle, and he calls it a "social puzzle game":

"RoboZZle is an online puzzle game that challenges players to program a robot to pick up all stars on a game board. The game mechanics are simple, yet allow for a wide variety of challenges that call for very different solution approaches." (from the blog entry introducing it.)

The coolest feature of this game is that you win by programming.  And there's a whole community surrounding it, complete with forums, the ability to create your own games, a ranking system, its own blog and continuously updated news, etc.  Check it out.

3/6/2009 3:36:36 PM (Pacific Standard Time, UTC-08:00)  #   

 Monday, February 23, 2009

Pop quiz: Can this code deadlock?

SpinLock slockA = new SpinLock();

SpinLock slockB = new SpinLock();

 

Thread 1             Thread 2

~~~~~~~~             ~~~~~~~~

slockA.Enter();      slockB.Enter();

slockA.Exit();       slockB.Exit();

slockB.Enter();      slockA.Enter();

slockB.Exit();       slockA.Exit();

The answer, as I'm sure you suspiciously guessed, is "it depends."

I previously posted some thoughts about whether a full fence is required when exiting the lock.  In that post, I focused primarily on timeliness.  But what might be even more frightening is that the answer to my question above is yes, provided two things:

1. Exit doesn't end with a full fence.
2. Enter doesn't start with a full fence.

Just making Exit a store release and Enter a load acquire is insufficient.  Here's why.

Imagine a super simple spin lock that satisfies our deadlock criteria:

class SpinLock {

    private volatile int m_taken;

 

    public void Enter() {

        while (true) {

            if (m_taken == 0 &&

                    Interlocked.Exchange(ref m_taken, 1) == 0)

                break;

        }

    }

 

    public void Exit() {

        m_taken = 0;

    }

}

Clearly Exit satisfies #1.  The technique of using an ordinary read of m_taken before resorting to the XCHG call is often known as a TATAS (test-and-test-and-set) lock, and this can help alleviate contention.  And it also means we will satisfy #2 above.

To see why deadlock is possible, imagine the following (fully legal) compiler transformation.  The compiler first inlines everything, so for Thread 1 we have:

Thread 1

~~~~~~~~

while (true) {

    if (slockA.m_taken == 0 &&

            Interlocked.Exchange(ref slockA.m_taken, 1) == 0)

        break;

}

slockA.m_taken = 0;

while (true) {

    if (slockB.m_taken == 0 &&

            Interlocked.Exchange(ref slockB.m_taken, 1) == 0)

        break;

}

slockB.m_taken = 0;

What has to happen next is pretty subtle.  It's even unlikely a compiler would do this intentionally (as far as I can tell).  But it's entirely legal to morph the above code into something like this:

Thread 1

~~~~~~~~

while (true) {

    if (slockA.m_taken == 0 &&

            Interlocked.Exchange(ref slockA.m_taken, 1) == 0)

        break;

}

while (slockB.m_taken == 0) ;;

slockA.m_taken = 0;

if (Interlocked.CompareExchange(ref slockB.m_taken, 1) != 0)

    while (slockB.m_taken != 0 ||

            Interlocked.Exchange(ref slockB.m_taken, 1) != 0) ;;

slockB.m_taken = 0;

The load(s) of slockB.m_taken have moved before the store to slockA.m_taken; this is legal, even if they are both marked volatile.  A load acquire can move above a store release, and the code remains functionally equivalent.  Now, the code required to fix up this code motion is pretty hokey.  We clearly can't do the XCHG before the store to slockA.m_taken, so we need to try it afterwards.  But that brings about an awkward transformation: if it fails, we must effectively do what the original code did, spinning until we acquire the slockB lock.

Do you see the deadlock yet?

Imagine the compiler did similar code motion on Thread 2:

Thread 2

~~~~~~~~

while (true) {

    if (slockB.m_taken == 0 &&

            Interlocked.Exchange(ref slockB.m_taken, 1) == 0)

        break;

}

while (slockA.m_taken == 0) ;;

slockB.m_taken = 0;

if (Interlocked.CompareExchange(ref slockA.m_taken, 1) != 0)

    while (slockA.m_taken != 0 ||

            Interlocked.Exchange(ref slockA.m_taken, 1) != 0) ;;

slockA.m_taken = 0;

Oh no!  See it now?

If Thread 1 and Thread 2 both enter the critical regions for slockA and slockB at the same times, they will end up spin-waiting for the other to leave before exiting their respective lock.

Boom: deadlock.


 

2/23/2009 8:59:14 PM (Pacific Standard Time, UTC-08:00)  #   

 Sunday, February 22, 2009

A few weeks back I recorded a discussion with the infamous Erik Meijer and Charles from Channel9.

Perspectives on Concurrent Programming and Parallelism
http://channel9.msdn.com/shows/Going+Deep/Joe-Duffy-Perspectives-on-Concurrent-Programming-and-Parallelism/

In it, I show my cards a bit more than intuition says I should.  I'm not good at poker.

To summarize:

  • Mostly functional (purity + immutability) is a great default.
  • Safe, determinstic mutability (a la runST) is a must-have for cognitive familiarity.
  • Isolation is key to achieve the former; type systems can help (a lot).
  • Actors, agents, forkIO, <what have you> is a good model, but not the only one.  Isolation is (far) more general.
  • Transactions can help around the edges.

I'm working on a few papers for public consumption this year where I espouse these ideas.  Keep watching for more detail.

2/22/2009 11:34:10 PM (Pacific Standard Time, UTC-08:00)  #   

 Friday, February 20, 2009

I was very harsh in my previous post about reader/writer locks.

The results are clearly very hardware-specific.  And one can certainly argue that better implementations are possible.  (In fact, I will show one momentarily.)  But no matter which way you slice-and-dice it, a lock implies mutable shared state which implies contention.  Herb argued this point quite well, and rather thoroughly, in his recent Dr. Dobb’s article.  Interference due to contention means more time spent resolving memory conflicts and less time doing useful work.  A reader/writer lock can be infinitely clever, but there is still a consensus protocol that must be established: and that implies a loss of scalability.  Pretty simple.

It’s very tricky to develop a consensus protocol that is sufficiently lossy so as to relieve memory contention while at the same time being sufficiently precise that the lock works right.  In the case of a spinning reader/writer lock (which is, for what it’s worth, overly naïve an approach for most circumstances), you need to ensure that a writer knows for sure when there are 0 readers, and that each reader knows for sure whether there is 0 or 1 writer.  (For blocking reader/writer locks, there’s a whole lot more.)  One promising thing to note is that the writer only needs to know whether there are 0 or N readers, but not the specific value N; there’s a fair bit of research on scalable counters (like this) which exploit problems of this nature.  Unfortunately, it’s not completely relevant here.  You need to know exactly when the transition from N to 0 readers happens in order to let the writer through in a timely fashion; and in order to account for that transition, a consensus among readers is needed.  That's hard to do.

More scalable solutions are possible than the simple lock I showed previously.  Although writers need to know whether readers are present, the readers themselves could care less about other readers.  As a result, we can make the lock slightly more expensive for the writer, because it needs to accumulate the count of readers, but this allows us to make it it slightly cheaper for the readers to enter and exit.  Where cheaper means less contention.

Here’s one possible algorithm.  We’ll keep an array of read flags and a single write flag:

private volatile int m_writer;

private ReadEntry[] m_readers = new ReadEntry[Environment.ProcessorCount * 16];

A few things are noteworthy about the read flags.

First, it’s an array of ReadEntry values.  These are just simple structs that wrap a volatile int, but we also pad the struct so that it’s 128 bytes in total size.  That avoids the situation where multiple read flags just happen to end up sharing the same cache line (which are usually either 64 or 128 bytes in size), which leads to false sharing in the memory system (destroying our aim to reduce contention).

[StructLayout(LayoutKind.Sequential, Size = 128)]

struct ReadEntry {

    internal volatile int m_taken;

}

Second, we size the array to be 16-times the number of processors.  We hash into it based on the calling thread’s unique identifier, so to reduce (but not eliminate) the chance of hashing collisions, we’ll use a few times more buckets than the total number of concurrent threads.  Hashing collisions are expensive: they incur some amount of memory contention, and also demand that we use an atomic CAS increment instead of an ordinary ++.  (While a super-duper-cheap TLS solution might seem more ideal, there isn’t any good per-object TLS solution to use.  The array hashing approach is actually quite fast.)

Notice that we’re using an awful lot of space for a single lock.  This means the techniques I show here wouldn’t be readily applicable to a system that uses lots of fine-grained locks, like transactional memory.  But similar ideas can be extrapolated, e.g., by using shared lock tables.

Lastly, some invariants among these fields are self-evident.  When the writer flag is 0, no writers are waiting; when it is 1, either a writer is actively in the critical section, or there is a writer waiting for readers to exit.  When at least one reader flag entry is non-0, there is a reader either inside the lock or attempting to enter it.  Thus, no new writer is permitted while there’s a non-0 reader entry, and no new reader is permitted while there’s a non-0 writer flag.  This is sufficient to ensure the reader/writer lock properties hold.

Now let’s look at how the EnterReadLock and ExitReadLock methods work.

When a reader arrives, it spins until the writer flag is non-0.  It then hashes into the read flag array using its unique thread identifier, and then atomically increments the read counter.  It then needs to recheck that a writer didn’t arrive in the meantime.  (The CAS increment means we can safely do this without worry for reordering bugs, like the read of the writer flag passing the write to the reader flag.)  If a writer hasn’t arrived, the read lock has been successfully acquired and we’re done; if a writer has arrived, however, the reader needs to back out the change (since the writer might be waiting for the read flag to become 0) and then go back to spinning.  It will retry again once the writer exits.

private int ReadLockIndex {

    get { return Thread.CurrentThread.ManagedThreadId % m_readers.Length; }

}

 

public void EnterReadLock() {

    SPW sw = new SPW();

    int tid = ReadLockIndex;

 

    // Wait until there are no writers.

    while (true) {

        while (m_writer == 1) sw.SpinOnce();

 

        // Try to take the read lock.

        Interlocked.Increment(ref m_readers[tid].m_taken);

        if (m_writer == 0) {

            // Success, no writer, proceed.

            break;

        }

 

        // Back off, to let the writer go through.

        Interlocked.Decrement(ref m_readers[tid].m_taken);

    }

}

(Note that SPW is a little type to encapsulate the spin-wait logic, including some amount of backoff to reduce contention.  An example implementation at the bottom of this essay, along with the full reader/writer lock code.  .NET 4.0 includes a SpinWait type that provides this same functionality.)

Exiting the read lock is pretty simple.  We just need to decrement our counter.

public void ExitReadLock() {

    // Just note that the current reader has left the lock.

    Interlocked.Decrement(ref m_readers[ReadLockIndex].m_taken);

}

The writer lock is pretty straightforward.  It works the same way most spin-based mutually exclusive locks work, but using a CAS on the writer flag, but has an extra step after successfully acquiring the lock: a writer must walk the list of read flags, and wait for each one to become 0.  (This is similar to Peterson's mutual exclusion algorithm for N-threads.)  Because the write flag is set first (using a CAS), and because new readers won’t enter if the flag is set, we can be assured this works correctly without hokey memory reordering problems cropping up.

public void EnterWriteLock() {

    SPW sw = new SPW();

    while (true) {

        if (m_writer == 0 && Interlocked.Exchange(ref m_writer, 1) == 0) {

            // We now hold the write lock, and prevent new readers.

            // But we must ensure no readers exist before proceeding.

            for (int i = 0; i < m_readers.Length; i++)

                while (m_readers[i].m_taken != 0) sw.SpinOnce();

 

            break;

        }

 

        // We failed to take the write lock; wait a bit and retry.

        sw.SpinOnce();

    }

}

And exiting the write lock is even simpler than exiting the read lock.  We just set the writer flag to 0.

public void ExitWriteLock() {

    // No need for a CAS.

    m_writer = 0;

}

Given all of that, you might wonder how well this bad boy performs.  Well, single-threaded performance is a bit worse than the previous spin reader/writer lock: about 1.55x the cost of a monitor acquisition for the read lock instead of 0.95x, and about 5.52x for the write lock instead of 0.85X.  This makes sense.  There’s simply a whole lot more work going on in this new lock compared to the old, simple one.

But scalability is vastly improved.  Our hard work has apparently paid off.  Here’s a table much like the one in the previous post: scaling over the equivalent mutually exclusive monitor code, for various percentages of writers and various amounts of "work" (counts of function calls) inside the lock region.  (I have left out the legacy .NET ReaderWriterLock type because it is embarassingly terrible.)  Remember: 1.0x means it scales the new lock is the same as monitor, 0.5x means twice as fast, and 2.0x means twice as slow.  0.25x is ideal speedup (4x) since I am running the tests on a four way machine.

0% writers:

              0 calls       10 calls      100 calls     1000 calls

RWLSlim (3.5) 2.11x         2.01x         0.96x         0.32x

SpinRWL(old)  9.63x         7.04x         1.02x         0.26x

SpinRWL(new)  0.39x         0.36x         0.28x         0.25x

 

5% writers:

              0 calls       10 calls      100 calls     1000 calls

RWLSlim (3.5) 2.29x         2.36x         1.18x         0.61x

SpinRWL(old)  5.69x         5.59x         1.43x         0.94x

SpinRWL(new)  1.01x         0.96x         0.45x         0.38x

 

10% writers:

              0 calls       10 calls      100 calls     1000 calls

RWLSlim (3.5) 2.26x         2.04x         1.15x         1.00x

SpinRWL(old)  6.87x         5.03x         1.42x         1.34x

SpinRWL(new)  1.60x         1.51x         0.63x         0.53x

 

25% writers:

              0 calls       10 calls      100 calls     1000 calls

RWLSlim (3.5) 2.09x         2.10x         1.14x         1.00x

SpinRWL(old)  4.70x         4.20x         1.43x         1.69x

SpinRWL(new)  2.81x         2.29x         1.27x         0.73x

 

50% writers:

              0 calls       10 calls      100 calls     1000 calls

RWLSlim (3.5) 2.18x         1.95x         1.15x         0.95x

SpinRWL(old)  3.23x         3.73x         1.54x         1.39x

SpinRWL(new)  3.16x         2.76x         1.73x         1.10x

 

100% writers:

              0 calls       10 calls      100 calls     1000 calls

RWLSlim (3.5) 2.18x         1.95x         1.04x         0.92x

SpinRWL(old)  2.63x         2.04x         1.06x         0.87x

SpinRWL(new)  6.79x         3.96x         1.62x         1.06x

You can see there are now several more cases where the new reader/writer lock beats out both the .NET 3.5 ReaderWriterLockSlim type in addition to our previous attempt.  In fact, we now have a few new scenarios that scale, like 5% or 10% writers where the amount of work being done is at least 100 function calls.  (Unfortunately, doing 100 or more function calls inside a lock that uses spin-waiting is dangerous and considered a very bad practice: you should be able to count the number of instructions on your fingers (and toes).  But that’s somewhat beside the point.)  In summary, so long as there is a fair amount of work going on and the percentage of writers remains very low, we might see a benefit.

So was I overly harsh on reader/writer locks in my last post?  Sure, maybe a little.  While I am still very disappointed in the current .NET reader/writer locks (and, I imagine, the Vista SRWLock), the results I was able to get here are a bit more promising.

But the point I was trying to get across is the same: sharing is sharing is sharing.  Avoid it like the plague.

(Thanks to Tim Harris for sending me private email about my previous posts.  The brief discussion inspired me to pick this back up.)

 

Here’s the full code for the reader/writer lock.

using System;

using System.Threading;

using System.Collections.Concurrent;

using System.Collections.Generic;

using System.Runtime.InteropServices;

 

// We use plenty of interlocked operations on volatile fields below.  Safe.

#pragma warning disable 0420

 

/// <summary>

/// A very lightweight reader/writer lock.  It uses a single word of memory, and

/// only spins when contention arises (no events are necessary).

/// </summary>

public class ReaderWriterSpinLockPerProc {

    private volatile int m_writer;

    private volatile ReadEntry[] m_readers = new ReadEntry[Environment.ProcessorCount * 16];

 

    [StructLayout(LayoutKind.Sequential, Size = 128)]

    struct ReadEntry {

        internal volatile int m_taken;

    }

 

    private int ReadLockIndex {

        get { return Thread.CurrentThread.ManagedThreadId % m_readers.Length; }

    }

 

    public void EnterReadLock() {

        SPW sw = new SPW();

        int tid = ReadLockIndex;

 

        // Wait until there are no writers.

        while (true) {

            while (m_writer == 1) sw.SpinOnce();

 

            // Try to take the read lock.

            Interlocked.Increment(ref m_readers[tid].m_taken);

            if (m_writer == 0) {

                // Success, no writer, proceed.

                break;

            }

 

            // Back off, to let the writer go through.

            Interlocked.Decrement(ref m_readers[tid].m_taken);

        }

    }

 

    public void EnterWriteLock() {

        SPW sw = new SPW();

        while (true) {

            if (m_writer == 0 && Interlocked.Exchange(ref m_writer, 1) == 0) {

                // We now hold the write lock, and prevent new readers.

                // But we must ensure no readers exist before proceeding.

                for (int i = 0; i < m_readers.Length; i++)

                    while (m_readers[i].m_taken != 0) sw.SpinOnce();

 

                break;

            }

 

            // We failed to take the write lock; wait a bit and retry.

            sw.SpinOnce();

        }

    }

 

    public void ExitReadLock() {

        // Just note that the current reader has left the lock.

        Interlocked.Decrement(ref m_readers[ReadLockIndex].m_taken);

    }

 

    public void ExitWriteLock() {

        // No need for a CAS.

        m_writer = 0;

    }

}

 

struct SPW {

    private int m_count;

    internal void SpinOnce() {

        if (m_count++ > 32) {

            Thread.Sleep(0);

        } else if (m_count > 12) {

            Thread.Yield();

        } else {

            Thread.SpinWait(2 << m_count);

        }

    }

}

2/20/2009 7:33:45 PM (Pacific Standard Time, UTC-08:00)  #   

 

Recent Entries:

Search:

Browse by Date:
<May 2009>
SunMonTueWedThuFriSat
262728293012
3456789
10111213141516
17181920212223
24252627282930
31123456

Browse by Category:

Notables: