RSS 2.0

Personal Info:

Joe Send mail to the author(s) leads the architecture of an experimental OS's developer platform, where he is also chief architect of its programming language. His current mission is to enable writing large-scale software that is reliable, secure, and scalable by-construction. Before this, Joe founded the Parallel Extensions to .NET project. He has been granted 19 patents, with 49 pending. When not working, Joe enjoys travelling with his wife, writing books, writing music, studying music theory & mathematics, and doing anything involving food & wine.

My books

My music

Disclaimer:
The content of this site are my own personal opinions and do not represent my employer's view in anyway.

© 2012, Joe Duffy

 
 Sunday, November 01, 2009

Say you've got a Task<T>.  Well, now what?

You know that eventually a T will become available, but until then you're out of luck.  You could go ahead and be a naughty little devil by calling Wait on it -- blocking the current thread (eek!) -- or you could call ContinueWith on the task to get back a new Task<U>, representing the work you would do to create some new U object if only you presently had a T in hand.  And then perhaps you will find yourself in the same situation for that U.

These are those dataflow graphs I mentioned in the previous blog post.  Things of beauty.

To be more concrete about the situation I describe, imagine you've got the following IFoo interface:

interface IFoo

{

    int Bar();

    string Baz(int x);

}

Now, given a Task<IFoo>, you can't do anything related to an IFoo.  And yet presumably that's why you've got the task in the first place: because you care about the IFoo.  What if you ultimately want to invoke the Bar method, for example?

Task<IFoo> task = ...;

You can of course block the thread:

// Option A: block the thread.

int resultA = task.Result.Bar();

...

Or you can choose to program in a very clunky way:

// Option B: use dataflow.

Task<int> resultB = task.ContinueWith(t => t.Result.Bar());

But what if, instead, you could do something like this?

// Option C: magic.

Task<int> resultC = task.Bar();

Whoa, wait a minute.  We're calling Bar() on a Task<IFoo>?  Neat, but how can that be?

This is obviously a trick.  All of the members of T are somehow being made available on the Task<T> object, so that they can be called before the task has actually been resolved to a concrete value.  Of course, were we to allow this, what you get back to represent the result of such calls would need to be task objects too: hence we get back a Task<int> from the call on Bar(), instead of an int.  This is similar to call streams in Barbara Liskov's Argus language (her primary focus immediately after CLU).

This kind of lifting from the inner type outward is much like what you get in languages that allow generic mixins.  C# already has one semi-such type, though you may not realize it: Nullable<T> actually allows you to directly access interfaces implemented by T without needing to call Value on it.  It's almost like Nullable<T> was defined as deriving from T itself which is clearly not actually possible (for numerous reasons, not the least significant of which is that it's a struct).  Try it.  This works because the type system treats Nullable<T> and T somewhat uniformly (though you'd be surprised by some dangers lurking within -- effectively Nullable<T> mustn't implement any interfaces *ever* otherwise a type hole would result).  But I digress...

Unfortunately without deep language changes we can't get this to work the way we'd like.  I have found numerous occasions where a general lifting capability in C# would be useful: Lazy<T> is but one example.  That said, each time we run across an instance, it demands slightly different type system treatment, and it seems unlikely such a general facility would be as usable as the one off features.

Type systems aside, I am actually using a very dirty trick to make this work: I'm using the new System.Dynamic features in .NET 4.0 to do it all dynamically.  You may love or hate this, depending on your stance on type systems.  Being an ML guy, I'll let you figure out what I think.  (Hint: gross hack!)

We can go further.  (Although sadly I won't demonstrate how to do so in this blog post.  I had wanted to go all the way, but need to get some actual language work done today, in addition to a little Riemann study, instead of having endless fun tinkering with Visual Studio 2010.  Shucks.)  Notice that Baz accepts an int as input.  Well, what if all we've got is a Task<int>?  We can of course also allow that to get passed in too:

Task<string> resultD = task.Baz(42); // Real input.  Fine.

Task<int> arg = ...;

Task<string> resultE = task.Baz(arg); // A task as input!  Cool!

But wait, there is more!  It slices and dices too.  The next trick is difficult -- if not impossible -- to do without far reaching language changes.  But we could also even bridge the world of ordinary methods too, not just those that have been accessed by tunneling through a Task<T>.  For example:

string f(int x) {...}

...

Task<int> task = ...;

Task<string> result = f(task);

Not to even mention:

Task<int> x = ...;

Task<int> y = ...;

Task<int> z = x + y;

This is deep.  What we are saying is that anywhere a T is expected, we can supply a Task<T>.  Of course once we've entered the world of tasks, we cannot escape until values actually begin resolving.  So when we invoke the method f in this example, we of course get back a Task<string> for its result.  Once we've stepped onto a turtle's back, well, it's turtles all the way down.

(Which reminds me of the well known tale:

A well-known scientist (some say it was Bertrand Russell) once gave a public lecture on astronomy. He described how the earth orbits around the sun and how the sun, in turn, orbits around the center of a vast collection of stars called our galaxy. At the end of the lecture, a little old lady at the back of the room got up and said: "What you have told us is rubbish. The world is really a flat plate supported on the back of a giant tortoise." The scientist gave a superior smile before replying, "What is the tortoise standing on?" "You're very clever, young man, very clever", said the old lady. "But it's turtles all the way down!"

Tasks are not greasy hamburgers after all, as I had claimed in the last post, but rather they are turtles.

I've wasted all of my energy speaking of turtle hamburgers drenched in asynchronous aioli, and have left only a little to go over the hacked up implementation of this idea.  Sigh.  Well, we had better get to it.)

In summary: we'll just rely on dynamic dispatch to do the lifting, thanks to the new .NET 4.0 DynamicObject class.  This is wildly less efficient than a proper type system design would yield, not to mention the utter lack of static type checking.  Of course a proper implementation that designed for this from Day One would also avoid the tremendous amount of object allocation that relying on the current Task<T> objects and ContinueWith overloads imply.  But nevertheless, this approach will allow us to at least have a good ole' time and stimulate the creative side of the noggin.

First, I shall provide an extension method for getting a DynamicTask<T> -- the thing that actually derives from DynamicObject and implements the custom dynamic binding:

public static class DynamicTask

{

    public static dynamic AsDynamic<T>(this Task<T> task)

    {

        return new DynamicTask<T>(task);

    }

}

Notice that this changes our calling conventions ever so slightly.  Namely:

// Option C: magic.

Task<int> resultC = task.AsDynamic().Bar();

The AsDynamic places the caller into the lifted context.  As invocations are made, the results become real tasks, and not dynamic ones, such that to continue the calling will require many AsDynamic()s.  This is a minor inconvenience and we could certainly automatically wrap the return values in DynamicTask<T> objects if we wanted to eliminate this problem, i.e. to make chaining less verbose.

Second, we must implement the DynamicTask<T> class.  We will do a very simple translation.  Given a member access expression 'x.m', where m is either a field or property of type U, we will morph this into the new expression 'x.Task.ContinueWith(v => v.Result.m)', which is of type Task<U>.  Similarly, given a method invocation 'x.M(a1,...,aN)', whose return value is of type U, we will morph it into the new expression 'x.Task.ContinueWith(v => v.Result.M(a1,...,aN))', which is of type Task<U> (or just Task if U is the void type).  To support the ability to pass a task argument where an actual one is expected would require packing the argument with the target into an array, and doing a ContinueWhenAll on it.

(Perhaps I will illustrate how to do these other tricks in a later post, but I'm tight for time right now.  I'm only sketching the general idea.  Even in what I show below, things will be incomplete, because topics such as getting exception propagation right when tasks begin failing are tricky.  Ideally the whole dataflow chain will be "broken" by such an exception.  Additionally, I've only implemented what was necessary to get a few interesting examples working.  The binder, for example, certainly has a few loose ends.  Blog reader beware.)

Here is the implementation of DynamicTask<T>:

public class DynamicTask<T> : DynamicObject

{

    private Task<T> m_task;

 

    public DynamicTask(Task<T> task)

    {

        if (task == null) {

            throw new ArgumentNullException("task");

        }

        m_task = task;

    }

 

    public Task<T> Task {

        get { return m_task; }

    }

 

    public override DynamicMetaObject GetMetaObject(Expression parameter) {

        if (parameter == null) {

            throw new Exception("parameter");

        }

        return new TaskLiftedObject(this, parameter);

    }

 

    class TaskLiftedObject : DynamicMetaObject

    {

        ...

    }

}

Simple.  All of the dynamic magic resides in the implementation of TaskLiftedObject, which derives from the DynamicMetaObject class.  It is constructed with an instance of the DynamicTask<T> along with the expression tree that can be used to dynamically load up an instance of that task.  All of the dynamic features work with expression trees.  For example, in response to an attempt to invoke a method M on a DynamicTask<T>, our binder will need to find the right method M on the underlying T, and then return an expression tree that does the ContinueWith and so forth.

Let's start cracking open TaskLiftedObject:

    class TaskLiftedObject : DynamicMetaObject

    {

        private DynamicTask<T> m_task;

 

        public TaskLiftedObject(DynamicTask<T> task, Expression expression) :

            base(expression, BindingRestrictions.Empty, task)

        {

            m_task = task;

        }

We will override two of DynamicMetaObject's functions.  BindGetMember is called when a member is accessed (like a property or field), whereas BindInvokeMember is called when a method call is made.  There are several other methods that a proper binder would need to override in order to make delegate dispatch and such work properly.  But this suffices to get started:

        public override DynamicMetaObject BindGetMember(GetMemberBinder binder)

        {

            // We have a member access:

            //     x.m

            //

            // which must become:

            //     x.Task.ContinueWith(v => { v.Result.m; })

            //

 

            return new DynamicMetaObject(

                MakeContinuationTask(Bind(binder.Name, -1), null),

                BindingRestrictions.GetInstanceRestriction(Expression, Value),

                Value

            );

        }

 

        public override DynamicMetaObject BindInvokeMember(InvokeMemberBinder binder, DynamicMetaObject[] args)

        {

            // We have a call:

            //     x.Foo(a1,...,aN)

            //

            // which must become:

            //     x.Task.ContinueWith(v => { v.Result.Foo(a1,...,aN); })

            //

 

            Expression[] argsEx = new Expression[args.Length];

            for (int i = 0; i < args.Length; i++) {

                argsEx[i] = args[i].Expression;

            }

 

            return new DynamicMetaObject(

                MakeContinuationTask(Bind(binder.Name, binder.CallInfo.ArgumentCount), argsEx),

                BindingRestrictions.GetInstanceRestriction(Expression, Value),

                Value

            );

        }

Clearly the workhorses here are Bind and MakeContinuationTask.  Bind is responsible for performing dynamic lookup for a matching member on T that has the requested Name and, if a method call is being made, the proper number of parameters.  For brevity, I've omitted anything to do with argument type checking, an obvious hole that we'd want to fix some day:

        private static MemberInfo Bind(string name, int argCount)

        {

            // Lookup the target member on the T, rather than the (Dynamic)Task<T>.

            return

                (from m in typeof(T).GetMembers(BindingFlags.Instance | BindingFlags.Public)

                 where m.Name.Equals(name) &&

                       (argCount == -1 ?

                           !(m is MethodInfo) :

                           ((MethodInfo)m).GetParameters().Length == argCount)

                 select m).

                Single();

        }

Nothing too interesting here either -- just a bit of hacky reflection code done with a fancy LINQ query.  If anything other than exactly one method was found, the call to Single() will throw an exception.  If you want to see what a "real" dynamic binder looks like, you won't find it here: check out VB's or IronPython's.

Now for the meat.  The MakeContinuationTask method takes the target member that we've found dynamically via Bind, as well as an optional array of expression trees, each representing an argument being passed to the target method (and which will be null for property and field access), and manufactures the expression tree that represents the execution of the dynamic call itself:

        private Expression MakeContinuationTask(MemberInfo target, Expression[] targetArgs)

        {

            var lambdaParam = Expression.Parameter(typeof(Task<T>), "v");

            var lambdaParamResult = Expression.Property(lambdaParam, "Result");

 

            Expression lambdaBody;

            Type lambdaReturnType;

            if (target is MethodInfo) {

                lambdaBody = Expression.Call(lambdaParamResult, (MethodInfo)target, targetArgs);

                lambdaReturnType = ((MethodInfo)target).ReturnParameter.ParameterType;

            }

            else if (target is PropertyInfo) {

                lambdaBody = Expression.Property(lambdaParamResult, (PropertyInfo)target);

                lambdaReturnType = ((PropertyInfo)target).PropertyType;

            }

            else if (target is FieldInfo) {

                lambdaBody = Expression.Field(lambdaParamResult, (FieldInfo)target);

                lambdaReturnType = ((FieldInfo)target).FieldType;

            }

            else {

                throw new Exception("Unsupported dynamic invoke: " + target.GetType().Name);

            }

 

            return Expression.Call(

                Expression.Property(

                    Expression.Convert(this.Expression, typeof(DynamicTask<T>)),

                    typeof(DynamicTask<T>).GetProperty("Task")

                ),

                GetContinueWith(lambdaReturnType), // ContinueWith

                new Expression[] {

                    // v => { v.Result.M(a0,...,aN) }

                    Expression.Lambda(lambdaBody, lambdaParam)

                }

            );

        }

You should be able to convince yourself that this code generates the desired transformation described earlier.  It uses a method to find the overload of Task<T>.ContinueWith that we want to bind against, and invokes that on the Task<T> contained within the DynamicTask<T> against which the dynamic call was made.  It is rather unfortunate that the CLR does not allow the void type as a generic type argument, so we have to be a little bit inconsistent with our treatment of void returns, by choosing a different ContinueWith overload.

If the above reflection code was called hacky, the ContinueWith lookup is worse.  It's very inefficient, not to mention fragile (because it depends on the current layout of Task<T>'s overloads, what with instantiating generic methods and the like).  C'est la vie:

        private static MethodInfo GetContinueWith(Type returnType)

        {

            // @TODO: caching to avoid expensive lookups each time.

            if (returnType == typeof(void)) {

                return typeof(Task<T>).GetMethod(

                    "ContinueWith",

                    new Type[] { typeof(Action<Task<T>>) }

                );

            }

            else {

                foreach (MethodInfo mif in typeof(Task<T>).GetMethods()) {

                    if (mif.Name == "ContinueWith" && mif.IsGenericMethodDefinition) {

                        MethodInfo mifOfT = mif.MakeGenericMethod(returnType);

                        ParameterInfo[] mifParams = mifOfT.GetParameters();

                        if (mifParams.Length == 1 &&

                                mifParams[0].ParameterType == typeof(Func<,>).MakeGenericType(typeof(Task<T>), returnType)) {

                            return mifOfT;

                        }

                    }

                }

            }

 

            throw new Exception("Fatal error: ContinueWith overload not found");

        }

    }

And that's it.  With that, we can get dynamic invocations on unresolved T's via Task<T> objects.  Nifty.

I'm not saying any of this is a really good idea.  Honestly, I'm not.  Of course, there's a kernel of a good idea there and the systems we are working on take this kernel to its extreme.  By providing a programming model that encourages deep chains of datafow to be expressed speculatively in a natural and familiar manner, greater degrees of latent parallelism can lie resident in an application waiting to be unlocked as more processors become available.  Doing it for real requires impactful changes to the language, supporting infrastructure, and particularly tooling.  Just imagine what it means to break into a debugger to inspect deep dataflow graphs that have been constructed by compiler magic underneath you.  And the use of ContinueWith is a little lame, because of course the target of our call may be something that can be run speculatively too with first class pipleining, rather than completely delaying the invocation of it.

So we won't be seeing lifted tasks in .NET anytime soon.  Writing up this blog post was merely an excuse to toy around with the new C# dynamic features and to have a little recreational time.   And to generate excitement about what .NET 4.0 holds in store.  I hope you have enjoyed it.  Now back to reality.

11/1/2009 1:49:28 PM (Pacific Standard Time, UTC-08:00)  #   

 Saturday, October 31, 2009

Well, Visual Studio 2010 Beta 2 is out on the street.  It contains plenty of neat new things to keep one busy for at least a rainy Saturday.  I proved this today.

Of course, Parallel Extensions is in the box.  .NET 4.0's Task and Task<T> abstractions are used to implement such things as PLINQ and Parallel.For loops, but of course they are great for representing asynchronous work too.  The FromAsync adapters move you from the dark ages of IAsyncResult to the glitzy new space age of tasks.

Not only are tasks tastier than hamburgers, but they enable complex dataflow graphs of asynchronous work to unfold dynamically at runtime, thanks to the ContinueWith method.  From a Task<T> you can get a Task<U> that was computed based on the T; ad infinitum.  We like dataflow.  It is the key to unlocking parallelism, or more accurately, boiling away all else except for dataflow is the key.  But what about control flow, you might ask?  We like it less.  But you can do it, so long as you put in some work.  F#'s async workflows make this sort of thing a tad easier, but the raw libraries in .NET 4.0 don't come with any sort of loops or conditional capabilities.  Perhaps in the future they will.  Nevertheless, in this post I shall demonstrate how to build a couple simple ones.

Not because the lack of them is going to cause unprecidented and unheard of horrors, but rather because in doing so we'll see some neat features of tasks.

The two methods I will illustrate in this post are:

public static class TaskControlFlow

{

    public static Task For(int from, int to, Func<int, Task> body, int width)

    public static Task While(Func<int, bool> condition, Func<int, Task> body, int width)

}

Notice that each body is given the iteration index and is expected to launch asynchronous work and return a Task.  The parameters that these methods take are probably obvious.  Well, except for the last one.  The "width" indicates how many outstanding asynchronous bodies should be in flight at once.  The Task returned by For and While won't be considered done until all iterations are done, and any exceptions will be propagated as you might hope.  It would be pretty useless otherwise.

For example, we could write a while loop that does something very silly:

TaskControlFlow.While(

    i => i < 100,

    i => { return CreateTimerTask(250).ContinueWith(_ => Console.WriteLine(i)); },

    4

).Wait();

This just prints returns a "timer task" that completes after 250ms and prints out the iteration to the console. We pass a width of 4, so only four tasks will be outstanding at any given time.  Notice we call Wait at the end, since both For and While return tasks representing the in flight work.  This could have instead been written using a For loop as follows:

TaskControlFlow.For(0, 100,

    i => { return CreateTimerTask(250).ContinueWith(_ => Console.WriteLine(i)); },

    4

).Wait();

The CreateTimerTask method, by the way, looks like this:

private static Task CreateTimerTask(int ms)

{

    var tcs = new TaskCompletionSource<bool>();

    new Timer(x => ((TaskCompletionSource<bool>)x).SetResult(true), tcs, ms, -1);

    return tcs.Task;

}

As something more realistic, imagine we wanted to do something with a large number of files, and don't want to block a whole bunch of threads in the process.  The following "simple" expression will count up all of the bytes for all of the files in a particular directory, without once blocking the thread -- well, except for the initial call to Directory.GetFiles:

string win = "c:\\...\\";

string[] files = Directory.GetFiles(win);

int total = 0;

TaskControlFlow.For(0, files.Length,

    i => {

        bool eof = false;

        int offset = 0;

        byte[] buff = new byte[4096];

        FileStream fs = File.OpenRead(files[i]);

        return TaskControlFlow.While(

            j => !eof,

            j => Task<int>.Factory.

                FromAsync<byte[],int,int>(

                    fs.BeginRead, fs.EndRead, buff, offset, buff.Length,

                    null, TaskCreationOptions.None

                ).

                ContinueWith(v => {

                    if (eof = v.Result < buff.Length) {

                        fs.Close();

                    }

                    offset += v.Result;

                    Interlocked.Add(ref total, v.Result);

                }),

                1

        );

    },

    8

).Wait();

Console.WriteLine(total);

Pretty neat.  We've somewhat arbitrarily chosen a width of 8 for this loop.  And notice something very subtle but important here: we've chosen a width of 1 for the inner loop that plows through the bytes of a file.  This is because we're sharing state, and it would not be safe to launch numerous iterations at once.  The same byte[], eof variable, and so forth, would become corrupt.  I will mention in passing that it's unfortunate that we've got that interlocked stuck in there to add to the total.  Refactoring this so that we could just do a LINQ reduce over the whole thing would be nice.  Indeed, it can be done.

We can do away with the For implementation very quickly.  It is just implemented in terms of While:

public static Task For(int from, int to, Func<int, Task> body, int width)

{

    return While(i => from + i < to, body, width);

}

And it turns out that the While implementation is not terribly complicated either.  Here it is:

public static Task While(Func<int, bool> condition, Func<int, Task> body, int width)

{

    var tcs = new TaskCompletionSource<bool>();

 

    int currIx = -1; // Current shared index.

    int currCount = width; // The number of outstanding tasks.

 

    int canceled = 0; // 1 if at least one body was cancelled.

    ConcurrentBag<Exception> exceptions = null; // A collection of exceptions, if any.

 

    // Generate a continuation action: this fires for each body that completes.

    Action<Task> fcont = null;

    fcont = tsk => {

        if (tsk.IsFaulted) {

            // Accumulate exceptions.

            LazyInitializer.EnsureInitialized(ref exceptions);

            foreach (Exception inner in tsk.Exception.InnerExceptions) {

                exceptions.Add(inner);

            }

        }

        else if (tsk.IsCanceled) {

            // Mark that cancellation has occurred.

            canceled = 1;

        }

        else if (canceled == 0 && exceptions == null) {

            // If no cancellations / exceptions are found, attempt to kick off more work.

            int ix = Interlocked.Increment(ref currIx);

            if (condition(ix)) {

                // Generate a new body task, handling exceptions.  Then make sure we

                // tack on the continuation on that new task, so we can keep on going...

                // If the condition yielded 'false', we'll simply fall through and try to finish.

                Task btsk;

                try {

                    btsk = body(ix);

                }

                catch (Exception ex) {

                    btsk = AlreadyFaulted(ex);

                }

                btsk.ContinueWith(fcont);

                return;

            }

        }

 

        // If this is the last task, signal completion.

        if (Interlocked.Decrement(ref currCount) == 0) {

            if (exceptions != null) {

                tcs.SetException(exceptions);

            }

            else if (canceled == 1) {

                tcs.SetCanceled();

            }

            else {

                tcs.SetResult(true);

            }

        }

    };

 

    // Fire off the right number of starting tasks.

    for (int i = 0; i < width; i++) {

        AlreadyDone.ContinueWith(fcont);

    }

 

    return tcs.Task;

}

I've commented the code inline to illustrate what is going on.  The only other part that isn't shown are the AlreadyDone and AlreadyFaulted members, which simply give Tasks that are already in a final state.  This isn't strictly necessary, but come in handy in a number of situations:

internal static Task AlreadyDone;

 

static TaskControlFlow()

{

    var tcs = new TaskCompletionSource<bool>();

    tcs.SetResult(true);

    AlreadyDone = tcs.Task;

}

 

private static Task AlreadyFaulted(Exception ex)

{

    var tcs = new TaskCompletionSource<bool>();

    tcs.SetException(ex);

    return tcs.Task;

}

And that's it.  I'm done for now.  Hope you enjoyed it.  I've got a few other posts in the works -- primarily the result of a day full of hacking (I got in the office at 7am this morning, and have been here ever since, 14 hours later) -- demonstrating how to do speculative asynchronous work for if/else branches.  Finally, I also have a neat example that illustrates how to do deep dataflow-based speculation without having to wait for work to complete.  This combines the new .NET 4.0 dynamic capabilities with parallelism, so I'm pretty excited to get it working and write about it.

10/31/2009 8:06:24 PM (Pacific Daylight Time, UTC-08:00)  #   

My team is hiring.  For example:

https://careers.microsoft.com/JobDetails.aspx?jid=7594&lang=en

As the job description says, we are focused mainly concurrency & parallelism at each layer of the system: kernel-mode, user-mode, frameworks, languages, compilers, ...

The sky is the limit and nothing is off the table.  I've never had so much fun in my life as I am having right now, and couldn't imagine a better group of people to do it with.  I think this is what NT must have felt like.

If you're interested, email me at joedu AT microsoft DOT com, or apply via that link.

10/31/2009 2:02:15 PM (Pacific Daylight Time, UTC-07:00)  #   

 Monday, October 19, 2009

Embarrassingly, I neglected to write about the oldest trick in the book in my last post: designing the producer/consumer data structure to reduce false sharing.  As I've written about several times previously (e.g. see here), and more so in the book, false sharing is always deadly and must be avoided.

As a simple example, consider a program that merely increments a shared counter over and over again.  If we give P threads their own separate counters, and ask them to increment the respective counter an equal number of times.  Each thread can of course do this without synchronization, because the counters are distinct: no locks or even interlocked operations are necessary.  Naively, one might expect that running P of them in parallel leads to no interference, and hence perfect parallelization.  However, when I run a little benchmark on my 8-way machine, the numbers for increasing values of P tell a very different story:

1 = 22425789

2 = 42023726   (187%)

4 = 175828522  (784%)

8 = 333906288  (1489%)

It is clear that the throughput drops dramatically as P increases.  The reason?  Each counter, being only 8 bytes wide, shares a cache line with as many as 7 other counters -- or 15 if we're on a machine with 128 byte cache lines.  A simple change to the counter's layout, so that individual counters do not share the same cache line, will remedy the situation.  The numbers improve dramatically.  In fact, they remain constant no matter the value of P:

1 = 21914250

2 = 21900392   (100%)

4 = 21865781   (100%)

8 = 21934008   (100%)

This perfect scaling isn't always possible due to memory bandwidth, but because we're just incrementing a single counter per core this doesn't manifest as a problem.

For what it's worth, the machine I am running these tests on is an 8-way, dual-socket, quad-core.  Pairs of cores share an L1 cache, and all cores in a socket share an L2 cache.  So the pairs {0,1}, {2,3}, {4,5}, and {6,7} are each expected to have distinct L1 caches and the groups {0,1,2,3} and {4,5,6,7} are expect to have distinct L2 caches.  The 2 number above is run with two threads affinitized such that they share the same L1 cache.  If we force them apart, however, we get slightly different results:

2 = 42023726   (187%) -- same L1 cache

2 = 54706505   (244%) -- same L2 cache

2 = 75030977   (335%) -- separate sockets

As expected, the more distance in the cache hierarchy, the greater the slowdown due to the increased ping pong paths.

The specific results are of course unique to my machine, but nevertheless the conclusion is clear: reducing sharing leads to substantial performance gains, particularly with large numbers of threads hammering on the shared lines.  Often more so than eliminating other sources of wasted cycles, like interlocked operations.  Eliminating those sources is clearly important too, but it really is amazing how deadly and yet difficult to discover false sharing can be: few cases are as obvious as this one.

One aside is worth mentioning before winding down.  When I first ran this experiment, I had done it two ways: (1) with fields of a shared object, then using StructLayout(LayoutKind=Explicit) to keep fields apart, and (2) with counters crammed into an array, which then contains padding elements to eliminate the false sharing.  The former is shown above.  If you try the latter, you may be surprised.  The layout of arrays on the CLR is such that an array's length resides before the first element.  So unless you pad the first element of the array, all accesses will perform bounds checking that touches the first element's line.  Because this line is being mutated by the thread incrementing the first counter, terrible false sharing results.  To solve this, we must pad the first element too.

For example, here are the array numbers with false sharing:

1 = 27366202

2 = 125264714  (458%)

4 = 1383953372 (7969%)

8 = 3136996731 (11463%)

Notice the P = 8 case is over 100x slower!  Yowzas.  After fixing things, with the first element padded, we again observe perfect scaling:

1 = 27393869

2 = 27465999   (100%)

4 = 27370901   (100%)

8 = 27408631   (100%)

Clearly false sharing is not merely a theoretical concern.  In fact, during our Beta1 performance milestone in Parallel Extensions, most of our performance problems came down to stamping out false sharing in numerous places: the partitioning logic of parallel for loops, polling cancellation token flags, enumerators allocated at the beginning of a PLINQ query and constantly mutated during its execution, and even in our examples (e.g. see Herb's matrix multiplication example), etc.  It is terribly simple to make a mistake and, in a complicated system, terribly difficult to pinpoint the origin of what can be a truly crippling scalability bottleneck.

In the next post, we will go back and take a look at our single-producer / single-consumer buffer, and redesign it to have substantially better cache behavior.

~

For reference, here's the basic program used for a lot of these tests:

//#define CACHE_FRIENDLY

//#define USE_ARRAY

 

#pragma warning disable 0169

 

using System;

using System.Diagnostics;

using System.Runtime.InteropServices;

using System.Threading;

 

class Program

{

    const int P = 1;

 

#if USE_ARRAY

    class Counters

    {

        long[] m_longs;

 

        internal Counters(int n) {

#if CACHE_FRIENDLY

            m_longs = new long[(n+1)*16];

#else

            m_longs = new long[n];

#endif

        }

 

        public void Increment(int i) {

#if CACHE_FRIENDLY

            m_longs[(i+1)*16]++;

#else

            m_longs[i]++;

#endif

        }       

    }

#else // USE_ARRAY

#if CACHE_FRIENDLY

    [StructLayout(LayoutKind.Explicit)]

#endif

    struct Counters

    {

#if CACHE_FRIENDLY

        [FieldOffset(0)]

#endif

        public long a;

#if CACHE_FRIENDLY

        [FieldOffset(128)]

#endif

        public long b;

#if CACHE_FRIENDLY

        [FieldOffset(256)]

#endif

        public long c;

#if CACHE_FRIENDLY

        [FieldOffset(384)]

#endif

        public long d;

#if CACHE_FRIENDLY

        [FieldOffset(512)]

#endif

        public long e;

#if CACHE_FRIENDLY

        [FieldOffset(640)]

#endif

        public long f;

#if CACHE_FRIENDLY

        [FieldOffset(768)]

#endif

        public long g;

#if CACHE_FRIENDLY

        [FieldOffset(896)]

#endif

        public long h;

    }

 

    static Counters s_c = new Counters();

#endif // USE_ARRAY

 

    public static void Main(string[] args)

    {

        int p = int.Parse(args[0]);

        const int iterations = int.MaxValue / 4;

        ManualResetEvent mre = new ManualResetEvent(false);

 

#if USE_ARRAY

        Counters c = new Counters(p);

#endif

 

        Thread[] ts = new Thread[p];

        for (int i = 0; i < ts.Length; i++) {

            int tid = i;

            ts[i] = new Thread(delegate() {

                SetThreadAffinityMask(GetCurrentThread(), new UIntPtr(1u << tid));

                mre.WaitOne();

                for (int j = 0; j < iterations; j++)

#if USE_ARRAY

                    c.Increment(tid);

#else

                    switch (tid) {

                        case 0: s_c.a++; break;

                        case 1: s_c.b++; break;

                        case 2: s_c.c++; break;

                        case 3: s_c.d++; break;

                        case 4: s_c.e++; break;

                        case 5: s_c.f++; break;

                        case 6: s_c.g++; break;

                        case 7: s_c.h++; break;

                    }

#endif

            });

            ts[i].Start();

        }

 

        Stopwatch sw = Stopwatch.StartNew();

        mre.Set();

        foreach (Thread t in ts) t.Join();

        Console.WriteLine(sw.ElapsedTicks);

    }

 

    [System.Runtime.InteropServices.DllImport("kernel32.dll")]

    static extern IntPtr GetCurrentThread();

 

    [System.Runtime.InteropServices.DllImport("kernel32.dll")]

    static extern UIntPtr SetThreadAffinityMask(IntPtr hThread, UIntPtr dwThreadAffinityMask);

}

10/19/2009 5:59:20 PM (Pacific Daylight Time, UTC-07:00)  #   

 Sunday, October 04, 2009

Commonly two threads must communicate with one another, typically to exchange some piece of information.  This arises in low-level shared memory synchronization as in PLINQ’s asynchronous data merging, in the implementation of higher level patterns like message passing, inter-process communication, and in countless other situations.  If only two agents partake in this arrangement, however, it is possible to implement a highly efficient exchange protocol.  Although the situation is rather special, exploiting this opportunity can lead to some interesting performance benefits.

The standard technique for shared-memory situations is to use a ring buffer.  This buffer is ordinarily an array of fixed length that may become full or empty.  The two threads in this arrangement assume the role of producer and consumer: the producer adds data to the buffer and may make it full, whereas the consumer removes data from the buffer and may make it empty.  It is possible to generalize this to multi-consumers or multi-producers, with some added cost to synchronization.  What is described below is for the two thread case.

We will call this a ProducerConsumerRendezvousBuffer<T>, and its basic structure looks like this:

using System;

using System.Threading;

 

public class ProducerConsumerRendezvousPoint<T>

{

    private T[] m_buffer;

    private volatile int m_consumerIndex;

    private volatile int m_consumerWaiting;

    private AutoResetEvent m_consumerEvent;

    private volatile int m_producerIndex;

    private volatile int m_producerWaiting;

    private AutoResetEvent m_producerEvent;

 

    public ProducerConsumerRendezvousPoint(int capacity)

    {

        if (capacity < 2) throw new ArgumentOutOfRangeException("capacity");

        m_buffer = new T[capacity];

        m_consumerEvent = new AutoResetEvent(false);

        m_producerEvent = new AutoResetEvent(false);

    }

 

    private int Capacity

    {

        get { return m_buffer.Length; }

    }

 

    private bool IsEmpty

    {

        get { return (m_consumerIndex == m_producerIndex); }

    }

 

    private bool IsFull

    {

        get { return (((m_producerIndex + 1) % Capacity) == m_consumerIndex); }

    }

 

    public void Enqueue(T value)

    {

        ...

    }

 

    public T Dequeue()

    {

        ...

    }

}

There are some basic invariants to call out:

  • Our buffer holds our elements, producer index says at what position the next element enqueued will be stored, and the consumer index says from what position the next request to dequeue an element will retrieve its value.
  • We reserve one element in our buffer to differentiate between fullness and emptiness.  This is why we demand that capacity be >= 2.  We could alternatively know how to distinguish between a free slot and a used one, such as checking for null, but keep things simple for now.
  • Thus, IsEmpty is true when the consumer and producer index are the same.  Whereas IsFull is true when the consumer is one ahead of the producer, such that producing would make the producer index collide with the consumer index (otherwise leading to IsEmpty).
  • It should be obvious that our intent is to block consumption when IsEmpty == true and production when IsFull == true.  This is the point of the waiting flags and events.

Now let us look at the implementation first of Enqueue and then Dequeue, paying special attention to the necessary synchronization operations.  They look nearly identical:

    public void Enqueue(T value)

    {

        if (IsFull) {

            WaitUntilNonFull();

        }

 

        m_buffer[m_producerIndex] = value;

        Interlocked.Exchange(ref m_producerIndex, (m_producerIndex + 1) % Capacity);

 

        if (m_consumerWaiting == 1) {

            m_consumerEvent.Set();

        }

    }

Enqueue begins, as expected, by checking whether the queue is full.  Notice that we have not yet issued any memory fences yet.  The only thread that may make the buffer full is the current one, which will obviously not occur before proceeding, and therefore we needn’t perform any expensive synchronization operation for this check.  The value seen may of course be stale but we can deal with that possibility inside the slow path, WaitUntilNonFull.  We’ll look at that momentarily.

We then proceed to placing the value in the buffer at the current producer’s index.  Only the current thread will update the producer index and a consumer will not read from the current value so long as the producer index refers to it.  The value may not even be written atomically, e.g. for T’s that are greater than a pointer sized word.  This is okay: only the act of incrementing the index allows a consumer to access the element in question.  Writes on the CLR 2.0 memory model are retired in order and the reading side will use an acquire load of the index before accessing the element’s words.  Indeed we could use complicated multipart value types that are comprised of lengthy buffers, header words, and so on.

We then increment the producer index, handling the possibility of wrap-around by modding with the capacity.  This uses an Interlocked.Exchange for one simple reason: we are about to read a consumer waiting flag, and must prevent the load of that flag from moving prior to the producer index write.  The consumer sets this flag when it notices the queue is empty and waits.  This enables us to use a “Dekker style” check to minimize synchronization.  We could have alternatively just unconditionally set the event, doing away with the interlocked operation altogether.  But that call, if it involves kernel transitions, which is quite likely, is going to be much more expensive and would occur on every call to Enqueue.  And any event of this kind that doesn’t require kernel transitions is going to at least require an interlocked operation for the same reason we require one here.  An alternative technique involves setting when we transition the buffer from empty to non-empty or full to non-full, but this wastes a possibly expensive signal if the other party isn't even currently waiting.  If full or empty is a rare situation, then full or empty and with a peer actually physically waiting is even rarer.

Let’s now look at the WaitUntilNonFull method.  It’s really the reverse of what the consumer does, so based on everything said till this point, I am guessing it’s obvious:

    private void WaitUntilNonFull()

    {

        Interlocked.Exchange(ref m_producerWaiting, 1);

        try {

            while (IsFull) {

                m_producerEvent.WaitOne();

            }

        }

        finally {

            m_producerWaiting = 0;

        }

    }

We begin by issuing a memory fence and setting the producer waiting flag.  This memory fence is necessary to advertise that we are about to wait, and also to ensure the subsequent check of IsFull is synchronized.  The consumer does something very much like the producer does (above) after taking an element: if the producer is waiting, the consumer has made space for it and therefore must signal.  But it could be the case that the consumer has already made the queue non-full before it could notice the producer’s waiting flag.  We catch this by ensuring the producer’s check of IsFull cannot go before setting the producer waiting; similarly, the consumer cannot make IsFull false without subsequently noticing that the producer is waiting; this avoids deadlock.

Everything else is self explanatory.  Well, almost.  We need a loop here to catch one subtle situation.  Imagine a producer enters into this method thinking the buffer is full.  It sets its flag, and then immediately notices that the buffer is not full anymore.  A consumer has generated a new item of interest.  But imagine that consumer noticed that the producer was waiting and hence set its event.  This is an auto-reset event, so the next time the producer must wait, the event will have already been set and it’ll likely wake up before IsFull has become true.  An alternative way of dealing with this is to call Reset on the event if we didn’t actually wait on the event, but again we keep things simple.

At this point, the consumer side is going to look very familiar:

    public T Dequeue()

    {

        if (IsEmpty) {

            WaitUntilNonEmpty();

        }

 

        T value = m_buffer[m_consumerIndex];

        m_buffer[m_consumerIndex] = default(T);

        Interlocked.Exchange(ref m_consumerIndex, (m_consumerIndex + 1) % Capacity);

 

        if (m_producerWaiting == 1) {

            m_producerEvent.Set();

        }

 

        return value;

    }

 

    private void WaitUntilNonEmpty() {

        Interlocked.Exchange(ref m_consumerWaiting, 1);

        try {

            while (IsEmpty) {

                m_consumerEvent.WaitOne();

            }

        }

        finally {

            m_consumerWaiting = 0;

        }

    }

This is near-identical to Enqueue and WaitUntilNonFull, and so needs little explanation.  The acquire load inside IsEmpty of the producer index ensures that we observe the producer index for this particular value being beyond the current consumer’s index before loading the value itself, thereby ensuring we see the whole set of written words.  The one other thing to point out is that we “null out” the element consumed which, for large buffers, helps to avoid space leaks that would have otherwise been possible.

There are certainly some opportunities for improving this.

For example, we might add a little bit of spinning in the wait cases.  This would be worthwhile for cases that exchange data at very fast rates and have small buffers, meaning that the chance of hitting empty and full conditions is quite high.  Avoiding the context switch thrashing is likely to lead to hotter caches, because threads will remain runnable for longer, and the raw costs of switching themselves.

Additionally, we technically could use a single event if we wanted to spend the effort.  We’d have to handle a few tricky cases, however: namely, the case where a producer or consumer ends up waiting on an event because it “just missed” the event of interest, thus satisfying the event.  Indeed both threads could actually end up waiting on the event simultaneously and we need to somehow ensure the right one eventually gets awakened.  This leads to some chatter and probably isn’t worth the added complexity.

Here is a peek at some rough numbers from a little benchmark that has two threads enqueuing and dequeuing elements as fast as humanly (or computerly) possible.  This is a particularly unique and unlikely situation, but stresses the implementation in a few interesting ways.  In particular, it will stress the contentious slow paths; although these are expected to be rarer, the fast paths are just so easy to get right in this data structure that they are mostly uninteresting to stress performance-wise.  There are then a few variants, each based on the original version shown above:

  • 2 element capacity, which means we’ll be transitioning from empty to full and back a lot.
  • 1024 element capacity, which means we won’t.
  • With spinning, using .NET 4.0’s new System.Threading.SpinWait type.
  • An implementation that overuses interlocked operations as many naïve programmers would do.

The 2 element capacity situation is common in some message passing systems, e.g. Ada rendezvous, Comega joins, and the like.  Whereas the 1,024 element capacity situation is common for more general purpose channels, where some amount of pipelining is anticipated.

I whipped together a benchmark -- so quickly that we can barely trust it, I might add -- to measure these things.  Here’s a small table, showing the observed relative costs:

                     2 capacity    1024 capacity

As-is  No-spin       100.00%       1.93%

       Spin          56.41%        1.66%

Naïve  No-spin       101.20%       2.09%

       Spin          67.73%        1.87%

As with most microbenchmarks, take the results with a grain of salt.  And there are certainly more interesting variants to compare this against, including a monitor-based implementation that locks around access to the buffer itself.  Nevertheless, we can draw a few conclusions from this: as expected, the version that uses a single interlocked on enqueue and single interlocked on dequeue is faster than the naïve version that uses multiple (surprise!); spinning makes a much more interesting difference on the 2 element capacity situation, as expected, because it reduces the number of context switches dramatically; and, finally, the larger capacity enables a producer to race ahead of the consumer, hence avoiding far fewer transitions from full to empty to full and so forth.

This post was more of a case study than anything else.  There is nothing conclusive or groundbreaking here, and I suppose I should have said that would be the case up front.  That said, I’ve seen this technique used in over a dozen situations in actual product code now, so I figured I’d write a little about it, with a focus on how to minimize the synchronization operations.  We even contemplated shipping such a type in Parallel Extensions to .NET, but it’s just too darn specialized to warrant it.  So the closest thing we provided is BlockingCollection<T>.  Enjoy.

10/4/2009 6:03:17 PM (Pacific Daylight Time, UTC-07:00)  #   

 Monday, September 28, 2009

I've officially started down the long road of writing a 2nd edition of Concurrent Programming on Windows, and would like your help.

There are many great new features in Windows 7 and the next versions of .NET, Visual C++ / CRT, and Visual Studio.  The book will of course cover them all.

But I am also looking to reshape the 1st edition in many dimensions.  I'd like to focus on readability, conciseness, and clearly separating the "must know" topics from the more geeky and advanced ones.  This is a common conundrum when writing a technical book.  The advanced topics are more likely to appeal to readers of my blog, for instance, but may be daunting for newcomers to concurrency.  Tradeoffs abound.  Nevertheless the 2nd edition is likely to be slimmed down compared to the 1st.

Any and all feedback, suggestions, and ideas are welcome.  What did you like about the 1st edition, and what did you not like?  If you could change a handful of things, what would make the top of your list?  And was it missing something crucial that you would like to see covered?  Please send your feedback to joe AT@ acm DOT org, or simply leave comments here on the blog.  Regardless of whether you've read the 1st edition or not.

I sincerely look forward to hearing from you.  Cheers.

9/28/2009 5:47:03 PM (Pacific Daylight Time, UTC-07:00)  #   

 Monday, July 27, 2009

I had originally entitled this post "Having your concurrency cake and eating it too", but it sounded a little too silly.

I have grown convinced over the past few years that taming side effects in our programming languages is a necessary prerequisite to attaining ubiquitous parallelism nirvana.  Although I am continuously exploring the ways in which we can accomplish this -- ranging from shared nothing isolation, to purely functional programming, and anything and everything in between -- what I wonder the most about is whether the development ecosystem at large is ready and willing for such a change.

It is this that I find the most frightening.  I know we can give the world Haskell, or Erlang, or simple incremental steps within familiar environments, like Parallel Extensions.  (Indeed, the world already has these things.)  But elevating effects to a first class concern in day-to-day programming turns out to be a tough pill to swallow.  Particularly since the incremental degrees of parallelism that this switch will unlock are questionable (see this and this); and even if they were pervasive and impressive, it's unclear what percentage of developers will pay what specific price for a 2x, 4x, or even 16x increase in compute performance.  It sounds great on paper, but the cost / benefit equation is a complicated one.

"Pay for play" is the standard terminology we use for such things around here, and the solution needs to have the right amount of it.

Many folks with embarrassingly parallel algorithms will succeed just fine in a shared memory + locks + condition variables world, and indeed have already begun to do so.  And specialized tools -- like GPGPU programming -- have popped up that, when small kernels of computations are written in a highly constrained way, will parallelize, sometimes impressively.  Is this enough?  Perhaps for the next 5 years, but surely not much longer after that.  It is in my opinion qualitatively very important for the future of computer science that we provide programming environments that are more conducive to safe and automatic parallelism.  And yet I cannot stand up with a straight face and proclaim that each and every developer on the face of the planet should practice side effect abstinence.  A healthy balance between cognitive familiarity and pragmatic [r]evolution must be found.  Many promising approaches are in the works (see UIUC's DPJ), but we are years away.

Until then, parallelism on broadly deployed commercial platforms will likely remain in the realm of specialists.

Of course, Haskell and Erlang both accomplish the no effects feat in a sneaky way.  For those interested in foisting parallelism unto the masses, lessons can be learned from these communities.  If you buy into purely functional programming, you necessarily buy into programming without effects, and the (sparing) use of monads to represent them.  (Or, as my colleague Erik calls it, fundamentalist functional programming).)  And if you buy into large scale message passing, you (typically) necessarily also buy into programming without shared memory, leaving behind only strongly isolated effects.  The key here is that developers gain many other benefits by switching to these platforms -- and the lack of effects is admittedly a consequential byproduct of this switch.  The lack of effects are not center stage.  The two approaches have recently begun to converge in what I believe to be the appropriate long-term approach: strong isolation with effects within, and safe, deterministic data parallelism through careful control over sharing, aliasing, and heap separation.

That said, though not center stage, the switch to effectless programming is certainly not painless.

Enabling side effects among otherwise functional code, I think, is a good thing, because it allows familiar algorithms to be encoded in an ordinary imperative way.  Familiarity is key: it may sound two faced, but I don't think parallelism is sufficiently top of mind that developers will want to completely rearrange the way that they write software.  Perhaps we will evolve in this direction, but a significant leap will fall flat.  Moreover, many algorithms actually depend on stateful updates to achieve adequate performance, like write in place graphics buffers.  The Haskell state monad strikes a nice balance between embedding imperative-looking effects, when coupled with the do notation, within a strictly functional language.

Furthermore, I really respect that Haskell discourages cheating.  (Any unsafePerformIO is viewed with great suspicion.)  I quite like mostly-functional programming languages like ML and Scheme, because they tend to be easier on programmers with C backgrounds, but strongly dislike that a mutation can lurk within what appears to be an otherwise pure function.  Documenting side effects in the type system is healthy and allows better symbolic reasoning about the dependencies and implicit parallelism contained within, transitively, while still providing a way to get at effectful programming.  Haskell does a great job at this.  The elimination of dependence ought to be the focus of programmers, and not the elimination of ad-hoc and unstructured access to shared, mutable state.  These are algorithmic and important concerns.

What remains unclear is where the boundaries lie.  Part and parcel of documenting effects is thinking about them when designing your software.  You need to consider whether IList<T>'s Contains method may mutate the list or not, for example, and hold the line on implementations of the interface.  Either it returns an 'a' or an 'IO a' -- and this decision is one that has far reaching implications.  This is a wholly separate kind of interface contract than what most programmers are accustomed to having to think about during the code-debug-edit cycle.  And surely Python and JavaScript developers will not care one way or the other, particularly if it forces more design decisions up front than what is customary today.  This bifurcation seems inevitable, and yet there is substantial crossover: C# developers will write Python scripts, and Python developers will consume components written in C#.

And yet, I think we need to venture down this path in order achieve automatically scalable software.  Parallel computers have become incredibly cheap, and so the historical barriers into high performance technical computing have been whittled away to the software skills necessary to write scalable programs; we will likely succeed at expanding this market without radical changes, but if we stopped there, vast reams of client-side software will be left in the dust.  I've been making inroads into solving the problem on my end, with a new language that sits between C# and Haskell.  I'm biased, have been hard at work on this problem for many years, and yet still struggle to answer these fundamental questions.  I am a big believer that there's got to be a happy medium out there.  But I'm still very perplexed, and face some very high walls to hurdle.  Who will discover the right balance, and when will they do so?

7/27/2009 3:57:18 PM (Pacific Daylight Time, UTC-07:00)  #   

 Monday, July 13, 2009

In this blog post, I'll demonstrate building some very simple (but nice!) synchronization abstractions: a Lock type and a standalone ConditionVariable class.  And we'll use a few new types in .NET 4.0 in the process.  I had to implement a condition variable recently -- the joys of developing a new operating system / platform from the ground up -- and decided to put together a toy example for a blog post as I went.  Warning: this is for educational purposes only.

Not to sound like a broken record, but it is a very good idea to manage locks intentionally.  Doing so makes synchronization code easier to write, understand, and, correspondingly, maintain; given the difficult nature of concurrency, any opportunity for simplification is always welcomed.  Yes, that means avoiding the CLR's dreadful capability to lock on arbitrary objects.  (Which, by the way, is effectively just a holdover from the days where .NET was trying to woo developers from Java onto the platform.)  In retrospect, this ability was a bad idea, and we should have provided and embellished a System.Threading.Lock class from Day One.

Well, rewind the clock and imagine we had provided such a Lock class.  In fact, here's an overly simple one right here.  I'm going to cheat a little, and reuse two locks that come with .NET 4.0: Monitor itself, and the new SpinLock class:

//#define SPIN_LOCK

 

public class Lock

{

#if SPIN_LOCK

    private SpinLock m_slock = new SpinLock();

#else

    private object m_slock = new object();

#endif

    private ThreadLocal<int> m_acquireCount = new ThreadLocal<int>();

 

    public void Enter() {

#if SPIN_LOCK

        bool ignoreTaken;

        m_slock.Enter(ref ignoreTaken);

#else

        Monitor.Enter(m_slock);

#endif

        m_acquireCount.Value = m_acquireCount.Value + 1;

    }

 

    public void Exit() {

        m_acquireCount.Value = m_acquireCount.Value - 1;

#if SPIN_LOCK

        m_slock.Exit();

#else

        Monitor.Exit(m_slock);

#endif

    }

 

    public bool IsHeld {

        get { return m_acquireCount.Value > 0; }

    }

 

    public int RecursionCount {

        get { return m_acquireCount.Value; }

    }

}

Okay, this is not rocket science.  And to be fair, it's missing some critical features, like reliable acquisition (finally available on Monitor in 4.0, and also SpinLock), and lock leveling.  But it's a start.

Once we've got such a Lock class, we may want to extend it with 1st class condition variable support.  Condition variables are core to the monitor concept, and provide a synchronization point that combines a lock with some condition that may be waited upon and triggered.  They help to avoid all the pitfalls of standalone events: mainly missed pulses due to the lack of synchronization involved between producers and consumers.

Furthermore, imagine we allow multiple separate ConditionVariable objects per single Lock object.  This is a feature that Monitor doesn't currently support (though Win32 CONDITION_VARIABLEs do).  This capability would enable us to, say, create a bounded buffer with a single lock to protect the queue, and two separate condition variables: one for the non-empty condition, and the other for the non-full condition.  This simplifes the implementation, and helps to avoid deadlock-prone techniques that result from trying to use multiple separate synchronization objects.

The trick is that the Lock and ConditionVariable class need to be well-integrated.  So we will provide a constructor that accepts a Lock object:

public class ConditionVariable

{

    private Lock m_slock;

 

    public ConditionVariable(Lock slock) {

        if (slock == null)

            throw new ArgumentNullException("slock");

 

        m_slock = slock;

    }

Once we've got that, there are two basic operations to implement: waiting and pulsing (signaling).  To achieve this, we'll give each thread its own ManualResetEventSlim object -- a lightweight event class, new to .NET 4.0.  (Ironically, it uses Monitor.Wait and Pulse under the covers.)  This event will be stored in an instance of the new .NET 4.0 type, ThreadLocal<T>.  (An alternative is to store it in a [ThreadStatic], and reuse the same event across all ConditionVariables.  Since we only support waiting on one such condition at a time (currently), there is no reason we can't just have one per thread.  This is precisely what the CLR does internally, though it's a shame we can't grab hold of that preexisting event.)  In addition to that, we'll need a wait-list, maintained in FIFO order as a Queue<ManualResetEventSlim>:

    private Queue<ManualResetEventSlim> m_waiters =

        new Queue<ManualResetEventSlim>();

    private ThreadLocal<ManualResetEventSlim> m_waitEvent =

        new ThreadLocal<ManualResetEventSlim>();

Waiting does pretty much what you'd imagine.  The m_slock object doubly acts as protection against concurrent access to the waiters list.  So when a Wait call is made, we demand that the lock is held by the calling thread.  Subtly, we also demand that it hasn't been recursively acquired, since that would require exiting the lock multiple times.  This can lead to desynchronization bugs.  Unfortunately, Monitor does this, but is critically broken as a result.  Once the validation occurs, Wait simply places the current thread into the wait list, exits the lock, waits to be awakened, and then reacquires the lock before returning.  This is pretty much exactly what the CLR Monitor class does internally:

    public void Wait() {

        int rcount = m_slock.RecursionCount;

        if (rcount == 0)

            throw new InvalidOperationException("Lock is not held.");

        if (rcount > 1)

            throw new InvalidOperationException("Lock is held recursively.");

 

        // Lazily initialze our event, if necessary.

        ManualResetEventSlim mres = m_waitEvent.Value;

        if (mres == null) {

            mres = m_waitEvent.Value = new ManualResetEventSlim(false);

        }

        else {

            mres.Reset();

        }

 

        m_waiters.Enqueue(mres);

        m_slock.Exit();

        mres.Wait(); // bugbug: interrupt => desync.

        m_slock.Enter();

    }

Lastly, we must implement the Pulse and PulseAll methods.  For kicks, we'll provide an overload of Pulse -- which normally awakens one waiting thread -- that awakens an arbitrary maximum number of threads.  So you could say Pulse(4) to awaken at most 4 threads, for example.  These methods are even simpler than Wait: they dequeue events off the wait list, and just set them.  This awakens the waiters, as desired:

    public void Pulse() {

        Pulse(1);

    }

 

    public void Pulse(int maxPulses) {

        if (!m_slock.IsHeld)

            throw new InvalidOperationException("Lock is not held.");

        for (int i = 0; i < maxPulses; i++) {

            if (m_waiters.Count > 0) {

                m_waiters.Dequeue().Set();

            }

            else {

                break;

            }

        }

    }

 

    public void PulseAll() {

        Pulse(int.MaxValue);

    }

}

(This has the unfortunate side effect of two-step dances.  The pulse will awaken threads at the mres.Wait() line in Wait, and they immediately try to call m_slock.Enter() as a result.  A priority boost may cause them to preempt the pulsing thread, even though they will just end up waiting.  A possible fix to this is to even more tightly integrate the Lock and ConditionVariable classes, by having a "deferred pulse" list attached to the lock.  Once it has been completely exited, the Lock's Exit method could drain the deferred pulse list and awaken the threads, thus avoiding the two-step dance.)

As to examples, let's take a quick peek at a blocking / bounded queue.  When constructed, a capacity is given.  Whenever an Enqueue would cause the buffer's contents to exceed the capacity, the producer is blocked until space is made by a consumer.  Whenever a Dequeue is attempted on an empty buffer, the consumer is blocked until an item is produced.  Though there are opportunities for optimization, this is encoded straightforwardly as follows:

class BlockingQueue<T>

{

    private int m_capacity;

    private Queue<T> m_q;

    private Lock m_qLock;

    private ConditionVariable m_qNonFullCondition;

    private ConditionVariable m_qNonEmptyCondition;

 

    public BlockingQueue(int capacity) {

        m_capacity = capacity;

        m_q = new Queue<T>();

        m_qLock = new Lock();

        m_qNonFullCondition = new ConditionVariable(m_qLock);

        m_qNonEmptyCondition = new ConditionVariable(m_qLock);

    }

 

    public void Enqueue(T item) {

        m_qLock.Enter();

        while (m_q.Count == m_capacity)

            m_qNonFullCondition.Wait();

        m_q.Enqueue(item);

        m_qNonEmptyCondition.Pulse();

        m_qLock.Exit();

    }

 

    public T Dequeue() {

        T item;

 

        m_qLock.Enter();

        while (m_q.Count == 0)

            m_qNonEmptyCondition.Wait();

        item = m_q.Dequeue();

        m_qNonFullCondition.Pulse();

        m_qLock.Exit();

 

        return item;

    }

}

The naive approach typically uses a single event to signal the non-empty / non-full transitions.  The risk of doing this, of course, is that the wrong kind of thread (producer or consumer) will be signaled, depending on chance and wait arrival order.  This is ordinarily only a concern for bounded queues of reasonably small sizes, and high degrees of concurrency, but is still an interesting example of why multiple condition variables per lock is useful.

Enjoy!

7/13/2009 9:52:50 PM (Pacific Daylight Time, UTC-07:00)  #   

 Tuesday, June 23, 2009

I wrote this memo over 2 1/2 years ago about what to do with concurrent exceptions in Parallel Extensions to .NET.  Since Beta1 is now out, I thought posting it may provide some insight into our design decisions.  (And yes, most design discussions start this way.  Somebody develops a personal itch, dives deep into it, and emerges with a proposal for others to vote up, shoot down, or, as is typically the case, somewhere in the middle (provide constructive feedback, iterate, etc).)  I've made only a few slight edits (like replacing code- and type-names), but it's mainly in original form.  I still agree with much of what I wrote, although I'd definitely write it differently today.  And in retrospect, I would have driven harder to get deeper runtime integration.  Perhaps in the next release.

~~~

Concurrency and Exceptions
October, 2006

Exceptions raised inside of concurrent workers must be dealt with in a deliberate way.  Failures can happen concurrently, and yet often the programmer is working with an API that appears to them as though it’s sequential.  The basic question is, then, how do we communicate failure?

The problem

Fork/join concurrency, in which a single “master” thread forks and coordinates with N separate parallel workers, is an incredibly common instance of one of these sequential-looking concurrent operations.  The same callback is run by many threads at once, and may fail zero, one, or multiple times.  The exception propagation problem is inescapable here and comes with a lot of expectations, because the programmer is presented a traditional stack-based function calling interface papered on top of data or task parallelism underneath.

I am faced with the need for a solution to this problem for PLINQ right now and, while I could invent a one-off solution, we owe it to our customers to come up with a common platform-wide approach (or at least ManyCore-wide).  Any solution should compose well across the stack, so that somebody invoking a PLINQ query from within their TPL task that was spawned from a thread pool thread yields the expected and consistent result.  And I would like for us to reach consensus for both managed and native programming models.

Before moving on, there is one non-goal to call out.  Long-running tasks not under the category of fork/join also deserve some attention, because of the ease with which stack traces can be destroyed and the corresponding impact to debugging, but I will ignore them for now.  The problem is not new, exists with the IAsyncResult pattern, and PLINQ doesn’t use this sort of singular asynchronous concurrency.  These cases can typically be trivially solved using existing mechanisms, like standard exception marshaling.

No errors, one error, many errors

To understand the core of the issue, imagine we have an API ‘void ForAll<T>(Action<T> a, T[] d)’.  It takes a delegate and an array, and for every element ‘e’ in ‘d’ invokes the delegate, passing the element, i.e. ‘a(e)’.  If multiple processors are available, the implementation of ForAll may use some heuristic to distribute work among several OS threads, for instance by partitioning the array, probably running one partition on the caller’s thread, and finally joining with these threads before returning so that the caller knows that all of the work is complete when the API returns.

ForAll is not fictitious, and is similar to a number of PLINQ APIs: Where, Select, Join, Sort, etc.  It is also exposed directly by the TPL runtime’s Parallel class which intelligently forks and joins with workers.

‘a’ is a user-specified delegate and can do just about anything.  That includes, of course, throwing an exception.  What’s worse, because ‘a’ is run in several threads concurrently, there may be more than one exception thrown.  In fact, there are three distinct possibilities:

  1. No errors: No invocations of ‘a’ throw an exception.
  2. One error: A single invocation of ‘a’ throws an exception.
  3. Many errors: Concurrent invocations of ‘a’ on separate threads throw exceptions.

Clearly letting an exception crash whichever thread the problematic ‘a(e)’ happened to be run on is problematic and confusing.  If for no other reason than the IAsyncResult pattern has established precedent.  But realistically, the developer would be forced to devise his or her own scheme to marshal the failure back to the calling thread in order for any sort of chance at recovery.  They would get it wrong and it would lead to incompatible and poorly composing silos over time.  A Byzantine model that fully prohibits exceptions passing fork/join barriers goes against the simple, familiar, and understandable (albeit often deceptively so) model of exceptions.

(That said, marshaling leads to a crappy debugging experience.  An already attached debugger will get a break-on-throw notification at the exception on the origin thread, but since we catch, marshal, and (presumably) rethrow, the first and second chances for unhandled exceptions won’t happen until after the exception been marshaled.  This breaks the first pass, and by the time the debugger breaks in, or a crash dump is taken, the stack associated with the origin thread is apt to have gone away, been reused for another task (in the case of the thread pool), etc.  We generally try to avoid breaking the first pass in the .NET Framework, but do it in plenty of places: the BCL today already contains tons of try { … } catch { /*cleanup */ throw; }-style exception handlers, for example.  For this reason I’m not terribly distraught over the implications of doing it ourselves.  And sans deeper integration with the exception subsystem – something we ought to consider – there aren’t many reasonable alternatives.)

What makes this problem really bad is that ForAll appears as though it’s synchronous:

void f() {

    // do some stuff

    ForAll(…, …);

    // do some more stuff, ‘ForAll’ is completely done

}

The method call to ForAll itself is synchronous, but of course its internal execution is not.  But still, to the developer, the call to this function represents one task, one logical piece of work, regardless of the fact that the implementation uses multiple threads for execution.  As higher level APIs are built atop things like ForAll, the low level parallel infrastructure problem becomes a higher level library or application problem.  A Sort that is internally parallel must now decide what exception(s) it will tell callers it may throw.

Nondeterministic exception ordering

We assume the ForAll API stops calling ‘a(e)’ on any given thread when it first encounters an exception.  That is, each thread just does something like this:

for (int i = start_idx; i < end_idx; i++) {

    a(d[i]);

}

The for loop terminates when any single iteration throws an exception.  Imagine our array contains 2048 elements and that ForAll smears the data across 8 threads, partitioning the array into 256-element sized chunks of contiguous elements.  So partition 0 gets elements [0…256), partition 1 gets [256…512), …, and partition 7 gets [1792…2048).  Now imagine that ‘a’ throws an exception whenever fed a null element, and that every 256th element in ‘d’, starting at element 10, is null.  What can a developer reasonably expect to happen?

On one hand, if we’re trying to preserve the illusion of sequential execution, we would only want to surface the exception from the 10th element.  With a sequential loop, this would have prevented the 266th, 522nd, and so on, elements from even being passed to ‘a’.  So we might simply say that the “left most” exception (based on ordinal index) is the one that gets propagated.  The obvious problem with this is there are races involved: subsequent iterations indeed may have actually run.  Alternatively, we might consider only letting the “first” propagate.  Unfortunately, that doesn’t work either, because we unfortunately can’t necessarily determine, for a set of concurrent exceptions, which got thrown first.  Even if they have timestamps, they could occur in parallel at indistinguishably close times.  Nor does this really matter, because it feels fundamentally wrong.

The reason is that we can’t simply throw away failures without true recoverability in the system, a la STM.  The execution of code leading up to the exception did actually happen, after all, and there could be residual effects.  We might be masking a terrible problem by throwing failures away, possibly leading to (more) state corruption and (prolonged, perhaps unrecoverable) damage.  What if the 10th element was a simple ArgumentNullException that the caller chooses to tolerate, but the 266th element’s exception was in response to a catastrophic error from which the application can’t recover?  We can’t choose to propagate the 10th but swallow the 266th.  Broadly accepted exceptions best practices suggest that app and library devs never catch and swallow exceptions they cannot reasonably handle.  We should do our best to follow the spirit of this guidance too.

Re-propagation

We could employ an approach similar to the IAsyncResult pattern, with some slight tweaks.

If each concurrent copy of ForAll caught any unhandled exceptions and marshaled them to the forking thread, including any exceptions that happen on the forking thread itself, we could then propagate all of them together after the join completes.  The question is then: what exactly do we propagate?

If there is just a single exception, it’s tempting to just rethrow it.  But I don’t believe this is a good approach for two primary reasons:

  1. This will destroy the stack trace of the original exception.  This means no information about the actual source of the error inside ‘a’ is available.  With some help from the CLR team, we might be able to get a special type of ‘rethrow’ that copied the original stack trace before recreating a new one.  This is already done for remoted exceptions, and the Exception base class will prefix the original remoted stack trace to the new stack trace.
  2. This doesn’t scale to handle multiple exceptions.  If we could solve #1, it might be attractive because it appears as-if things happened sequentially, but we can’t escape #2, no matter what we do.  We could have different behavior in these two cases, but I believe it’s better to remain consistent instead.  Otherwise, developers will need to write their exception handles two ways: one way to handle singular cases, and the other way to handle multiple cases, where the same API may do either nondeterministically.

Given that we need to propagate multiple exceptions, we should wrap them in an aggregate exception object, and propagate that instead.  At least this way, the original exceptions will be preserved, stack trace and all.  Of course the original exceptions themselves might be other aggregates, handling arbitrary composition.

For sake of discussion, call this aggregate exception System.AggregateException, which of course derives from System.Exception.  It exposes the raw Exception objects thrown by the threads, via an ‘Exception[] InnerExceptions’ property, and additional meta-data about each exception: from which thread it was thrown, and any API specific information about the concurrent operation itself.  This last part is just to help debuggability.  For instance, we might tell the developer that the ArgumentNullException was thrown from a thread pool thread with ID 1011, and that it occurred while invoking the 266th element ‘e’ of array ‘d’.  We might also guarantee the exceptions will be stored in the order in which they were marshaled back to the forking thread, just to help the developer (as much as we can) piece together the sequence of events leading to failures.

(Editor’s note: we decided against storing this meta-data information for various reasons.)

Now the dev can do whatever he or she wishes in response to the exception.  Previously they might have written:

try {

    ForEach(a, d);

} catch (FileNotFoundException fex) {

    // Handler(fex);

}

And now they would have to instead write:

try {

    ForAll(a, d);

} catch (AggregateException pex) {

    List<Exception> unhandled = new List<Exception>();

 

    foreach (Exception e in pex.InnerExceptions) {

        FileNotFoundException fex = e as FileNotFoundException;

        if (fex == null) {

            unhandled.Add(fex);

        } else {

            // Handler(fex);

        }

    }

 

    if (unhandled.Count > 0)

        throw new AggregateException(unhandled);

}

In other words, they would catch the AggregateException, enumerate over the inner exceptions, and react to any FileNotFoundExceptions as they would have normally.  (Taking into consideration that there might have been multiple.)  At the end, if there are any non-FileNotFoundExceptions left over, we propagate a new AggregateException with the handled FileNotFoundExceptions removed.  If there was only one remaining, we could, I suppose, try to rethrow just that, but this has the same nondeterminism problems mentioned above.

Very few people will write this code.  But one of the most vocal arguments against it is: just throw one singular exception, such as ForAllException, and let it crash, because no developer will handle it.  Well, that scheme is no better than throwing the AggregateException.  At least the aggregation model lets people write backout and recovery code if they have the patience to deal with the reality that multiple exceptions occurred.

To make this slightly easier, we could expose an API, ‘void Handle(Func<Exception, bool> a) where T : Exception’, that effectively encapsulates the same logic as shown above, repropagating the exception at the end if all the exceptions weren’t handled (i.e. some weren’t of type T):

try {

    ForAll(a, d);

} catch (AggregateException pex) {

    pex.Handle(delegate(Exception ex) {

        FileNotFoundException fex = ex as FileNotFoundException;

        if (fex != null) {

            // Handle(fex);

            return true;

        }

        return false;

    });

}

(One problem with this approach is that the ‘throw’ inside of Handle will destroy the original stack trace for ‘pex’.  An alternative might be for Handle to modify the AggregateException in place, keeping the stack trace intact, returning a bool that the caller switches on and does a ‘throw’ if it returns false; this is unattractive because it’s error prone and could lead to accidentally swallowing, but in the end might help debuggability.)

If we cared about eliminating unnecessary catch/rethrows, we could use 1st pass filters instead, but this would only be available to VB and C++/CLI programmers, as C# doesn’t expose filters.  For example, in pseudo-code:

try {

    ForAll(a, d);

} catch (fex.InnerExceptions.Contains<FileNotFoundException>()) {

    // Handle …

}

Although interesting, we’re trying to move away from our two pass model.  So let’s forget about this for now.

This approach suffers when composing with non-aggregate exception aware code.  For it to work well, everybody on the call stack needs to be looking inside the aggregate for “their” exception, handling it, and possibly repropagating.  If we want existing BCL APIs to start using data parallelism internally, we would have to be careful here, not to break AppCompat because we start throwing AggregateExceptions instead of the originals.

This is probably where there’s an opportunity for better CLR and tool integration.  For instance, you could imagine a world where the CLR automatically unravels the parallel failures, matching and running handlers for specific exceptions inside the aggregate as it goes, but repropagating if all exceptions weren’t handled.  This is very hand-wavy and fundamentally changes the way exceptions work, so it would require a lot more thought.  A catch block that swallows an exception (today) is just about guaranteed—asynchronous exceptions aside—that the IP will soon reach the next instruction after the try/catch block.  This is a pretty basic invariant.  With this proposal, that wouldn’t be the case, and would be bound to break large swaths of code.  Sticking with the library approach (with all its imperfections) seems like the best plan of attack for now.

Waiting for the “join” to finish

There was something implicit in the design mentioned above.  The ForAll API, and others like it, wouldn’t actually propagate exceptions until the fate of all threads was known.

Imagine we have the scenario described earlier (2048 elements, 8 threads), but slightly different: the 0th element causes an exception, but no other.  It turns out this is probably a common case, i.e. that only a subset of the partitions will yield an exception.  In this case, we would still have to wait for 7*256 = 1,792 elements to be run through ‘a’ before this exception is propagated.  Imagine a slightly different case.  The 0th element throws a catastrophic exception, and the application is going to terminate as soon as it propagates.  ‘a’ simply can’t be run any more, and will keep reporting back this same exception.  But it will take 8 of these exceptions to actually stop the application, i.e. by calling ‘a’ on the 0th, 256th, 512th, etc. elements, if we wait for all tasks to complete.  If each exception corresponds to some failed attempt at forward progress, one that possibly corrupts state, then the damage is O(N) times “worse” (for some measurement) than in the sequential program, where N is the number of concurrent tasks.

Instead of waiting helplessly, we could try to aggressively shut down these concurrent workers.

At first, you might be tempted to employ CLR asynchronous thread aborts, but this is fraught with peril.  Almost all .NET Framework code today is taught that thread abort == AppDomain unload, and reacts accordingly.  State corruption stemming from libraries as fundamental as the BCL would be just about guaranteed.  Changing this state of mind and the state of our software would be quite the undertaking.

Instead, we can have the concurrent API itself periodically check an ‘abort’ flag shared among all workers.  The first thread to propagate an exception would set this flag.  And whenever a worker has seen that it has been set, it voluntarily returns instead of finishing processing data:

for (int i = start_idx; i < end_idx && !aborted; i++) {

    a(d[i]);

}

This increases the responsiveness of exception propagation, but clearly isn’t foolproof.  There will still be a delay for long-running callbacks.  Thankfully, with PLINQ, TPL, and I hope most of our parallel libraries, the units of work will be individually fine-grained, and therefore this technique should suffice.

If a concurrent worker is blocked, there’s not a whole lot we can do.  Much like thread aborts, you might be tempted to use Thread.Interrupt to remove it from the wait condition.  Unfortunately this will leave state corruption in its wake, because plenty of code does things like WaitHandle.WaitOne(Timeout.Infinite) without checking the return value or expecting a ThreadInterruptionException.  The same argument applies to, say, user-mode APCs.  Eventually you might also be tempted to use IO cancellation in Windows Vista to cancel errant, runaway network or disk IO requests.  This would be great.  But this also generally has the same problem as interruption, so until we find a general solution to that, we can’t do any of this.

(Editor’s note: We eventually solved this problem by coming up with a unified cancellation framework.)

One last note

This path forward seems best for now, but it leaves me wanting more.

In the end, this feels like a more fundamental problem.  An API like ForAll gives the illusion of an ordinary, old sequential caller/callee relationship.  But the callee doesn’t use a stack-based calling approach: instead, it distributes work among many concurrent workers, turning the linear stack into a sort of dynamically unfolding cactus stack (or tree).  And SEH exceptions are fundamentally linear stack-based creatures.

In this world, it’s just a simple fact that data all over the place can become corrupt simultaneously.  Many things can fail at once because many things are happening at once.  It’s inescapable.  Recovery is disastrously difficult, so most failures will end in crashes.  STM’s promise for automatic recovery offers a glimmer of hope, but without it, I worry that papering a sequential “feel” on top of data/task parallelism is a dangerous game to play.

6/23/2009 8:13:52 PM (Pacific Daylight Time, UTC-07:00)  #   

 Tuesday, June 16, 2009

One of my many focuses lately has been developing a memory ordering model for our project here at Microsoft.  There are four main questions to answer when defining such a model:

  1. What are the ordering guarantees for ordinary loads and stores?
  2. What are the ordering guarantees for volatile loads and stores?
  3. What kinds of explicit fences are allowed?
  4. Where are fences used automatically, e.g. to preserve type safety and security?

These tend to be the differentiation points for any model.  Everything else is mostly commodity.  Not that there is much else, mind you, but respecting data dependence, not speculating ahead such that exceptions occur that wouldn't have occurred in a sequential execution, and so forth are all must haves, for instance.  Most interesting permutations of answers for these questions have already been explored, and industry consensus is being reached, so it would be better to say I've been picking a model rather than defining one.

What's interesting is that memory model designers are often colored by their favorite architecture du jour.  If somebody cares primarily about X86, they are apt to choose something very strong.  If somebody cares primarily about ARM, however, they are apt to choose something very weak.  There is a classic tradeoff here.  Stronger means easier to program, while weaker means better performance.  For some reason, many of the projects I've worked on have had an abundance of strong hardware (like X86) and a scarcity of weak hardware (like ARM and IA64).  The reality sinks in: most developers on the team code to X86, and then when it comes time to getting more serious about the other platforms, code starts breaking all over the place.  This is why the CLR went so strong in 2.0, even though IA64 was an important platform to support.

Let's look at some common answers to the above questions.

For #1:

  • C++, Visual C++, ECMA 1.0, Java Memory Model, and Prism: no ordering guarantees.
  • CLR 2.0: ordered stores, no ordering for loads.

For #2:

  • C++: prevents compiler-only code motion, but explicit fences are needed for processor ordering.
  • Visual C++, ECMA 1.0, and CLR 2.0: loads are acquire, stores are release ordered.
  • Java Memory Model: loads and stores are fully ordered (sequentially consistent).

For #3:

  • C++: implementation-specific.
  • Visual C++: intrinsics and Win32 APIs.
  • ECMA 1.0 and CLR 2.0: locks, and mostly Win32-style interlocked APIs.
  • Java Memory Model: locks, compare-and-swap, atomics, etc.

For #4:

Managed environments like the CLR and JVM need to ensure type safety, even if ordinary loads and stores are unordered.  This is nontrivial, because the boundary around type safety is blurred.  Certainly we must ensure garbage v-table pointers are not seen.  But is a thread allowed to read non-zeroed memory behind an object reference?  And can it contain garbage (e.g. "values out of thin air")?  What about writes done by mutator threads, including write barriers, while a concurrent collector is tracing objects in the heap?  Are array lengths part of the set of protected fields that mustn't be read out of order?  Strings, since they are commonly used for security checking?  And so on.

It is mainly the deep questions around #4, and also some simple compatibility struggles (around things like double checked locking), that caused the stronger answers for #1 in the CLR 2.0.

In any case, I'm advocating a very different approach than the traditional models.

We pick completely weak ordering for ordinary loads and stores, to enable efficient execution on weaker platforms like ARM, PowerPC, IA64, etc.  That part isn't new.  But here's the clincher.  No volatiles.  There are special variables that are used to communicate between threads (call them volatile if you'd like), but using them implies no kind of special automatic fencing.  Instead, whenever accessing such a variable, at the site of usage, the kind of fence desired must be used (compiler-enforced): full-fence (sequentially consistent), acquire-fence, release-fence, no-fence, or compiler-only-fence (for things like ensuring loads don't get hoisted as loop invariant).  Of course, certain kinds of fences are sprinkled throughout the system to guarantee type safety in all of the aforementioned places (and more), but these are implementation details.

(This approach is rather like Herb Sutter's Prism and C++0x atomics.  See http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2008/n2664.htm.)

Particularly after managing teams who developed a plethora of lock free code, I love this approach.  I can review code and immediately understand what ordering invariants the developer assumed when writing the code.  This doesn't really make writing lock free code any simpler, except that it forces you to pause and think about things a bit more carefully than you may have otherwise.  But it certainly makes code easier to understand and maintain, and makes it clear to people that sprinkling volatile all over the place isn't going to save your butt: the only thing that will do that is careful thinking and engineering.

6/16/2009 11:53:26 PM (Pacific Daylight Time, UTC-07:00)  #   

 

Recent Entries:

Search:

Browse by Date:
<November 2009>
SunMonTueWedThuFriSat
25262728293031
1234567
891011121314
15161718192021
22232425262728
293012345

Browse by Category:

Notables: