RSS 2.0

Personal Info:

Joe Send mail to the author(s) leads the architecture of an experimental OS's developer platform, where he is also chief architect of its programming language. His current mission is to enable writing large-scale software that is reliable, secure, and scalable by-construction. Before this, Joe founded the Parallel Extensions to .NET project. He has been granted 19 patents, with 49 pending. When not working, Joe enjoys travelling with his wife, writing books, writing music, studying music theory & mathematics, and doing anything involving food & wine.

My books

My music

Disclaimer:
The content of this site are my own personal opinions and do not represent my employer's view in anyway.

© 2012, Joe Duffy

 
 Wednesday, July 30, 2008

I have two new positions open on my team.

  1. http://members.microsoft.com/careers/search/details.aspx?JobID=955B4B91-F863-492C-B839-63964E7966B9
  2. http://members.microsoft.com/careers/search/details.aspx?JobID=6B22807F-71BF-4939-ACB4-EC380106AFBF

If you want to shape the future of the runtimes, languages, and libraries that millions of developers use every day, this is a perfect opportunity.

And you'll get to work with some amazing people too.

You can submit your resume on the website or just email me at joedu @ you-know-where dot com.

7/30/2008 1:46:39 PM (Pacific Daylight Time, UTC-07:00)  #   

 Tuesday, July 29, 2008

This is the first part in a series I am going to do on building a custom thread pool.  Not that I’m advocating you do such a thing, but I figured it could be interesting to explore the intricacies involved.  We’ll start off really simple:

  • A CLR monitor based queuing mechanism.
  • A static, fixed number of threads.
  • The ability to create multiple pools that are isolated from one another.
  • Flowing of ExecutionContexts and the ability to turn it off.

As the series progresses, I intend to incorporate some interesting facets such as:

  • Dynamic thread injection, so that the number of threads is not fixed.
  • Thread sharing among multiple pools in an AppDomain.
  • A per-thread work stealing queue to increase the efficiency of recursively queued work.
  • Interoperability with I/O completion ports.
  • Returning an IAsyncResult object for seamless APM integration.
  • Cancelation.
  • Anything else that readers suggest might be interesting.  Let me know.

And with that, let’s begin.

For now, we’ll use a very simple interface, IThreadPool, under which we can implement various mechanisms and policies.  This will make it easier to write generic test harnesses that compare different implementations.  For this post we won’t really make use of that capability (much), but we will use it to compare the stock CLR ThreadPool against a very simple custom one.

interface IThreadPool : IDisposable

{

    void QueueUserWorkItem(WaitCallback work, object obj);

}

So that we can subsequently compare implementations, we have two simple implementations of IThreadPool.  One does safe ThreadPool.QueueUserWorkItem calls, and the other does unsafe ThreadPool.UnsafeQueueUserWorkItem calls.  The only difference is that the latter doesn’t flow the ExecutionContext across threads.

class CLRThreadPool : IThreadPool

{

    public void QueueUserWorkItem(WaitCallback work, object obj)

    {

        ThreadPool.QueueUserWorkItem(work, obj);

    }

 

    public void Dispose() { }

}

 

class CLRUnsafeThreadPool : IThreadPool

{

    public void QueueUserWorkItem(WaitCallback work, object obj)

    {

        ThreadPool.UnsafeQueueUserWorkItem(work, obj);

    }

 

    public void Dispose() { }

}

Our simple thread pool, SimpleLockThreadPool, will have 7 fields:

  • private int m_concurrencyLevel: the number of threads to create statically, specified at construction time (w/ a default of Environment.ProcessorCount); 
  • private bool m_flowExecutionContext: whether execution context flowing is turned on (the default) or off.  Turning it off can provide some performance gains.
  • private Queue<WorkItem> m_queue: the queue of actual work items.  This object is also used as a monitor.  We’ll see what the WorkItem data structure looks like momentarily.
  • private Thread[] m_threads: the set of threads actively dequeuing and running work items from this pool.  Each instance of SimpleLockThreadPool has its own set.
  • private int m_threadsWaiting: a hint used to avoid pulsing on enqueue when no threads are waiting.  Threads increment and decrement before and after (respectively) waiting for work.
  • private bool m_shutdown: set to true when threads are requested to exit.

Each WorkItem is a struct with three fields.  Using a struct avoids superfluous heap allocations.

  • internal WaitCallback m_work: the delegate to invoke.
  • internal object m_obj: some optional state to pass as the argument to m_work.
  • internal ExecutionContext m_executionContext: a context captured at enqueue time, to be used when running the callback.  This ensures the appropriate security context and logical call context flow to the work item’s stack, for example.

There are just 4 methods of interest:

  • public void QueueUserWorkItem(WaitCallback work, object obj): implements the IThreadPool interface, and does a few things.  It allocates a new WorkItem, optionally captures and stores an ExecutionContext, ensures the pool has started, and then enqueues the WorkItem into the pool, possibly pulsing a single thread (if any are waiting).  There’s also a convenient overload that doesn’t take an obj for situations where it isn’t needed.
  • private void EnsureStarted(): a simple helper method that will lazily initialize and start the set of threads in a particular pool.  These threads just sit in a loop and dequeue work.  The lazy aspect ensures that a pool that doesn’t ever get used won’t allocate threads.
  • private void DispatchLoop(): this is the main method run by each pool thread.  All it does is sit in a loop dequeuing and (if the queue is empty) waiting for new work to arrive.  When shutdown is initiated, the method voluntarily quits.  If any pool work items throw an exception, this top-level method lets them go unhandled, resulting in a crash of the thread.
  • public void Dispose(): shuts down all the threads in a pool.  It is synchronous, so it actually waits for them to complete before returning.  If work items take a long time to finish, this could be a problem.  Extending this to timeout, etc., would be trivial.

And that’s really it.  This is a very simple and naïve start, but it will prove to be a good starting point for all of the extensions I mentioned at the outset.  Here’s the full code.

public class SimpleLockThreadPool : IThreadPool

{

 

    // Constructors--

    // Two things may be specified:

    //   ConcurrencyLevel == fixed # of threads to use

    //   FlowExecutionContext == whether to capture & flow ExecutionContexts for work items

 

    public SimpleLockThreadPool() :

        this(Environment.ProcessorCount, true) { }

 

    public SimpleLockThreadPool(int concurrencyLevel) :

        this(concurrencyLevel, true) { }

 

    public SimpleLockThreadPool(bool flowExecutionContext) :

        this(Environment.ProcessorCount, flowExecutionContext) { }

 

    public SimpleLockThreadPool(int concurrencyLevel, bool flowExecutionContext)

    {

        if (concurrencyLevel <= 0)

            throw new ArgumentOutOfRangeException("concurrencyLevel");

 

        m_concurrencyLevel = concurrencyLevel;

        m_flowExecutionContext = flowExecutionContext;

 

        // If suppressing flow, we need to demand permissions.

        if (!flowExecutionContext)

            new SecurityPermission(SecurityPermissionFlag.Infrastructure).Demand();

    }

 

    // Each work item consists of a closure: work + (optional) state obj + context.

 

    struct WorkItem

    {

        internal WaitCallback m_work;

        internal object m_obj;

        internal ExecutionContext m_executionContext;

 

        internal WorkItem(WaitCallback work, object obj)

        {

            m_work = work;

            m_obj = obj;

            m_executionContext = null;

        }

 

        internal void Invoke()

        {

            // Run normally (delegate invoke) or under context, as appropriate.

            if (m_executionContext == null)

                m_work(m_obj);

            else

                ExecutionContext.Run(m_executionContext, ContextInvoke, null);

        }

 

        private void ContextInvoke(object obj)

        {

            m_work(m_obj);

        }

    }

 

    private readonly int m_concurrencyLevel;

    private readonly bool m_flowExecutionContext;

    private readonly Queue<WorkItem> m_queue = new Queue<WorkItem>();

    private Thread[] m_threads;

    private int m_threadsWaiting;

    private bool m_shutdown;

 

    // Methods to queue work.

 

    public void QueueUserWorkItem(WaitCallback work)

    {

        QueueUserWorkItem(work, null);

    }

 

    public void QueueUserWorkItem(WaitCallback work, object obj)

    {

        WorkItem wi = new WorkItem(work, obj);

 

        // If execution context flowing is on, capture the caller's context.

        if (m_flowExecutionContext)

            wi.m_executionContext = ExecutionContext.Capture();

 

        // Make sure the pool is started (threads created, etc).

        EnsureStarted();

 

        // Now insert the work item into the queue, possibly waking a thread.

        lock (m_queue)

        {

            m_queue.Enqueue(wi);

            if (m_threadsWaiting > 0)

                Monitor.Pulse(m_queue);

        }

    }

 

    // Ensures tha threads have begun executing.

 

    private void EnsureStarted()

    {

        if (m_threads == null)

        {

            lock (m_queue)

            {

                if (m_threads == null)

                {

                    m_threads = new Thread[m_concurrencyLevel];

                    for (int i = 0; i < m_threads.Length; i++)

                    {

                        m_threads[i] = new Thread(DispatchLoop);

                        m_threads[i].Start();

                    }

                }

            }

        }

    }

 

    // Each thread runs the dispatch loop.

 

    private void DispatchLoop()

    {

        while (true)

        {

            WorkItem wi = default(WorkItem);

 

            lock (m_queue)

            {

                // If shutdown was requested, exit the thread.

                if (m_shutdown)

                    return;

 

                // Find a new work item to execute.

                while (m_queue.Count == 0)

                {

                    m_threadsWaiting++;

                    try { Monitor.Wait(m_queue); }

                    finally { m_threadsWaiting--; }

 

                    // If we were signaled due to shutdown, exit the thread.

                    if (m_shutdown)

                        return;

                }

 

                // We found a work item! Grab it ...

                wi = m_queue.Dequeue();

            }

 

            // ...and Invoke it. Note: exceptions will go unhandled (and crash).

            wi.Invoke();

        }

    }

 

    // Disposing will signal shutdown, and then wait for all threads to finish.

 

    public void Dispose()

    {

        m_shutdown = true;

        lock (m_queue)

        {

            Monitor.PulseAll(m_queue);

        }

 

        for (int i = 0; i < m_threads.Length; i++)

            m_threads[i].Join();

    }

}

I think everything should be self-explanatory given the earlier explanation of all the fields and types.  Let’s take a look at a simple test harness for this.  There are a myriad of useful tests, and the one that I will show right now is but one of them.  It queues a whole lot of work items, and then blocks waiting for them to complete.  I have two variants: one of them allows work items to begin executing before the queuing is done, while the other separates the phases.  Here is the general test.

class Program

{

    public static void Main(string[] args)

    {

        bool separateQueueFromDrain = bool.Parse(args[0]);

        const int warmupRunsPerThreadPool = 100;

        const int realRunsPerThreadPool = 1000000;

 

        IThreadPool[] threadPools = new IThreadPool[]

        {

            new CLRThreadPool(),

            new CLRUnsafeThreadPool(),

            new SimpleLockThreadPool(true), // Flow EC

            new SimpleLockThreadPool(false), // Don't flow EC

        };

 

        long[] queueCost = new long[threadPools.Length];

        long[] drainCost = new long[threadPools.Length];

 

        Console.WriteLine("+ Running benchmarks ({0}) +", threadPools.Length);

 

        for (int i = 0; i < threadPools.Length; i++)

        {

            IThreadPool itp = threadPools[i];

            Console.Write("#{0} {1}: ", i, itp.ToString().PadRight(26));

 

            // Warm up:

            using (CountdownEvent cev = new CountdownEvent(warmupRunsPerThreadPool))

            {

                WaitCallback wc = delegate { cev.Decrement(); };

                for (int j = 0; j < warmupRunsPerThreadPool; j++)

                    itp.QueueUserWorkItem(wc, null);

                cev.Wait();

            }

 

            // Now do the real thing:

            int g0collects = GC.CollectionCount(0);

            int g1collects = GC.CollectionCount(1);

            int g2collects = GC.CollectionCount(2);

 

            using (CountdownEvent cev = new CountdownEvent(realRunsPerThreadPool))

            using (ManualResetEvent gun = new ManualResetEvent(false))

            {

                WaitCallback wc = delegate {

                    if (separateQueueFromDrain) { gun.WaitOne(); }

                    cev.Decrement();

                };

                Stopwatch sw = Stopwatch.StartNew();

                for (int j = 0; j < realRunsPerThreadPool; j++)

                    itp.QueueUserWorkItem(wc, null);

                queueCost[i] = sw.ElapsedTicks;

                sw = Stopwatch.StartNew();

                if (separateQueueFromDrain) { gun.Set(); }

                cev.Wait();

                drainCost[i] = sw.ElapsedTicks;

            }

 

            g0collects = GC.CollectionCount(0) - g0collects;

            g1collects = GC.CollectionCount(1) - g1collects;

            g2collects = GC.CollectionCount(2) - g2collects;

 

            Console.WriteLine("q: {0}, d: {1}, t: {2} (collects: 0={3},1={4},2={5})",

                queueCost[i].ToString("#,##0"),

                drainCost[i].ToString("#,##0"),

                (queueCost[i] + drainCost[i]).ToString("#,##0"),

                g0collects,

                g1collects,

                g2collects

            );

 

            itp.Dispose();

            GC.Collect(2);

            GC.WaitForPendingFinalizers();

        }

 

        Console.WriteLine();

        Console.WriteLine("+ Comparison against baseline ({0}) +", threadPools[0]);

        for (int i = 0; i < threadPools.Length; i++)

        {

            Console.WriteLine("#{0} {1}: q: {2}x, d: {3}x, t: {4}x",

                i,

                threadPools[i].ToString().PadRight(26),

                queueCost[i] / (float)queueCost[0],

                drainCost[i] / (float)drainCost[0],

                (queueCost[i] + drainCost[i]) / ((float)queueCost[0] + drainCost[0])

            );

        }

    }

}

If we pass ‘true’ on the command line, the phases are separated, and if we pass ‘false’ they are not.  The ‘true’ part allows us to hone in on the source of overhead (is it the queuing itself, or the dispatching of work items?), but at the expense of needing to keep more of the work items in memory at once (because pool threads can’t drain them as we queue them).  We run the test over an array of IThreadPool implementations, and for each one print out the cost to queue work, drain work, and the number of Gen0, Gen1, and Gen2 collections performed for each one.  The GC statistics are interesting because they tell us how much more memory (roughly speaking) we are allocating for the same workload on different pool implementations.  As our pool gets more complicated, this will be something to keep your eye on.

Here are some sample numbers on my dual-core laptop.  Your results will vary.  When ‘true’ is passed, I see numbers like the following:

+ Running benchmarks (4) +

#0 CLRThreadPool             : q: 3,163,506, d: 5,137,893, t: 8,301,399 (collects: 0=16,1=8,2=3)

#1 CLRUnsafeThreadPool       : q: 1,285,806, d: 4,428,451, t: 5,714,257 (collects: 0=5,1=4,2=1)

#2 SimpleLockThreadPool      : q: 4,208,686, d: 11,839,614, t: 16,048,300 (collects: 0=104,1=14,2=4)

#3 SimpleLockThreadPool      : q: 499,575, d: 3,992,190, t: 4,491,765 (collects: 0=1,1=1,2=1)

 

+ Comparison against baseline (CLRThreadPool) +

#0 CLRThreadPool             : q: 1x, d: 1x, t: 1x

#1 CLRUnsafeThreadPool       : q: 0.4064497x, d: 0.8619196x, t: 0.6883487x

#2 SimpleLockThreadPool      : q: 1.330387x, d: 2.304371x, t: 1.933204x

#3 SimpleLockThreadPool      : q: 0.1579181x, d: 0.7770092x, t: 0.5410853x

And when ‘false’ is passed, I see similar but subtly different numbers:

+ Running benchmarks (4) +

#0 CLRThreadPool             : q: 3,476,630, d: 27,592, t: 3,504,222 (collects: 0=20,1=6,2=0)

#1 CLRUnsafeThreadPool       : q: 2,636,319, d: 140,653, t: 2,776,972 (collects: 0=5,1=2,2=0)

#2 SimpleLockThreadPool      : q: 4,850,171, d: 6,227,052, t: 11,077,223 (collects: 0=95,1=14,2=4)

#3 SimpleLockThreadPool      : q: 826,987, d: 132,755, t: 959,742 (collects: 0=1,1=1,2=1)

 

+ Comparison against baseline (CLRThreadPool) +

#0 CLRThreadPool             : q: 1x, d: 1x, t: 1x

#1 CLRUnsafeThreadPool       : q: 0.7582973x, d: 5.097601x, t: 0.7924646x

#2 SimpleLockThreadPool      : q: 1.395078x, d: 225.6832x, t: 3.161108x

#3 SimpleLockThreadPool      : q: 0.2378703x, d: 4.811358x, t: 0.2738816x

Notice right away that we are handily beating the heck out of the CLR thread pool in the case where we don’t flow ExecutionContext objects (the #3 case).  In fact, we are only 27% the cost for the ‘false’ variant.  But we unfortunately don’t fare nearly as well when we flow ExecutionContext objects (the #2 case).  It turns out that’s because the CLR has a unique advantage over us when compared to our naïve call to ExecutionContext.Capture.  Just look at the sizeable difference in Gen0 collections; we are clearly allocating a lot more memory.  This will be a topic for a subsequent post.

7/29/2008 1:44:01 AM (Pacific Daylight Time, UTC-07:00)  #   

 Sunday, July 20, 2008

Here's a slightly more formal approach to proving that the CLR MM is improperly implemented for the particular example I showed earlier.

As the Java MM folks have done, I will use a combination of happens-before and synchronizes-with relations, which allows order in a properly synchronized program to be describe as a "flat" sequence with total ordering among elements.  Assume < means synchronizes-with.  If a happens-before b, and a < b, then any writes in a are visible to any loads in b.  This relation is transitive: if a < b and b < c, then a < c.  Given this, we can take an observed set of results (the values held in memory locations), a hypothesized execution order (which we can infer from the observation), and validate it against the program order (as written in the source); we do this by taking the MM-specific synchronizes-with relation rules, and see if we can produce the observed output given our belief of the execution order.  If we find a contradiction (the execution order required to produce the output could not be produced given the program order and MM rules), either there is an alternative execution order we failed to guess, or we have found a violation of the memory model.

Single threaded programs are easy.  Multi threaded programs are hard.  We must manually "sequentialize" the program by constructing an interleaving of all executed program operations into a single flat sequence, and permute them as needed to produce the output in order to formulate a hypothesis of the execution order.  This is of course very difficult to do, so it only works with very small programs (like the one I will show below).

I will try to define the CLR 2.0 MM in terms of synchronizes-with, although I have to admit it’s going to be difficult to do off the top of my head:

  1. a < b, given a volatile load a that precedes any other memory operation b.  (Loads are acquire.)
  2. a < b, given any memory operation a that precedes any other store b.  (Stores are release.)
  3. a < b, given two separate memory operations a which precedes b that work with the same memory location.  (Data dependence.)
  4. a < b, given any memory operation a that precedes a full fence b.  (Cannot move after a fence.)
  5. a < b, given a full fence a that precedes some memory operation b.  (Cannot move before a fence.)
  6. a < b, given a lock acquire a that precedes some memory operation b.  (Lock acquires are acquire fences.)
  7. a < b, given a memory operation a that precedes a lock release b.  (Lock releases are release fences.)

Let’s take the disturbing example, assuming all loads and stores are volatile.

X = 1;              Y = 1;
R0 = X;             R2 = Y;
R1 = Y;             R3 = X;

Let’s hypothesize about execution order.

To produce an output in which R1 == R3 == 0, let us observe that it must be the case that X = 1 and Y = 1 must not happen first.  If one such instruction does occur first, then any possible outcome leads to R1 and/or R3 holding the value 1.  That is because of rule 3: if X = 1 happened first, then X = 1 < R3 = X, leading to R3 == 1 and similarly if Y = 1 happened first, then Y= 1 < R1 = Y, leading to R1 == 1.  So let us try to make X = 1 and Y = 1 not happen first.

Indeed, it is impossible for R0 = X or R2 = Y to happen first.  This is because of CLR MM rule 3: X = 1; R0 = X leads to data dependence, and thus X = 1 < R0 = X.  Similarly, Y = 1 < R2 = Y.  Dead end.  Let’s try the only other route.

The only remaining possibility to produce the output R1 == R3 == 0 is if R1 = Y or R3 = X occurs first.  Let us try to make R1 = Y occur first.  Ah-hah!  We cannot!  Given CLR MM rule 1, R0 = X < R1 = Y.  And because of transitivity, this necessarily implies that X = 1 < R1 = Y.  The same holds for the other thread’s instructions: Y = 1 < R3 = X.  The output R1 == R3 == 0 is therefore a contradiction and disallowed by the CLR MM.

Now, this is light years from a formal proof, but is the reasoning I’ve been using in my mind to explain why this new realization is fundamentally very disturbing and is explicitly not allowed by the CLR MM. Thankfully it seems the JIT team agrees and is willing to fix this for the next release. And, I'm still in search of an example of code that is broken by this problem ...

7/20/2008 1:14:14 AM (Pacific Daylight Time, UTC-07:00)  #   

 Wednesday, July 16, 2008

The adjacent release/acquire problem is well known.  As an example, given the program:

P0          P1
==========  ==========
X = 1;      Y = 1;
R0 = Y;     R1 = X;

The outcome R0 == R1 == 0 is entirely legal.  This could happen because writes are delayed in processor store buffers; so before R0 = Y retires, the store X = 1 may have not even left the local processor P0; similarly, before R1 = X retires, the store Y = 1 may not have even left processor P1.  It is as if the program was written as follows:

P0          P1
==========  ==========
R0 = Y;     R1 = X;
X = 1;      Y = 1;

The standard way to fix this is to emit a full fence:

P0          P1
==========  ==========
X = 1;      Y = 1;
XCHG;       XCHG;
R0 = Y;     R1 = X;

But here is one that may be a little surprising:

P0          P1
==========  ==========
X = 1;      Y = 1;
R0 = X;     R2 = Y;
R1 = Y;     R3 = X;

Assuming X and Y are "volatile" to the compiler, is R1 == R3 == 0 a possible outcome in this program?

Based on the rules we provide for .NET's MM, and Intel's whitepaper, one could reasonably argue "no".  The reasoning goes as follows.  True data dependence prohibits R0 = X from moving before X = 1, and the no load/load reordering rule (e.g. Intel's Rule 2.1) prohibits R1 = Y from moving before R0 = X.  Thus, transitively, R1 = Y may not move before X = 1.  Similarly, true data dependence prohibits R2 = Y from moving before Y = 1, and the no load/load reordering rule prohibits R3 = X from moving before R2 = Y, and therefore R3 = X may not move before Y = 1.  Given this reasoning, the individual instruction streams cannot be reordered in place.  And therefore, no interleaving of them will yield R1 == R3 == 0, because either X = 1 or Y = 1 must happen first, and both R1 = Y and R3 = X must come later.  Hence at least one of R1 or R3 will observe a value of 1.

Sadly, this reasoning is incorrect.  Rule 2.4 in the Intel whitepaper states that "intra-processor forwarding is allowed."  They even have an innocent example in the paper, but it actually doesn't exhibit load/load reordering.  It does, however, illustrate that stores may be delayed for some time in a write buffer.  Perhaps surprisingly, such intra-processor forwarding of buffered stores is actually permitted to satisfy subsequent loads from that location by the same processor before the store has left the processor.  This can happen even if it means passing intermediate loads from different memory locations!  The result is that load/load reordering is effectively possible under some circumstances.  Loads still physically retire in order of course, but because they may be satisfied by pending writes that other processors cannot yet see, it is as if the original program were written as:

P0          P1
==========  ==========
R1 = Y;     R3 = X;
X = 1;      Y = 1;
R0 = X;     R2 = Y;

The fundamentally contradicts what most people believe about .NET's MM, and indeed, Intel's MM as specified in that whitepaper.  To be fair, the whitepaper actually does call this out, but in a roundabout and misleading fashion.  The text in Rule 2.1, which states that "no loads can be reordered with other loads", is far too strong.

Anytime a little hole in something as fundamental as MM axioms is uncovered, it is cause for concern.  So I found this discovery deeply disturbing.  Many abstractions and theorems are proved with the assumption that the MM is rock solid.  I know a lot of code I have written relies on such proofs.

That said, I've been racking my brain (and in fact was having nightmares about it last evening) trying to uncover a case where this is worse than the existing release/acquire reordering issue that I opened this post with.  Everything I come up with is saved at the last minute by rules 2.1 (for stores) and 2.5 "stores are transitively visible".  The basic problem is that a processor can get stuck seeing its own written value for some time, during which other processors cannot, but ultimately it doesn't seem to matter because the buffer will eventually be flushed.  Then any intermediary values that the destination may have held while that processor was stuck will have been overwritten anyway, so the outcome should be explainable (albeit racey).  I'm still thinking hard about this.

7/16/2008 7:43:00 PM (Pacific Daylight Time, UTC-07:00)  #   

 Monday, June 23, 2008

I just submitted the final manuscript for Concurrent Programming on Windows to Addison-Wesley.

This marks the exciting transition from things happening on my timetable to things happening on AW’s timetable.

A lot has changed for me since I decided to write this book.  You might be surprised to hear that I actually signed the contract for it on November 29th, 2005.  That’s 2 years and 7 months ago.  It’s almost unbelievable that this book took so long to finish.  By comparison, my first one took just a little over a year.  The road has been a long one, full of personal ups and downs, but it’s no doubt been an exciting trip.

I’ve been at Microsoft the whole time.  At the outset, I was a PM on the CLR Team, hacking on software transactional memory and PLINQ as an evening activity.  Then I transitioned to doing it full time, but still as a PM.  Then I joined the Parallel Computing team as the dev for PLINQ.  Then I kicked off the whole Parallel Extensions effort (which is 20 members and growing strong), became the dev lead, and here I am today.  It’s pretty strange to say this, but without the book very little of that would have happened.  I can’t think of a better way to get entrenched in a technology, experience the breadth, and force yourself to learn every little intricate and often enlightening detail.  If you can afford the impact to mental health and personal relationships ;), it’s an activity I highly recommend to anybody wanting to master a technology...  not that one can actually master the concurrency beast, but y’know...

In retrospect, it should have taken a year.  Maybe next time.

The good news is that you will have the book in your hands soon.  (Well, if you decide to buy a copy, that is.)  If you manage to make it to my PDC 2008 pre-con session, I’m hoping we will have some copies available.  No promises, since I missed my final deadline by a couple weeks, but my fingers are crossed.

Oh yeah, and you can expect me to pick up blogging again now that I’ll have some free time.  Hmm, free time?  What will I do with myself!

Laissez les bon temps roulez!

6/23/2008 12:34:18 AM (Pacific Daylight Time, UTC-07:00)  #   

 Friday, June 13, 2008

We had an interesting debate at a Parallel Extensions design meeting yesterday, where I tried to convince everybody that a full fence on SpinLock exit is not a requirement.  We currently offer an Exit(bool) overload that accepts a flushReleaseWrite argument.  This merely changes the lock release from

m_state = 0;

to

Interlocked.Exchange(ref m_state, 0);

The main purpose of this is to announce “availability” of the locks to other processors.  More specifically, it ensures that before the current processor is able to turn around and reacquire the lock in its own private cache, that other processors at least have the opportunity to see the write.  This is a fairness optimization, and avoiding the CAS on release halves the number of CAS operations necessary (which are expensive), so we would generally like to avoid superflous ones.  It turns out you could easily do this without our help.  Instead of

slock.Exit(true);

you could say

slock.Exit();
Thread.MemoryBarrier();

Most of the debate about whether the default Exit should use a fence centered around confusion over the strength of volatile vs. a full fence.  For example, the C# documentation for volatile is highly misleading (http://msdn.microsoft.com/en-us/library/x13ttww7(VS.71).aspx):

The volatile modifier is usually used for a field that is accessed by multiple threads without using the lock statement to serialize access. Using the volatile modifier ensures that one thread retrieves the most up-to-date value written by another thread.

The confusion is over the “ensures that one thread receives the most up-to-date value written by another thread” part.  Technically this is somewhat-accurate, but is worded in a very funny and misleading way.  To see why, let’s take a step back and consider what volatile actually means in the CLR’s memory model (MM) for a moment, to set context.  Note that I did my best to concisely summarize the MM here: http://www.bluebytesoftware.com/blog/2007/11/10/CLR20MemoryModel.aspx.

Volatile on loads means ACQUIRE, no more, no less.  (There are additional compiler optimization restrictions, of course, like not allowing hoisting outside of loops, but let’s focus on the MM aspects for now.)  The standard definition of ACQUIRE is that subsequent memory operations may not move before the ACQUIRE instruction; e.g. given { ld.acq X, ld Y }, the ld Y cannot occur before ld.acq X.  However, previous memory operations can certainly move after it; e.g. given { ld X, ld.acq Y }, the ld.acq Y can indeed occur before the ld X.  The only processor Microsoft .NET code currently runs on for which this actually occurs is IA64, but this is a notable area where CLR’s MM is weaker than most machines.  Next, all stores on .NET are RELEASE (regardless of volatile, i.e. volatile is a no-op in terms of jitted code).  The standard definition of RELEASE is that previous memory operations may not move after a RELEASE operation; e.g. given { st X, st.rel Y }, the st.rel Y cannot occur before st X.  However, subsequent memory operations can indeed move before it; e.g. given { st.rel X, ld Y }, the ld Y can move before st.rel X.  (I used a load since .NET stores are all release.)  Note that RELEASe is the opposite of ACQUIRE: you can think of an acquire as a one-way fence that prohibits passes downward, and a release as a one-way fence that prohibits passes upward.  A full fence prohibits both (lock acquire, XCHG, MB, etc).

Note one very interesting thing in this discussion: a release followed by an acquire, given the above rules, does not prohibit movement of the instructions with respect to one another!  Given { st.rel X, ld.acq Y }, even though they are both volatile (i.e. acquire and release), so long as X!=Y, it is perfectly legal for the ld.acq Y to move before st.rel X.  We aren’t limited to single instructions either, e.g. { st.rel X, ld.acq A, ld.acq B, ld.acq C }, all three loads (A, B, C) may indeed happen before the X.  This occurs with regularity in practice, on X86, X64, and IA64, because of store buffering.  It would just be too costly to hold up loads until a store has reached all processors.  Superscalar execution is meant to hide such latencies.

(As an aside, many people wonder about the difference between loads and stores of variables marked as volatile and calls to Thread.VolatileRead and Thread.VolatileWrite.  The difference is that the former APIs are implemented stronger than the jitted code: they achieve acquire/release semantics by emitting full fences on the right side.  The APIs are more expensive to call too, but at least allow you to decide on a callsite-by-callsite basis which individual loads and stores need the MM guarantees.)

I have to admit the store buffer problem is mostly theoretical.  It rarely comes up in practice.  That said, on a system which permits load reordering, imagine:

Initially: X = Y = 0

T0                       T1
X = 5; // st.rel         while (X == 0) ; // ld.acq
while (Y == 0) ; // ld   X = 0; // st.rel
A = X; // ld.acq         Y = 5; // st.rel

After execution, is it possible that A == 5?

If the read of Y is non-volatile on T0 (which would be bad because a compiler may hoist it out of the loop, but ignore compilers for a moment), then the fact that the subsequent read of X is volatile does not save us from a reordering leading to A == 5.  This is the { ld, ld.acq } case described earlier.  Why might this physically occur?  Well, it won’t happen on X86 and X64 because loads are not permitted to reorder.  However!!  IA64 permits non-acquire loads (non-volatile) to reorder, and so the A = X may actually be satisfied out of the write buffer before the store even leaves the processor.  It’s as though the program became:

T0                       T1
X = 5; // st.rel         while (X == 0) ; // ld.acq
A = X; // ld.acq         X = 0; // st.rel
while (Y == 0) ; // ld   Y = 5; // st.rel

Whoops!  This should make it apparent that this outcome is indeed a real possibility.  And clearly it may cause bugs.

Note 6/13/08: Eric pointed out privately that compilers need only respect the CLR MM, and can freely reorder loads.  Thus, this problem may actually arise on non-IA64 machines.  Of course he is entirely correct.  It was silly of me to overlook that.

All that said, let’s get back to the original concern about visibility of writes.  This issue doesn’t even really involve reordering.  Imagine one processor continuously executes a stream of lock acquires and releases, and that the stream goes on indefinitely (perhaps because it’s in a loop):

while (Interlocked.CompareExchange(ref m_state, 1, 0) != 0) ;
m_state = 0;
while (Interlocked.CompareExchange(ref m_state, 1, 0) != 0) ;
m_state = 0;

The Interlocked operation acquires the cache line in X mode.  After it executes, other processors will notice that the lock is taken.  But right away, the processor writes 0 to the line without a fence, and immediately goes on to execute another acquire.  It is highly likely that the line will be marked dirty in the processor’s cache by the time that it acquires it in X mode again, something that the cache coherency system makes very cheap.  In fact, the write of m_state = 0 probably hasn’t left the write buffer yet due to latency.

So before another processor can even see m_state as 0, the processor will have already gotten around to taking the lock again.  Even for volatile loads and stores, there is no MM guarantee that writes will leave the processor immediately; hence the documentaiton earlier is slightly confusing; yes, the processor doing a volatile read will see the “most recent” value, but that “most recent” value (a) may be satisfied out of the local write buffer, and (b) may simply not have the ability to observe writes that occurred in practice due to the above timeliness issue. 

6/13/2008 12:52:45 PM (Pacific Daylight Time, UTC-07:00)  #   

 Thursday, June 05, 2008

We sat down last week with Charles from Channel9 to discuss the new CTP.  Both parts got posted today:

We focus on the new aspects of the stack, incl. the new scheduler and CDS, and also discuss what's changed in PLINQ and TPL.

If you have ideas for future videos, or any feedback/questions, you know where to send 'em.  joedu AT youknowwhere DOT com.

6/5/2008 5:14:47 PM (Pacific Daylight Time, UTC-07:00)  #   

 Monday, June 02, 2008

We just released a new CTP of Parallel Extensions to .NET: get it here.

Some relevant information is up on our team blog:

I'm really excited to see our entire stack finally shipping as one cohesive unit: the data structures we use throughout the implementation exposed publicly (what we now call CDS), a new scheduler built from the ground up, TPL and PLINQ better together, and lots more.  We're still very far from the end of the road, and have plenty of cool stuff still in the works, but we've made a ton of progress in just 6 months since our last CTP.

Have fun and, as always, feedback is much appreciated.

6/2/2008 9:00:26 AM (Pacific Daylight Time, UTC-07:00)  #   

 Thursday, May 29, 2008

PDC'08 is officially on for October 27-30th this year: http://microsoftpdc.com/.

My team will certainly have some really fun stuff to show off, and just glancing at the preliminary list of teaser sessions, it's going to be a blast.

A few of us have teamed up to give a PreCon.  That's code-word for a full day of neverending concurrency goodness.  From http://microsoftpdc.com/agenda/Preconference.aspx:

Concurrent, Multi-core Programming on Windows and .NET
Presenter(s): David Callahan, Joe Duffy, Stephen Toub
The leap from single-core to multi-core technology is altering computing as we know it, enabling increased productivity, powerful energy-efficient performance, and leading-edge advanced computing experiences. The good news is that Windows and .NET offer rich support for threading and synchronization to take advantage of these new platforms. This session, presented by David Callahan, Microsoft distinguished engineer, Joe Duffy, author of “Concurrent Programming on Windows” (Addison-Wesley), and Stephen Toub, program manager lead for the Concurrency Development Platform team at Microsoft, will cover a broad range of topics, including mechanisms to create, coordinate, and synchronize among threads; best practices for concurrent libraries and apps; and techniques for improving scalability, including lock-free algorithms. Focus will be on .NET programming, including the next generation of parallel programming support within the Framework, but Windows internals and C++ nuggets will be discussed too.

About the presenter(s):
David Callahan joined Microsoft in 2005. He is a Distinguished Engineer leading the Parallel Computing Platform Team within Visual Studio® focused on incubating technology for the coming manycore processors. This team is producing exciting new technologies as part of Visual Studio and also driving the Parallel Computing Initiative, a company wide effort to deliver customer value from the power of future high-performance processors. David’s background is in programming languages, parallel programming techniques, and compilation techniques focused on expressing and exploiting concurrency.

Stephen Toub is a Senior Program Manager Lead on the Parallel Computing Platform team at Microsoft, where he spends his days focusing on the next generation of programming models for concurrency. Stephen is also a Contributing Editor for MSDN® Magazine, for which he writes the .NET Matters column, and he’s an avid speaker at conferences like TechEd and DevConnections. Prior to working on the Parallel Computing Platform, Stephen designed and built enterprise applications for companies such as GE, McGraw-Hill, BankOne, and JetBlue. He was a developer for Microsoft Outlook as well as for the Microsoft Office Solution Accelerators. In his spare time, Stephen loves to sing, and he spends as much time as possible with his beautiful wife Tamara.

Joe Duffy leads development for Microsoft's Parallel Extensions to .NET technology, a set of library and runtime technologies for concurrent and parallel computing. He founded the project in 2006 with Parallel Language Integrated Query (aka PLINQ), an innovative declarative parallel query analysis and execution engine. Prior to Parallel Extensions, Joe worked on transactional memory, library and VM support for concurrency in the Common Language Runtime (CLR) team, and has written 3 functional language compilers (Scheme, Common LISP, and Haskell). He has written two books, including Concurrent Programming on Windows (Addison-Wesley, 2008), and in his spare time reads and writes (code and text), plays guitar, and studies music theory.

Be there, or be square.  The slides aren't even finalized yet, so let me know if you have particular topics you'd like to learn more about.

5/29/2008 11:19:11 PM (Pacific Daylight Time, UTC-07:00)  #   

 Friday, May 16, 2008

Counting events and doing something once a certain number have been registered is a highly common pattern that comes up in concurrent programming a lot.  In the olden days, COM ref counting was a clear example of this: multiple threads might share a COM object, call Release when done with it, and hence memory management was much simpler.  GC has alleviated a lot of that, but the problem of deciding when a shared IDisposable resource should be finally Disposed of in .NET is strikingly similar.  And now-a-days, things like CountdownEvent are commonly useful for orchestrating multiple workers (see MSDN Magazine), which (although not evident at first) is based on the same counting principle.

Coding up one-off solutions to all of these is actually pretty simple.  But doing so seems unfashionably ad-hoc, at least to me.  Codifying the pattern can be done in a couple dozen lines of code, so that it can be reused for many purposes.  As an example, here is a reusable Counting<T> class, written in C#, that just invokes action delegate once the count reaches zero:

#pragma warning disable 0420

 

using System;

using System.Threading;

 

public class Counting<T>

{

    private readonly T m_obj;

    private volatile int m_count;

    private readonly Action<T> m_action;

 

    public Counting(T obj, int initialCount, Action<T> action)

    {

        m_obj = obj;

        m_count = initialCount;

        m_action = action;

    }

 

    public int AddRef()

    {

        int c;

        if ((c = Interlocked.Increment(ref m_count)) == 1)

            throw new Exception();

        return c;

    }

 

    public int Release()

    {

        int c;

        if ((c = Interlocked.Decrement(ref m_count)) == 0)

            m_action(m_obj);

        return c;

    }

 

    public T Obj { get { return m_obj; } }

}

Notice I’ve used the IUnknown vocabulary of AddRef and Release.  Old habits die hard.

The CountdownEvent I mentioned earlier is just a simple extension to this basic functionality.  In fact, we don’t need to write another class; it’s merely an instance of Counting<T>, where the T is a ManualResetEvent.  Setters directly use the Counting<T> object’s Release method to register a signal, while waiters can use the WaitOne method on the raw ManualResetEvent itself.  The event will be set once all signals have arrived:

Counting<ManualResetEvent> countingEv = new Counting<ManualResetEvent>(

    new ManualResetEvent(false), N, e => e.Set()

);

 

// Setter:

countingEv.Release();

 

// Waiter:

countingEv.Obj.WaitOne();

(Exposing a traditional Set/Wait interface would of course be nicer, but even then Counting<T> makes the implementation brain-dead simple.)

Similarly, the “who should dispose” problem is easy to solve with Counting<T>.  Say that, instead of setting the event, we actually want to Dispose of some IDisposable object when all threads are done with it:

Counting<ManualResetEvent> ev = new Counting<ManualResetEvent>(

    new ManualResetEvent(false), N, e => e.Dispose()

);

Though this does the trick, we might instead wrap it in a more convenient package:

public class CountingDispose<T> :

        Counting<T>, IDisposable

        where T : IDisposable

{

    public CountingDispose(T obj, int initialCount) :

        base(obj, initialCount, d => d.Dispose()) { }

}

Given this definition, threads can use the CountingDispose<T> object as they would any IDisposable thing.  This facilitates use in C# using blocks.  Only when all threads have called Dispose will Dispose be called on the actual underlying object:

CountingDispose<ManualResetEvent> ev = new CountingDispose<ManualResetEvent>(

    new ManualResetEvent(false), N

);

 

// Some threads wait:

using (ev) {

    … ev.WaitOne(); …

}

 

// Some threads set:

using (ev) {

    … ev.Set(); …

}

I’ve found that the extremely simple Counting<T> idea is a surprisingly powerful one.  It’s fairly extensible too; for example, you clearly may want to run actions at different points in the counting, use clever synchronization to ensure actions run at particular points are processed in-order (useful for progress reporting), to reset the count afterwards, and so on.  It’s way too simple to claim it’s anything terribly amazing, but thought I’d share the idea anyway.

5/16/2008 8:18:59 PM (Pacific Daylight Time, UTC-07:00)  #   

 

Recent Entries:

Search:

Browse by Date:
<July 2008>
SunMonTueWedThuFriSat
293012345
6789101112
13141516171819
20212223242526
272829303112
3456789

Browse by Category:

Notables: