|
Personal Info:
Joe  leads the architecture of an experimental OS's developer platform, where
he is also chief architect of its programming language. His current mission is to enable
writing large-scale software that is reliable, secure, and scalable by-construction. Before this, Joe
founded the Parallel Extensions to .NET project.
He has been granted 19 patents, with 49 pending. When not working, Joe enjoys travelling with his wife,
writing books, writing music,
studying music theory & mathematics, and doing anything involving food & wine.
My books
My music
Disclaimer:
The content of this site are my own personal opinions and do
not represent my employer's view in anyway.
© 2012, Joe Duffy
|
|
 Thursday, February 23, 2006
When you perform a wait on the CLR, we make sure it happens in an STA-friendly manner. This entails using msg-waits, such as MsgWaitForMultipleObjectsEx and/or CoWaitForMultipleHandles. Doing so ensures we pick up and dispatch incoming RPC work mid-stack, while the STA isn't necessarily sitting in a top-level message loop. In fact, an STA that doesn't pump temporarily can easily lead to temporary and permanent hangs (i.e. deadlocks), especially in common COM scenarios where reentrant calls across apartments are made (e.g. MTA->STA->MTA->STA). Even where deadlock isn't possible, failing to pump can have a ripple effect across your process, as components wait for other components to complete intensive work.
I was recently writing about this fact for an article, and realized I had leaped to an incorrect assumption. OLE creates special RPC windows for processing these apartment transitions. I knew that. But I had assumed that we blatently pump for messages for any windows. But in reality, we don't. We only pump the special "WIN95 RPC Wmsg", "OleMainThreadWndClass", and "OleObjectRpcWindow" RPC windows. Consider why.
If you were on a UI and you pumped your window's message queue, you could end up dispatching new events before old events had completed. If you dispatched a WM_CLOSE message on the same stack you were doing some other UI processing, you'd destroy the window before that other processing was done. Without the GUI message loop taking this into account somehow, you'd crash. There are other factors. Imagine you were processing a click event that required movement of several UI elements. If some bit of infrastructure--or perhaps your code--ended up pumping, UI invariants could be broken. (Lots of FX code pumps for RPC, by the way.) At best, this could lead to strange visual artifacts, and at worst a crash.
You'd also have to deal with the subtleties of reentrancy. I've written about this before. Imagine a UI event had waited on an auto-reset event, did some work, and then set the event. If another event--perhaps the same type--were dispatched while it was doing "some work," it might try to wait on this same event. This would be a deadlock. A pretty difficult one to track down too, especially if it only occurred if the user clicked on certain elements at precise timings to get the reentrancy to occur. This is already a problem with COM interop. But thankfully we don't burden Windows Forms and WPF programming with it too.
 Tuesday, February 21, 2006
I've been pretty busy lately orchestrating all the concurrency work the CLR is undertaking for the next few versions.
We have several long lead incubation projects underway to explore new programming models which--some of us at least--believe will make concurrent programming at least an order of magnitude easier. While I hope we ship them, the fate of those projects is currently unknown. We'll understand better over time what is feasible and in what timeframe.
But there's also a set of features and work items we are actively planning for in the mainline product. This includes looking at new abstractions (e.g. futures, barriers, better lock factoring, a richer variety of locks, lock leveling), fixing existing ones (e.g. nagging issues and obvious missing features (e.g. prioritization, isolation, diagnostics) with our thread-pool, improving RWL's performance), debugging and tool support, and general cleanup activities (e.g. getting our own locking practices in order).
If there's anything you'd like to see us consider in this category, please let me know. Either leave a comment here or send me an email at joedu at microsoft. I'll share it with the rest of the concurrency team, and we'll seriously consider it.
I also have a couple related articles to watch out for: one on deadlock avoidance and detection in next month's MSDN magainzine, and another on application responsiveness in WPF in DDJ. I'm also going to be on .NET Rocks next week discussing some of this.
Unfortunately, all of this has left very little time to blog. If you have ideas for topics, drop me a line.
 Tuesday, February 07, 2006
I was on a mail thread today, the topic for which was the meaning—and perhaps lack of comprehensiveness—of the statement: “This type is thread safe.” Similar statements are scattered throughout our product documentation, without any good central explanation of its meaning and any caveats.
It’s relatively difficult to make such a statement. The .NET Framework is generally written so that all static members are thread-safe, while instance members are not. There are some notable exceptions, mostly to do with immutable types (e.g. the primitives, System.DateTime, System.Type, etc.), but they are infrequent.
This brings me to the types of thread safety issues we’re generally concerned with.
The first problem is torn reads and writes. Operations that deal with data whose size is greater than the native machine-sized pointer (i.e. sizeof(void*)) are not atomic in the ISA. This applies to 64-bit operations on 32-bit platforms, 128-bit on 64-bit, etc. We happen to provide you two intrinsic 64-bit types—Int64 (C# long) and Double (C# double)—which makes this issue a tad tricky. So this code
static long x; // … x = 1111222233334444;
consists of two DWORD MOVs at the machine level, one to the most significant half and then the least significant half (in that order, at least in the Whidbey x86 JIT). Reads likewise involve two MOV instructions.
If you’re doing naked reads and writes with such values, instructions can be interleaved such that only one DWORD has been written. That means a thread racing with the above assignment can see x with a value of (x & 0xFFFFFFFF00000000), or 2524709548 in decimal. This is obviously surprising, as the two values (at first glance) don’t seem to be related. And this same principle applies to reads and writes of any value type instances whose size is greater than sizeof(void*).
This can be solved by protecting all access to the data under a lock or via an interlocked operation. Interlocked.Exchange will do the trick for writes, and Interlocked.Read for reads. Note that most platforms offer 128-bit interlocked instructions. Unfortunately, because of the platform-specificity, the Win32 APIs and our System.Threading APIs don’t broadly support them. Hopefully this changes over time. For the same reason you often need two void*-sized writes on 32-bit, you often need the same on 64-bit.
In summary, any type that exposes a writable 64-bit field, or which returns a 64-bit value which has been copied by a field that might be in motion, is not thread-safe. And any internal reads and writes need to be done under the protection of a lock or interlocked operation. A method that updates an internal field, for example, can race with a property that returns the current value.
The second problem is read/modify/write sets of instructions. If reads and writes can be multiple instructions long, it should be clear that by default a read/modify/write is at least three. For a 64-bit value on a 32-bit platform, it’s six. At any point in that invocation, interleaved execution can cause an update to go missing. The solution here, again, is to do this inside a lock. The Interlocked.CompareExchange function is great for this purpose, as it takes advantage of hardware-level read/modify/write instructions. Thankfully they are supported by all modern hardware ISAs.
The last problem is that of ensuring coarser-grained data structure invariants can never be seen in a broken state by concurrent execution. This is especially difficult since arbitrary managed programs don’t capture such invariants in the program itself. Aside from static state, most Framework types don’t even come close to attempting to provide this level of guarantee. Static caches and lazily initialized state, for example, are places where the Framework needs to account for concurrent access. Old-style collections with SyncRoots tried to provide similar protection, but the new generic collections don’t any longer, mostly because of the performance hit you take on sequential code-paths. But those cases are the exception, not the norm.
The immutable types mentioned above are nice in that, aside from initialization-time, they never break their internal invariants. Thus, aside from assignments of instances to shared variables, you needn’t worry about any special synchronization.
In summary, any type that breaks invariants must do so in such a way that these invariants can never be observed due to concurrent execution. This means all access to data needs to be serialized with respect to coarser grained operations updating state. Our Framework isn’t written in this way, so if you share it, you usually take responsibility for locking it.
My preference is for developers to assume that all types are unsafe, and to explicitly lock when accessing them concurrently. Regardless of the documentation’s claims. We simply do not check for these things across releases, and some code that works today might break tomorrow because we accidentally forgot to account for a torn read.
 Thursday, January 26, 2006
Vance Morrison's excellent MSDN article from a few months back talks about why double checked locking is guaranteed to work on the CLR v2.0, and why it is one of the few safe lock-free mechanisms on the runtime. He also sent an email to the develop.com mailing list a while back explaining why this pattern wasn’t guaranteed to work on the ECMA memory model. We did quite a bit of implementation work and testing to tame the crazy memory model of IA-64 on 2.0. (Note that none of this is in the ECMA specification, so if you’re worried about CLI compatibility, beware.)
These modifications not only enable the double checked locking pattern, but also prevent constructors from publishing the newly allocated object before their state has been initialized, as I mentioned in my PDC presentation on concurrency last year. We accomplish this by ensuring writes have 'release' semantics on IA-64, via the st.rel instruction. A single st.rel x guarantees that any other loads and stores leading up to its execution (in the physical instruction stream) must have appeared to have occurred to each logical processor at least by the time x's new value becomes visible to another logical processor. Loads can be given 'acquire' semantics (via the ld.acq instruction), meaning that any other loads and stores that occur after a ld.acq x cannot appear to have occurred prior to the load. The 2.0 memory model does not use ld.acq’s unless you are accessing volatile data (marked w/ the volatile modifier keyword or accessed via the Thread.VolatileRead API). This can lead to some subtle problems.
For example, a slight variant of the double checked lock will not work under our model:
class Singleton { private static object slock = new object(); private static Singleton instance; private static bool initialized; private Singleton() {} public Instance { get { if (!initialized) { lock (slock) { if (!initialized) { instance = new Singleton(); initialized = true; } } } return instance; } } }
You might have decided to use this pattern to determine whether to initialize a value-type, since checking the variable for null isn’t possible. If you had some more complex set of state, perhaps you want to use a single Boolean rather than checking, say, 10 separate variables to see if they have each been initialized. Whatever your reasoning, as written the above code is prone to a subtle race condition.
The problem here is that both reads of initialized and instance do not have 'acquire' semantics. Thus, instance could appear to have been read before initialized, e.g. as follows:
| Time |
Thread A |
Thread B |
| 0 |
|
Reads instance as null |
| 1 |
Reads initialized as false |
| 2 |
Sets instance to ref to new obj |
| 3 |
Sets initialized to true |
| 4 |
Uses instance (initialized) |
| 5 |
|
Reads initialized as true |
| 6 |
|
Uses instance (null!) |
Thread B ends up returning a null reference. If a caller tried to use it, they might encounter a spurious NullReferenceException, the cause of which is incredibly hard to debug. For example:
void f() { Singleton s = Singleton.Instance; s.DoSomething(); // Boom! }
For this to have happened, Thread B would have had to read instance entirely out of order. It might have done so for any number of reasons. If it recently executed some code that pulled it into cache—either directly or due to locality—it isn’t required to invalidate the cache with non-acquire reads, even though it observed a new write with release semantics, because it's as if the load was moved before the load of initialized. Or superscalar execution might perform branch prediction and retrieve the value of instance, assuming that initialized will be false, pulling it into cache ahead of the read of initialized. Again, because it is a non-acquire read, this is a valid thing to do. If it reads initialized as true, its prediction was actually correct, and it just returns the null value that was pre-fetched. It might even be the case that a compiler along the way moved the read, which is also entirely legal with our memory model.
One possible solution for this is to employ a volatile-read on the first read of the initialized variable, prohibiting the read of instance from moving prior to the read of initialized. Control dependency prevents us from having to use a volatile-read for the reads of both variables.
class Singleton { private static object slock = new object(); private static Singleton instance; private static int initialized; private Singleton() {} public Instance { get { if (Thread.VolatileRead(ref initialized) == 0) { lock (slock) { if (initialized == 0) { instance = new Singleton(); initialized = 1; } } } return instance; } } }
You could have instead inserted a call to Thread.MemoryBarrier instead, which is a two way memory-fence, in between if-block and the read of instance, but the cost of a barrier is generally higher than both a st.rel and ld.acq because it affects surrounding instructions and movement in both directions.
The take-away here is not that you must understand the specifics of how cache coherency, speculative execution, and our memory model interact. Rather, it should be that once you venture even slightly outside of the bounds of the few "blessed" lock-free practices mentioned in the article mentioned above, you are opening yourself up to the worst kind of race conditions. Using locks is a simple way to avoid this pain. And hopefully someday in the future, transactional memory will enable performant execution of code with lock elision techniques that lead to the performance of lock-free code, but without any of the mental illness that such techniques have been proven to cause.
 Saturday, January 21, 2006
I drink about 4x more tea than I do coffee. Actually, I wouldn't drink coffee at all if there were decent tea stores in the area with readily available to-go offerings.
I just ordered a bunch of great teas from Upton Tea Imports, including a new 2006 Darjeeling First Flush. I can't wait until they arrive:
- TD56: Tindharia Estate FTGFOP1 First Flush (EX-1)
- TS70: Temi Estate FTGFOP1 CL
- TA93: Nahorhabi Estate FTGFOP1 SPL CL
- ZO87: Ginseng Tie-Guan-Yin Oolong
- ZM44: Osmanthus Oolong Se Chung
- ZW84: Organic Fuding White Treasure
- ZW90: Organic White Point Reserve
- ZW99: China White Paklum Tips Reserve
- TA98: Mothola Estate White Tea
- TJ77: Spring Harvest Kabusencha
If you're not a tea drinker, or the closest thing to real tea you've had is a soppy Stash tea bag, I encourage you to try one of Upton's sampler sets.
This deck, from a recent POPL presentation, offers a fascinating view on the future of mainstream games-programming languages, specifically around safety and concurrency: http://www.cs.princeton.edu/~dpw/popl/06/Tim-POPL.ppt.
(Of course, if I had a link to a video-taping of the talk or a detailed paper, that would be more useful; but I don't.)
An interesting conclusion--reinforcing a growing commonplace belief, at least here at Microsoft--is that there is no single concurrency model that works for all problems. In other words, there is no silver bullet. A combination of isolation (via STM), with implicit- and data-parallelism is discussed.
 Monday, January 16, 2006
Transactional memory promises to improve the lives of developers everywhere. From races, to deadlocks, to lock granularity and scalability headaches, the concept of transactions cleans up a lot of the worries inherent in the current lock-based concurrency programming model.
You have it today, sort of. Juval Lowy wrote a great piece on MSDN a few months back on how to build System.Transactions resource-managers over volatile in-memory data. I highly recommend checking it out. Maurice Herlihy has also made his SxM library available online here. I wonder if there's an interesting intersection between the two.
Update: I neglected to mention failure atomicity as one of the major benefits of TM, whether running concurrently or not.
 Tuesday, January 10, 2006
I recently uploaded a new page to serve as the online-companion for my .NET Framework 2.0 book.
It doesn't contain much right now, but it does have a lot of links to references I used throughout (books and articles): /books/netfx20/netfx20_book_resources.html.
 Saturday, January 07, 2006
I've posted before about how you might use C# enumerators to simulate coroutines. Enumerators are a very powerful feature, but unfortunately have one big drawback vis-à-vis their attempt at coroutines: you can yield only from one stack frame deep. The C# compiler state-machine transforms enough information for a single function, but obviously doesn't do that for the entire stack. Real coroutines can yield from an arbitrarily nested callstack, and have that entire stack restored when they are resumed.
There are other techniques. If you're willing to spend an entire thread to keep the stack alive, for example, you can use events to model coroutines with a standard producer/consumer relationship. The benefit to this approach is that you are in fact able to yield from arbitrarily nested frames. The clear drawback is the performance overhead. Each coroutine will eat up 1MB of reserved stack space from the virtual address space. But, probably worse, each time a new item is requested, an OS context switch is required; and similarly, whenever a new item becomes available (i.e. yielded), a context switch occurs again. This back-and-forth switching is pure overhead that could be eliminated with true coroutines.
(Note: this article describes how to use Fibers to avoid this context switch penalty. Fibers are dynamite on the CLR, however, so tread with caution if you even contemplate using this approach. Furthermore, you can easily dream up ways to serialize the physical stack, a la continuations. You do have access to the current CONTEXT, via GetThreadContext, on Windows and can use the thread's stack base and context ESP to determine the boundaries. But so many things in Windows rely on the TEB, from the CRT to exception handling to GetLastError to arbitrary usage of the TLS, like the way the CLR maintains a list of frame transitions. Nevermind having to accurately report roots back to the GC. These nightmares make real coroutines on Windows almost unapproachable, at least for the faint of heart.)
I hacked up a little Coroutine class this morning that uses the thread-per-coroutine approach mentioned above. Up front, I have to warn you: I spent 30 minutes on this thing. It's bound to be buggy, and I took some shortcuts (like not implementing the respective collections interfaces). Rather than walking through bit-by-bit, I've tried to comment the source code to explain how it works:
using System;
using SD = System.Diagnostics;
using System.Threading;
public delegate void CoroutineStart();
public class Coroutine<T> : IDisposable
{
// Fields
private CoroutineStart start;
private Thread thread;
private AutoResetEvent computeNextEvent;
private AutoResetEvent nextAvailableEvent;
private ManualResetEvent doneEvent;
private T current;
// We have a thread-static here so the coroutine needn't track the Coroutine<T> object
// manually. The Yield function is static, so they can just call Coroutine.Yield(v);
[ThreadStatic]
private static Coroutine<T> coroutine;
// Constructors
public Coroutine(CoroutineStart start)
{
this.start = start;
this.thread = new Thread(Worker);
this.computeNextEvent = new AutoResetEvent(false);
this.nextAvailableEvent = new AutoResetEvent(false);
this.doneEvent = new ManualResetEvent(false);
}
// Properties
public T Current
{
// TODO: we could add some error checking here, e.g. if somebody tries to
// read past the end-of-stream.
get { return current; }
}
// Methods
public bool MoveNext()
{
if (thread.ThreadState == ThreadState.Unstarted)
thread.Start();
else
computeNextEvent.Set();
// We wait on the 'next available' and 'done' events simultaneously. And then
// we use this to determine whether the coroutine has finished or not. The consumer
// will typically use this in a loop, e.g. while (c.MoveNext()) { f(c.Current); }.
return (0 ==
WaitHandle.WaitAny(new WaitHandle[] { nextAvailableEvent, doneEvent }));
}
private void Worker()
{
try
{
// Stash the coroutine object in TLS and start the CoroutineStart routine.
coroutine = this;
start();
}
catch (ThreadInterruptedException)
{
// Ignore the interrupt request. We use this as the 'proper' way to shut-down
// a couroutine. This is really a hack. Needs to be revisited.
}
finally
{
// Lastly, signal to the caller that the coroutine is done producing. Note that
// we'd ideally just use the thread executive object directly. But unfortunately the
// managed thread class doesn't expose this WaitHandle. :(
doneEvent.Set();
}
}
public static void Yield(T value)
{
Coroutine<T> c = coroutine;
// First, ensure we're on a coroutine thread.
if (c == null)
throw new InvalidOperationException("You can only yield from a coroutine thread");
// Now, set the coroutine's current value to the argument, signal to the consumer
// that we have a new item, and go to sleep until we're asked to compute the next item.
c.current = value;
c.nextAvailableEvent.Set();
c.computeNextEvent.WaitOne();
}
public void Dispose()
{
// We ensure the thread has stopped here. We use a really ugly interrupt to bring
// it down if not.
if (thread.ThreadState != ThreadState.Aborted &&
thread.ThreadState != ThreadState.Stopped &&
thread.ThreadState != ThreadState.Unstarted)
{
SD.Trace.TraceWarning("Coroutine thread has not stopped when Disposing, in state {0}",
thread.ThreadState);
thread.Interrupt();
// Joining here is questionable at best. It could lead to deadlocks.
thread.Join();
}
// Close out all of the events.
computeNextEvent.Close();
nextAvailableEvent.Close();
doneEvent.Close();
}
}
public static class Coroutine
{
// This is a trick. The C# compiler will infer the method argument <T>, enabling
// us to shunt right over to the Coroutine<T> implementation. This is nice because
// the user can just write Coroutine.Yield(n) instead of Coroutine<T>.Yield(n). The
// annoying part is that you can easily yield something of the wrong type, leading to
// an IllegalOperationException because C<T>.Yield will look in TLS and not find anything.
public static void Yield<T>(T t)
{
Coroutine<T>.Yield(t);
}
}
Now let's see it in action. Given a function Fibonacci, which continuously yields the next item in the Fibonacci sequence:
void Fibonnaci()
{
long n0 = 0;
long n1 = 1;
long n;
while (true)
{
n = n0 + n1;
n0 = n1;
n1 = n;
Coroutine.Yield(n);
}
}
We can form a coroutine over it and scroll through the first 10 numbers:
using (Coroutine<long> c = new Coroutine<long>(Fibonnaci))
{
int i = 0;
while (c.MoveNext() && i++ < 10)
Console.WriteLine(c.Current);
}
And of course, we can create a coroutine over a function that yields from functions deep in the call stack:
void a()
{
Coroutine.Yield("a");
b();
e();
}
void b()
{
Coroutine.Yield("a.b");
c();
}
void c()
{
Coroutine.Yield("a.b.c");
d();
}
void d()
{
Coroutine.Yield("a.b.c.d");
}
void e()
{
Coroutine.Yield("e");
}
And iterate over it:
using (Coroutine<string> c = new Coroutine<string>(a))
{
while (c.MoveNext())
Console.WriteLine(c.Current);
}
A neat extension to this whole idea might be a BeginMoveNext function that follows the asynchronous programming model. You could then exploit the fact that the consumer and producer are on separate threads to make progress while the producer is calculating the next item in line. Assuming you're on a multi-hardware-thread machine, this would cut down on the context switch penalty by as much as half.
 Thursday, December 29, 2005
|
|
Recent Entries:
Search:
Browse by Date:
| | Sun | Mon | Tue | Wed | Thu | Fri | Sat | | 29 | 30 | 31 | 1 | 2 | 3 | 4 | | 5 | 6 | 7 | 8 | 9 | 10 | 11 | | 12 | 13 | 14 | 15 | 16 | 17 | 18 | | 19 | 20 | 21 | 22 | 23 | 24 | 25 | | 26 | 27 | 28 | 1 | 2 | 3 | 4 | | 5 | 6 | 7 | 8 | 9 | 10 | 11 |
Browse by Category:
Notables:
|