Thursday, June 29, 2006

I'll be speaking at JAOO'06 in Denmark this October. They have an entire track dedicated to concurrency. If you're in the area (or don't mind the travel), I highly recommend checking it out:

Concurrency and the composition of frameworks

Abstract:
Multi-core computer architectures pose both a threat and an opportunity to modern software. The amount of computing power that will soon be available will enable mainstream applications to solve problems that require computing power that has until recently only been available in supercomputers. But it also means that our software needs to evolve alongside to better support and enable the levels of concurrency we'll need to effectively use all of those cores. This fact applies as much to reusable software libraries as it does to applications themselves.

This new direction imposes some new and interesting constraints on the architecture of reusable software components, including the need to remain thread agnostic, expose latency characteristics and mechanisms for hiding latency, and, for computationally expensive library routines, some way to carry them out in parallel based on the context in which they were called. These are all areas which have not yet been researched heavily and which commercial library vendors are only now beginning to seriously deal with.

This talk presents an overview of the problem, identifies some key challenges, and proposes some direction for enabling our software to both take advantage of concurrency and to avoid inhibiting it. While the discussion has been derived from experiences on the Windows and .NET Framework platforms, the ideas presented aim to transcend any specific technology.

If you're in Denmark and want to meet up to chat, definitely drop me a line.

6/29/2006 1:00:24 PM (Pacific Daylight Time, UTC-07:00)  #   

 Saturday, June 24, 2006

It shouldn't be news to anybody that Bill is transitioning into a new role in 2 years. I figured I'd dump some of my thoughts about this onto paper. Remember: This is in no way the official company view on the matter, nor is it motivated by any sort of company-private information.

First, I am surprised that Bill didn't make this transition sooner. I think it's admirable how he's been able to maneuver through the technology details for so long. He refused to completely give up his technical edge. And I think it's cool that he can venture out into entirely new verticals with the premise in hand that all you need is a bunch of really smart, motivated people to succeed. It's risky. And it's a sort of an antithesis to traditionalist business management views.

From talking to colleagues about this news, however, I think people tend to downplay the importance of having Bill around. Externally, there's no doubt he's a huge part of our PR, whether you like him or not. That's probably why the stock has been flat after the announcement. Not having him in charge of technical direction may open up some new avenues that we wouldn't have otherwise explored. New blood is always healthy. At the same time, though, it takes a functioning system and runs the risk of disrupting it. But I'm more worried about the internal climate...

Just as Dave Cutler is a god to the NT Team, Bill is a role model for every technical person in the company. He's a geek. He talks like one, he looks like one, and he acts like one. He was very successful at a young age, was self-taught, and didn't need college to do and succeed at what he loved. And he has an inconceivable level of power and influence both within and outside the company. For a place that's full of uber-geek MS-for-lifers who joined the firm straight out of college, and who wouldn't think about leaving (with Bill around at least), all of these are very important traits. While Ray Ozzie is a very accomplished and intelligent guy, he's missing almost all of the traits I mentioned above. And I think people will notice.

In the past month alone, I've had a BillG Review and received feedback from him on two spring ThinkWeek papers that I submitted. These were my first personal interactions with Bill, and perhaps my last. I was impressed. He's scary smart. There's no doubt he's very clever and can effortlessly cut through complexity to understand the core of really deep technical problems. There's no way the new senior technical leadership will have the same traits and to the same degree. It's not that they aren't great people. I've had several meetings with Craig Mundie recently, our CTO, and he's extraordinarily insightful and talented. I found myself blown away by some points he was making. But Bill's just too good to beat.

So here's the gist of it all. Microsoft in the past has been a technically motivated company. Bill's passion was around how we could use technology to change the world. But at the same time, he cared about how that technology was architected and built. He didn't simply spew MBA mumbo-jumbo. Microsoft feels like a company full of 100s of start-ups, each of which reports to the same technical leader, all fighting tooth and nail to build the greatest technology possible. I think all of that is going to change. I conjecture that we're at the beginning of a major shift, where Microsoft will slowly evolve from a technology-driven company to a business-driven company. The two are not mutually exclusive, obviously, but the balance will shift. We'll do more projects based on business reasons and less based on pure technology reasons. We'll waste less money in the process. It had to happen sooner or later. But for the geeks like me, I have to wonder whether it will remain as much of an enjoyable place to work. Or whether those who are looking for such an environment will be forced to go elsewhere...

6/24/2006 11:37:00 AM (Pacific Daylight Time, UTC-07:00)  #   

 Wednesday, June 21, 2006

Windows Vista has some great new features for concurrent programming. For those of you still writing native code, it's worth checking them out. For those writing managed code, we have a bunch of great stuff in the pipeline for the future, but unfortunately you'll have to wait. Or convert (back) to the dark side.

The Vista features include:

1. Reader/writer locks. The kernel32 function InitializeSRWLock takes a pointer to a SRWLOCK structure, just like InitializeCriticalSection, and initializes it. AcquireSRWLockExclusive and AcquireSRWLockShared acquire the lock in the specific mode and ReleaseSRWLockXXX releases the lock. This is a "slim" RW lock, meaning it's actually comprised of a pointer-sized value, and is ultra-fast and lightweight, much like existing Win32 CRITICAL_SECTIONs. It should be about the cost of a single interlocked operation to acquire. E.g.

SRWLOCK rwLock;
InitializeSRWLock(&rwLock);
AcquireSRWLockShared(&rwLock);
// ... shared operations ...
ReleaseSRWLockShared(&rwLock);

2. Condition variables. These integrate with RW locks and critical sections, enabling you to do essentially what you can already do with Monitor.Wait/Pulse/PulseAll. InitializeConditionVariable takes a pointer to a CONDITION_VARIABLE and initializes it. SleepConditionVariableCS and SleepConditionVariableSRW release the specified lock (either CRITICAL_SECTION or SRWLOCK) and wait on the condition variable as an atomic action. When the thread wakes up again, it immediately attempts to acquire the lock it released during the wait. WakeConditionVariable wakes a single waiter for the target condition and WakeAllConditionVariable wakes all waiters, much like Pulse and PulseAll. E.g.

Buffer * pBuffer = ...;
PCRITICAL_SECTION pCsBufferLock = ...;
PCONDITION_VARIABLE pCvBufferHasItem = ...;

// Producer code:
EnterCriticalSection(pCsBufferLock);
while (pBuffer->Count == 0) {
    SleepConditionVariableCS(pCvBufferHasItem, pCsBufferLock, INFINITE);
}
// process item...
LeaveCriticalSection(pCsBufferLock);

// Consumer code:
EnterCriticalSection(pCsBufferLock);
pBuffer->Put(NewItem());
LeaveCriticalSection(pCsBufferLock);
WakeAllConditionVariable(pCvBufferHasItem);

More details on condition variables can be found on MSDN.

3. Lazy/one-time initialization. This allows you to write lazy allocation without fully understanding memory models and that sort of nonsense. The new APIs in kernel32, InitXXX, support both synchronous and asynchronous initialization. These have some amount of overhead for the initialization case due to the use of a callback, but in general this will be fast enough for most lazy initialization and much less error prone. Herb Sutter has proposed a similar construct for the VC++ language, and to be honest I wish we had this built-in to C# too. See the MSDN docs for an example and more details.

4. An overhauled thread pool API. The Windows kernel team has actually rewritten the thread pool from the ground up for this release. Their APIs now support creating multiple pools per process, waiting for queues to drain or a specific work item to complete, cancellation of work, cancellation of IO, and new cleanup functionality, including automatically releasing locks. It also has substantial performance improvements due to a large portion of the code residing in user-mode instead of kernel-mode. MSDN has a comparison between the old and new APIs.

5. A bunch of new InterlockedXXX variants.

6. Application deadlock detection. This is separate from the existing Driver Verifier ability to diagnose deadlocks in drivers. This capability integrates with all synchronization mechanisms, from CRITICAL_SECTION to SRWLOCK to Mutex, and keys off of any calls to XXXWaitForYYYObjectZZZ. Unfortunately, I think this is new to the latest Vista SDK, and thus there isn't a lot of information available publicly. This could probably make a good future blog post if there's interest.

Have fun with this stuff, of course. But be careful. Don't poke an eye out.

6/21/2006 11:44:52 PM (Pacific Daylight Time, UTC-07:00)  #   

 Tuesday, June 20, 2006

Jim Johnson started a series back in January that I’m dying to see continued. It's about writing resource managers in System.Transactions, which surprisingly turns out to be incredibly straightforward. Provided you are able to implement the correct ACI[D] transactional qualities for the resource in question, that is. Juval Lowy’s December 2005 MSDN Magazine article on volatile resource managers described how to build what turns out to be essentially mini-transactional memory, without much of the syntax, implicit and transitive qualities, and robustness.

As an example of where you might use a resource manager, imagine that you wanted to ensure that any memory allocations and deallocations inside a transaction scope participate with the System.Transactions ambient transaction. Maybe you'd like your allocations to be in sync with the database server or web service to which you’re also transacting access. I’ll walk through an example of how straightforward writing such a resource manager can be.

First, our starting class is quite simple. It just allocates and frees memory. Sans transactions, it looks like this:

using System.Runtime.InteropServices;

public static class Mm {
    public static IntPtr Malloc(long bytes) {
        return Marshal.AllocHGlobal(new IntPtr(bytes));
    }
    public static void Free(IntPtr pp) {
        Marshal.FreeHGlobal(pp);
    }
}

Mm.Malloc returns a pointer to ‘bytes’ amount of memory via kernel32!GlobalAlloc (which turns out to be a crappy way to manage memory by the way, and is still alive only to support DDE, the clipboard, and OLE, or so I’m told; it works as an example though). Mm.Free takes a pointer to memory that was previously allocated via Mm.Malloc and frees it. Pretty simple.

OK, that’s not incredibly useful, especially considering that we’re just making single-line invocations to the Marshal class. But it’s a starting point.

Ultimately, what we want to ensure is that at the end of a transaction, any memory allocation and deallocation that happened within the transaction is consistent with the outcome of that transaction. That means, quite simply, that if memory was allocated and the transaction commits, we keep the memory allocated around; but if, on the other hand, the transaction rolls back, we must undo the allocation. Similarly, if we free memory and the transaction commits, then the memory remains freed; if it rolls back, we must undo the freeing.

If we want to build such a thing directly on top of existing facilities we clearly can’t do this precisely as I suggest. How do you undo a call to free in the CRT, for example? You can’t. Once you call free, the memory's gone, returned to the pool, and possibly used before your transaction even knows what to do with itself. But it turns out that we can "fake it" sufficiently close enough that most people can’t tell that we're faking it. Here’s what we do instead:

  1. When somebody allocates memory, we log a compensating action in the transaction that frees the memory should we roll back. If the transaction commits, we do nothing more.
  2. When somebody frees memory, we defer the call to commit time. If it never commit, we never free the memory.

This is fairly well known in database literature. Take a look at Jim Gray’s 1980 paper, A Transaction Model, where he describes REDO and UNDO actions, to see what I mean. (1980! That was ages ago.) What we’re saying basically is that allocation logs an UNDO action and freeing logs a REDO action. The isolation leaks out of this in some regard--evidence of our "faking it"--because the fact that our freed memory isn’t instantaneously available to other allocations might be noticed, especially under high stress conditions. OOMs may result that would have not otherwise happened, and the working set of the program may increase, especially for long running transactions. Cest la vie.

Anyhow… on to the implementation of these ideas. It’s surprisingly simple.

We will allow instances of our Mm class to be created by the implementation. From the viewpoint of a user, the class is still entirely static and cannot be constructed. These instances will become enlistments responsible for implementing transactional semantics and responding to certain event notifications from the System.Transactions machinery. To do so, the class must implement the System.Transactions interface IEnlistmentNotification:

using System;
using System.Collections.Generic;
using System.Runtime.InteropServices;
using System.Transactions;

public sealed class Mm : IEnlistmentNotification {

    /** Fields **/

    private LinkedList<IntPtr> m_freeOnCommit = new LinkedList<IntPtr>();
    private LinkedList<IntPtr> m_freeOnRollback = new LinkedList<IntPtr>();

    [ThreadStatic]
    private static Dictionary<Transaction, Mm> s_currentMm;

    /** Constructors **/

    private Mm() {}

    /** Methods **/

    public static IntPtr Malloc(long bytes) {
        …
    }

    public static void Free(IntPtr pp) {
        …
    }

}

We’ve added two linked lists to hold the deferred (m_freeOnCommit) and compensating actions (m_freeOnRollback). And we have a thread-static dictionary that maps the current transaction to the enlisted instance of Mm. This is pretty straightforward stuff, although there are a plethora of alternative designs. Now let’s see how we get data into these things. The Malloc and Free implementations will change slightly to check for an existing transaction:

public static IntPtr Malloc(long bytes) {
    IntPtr pp = Marshal.AllocHGlobal(new IntPtr(bytes));

    // If insufficient memory, OOM is thrown and we never log the free.
    Mm mm = GetCurrentMm();
    if (mm != null) {
        // Compensating activity to ensure that if we rollback, we free.
        mm.m_freeOnRollback.AddLast(pp);
    }

    return pp;
}

public static void Free(IntPtr pp) {
    Mm mm = GetCurrentMm();
    if (mm != null) {
        // We defer the freeing of memory in case we don't commit.
        mm.m_freeOnCommit.AddLast(pp);
    } else {
        Marshal.FreeHGlobal(pp);
    }
}

This implements the commit and rollback behavior I described above, i.e. we add the memory location to free on to the deferred or compensated list according to the rules we’ve already established. GetCurrentMm is responsible for lazily allocating and enlisting an instance of Mm. If there is no active ambient transaction, it just returns null:

private static Mm GetCurrentMm() {
    // Are we in a transaction?
    Transaction currTx = Transaction.Current;
    if (currTx == null) {
        // Return null to indicate we're not in a transaction.
        return null;
    }

    // Have we already allocated and enlisted a volatile RM for this transaction?
    Mm currMm = null;
    if (s_currentMm == null) {
        s_currentMm = new Dictionary<Transaction, Mm>();
    } else {
        s_currentMm.TryGetValue(currTx, out currMm);           
    }

    // No RM found, create/enlist one.
    if (currMm == null) {
        currMm = new Mm();
        s_currentMm.Add(currTx, currMm);
        currTx.EnlistVolatile(currMm, EnlistmentOptions.None);
    }

    return currMm;
}

And of course we have a RemoveCurrentMm which will be used eventually to remove the enlistment information from our dictionary:

private static void RemoveCurrentMm() {
    Transaction currTx = Transaction.Current;
    if (currTx != null && s_currentMm != null) {
        s_currentMm.Remove(currTx);
    }
}

So now we have all of the information about what should be freed and when, but there’s no code that actually executes the free operations. To do that, all we have to do is implement the IEnlistmentNotification interface properly, iterating the proper list, and invoking Malloc.FreeHGlobal on the contents. In other words, Commit and Rollback just invoke free on all of the memory addresses in the respective linked list:

void IEnlistmentNotification.Commit(Enlistment enlistment) {
    FreeAll(m_freeOnCommit);
    RemoveCurrentMm();
    enlistment.Done();
}

void IEnlistmentNotification.Rollback(Enlistment enlistment) {
    FreeAll(m_freeOnRollback);
    RemoveCurrentMm();
    enlistment.Done();
}

private void FreeAll(LinkedList<IntPtr> toFree) {
    foreach (IntPtr p in toFree) {
        Marshal.FreeHGlobal(p);
    }
}

We’re assuming in all of those cases that FreeAll and RemoveCurrentMm can’t fail. If our commit or rollback processing failed mid-way, that would put the entire process at risk: memory could be leaked or become corrupt. System.Transactions will respond to that by sending InDoubt notifications to all enlistments. Since the only way we can potentially contain and resolve volatile state corruption is to crash the process, that’s exactly what we do:

void IEnlistmentNotification.InDoubt(Enlistment enlistment) {
    Environment.FailFast("State protected by RM is in question");
}

This is a Byzantine response, sure, but it’s the only way we can guarantee that state doesn’t become corrupt when the transaction’s fate is InDoubt. If the fate of one or more resource managers cannot be determined, we don't know whether to commit or fail for sure. We could guess, of course, but guessing doesn’t lead to pleasant behavior in software, especially when we have to debug it. (And if you're making guesses, you'll probably have to spend more time debugging, so it's a double whammy of sorts.)

And that’s it! Now we can use memory operations inside of a transaction, and have it behave as expected. Just as an example, this test case ensures that writing to memory that was allocated in a transaction that eventually aborts causes an AccessViolation:

bool test1success = false;
IntPtr pMem1 = IntPtr.Zero;
try {
    try {
        using (TransactionScope txScope = new TransactionScope()) {
            pMem1 = Mm.Malloc(1024 * 1024); // get 1MB of space
            throw new Exception(); // cause an abort
        }
    } catch {
        // The txn was aborted, we expect reading from memory to fail.
        uint * pInt = (uint *)pMem1.ToPointer();
        *pInt = 0xdeadbeef;
    }
} catch (AccessViolationException) {
    test1success = true;
}
Console.WriteLine("Test 1 succeeded: {0} (rollback of malloc)", test1success);

There are three other tests that may be of interest in the source file for the Mm class and associated code: MallocFree.cs.

Of course this approach is not perfect in several areas. One that comes to mind immediately is the fact that we're doing a potentially expensive lookup for an ambient transaction on every memory allocation and deletion, which could be too much, especially if it happened in some general purpose allocation and deallocation routines. And of course automatically finding the transaction and using might also be a bad idea. We might instead want the user to opt-in to transactional Malloc and Free at the callsite, so that users aren't surprised when their malloc or free never happens (the transaction rolled back). Nevertheless, this article at least cracked the surface of a very difficult problem and surfaced some interesting issues.

6/20/2006 4:45:09 PM (Pacific Daylight Time, UTC-07:00)  #   

 Saturday, June 17, 2006

Ntdll exports an undocumented function from WinNT.h:

PTEB NtCurrentTeb();

This gives you access to the current thread's TEB (thread environment block), which is a per-thread data structure that holds things like a pointer to the SEH exception chain, stack range, TLS, fiber information, and so forth. This function actually returns you a PTEB, which is defined as _TEB*. _TEB is an internal data structure defined in winternl.h, and consists of a bunch of byte arrays. You can cast this to PNT_TIB (defined as _NT_TIB*), which gives you access to the data in a strongly typed way. And _NT_TIB is a documented data structure, unlike _TEB, meaning you can actually rely on it not breaking between versions of Windows.

For example, this code prints out the current thread's stack base and limit. The base is the start of the user-mode stack, and the limit is the last committed page, which grows as you use more stack:

PNT_TIB pTib = reinterpret_cast<PNT_TIB>(NtCurrentTeb());
printf("Base = %p, Limit = %p\r\n",
    pTib->StackBase, pTib->StackLimit);

There's a shortcut you can take. You can always find a pointer to the TEB in the register FS:[18h]:

PNT_TIB pTib; 
_asm {
    mov eax,fs:[18h]
    mov pTib,eax
}
printf("Base = %p, Limit = %p\r\n",
    pTib->StackBase, pTib->StackLimit);

There's an even shorter shortcut you can take. You can actually find the base and limit in different segments of the FS register, FS:[04h] for the base and FS:[08h] for the limit:

void * pStackBase;
void * pStackLimit;
_asm {
    mov eax,fs:[04h]
    mov pStackBase,eax
    mov eax,fs:[08h]
    mov pStackLimit,eax
}
printf("Base = %p, Limit = %p\r\n",
    pStackBase, pStackLimit);

Unfortunately, the _asm keyword is not supported on all architectures, so the above code is only guaranteed to work on x86 (e.g. the VC++ Intel Itanium compiler doesn't support it). Furthermore, the hardcoded offsets 04h and 08h are clearly wrong on 64-bit: you need more than 4 bytes to represent a 64-bit pointer. NtCurrentTeb hides all of this and uses whatever platform-specific technique is needed to retrieve the information.

Matt Pietrek's 1996 and 1998 Microsoft Systems Jounral articles are the best reference I could find on TEBs, aside from the Windows Internals book.

Believe it or not, this is useful information. I wrote some code recently that took a different code path based on whether it was writing to the stack or the heap, and using the TEB does the trick.

I have written about 7 pages on user-mode stacks in my upcoming concurrency book. This ranges from CLR stack frames to stack overflow to just how stacks work internally in Windows. I haven't found any book or resource that collects all of this information together in one place. It turns out that most developers don't need to worry about stacks at all, but this understanding is crucial to moving forward to more advanced concurrency programming models.

6/17/2006 12:48:13 PM (Pacific Daylight Time, UTC-07:00)  #   

 Sunday, June 11, 2006

After posting my last article on creating a lazy allocation IAsyncResult, I received a few mails on the ordering of the completion sequence. It was wrong and has been updated. Thanks to those who pointed this out.

I used the following incorrect ordering: 1) set IsCompleted to true, 2) invoke the callback, and 3) signal the handle.

This can lead to deadlocks if the callback waits on the handle. My implementation carefully avoided EndXxx-induced deadlocks (by checking IsCompleted before waiting on the handle), but if the callback directly WaitOne's on the IAsyncResult.AsyncWaitHandle property, the callback will of course deadlock. Directly accessing the handle might be attractive to the callback author, especially for higher level orchestration via WaitAny and WaitAll. So it's probably something we'd like to support. One way to avoid this is to invoke the callback asynchronously with BeginInvoke, but a better solution is to use the correct ordering instead.

The correct ordering is: 1) set IsCompleted to true, 2) signal the handle, and 3) invoke the callback.

The first version I wrote had the correct ordering, since that seemed to be the logical choice. Unfortunately, the Framework Design Guidelines lists the steps in the wrong order, which led me down that path. I've let Brad and Krzys about this. A customer who read my blog actually mailed Brad about this error too, just about simultaneously. There may be rationale behind this, but we've used the correct ordering in the file and network IO APIs since V1.0 so I think it's just wrong.

It's worth pointing out that the network classes already use a lazy allocation scheme very similar to the one I wrote about. Check out the System.Net.LazyAsyncResult internal type in System.dll. I'm advocating for moving file IO onto the same plan in the next release of the Framework. We'll see how it turns out.

Lastly, some might notice I originally said this would be a two part series. Well, I wrote a whole bunch of code to implement a sophisticated LRU-based resurrection caching scheme--to avoid allocating IAsyncResults every time--and then realized that my example doesn't do anything expensive on IAsyncResult creation that would warrant such a thing. The result? It was actually slower than the ordinary lazy version I posted a couple weeks back. I think the techniques I used are interesting nonetheless, so I am going to try and rework the example to incorporate some expensive buffer management, and then see where I stand. But I'm not promising anything just yet. And this was a great reminder that solving actual profiled problems is always more worthwhile than solving perceived, yet unmeasured, non-problems.

6/11/2006 8:45:34 PM (Pacific Daylight Time, UTC-07:00)  #   

 Tuesday, May 30, 2006

Lock free code is hard. But it can come in handy in a pinch.

There have been some recent internal discussions about the IAsyncResult pattern and performance. Namely that, for high throughput scenarios, where the cost of the asynchronous work is small relative to the cost of instantiating new objects, there is a considerable overhead to using the IAsyncResult pattern. This is due to two allocations necessary to implement the pattern: (1) the object that implements IAsyncResult itself, and (2) the WaitHandle for consumers of the API who must access the IAsyncResult.AsyncWaitHandle property. I will address (2) in this article, since it is much more expensive than (1).

Update: I've posted an addendum to this article here.

Rendezvousing

Just to recap, there are four broad ways to rendezvous with the IAsyncResult pattern:

  1. You can poll the IAsyncResult.IsCompleted boolean flag. If it's true, the work has completed. If it's false, you can go off and do some interesting work, coming back to check it once in a while.
  2. Supplying a delegate callback to the BeginXxx method. This callback is invoked when the work completes, passing the IAsyncResult as an argument to your callback.
  3. Waiting on the IAsyncResult.AsyncWaitHandle. This is a Windows WaitHandle, typically a ManualResetEvent, which allows you to block for a while until the work completes.
  4. Call the EndXxx method. Internally, this will often check IsCompleted and, if it's false, will wait on the AsyncWaitHandle.

Notice that in cases 1 and 2, the WaitHandle isn't even needed. And in case 4, it's only needed some fraction of the time. Well, it turns out we can avoid allocating it altogether for those cases where it's not used. We can "just" lazily allocate it. Note that for asynchronous IO, the majority of code will use method 2 above. For scalable servers, we often don't want to tie up an extra thread polling or waiting for completion, since that contradicts the primary benefits of Windows IO Completion Ports.

The Requirements

Notice I enclosed the word just above in quotes when mentioning lazy allocation. We could of course use a lock. But that would require that we allocate an object to lock against. We could of course just lock 'this', but that also comes with a performance overhead. We can get away with lock free code in this case, so long as we recognize a very important race condition that we must tolerate. Imagine this case: Thread A checks IsCompleted. It's false. So it accesses the AsyncWaitHandle property, triggering lazy allocation. Meanwhile, Thread B finishes the async work, and sets IsCompleted to true. We need to ensure a deadlock doesn't ensue.

This race could go one of two ways:

  1. Thread A lazily allocates and publishes the WaitHandle before Thread B sets IsCompleted to true. Thread B must now witness a non-null WaitHandle when it checks, and it must return the WaitHandle in the signaled state. If it returns an unsigned WaitHandle, Thread A will wait on it, and never be woken up. This is a deadlock.
  2. Thread B finishes first, setting IsCompleted to true, and seeing a null WaitHandle. Thread A must see IsCompleted as true and consequently return the event in a signaled state. Just like before, if this doesn't happen, Thread A will wait on an unsigned WaitHandle which will never be signaled. Deadlock.

To ensure both of these cases work, Thread A's read of the WaitHandle field and Thread B's read of IsCompleted must be preceded by a memory barrier. This ensures the memory accesses aren't reordered at the compiler or processor level, either of which could lead to the deadlock situations we are worried about. The CLR 2.0's memory model is not sufficient even with volatile loads, because the load acquire can still move "before" the store release.

An Implementation

Here is one simplistic implementation of a FastAsyncResult class, with ample comments embedded within to explain things:

class FastAsyncResult : IAsyncResult, IDisposable {

 

    // Fields

 

    private object m_state;

    private ManualResetEvent m_waitHandle;

    private bool m_isCompleted;

    private AsyncCallback m_callback;

    internal object m_internal;

 

    // Constructors

 

    internal FastAsyncResult(AsyncCallback callback, object state) {

       m_callback = callback;

       m_state = state;

    }

 

    // Properties

 

    public object AsyncState {

        get { return m_state; }

    }

 

    public WaitHandle AsyncWaitHandle {

        get { return LazyCreateWaitHandle(); }

    }

 

    public bool CompletedSynchronously {

        get { return false; }

    }

 

    public bool IsCompleted {

        get { return m_isCompleted; }

    }

 

    internal object InternalState {

        get { return m_internal; }

    }

 

    // Methods

 

    public void Dispose() {

        if (m_waitHandle != null) {

            m_waitHandle.Close();

        }

    }

 

    public void SetComplete() {

        // We set the boolean first.

        m_isCompleted = true;

 

        // And then, if the wait handle was created, we need to signal it. Note the

        // use of a memory barrier. This is required to ensure the read of m_waitHandle

        // never moves before the store of m_isCompleted; otherwise we might encounter a

        // race that leads us to not signal the handle, leading to a deadlock. We can't

        // just do a volatile read of m_waitHandle, because it is possible for an acquire

        // load to move before a store release.

        Thread.MemoryBarrier();

        if (m_waitHandle != null) {

            m_waitHandle.Set();

        }

 

        // If the callback is non-null, we invoke it.

        if (m_callback != null) {

            m_callback(this);

        }

    }

 

    private WaitHandle LazyCreateWaitHandle() {

        if (m_waitHandle != null) {

            return m_waitHandle;

        }

 

        ManualResetEvent newHandle = new ManualResetEvent(false);

        if (Interlocked.CompareExchange(ref m_waitHandle, newHandle, null) != null) {

            // We lost the race. Release the handle we created, it's garbage.

            newHandle.Close();

        }

 

        if (m_isCompleted) {

            // If the result has already completed, we must ensure we return the

            // handle in a signaled state. The read of m_isCompleted must never move

            // before the read of m_waitHandle earlier; the use of an interlocked

            // compare-exchange just above ensures that. And there's a race that could

            // lead to multiple threads setting the event; that's no problem.

            m_waitHandle.Set();

        }

 

        return m_waitHandle;

    }

 

}

Notice also that we tolerate the race where two threads try to lazily allocate the handle. Only one thread wins. The thread that loses is responsible for cleaning up after itself so that we don't "leak" a WaitHandle (requiring finalization to close it). This is an example of tolerating races instead of preventing them, and is similar to the design we use for jitting code in the runtime, for example.

Some Initial Results

I'll do a more thorough analysis as follow up to my next post, including profiling traces. But the initial results are promising.

I wrote a test harness that calculates the fibonacci series asynchronously, based on the sample code used in the Framework Design Guidelines book. As you can see by the comparisons, the larger the series being calculated, the less substantial the impact my performance work makes:

Size         Normal       Lazy         Improvement  
1            177 ms       62 ms        185%
2            179 ms       63 ms        184%
3            180 ms       65 ms        177%
4            181 ms       66 ms        174%
5            187 ms       69 ms        171%
6            195 ms       79 ms        147%
7            210 ms       97 ms        116%
8            239 ms       122 ms       96%
9            275 ms       165 ms       67%
10           356 ms       257 ms       39%
12           745 ms       631 ms       18%
15           3217 ms      3075 ms      05%

As we would expect, as the ratio of the cost of computation to the cost of allocating the WaitHandle increases (with an increased "size" of the fibonacci series being calculated), the observed performance improvement also decreases. For very small computations, however, this technique can really pay off. In the case of high performance asynchronous IO, for example, where completion often involves simply marshaling some bytes between buffers, this can be a key step in the process of improving system throughput.

As I noted earlier, lock free techniques are almost never worth the trouble unless you've measured a problem, especially due to the maintenance and testing costs for such code. And I jumped right over using a lock for allocation. It turns out in this particular scenario, that technique fares just as well as the lock free code, albeit with a lot more simplicity. It only incurs slight overhead when checking the handle to see if it has been allocated yet as well as when setting it at completion time. Since my test case never has to check the WaitHandle, it only has to enter the lock upon completion, which is relatively cheap. As always, start simple and then go from there.

5/30/2006 9:00:31 PM (Pacific Daylight Time, UTC-07:00)  #   

C# 1.0 shipped with the ability to stack allocate data with the stackalloc keyword, much like C++'s alloca. There are restrictions, however, around what you can allocate on the stack: Inline arrays of primitive types or structs that themselves have fields of primitive types (or structs that etc...). That's it. C# 2.0 now allows you to embed similar inline arrays inside other value types, even for those that are allocated inside of a reference type on the heap, by using the fixed keyword.

And of course, you can allocate arrays of those value types on the stack too:

using System;

 

unsafe class Program {

    struct A {

        internal int x;

        internal fixed byte y[1024];

    }

 

    public static void Main() {

        byte * bb = stackalloc byte[2048];

        Console.WriteLine("&bb         : {0:X}", (uint)&bb);

        Console.WriteLine("&bb[1]      : {0:X}", (uint)&bb[1]);

        Console.WriteLine("&bb[2048]   : {0:X}", (uint)&bb[2048]);

 

        A * a = stackalloc A[2048];

        Console.WriteLine("&a          : {0:X}", (uint)&a);

        Console.WriteLine("&a->x       : {0:X}", (uint)&a->x);

        Console.WriteLine("&a->y[0]    : {0:X}", (uint)&a->y[0]);

        Console.WriteLine("&a->y[2048] : {0:X}", (uint)&a->y[2048]);

        Console.WriteLine("&a[1]       : {0:X}", (uint)&a[1]);

        Console.WriteLine("&a[2048]    : {0:X}", (uint)&a[2048]);

    }

}

The use of this is of course almost always limited to unmanaged interop scenarios. For example, there's at least one place in the BCL where we use this to stack allocate the binary layout of a security descriptor that we then pass into the Win32 CreateMutex API, which avoids having to create a new interop struct. (Whether such hacks are a good thing to put in our code-base is another topic altogether...)

The stack allocated data doesn't outlive the stack frame, so as soon as you return from the function in which the stackalloc occurs, the data is gone. If you pass a pointer to it and somebody stores it, they could later try to dereference a pointer into dead (and possibly since reused) stack space. And reading too far can lead to buffer over- or underflows which bash other data on the stack. Using this requires compilation with /unsafe, and needless to say, you need to be careful with it (if not avoid it altogether).

5/30/2006 12:29:58 PM (Pacific Daylight Time, UTC-07:00)  #   

 Wednesday, May 24, 2006

In managed code, you can pass ByRefs "down the stack." You can't do much aside from that, however, other than use things like ldind and stind on them. And of course, you can cast them to native pointers, store them elsewhere, and so on, but those sorts of (evil) things are unverifiable.

Right? Well, sort of.

In Whidbey, we made a change to the verification rules such that a function can now return a ByRef to a caller. This of course is safe so long as the ByRef doesn't refer to a stack location. A field ref inside a heap-allocated object, static field ref, or an array element ref are all just fine. And of course, a function can just return a ByRef that was passed to it as an argument. Take a look at this IL:

.assembly extern mscorlib {}
.assembly byrefret {}

.class Program extends [mscorlib]System.Object {

    .field static int32 s_x

    .method static void Main() {
        .entrypoint

        call int32& Program::f()
        ldind.i4
        call void [mscorlib]System.Console::WriteLine(int32)

        call int32& Program::g()
        ldind.i4
        call void [mscorlib]System.Console::WriteLine(int32)

        ret
    }

    .method static int32& f() {
        ldsflda int32 Program::s_x
        ret
    }

    .method static int32& g() {
        .locals init (int32 x)
        ldloca x
        ret
    }

}

Function f verifies just fine, since it just returns a ByRef to a static field, whereas function g fails verification, because it returns a ByRef to a local on the stack. You actually can't write code to produce the IL shown above from any of Microsoft's compilers except for VC++, i.e. C# won't let you say "static ref int f() { return ref s_x; }".

Now, why would you ever want such a thing? VC++ needed it, for example, to implement STL.NET. Traditional STL returns references to elements inside of internal data structures, which can subsequently be modified. Without this support, such values would need to be copied, or the STL.NET APIs would have had to deviate from the traditional STL APIs.

Interestingly, this doesn't change the ECMA Specification. It's always been loose on the issue, saying in Section 12.1.1.2 of Partition I: "Verification restrictions guarantee that, if all code is verifiable, a managed pointer to a value on the evaluation stack doesn't outlast the life of the location to which it points." Since you can't return a ByRef to a stack location, we don't violate this guarantee. Our previous verifier was simply being overly strict.

5/24/2006 6:23:45 PM (Pacific Daylight Time, UTC-07:00)  #   

 Saturday, May 20, 2006

Via DBox and TBray, I stumbled upon Will Continuations Continue?, a great essay about why continuation support in modern VMs is not a good idea after all:

"By far he most compelling use case for continuations are continuation-based web servers. ... Rather than relying on the server’s stack to keep track of what location we’re looking at, the [UI] will be a view on a model ... When you pressed “Buy”, it would pass all the information necessary to complete the transaction onto the server. Consequently, we’ll have no more of a pressing need for continuations than traditional applications have today."

I couldn't agree more, although I arrive at the conclusion via a different line of thought.

Just over a year ago now, I was working on continuations for my Scheme interpreter and compiler, Sencha. I managed to create something that "worked" -- in the sense that the stack could be captured, passed around, and restored; and it even still reported locals as roots to the GC -- but there are so many facets of a modern runtime to consider that true product support would be a massive undertaking. I thought continuations were a good idea. Why? To be honest, the main reason was my simple goal of having a full-fidelity Scheme runtime. But I also admired their power.

In retrospect, I now realize something important: the stack is evil. It's a horrible representation of state, especially for web applications.

The stack is unnecessarily bound to an OS thread, and munges control flow with the "state" of the program. The fact that return addresses for function calls lives on it has been the source of many security problems and counter-measures (/GS). When a thread blocks, the entire stack is wasted, even if there is logical work on it that could progress if it weren't for the arbitrary physical association. There's so much crap on it that to summarize the state of your entire program often requires pausing threads and walking their stacks. How dirty and impolite! Freak-of-nature abominations have twisted what the stack was meant for, e.g. COM and GUI reentrancy and APCs, completely disassociating logical and physical representations. You have to reserve a contiguous chunk of the thing per thread (often 1MB), wasting virtual memory space, because Windows doesn't support linked stack regions (not as big a deal on 64-bit as on 32-bit, sure), which also leads to the CLR ripping the process if you ever exceed it (overflow).

So many problems we encounter with parallel programming (among other domains) would go away with a more structured representation of the program as a state machine.

Dharma and the rest of the WF team are delivering just that (in the large). C# 2.0's iterator feature supplies a similar capability (in the small). The Concurrency and Coordination Runtime (CCR) eschews stack in favor of orchestration and message passing. We'll converge at some point. And it won't be around serializing stacks, it will be around getting rid of the damned things.

5/20/2006 12:24:31 PM (Pacific Daylight Time, UTC-07:00)  #   

 

RSS 2.0

Me
 

Joe Send mail to the author(s) is an architect and developer on a systems incubation project at Microsoft.

Recent

Search

Browse

Disclaimer:
The content of this site are my own personal opinions and do not represent my employer's view in anyway.

© 2013, Joe Duffy