RSS 2.0

Personal Info:

Joe Send mail to the author(s) leads the architecture of an experimental OS's developer platform, where he is also chief architect of its programming language. His current mission is to enable writing large-scale software that is reliable, secure, and scalable by-construction. Before this, Joe founded the Parallel Extensions to .NET project. He has been granted 19 patents, with 49 pending. When not working, Joe enjoys travelling with his wife, writing books, writing music, studying music theory & mathematics, and doing anything involving food & wine.

My books

My music

Disclaimer:
The content of this site are my own personal opinions and do not represent my employer's view in anyway.

© 2012, Joe Duffy

 
 Monday, August 21, 2006

[Update - 8/22/06 - fixed typos and paid homage to VSTS 2005's code analysis which checks for this problem.]

From the department of Spolsky's Law of Leaky Abstractions, we turn today to accidental lock conflicts across AppDomain boundaries.

The CLR supports various cross-AppDomain marshaling mechanisms, one of which is known by the lovely name of marshal-by-bleed. This simply means that pointers from multiple AppDomains actually refer to the same location in memory. Most of the time some form of marshaling is required for objects so that we can safely isolate separate AppDomains from one another.

In managed code, you can lock on any object through the Monitor type, exposed in C# and VB via the 'lock' and 'SyncLock' keywords, respectively. The implementation of Monitor.Enter/Exit uses space in the object header and/or the object's sync-block to record exclusive ownership of the lock. The fact that objects typically don't bleed across AppDomains is a GoodThing(tm), as this is how add-ins, SQL Server, and other hosts isolate failures between components. When writing code, we typically assume state in one AppDomain can't corrupt state in another, totally independent, AppDomain.

Unfortunately, domain neutral Type objects (as well as other Reflection types, e.g. XXInfos) are actually shared across all AppDomains in the process. They are marshal-by-bleed. Strings also fall into this camp. A string argument to a remoted MarshalByRefObject method invocation may be bled, as can be any process-wide interned string literal. The System.Threading.Thread object (called the thread-base-object, aka TBO, internally) also bleeds across AppDomains. What a bloody mess! (Ha ha.)

So why does this all matter?!

Recall that lock owner information is tied to the instance. If you use any of these things as a target of Monitor.Enter, code running in one AppDomain can actually interfere with code in another AppDomain. That's because they are using the same object and thus the same lock information underneath. What a lousy abstraction--this was never meant to leak through! And it can cause trouble too. If one AppDomain orphans the lock (forgets to release it), it may cause deadlocks in other AppDomains. Even sans deadlocks, this fact can simply yield false conflicts, which can subsequently negatively impact scalability.

For example, consider this code:

lock (typeof(object)) {
    ...
}

Code in AppDomain A uses the same Type object to represent 'typeof(object)' as code in AppDomain B. Therefore they share lock information.

If we run such code from multiple AppDomains, the code yields a conflict:

WaitHandle wh = new ManualResetEvent("XXX", false);
lock (typeof(object)) {
    AppDomain ad2 = AppDomain.CreateDomain("2");
    ad2.DoCallBack(delegate {
        ThreadPool.QueueUserWorkItem(delegate {
            WaitHandle wh2 = new ManualResetEvent("XXX", false);
            lock (typeof(object)) {
                wh2.Set();
            }
        });
    });
    wh.WaitOne();
}

If one AppDomain is waiting for a synchronization event from another--as in this example--this can actually yield a deadlock. If you replaced the lock statements in this example with, say, lock ("Foo") { ... }, you'll see the same result due to string literal interning.

Clearly this is nasty problem, especially if Framework code were to use such patterns. This is one reason you'll notice we strongly discourage locking on Type objects. Even if you're not in mscorlib (by default the only domain neutral assembly), your type can be loaded domain neutral based on hosting policy, among other things. And therefore you may not even catch said bugs during testing.

Note that MarshalByRefObjects aren't subject to these problems. Although operations in one AppDomain can refer to the same instance in another, these accesses go through a proxy. Locking on the proxy is different than locking on the raw underlying object, and thus no false conflicts.

This is enforced with the DoNotLockOnObjectsWithWeakIdentity VSTS 2005 code analysis rule.

If all of this is making you feel rather queasy, fear not. We have a weekly "CLR Foundations" meeting where a large portion of the CLR Team meets to discuss the history of the CLR and .NET Framework. A couple weeks back this topic came up in passing. Most people on the team were quite surprised, and many even seemed to be enshrouded in disbelief. At least we can recognize a mistake after it's been made. ;)

8/21/2006 8:50:14 PM (Pacific Daylight Time, UTC-07:00)  #   

 Sunday, August 13, 2006

I've run across several algorithms lately that have benefited from the use of a Bloom filter. This led me to dig up the original paper (Space/time trade-offs in hash coding with allowable errors) in which the idea was proposed. What surprised me a little was that this technique was invented 35 years ago, by a fellow by the name of Burton Bloom, and yet it remains a simple and effective way to speed up a certain category of modern software problems.

The technique summarizes the contents of a set into a concise value (the filter) from which quick answers may be computed. This value is typically generated using some form of hash-coding of the set elements and stored within a bitmap, enabling ultra-fast bitwise updates and queries. This value promises to never return a false negative, although it is permitted to return false positives. So long as the false positive rate is low, many queries that would ordinarily have come up empty handed after a lengthy search will be retired in constant time. As an element is added to the set, the value is updated to reflect its presence. A tricky part of this technique, however, is that, because the value often summarizes the elements using one-bit-to-many-elements, removing an element from the set typically cannot just reset the corresponding bit. As items are removed over time, this can lead to a stale value, an increasing rate of false positives, and a higher number of lengthy searches. Thus the value must be periodically recomputed to combat this problem.

Let's take an example. Say you had a balanced binary tree that was frequently searched, infrequently deleted from, and whose searches more often lead to misses than they do hits. You can expect O(log n) search time. That's not bad, but for large values of n you may have incentive to speed up the common case, which happens to be the worst case for a perfectly balanced binary tree: a search that turns up empty. Using a Bloom filter gives you a slightly modified algorithm whose search complexity is Ω(1) for our problem's best case, and still O(log n) otherwise. This also adds some notable cost due to the need to periodically recompute the filter value, which is an O(n) operation, but since we infrequently delete items, we expect this to pay off in spades.

Say we already had a BinaryTree<T> data structure. We could easily adopt a Bloom filter with a series of minor modifications.

First, a new field to remember the filter's value. I've chosen a 64-bit bitmap for illustration:

long filterValue = 0;

And since we've used a bitmap, we need a routine to calculate any arbitrary element's single bit position. Remember, false positives are OK, and therefore multiple elements may share the same bit:

long GetFilterValueForElement(T e) {
    return 1 << (e.GetHashCode() % (sizeof(long) * 8));
}

When adding an item to the set, we must change the filter's value:

filterValue |= GetFilterValueForElement(e);

Here's the beneficial change. While searching for an item, we can add a quick check up front to speed up queries:

if ((filterValue & GetFilterValueForElement(e)) == 0) {
    // Element not found: we can be assured this is correct.
}
// Perform the existing, lengthier search. might be a false positive.

And of course, we must periodically do the O(n) operation on the tree to recompute the filter. We might do this whenever the tree becomes empty and every n deletions, for example:

if (isEmpty || (++deletionCount % recomputeFilterPeriod) == 0) {
    filterValue = 0;
    foreach (T e in this) {
        filterValue |= GetFilterValueForElement(e);
    }
}

You could even keep track of the number of false positives seen, and use that to determine when to initiate the recomputation.

This technique clearly won't work well in data structures and problems in which the deletion-to-query ratio is high, but there are plenty of situations where this technique helps tremendously. Logs of various forms, for example, exhibit the property that they are added to but never deleted. Depending on the density of the data structure, you may want to use an array instead of a bitmap--or a bitmap with more bits--to represent the summary. I ran a quick test and the simple bit-shifting hash-coding mechanism above yielded a full filter (~0) after only 227 random object allocations. You can of course even decide to reduce the density of your bitmap dynamically by detecting a close-to-full map and upgrading to a larger filter data structure. A dense filter often corresponds to a higher false positives rate, in which case you simply waste time maintaining and checking a value that doesn't give you any benefit.

8/13/2006 1:45:48 PM (Pacific Daylight Time, UTC-07:00)  #   

 Wednesday, August 09, 2006

The Lang.NET Symposium was held here in Redmond, WA last week.  My pal, Erik Meijer, the monadic maniac of Head in the BoxLambda the Ultimate, and now VB 9 fame, helped pull together the killer agenda, which is always a recipe for a good time.  Thottam, a PM on the CLR team, just posted a great summary on his blog.  He sent it out over email last week, and I'm really glad he posted it for everyone else to see.  There are a bunch of good language geek links lurking within. 

8/9/2006 8:42:02 PM (Pacific Daylight Time, UTC-07:00)  #   

 Sunday, August 06, 2006

This falls into the "fun hacks" category, meaning the result is not necessarily something you'd want to use in your everyday life. To go a step further, I strongly recommend you don't use the code shown here as-is; read the summary at the end for some rationale behind that statement. Enough with the disclaimer. On with the show...

Some requirements for our cross-process RWLock

Imagine you had the need for:

1. A managed reader/writer lock
2. that runs on down-level (pre-Vista) operating systems,
3. and that optionally works across process boundaries and AppDomains.

What do we already have that might fit the bill? The existing ReaderWriterLock type in the System.Threading namespace works fine for 1 and 2, but not 3. I suppose you could share it across AppDomains--or even processes--with some form of messaging scheme, but let's ignore that for a moment. It's a little *too* clever for my taste. Windows Vista of course comes with a new slim reader/writer lock. It's a close cousin of the Win32 CRITICAL_SECTION, and can be used for cross-AppDomain synchronization. Unfortunately, you have to P/Invoke to get at it from managed code, it won't run on pre-Vista operating systems, and doesn't work for cross-process scenarios.

Let's build it out of duct tape and barbed wire

It turns out that you can build a type that meets our requirements out of existing Windows kernel objects. All it takes is a little imagination. Here's what you need:

1. A semaphore to count the number of readers inside the lock.
2. A mutex to ensure only one writer can be in the lock at a time.
3. (Optionally) a manual-reset event used by writers to ensure no new readers enter the lock while it waits.

The scaffolding for one such implementation--which I will call the IpcReaderWriterLock--is as follows:

public class IpcReaderWriterLock : IDisposable

{

    /** Fields **/

 

    private const int DEFAULT_MAX_READER_COUNT = 25;

    private const string NAME_PREFIX = @"IpcRWL#";

 

    private readonly int m_maxReaderCount;

    private Semaphore m_readerSemaphore;

    private EventWaitHandle m_blockReadsEvent;

    private Mutex m_writerMutex;

    private int m_writerRecursionCount;

 

    /** Constructors **/

 

    public IpcReaderWriterLock() : this(null, DEFAULT_MAX_READER_COUNT) { }

    public IpcReaderWriterLock(string name) : this(name, DEFAULT_MAX_READER_COUNT) { }

    public IpcReaderWriterLock(int maxReaderCount) : this(null, maxReaderCount) { }

 

    public IpcReaderWriterLock(string name, int maxReaderCount)

    {

        m_maxReaderCount = maxReaderCount;

 

        string blockReadsEventName = null;

        string writerMutexName = null;

        string readerSemaphoreName = null;

        if (name != null)

        {

            blockReadsEventName = string.Format("{0}{1}#{2}", NAME_PREFIX, name, "RdEv");

            writerMutexName = string.Format("{0}{1}#{2}", NAME_PREFIX, name, "WrMtx");

            readerSemaphoreName = string.Format("{0}{1}#{2}", NAME_PREFIX, name, "RdSem");

        }

 

        m_blockReadsEvent = new EventWaitHandle(true, EventResetMode.ManualReset, blockReadsEventName);

        m_writerMutex = new Mutex(false, writerMutexName);

        m_readerSemaphore = new Semaphore(maxReaderCount, maxReaderCount, readerSemaphoreName);

    }

 

    /** Methods **/

 

    public void Dispose()

    {

        // Just close all of the kernel objects we opened during construction.

        // Note: this method is not thread-safe. If threads race with

        // one another to call Dispose, some nasty bugs will arise.

 

        if (m_blockReadsEvent != null)

        {

            m_blockReadsEvent.Close();

            m_blockReadsEvent = null;

        }

 

        if (m_writerMutex != null)

        {

            m_writerMutex.Close();

            m_writerMutex = null;

        }

 

        if (m_readerSemaphore != null)

        {

            m_readerSemaphore.Close();

            m_readerSemaphore = null;

        }

    }

 

    public void EnterReadLock() { ... } 

    public void ExitReadLock() { ... } 

    public void EnterWriteLock()  { ... }

    public void ExitWriteLock() { ... }

}

Notice that we allow naming of the lock. Any name given flows into the kernel objects used underneath, enabling cross-process and cross-AppDomain communication. You just create the same IpcReaderWriterLock with the same name in multiple processes or AppDomains, and they will magically interact with one another (whether you want them to or not). An unnamed lock is isolated inside of the AppDomain in which it was created. Notice also that there's a maximum number of simultaneous readers, the default for which is 25. This isn't terribly important, but any override does impact performance (as described below).

Read and write lock implementation

Now let's implement the read-lock acquisition and release functions, EnterReadLock and ExitReadLock. We support more than one reader at a time via the use of a semaphore (#1 above). We also support preventing blocking new readers from entering the lock while the writer waits for all readers to exit (#3 above). Thus, both of these functions are quite trivial to write:

public void EnterReadLock()

{

    Thread.BeginCriticalRegion();

 

    // We first wait on the read blocking event, in case a writer

    // has tried to acquire the lock and wants us to wait.

    m_blockReadsEvent.WaitOne();

 

    // Now take '1' from the reader semaphore to count the number

    // of simultaneous readers inside the lock.

    m_readerSemaphore.WaitOne();

}

 

public void ExitReadLock()

{

    // Just release '1' back to the semaphore to let others know

    // the number of simultaneous readers just decreased.

    m_readerSemaphore.Release();

 

    Thread.EndCriticalRegion();

}

Next comes the write-lock acquisition and release functions, EnterWriteLock and ExitWriteLock. They are slightly more complicated, but not by much. First we acquire the writer mutex. Once we've done that, we increment the recursion count, and ensure that we do some other work only the first time a writer lock is acquired on the thread. We block any new readers from entering, and then we effectively wait for all readers to exit. We do that by acquiring the semaphore n times, where n is the maximum number of readers that we support. Releasing the write lock does the reverse of all of that:

public void EnterWriteLock()

{

    Thread.BeginCriticalRegion();

 

    // We have to first ensure only one writer can get in at a time.

    m_writerMutex.WaitOne();

 

    // Increment our recursion count.

    m_writerRecursionCount++;

 

    // For the first writer who enters, we need to block new readers

    // and wait for any existing readers to exit the lock.

    if (m_writerRecursionCount == 1)

    {

        // Next we block any new readers from entering the lock.

        m_blockReadsEvent.Reset();

 

        // And lastly, we ensure that all readers have exited the lock.

        // We do this by acquiring the semaphore's capacity. It's

        // unfortunate that the Win32 APIs don't support a take-n

        // function for semaphores.

        for (int i = 0; i < m_maxReaderCount; i++)

        {

            m_readerSemaphore.WaitOne();

        }

    }

}

 

public void ExitWriteLock()

{

    // We have to do everything in the reverse order as we did

    // during acquisition. Not doing so can lead to subtle bugs,

    // including lost resets and deadlocks.

 

    m_writerRecursionCount--;

 

    // The last writer to release has to signal readers.

    if (m_writerRecursionCount == 0)

    {

        // We release the semaphore's capacity back, enabling readers

        // to take from it. Note that as soon as we call this, other

        // threads may wake up and race to acquire the semaphore. In

        // fact, simultaneous readers can get in, even though we still

        // have a writer in here!

        m_readerSemaphore.Release(m_maxReaderCount);

 

        // Unblock any readers that are waiting. Note: ideally we would

        // do this after signaling writers, so that readers can't sneak

        // in before the writer, but that would be more complicated: we

        // keep it simple for now.

        m_blockReadsEvent.Set();

    }

 

    // And lastly release the mutex.

    m_writerMutex.ReleaseMutex();

 

    Thread.EndCriticalRegion();

}

And that's it. A fully functioning reader/writer lock, for some definition of "functioning." The full source code can be found here: IpcReaderWriterLock.cs.

The test case

This example wouldn't be complete with a simple test case to prove that it works. The sample program included in the IpcReaderWriterLock source code creates 20 threads--10 readers and 10 writers--in 10 AppDomains. Each does a piece of work designed to expose race conditions via context switching at sensitive points. It prints out "Success" at the end, assuming it all worked. I see "Success" every time I run it, so it works on my machine at least. Hooray.

Summary: Don't use this thing

OK, OK... this lock is pretty icky and nasty to be honest, and you probably wouldn't want to use it. Ever. A simple write-lock acquisition incurs 27 kernel transitions with the default settings. This ends up costing over 1000-times the cost a simple monitor acquisition! (Yes, there are three 0's in that number... ouch.) Moreover, the cost increases proportional to the number of simultaneous readers that the lock instance supports, which is not very good. That's why I've used such a low default: 25. And it's not very reliable either, which, for anything that does cross-AppDomain or cross-process synchronization can be disastrous. One process that crashes can lead to machine-wide corruption and a user who needs to reboot the machine. A much better lock (performant, reliable, etc.) could be built using memory mapped files, although the implementation would be substantially more complicated.

I almost didn't write this post for all of these reasons. I'm not sure whether I'm doing a disservice to my readers by doing this. Instead, I hope that showing how simple Windows kernel objects can be composed together in interesting, powerful, and non-obvious ways is interesting, if only for trivia reasons. I'd also like to think that it may inspire you to think about things a little differently, perhaps helping you to write clever (but useful!) things out of the building blocks you already have.

8/6/2006 4:29:13 PM (Pacific Daylight Time, UTC-07:00)  #   

 Thursday, August 03, 2006

A glimpse of the future?

32-Core Processors: Intel Reaches for (the) Sun

Intel is of rolling out its Core 2 micro-architecture now. The Xeon 5100 server processor aka Woodcrest was released only weeks ago, Core 2 Duo for the desktop (Conroe) is expected on July 27th and the mobile version Merom will follow only weeks later. The next milestone is quad-core processors, which the firm will produce by fitting two Woodcrest dual cores inside a physical processor package (Clovertown). You may have realized that there is a product development pattern behind recent and upcoming Intel multi core processor releases. Amazingly enough, Intel has been studying Sun's UltraSPARC T1 (Niagara) to come up with a radical processor redesign for 2010 that could perform 16 times faster than Woodcrest. This is no marketing blurb, guys; this is technical intelligence from within the Borg collective.

Read More...

I want a 32-CPU MP machine with those puppies on it. 1,024 cores anybody? Engadget claims each of those cores has 4 hardware threads. 4,096 hardware threads anybody?

Engadget goes on to say:

The biggest hurdle of all, however, could be a consumer Microsoft OS that can fully help software take advantage of multiple cores, a task which Vista isn't quite up to.

This is a bit of an exaggeration. Server SKUs will take advantage of multi-core without problem. It's our client situation that needs a little help, and even then the OS has little to do with it.

8/3/2006 8:40:38 PM (Pacific Daylight Time, UTC-07:00)  #   

 Tuesday, August 01, 2006

The September edition of MSDN Magazine contains an article I wrote.  Surprise!  It's on concurrency:

Using Concurrency for Scalability

There's been a lot of buzz lately about concurrency. The reason for this is due mostly to major hardware vendors' plans to add more processor cores to both client and server machines, and also to the relatively unprepared state of today's software for such hardware. Many articles focus on how to make concurrency safe for your code, but they don't deal with how to get concurrency into your code in the first place. ...

In some sense, a large chunk of the responsibility for making software go faster with the next generation of hardware has been handed off from hardware to software. That means in the medium-to-long term, if you want your code to get faster automatically, you'll have to start thinking about architecting and implementing things differently. This article is meant to explore these architectural issues from the bottom up, and aims to guide you through this new world. In the long run, new programming models are likely to appear that will abstract away a lot of the challenges you will encounter.

Enjoy.  The original go-'round was about 10,000 words, far too long for a column-length magazine article.  After cutting it back, I do feel like it's a bit fast paced and glosses over some important details.  In particular, I had wanted to cut the profiling section but there wasn't enough time in the process to do it.  I hope somebody writes a feature-length article on that sometime soon.  Nevertheless I think the article will be useful.  Any feedback would be appreciated.

I'm also pleased to see that Jeff Richter has a great article on CCR in this same edition (and a sister Channel9 video).

8/1/2006 8:52:26 PM (Pacific Daylight Time, UTC-07:00)  #   

 Tuesday, July 25, 2006

A colleague lent me a copy of W. Daniel Hillis’s PhD thesis, The Connection Machine, which is also available in book form from The MIT Press. I only began reading it last night, but I have been continuously amazed. It’s been enlightening to realize how much framing problems differently (and, in many cases, more naturally) can make programming without concurrency seem ridiculous.

To give you an idea, here's a quote from the thesis:

“When performing simple computations over large amounts of data, von Neumann computers are limited by the bandwidth between memory and processor. This is a fundamental flaw in the von Neumann design; it cannot be eliminated by clever engineering.”

Here is a quick illustration: What’s the most efficient way of copying a source array of 100,000 elements to a destination array of 100,000 elements? With a single-CPU this would typically be O(n), where n is the length of the array. If you could minimize costs due to thread creation and communication, and ensure good locality, you might be able to gain some parallel speedup by using multiple CPUs.

With The Connection Machine, however, you can do it in O(1) time. Simply instruct the 100,000 source cells, each of which holds a single array element, to communicate their value to the 100,000 destination cells, and instruct the destination cells to receive and store the value. This happens instantaneously, across the machine, not in serial fashion. If any node a can communicate with any other node b in 1 time unit, the entire array is copied in just 1 time unit, not 100,000! (Designing such an interconnect is, of course, quite difficult, but in theory it is quite nice.)

I found it particularly interesting that, back in 1985, at least this author recognized the impending demise of an entirely sequential approach to all problems. I guess the von Neumann-plus-Knuth knockout punch infected the world of programming to the contrary, and it was straight down-hill from there…

7/25/2006 11:32:03 PM (Pacific Daylight Time, UTC-07:00)  #   

 Saturday, July 15, 2006

Stack overflow can be catostrophic for Windows programs. Some Win32 libraries and commercial components may or may not respond intelligently to it. For example, I know that, at least as late as Windows XP, a Win32 CRITICAL_SECTION that has been initialized so as to never block can actually end up stack overflowing in the process of trying to acquire the lock. Yet MSDN claims it cannot fail if the spin count is high enough. A stack overflow here can actually lead to orphaned critical sections, deadlocks, and generally unreliable software in low stack conditions. The Whidbey CLR now does a lot of work to probe for sufficient stack in sections of code that manipulate important resources. And we pre-commit the entire stack to ensure that overflows won't occur due to failure to commit individual pages in the stack. If a stack overflow ever does occur, however, it's considered a major catastrophy--since we can't reason about the state of what native code may have done in the face of it--and therefore, the default unhosted CLR will fail-fast.

In some rare cases, it is useful to query for the remaining stack space on your thread, and change behavior based on it. It could enable you to fail gracefully rather than causing a stack overflow, possibly in Win32, causing the process to terminate.  A UI that needs to render some very deep XML tree, and does so using stack recursion, could limit its recursion or show an error message based on this information, for example.  It could decide that it needs to spawn a new thread with a larger stack to perform the rendering.  Or it may just be a handy way to log an error message during early testing so that the developers can fine tune the stack size or depend less heavily on stack allocations to get the job done.

I've previously mentioned that the TEB has a StackBase and StackLimit, and that it can be dynamically queried using the ntdll!NtCurrentTeb function. Unfortunately, the StackLimit is only updated as you actually touch pages on the stack, and thus it's not a reliable way to find out how much uncommitted stack is left. The CLR uses kernel32!VirtualAlloc to commit the pages, not by actually moving the guard page, so StackLimit is not updated as you might have expected. There's an undocumented field, DeallocationStack, at 0xE0C bytes from the beginning of the TEB that will give you this information, but that's undocumented, subject to change in the future, and is too brittle to rely on.

The RuntimeHelpers.ProbeForSufficientStack function may look promising at first, but it won't work for this purpose. It probes for a fixed number of bytes (48KB on x86/x64), and if it finds there isn't enough, it induces the normal CLR stack overflow behavior.

The good news is that the kernel32!VirtualQuery function will get you this information. It returns a structure, one field of which is the AllocationBase for the original allocation request. When Windows reserves your stack, it does so as one contiguous piece of memory. The MM remembers the base address supplied at creation time, and it turns out that this is the "end" of your stack (remember, the stack grows downward). With a little P/Invoke magic, it's simple to create a CheckForSufficientStack function using this API. Our new function takes a number of bytes as an argument and returns a bool to indicate whether there is enough stack to satisfy the request:

public unsafe static bool CheckForSufficientStack(long bytes) {

    MEMORY_BASIC_INFORMATION stackInfo = new MEMORY_BASIC_INFORMATION();

 

    // We subtract one page for our request. VirtualQuery rounds UP to the next page.

    // Unfortunately, the stack grows down. If we're on the first page (last page in the

    // VirtualAlloc), we'll be moved to the next page, which is off the stack! Note this

    // doesn't work right for IA64 due to bigger pages.

    IntPtr currentAddr = new IntPtr((uint)&stackInfo - 4096);

 

    // Query for the current stack allocation information.

    VirtualQuery(currentAddr, ref stackInfo, sizeof(MEMORY_BASIC_INFORMATION));

 

    // If the current address minus the base (remember: the stack grows downward in the

    // address space) is greater than the number of bytes requested plus the reserved

    // space at the end, the request has succeeded.

    return ((uint)currentAddr.ToInt64() - stackInfo.AllocationBase) >

        (bytes + STACK_RESERVED_SPACE);

}

 

// We are conservative here. We assume that the platform needs a whole 16 pages to

// respond to stack overflow (using an x86/x64 page-size, not IA64). That's 64KB,

// which means that for very small stacks (e.g. 128KB) we'll fail a lot of stack checks

// incorrectly.

private const long STACK_RESERVED_SPACE = 4096 * 16;

 

[DllImport("kernel32.dll")]

private static extern int VirtualQuery (

    IntPtr lpAddress,

    ref MEMORY_BASIC_INFORMATION lpBuffer,

    int dwLength);

 

private struct MEMORY_BASIC_INFORMATION {

    internal uint BaseAddress;

    internal uint AllocationBase;

    internal uint AllocationProtect;

    internal uint RegionSize;

    internal uint State;

    internal uint Protect;

    internal uint Type;

}

If this returns true, you can be guaranteed that an overflow will not occur. Well, modulo stack guarantee issues, that is...

Notice that we have to consider some amount of reserved space at the end of the stack. Platforms typically reserve a certain amount to ensure custom stack overflow processing can be triggered. Windows actually reserves a few pages at the end of the stack for this reason. If, after a stack overflow occurs, a double stack overflow is triggered (that is, stack overflow handling actually exceeds these pages), Windows takes over and kills the process. The CLR prefers to initiate a controlled shut-down: telling the host, if any, and fail-fasting otherwise. This means it needs to reserve even more than Windows does automatically. The kernel32!SetThreadStackGuarantee can be used for this. In any case, we need to consider that when looking for enough stack space in our function. The code above assumes 16 4KB pages are required; this is more than is typically needed, so it may lead to false positives (but we hope no false negatives). Also note the program above is very x86/x64-specific, and won't work reliably on IA-64: it hard-codes a 4KB page size. It's a trivial excercise to extend this to use information from kernel32!GetSystemInfo to use the right page size dynamically.

As an example, check out this code:

static unsafe void Main() {

    Test(8*1024, 8*1024, true);

    Test(0, (960*1024) + (8*1024), false);

    Test(960*1024, 8*1024, false);

}

 

static unsafe void Test(int eatUp, long check, bool expect) {

    byte * bb = stackalloc byte[eatUp];

    Console.WriteLine("eatUp: {0}, check: {1}: {2}",

        eatUp, check,

        CheckForSufficientStack(check) == expect ?

        "SUCCESS" : "FAIL");

}

As I've described previously, the stack size can depend on the EXE PE file or parameters passed when creating a thread. This example assumes a 1MB stack size.

7/15/2006 12:39:43 PM (Pacific Daylight Time, UTC-07:00)  #   

 Saturday, July 08, 2006

The CLR thread pool is a very useful thing. It amortizes the cost of thread creation and deletion--which, on Windows, are not cheap things--over the life of your process, without you having to write the hairy, complex logic to do the same thing yourself. The algorithms it uses have been tuned over three major releases of the .NET Framework now. Unfortunately, it’s still not perfect. In particular, it stutters occasionally.

As I’ve hinted at before, we have a lot of work actively going on right now that we hope to show up over the course of the next couple CLR versions (keep an eye on those CTPs!). This may include vastly improved performance for work items and IO completions, significantly reducing the overhead of using our thread pool (in some cases to as little as ~1/8th of what it is today), eliminating accidental deadlocks due to lots of blocked thread pool workers, and a slew of useful new features (prioritization, isolation, better debugging, etc.).

One silly thing our thread pool currently does has to do with how it creates new threads. Namely, it severely throttles creation of new threads once you surpass the "minimum" number of threads, which, by default, is the number of CPUs on the machine. We limit ourselves to at most one new thread per 500ms once we reach or surpass this number. This can be pretty bad for some workloads, most notable those that are "bursty"; i.e. those that exhibit interspersed inactive and active periods rather sporadically and unpredictably. ASP.NET is a great example of an environment in which this frequerntly happens. Here’s an illustration:

  1. Imagine we have a 4 CPU web server. The "minimum" thread count used thus 4 (assuming the default).
  2. The web server has just started up.
  3. 16 new requests come in within a short period of time.
  4. Ther CLR quickly create 4 thread pool threads to service the first 4 requests. Because we don’t want to add any more for another 500ms, the other 12 requests sit in the queue.
  5. The 4 thread pool threads are running some arbitrary web page response. Imagine the response generation code does some type of database query that takes 4 seconds to complete. (This is a strong argument for using ASP.NET asynchronous pages (see http://msdn.microsoft.com/msdnmag/issues/05/10/WickedCode/) -- in which case, the 4 thread pool threads would free up to execute 4 new requests almost immediately -- or perhaps simply rearchitecting the seemingly poor database interaction, but ignore this for now.)
  6. After 500ms, a new thread pool thread is created, and the 5th request is serviced.
  7. We now wait another 500ms to add another thread, service, the next request, and so forth.

If the server has a constant load, eventually the pool will become "primed." But if a burst of work is followed by an inactive period of time, the threads in the thread pool start timing out waiting for new work, and eventually will retire themselves, until the pool shrinks back to the minimum. Imagine that this happens and then a bunch of new work arrives. Oops. This can clearly lead to some nasty scalability nightmares. KB article 821261, Contention, poor performance, and deadlocks when you make Web service requests from ASP.NET applications, describes this problem among others.

To "fix" this we added the ability in v1.1 to specify the minimum thread count in the thread pool, either with the configuratoin file or with the ThreadPool.SetMinThreads API. See KB article 810259, FIX: SetMinThreads and GetMinThreads API Added to Common Language Runtime ThreadPool Class, for details. It turns out that Microsoft Biztalk Server has run into the same problem: FIX: Slow performance on startup when you process a high volume of messages through the SOAP adapter in BizTalk Server 2004. I suspect many other commercial products have run into this as well. And it's rather annoying that each of them have to figure this out after they've shipped something, turning into a support bulletin, an internal bug-fix, and (I would guess) a service pack containing said bug-fix.

I wouldn't actually call what we did a fix. At best, it's a workaround. Hell, one of the KB articles above says that if you want decent scalability you need to change the minWorkerThreads count to 50. Our default is 1! Not too far off, eh? Shouldn't decent scalability be the default behavior?

We need to fix this for real.

Now, of course, it's a hard problem to solve. You don’t want to be too liberal adding threads to the pool because it can cause poor scalability should a large number of those extra threads suddenly become runnable. In an ideal world, no threads block, and having the same number of threads as you have CPUs gives you the best performance. (Better cache utilization, less overhead due to context switching, and so forth.) But software is often far less than ideal. As noted above, ASP.NET asynchronous pages are a great way to acheive this, and compared to injecting a whole bunch of relatively expensive threads into the process, it's obviously a better design. Unfortunately, I am not convinced all of our customers will stumble across this design, nor will it be brain-dead simple to rearchitect an existing site to take advantage of the feature without considerable work.

My hope is that we can solve this problem in the CLR by applying clever heuristics that even out over time. For example, we may start out life being over eager and generous with thread injection, but then "learn our lesson" after running for a period of time. This would lead to stabalization and an increasingly superior performance over the life of the process for the work that the server experiences. For example, if the server often experiences bursts, we will monitor the number of threads that lead to the best throughput during such an active period, and during periods of inactivity we will avoid retiring threads. This ensures that the next time the server is busy, work can be responded to in a more scalable manner, albeit with some extra working set overhead for keeping those threads around for longer. Perhaps more appropriate configuration settings could be dynamically recommended based on statistics gatherer during previous up-times of the server. And of course, we can offer more reasonable defaults for clients with short-living processes that might be harmed by over-eagerness with thread injection.

If anybody has experienced this problem in the wild, I’d love any feedback you might have. Feel free to leave a comment or email me at joedu at you-know-where dot com.

7/8/2006 9:18:26 PM (Pacific Daylight Time, UTC-07:00)  #   

 Thursday, July 06, 2006

As I mentioned in a recent post, Windows Vista has new built-in support for deadlock detection. At the time, I couldn't find any publicly available documentation on this feature. Well, I just found it:

Wait Chain Traversal
Wait Chain Traversal (WCT) enables debuggers to diagnose application hangs and deadlocks. A wait chain is an alternating sequence of threads and synchronization objects; each thread waits for the object that follows it, which is owned by the subsequent thread in the chain. Read More.

The new APIs OpenThreadWaitChainSession, CloseThreadWaitChainSession, and GetThreadWaitChain permit both asynchronous and synchronous detection and response. MSDN also has a fairly detailed code sample that uses the new APIs to print out the wait chain for all threads in a process.

7/6/2006 12:20:48 PM (Pacific Daylight Time, UTC-07:00)  #   

 

Recent Entries:

Search:

Browse by Date:
<August 2006>
SunMonTueWedThuFriSat
303112345
6789101112
13141516171819
20212223242526
272829303112
3456789

Browse by Category:

Notables: