RSS 2.0

Personal Info:

Joe Send mail to the author(s) leads the architecture of an experimental OS's developer platform, where he is also chief architect of its programming language. His current mission is to enable writing large-scale software that is reliable, secure, and scalable by-construction. Before this, Joe founded the Parallel Extensions to .NET project. He has been granted 19 patents, with 49 pending. When not working, Joe enjoys travelling with his wife, writing books, writing music, studying music theory & mathematics, and doing anything involving food & wine.

My books

My music

Disclaimer:
The content of this site are my own personal opinions and do not represent my employer's view in anyway.

© 2012, Joe Duffy

 
 Tuesday, October 03, 2006

I am often confronted with the question of whether concurrency programming models that employ shared memory are evil. I was asked this question directly on the concurrency panel at JAOO’06 earlier this week, for instance, and STM makes a big bet that such models are tenable.

Without shared memory, it’s tempting to think that traditional concurrency problems go away, as if by magic. If no two pieces of code are simultaneously working on the same location in memory, for instance, there are (seemingly) no race conditions or deadlocks. Most people believe this, and it (on the surface) seems somewhat reasonable. Until you realize that it’s fundamentally flawed.

Shared memory systems are just an abstraction in which data can be named by its virtual memory address. In fact, one could argue that it’s an optimization—that the same sort of systems could be built by mapping virtual memory addresses (at a logical level) to some other location (at a physical level) using an algorithm that doesn't rely on page-tables, TLBs, and so on. Distributed RPC systems in the past have tried this very thing: to map object references to data residing on far-away nodes, and have mostly failed in the process. I’m not trying to convince you that alternative mapping techniques are a good thing, only that abstractly speaking at least, all of the same concurrency control problems will arise in systems that exhibit this fundamental property. Interestingly, shared memory systems have turned into tiny distributed systems with complex cache coherency logic anyway, so one has to wonder where the boundary between shared memory and message passing really lies.

There is a fundamental, undeniable law here:

Any system in which two concurrent agents can name the same piece of data must also exhibit the standard problems of concurrency: broken serializability, race conditions, deadlocks, livelocks, lost event notifications, and so on. Concurrency control is simply a requirement if correctness is desired.

So in reality, the real question at hand should be, would a system in which every concurrent agent operates on its own, completely isolated piece of data be more attractive? I personally think that’s farfetched and unrealistic. Systems with shared data need to have shared data; it's a property of the system being modeled. Even with isolated data, concurrency control would be required if, say, a central copy is rendezvoused with periodically (which, by the way, is the only way I can see such a system remaining correct). And then you have to wonder what copying buys you. It certainly costs you. Data locality is crucial to achieving adequate performance in most low-to-mid-level systems software. Yet copy-on-send message passing systems throw this out the window entirely. I refuse to believe that this will ever be the dominant model of fine-grained concurrency, at least on the current hardware architectures available by Intel and AMD. And certainly not without a whole lot more research and perhaps hardware support.

A distributed system in which many simultaneous clients might access the same piece of data on the server has all the same issues. AJAX systems, for instance, easily lull the author into a false sense of security. But, unfortunately(?), a transaction is a transaction, and if concurrency control isn’t in place, such systems are effectively executing without any isolation or serialization guarantees whatsoever—I just read an article in the latest DDJ where this was explained. I'm surprised a dedicated article actually needs to point this out: concurrent access to data under any other name is still concurrent access to data. And of course, once you start to employ concurrency control, you are susceptible to deadlocks and so on—unless you have a system that can transparently resolve them.

Interesting research has been done recently by MSR on static verification to prove the absence of sharing (across processes)—called Software Isolated Processes (SIPs)—building on the type safe, verifiable subset of IL. STM of course also builds on top of the shared memory programming model; but, although threads can name the same location in memory, this is completely hidden—concurrency control is still employed in the implementation where necessary. I believe this systems are promising. They also have the benefit of building on the same foundational memory performance equations that software developers are used to relying on today.

10/3/2006 4:13:32 PM (Pacific Daylight Time, UTC-07:00)  #   

 Monday, October 02, 2006

Here are the slides for my JAOO'06 talk Concurrency and the composition of frameworks:

JAOO-06__ConcurrencyAndFrameworks.ppt (2.06 MB)

10/2/2006 7:16:10 AM (Pacific Daylight Time, UTC-07:00)  #   

 Friday, September 22, 2006

An article I wrote (seemingly ages ago) just appeared in the September issue of Dr. Dobb's journal:

Application Responsiveness: Using concurrency to enhance user experiences
Thanks to recent innovation in both hardware graphics processors and client-side development frameworks, GUIs for Windows applications have become more and more visually stunning over time. But throughout the evolution of such frameworks, one problem hasn't gone away—poor responsiveness. Studies show that positive user experiences are linked to application responsiveness and, conversely, that frustrating experiences are often caused by poor responsiveness. More often than not, this lack of responsiveness is due to a series of subtle (and sometimes accidental) design choices made during development. In this article, I examine the root of the responsiveness problem, and then suggest some best practices for eliminating it.

My article only touches on some important issues that are described in detail elsewhere.  Here are the references I used:

  1. D. Duis, J. Johnson. Improving User Interface Responsiveness Despite Performance Limitations. Proc. IEEE Computer Society Intl. Conference. February 1990.
  2. J. Duffy. No More Hangs: Techniques for Avoiding and Detecting DeadlocksMSDN Magazine. April 2006.
  3. G. H. Forman. Obtaining Responsiveness in Resource-Variable Environments. PhD Dissertation, University of Washington. 1998.
  4. I. Griffiths. Windows Forms: Give Your .NET-based Applications a Fast and Responsive UI with Multiple Threads. MSDN Magazine. February 2003.
  5. N. Kramer. Threading Models (Windows Presentation Foundation). Weblog essary. June 2005.
  6. G. Maffeo, P. Silwowicz. Win32 I/O Cancellation in Windows Vista. MSDN. September 2005.
  7. V. Morrison. Concurrency: What Every Dev Must Know About Multithreaded Apps. MSDN Magazine. August 2005.
  8. M. E. Russinovich, D. A. Solomon. Microsoft Windows Internals. ISBN 0-735-61917-4, MS Press. December 2004.
  9. C. Sells. Safe, Simple Multithreading in Windows Forms, Part 1. MSDN. June 2002.
  10. C. Sells, I. Griffiths. Programming Windows Presentation Foundation. ISBN 0-596-10113-9, O'Reilly. September 2005.

Thanks go to Jeff Richter, Nick Kramer, Alessandro Catorcini1, and Vance Morrison for reviewing early drafts.  Enjoy.

1. Alessandro, man, you need a blog! ;)

9/22/2006 11:27:36 AM (Pacific Daylight Time, UTC-07:00)  #   

 Wednesday, September 13, 2006

LINQ coaxes developers into writing declarative queries that specify what is to be computed instead of how to compute the results. This is in contrast to the lion's share of imperative programs written today, which are huge rat nests of for-loops, switch statements, and function calls. The result of this new direction? Computationally intensive filters, projections, reductions, sorts, and joins can be evaluated in parallel... transparently... with little-to-no extra input from the developer. The more data the better.

If you buy the hypothesis--still unproven--that developers will write large swaths of code using LINQ, then by inference, they will now also be writing large swaths of implicitly data parallel code. This, my friends, is very good for taking advantage of multi-core processors.

If you want to get a little glimpse of what I've been spending my time working on, check out these (brief) stories about Parallel LINQ (aka PLINQ), a parallel query execution engine for LINQ:

We've spent many, many months now cranking out a fully functional prototype. The numbers were impressive enough to catch the eye of some key people around the company. And the rest is history... (well, not quite yet...)

I'll no doubt be disclosing more about this in the coming weeks.

(Note: I am in no way committing to any sort of product or release timeframe. This technology is quite early in the lifecycle, and, while unlikely, might never actually make the light of day... Label this puppy as "research" for now.)

9/13/2006 4:48:33 AM (Pacific Daylight Time, UTC-07:00)  #   

 Friday, September 01, 2006

Tim Harris, a Microsoft colleague I've had the please to work a lot with lately, joined Simon Peyton Jones, of Glasgow Haskell fame, to do a Channel9 interview on Software Transactional Memory (STM). I encourage you to check it out.

Update: I didn't say this before, but I would love any feedback about this technology. What if you had this in C# today? (Do you want it in C# today?) Would it make your life simpler? What are the major challenges you'd encounter if you were to start using it in your programs and libraries? What are the major benefits? Feel free to leave comments (either here or in the Channel9 post) or send me email directly at joedu AT microsoft DOT com.

9/1/2006 6:10:53 PM (Pacific Daylight Time, UTC-07:00)  #   

 Tuesday, August 22, 2006

A common technique to avoid giving up your time-slice on multi-CPU machines is to use a hand-coded spin wait. This is appropriate when the cost of a context switch (4,000+ cycles) and ensuing cache effects are more expensive than the possibly wasted cycles used for spinning, which is to say not terribly often. When used properly, however, very little time is spent spinning, and the spin wait is only ever invoked rarely when very specific cross-thread state is seen, such as lock-free code observing a partial update. There are some best practices that must be followed when writing such a spin wait to guarantee good behavior across different machine configurations, i.e. HT, single-CPU, and multi-CPU systems.

A correct wait must issue a yield/pause instruction on each loop iteration to work well on Intel HT machines:

while (!cond) {
    Thread.SpinWait(20);
}

Many implementations should also fall back to a more expensive wait on, say, a Windows event or CLR monitor after spinning a while. This handles the worst case situation in which the thread that is destined to make 'cond' true is not making forward progress as quickly as you'd hoped. A complementary and alternative technique is to simply give up the time-slice in such cases using the Thread.Sleep API:

uint loops = 0;
while (!cond) {
    if ((++loops % 100) == 0) {
        Thread.Sleep(0);
    } else {
        Thread.SpinWait(20);
    }
}

This approach ensures that, if the machine is saturated, the spin wait doesn't prevent the thread which will set the event from being scheduled and making forward progress.

All of this is pure nonsense and ludicrousness on single-CPU machines. If you're waiting for another thread to set an event... well... it clearly isn't going to do that if you're actively using the one and only CPU to waste cycles spinning! Therefore a natural extension to the above approach is to check for a single-CPU machine and respond by yielding to another thread:

uint loops = 0;
while (!cond) {
    if (Environment.ProcessorCount == 1 || (++loops % 100) == 0) {
        Thread.Sleep(0);
    } else {
        Thread.SpinWait(20);
    }
}

OK, this is looking rather nice now. But wait. There's a subtle but nasty problem lurking here.

Sleep(0) actually only gives up the current thread's time-slice if a thread at equal priority is ready to run. Don't believe me? Check out the MSDN docs. If you're writing a reusable API that will be called by a user app, they might decide to drop a few of their threads' priorities. Messing with priorities is actually a very dangerous practice, and this is only one illustration of what can go wrong. (Other illustrations are topics for another day.) In summary, plenty of people do it and so reusable libraries need to be somewhat resilient to it; otherwise, we get bugs from customers who have some valid scenario for swapping around priorities, and then we as library developers end up fixing them in service packs. It's less costly to write the right code in the first place.

Here's the problem. If somebody begins the work that will make 'cond' true on a lower priority thread (the producer), and then the timing of the program is such that the higher priority thread that issues this spinning (the consumer) gets scheduled, the consumer will starve the producer completely. This is a classic race. And even though there's an explicit Sleep in there, issuing it doesn't allow the producer to be scheduled because it's at a lower priority. The consumer will just spin forever and unless a free CPU opens up, the producer will never produce. Oops!

You can solve this problem by changing the Sleep to use a parameter of 1:

uint loops = 0;
while (!cond) {
    if (Environment.ProcessorCount == 1 || (++loops % 100) == 0) {
        Thread.Sleep(1);
    } else {
        Thread.SpinWait(20);
    }
}

This fixes the problem, albeit with the disadvantage that the thread is unconditionally removed from the scheduler temporarily. (We also call SleepEx with an alertable flag which is more expensive due to APC checks, but I digress.) It's unfortunate that a quick 5 minute audit turns up plenty of Sleep(0)'s in the .NET Framework. I hope to get an FxCop rule created to catch this.

The kernel32!SwitchToThread API doesn't exhibit the problems that Sleep(0) and Sleep(1) do. Unfortunately, you can't reliably get at it from managed code. You can P/Invoke, but it's actually dangerous to do if you end up running in a host. We've overridden thread yielding behavior on the CLR such that we can call out to a host for notification purposes. This was used primarily for fiber mode in SQL Server (which was cut), so that it could use this as an opportunity to switch fibers, but other hosts are free to do what they please. If you don't care about working in a host, then feel free to do this, but please document it clearly and use the following HPA signature so people don't use your type incorrectly unknowingly:

[DllImport("kernel32.dll"), HostProtection(SecurityAction.LinkDemand, ExternalThreading=true)]
static extern bool SwitchToThread();

We're looking at adding a Thread.Yield API in the next rev of the CLR that does this in a host-friendly way. For now, you'll have to rely on Sleep(1).

Thankfully, the starvation problem is not quite *that* bad. The Windows scheduler combats this problem. It uses a balance set manager: a system daemon thread whose responsibility it is to wake up once a second to check for threads that are being starved because of a lower priority than other runnable threads. The goal of this service is to prevent CPU starvation and to minimize the impact of priority inversion. If any threads are found by the balance set manager which have been starved for ~3-4 seconds, those starved threads enjoy a temporary priority boost to priority 15 ("time critical"), virtually ensuring the thread will be scheduled. (Although this won't strictly guarantee it: if your other threads have real-time priorities, i.e. >15, then starvation will continue indefinitely... you're playing with dynamite once you enter that realm.) And once the thread does get scheduled, it also enjoys a quantum boost: its next quantum is stretched to 2x its normal time on client SKUs, and 4x its normal time on server SKUs. The priority decays as each quantum passes, continuing until the thread reaches its original lower priority.

In our example above when Sleep(0) is used, we hope this will unstick the machine and let the producer produce and finally the consumer to consume. Indeed with some testing, we see it unstick after a little more than 3 seconds. This is still long enough, however, to kill performance on a server application, cause a noticeable perf degradation on the client, and destroy responsiveness in a GUI app. Here's a simple test that exposes the problem (on a single-CPU machine):

using System;
using System.Diagnostics;
using System.Threading;

class Program {
    public static volatile int x = 0;

    public static void Main() {
        Stopwatch sw = new Stopwatch();
        sw.Start();

        SpawnWork();
        while (x == 0) {
            Thread.Sleep(0);
        }

        sw.Stop();
        Console.WriteLine("Sleep(0) = {0}", sw.Elapsed);

        x = 0;

        sw.Reset();
        sw.Start();

        SpawnWork();
        while (x == 0) {
            Thread.Sleep(1);
        }

        sw.Stop();
        Console.WriteLine("Sleep(1) = {0}", sw.Elapsed);
    }

    private static void SpawnWork() {
        ThreadPool.QueueUserWorkItem(delegate {
            Thread.CurrentThread.Priority = ThreadPriority.BelowNormal;
            x = 1;
        });
    }
}

And here's some example output which is quite consistent from run to run:

Sleep(0) = 00:00:03.8225238
Sleep(1) = 00:00:00.0000678

As we can see, in the case of Sleep(0), the balance set manager stepped in and boosted our producer thread after ~3-4 seconds as promised. We avoid the problem altogether with Sleep(1).

The moral of the story? Priorities are evil, don't mess with them. Always use Sleep(1) instead of Sleep(0). The Windows balance set manager is cool.

8/22/2006 10:18:05 PM (Pacific Daylight Time, UTC-07:00)  #   

 Monday, August 21, 2006

[Update - 8/22/06 - fixed typos and paid homage to VSTS 2005's code analysis which checks for this problem.]

From the department of Spolsky's Law of Leaky Abstractions, we turn today to accidental lock conflicts across AppDomain boundaries.

The CLR supports various cross-AppDomain marshaling mechanisms, one of which is known by the lovely name of marshal-by-bleed. This simply means that pointers from multiple AppDomains actually refer to the same location in memory. Most of the time some form of marshaling is required for objects so that we can safely isolate separate AppDomains from one another.

In managed code, you can lock on any object through the Monitor type, exposed in C# and VB via the 'lock' and 'SyncLock' keywords, respectively. The implementation of Monitor.Enter/Exit uses space in the object header and/or the object's sync-block to record exclusive ownership of the lock. The fact that objects typically don't bleed across AppDomains is a GoodThing(tm), as this is how add-ins, SQL Server, and other hosts isolate failures between components. When writing code, we typically assume state in one AppDomain can't corrupt state in another, totally independent, AppDomain.

Unfortunately, domain neutral Type objects (as well as other Reflection types, e.g. XXInfos) are actually shared across all AppDomains in the process. They are marshal-by-bleed. Strings also fall into this camp. A string argument to a remoted MarshalByRefObject method invocation may be bled, as can be any process-wide interned string literal. The System.Threading.Thread object (called the thread-base-object, aka TBO, internally) also bleeds across AppDomains. What a bloody mess! (Ha ha.)

So why does this all matter?!

Recall that lock owner information is tied to the instance. If you use any of these things as a target of Monitor.Enter, code running in one AppDomain can actually interfere with code in another AppDomain. That's because they are using the same object and thus the same lock information underneath. What a lousy abstraction--this was never meant to leak through! And it can cause trouble too. If one AppDomain orphans the lock (forgets to release it), it may cause deadlocks in other AppDomains. Even sans deadlocks, this fact can simply yield false conflicts, which can subsequently negatively impact scalability.

For example, consider this code:

lock (typeof(object)) {
    ...
}

Code in AppDomain A uses the same Type object to represent 'typeof(object)' as code in AppDomain B. Therefore they share lock information.

If we run such code from multiple AppDomains, the code yields a conflict:

WaitHandle wh = new ManualResetEvent("XXX", false);
lock (typeof(object)) {
    AppDomain ad2 = AppDomain.CreateDomain("2");
    ad2.DoCallBack(delegate {
        ThreadPool.QueueUserWorkItem(delegate {
            WaitHandle wh2 = new ManualResetEvent("XXX", false);
            lock (typeof(object)) {
                wh2.Set();
            }
        });
    });
    wh.WaitOne();
}

If one AppDomain is waiting for a synchronization event from another--as in this example--this can actually yield a deadlock. If you replaced the lock statements in this example with, say, lock ("Foo") { ... }, you'll see the same result due to string literal interning.

Clearly this is nasty problem, especially if Framework code were to use such patterns. This is one reason you'll notice we strongly discourage locking on Type objects. Even if you're not in mscorlib (by default the only domain neutral assembly), your type can be loaded domain neutral based on hosting policy, among other things. And therefore you may not even catch said bugs during testing.

Note that MarshalByRefObjects aren't subject to these problems. Although operations in one AppDomain can refer to the same instance in another, these accesses go through a proxy. Locking on the proxy is different than locking on the raw underlying object, and thus no false conflicts.

This is enforced with the DoNotLockOnObjectsWithWeakIdentity VSTS 2005 code analysis rule.

If all of this is making you feel rather queasy, fear not. We have a weekly "CLR Foundations" meeting where a large portion of the CLR Team meets to discuss the history of the CLR and .NET Framework. A couple weeks back this topic came up in passing. Most people on the team were quite surprised, and many even seemed to be enshrouded in disbelief. At least we can recognize a mistake after it's been made. ;)

8/21/2006 8:50:14 PM (Pacific Daylight Time, UTC-07:00)  #   

 Sunday, August 13, 2006

I've run across several algorithms lately that have benefited from the use of a Bloom filter. This led me to dig up the original paper (Space/time trade-offs in hash coding with allowable errors) in which the idea was proposed. What surprised me a little was that this technique was invented 35 years ago, by a fellow by the name of Burton Bloom, and yet it remains a simple and effective way to speed up a certain category of modern software problems.

The technique summarizes the contents of a set into a concise value (the filter) from which quick answers may be computed. This value is typically generated using some form of hash-coding of the set elements and stored within a bitmap, enabling ultra-fast bitwise updates and queries. This value promises to never return a false negative, although it is permitted to return false positives. So long as the false positive rate is low, many queries that would ordinarily have come up empty handed after a lengthy search will be retired in constant time. As an element is added to the set, the value is updated to reflect its presence. A tricky part of this technique, however, is that, because the value often summarizes the elements using one-bit-to-many-elements, removing an element from the set typically cannot just reset the corresponding bit. As items are removed over time, this can lead to a stale value, an increasing rate of false positives, and a higher number of lengthy searches. Thus the value must be periodically recomputed to combat this problem.

Let's take an example. Say you had a balanced binary tree that was frequently searched, infrequently deleted from, and whose searches more often lead to misses than they do hits. You can expect O(log n) search time. That's not bad, but for large values of n you may have incentive to speed up the common case, which happens to be the worst case for a perfectly balanced binary tree: a search that turns up empty. Using a Bloom filter gives you a slightly modified algorithm whose search complexity is Ω(1) for our problem's best case, and still O(log n) otherwise. This also adds some notable cost due to the need to periodically recompute the filter value, which is an O(n) operation, but since we infrequently delete items, we expect this to pay off in spades.

Say we already had a BinaryTree<T> data structure. We could easily adopt a Bloom filter with a series of minor modifications.

First, a new field to remember the filter's value. I've chosen a 64-bit bitmap for illustration:

long filterValue = 0;

And since we've used a bitmap, we need a routine to calculate any arbitrary element's single bit position. Remember, false positives are OK, and therefore multiple elements may share the same bit:

long GetFilterValueForElement(T e) {
    return 1 << (e.GetHashCode() % (sizeof(long) * 8));
}

When adding an item to the set, we must change the filter's value:

filterValue |= GetFilterValueForElement(e);

Here's the beneficial change. While searching for an item, we can add a quick check up front to speed up queries:

if ((filterValue & GetFilterValueForElement(e)) == 0) {
    // Element not found: we can be assured this is correct.
}
// Perform the existing, lengthier search. might be a false positive.

And of course, we must periodically do the O(n) operation on the tree to recompute the filter. We might do this whenever the tree becomes empty and every n deletions, for example:

if (isEmpty || (++deletionCount % recomputeFilterPeriod) == 0) {
    filterValue = 0;
    foreach (T e in this) {
        filterValue |= GetFilterValueForElement(e);
    }
}

You could even keep track of the number of false positives seen, and use that to determine when to initiate the recomputation.

This technique clearly won't work well in data structures and problems in which the deletion-to-query ratio is high, but there are plenty of situations where this technique helps tremendously. Logs of various forms, for example, exhibit the property that they are added to but never deleted. Depending on the density of the data structure, you may want to use an array instead of a bitmap--or a bitmap with more bits--to represent the summary. I ran a quick test and the simple bit-shifting hash-coding mechanism above yielded a full filter (~0) after only 227 random object allocations. You can of course even decide to reduce the density of your bitmap dynamically by detecting a close-to-full map and upgrading to a larger filter data structure. A dense filter often corresponds to a higher false positives rate, in which case you simply waste time maintaining and checking a value that doesn't give you any benefit.

8/13/2006 1:45:48 PM (Pacific Daylight Time, UTC-07:00)  #   

 Wednesday, August 09, 2006

The Lang.NET Symposium was held here in Redmond, WA last week.  My pal, Erik Meijer, the monadic maniac of Head in the BoxLambda the Ultimate, and now VB 9 fame, helped pull together the killer agenda, which is always a recipe for a good time.  Thottam, a PM on the CLR team, just posted a great summary on his blog.  He sent it out over email last week, and I'm really glad he posted it for everyone else to see.  There are a bunch of good language geek links lurking within. 

8/9/2006 8:42:02 PM (Pacific Daylight Time, UTC-07:00)  #   

 Sunday, August 06, 2006

This falls into the "fun hacks" category, meaning the result is not necessarily something you'd want to use in your everyday life. To go a step further, I strongly recommend you don't use the code shown here as-is; read the summary at the end for some rationale behind that statement. Enough with the disclaimer. On with the show...

Some requirements for our cross-process RWLock

Imagine you had the need for:

1. A managed reader/writer lock
2. that runs on down-level (pre-Vista) operating systems,
3. and that optionally works across process boundaries and AppDomains.

What do we already have that might fit the bill? The existing ReaderWriterLock type in the System.Threading namespace works fine for 1 and 2, but not 3. I suppose you could share it across AppDomains--or even processes--with some form of messaging scheme, but let's ignore that for a moment. It's a little *too* clever for my taste. Windows Vista of course comes with a new slim reader/writer lock. It's a close cousin of the Win32 CRITICAL_SECTION, and can be used for cross-AppDomain synchronization. Unfortunately, you have to P/Invoke to get at it from managed code, it won't run on pre-Vista operating systems, and doesn't work for cross-process scenarios.

Let's build it out of duct tape and barbed wire

It turns out that you can build a type that meets our requirements out of existing Windows kernel objects. All it takes is a little imagination. Here's what you need:

1. A semaphore to count the number of readers inside the lock.
2. A mutex to ensure only one writer can be in the lock at a time.
3. (Optionally) a manual-reset event used by writers to ensure no new readers enter the lock while it waits.

The scaffolding for one such implementation--which I will call the IpcReaderWriterLock--is as follows:

public class IpcReaderWriterLock : IDisposable

{

    /** Fields **/

 

    private const int DEFAULT_MAX_READER_COUNT = 25;

    private const string NAME_PREFIX = @"IpcRWL#";

 

    private readonly int m_maxReaderCount;

    private Semaphore m_readerSemaphore;

    private EventWaitHandle m_blockReadsEvent;

    private Mutex m_writerMutex;

    private int m_writerRecursionCount;

 

    /** Constructors **/

 

    public IpcReaderWriterLock() : this(null, DEFAULT_MAX_READER_COUNT) { }

    public IpcReaderWriterLock(string name) : this(name, DEFAULT_MAX_READER_COUNT) { }

    public IpcReaderWriterLock(int maxReaderCount) : this(null, maxReaderCount) { }

 

    public IpcReaderWriterLock(string name, int maxReaderCount)

    {

        m_maxReaderCount = maxReaderCount;

 

        string blockReadsEventName = null;

        string writerMutexName = null;

        string readerSemaphoreName = null;

        if (name != null)

        {

            blockReadsEventName = string.Format("{0}{1}#{2}", NAME_PREFIX, name, "RdEv");

            writerMutexName = string.Format("{0}{1}#{2}", NAME_PREFIX, name, "WrMtx");

            readerSemaphoreName = string.Format("{0}{1}#{2}", NAME_PREFIX, name, "RdSem");

        }

 

        m_blockReadsEvent = new EventWaitHandle(true, EventResetMode.ManualReset, blockReadsEventName);

        m_writerMutex = new Mutex(false, writerMutexName);

        m_readerSemaphore = new Semaphore(maxReaderCount, maxReaderCount, readerSemaphoreName);

    }

 

    /** Methods **/

 

    public void Dispose()

    {

        // Just close all of the kernel objects we opened during construction.

        // Note: this method is not thread-safe. If threads race with

        // one another to call Dispose, some nasty bugs will arise.

 

        if (m_blockReadsEvent != null)

        {

            m_blockReadsEvent.Close();

            m_blockReadsEvent = null;

        }

 

        if (m_writerMutex != null)

        {

            m_writerMutex.Close();

            m_writerMutex = null;

        }

 

        if (m_readerSemaphore != null)

        {

            m_readerSemaphore.Close();

            m_readerSemaphore = null;

        }

    }

 

    public void EnterReadLock() { ... } 

    public void ExitReadLock() { ... } 

    public void EnterWriteLock()  { ... }

    public void ExitWriteLock() { ... }

}

Notice that we allow naming of the lock. Any name given flows into the kernel objects used underneath, enabling cross-process and cross-AppDomain communication. You just create the same IpcReaderWriterLock with the same name in multiple processes or AppDomains, and they will magically interact with one another (whether you want them to or not). An unnamed lock is isolated inside of the AppDomain in which it was created. Notice also that there's a maximum number of simultaneous readers, the default for which is 25. This isn't terribly important, but any override does impact performance (as described below).

Read and write lock implementation

Now let's implement the read-lock acquisition and release functions, EnterReadLock and ExitReadLock. We support more than one reader at a time via the use of a semaphore (#1 above). We also support preventing blocking new readers from entering the lock while the writer waits for all readers to exit (#3 above). Thus, both of these functions are quite trivial to write:

public void EnterReadLock()

{

    Thread.BeginCriticalRegion();

 

    // We first wait on the read blocking event, in case a writer

    // has tried to acquire the lock and wants us to wait.

    m_blockReadsEvent.WaitOne();

 

    // Now take '1' from the reader semaphore to count the number

    // of simultaneous readers inside the lock.

    m_readerSemaphore.WaitOne();

}

 

public void ExitReadLock()

{

    // Just release '1' back to the semaphore to let others know

    // the number of simultaneous readers just decreased.

    m_readerSemaphore.Release();

 

    Thread.EndCriticalRegion();

}

Next comes the write-lock acquisition and release functions, EnterWriteLock and ExitWriteLock. They are slightly more complicated, but not by much. First we acquire the writer mutex. Once we've done that, we increment the recursion count, and ensure that we do some other work only the first time a writer lock is acquired on the thread. We block any new readers from entering, and then we effectively wait for all readers to exit. We do that by acquiring the semaphore n times, where n is the maximum number of readers that we support. Releasing the write lock does the reverse of all of that:

public void EnterWriteLock()

{

    Thread.BeginCriticalRegion();

 

    // We have to first ensure only one writer can get in at a time.

    m_writerMutex.WaitOne();

 

    // Increment our recursion count.

    m_writerRecursionCount++;

 

    // For the first writer who enters, we need to block new readers

    // and wait for any existing readers to exit the lock.

    if (m_writerRecursionCount == 1)

    {

        // Next we block any new readers from entering the lock.

        m_blockReadsEvent.Reset();

 

        // And lastly, we ensure that all readers have exited the lock.

        // We do this by acquiring the semaphore's capacity. It's

        // unfortunate that the Win32 APIs don't support a take-n

        // function for semaphores.

        for (int i = 0; i < m_maxReaderCount; i++)

        {

            m_readerSemaphore.WaitOne();

        }

    }

}

 

public void ExitWriteLock()

{

    // We have to do everything in the reverse order as we did

    // during acquisition. Not doing so can lead to subtle bugs,

    // including lost resets and deadlocks.

 

    m_writerRecursionCount--;

 

    // The last writer to release has to signal readers.

    if (m_writerRecursionCount == 0)

    {

        // We release the semaphore's capacity back, enabling readers

        // to take from it. Note that as soon as we call this, other

        // threads may wake up and race to acquire the semaphore. In

        // fact, simultaneous readers can get in, even though we still

        // have a writer in here!

        m_readerSemaphore.Release(m_maxReaderCount);

 

        // Unblock any readers that are waiting. Note: ideally we would

        // do this after signaling writers, so that readers can't sneak

        // in before the writer, but that would be more complicated: we

        // keep it simple for now.

        m_blockReadsEvent.Set();

    }

 

    // And lastly release the mutex.

    m_writerMutex.ReleaseMutex();

 

    Thread.EndCriticalRegion();

}

And that's it. A fully functioning reader/writer lock, for some definition of "functioning." The full source code can be found here: IpcReaderWriterLock.cs.

The test case

This example wouldn't be complete with a simple test case to prove that it works. The sample program included in the IpcReaderWriterLock source code creates 20 threads--10 readers and 10 writers--in 10 AppDomains. Each does a piece of work designed to expose race conditions via context switching at sensitive points. It prints out "Success" at the end, assuming it all worked. I see "Success" every time I run it, so it works on my machine at least. Hooray.

Summary: Don't use this thing

OK, OK... this lock is pretty icky and nasty to be honest, and you probably wouldn't want to use it. Ever. A simple write-lock acquisition incurs 27 kernel transitions with the default settings. This ends up costing over 1000-times the cost a simple monitor acquisition! (Yes, there are three 0's in that number... ouch.) Moreover, the cost increases proportional to the number of simultaneous readers that the lock instance supports, which is not very good. That's why I've used such a low default: 25. And it's not very reliable either, which, for anything that does cross-AppDomain or cross-process synchronization can be disastrous. One process that crashes can lead to machine-wide corruption and a user who needs to reboot the machine. A much better lock (performant, reliable, etc.) could be built using memory mapped files, although the implementation would be substantially more complicated.

I almost didn't write this post for all of these reasons. I'm not sure whether I'm doing a disservice to my readers by doing this. Instead, I hope that showing how simple Windows kernel objects can be composed together in interesting, powerful, and non-obvious ways is interesting, if only for trivia reasons. I'd also like to think that it may inspire you to think about things a little differently, perhaps helping you to write clever (but useful!) things out of the building blocks you already have.

8/6/2006 4:29:13 PM (Pacific Daylight Time, UTC-07:00)  #   

 

Recent Entries:

Search:

Browse by Date:
<October 2006>
SunMonTueWedThuFriSat
24252627282930
1234567
891011121314
15161718192021
22232425262728
2930311234

Browse by Category:

Notables: