RSS 2.0

Personal Info:

Joe Send mail to the author(s) leads the architecture of an experimental OS's developer platform, where he is also chief architect of its programming language. His current mission is to enable writing large-scale software that is reliable, secure, and scalable by-construction. Before this, Joe founded the Parallel Extensions to .NET project. He has been granted 19 patents, with 49 pending. When not working, Joe enjoys travelling with his wife, writing books, writing music, studying music theory & mathematics, and doing anything involving food & wine.

My books

My music

Disclaimer:
The content of this site are my own personal opinions and do not represent my employer's view in anyway.

© 2012, Joe Duffy

 
 Thursday, September 29, 2005

I’ve talked about Thread Aborts before. And I spoke briefly at PDC about why you shouldn’t lock on objects shared across AppDomains (ADs). But I wanted to spend a brief moment fusing the two together to illustrate the point. There are some interesting factors at play here.

The Guidance

To begin with, our Design Guidelines advise:

Do not lock on any public types, or on instances you do not control. Notice that the common constructs lock (this), lock (typeof (Type)), and lock (“myLock”) violate this guideline.

Most people don’t intuitively understand the why behind not locking on Types and Strings. I’ll leave the public type discussion off the table for this post.

The Reasons

To understand why this is problematic, first you have to understand that we share objects across ADs. And you need to understand when we do it. Various Reflection bits and bytes—such as instances of the Type class—are one such case, when they refer to a domain neutral assembly. mscorlib always gets loaded domain neutral, for example; other assemblies can fall into this category too, based on Hosting policies. Interned strings are also shared across ADs, so a “Hello, World” literal in AD #A is the same precise object as that “Hello, World” in AD #B in the same process. All of the above are called cross-AD bled objects or AD-agile instances, something discussed in reasonable detail here.

A conclusion that you can make right away is that locking on an object shared across ADs #A and #B can interfere with each other, even if it’s only by coincidence. Subtle timing oddities might arise—including starving another AD’s completely unrelated opaque body of code for a seemingly unknown reason—but in many cases the effects won’t be so catastrophic. And in some rare situations it might even be intentional. But let’s take it a step further.

Next, you need to know a little about how we perform AD unloads. If you want to know a lot about this, go read Chris’s excellent post on the subject (the same as above). But I will try to summarize, and in doing so will paint a naïve picture of the world. During ordinary AD unloads we are careful to ensure that threads are unwound in an orderly fashion. That is to say, finally blocks lexically surrounding the instruction pointer are run, and of course objects in that AD are given a chance to run their Finalizers. This happens because a ThreadAbortException is generated at the current point of execution in each thread which actively has a stack in the AD.

Assuming you’ve written your code to use a lock statement (or at least to release the Monitor in a finally block), this orderly thread unwinding permits you to release any locks held. You may catch a Thread Abort, but it is a so-called undeniable exception, meaning it will be reraised at the end of your catch blocks. This is quite visible during an ordinary unload. And of course, Aborts are suspended in the case that you’re in a CER, unmanaged chunk of code that isn’t polling for aborts, a finally block, and so on. Lastly, if you see an Abort happen when your code holds a Monitor, you can be assured the entire AD is being ripped—not just a single Thread; this assumption is safe because we work with Hosts (via Begin- and EndCriticalRegion) to let them know when the whole AD could become corrupt as the result of a single ThreadAbort.

But if you piss SQL Server off by taking too long in one of your finally blocks (for example), it will get a tad snippy. Specifically, it can respond by escalating to a rude AD unload. A rude unload does not tear the AD down by injecting ThreadAbortExceptions and enabling them to percolate back to the top of the call-stack. Rather, it rips it down very aggressively, bypassing lexically relevant finally blocks, only giving a best effort attempt at running CERs, and executing critical finalizers (CFOs) only. Of course, this isn’t nearly as aggressive as a P/Invoke to kernel32!TerminateProcess, but it’s not quite as polite as an ordinary unload.

This means, as a very specific example, that a finally block wishing to execute Monitor.Exit won’t even get run. And if the Exit doesn’t run, that Monitor will be permanently left stamped with the Thread’s ID as the owner. But the Thread has gone bye-bye. Orphaned. Until you’ve created 4,294,967,295 threads such that the Thread IDs wrap around and the old ID gets assigned to a new Thread, and that thread spuriously decides to Exit the Monitor without acquiring it first, your system is going to be locked up for a bit. In other words, deadlocked.

Side Note: Arguably this behavior in any case; if two ADs were intentionally coordinating work, an orphaned lock is better than observing corrupt data structures. But for accidentally shared objects, perhaps it's overly draconian. But I digress.

In fact, any host might do this based on a variety of policies. Some might choose to perform rude AD unloads all of the time, while others might not do it at all. Most of them will use an escalation policy rather than doing it outright—such as SQL Server—but anything’s fair game when the host is in control. A matrix of which hosts do what would be nice, but I don’t have one. We have a nifty tool internally that allows simulation of any of these policies, but you can just as “easily” do it yourself by navigating the Hosting APIs. The general idea is described in more detail in Stephen Toub’s recent excellent MSDN article, and in gory detail in the Customizing the CLR book.

A Demonstration

Let’s first take a look at and observe the effects of a scenario which locks on cross-AD objects:

using System;
using System.Threading;
 
class Program {
    static void Main() {
        // Start up a new AppDomain that hogs a lock.
        AppDomain ad = AppDomain.CreateDomain("FooDomain");
        ad.DoCallBack(delegate {
            Thread t = new Thread(delegate() {
                lock(typeof(string)) {
                    try {
                        Console.WriteLine("AD#B: Got it.");
                        Thread.Sleep(10000);
                    } catch (Exception e) {
                         Console.WriteLine("AD#B: {0}", e);
                         //Thread.Sleep(5000); // provoke a rude unload?
                    }
                }
            });
            t.Start();
        });
 
        // Pause briefly.
        Thread.Sleep(500);
 
        // This will fail because AD#B owns the shared lock.
        bool b = Monitor.TryEnter(typeof(string), 500);
        if (b) {
            Console.WriteLine("AD#A: Got it.");
            Monitor.Exit(typeof(string));
        }
 
        // Kill the other AppDomain.
        AppDomain.Unload(ad);
        Console.WriteLine("AD#A: AD#B is dead.");
 
        // Is the lock orphaned? If we provoked a rude unload, this should hang.
        lock(typeof(string)) {
            Console.WriteLine("AD#A: I got in!");
        }
    }
}

I hope the code is simple enough to be obvious. A brief explanation is warranted:

  1. From an existing Thread T1 in AD #A, we create a new AD #B, and start a new Thread of execution T2 running inside of it;
  2. T1 resumes and waits briefly to ensure T2 can make forward progress first;
  3. T2 locks on typeof(String), and then goes to sleep for a while;
  4. Meanwhile, T1 resumes, attempts to acquire the lock, and fails (because the lock is held by T2 because the String type is shared across ADs);
  5. T1 then initiates an Unload on AD #B;
  6. The result is a ThreadAbort in T2, the finally block releases the Monitor, and AD #B is successfully unloaded;
  7. T1 in AD #A successfully acquires the lock.

Throughout all of this, there is some nice text being printed to the console. I see the following:

AD#B: Got it.
AD#B: System.Threading.ThreadAbortException: Thread was being aborted.
   at System.Threading.Thread.SleepInternal(Int32 millisecondsTimeout)
   at Program.<Main>b__1()
AD#A: AD#B is dead.
AD#A: I got in!

Looks Fine, Eh?

Well that works just fine in unhosted scenarios, as we might have expected it to. The lock-protected bits of code stomp on each other, but at least AD #B happily gives up the lock during an ordinary unload. Note that if the code running in AD #B were careless, it might not have protected the lock acquisition/release in a try/finally, in which case AD #A would be screwed. It would deadlock when it attempted to acquire the lock.

But things get worse. More subtle deadlocks can occur, even if AD #B were written correctly through the use of the C# ‘lock’ statement. As we’ve already established, this might happen if the code were run inside a host that employed rude AD unloads, such as SQL Server. If a thread initiated a rude AD unload in AD #B while it held the lock, the same exact code that worked in the unhosted case would deadlock as soon as AD #A’s last attempt to acquire the lock executed. Presumably SQL Server would notice this deadlock and kill the code—perhaps leading to both ADs ultimately being unloaded—but I am not 100% certain about this.

A Possible Refinement

Through a combination of CERs, we can get our code working again. Note that—if it’s not obvious by now—the real solution is to avoid locking on cross-AD bled objects! Just don’t do it and you won’t get into this trouble. But of course, the geek inside instigates more fun…

Brian Grunkemeyer, a developer on our team, wrote a great piece of code sometime between Beta2 of Whidbey and now. It’s a method on Monitor called ReliableEnter, and it permits you to acquire a Monitor and know reliably whether it succeeded. It does so with a Boolean byref parameter which is set inside of a ThreadAbort-safe region of native code. This means that you can actually rely on the value of the Boolean in a cleanup CER, for example, to indicate whether the Monitor was successfully acquired or not, while at the same time not actually suspending ThreadAborts by wrapping the whole acquisition in a CER.

Unfortunately, we were unable to make it accessible in Whidbey. It’s an internal method, and it got added too late. We’ll probably do that in the future. To make calling it cheap and possible, I wrote a little hack that uses a DynamicMethod to bind to it. In fact I did a little more than just that. I’m not going to analyze it in detail. Feel free to ask questions if you wonder how it works:

delegate void MonitorAction();
class ReliableMonitor
{
    class Holder<T>
    {
        internal Holder() { this.value = default(T); }
        internal Holder(T value) { this.value = value; }
        internal T value;
    }

    delegate void ReliableEnterDelegate(object obj, Holder<bool> taken);
    private static ReliableEnterDelegate monReliableEnter;

    static ReliableMonitor()
    {
        MethodInfo reMi = typeof(Monitor).GetMethod("ReliableEnter", BindingFlags.Static | BindingFlags.NonPublic);
        DynamicMethod dm = new DynamicMethod("Mon_ReliableEnter", null,
            new Type[] { typeof(object), typeof(Holder<bool>) }, typeof(Program), true);
        ILGenerator ilg = dm.GetILGenerator();
        ilg.Emit(OpCodes.Ldarg_0);
        ilg.Emit(OpCodes.Ldarg_1);
        ilg.Emit(OpCodes.Ldflda, typeof(Holder<bool>).GetField("value", BindingFlags.Instance | BindingFlags.NonPublic));
        ilg.Emit(OpCodes.Call, reMi);
        ilg.Emit(OpCodes.Ret);
        monReliableEnter = (ReliableEnterDelegate)dm.CreateDelegate(typeof(ReliableEnterDelegate));
    }

    internal static void Enter(object obj)
    {
        Monitor.Enter(obj);
    }

    internal static void RunWithLock(object obj, MonitorAction action)
    {
        Holder<bool> taken = new Holder<bool>();

        System.Runtime.CompilerServices.RuntimeHelpers.ExecuteCodeWithGuaranteedCleanup(
            delegate
            {
                monReliableEnter(obj, taken);
                action();
            },
            delegate
            {
                if (taken.value)
                {
                    Monitor.Exit(obj);
                    taken.value = false;
                }
            },
            null);
    }

    internal static void Exit(object obj)
    {
        Monitor.Exit(obj);
    }
}

Notice the RunWithLock method. It uses a great method RuntimeHelpers.ExecuteCodeWithGuaranteedCleanup located in the System.Runtime.CompilerServices namespace. We call it SRCSRHECWGC—pronounced “shreek shreck woogy-cuck”—for short around here. Well, we don't really call it that, but I think I will from now on. SRCSRHECWGC runs the first delegate and uses some CER magic to guarantee that the cleanup code passed as the second argument executes in the face of rude AD unloads. At least the type of failures we’re concerned about here. It might not do its job very well if you pull the plug on your computer, for example.

If we were to rewrite our code above to use the RunWithLock method, it could survive a rude AD unload and skirt the frightening onset of a deadlock:

class Program {

    static void Main() {
        // Start up a new AppDomain that hogs a lock.
        AppDomain ad = AppDomain.CreateDomain("FooDomain");
        ad.DoCallBack(delegate {
            Thread t = new Thread(delegate() {
                ReliableMonitor.RunWithLock(typeof(string),
                    delegate {
                        try {
                            Console.WriteLine("AD#B: Got it.");
                            Thread.Sleep(10000);
                        } catch (Exception e) {
                             Console.WriteLine("AD#B: {0}", e);
                             //Thread.Sleep(5000); // provoke a rude unload
                        }
                    });
            });
            t.Start();
        });
 
        // Pause briefly.
        Thread.Sleep(500);
 
        // This will fail because AD#B owns the shared lock.
        bool b = Monitor.TryEnter(typeof(string), 500);
        if (b) {
            Console.WriteLine("AD#A: Got it.");
            Monitor.Exit(typeof(string));
        }
 
        // Kill the other AppDomain.
        AppDomain.Unload(ad);
        Console.WriteLine("AD#A: AD#B is dead.");
 
        // Is the lock orphaned? If we provoked a rude unload, this should hang.
        lock(typeof(string)) {
            Console.WriteLine("AD#A: I got in!");
        }
    }
}

This has the effect that we wanted. When run in a situation where the RunWithLock method guarantees that we release the lock even in the face of a rude unload. The result? AD #A does not deadlock.

Hoorah.

And they all rejoiced.

9/29/2005 2:49:37 PM (Pacific Daylight Time, UTC-07:00)  #   

 Tuesday, September 27, 2005

Many people forget that, regardless of n-tiers of process, it's all just code.

Shipping software is about ensuring the right code gets written at the right time, that it does something amazing, and that it works when it's supposed to.

In the end, that's really all that matters. Don't forget it.

9/27/2005 2:41:37 PM (Pacific Daylight Time, UTC-07:00)  #   

 Thursday, September 22, 2005

We made a change in Whidbey recently that impacts the verification of calls to virtual methods.

Invoking Virtual Methods Statically

Valid IL could previously invoke a precise implementation of a virtual method with a call instruction instead of a callvirt. The target type's exact method token could be specified, bypassing all dynamic dispatch altogether. For example, given two classes A and B

class A {
    public virtual void f() {
        Console.WriteLine("A::f");
    }
}

class B : A {
    public override void f() {
        Console.WriteLine("B::f");
    }
}

a consumer would ordinarily emit IL to perform a virtual dispatch, looking something like this in IL

newobj instance void B::.ctor()
callvirt instance void A::f()

The result is of course a properly dispatched virtual call which resolves to B's override and prints out "B::f". But somebody could do this instead

newobj instance void B::.ctor()
call instance void A::f()

The result of which is an ordinary statically dispatched call to A's implementation of f, printing out "A::f".

Some consider this a violation of privacy through inheritence. Lots of code is written under the assumption that overriding a virtual method is sufficient to guarantee  custom logic within gets called. Intuitively, this makes sense, and C# lulls you into this sense of security because it always emits calls to virtual methods as callvirts. C++ offers language syntax to do precisely this, however, e.g.

B b;
b.A::f();

I don't know of any other language that support this type of call directly, but presumably somebody else followed in C++'s footsteps here. C# (and others) use this technique to implement 'call to base' functionality. Some compilers emit this type of IL so that their method resolution code can bind to virtual methods in a custom way. And others could do it in an attempt to "devirtualize" method calls when they know there are no overrides.

Verification Changes

Late in Whidbey, some folks decided this is subtly strange enough that we at least don't want partially trusted code to be doing it. That it's even possible is often surprising to people. We resolved the mismatch between expectations and reality through the introduction of a new verification rule.

The rule restricts the manner in which callers can make non-virtual calls to virtual methods, specifically by only permitting it if the target method is being called on the caller's 'this' pointer. This effectively allows an object to call up (or down, although that would be odd) its own type hierarchy. With this change, the above example fails verification, "The 'this' parameter to the call must be the calling method's 'this' parameter."

Identity Tracking

The verifier implements this magic using a technique called identity tracking. We don't use this style of tracking in many places. The verifier ordinarily tracks only the static type of items on the stack. But in this case, it needs to be comfortable that you're using the same arg.0 pointer for the method call as was passed onto the caller's stack frame. If you've executed a starg 0 in the IL stream, for example, you won't be permitted to make the call. Even if you do a ldarg.0 followed by a starg 0, the verifier tosses you out the window.

A catch here is that while you might be operating dynamically on the 'this' pointer, the verifier avoids statically tracking pointers across method calls. An example of where this can produce a false positive is as follows

class A {
    public virtual void f() {
        Console.WriteLine("Foo::f");
    }
}

class B : A {
    public override void f() {
        Console.WriteLine("Bar::f");
    }

    private B Echo(B b) { return b; }

    public void FailsVerification() {
        Echo(this).A::f();
    }
}

It's clear that FailsVerification is really just invoking methods on its this pointer. But it does so in a roundabout fashion. (Of course that 'A::f()' syntax is psuedo-code; it would compile in C++, but C# doesn't offer such a feature.) Regardless, the IL that gets produced isn't verifiable.

9/22/2005 1:37:52 PM (Pacific Daylight Time, UTC-07:00)  #   

 Wednesday, September 14, 2005

2 talks down, 0 to go.

What a great feeling. Both talks are currently in the Top 10 for session evals. We'll see if I can stay on top.

Update: Please fill out the session evals on CommNet if you attended my talks. Thanks!
Update #2: The decks for my talks are available: (1) Programming w/ Concurrency and (2) Writing a Compiler.

Time to fall asleep reading Virtual Machines: Versatile Platforms for Systems and Processes.

9/14/2005 12:23:59 AM (Pacific Daylight Time, UTC-07:00)  #   

 Thursday, September 08, 2005

Nearly 20 people from the CLR Team will be at PDC next week. This includes our Product Unit Manager, some senior Architects, Program Managers, Devs, and Testers. Of course, we'd love to meet one-on-one with folks from your company, or even you individually. We have a room available, and are flexible on the times.

If you're interested, just send an email to: PDC0511s@microsoft.com. Let us know what you're interested in. Note: Being the CLR Team, we're admittedly focused on the low level goop...But almost anything's fair game: AppCompat, Security, Reliability, Performance, Concurrency, Base Class Libraries, Garbage Collection, the Future of the CLR, etc.

If you want to meet with me in particular, email me at: joedu@microsoft.com.

9/8/2005 10:10:32 PM (Pacific Daylight Time, UTC-07:00)  #   

 Sunday, August 28, 2005

This is a fun example that illustrates a few topics I'm discussing at PDC in a couple weeks.

What, if anything, can cause Thread 2's assert below to fire?

class Foo {
    static Foo lastFoo;

    string state;
    bool initialized;

    public Foo() {
        state = "Developers, Developers, Developers!";
        initialized = true;
    }
}

// Thread #1:            // Thread #2:
lastFoo = new Foo();     Foo f = lastFoo;
                         Debug.Assert(f.initialized == true &&
                             f.state != null);

For purposes of illustration, imagine that lastFoo has already been initialized to some Foo prior to threads #1 and #2 executing.

8/28/2005 6:30:09 PM (Pacific Daylight Time, UTC-07:00)  #   

 Thursday, August 25, 2005

Following on the tail of Mr. Abrams's Channel9 video (is there one too many s's in there?), check out our video where we discuss the CLR Team's presence at the PDC (Part 1 and Part 2).

We talk with:

There were a few folks we didn't get to chat with:

In watching them, I think I frontloaded the talk in a selfish way. My two talks are first...It wasn't intentional, I swear! But oh well... Check it out!

I even made up a new word in the process: expressitivity. Doh! ;)

8/25/2005 10:43:49 PM (Pacific Daylight Time, UTC-07:00)  #   

 Tuesday, August 23, 2005

You know you're a geek when:

  1. You read processor manuals for fun.
  2. ...

I've been deeply internalizing the memory models implemented on various flavors of x86, IA-32, AMD-64, and IA-64 lately. And then rationalizing how our various JITs manage to implement the new strengthened Whidbey memory model on each architecture. Believe it or not, I love this stuff. One of the perks of being a Microsoft employee is that you can gain access to dual-proc/dual-core/HT machines, AMD-64 and IA-64 boxes, and basically anything else you could imagine. Now if there were just more hours in the day.

Here are a few good resources if you're interested in doing some research yourself:

My PDC talk touches on some of the details of memory models briefly. I wish I could do an entire talk on cache coherency, branch prediction, pipelining, instruction reordering, and the like...But I think that would put most attendees to sleep. There needs to be more me's in the world.

8/23/2005 10:57:14 PM (Pacific Daylight Time, UTC-07:00)  #   

 Sunday, August 14, 2005

I encountered a great quote recently while listening to Business at the Speed of Thought by Bill Gates. It really hit home with me:

"We always overestimate the change that will occur in the next two years and underestimate the change that will occur in the next ten. Don't let yourself be lulled into inaction."

Of course, the definition of "inaction" is up for grabs. Perhaps complacency would have been a more appropriate word. Action is great, but sufficient action is another thing altogether.

8/14/2005 11:38:26 AM (Pacific Daylight Time, UTC-07:00)  #   

 Friday, August 12, 2005

Check out Soma's post about the Nullable<T> DCR we recently implemented...we referred to the project as nullbox internally. This one kept me up at night on a few occassions, but was a lot of fun. :) Huge risk, but based on lots of feedback it was the right thing to do. And the team executed perfectly, nailing our target dates at each step along the way.

I alluded to this work here and here. I was vague and avoided answering probing comments intentionally. Now I can answer them...so ask away!

The core of this change is that the IL box instruction has been modified to recognize Nullable<T>s. For non-Nullables, behavior remains the same; but upon seeing one, it inspects its HasValue property. If HasValue is true, box peeks inside the structure, extracts the T value, and boxes that instead; otherwise, box simply leaves behind a null reference. Obviously, unbox has also been changed to allow nulls to be unboxed back into Nullable<T> structures. This had a rippling effect in the CLR codebase and also required changes to late-bound semantics to mimic the static case.

The result is that given

int? x = null;
object y = x;

both expressions

x == null
y == null

evaluate to true. And furthermore, given

bool F<T>(T t) {
    return t == null;
}

the following expressions

F(x)
F(y)

also evaluate to true.

I intend to post a more detailed summary of the DCR over the coming week[s].

8/12/2005 9:18:30 AM (Pacific Daylight Time, UTC-07:00)  #   

 

Recent Entries:

Search:

Browse by Date:
<September 2005>
SunMonTueWedThuFriSat
28293031123
45678910
11121314151617
18192021222324
2526272829301
2345678

Browse by Category:

Notables: