RSS 2.0

Personal Info:

Joe Send mail to the author(s) is a lead architect on an OS incubation project at Microsoft, and was the architect for Parallel Extensions to .NET. He is an author and frequent speaker.

Blogroll:
Other
News
 C|Net
 Kuro5hin
 The Register
Technology
 <?xmlhack?>
 Daily WTF
 DevX
 Hacknot
 Java Today
 Microsoft Top 10 Downloads
 MSDN
 MSDN: "Longhorn"
 MSDN: XML Developer Center
 Slashdot
 Techdirt
 theserverside.com
 W3C
 Web Pages That Suck
 XML Cover Pages
 XML Journal
 xml.com
Technology Blogs
 Aaron Skonnard [PluralSight]
 Adam Bosworth [Google]
 Andy Rich [MS/C++]
 Arpan Desai [MS/XML]
 BCL Team [MS]
 Bill Clementson [Lisp]
 Bill de hÓra
 Bruce Eckel [J]
 Bruce Tate [J]
 Casey Chestnut
 Cedric Beust [Google]
 Chris Anderson [MS/Avalon]
 Chris Lyon [MS]
 Christian Weyer
 Clemens Vasters [newtelligence]
 Craig Andera [PluralSight]
 Dan Sugalski [Parrot]
 Daniel Cazzulino
 Dave Chappel
 Dave Roberts [Lisp]
 Dave Thomas [PragProg]
 Dave Winer
 Dion Almaer [J]
 Don Demsak
 Doug Purdy [MS/Indigo]
 Drew Marsh
 Eric Gunnerson [MS]
 Eric Rudder [MS]
 Eric Sink
 Fritz Onion [PluaralSight]
 Gavin King [J/Hibernate]
 Grady Booch [IBM]
 Hervey Wilson [MS/Indigo]
 Hillel Cooperman [MS/Shell]
 Howard Lewis Ship [J/Apache]
 Ingo Rammer [PluralSight]
 James Gosling [J/Sun]
 James Strachan [J/Groovy]
 Jason Matusow [MS/OSS]
 Jeffrey Schlimmer [MS/Indigo]
 Joe Beda [Google]
 Joel Spoelsky
 Jon Udell
 Josh Ledgard [MS/Evang]
 Joshua Allen [MS]
 Lambda
 Larry Osterman [MS]
 Maoni Stephens [MS/CLR]
 Mark Fussell [MS/XML]
 Martin Fowler
 Martin Gudgin [MS/Indigo]
 Me
 Michael Howard [MS]
 Miguel de Icaza [Mono]
 Mike Clark
 Omri Gazitt [MS/Indigo]
 Pat Helland [MS/PAG]
 Pinku Surana
 Raymond Chen [MS]
 Rich Lander [MS/CLR]
 Rob Howard
 Rob Relyea [MS/Avalon]
 Robert Cringely
 S. Somasegar [MS/DevDiv]
 Sam Gentile
 Scoble [MS/Evang]
 Scott Guthrie [MS/WebNet]
 Scott Hanselman
 Sean McGrath [J]
 Simon Fell
 Stanley Lippman [MS/C++]
 Steve Maine
 Steve Swartz [MS/Indigo]
 Steve Vinoski
 Steven Clarke [MS/Usability]
 Stuart Halloway
 Ted Leung
 Ted Neward [DM]
 Tim Bray [Sun]
 Tim Ewald [Mindreef]
 Tim O'Reilly
 Werner Vogels [Amazon]
 Wintellect
 Yasser Shohoud [MS/Indigo]
Top 20
 Brad Abrams [MS/CLR]
 Chris Brumme [MS/CLR]
 Chris Sells [MS/Ultra]
 Cyrus Najmabadi [MS/C#]
 Dominic Cooney [MS/XAF]
 Don Box [MS/Ultra]
 Don Syme [MS/R]
 Guido van Rossum [Python]
 Herb Sutter [MS/C++]
 Ian Griffiths
 Jason Zander [MS/CLR]
 Jim Hugunin [MS/CLR]
 Joel Pobar [MS/CLR]
 Krzysztof Cwalina [MS/CLR]
 Patrick Logan
 Paul Graham
 Rico Mariani [MS/CLR]
 Rory Blyth [MS/DN]
 Sam Ruby
 Wesner Moise
VC/Business Blogs
 Ed Sim
 Fred Wilson
 Jonathan Schwartz [J/Sun]
 Lawrence Lessig [Stanford]
 Mark Cuban
 Michael Hyatt
 Pierre Omidyar
 Ross Mayfield
 VentureBlog
 Weekly Read
Wine, Food & Tea
 The Silk Road of Wine
 Vinography: a wine blog
 Wine Whys

Disclaimer:
The content of this site are my own personal opinions and do not represent my employer's view in anyway.

© 2010, Joe Duffy

 
 Friday, March 13, 2009

Managed code generally is not hardened against asynchronous exceptions.

“Hardened” simply means “written to preserve state invariants in the face of unexpected failure.”  In other words, hardened code can tolerate an exception and continue being called subsequently without a process or machine restart.  Conversely, code that is not hardened may react sporadically if continued use is attempted: by corrupting state and subsequently behaving strangely and unpredictably.

Asynchronous exceptions are a foreign concept to native programmers, and arise because there is a runtime underneath all managed code that is silently injecting code on behalf of the original program.  The only truly asynchronous exception is ThreadAbortException, but any in the set { OutOfMemoryException, TypeInitializationException, ThreadInterruptedException, StackOverflowException } are often labeled as such.  While thread aborts can happen at any line of code outside a delay-abort regions, these other exceptions can be introduced by the CLR at surprising times; i.e., { memory allocations, static member access, blocking calls, any function call }.  The effect is that, unlike most exceptions, the points at which they may occur are not obvious.  OOMs, for instance, can happen at any method call (due to failure to allocate memory in which to JIT code), implicit boxing, etc.

(As of 2.0, StackOverflowException is no longer relevant because SO triggers a FailFast of the process instead.  So saying that managed code is not hardened against SO is an understatement.)

Also, because of the way COM reentrancy works, any blocking call can lead to any arbitrary code dispatched through STA pumping.  And that arbitrary code, much like an APC, can fail via any arbitrary exception.  These are a lot like asynchronous exceptions.  So in truth, code that isn’t written to respond to arbitrary exceptions at all blocking points is technically not hardened either.

.NET doesn’t provide checked exceptions, so the blunt reality is that very little managed code is hardened properly to synchronous exceptions either.  I think we do a better job in the framework of carefully engineering the code to resiliently tolerate failure, usually by being very careful about argument validation, but we aren’t perfect.  Some things slip through.

If you stop to think about why hardening isn’t done, it’s probably obvious.  It’s darn difficult.  Especially for asynchronous exceptions where nearly every line of code must be considered.  In Win32 programming, most failure points are indicated by return codes.  (Although C++ exceptions can sneak through the cracks at surprising times.  Like the fact that EnterCriticalSection can throw.)  While error codes are cumbersome to program against (since every call needs to be checked for a plethora of conditions, making it easy to miss something), at least the response to failure is explicit.  You can decide to propagate and leave state corrupt, fix up state and then propagate, rip the process, or ignore the failure, as appropriate.  This becomes part of the API contract.  In managed code, you need to know to wrap such calls in try/catch blocks.  Nobody does this.  It’s insane to even consider doing that.  And because nobody does, you can’t even catch exceptions coming out of a single API call and know that, when faced with an OOM (for example), that all code on the propagating callgraph has transitively handled the failure in a controlled manner.  The very fact that the lock{} statement auto-unlocks without rolling back corrupt state should be indication enough of the current state of affairs.

An instance of any of the aforementioned exceptions means the AppDomain is toast.

By toast, I mean that it’s soon going to be unusable, and hopefully actively being unloaded.  Code in the framework assumes this, and you should too.  All it does is try to get out of the way by not crashing or hanging the ensuing unload.  A small fraction of code that deals with process-wide state comprised of resources not under the purview of the CLR GC needs to worry about running and avoiding leaks.  This is where things like CERs, CriticalFinalizerObjects, and paired operations stuck in finally blocks come into play.  They ensure cross-process state is freed, and that asynchronous exceptions cannot occur in places that would crash or hang a clean unload.

Unfortunately, it’s not always the case that the AppDomain is unloading when such an exception occurs:

  • Somebody can call Thread.Abort directly, without killing the AppDomain.  They can either call ResetAbort and keep it around, or let it return to the ThreadPool which catches and swallows aborts.  In fact, we tell people that synchronous aborts a la Thread.CurrentThread.Abort is “always safe”, whereas we tell people asynchronous aborts are dangerous and best avoided.
  • Some framework infrastructure, most notably ASP.NET, even aborts individual threads routinely without unloading the domain.  They backstop the ThreadAbortExceptions, call ResetAbort on the thread and reuse it or return it to the CLR ThreadPool.  That means any code running in ASP.NET is apt to be corrupted when websites are recycled and AppDomain isolation is not being used.
  • Assume AppDomain B is being unloaded.  If some thread has called from A to B to C, the thread will immediately suffer an abort.  The result is that C will see a thread unwinding with a ThreadAbortException, back into B, and then back to A, at which point the exception turns into a deniable AppDomainUnloadedException that can be caught.  But C has seen an in-flight abort and yet it is not being unloaded.  The result is that C’s state may be completely corrupt.  I believe this should be considered a bug in the CLR.
  • We can’t differentiate between soft- and hard-OOMs today.  The former are caused by requests to allocate large blocks memory.  Often a failure here isn’t indicative of a disaster.  It may be due to a need to allocate 1GB of contiguous memory, and perhaps there is fragmentation.  Hard OOMs are often caused by running up against the edge of the machine where no pagefile space is available, and may indicate a failure to JIT some important method, among other things.  But because we don’t differentiate, any managed code can catch-and-ignore any kind of OOM, including hard ones.
  • Thread interruptions are often used as a form of inter-thread communication.  For example, they can be used as a poor man’s cancellation.  (This is inappropriate, and cooperative techniques should always be used.  But it is widespread.)  But because they are used as a means of communication, they are almost always caught and handled in some controlled manner.  This is one place where we screwed up by not hardening the frameworks against interrupted blocking calls and reacting intelligently.  Checked exceptions would have saved us.

What does all of this mean?  Quite simply, the .NET Framework cannot be trusted when any of the aforementioned exceptions are thrown.  Ideally the process will come tumbling down shortly thereafter, but improperly written code can catch them and continue trying to limp along.  In fact, as I mentioned above, some wildly popular & successful application models do (notably, ASP.NET and WinForms).

This state of affairs is admittedly unfortunate.  We don’t properly separate out the truly fatal exceptions from those that we can gracefully recover from.  In an ideal world, I’d love to see us do that.  For example:

  1. At some point, we really ought to consider FailFast instead of continuing to run code under failures we know are fatal and dangerous to attempt to recover from, much like we do with SO.  At least these failures should be undeniable like thread aborts are.  But this is a fairly Byzantine response and is not for the faint of heart.  Given that we still live in a world where WinForms wraps the top-most frame of the GUI thread in a catch-all, presents a dialog box, and allows a user to click “Ignore & Continue”, I seriously doubt we’ll get there anytime soon.
  2. Never expose a ThreadAbortException to code in an AppDomain unless we can guarantee the AppDomain is being unloaded.  That means getting rid of the Abort API, and thus indirectly disallowing code from catching and calling ResetAbort.  It also means the A calls B calls C case would not allow B to unload until the thread voluntarily unwinds out of C.
  3. Allow OOMs to be caught only when they are soft.  That means a call to ‘new’, and it means the catch much occur inside the same stack frame as the call to ‘new’.  Such exceptions can be tolerated if code is properly written, and we will tell developed to be mindful of them.  Once such an OOM propagates past the calling stack frame, they will escalate to hard.
  4. All other OOMs are hard and fatal.  This includes failure to allocate memory to JIT code and failure to allocate 20 bytes to box an int.  Hard OOMs are thus undeniable.
  5. Get rid of ThreadInterruptedExceptions.  We screwed this up from Day One, and it’s probably too late to fix this.  We added cooperative cancellation in .NET 4.0 for a reason.
  6. TypeInitializationExceptions can probably stay, but we should allow rerunning the cctor upon subsequent accesses.  Today, once a class C throws from its cctor, the class can never be constructed.  So on the current plan, it only makes sense to FailFast.

I’m sure there are many other things we could do to improve things.  But these 6 general themes would be a great start.

I’m just spitballing here.  There are no concrete plans to do any of these 6 things as far as I know.  And at the end of the day, hardening only improves the statistics of the situation, so it tends to be very difficult to argue for one change over another, particularly if taking the change would make existing programs break.  But I really would like to see the base level of reliability in managed code improve with time.  Especially with the exciting work going on around contract-checking in the BCL in Visual Studio 10, I hope these topics become top-of-mind for folks again in the near future.

3/13/2009 1:24:25 PM (Pacific Daylight Time, UTC-07:00)  #    Comments [9]

 

Recent Entries:

Search:

Browse by Date:
<March 2009>
SunMonTueWedThuFriSat
22232425262728
1234567
891011121314
15161718192021
22232425262728
2930311234

Browse by Category:

Notables:

Currently Up To:

Reading...

Listening...

Watching...