RSS 2.0

Personal Info:

Joe Send mail to the author(s) is the lead developer and architect for Parallel Extensions to .NET, tinkers with type systems, and is an author and speaker.

Blogroll:
Other
News
 C|Net
 Kuro5hin
 The Register
Technology
 <?xmlhack?>
 Daily WTF
 DevX
 Hacknot
 Java Today
 Microsoft Top 10 Downloads
 MSDN
 MSDN: "Longhorn"
 MSDN: XML Developer Center
 Slashdot
 Techdirt
 theserverside.com
 W3C
 Web Pages That Suck
 XML Cover Pages
 XML Journal
 xml.com
Technology Blogs
 Aaron Skonnard [PluralSight]
 Adam Bosworth [Google]
 Andy Rich [MS/C++]
 Arpan Desai [MS/XML]
 BCL Team [MS]
 Bill Clementson [Lisp]
 Bill de hÓra
 Bruce Eckel [J]
 Bruce Tate [J]
 Casey Chestnut
 Cedric Beust [Google]
 Chris Anderson [MS/Avalon]
 Chris Lyon [MS]
 Christian Weyer
 Clemens Vasters [newtelligence]
 Craig Andera [PluralSight]
 Dan Sugalski [Parrot]
 Daniel Cazzulino
 Dave Chappel
 Dave Roberts [Lisp]
 Dave Thomas [PragProg]
 Dave Winer
 Dion Almaer [J]
 Don Demsak
 Doug Purdy [MS/Indigo]
 Drew Marsh
 Eric Gunnerson [MS]
 Eric Rudder [MS]
 Eric Sink
 Fritz Onion [PluaralSight]
 Gavin King [J/Hibernate]
 Grady Booch [IBM]
 Hervey Wilson [MS/Indigo]
 Hillel Cooperman [MS/Shell]
 Howard Lewis Ship [J/Apache]
 Ingo Rammer [PluralSight]
 James Gosling [J/Sun]
 James Strachan [J/Groovy]
 Jason Matusow [MS/OSS]
 Jeffrey Schlimmer [MS/Indigo]
 Joe Beda [Google]
 Joel Spoelsky
 Jon Udell
 Josh Ledgard [MS/Evang]
 Joshua Allen [MS]
 Lambda
 Larry Osterman [MS]
 Maoni Stephens [MS/CLR]
 Mark Fussell [MS/XML]
 Martin Fowler
 Martin Gudgin [MS/Indigo]
 Me
 Michael Howard [MS]
 Miguel de Icaza [Mono]
 Mike Clark
 Omri Gazitt [MS/Indigo]
 Pat Helland [MS/PAG]
 Pinku Surana
 Raymond Chen [MS]
 Rich Lander [MS/CLR]
 Rob Howard
 Rob Relyea [MS/Avalon]
 Robert Cringely
 S. Somasegar [MS/DevDiv]
 Sam Gentile
 Scoble [MS/Evang]
 Scott Guthrie [MS/WebNet]
 Scott Hanselman
 Sean McGrath [J]
 Simon Fell
 Stanley Lippman [MS/C++]
 Steve Maine
 Steve Swartz [MS/Indigo]
 Steve Vinoski
 Steven Clarke [MS/Usability]
 Stuart Halloway
 Ted Leung
 Ted Neward [DM]
 Tim Bray [Sun]
 Tim Ewald [Mindreef]
 Tim O'Reilly
 Werner Vogels [Amazon]
 Wintellect
 Yasser Shohoud [MS/Indigo]
Top 20
 Brad Abrams [MS/CLR]
 Chris Brumme [MS/CLR]
 Chris Sells [MS/Ultra]
 Cyrus Najmabadi [MS/C#]
 Dominic Cooney [MS/XAF]
 Don Box [MS/Ultra]
 Don Syme [MS/R]
 Guido van Rossum [Python]
 Herb Sutter [MS/C++]
 Ian Griffiths
 Jason Zander [MS/CLR]
 Jim Hugunin [MS/CLR]
 Joel Pobar [MS/CLR]
 Krzysztof Cwalina [MS/CLR]
 Patrick Logan
 Paul Graham
 Rico Mariani [MS/CLR]
 Rory Blyth [MS/DN]
 Sam Ruby
 Wesner Moise
VC/Business Blogs
 Ed Sim
 Fred Wilson
 Jonathan Schwartz [J/Sun]
 Lawrence Lessig [Stanford]
 Mark Cuban
 Michael Hyatt
 Pierre Omidyar
 Ross Mayfield
 VentureBlog
 Weekly Read
Wine, Food & Tea
 The Silk Road of Wine
 Vinography: a wine blog
 Wine Whys

Disclaimer:
The content of this site are my own personal opinions and do not represent my employer's view in anyway.

© 2009, Joe Duffy

 
 Sunday, March 04, 2007

In 2.0 SP1, we changed the threadpool's default maximum worker thread count from 25/CPU to 250/CPU.

The reason wasn't to encourage architectures that saturate so many threads with CPU-bound work, nor was it to suggest that having 249 threads/CPU blocked and 1/CPU actively running is actually a good design.  A non-blocking, continuation-passing/event-driven architecture is generally a better approach for the latter case.  Rather, we did this to statistically reduce the frequency of accidental and nondeterministic deadlocks.

Believe it or not, deadlocking the threadpool was the most frequently reported threading-related customer problem/complaint during my tenure as the CLR's concurrency PM.  There are KB articles and a wealth of customer and Microsoft employee blog posts about this issue. 

Many algorithms demand dependencies between parallel tasks.  And sometimes—as is frequently the case in data parallel algorithms—the number of tasks can be variable up to a factor of the input size.  A "task" in this context is just the closure passed to QueueUserWorkItem.  Consider a dumb parallel merge sort, which uses divide-and-conquer style parallelism, for example:

void Sort(int[] numbers, int from, int to) {
    if (from < to) { ...
        ThreadPool.QueueUserWorkItem(delegate { Sort(numbers, from, (from+to+1)/2); }); // T1
        Sort(numbers, (from+to+1)/2, to); // T2
        ... Wait for T1 to finish ...
        ... Merge left and right ...
    }
}

In this case, T1 is run in parallel and sorts the left half; T2 runs on the calling thread and sorts the right.  After T2 runs sequentially, it must wait for T1 to complete before moving on to the merge step.  As written, this algorithm is clearly inefficient and deadlocks easily.  Pass it an array of size 33 on a 2-CPU machine, and the threadpool's default maximums will ensure that some T1's can't even get scheduled, leaving threadpool threads blocked waiting for their queued (and stuck) counterparts to finish.  Depending on how tasks are scheduled at runtime this could deadlock.

Clearly the programmer needs to "stop" dividing the problem at some reasonable point, i.e. limit the maximum number of tasks generated; otherwise the task count will grow with some factor of the input size, causing deadlocks for large inputs (not to mention huge context switch and resource consumption overheads).  When might that point be?  Perhaps the programmer calculates some degree-of-parallelism (DOP) at the top of the recursive call stack, say log2(#ofCPUs).  DOP is passed to the first call to Sort and each subsequent recursive call decrements the DOP by 1.  So long as the DOP argument is >0, T1 is run in parallel; otherwise, T1 is run sequentially on the same thread, just like T2.  This ensures we don't spawn more tasks than there are CPUs.

And this will probably work.  Most of the time.

What if, just by chance, the stars aligned and 25 instances of this algorithm ran simultaneously?  Seems farfetched?  Maybe so.  Consider this: using log2(#ofCPUs) might not be enough in the case that some comparison routines block during sorting, possibly suggesting log2(2(#ofCPUs)) as a better DOP instead.  And then all we need is 12.5 occurrences.  Still a little farfetched, but not quite as much.  But wait: there could be other algorithms using the thread pool simultaneously, particularly on a server.  (Yes, data parallelism on a server is probably suspect in the first place, but for highly parallel servers with volatile loads, it could be useful.)  And remember, the thread pool is shared across all AppDomains in the process, so if you've written a reusable component, you're relying on all other software in the process to behave properly too (which you may have absolutely no control over).

Most of these admittedly represent imperfections in the overall design and architecture of the application, but the sad fact is that they tend to be somewhat common.  Especially when components are dynamically composed in the process, as is common with server and AddIn-style apps.  And they are very nondeterministic and hard to test for.  Our platform doesn't offer a mechanism today that allows developers to write code that is intelligently-parallel, particularly when many heterogeneous components are trying to use concurrency for performance.  Even with the suggested improvements and the CLR threadpool's old 25/CPU thread limit, the Sort routine could deadlock once in a while, maybe under extreme stress and very hard to reproduce conditions.  This will occur less frequently, statistically speaking, with the 250/CPU limit.  The problem is that all of this is just a heuristic, there aren't any hard numbers and coordination involved.

It’s also worth noting that the threadpool throttles its creation of threads to 2/second once the count has exceeded the #ofCPUs.  That means if a programmer sees this situation happening with regularity, they will also observe hard to diagnose performance degradations.  Once in a while, that sort algorithm might take 10-times longer to run, inexplicably.  If this happens a lot, the developer will notice, profile, and fix the issue.  While this isn’t great, this problem is typically quite rare in any one (properly written) program, and doesn’t happen with regularity.  Our first priority was to prevent the periodic and sporadic hangs, the things eating away at the reliability and uptime of programs, to trade them off for possible periodic performance blips.  Many of those horribly misarchitected programs will still deadlock deterministically with the new limits, and the thread injection throttling will help to discourage them from being written this way.  (It would take 125 seconds to create 250 threads on a 1-CPU machine, and it seems unlikely that a 2 minute-plus delay would be tolerated.  Some people use SetMinThreads to get around this, which is (usually) inexcusable.)

With all of this said, we clearly have a lot of work to do in the future to encourage better parallel program architectures and to provide better tools for diagnostics in this area.  I agree with the basic tenet "use as few threads as possible, but no fewer," but sometimes we have to sacrifice idealism to solve practical real-world problems.  In my experience, this should solve many such problems.

 

Recent Entries:

Search:

Browse by Date:
<July 2009>
SunMonTueWedThuFriSat
2829301234
567891011
12131415161718
19202122232425
2627282930311
2345678

Browse by Category:

Notables:

Currently Up To:

Reading...

Listening...

Watching...