RSS 2.0

Personal Info:

Joe Send mail to the author(s) works on parallel libraries, infrastructure, and programming models in Microsoft's Developer Division.

Blogroll:
Other
News
 C|Net
 Kuro5hin
 The Register
Technology
 <?xmlhack?>
 Daily WTF
 DevX
 Hacknot
 Java Today
 Microsoft Top 10 Downloads
 MSDN
 MSDN: "Longhorn"
 MSDN: XML Developer Center
 Slashdot
 Techdirt
 theserverside.com
 W3C
 Web Pages That Suck
 XML Cover Pages
 XML Journal
 xml.com
Technology Blogs
 Aaron Skonnard [PluralSight]
 Adam Bosworth [Google]
 Andy Rich [MS/C++]
 Arpan Desai [MS/XML]
 BCL Team [MS]
 Bill Clementson [Lisp]
 Bill de hÓra
 Bruce Eckel [J]
 Bruce Tate [J]
 Casey Chestnut
 Cedric Beust [Google]
 Chris Anderson [MS/Avalon]
 Chris Lyon [MS]
 Christian Weyer
 Clemens Vasters [newtelligence]
 Craig Andera [PluralSight]
 Dan Sugalski [Parrot]
 Daniel Cazzulino
 Dave Chappel
 Dave Roberts [Lisp]
 Dave Thomas [PragProg]
 Dave Winer
 Dion Almaer [J]
 Don Demsak
 Doug Purdy [MS/Indigo]
 Drew Marsh
 Eric Gunnerson [MS]
 Eric Rudder [MS]
 Eric Sink
 Fritz Onion [PluaralSight]
 Gavin King [J/Hibernate]
 Grady Booch [IBM]
 Hervey Wilson [MS/Indigo]
 Hillel Cooperman [MS/Shell]
 Howard Lewis Ship [J/Apache]
 Ingo Rammer [PluralSight]
 James Gosling [J/Sun]
 James Strachan [J/Groovy]
 Jason Matusow [MS/OSS]
 Jeffrey Schlimmer [MS/Indigo]
 Joe Beda [Google]
 Joel Spoelsky
 Jon Udell
 Josh Ledgard [MS/Evang]
 Joshua Allen [MS]
 Lambda
 Larry Osterman [MS]
 Maoni Stephens [MS/CLR]
 Mark Fussell [MS/XML]
 Martin Fowler
 Martin Gudgin [MS/Indigo]
 Me
 Michael Howard [MS]
 Miguel de Icaza [Mono]
 Mike Clark
 Omri Gazitt [MS/Indigo]
 Pat Helland [MS/PAG]
 Pinku Surana
 Raymond Chen [MS]
 Rich Lander [MS/CLR]
 Rob Howard
 Rob Relyea [MS/Avalon]
 Robert Cringely
 S. Somasegar [MS/DevDiv]
 Sam Gentile
 Scoble [MS/Evang]
 Scott Guthrie [MS/WebNet]
 Scott Hanselman
 Sean McGrath [J]
 Simon Fell
 Stanley Lippman [MS/C++]
 Steve Maine
 Steve Swartz [MS/Indigo]
 Steve Vinoski
 Steven Clarke [MS/Usability]
 Stuart Halloway
 Ted Leung
 Ted Neward [DM]
 Tim Bray [Sun]
 Tim Ewald [Mindreef]
 Tim O'Reilly
 Werner Vogels [Amazon]
 Wintellect
 Yasser Shohoud [MS/Indigo]
Top 20
 Brad Abrams [MS/CLR]
 Chris Brumme [MS/CLR]
 Chris Sells [MS/Ultra]
 Cyrus Najmabadi [MS/C#]
 Dominic Cooney [MS/XAF]
 Don Box [MS/Ultra]
 Don Syme [MS/R]
 Guido van Rossum [Python]
 Herb Sutter [MS/C++]
 Ian Griffiths
 Jason Zander [MS/CLR]
 Jim Hugunin [MS/CLR]
 Joel Pobar [MS/CLR]
 Krzysztof Cwalina [MS/CLR]
 Patrick Logan
 Paul Graham
 Rico Mariani [MS/CLR]
 Rory Blyth [MS/DN]
 Sam Ruby
 Wesner Moise
VC/Business Blogs
 Ed Sim
 Fred Wilson
 Jonathan Schwartz [J/Sun]
 Lawrence Lessig [Stanford]
 Mark Cuban
 Michael Hyatt
 Pierre Omidyar
 Ross Mayfield
 VentureBlog
 Weekly Read
Wine, Food & Tea
 The Silk Road of Wine
 Vinography: a wine blog
 Wine Whys

Disclaimer:
The content of this site are my own personal opinions and do not represent my employer's view in anyway.

© 2008, Joe Duffy

 
 Sunday, March 04, 2007

In 2.0 SP1, we changed the threadpool's default maximum worker thread count from 25/CPU to 250/CPU.

The reason wasn't to encourage architectures that saturate so many threads with CPU-bound work, nor was it to suggest that having 249 threads/CPU blocked and 1/CPU actively running is actually a good design.  A non-blocking, continuation-passing/event-driven architecture is generally a better approach for the latter case.  Rather, we did this to statistically reduce the frequency of accidental and nondeterministic deadlocks.

Believe it or not, deadlocking the threadpool was the most frequently reported threading-related customer problem/complaint during my tenure as the CLR's concurrency PM.  There are KB articles and a wealth of customer and Microsoft employee blog posts about this issue. 

Many algorithms demand dependencies between parallel tasks.  And sometimes—as is frequently the case in data parallel algorithms—the number of tasks can be variable up to a factor of the input size.  A "task" in this context is just the closure passed to QueueUserWorkItem.  Consider a dumb parallel merge sort, which uses divide-and-conquer style parallelism, for example:

void Sort(int[] numbers, int from, int to) {
    if (from < to) { ...
        ThreadPool.QueueUserWorkItem(delegate { Sort(numbers, from, (from+to+1)/2); }); // T1
        Sort(numbers, (from+to+1)/2, to); // T2
        ... Wait for T1 to finish ...
        ... Merge left and right ...
    }
}

In this case, T1 is run in parallel and sorts the left half; T2 runs on the calling thread and sorts the right.  After T2 runs sequentially, it must wait for T1 to complete before moving on to the merge step.  As written, this algorithm is clearly inefficient and deadlocks easily.  Pass it an array of size 33 on a 2-CPU machine, and the threadpool's default maximums will ensure that some T1's can't even get scheduled, leaving threadpool threads blocked waiting for their queued (and stuck) counterparts to finish.  Depending on how tasks are scheduled at runtime this could deadlock.

Clearly the programmer needs to "stop" dividing the problem at some reasonable point, i.e. limit the maximum number of tasks generated; otherwise the task count will grow with some factor of the input size, causing deadlocks for large inputs (not to mention huge context switch and resource consumption overheads).  When might that point be?  Perhaps the programmer calculates some degree-of-parallelism (DOP) at the top of the recursive call stack, say log2(#ofCPUs).  DOP is passed to the first call to Sort and each subsequent recursive call decrements the DOP by 1.  So long as the DOP argument is >0, T1 is run in parallel; otherwise, T1 is run sequentially on the same thread, just like T2.  This ensures we don't spawn more tasks than there are CPUs.

And this will probably work.  Most of the time.

What if, just by chance, the stars aligned and 25 instances of this algorithm ran simultaneously?  Seems farfetched?  Maybe so.  Consider this: using log2(#ofCPUs) might not be enough in the case that some comparison routines block during sorting, possibly suggesting log2(2(#ofCPUs)) as a better DOP instead.  And then all we need is 12.5 occurrences.  Still a little farfetched, but not quite as much.  But wait: there could be other algorithms using the thread pool simultaneously, particularly on a server.  (Yes, data parallelism on a server is probably suspect in the first place, but for highly parallel servers with volatile loads, it could be useful.)  And remember, the thread pool is shared across all AppDomains in the process, so if you've written a reusable component, you're relying on all other software in the process to behave properly too (which you may have absolutely no control over).

Most of these admittedly represent imperfections in the overall design and architecture of the application, but the sad fact is that they tend to be somewhat common.  Especially when components are dynamically composed in the process, as is common with server and AddIn-style apps.  And they are very nondeterministic and hard to test for.  Our platform doesn't offer a mechanism today that allows developers to write code that is intelligently-parallel, particularly when many heterogeneous components are trying to use concurrency for performance.  Even with the suggested improvements and the CLR threadpool's old 25/CPU thread limit, the Sort routine could deadlock once in a while, maybe under extreme stress and very hard to reproduce conditions.  This will occur less frequently, statistically speaking, with the 250/CPU limit.  The problem is that all of this is just a heuristic, there aren't any hard numbers and coordination involved.

It’s also worth noting that the threadpool throttles its creation of threads to 2/second once the count has exceeded the #ofCPUs.  That means if a programmer sees this situation happening with regularity, they will also observe hard to diagnose performance degradations.  Once in a while, that sort algorithm might take 10-times longer to run, inexplicably.  If this happens a lot, the developer will notice, profile, and fix the issue.  While this isn’t great, this problem is typically quite rare in any one (properly written) program, and doesn’t happen with regularity.  Our first priority was to prevent the periodic and sporadic hangs, the things eating away at the reliability and uptime of programs, to trade them off for possible periodic performance blips.  Many of those horribly misarchitected programs will still deadlock deterministically with the new limits, and the thread injection throttling will help to discourage them from being written this way.  (It would take 125 seconds to create 250 threads on a 1-CPU machine, and it seems unlikely that a 2 minute-plus delay would be tolerated.  Some people use SetMinThreads to get around this, which is (usually) inexcusable.)

With all of this said, we clearly have a lot of work to do in the future to encourage better parallel program architectures and to provide better tools for diagnostics in this area.  I agree with the basic tenet "use as few threads as possible, but no fewer," but sometimes we have to sacrifice idealism to solve practical real-world problems.  In my experience, this should solve many such problems.

3/4/2007 9:08:41 PM (Pacific Standard Time, UTC-08:00)
Very interesting post as always Joe! How's the book coming along? I'm dying to read it :-) I was wondering if there was any kind of update you could give us on where PLINQ is going?
3/5/2007 4:59:45 AM (Pacific Standard Time, UTC-08:00)
Hi Joe,

I just ran the following code on my dual core vista box with VS 2005 SP1 and the result was 50 worker threads

int wt = 0;
int iot = 0;
ThreadPool.GetMaxThreads(out wt, out iot);
Console.WriteLine("{0} {1}", wt, iot);

Is VS SP1 not the vehicle for CLR 2.0 SP1 or is there a config setting I'd need to use to see the new limit?
3/5/2007 9:17:42 PM (Pacific Standard Time, UTC-08:00)
Tom: the book's coming...slowly...but I'm still actively working on it. I'll give a more detailed book and PLINQ update in the next couple weeks.

Richard: VS SP1 is not the vehicle for CLR 2.0 SP1, believe it or not. (That would be too intuitive. ;) ) I'm not in tune with when SP1 will be out the door, and in what form, and under what title, sorry.

--joe
4/8/2007 8:07:40 PM (Pacific Daylight Time, UTC-07:00)
Joe, when you say 25/CPU and 250/CPU, does it mean 25 per process / CPU and now increased to 250 per process / CPU since the thread pool theards are per process, right?
4/11/2007 9:06:34 PM (Pacific Daylight Time, UTC-07:00)
Atul: correct. Each process maintains its own pool of threads: now 250/CPU.
4/16/2007 7:02:31 AM (Pacific Daylight Time, UTC-07:00)
Hi Joe,

Thanks very much for the article.
Could you please tell us when will be released .NET 2.0 SP1 ?
Correct me if I'm wrong but it was due to be released at Q1 2007 ...
12/7/2007 4:27:43 AM (Pacific Standard Time, UTC-08:00)
It all comes to responsability and organization... if you know how to combine these two then you are all set.
1/3/2008 8:19:05 PM (Pacific Standard Time, UTC-08:00)
Hi Joe,
Thank you for such a nice article. Increasing the thread pool limit makes perfect sense for applications which allows plug-in but I am a bit skeptical about this change from the Server applications point of view.
Is 250 threads/cpu an option or is it imposed implicitly?
I am concerned about production servers which usually host numerous ASP.NET web applications. If the upper limit is forced to be 250, an attacker can generate too many requests for one particular application which will increase active threads in ThreadPool; as each thread takes 1 mb of the virtual memory, other applications on the server may also be effected and increase the chances of Out Of Memory exceptions.
I believe such a big pool size if even half consumed i.e 150 active threads, will impact performance of other applications because of excessive context switching.
I would really appreciate it if you could comment on my worries: P
Thank you so much
Regards
Imran

p.s. I am really looking forward to your upcoming book.
Name
E-mail
Home page

Comment (HTML not allowed)  

Enter the code shown (prevents robots):

 

Recent Entries:

Search:

Browse by Date:
<November 2008>
SunMonTueWedThuFriSat
2627282930311
2345678
9101112131415
16171819202122
23242526272829
30123456

Browse by Category:

Notables:

Currently Up To:

Reading...

Listening...

Watching...