|
Personal Info:
Joe  leads the architecture of an experimental OS's developer platform, where
he is also chief architect of its programming language. His current mission is to enable
writing large-scale software that is reliable, secure, and scalable by-construction. Before this, Joe
founded the Parallel Extensions to .NET project.
He has been granted 19 patents, with 49 pending. When not working, Joe enjoys travelling with his wife,
writing books, writing music,
studying music theory & mathematics, and doing anything involving food & wine.
My books
My music
Disclaimer:
The content of this site are my own personal opinions and do
not represent my employer's view in anyway.
© 2012, Joe Duffy
|
|
 Thursday, March 29, 2007
One of the motivations of doing a new reader/writer lock in Orcas (ReaderWriterLockSlim) was to do away with one particular scalability issue that customers commonly experienced with the old V1.1 reader/writer lock type (ReaderWriterLock). The basic issue stems from exactly how the lock decides when (or in this case, when not) to wake up waking writers. Jeff Richter’s MSDN article from June of last year highlights this problem. This of course wasn't the primary motivation, but it was just another straw hanging off the camel's back.
Contrast some choice behavior exhibited by the two lock types:
- Both ReaderWriterLock and ReaderWriterLockSlim will block new readers from acquiring the lock as soon as a writer begins waiting to enter. That means that as soon as all active readers exit the lock, the writer will be awoken and allowed to enter the lock. (For all intents and purposes, by the way we treat upgrade and write locks the same.)
- When a write lock is released, the ReaderWriterLock type will wake up all waiting readers, even if there are waiting writers. Again, new readers are blocked and once this awoken batch of read locks exit the lock again, the next writer in line is awoken.
- When a write lock is released, the ReaderWriterLockSlim type will wake up a waiting writer instead of readers. Readers may only proceed once there are no longer any waiting writers.
These last two points illustrate the basic issue with the old lock. (And the new one, too, to be brutally honest.) If a large number of writers is waiting to enter the ReaderWriterLock, each will be staggered by the amount of time it takes for all intervening readers to enter the lock, do their work, and exit. This can send the wait time for writers through the roof.
To further illustrate the point, imagine we have two writers (W0 and W1) and two readers (R0 and R1), each of which enters the lock, does some work for 1 unit of time, exits, and then goes back around and tries to acquire the lock again:
Thread Arrival Enter Exit W0 0 0 1 W1 0 3 4 R0 0 1 2 R1 0 1 2 W0 1 5 6 W1 4 8 9 R0 2 4 5 R1 2 4 5 ... R0 5 6 7 R1 5 6 7
Notice that the writers have to wait for a very long time to be serviced in comparison to the readers. If more and more writers show up, this problem becomes magnified, regardless of how many readers there are or the ratio of readers to writers.
The new lock doesn’t suffer from this same problem. But it does suffer from a different one: possible starvation of readers if there is always a writer arriving or waiting at the lock. As Jeff mentions in his MSDN article, most reader/writer locks work best when the ratio of reads to writes is high. In the above example, though, the readers actually would never get to enter the lock:
Thread Arrival Enter Exit W0 0 0 1 W1 0 1 2 R0 0 ?? ?? R1 0 ?? ?? W0 1 2 3 W1 2 3 4 ... W0 3 4 5 W1 4 5 6
So which is better? If writers are less frequent in your scenario—as they usually are—then the new lock will probably fit the bill. If not, you might run into troubles with the new one.
We had originally planned to allow you to configure the contention policy. In fact, if you picked up an earlier Orcas CTP, you probably noticed the ReaderWriterLockSlim constructor that took an enumeration value specifying the contention policy: PrefersReaders, PrefersWritersAndUpgrades, and Fifo. This simply added too much complexity for the short timeframe of the Orcas release, so it recently silently disappeared.
Though it’s the hardest to implement, I do think FIFO (or some heuristic approximation thereof) is the right answer here. Block new readers once a writer begins waiting. When the last reader exits, wake up the waiting writer. When the writer exits, wake up the next n contiguous waiting readers (all readers between the exiting writer and the next writer in the wait queue, if any) or the next writer (if the writer is next in the wait queue). Or, as noted, some approximation of this logic, since it could be fairly costly to orchestrate all the bookeeping, particularly the event waiting and signaling. But a FIFO-like ordering ensures some strong correlation between arrival time and relative wait time, which, I think, is what most people expect and desire. There are of course convoy problems that can happen when strict FIFO ordering is used, so I would expect we would still allow (some) arriving requests to pass others in line.
This last suggestion is actually quite similar to what the new SRWLOCK in Vista does. ReleaseSRWLockShared and ReleaseSRWLockExclusive signal the next threads in line based on a wait queue structure, without any sort of “prefers readers over writers” (or vice versa) policy. But that’s a topic for a separate day.
 Saturday, March 24, 2007
Prior to coming to Microsoft, I had been a developer for about 6 years. Back in 2004, when contemplating moving to a different company, I decided I was ready for a shift in focus. Namely, I wanted to move away from the details of writing code and move towards higher-level architecture, design, and strategy. So I joined Microsoft as a program manager (PM).
Now, less than 3 years later, I’ve moved back to being a software development engineer (SDE, i.e. developer). Is it forever? Who knows? I didn’t move back because I disliked the PM job (as some might assume). In fact, I thoroughly enjoyed most aspects of being a PM. After a few months in my new role I actually find myself missing some of them from time to time. There wasn’t any single thing that spurred the decision: it was fairly complicated, and involved a mixture of pursuing a specific opportunity, seeking growth, realizing I was missing many things that I enjoyed doing as a developer, and, I suppose, some degree of relapsing back to something I am more comfortable with.
I've written up some details about why I decided to change. Who knows, maybe it will help somebody out there choose the right role for themselves?
The specific opportunity
The opportunity I’m referring to was, of course, PLINQ, and it came about at just the right time. I spent the year leading up to my decision gradually focusing more and more on PLINQ. At first, PLINQ was just a crazy idea. Then I did some prototyping in my free time, had some quick successes, and was amazed at how simple it was to write parallel programs with LINQ queries. I showed some people, and most really latched on to the idea; they could understand the value of the system with just a 30 second elevator pitch. These positive reactions motivated me to stick with the idea, and to keep driving hard. Then there was interest from the C# team, interest from executives, a BillG ThinkWeek paper, and finally, somebody wanted to do it for real.
In some sense, it was like a start-up. I had an idea, figured it was an idea worth pursuing, did some prototyping, sold it among my peers, and finally convinced a VC team to fund the execution of the idea. Well, if it truly were a start-up, what would I then want to do? Code all night long, obviously! I wanted to architect and design the actual product version, implement the darned thing, and feel the whole process from start to finish from several perspectives. Doing PLINQ as a technical lead/developer seemed like a natural next step.
I should also note that I had considered taking developer positions before the PLINQ job. But in all of those instances, I chose not to for two reasons: I was having a blast as the concurrency PM on the CLR team and because the right opportunity hadn't arisen to steal me away.
This was also a great growth opportunity for me. Though I was a developer in the past, it was seldom my own idea that I was implementing, and this time it’s different. I had set the ship sailing in a direction that I was happy with, and experiencing the consequences of that direction is something I seriously wanted. It’s less about control than it is about feeling more ownership and responsibility for the direction in which our team is headed. It’s also a way of putting my butt on the line, hopefully winning more trust and respect in the process, showing that I can actually follow through on good ideas (rather than just having the idea itself). Most PMs certainly get a lot more of the influence and coarse-grained ownership opportunity than most SDEs, but my particular situation was such that my high-level contributions, leadership skills, and domain-experience will be more valuable and influential where I am now. This last part turned out to be less of a SDE vs. PM thing (i.e. I could have probably had those things as a PLINQ PM too), but it played a huge role in my decision to change teams.
I missed writing code
During the many long nights of prototyping PLINQ, it also kept reminding me how much I loved coding and how much I missed it as a PM. Put simply: I love programming. Anybody who’s done something they love and gotten lost in it knows what “flow” is. I used to experience flow when I played music, but I haven’t done that for so long that programming is now the only time I really feel it. Maybe it has to do with the fact that I started coding when my brain was still in development (13 years old), and so I tend to be comfortable thinking in code. You know: no conscious thought, just a direct conduit from your brain to your fingertips. There is a sense of time standing still, the code simply appearing on the screen in a magical blur, and at some point the thing actually works. You can get up, stretch, make some tea, map out the next few hours for a moment, and then you’re back into it. Lost in the fun of it all. It’s just great, and nothing I ever did as a PM gave me this feeling.
There’s an art to writing code. Sure, there’s a non-negligible scientific and mathematical component to programming, but I like to think of development as the artful application of those formal techniques. Inventing, structuring, and reifying abstractions, naming them, the careful placement of comments and whitespace, traversing the state space in your head, using intuition to make certain trade-offs, and so on.
Perhaps the most creative and artistic activity as a PM that I really enjoyed was writing. But to be honest, I’ve always enjoyed the kind of writing I do on my own time—writing blog essays and books—more than the more prevalent style of writing that a PM tends to do (say, design specifications). I would get the chance to write a strategic or influential technical document from time to time, but the rate at which I will contribute such things as a SDE seems like it will be the same as when I was a PM. It has been so far, at least. If you have the insight and know how, when, and with whom to share and present it, people will listen, regardless of your title.
Designing new things was also of course a creative activity I loved as a PM, but I’m doing more design work as a developer than when I was a PM. Not all SDEs design higher-level things, but I do regularly and work closely with many architects, distinguished engineers, and technical fellows in the process. Developers also get to design more at a slightly lower level of abstraction. But everything's just an abstraction anyway, so the "level" doesn't matter much (to me) anyway.
Prior to PLINQ, my Sencha interpreter/compiler (Scheme at first and later, Common LISP) was my hobby development project. (And, of course, there were many late nights spent on TopCoder.) I was basically spending my days as a PM and my nights as a developer. To be honest, I enjoy doing both things and, aside from the time it took to do both, it made me very happy. But in retrospect, it turned out to be rather unhealthy. Writing code was my sole hobby, and I evicted all of my pre-Microsoft hobbies from my system to make room for it. Now I get to write code during the day, do the components of PM that I enjoyed most, and still preserve my sanity. I’m just getting back into the things I used to love doing: writing more than just technical blog essays and books, playing guitar, sports, working out, food and wine, and basically just getting outside and feeling healthier, both mentally and physically. Maybe I'm getting old.
Fractured role structure
There are plenty of things I miss as a PM. But I associate the absence of many such things with where we are with the PLINQ project. High-level planning, becoming familiar with the competition and pertinent research in the area, thinking about product strategy, etc. are all certainly crucial activities throughout the project, but they are much more prevalent and involved at the outset. As the project shifts into execution mode, there’s a lot of thinking about release plans, customer interaction, and team building and management. I enjoy those things too, but I also love the design, architecture, and implementation of software. I still get to do them, just not as frequently. As a developer, you tend not to be involved in as many management discussions and decisions, but I can cope with this knowing I have a solid PM and management team working with me. I know that if there's anything that requires my attention or where my feedback would be valuable, they won't hesitate to bring it to me.
To be completely honest, I dislike the completely fractured role structure that most of Microsoft employs. (And “dislike” is putting it mildly.) I do believe that people need to specialize, particularly as they grow in their career. You simply can’t do it all and expect to move up to the next level of your career without abstracting some details. But when you get down to it, there really isn’t much of a difference between PM, SDE, and SDE/T. I've been doing interviews for PM and SDE lately, and we expect one to have a better business- and customer-sense, while the other needs to be solid algorithmically and implementation-wise, but there's a substantial overlap. There are obviously certain responsibilities which are unique to each role, but more often than not, the roles are fluid and (when things are going well) people are able to fill in whatever gaps happen to arise that they are good at filling. I just don’t think this kind of specialization necessarily needs to be forced. Maybe I'm influenced by my own style of sitting on the edge between the two roles. Each project needs its own unique structure, and some dedicated managers, but it doesn’t happen according to some magical formula. As I understand things, this is how it works at Google, and just about every research department I’ve ever known (including MSR).
I should also mention that the grass isn't all a vibrant shade of green on the other side of the fence. A PM gets to operate at a higher-level, seldom spending more than 10 minutes triaging and talking to a developer about any particular bug, for example. In contrast, a developer must focus on details most of the time. This might mean spending anywhere from 1 minute to 1 week on just a single bug. Some people enjoy getting lost in details (in fact, it can help lead to flow, though bug fixing is more investigative work than anything else, which isn't as conducive). Checking in a bug makes it really easy to quantify “success” and some people need this. (Not me.) I prefer more ambiguous tasks which usually means design or coding an entirely new algorithm, and I find fixing bugs tedious. This means the late-in-the-game product cycle will be a little rough for me. But that’s OK: the details are of course part of the engineering process, so I do it and don’t ever find myself thinking “wow, I dislike this” while I'm doing it. A lot of developers tend to be tools geeks, too. They love building complex ecosystems of engineering support tools. I’m not into that. I value solid engineering practices, but prefer things to be as simple as possible such that nothing gets in between me, the code, and the compiler. Reality isn’t quite always like this, and tools are often the most effective means to ensure certain quality properties of your product, but I still strive to eliminate (or avoid adding) as much unnecessary obstructions as possible. A wise man once said "keep it as simple as possible, but no simpler."
Some final words
At the end of the day, when I look at influential people I've bonded with across the company, most (but certainly not all) rose up through the ranks as developers. That’s clearly very subjective and personal: that’s not to say there aren’t great PMs (past or present), only that most architects, distinguished engineers, technical fellows, and researchers have or are coders at heart. Moreover, I simply see myself being more successful as a SDE. Not only have I done it for longer, but the career skills required to move from developer to architect to distinguished engineer (in Microsoft) align more closely with the skills I already have. While PMs can move to architecture with time, it simply looks like a much longer road from where I stand.
And after saying all of this, I truly believe job title doesn’t matter that much at all. Ability and good intuition matter. We’re all just shipping software. At the end of the day, we do what needs to get done for that to happen, regardless of what hat we happen to be wearing at the time.
 Friday, March 09, 2007
The CLR commits the entire reserved stack for managed threads. This by default is 1MB per thread, though you can change the values with compiler settings, a PE file editor, or by changing the way you create threads. We've been having a fascinating internal discussion on the topic recently, and I've been surprised how many people were unaware that the CLR engages in this practice. I figure there's bound to be plenty of customers in the real world that are also unaware.
Let's see some pages
This behavior can be seen quite easily by breaking into a debugger (like WinDbg) and inspecting the status of the virtual memory pages comprising a thread's stack. For example, from WinDbg the !teb command will show you the highest stack address (StackBase) and !vadump will show you the status of all pages in the process's address space. From this you can see that the relevant stack pages are in the state MEM_COMMIT rather than MEM_RESERVE.
Here's a quick example taken from a sample managed program. I've broken right inside the Main function and will dump the TEB:
0:000> !teb TEB at 000007fffffde000 ExceptionList: 0000000000000000 StackBase: 0000000000190000 StackLimit: 0000000000189000 ...
Based on this information, combined with the fact that we know managed thread stacks are 1MB by default, we can determine what memory addresses to look for: we subtract 100000 (1MB) from 190000 (StackBase) to arrive at the base address of the stack pages: 90000. Now we dump the virtual memory pages:
0:000> !vadump
... (1) BaseAddress: 0000000000090000 RegionSize: 0000000000001000 State: 00002000 MEM_RESERVE Type: 00020000 MEM_PRIVATE
(2) BaseAddress: 0000000000091000 RegionSize: 00000000000f0000 State: 00001000 MEM_COMMIT Protect: 00000004 PAGE_READWRITE Type: 00020000 MEM_PRIVATE
(3) BaseAddress: 0000000000181000 RegionSize: 0000000000001000 State: 00002000 MEM_RESERVE Type: 00020000 MEM_PRIVATE
(4) BaseAddress: 0000000000182000 RegionSize: 0000000000007000 State: 00001000 MEM_COMMIT Protect: 00000104 PAGE_READWRITE + PAGE_GUARD Type: 00020000 MEM_PRIVATE
(5) BaseAddress: 0000000000189000 RegionSize: 0000000000007000 State: 00001000 MEM_COMMIT Protect: 00000004 PAGE_READWRITE Type: 00020000 MEM_PRIVATE ...
That summarizes the stack memory. But what does it all mean? I've labeled the individual regions above with numbers so I can reference them. And, remember folks, the stack grows downward in the address space, so we'll discuss them in reverse order:
5. The actively used portion of the stack. Notice that the BaseAddress equals the thread's current StackLimit, and that its BaseAddress+RegionSize equals StackBase. The thread is actively using stack memory only within this region.
4. The guard portion of the stack. When an attempt is made to write to an address within this range (i.e. as the thread's stack grows by virtue of the program calling functions, stackalloc'ing, etc.), the memory's guard status is cleared and a fault is triggered. The OS traps this fault, and responds by committing additional guard region and then resuming at the faulting instruction. What used to be the guard page has now become part of 5, and the program can continue on its merry old way. (Assuming there is room to commit another guard region; if not, stack overflow ensues.) A couple things are worth noting. Because the CLR commits the entire stack, the OS doesn't really have to "commit" the memory: it just annotates the next region as the new guard. Also notice that the guard in this program is 28KB in size. Normally the guard is just a single page, but the CLR uses SetThreadStackGuarantee to increase the amount of committed stack we are guaranteed to have at any point in time, at least on OSes that support it. This makes responding to stack overflow easier.
3. This is often referred to as the "hard guard page". If you try to write to this, the OS rips down your process. In the wink of an eye, it’s gone, no callbacks, no nice Dr. Watson dumps, it just disappears. As guard pages are committed, this page is moved so that it's just beyond the guard region. I don't know precisely how this happens w/out having to commit more memory (since it's marked MEM_RESERVE), but I suspect the OS just magically rearranges some page table information.
2. This is the rest of the stack. It hasn't been used yet, and it's not part of the guard region. This is where you'll see a difference between a managed thread's stack and a native thread's stack: the pages are marked MEM_COMMIT for managed code, whereas they'd be MEM_RESERVE for native.
1. This is the final destination of the hard guard page, after the whole stack has been committed and the guard rests just before it at the end of the stack, this page will always remain. It is treated as a separate MEM_RESERVED page and will never be committed.
One additional thing is worth noting. When the CLR pre-commits the whole stack, it uses VirtualAlloc to do so. This leaves the guard page close to the bottom of the stack, the hard guard page just behind it, and the StackLimit in the TEB, set to the address where the guard page ends. This surprises some people, i.e. they expect to see a StackLimit set to, say, 91000 in the above example. But remember, the OS doesn’t get involved at all in our pre-committing of the stack.
Method to the madness
Why in the heck would we do all of this?
Believe it or not, there is a method to the madness. When the OS tries to commit a new guard region, it can fail. It won’t fail due to insufficient virtual memory space (not that such things would matter much on X64 anyhow), because the memory is already reserved. We can handle those cases just fine. Rather, it might fail if there is insufficient pagefile backing to commit the memory. This would manifest as a stack overflow. Sadly, at the point of this stack overflow, the CLR’s vectored exception handler (which responds to ERROR_STACK_OVERFLOW) would then have only the guard region’s worth of stack space in which to do anything reasonably intelligent. (Which, recall, was traditionally one page.) The unhosted CLR just has to issue a failfast in this case, but it also wants to do things like create a Windows Event Log entry, play nicely with Dr. Watson and debuggers, and so on.
This is also required for hosts like SQL Server who try to continue running in the face of stack overflow. In these cases, the CLR has to call out to the host to see what it would like to do. Maybe the host can run in just a page’s worth of stack, and maybe it can’t. The CLR doesn’t try to recover in unhosted situations because it is extremely difficult, and there are even problems with some of Win32 itself not being able to tolerate the presence of stack overflow (most notably CRITICAL_SECTIONs). But the SQL Server engine is a very carefully engineered piece of software and they have a lot of experience (and success, apparently) remaining running in these cases.
If we commit the entire stack, there is no fear of running out of physical space during stack growth, because the whole thing has already been backed.
But of course this is the major downside to this design as well. The pressure this puts on the pagefile is not negligible. If you have 1,000 threads in a process, you need 1GB of pagefile space to back all of their stacks. Sure, that’s a lot of threads, but heck, that’s a lot of disk space too! A stack page won’t require physical memory until it’s actually faulted in (i.e. read from or written to), but the pagefile expense is a high price to pay for what amounts to be an obscure (and dubious) condition.
I say “dubious” because you have to wonder: is it even worth it anyway? Probably not. On modern Windows OSes that support SetThreadStackGuarantee, there’s little reason to commit anything but the guaranteed guard region. The CLR uses this API, which means we can size the guaranteed region large enough to so we can always run our stack overflow logic within it. Committing any more than this really is just a waste. Even on OSes without this API, however, we’re going to failfast the process in this situation anyhow. Sure, if we didn’t commit the whole thing up front, then these “out of pagefile space” situations might result in an inability to log an Event Log entry, notify a debugger, and so on, but will we really be able to do that anyway given the extreme resource pressure the machine has to be under to create this situation? Probably not.
In the end, it matters little what I think about the design. This is how it is, and I figured you all should know about it.
 Sunday, March 04, 2007
In 2.0 SP1, we changed the threadpool's default maximum worker thread count from 25/CPU to 250/CPU.
The reason wasn't to encourage architectures that saturate so many threads with CPU-bound work, nor was it to suggest that having 249 threads/CPU blocked and 1/CPU actively running is actually a good design. A non-blocking, continuation-passing/event-driven architecture is generally a better approach for the latter case. Rather, we did this to statistically reduce the frequency of accidental and nondeterministic deadlocks.
Believe it or not, deadlocking the threadpool was the most frequently reported threading-related customer problem/complaint during my tenure as the CLR's concurrency PM. There are KB articles and a wealth of customer and Microsoft employee blog posts about this issue.
Many algorithms demand dependencies between parallel tasks. And sometimes—as is frequently the case in data parallel algorithms—the number of tasks can be variable up to a factor of the input size. A "task" in this context is just the closure passed to QueueUserWorkItem. Consider a dumb parallel merge sort, which uses divide-and-conquer style parallelism, for example:
void Sort(int[] numbers, int from, int to) { if (from < to) { ... ThreadPool.QueueUserWorkItem(delegate { Sort(numbers, from, (from+to+1)/2); }); // T1 Sort(numbers, (from+to+1)/2, to); // T2 ... Wait for T1 to finish ... ... Merge left and right ... } }
In this case, T1 is run in parallel and sorts the left half; T2 runs on the calling thread and sorts the right. After T2 runs sequentially, it must wait for T1 to complete before moving on to the merge step. As written, this algorithm is clearly inefficient and deadlocks easily. Pass it an array of size 33 on a 2-CPU machine, and the threadpool's default maximums will ensure that some T1's can't even get scheduled, leaving threadpool threads blocked waiting for their queued (and stuck) counterparts to finish. Depending on how tasks are scheduled at runtime this could deadlock.
Clearly the programmer needs to "stop" dividing the problem at some reasonable point, i.e. limit the maximum number of tasks generated; otherwise the task count will grow with some factor of the input size, causing deadlocks for large inputs (not to mention huge context switch and resource consumption overheads). When might that point be? Perhaps the programmer calculates some degree-of-parallelism (DOP) at the top of the recursive call stack, say log2(#ofCPUs). DOP is passed to the first call to Sort and each subsequent recursive call decrements the DOP by 1. So long as the DOP argument is >0, T1 is run in parallel; otherwise, T1 is run sequentially on the same thread, just like T2. This ensures we don't spawn more tasks than there are CPUs.
And this will probably work. Most of the time.
What if, just by chance, the stars aligned and 25 instances of this algorithm ran simultaneously? Seems farfetched? Maybe so. Consider this: using log2(#ofCPUs) might not be enough in the case that some comparison routines block during sorting, possibly suggesting log2(2(#ofCPUs)) as a better DOP instead. And then all we need is 12.5 occurrences. Still a little farfetched, but not quite as much. But wait: there could be other algorithms using the thread pool simultaneously, particularly on a server. (Yes, data parallelism on a server is probably suspect in the first place, but for highly parallel servers with volatile loads, it could be useful.) And remember, the thread pool is shared across all AppDomains in the process, so if you've written a reusable component, you're relying on all other software in the process to behave properly too (which you may have absolutely no control over).
Most of these admittedly represent imperfections in the overall design and architecture of the application, but the sad fact is that they tend to be somewhat common. Especially when components are dynamically composed in the process, as is common with server and AddIn-style apps. And they are very nondeterministic and hard to test for. Our platform doesn't offer a mechanism today that allows developers to write code that is intelligently-parallel, particularly when many heterogeneous components are trying to use concurrency for performance. Even with the suggested improvements and the CLR threadpool's old 25/CPU thread limit, the Sort routine could deadlock once in a while, maybe under extreme stress and very hard to reproduce conditions. This will occur less frequently, statistically speaking, with the 250/CPU limit. The problem is that all of this is just a heuristic, there aren't any hard numbers and coordination involved.
It’s also worth noting that the threadpool throttles its creation of threads to 2/second once the count has exceeded the #ofCPUs. That means if a programmer sees this situation happening with regularity, they will also observe hard to diagnose performance degradations. Once in a while, that sort algorithm might take 10-times longer to run, inexplicably. If this happens a lot, the developer will notice, profile, and fix the issue. While this isn’t great, this problem is typically quite rare in any one (properly written) program, and doesn’t happen with regularity. Our first priority was to prevent the periodic and sporadic hangs, the things eating away at the reliability and uptime of programs, to trade them off for possible periodic performance blips. Many of those horribly misarchitected programs will still deadlock deterministically with the new limits, and the thread injection throttling will help to discourage them from being written this way. (It would take 125 seconds to create 250 threads on a 1-CPU machine, and it seems unlikely that a 2 minute-plus delay would be tolerated. Some people use SetMinThreads to get around this, which is (usually) inexcusable.)
With all of this said, we clearly have a lot of work to do in the future to encourage better parallel program architectures and to provide better tools for diagnostics in this area. I agree with the basic tenet "use as few threads as possible, but no fewer," but sometimes we have to sacrifice idealism to solve practical real-world problems. In my experience, this should solve many such problems.
 Monday, February 19, 2007
A reader asked for clarification on a past article of mine, regarding my claim that one particular variant of the double checked locking pattern won't work on the .NET 2.0 memory model. The confusion was caused because my advice seems to contradict Vance's MSDN article on the topic.
The problem is with variants of double checked locking that use a flag to indicate that a variable has or has not been initialized, versus using the presence of null to indicate this. This can come in handy if null is a valid initialized value, when the value is a value type, and/or if multiple variables are involved in the initialization.
After following up with a few Microsoft and Intel folks about this, I still believe this to be an issue. Here is what I claim:
- Because standard Intel processors (X86/IA32, EM64T) use non-binding speculative reads, the problem will not happen due to speculation. And because processor consistency memory models don’t permit loads to freely reorder, this won’t happen because of cache hits.
- However, on IA64, non-volatile loads can be freely reordered, and therefore a cache hit can cause the load of the value to pass the load of the flag. I have not been given a clear answer yet on the nature of IA64’s speculation model, but I suspect IA64 is non-binding too, and therefore this cannot occur as a sole result of branch prediction (though that is pretty much immaterial because of cache reordering).
- In talking with some compiler folks here, they also agree that legal compiler transformations (according to .NET 2.0’s memory model) can break the code.
- With that said, no Microsoft compiler we know of will actually make the transformation.
- With some simple (though unlikely) modifications, existing compilers could find it more attractive to apply CSE/PRE, causing the read to move and break the code pattern.
The take-away is not necessarily the specific details, though perhaps those are interesting too. Rather, the primary take-away is that you really ought to use the volatile modifier whenever you aren’t 100% certain that the default memory model will prevent these kinds of reorderings. (And even then, volatile is still a good idea, to declare your intent to other programmers looking at the code.)
As I mentioned in the original article, the use of volatile is enough to ensure this particular example works correctly.
 Monday, February 12, 2007
Somebody recently asked in a blog comment whether the new ReaderWriterLockSlim uses a full barrier (a.k.a. two-way fence, CMPXCHG, etc.) on lock exit. It does, and I claimed that "it has to". It turns out that my statement was actually too strong. Doing so prevents a certain class of potentially surprising results, so it's a matter of preference to the lock designer whether these results are so surprising as to incur the cost of a full barrier. Vance Morrison's "low lock" article, for instance, shows a spin lock that doesn't make this guarantee. And, FWIW, this is also left unspecified in the CLR 2.0 memory model. Java's memory model permits non-barrier lock releases, though I will also note the JMM is substantially weaker in areas when compared to the CLR's.
Here's an example of a possibly surprising result that non-barrier releases can cause:
Initially x == y == 0.
Thread 0 Thread 1 =============== =============== lock(A); lock(B); x = 1; y = 1; unlock(A); unlock(B); t0 = y; t1 = x;
Is it possible after executing both that: t0 == t1 == 0?
It is simple to reason that this is impossible with sequential consistency. In SC the only way that t0 == 0 is if Thread 0's t0 = y statement (and therefore x = 1, assuming a memory model in which writes happen in order) were to occur before Thread 1's y = 1 (and therefore t1 = x). In this case, t1 = x must subsequently see t1 == x == 1, otherwise the history contradicts SC. The only other possibility is that Thread 1's t1 = x happens before Thread 0's x = 1 and therefore also t0 = y, in which case it must be the case that the subsequent t0 = y by Thread 0 yields t0 == y == 1. In both cases, either x or y must be seen as 1. (Interleavings are clearly possible that result in x == y == 1.)
The CLR 2.0's memory model guarantees that, if the unlock incurs a barrier, the same SC reasoning applies. Unfortunately, if the unlock is not a barrer, then in both cases the load of x or y may occur before the write buffer has been flushed, meaning the write to x or y and the unlocking write itself, possibly leading to t0 == t1 == 0. This happens even on relatively strong processor consistency memory models such as X86, and on weaker ones such as IA-64 (even when all loads are acquire and all stores are release, which only happens for volatile CLR fields). To ensure the write buffer has been flushed before the read happens, the unlock statement must flush the buffer (or an explicit barrier is needed), accomplished with CMPXCHG on X86 ISAs.
Many would argue that, because the locks taken are different between the two threads, SC does not apply and therefore implementing unlock as a non-barrier write is legal. JMM takes this stance. This actually seems like a fine argument to me, and after thinking about it for a bit, it's probably what I would choose if I were defining a memory model. But the CLR 2.0 MM is generally stronger than most, so people might actually depend on this and expect it to work. This could cause Monitor-based code to break when moved to alternative lock implementations that don't issue full barriers at release time. This is just one example of why it'd be really great to have a canonical specification for the CLR's MM. At least we'd then have a leg to stand on when faced with tricky compatability trade-offs some day down the line.
 Wednesday, February 07, 2007
In Orcas, we offer a new reader/writer lock: System.Threading.ReaderWriterLockSlim.
Motivation for a new lock
The primary reason for creating this type was that we wanted to provide an official reader/writer lock for the .NET Framework that people could actually rely on for performance-critical code. It was no secret that the current ReaderWriterLock type was such a pig, costing somewhere around 6X the cost of a monitor acquisition for uncontended write lock acquires, that most people avoided it entirely. Jeff Richter wrote an entire MSDN article about this, and Vance Morrison showed how to build your own on his weblog. It was really too bad customers couldn't depend on the class in the Framework, and to be honest most devs really shouldn't be in the business of writing and maintaining their own reader/writer lock.
Second, we had a large number of qualms with the existing lock’s design. It had funny recursion semantics (and is in fact broken in a few thread reentrancy cases we know about), and has a dangerous non-atomic upgrade method. Did you know that you actually need to check the WriterSeqNum before and after a call to our ReaderWriterLock’s UpgradeToWriteLock method to ensure it didn’t change during the upgrade? You do. The code actually releases the reader lock before upgrading to the write lock, which allows other threads to sneak in, acquire the lock in between, and possibly change state that was read during the decision to upgrade. The reason? If we upgraded and then released the read lock, two threads trying to simultaneously upgrade would deadlock one another. All of these problems represent very fundamental flaws in the existing type’s design.
So we decided to build a new one that solves all of these problems. To be honest, I would have liked to fix the current one, but the existing API and compatibility responsibilities make that just about impossible. We considered obsoleting the existing one, but as I note at the end of this article, there are still reasons to prefer the old lock.
Three modes: read, write, and upgradeable-read
The new ReaderWriterLockSlim supports three lock modes: Read, Write, and UpgradeableRead, and has the methods EnterReadLock, EnterWriteLock, EnterUpgradeableReadLock, and corresponding TryEnterXXLock and ExitXXLock methods, that do what you’d expect. Read and Write are easy and should be familiar: Read is a typical shared lock mode, where any number of threads can acquire the lock in the Read mode simultaneously, and Write is a mutual exclusion mode, where no other threads are permitted to simultaneously hold the lock in any mode. The UpgradeableReadLock will probably be new to most people, though it’s a concept that’s well known to database folks, and is the magic that allows us to fix the upgrade problem I mentioned earlier. We’ll look at it more closely in a moment.
The performance of the new lock is roughly equal to that of Monitor. When I say “roughly”, I mean that it’s within a factor of 2X in just about all cases. And the new lock favors letting threads acquire the lock in Write mode over Read or UpgradeableRead, since writers tend to be less frequent than readers, generally leading to better scalability. We’d originally considered providing a set of contention management options to choose from, but decided in the end to ship a simpler design that works well for most cases.
Upgrading
Let’s look at upgrades more closely now. The UpgradeableRead mode allows you to safely upgrade from Read to Write mode. Remember I mentioned earlier that our old lock breaks atomicity in order to provide deadlock-free upgrade capabilities (which is bad, particularly since most people don't realize it). The new lock neither breaks atomicity nor causes deadlocks. We acheive this by allowing only one thread to be in the UpgradeableRead mode at once, though there may be any number of other threads in Read mode while there’s one UpgradeableRead owner.
Once the lock is held in the UpgradeableRead mode, a thread can then read state to determine whether to downgrade to Read or upgrade to Write. Note that this decision should ideally be made as fast as possible: holding the UpgradeableRead lock forces any new Read acquisitions to wait, though existing Read holders are still permitted to remain active. (Sadly the CLR team seems to have removed two methods, DowngradeToRead and UpgradeToWrite, that I had originally designed for this purpose. I admit what follows isn’t the most obvious way to do it.) To downgrade, you simply call EnterReadLock followed by ExitUpgradeableReadLock: this permits other Read and UpgradeableRead acquisitions to finish that were previously held up by the fact that there was an UpgradeableRead lock held. To upgrade, you similarly call EnterWriteLock: this may actually have to wait until there are no longer any threads that still hold the lock in Read mode. There’s no real reason to also exit the UpgradeableReadLock at this point unlike the downgrade case, though in some cases it makes your code more uniform. E.g.:
ReaderWriterLockSlim rwl = …; … bool upgraded = true; rwl.EnterUpgradeableReadLock(); try { if (… read some state to decide whether to upgrade …) { rwl.EnterWriteLock(); try { … write to state … } finally { rwl.ExitWriteLock(); } } else { rwl.EnterReadLock(); rwl.ExitUpgradeableReadLock(); upgraded = false; try { … read from state … } finally { rwl.ExitReadLock(); } } } finally { if (upgraded) rwl.ExitUpgradeableReadLock(); }
Recursive acquires
Another nice feature with the new lock is how it treats recursion. By default, all recursive acquires, aside from the upgrade and downgrade cases already mentioned, is disallowed. This means you can’t call EnterReadLock twice on the same thread without first exiting the lock, for example, and similarly with the other modes. If you try, you get a LockRecursionException thrown at you. You can, however, turn recursion on at construction time: pass the enum value LockRecursionPolicy.SupportsRecursion to your lock’s constructor, and voila, recursion will be permitted. The chosen policy for a given lock is subsequently accessible from its RecursionPolicy property.
There’s one special case that is never permitted, regardless of the lock recursion policy: acquiring a Write lock when a Read lock is held. We considered enabling this, or at least giving a new enum value for it, but decided to hold off for now: if it turns out customers need it, we can always add it later. But it’s dangerous and leads to the same Read-to-Write upgrade deadlocks that the old lock was prone to, and so we didn’t want to lead developers down a path fraught with danger. If you need this kind of recursion, it’s a “simple” matter of changing your design to hoist a call to either EnterWriteLock or EnterUpgradeableReadLock (and the corresponding Exit method) to the outermost scope in which the lock is acquired.
There are corresponding properties IsReadLockHeld, IsWriteLockHeld, and IsUpgradeableReadLockHeld, to determine whether the current thread holds the lock in the specified mode. You can also query the WaitingReadCount, WaitingWriteCount, and WaitingUpgradeCount properties to see how many threads are waiting to acquire the lock in the specific mode, and CurrentReadCount to see how many concurrent readers there are. The RecursiveReadCount, RecursiveWriteCount, and RecursiveUpgradeCount properties tell you how many recursive acquires the current thread has made for the specific mode.
Some limitations: reliability
Lastly, I mentioned there are some caveats around where this lock’s use is appropriate. Well, there’s one, really: it’s not hardened to be reliable. This means a few things.
First, unlike the existing ReaderWriterLock, the ReaderWriterLockSlim type does not cooperate with CLR hosts through the hosting APIs. This means a host will not be given a chance to override various lock behaviors, including performing deadlock detection (as SQL Server does). Thus, you really ought not to use this lock if your code will be run inside SQL Server.
Next, the lock is not robust to asynchronous exceptions such as thread aborts and out of memory conditions. If one of these occurs while in the middle of one of the lock’s methods, the lock state can be corrupt, causing subsequent deadlocks, unhandled exceptions, and (sadly) due to the use of spin locks internally, a pegged 100% CPU. So if you’re going to be running your code in an environment that regularly uses thread aborts or attempts to survive hard OOMs, you’re not going to be happy with this lock. Unfortunately the lock doesn’t even mark critical regions appropriately, so hosts that do make use of thread aborts won’t know that the thread abort could possibly put the AppDomain at risk: many hosts would prefer to wait, or immediately escalate to an AppDomain unload, if an individual thread abort is necessary while the thread is in a critical region. But in the case of ReaderWriterLockSlim, a host has no idea if a thread holds the lock because the implementation doesn’t call Begin- and EndCriticalRegion. And the kind of problems I mentioned in the previous post are always an issue with ReaderWriterLockSlim, because we don't necessarily guarantee that there will be no instructions in the JIT-generated code between the acquisition and entrance to the following try block.
Summary
In summary, the new ReaderWriterLockSlim lock eliminates all of the major adoption blockers that plagued the old ReaderWriterLock. It performs much better, has deadlock-free and atomicity-preserving upgrades, and leads developers to program cleaner designs free of lock recursion. There are some downsides to the new lock, however, that may cause programmers writing hosted or low-level reliability-sensitive code to wait to adopt it. Don’t get me wrong, most people really don’t need to worry about these topics, so I apologize if my words of warning have scared you off: but those that do really need to be told about the state of affairs. Thankfully, I’m confident that many of these issues will be fixed in subsequent releases. And for most developers out there, the new ReaderWriterLockSlim is perfect for the job.
 Saturday, February 03, 2007
[Update 2/19/07: the search for Jim has been called off after just about three weeks. Thanks to everybody who helped, and best wishes to Jim's family.]
Jim Gray, a Microsoft technical fellow who is best known for his seminal work on database transactions and data management, has been missing for almost a week. He took his 40-foot sailboat, Tenacious, last Sunday to scatter the remains of his late mother near the Farallon Islands off the coast of California, and has been missing ever since. This article has additional details.
Although the Coast Guard has called off the search, friends and family are chartering private aircraft, scouring satellite imagery, and NASA even made a special run of its ER-2 aircraft to generate additional data. Werner Vogels has more information on his website, including how you can help with the search.
Jim actually provided important feedback while I was thinking and writing about PLINQ, and was one of the most supportive and excited reviewers of the project. Not to mention all of the papers he wrote or was involved with that inspired a lot of the direction. I’d be very sad to see his life cut short, particularly in this way.
 Monday, January 29, 2007
I previously mentioned the X86 JIT contains a "hack" to ensure that thread aborts can't sneak in between a Monitor.Enter(o) and the subsequent try-block. This ensures that a lock won't be leaked due to a thread abort occurring in the middle of a lock(o) { S1; } block. In the following example, that means an abort can't be triggered at S0:
Monitor.Enter(o); S0; try { S1; } finally { Monitor.Exit(o); }
If an abort could happen at S0, it'd be possible for a thread to acquire lock o, but before entering the try block, be asynchronously aborted, and then not run the finally block to release the lock on o. This would lead to an orphaned lock, and probable deadlocks later on during execution. Debugging an instance of such a deadlock would of course be rather difficult because it depends on a very subtle race condition that must occur within the tiny window of a single instruction. On a single-processor machine, this would require a precariously placed context switch, but as more and more cores are added to the machines that this software runs on, the probability simply increases.
Characterizing this as a "hack" was a little harsh. It's really just a byproduct of the way that the X86 JIT generates code.
For an asynchronous thread abort to be thrown in a thread, that thread must be either: (1) polling for the abort in the EE or (2) running inside of managed code. And even if the thread is in managed code, we may not be able to abort it, as is the case if the thread is currently executing a finally block, inside a constrained execution region, etc. The C# code generation for the lock statement ensures there are no IL instructions between the CALL to Monitor.Enter and the instruction marked as the start of the try block. The JIT correspondingly will not insert any machine instructions in between the two. And since any attempted thread aborts in Monitor.Enter are not polled for after the lock has been acquired and before returning, the soonest subsequent point at which an abort can happen is the first instruction following the call to Monitor.Enter. And at that point, the IP will already be inside the try block (the return from Monitor.Enter returns to the CALL+1), thereby ensuring that the finally block will always run if the lock was acquired.
This might seem like an implementation detail, but the reality is that we can never change it. Too many people depend on this guarantee.
It turns out that Whidbey's X64 JIT does not guarantee this behavior. (I suspect IA64 doesn't either, but don't know for sure.) In fact there's a high probability that this won't work: there is always a NOP instruction before the CALL and the instruction marking the try block in the JITted code. This is done to make it easier to identify try/catch scopes during stack unwind. This means that, yes indeed, an abort can happen at S0 on 64-bit.
This will likely be fixed for the next runtime release, but I can't say for sure.
Update 4/17/08: This was indeed fixed for the X64 JIT in Visual Studio 2008. Note that when compiling C# code targeting both X86 and X64, if you do not use the /o+ switch, this problem can still occur due to extra explicit NOPs inserted before the try.
The framework implements a method Monitor.ReliableEnter, by the way, that could be used to avoid orphaning locks in the face of thread aborts, but it's internal to mscorlib.dll. It sets an out parameter within a region of code that cannot be interrupted by a thread abort, which the caller can then check inside the finally block. The acquisition then gets moved inside so that, if the CALL is reached, the finally block is guaranteed to always run. You'd then write this instead:
bool taken; try { Monitor.ReliableEnter(o, out taken); S1; } finally { if (taken) Monitor.Exit(o); }
It's also possible the CLR team would expose this API in the future. We wanted to in Whidbey, but didn't have enough time. If 64-bit code generation was changed so that it doesn't emit a NOP before the try block, however, we probably wouldn't need ReliableEnter after all.
 Tuesday, January 23, 2007
Joel tagged me on this silly 5 things you didn't know about me meme. Since it's floating around, and because I’d hate to disappoint a crazy Australian who’s apt to shoot vinegar in my eyes with a Super Soaker (woof, woof), I figure I'll give it a shot:
- I played in a heavy metal band when I was a teenager. I was the lead guitarist, and loved to shred it up, with a sort of { 80's metal, 90's heavy metal, neo-classical, blues } style, influenced heavily by Slayer, Metallica, Pantera, GnR, Yngwie Malmsteen, Joe Satriani, and Stevie Ray Vaughan. (Odd combination? You bet!) Don’t ask for pictures, I have conveniently "lost" them. I stopped playing guitar for several years, but picked it back up in the past year and have been playing daily (focusing on building better music theory technical depth, but still shredding: my recent obsession is Dragonforce). I also created several independent industrial and experimental techno CDs. I try again every now and again, but I seem to have lost the patience.
- I stopped playing in the heavy metal band because I "accidentally" punched a guy in the face in a mosh pit at a Sepultura concert. This broke my hand in several places and was quite unpleasant. (I did stay for the whole set, I'm happy to report. The bar was kind enough to supply a plastic cup with some ice.) The emergency room wrapped it in a cast, and then after 4 weeks, it had healed incorrectly. So then I had to have it manually rebroken, and spent the entire summer in a new cast. At the end, I had no hand or arm strength, my finger was crooked (making it hard to hold a pick), and my band had broken up because we couldn't play out (and they apparently didn't have the motivation to replace me temporarily). This was the end of my dreams of being a rock star.
- When I was 18 years old, I weighed a whopping 260 pounds. Most of it was muscle, resulting from me working out literally every single day, and I have the stretch marks to show it. I was a big dude, benched a little more than my weight, and even played center for a short period in high school football (I wasn’t keen on the fact that it’s apparently protocol that the quarterback routinely lays his hands on the center’s butt). Now I weigh about 160 and have no muscle.
- I started programming when I was 12. This came about because a local gaming place (Gamer's Guild in Milford, MA) had set up a new BBS system, on which they provided several MUDs, MOOs, MUSHes, etc., on their Commodore Amiga 3000's and 4000. I became hooked, especially when I found out you could get access to the source code and actually change it. When CircleMUD came out, circa '92 (IIRC), I got serious about hacking code, doing the bulk of my MUD programming on a hand-assembled Slackware Linux 1.0 IBM PC. Ironically, yours truly (a Microsoft employee) did most of his learning on the Linux OS, using GNU's C Compiler. (I actually refused to use DOS until I started doing games programming that required graphics and found some cool DOS toolkits. I refused Windows even longer, until I got sick of using 'lynx' to browse the WWW. I have always found Windows API programming, with UPPER_CASE type names everywhere, rather ugly and I simply didn't get it at first.) I set up my own 8-line BBS and later in '94 converted to co-located Internet hosting, running a heavily modified CircleMUD (3.1?) for a year and a half. There was typically around 20 players signed on at any given point.
- I am VERY obsessive about things. I want to buy as many books as I can get my hold on, read them all the time, write code every waking moment, drink wine every night, go out for 15 course dinners every night, play guitar and learn every scale, major and minor, mode, modification, chord, arpeggio, I like to work, work, work until I am barely awake, and so on. It drives my friends completely nuts, but it keeps me busy. I've been that way ever since I was a little kid, so I don't think it's going to change anytime soon.
I don't know many people, so I'm going to commit meme heresy and not link to anybody else. Sorry.
|
|
Recent Entries:
Search:
Browse by Date:
| | Sun | Mon | Tue | Wed | Thu | Fri | Sat | | 25 | 26 | 27 | 28 | 1 | 2 | 3 | | 4 | 5 | 6 | 7 | 8 | 9 | 10 | | 11 | 12 | 13 | 14 | 15 | 16 | 17 | | 18 | 19 | 20 | 21 | 22 | 23 | 24 | | 25 | 26 | 27 | 28 | 29 | 30 | 31 | | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
Browse by Category:
Notables:
|