| |
 Tuesday, July 25, 2006
A colleague lent me a copy of W. Daniel Hillis’s PhD thesis, The Connection Machine, which is also available in book form from The MIT Press. I only began reading it last night, but I have been continuously amazed. It’s been enlightening to realize how much framing problems differently (and, in many cases, more naturally) can make programming without concurrency seem ridiculous.
To give you an idea, here's a quote from the thesis:
“When performing simple computations over large amounts of data, von Neumann computers are limited by the bandwidth between memory and processor. This is a fundamental flaw in the von Neumann design; it cannot be eliminated by clever engineering.”
Here is a quick illustration: What’s the most efficient way of copying a source array of 100,000 elements to a destination array of 100,000 elements? With a single-CPU this would typically be O(n), where n is the length of the array. If you could minimize costs due to thread creation and communication, and ensure good locality, you might be able to gain some parallel speedup by using multiple CPUs.
With The Connection Machine, however, you can do it in O(1) time. Simply instruct the 100,000 source cells, each of which holds a single array element, to communicate their value to the 100,000 destination cells, and instruct the destination cells to receive and store the value. This happens instantaneously, across the machine, not in serial fashion. If any node a can communicate with any other node b in 1 time unit, the entire array is copied in just 1 time unit, not 100,000! (Designing such an interconnect is, of course, quite difficult, but in theory it is quite nice.)
I found it particularly interesting that, back in 1985, at least this author recognized the impending demise of an entirely sequential approach to all problems. I guess the von Neumann-plus-Knuth knockout punch infected the world of programming to the contrary, and it was straight down-hill from there…
 Saturday, July 15, 2006
Stack overflow can be catostrophic for Windows programs. Some Win32 libraries and commercial components may or may not respond intelligently to it. For example, I know that, at least as late as Windows XP, a Win32 CRITICAL_SECTION that has been initialized so as to never block can actually end up stack overflowing in the process of trying to acquire the lock. Yet MSDN claims it cannot fail if the spin count is high enough. A stack overflow here can actually lead to orphaned critical sections, deadlocks, and generally unreliable software in low stack conditions. The Whidbey CLR now does a lot of work to probe for sufficient stack in sections of code that manipulate important resources. And we pre-commit the entire stack to ensure that overflows won't occur due to failure to commit individual pages in the stack. If a stack overflow ever does occur, however, it's considered a major catastrophy--since we can't reason about the state of what native code may have done in the face of it--and therefore, the default unhosted CLR will fail-fast.
In some rare cases, it is useful to query for the remaining stack space on your thread, and change behavior based on it. It could enable you to fail gracefully rather than causing a stack overflow, possibly in Win32, causing the process to terminate. A UI that needs to render some very deep XML tree, and does so using stack recursion, could limit its recursion or show an error message based on this information, for example. It could decide that it needs to spawn a new thread with a larger stack to perform the rendering. Or it may just be a handy way to log an error message during early testing so that the developers can fine tune the stack size or depend less heavily on stack allocations to get the job done.
I've previously mentioned that the TEB has a StackBase and StackLimit, and that it can be dynamically queried using the ntdll!NtCurrentTeb function. Unfortunately, the StackLimit is only updated as you actually touch pages on the stack, and thus it's not a reliable way to find out how much uncommitted stack is left. The CLR uses kernel32!VirtualAlloc to commit the pages, not by actually moving the guard page, so StackLimit is not updated as you might have expected. There's an undocumented field, DeallocationStack, at 0xE0C bytes from the beginning of the TEB that will give you this information, but that's undocumented, subject to change in the future, and is too brittle to rely on.
The RuntimeHelpers.ProbeForSufficientStack function may look promising at first, but it won't work for this purpose. It probes for a fixed number of bytes (48KB on x86/x64), and if it finds there isn't enough, it induces the normal CLR stack overflow behavior.
The good news is that the kernel32!VirtualQuery function will get you this information. It returns a structure, one field of which is the AllocationBase for the original allocation request. When Windows reserves your stack, it does so as one contiguous piece of memory. The MM remembers the base address supplied at creation time, and it turns out that this is the "end" of your stack (remember, the stack grows downward). With a little P/Invoke magic, it's simple to create a CheckForSufficientStack function using this API. Our new function takes a number of bytes as an argument and returns a bool to indicate whether there is enough stack to satisfy the request:
public unsafe static bool CheckForSufficientStack(long bytes) {
MEMORY_BASIC_INFORMATION stackInfo = new MEMORY_BASIC_INFORMATION();
// We subtract one page for our request. VirtualQuery rounds UP to the next page.
// Unfortunately, the stack grows down. If we're on the first page (last page in the
// VirtualAlloc), we'll be moved to the next page, which is off the stack! Note this
// doesn't work right for IA64 due to bigger pages.
IntPtr currentAddr = new IntPtr((uint)&stackInfo - 4096);
// Query for the current stack allocation information.
VirtualQuery(currentAddr, ref stackInfo, sizeof(MEMORY_BASIC_INFORMATION));
// If the current address minus the base (remember: the stack grows downward in the
// address space) is greater than the number of bytes requested plus the reserved
// space at the end, the request has succeeded.
return ((uint)currentAddr.ToInt64() - stackInfo.AllocationBase) >
(bytes + STACK_RESERVED_SPACE);
}
// We are conservative here. We assume that the platform needs a whole 16 pages to
// respond to stack overflow (using an x86/x64 page-size, not IA64). That's 64KB,
// which means that for very small stacks (e.g. 128KB) we'll fail a lot of stack checks
// incorrectly.
private const long STACK_RESERVED_SPACE = 4096 * 16;
[DllImport("kernel32.dll")]
private static extern int VirtualQuery (
IntPtr lpAddress,
ref MEMORY_BASIC_INFORMATION lpBuffer,
int dwLength);
private struct MEMORY_BASIC_INFORMATION {
internal uint BaseAddress;
internal uint AllocationBase;
internal uint AllocationProtect;
internal uint RegionSize;
internal uint State;
internal uint Protect;
internal uint Type;
}
If this returns true, you can be guaranteed that an overflow will not occur. Well, modulo stack guarantee issues, that is...
Notice that we have to consider some amount of reserved space at the end of the stack. Platforms typically reserve a certain amount to ensure custom stack overflow processing can be triggered. Windows actually reserves a few pages at the end of the stack for this reason. If, after a stack overflow occurs, a double stack overflow is triggered (that is, stack overflow handling actually exceeds these pages), Windows takes over and kills the process. The CLR prefers to initiate a controlled shut-down: telling the host, if any, and fail-fasting otherwise. This means it needs to reserve even more than Windows does automatically. The kernel32!SetThreadStackGuarantee can be used for this. In any case, we need to consider that when looking for enough stack space in our function. The code above assumes 16 4KB pages are required; this is more than is typically needed, so it may lead to false positives (but we hope no false negatives). Also note the program above is very x86/x64-specific, and won't work reliably on IA-64: it hard-codes a 4KB page size. It's a trivial excercise to extend this to use information from kernel32!GetSystemInfo to use the right page size dynamically.
As an example, check out this code:
static unsafe void Main() {
Test(8*1024, 8*1024, true);
Test(0, (960*1024) + (8*1024), false);
Test(960*1024, 8*1024, false);
}
static unsafe void Test(int eatUp, long check, bool expect) {
byte * bb = stackalloc byte[eatUp];
Console.WriteLine("eatUp: {0}, check: {1}: {2}",
eatUp, check,
CheckForSufficientStack(check) == expect ?
"SUCCESS" : "FAIL");
}
As I've described previously, the stack size can depend on the EXE PE file or parameters passed when creating a thread. This example assumes a 1MB stack size.
 Saturday, July 08, 2006
The CLR thread pool is a very useful thing. It amortizes the cost of thread creation and deletion--which, on Windows, are not cheap things--over the life of your process, without you having to write the hairy, complex logic to do the same thing yourself. The algorithms it uses have been tuned over three major releases of the .NET Framework now. Unfortunately, it’s still not perfect. In particular, it stutters occasionally.
As I’ve hinted at before, we have a lot of work actively going on right now that we hope to show up over the course of the next couple CLR versions (keep an eye on those CTPs!). This may include vastly improved performance for work items and IO completions, significantly reducing the overhead of using our thread pool (in some cases to as little as ~1/8th of what it is today), eliminating accidental deadlocks due to lots of blocked thread pool workers, and a slew of useful new features (prioritization, isolation, better debugging, etc.).
One silly thing our thread pool currently does has to do with how it creates new threads. Namely, it severely throttles creation of new threads once you surpass the "minimum" number of threads, which, by default, is the number of CPUs on the machine. We limit ourselves to at most one new thread per 500ms once we reach or surpass this number. This can be pretty bad for some workloads, most notable those that are "bursty"; i.e. those that exhibit interspersed inactive and active periods rather sporadically and unpredictably. ASP.NET is a great example of an environment in which this frequerntly happens. Here’s an illustration:
- Imagine we have a 4 CPU web server. The "minimum" thread count used thus 4 (assuming the default).
- The web server has just started up.
- 16 new requests come in within a short period of time.
- Ther CLR quickly create 4 thread pool threads to service the first 4 requests. Because we don’t want to add any more for another 500ms, the other 12 requests sit in the queue.
- The 4 thread pool threads are running some arbitrary web page response. Imagine the response generation code does some type of database query that takes 4 seconds to complete. (This is a strong argument for using ASP.NET asynchronous pages (see http://msdn.microsoft.com/msdnmag/issues/05/10/WickedCode/) -- in which case, the 4 thread pool threads would free up to execute 4 new requests almost immediately -- or perhaps simply rearchitecting the seemingly poor database interaction, but ignore this for now.)
- After 500ms, a new thread pool thread is created, and the 5th request is serviced.
- We now wait another 500ms to add another thread, service, the next request, and so forth.
If the server has a constant load, eventually the pool will become "primed." But if a burst of work is followed by an inactive period of time, the threads in the thread pool start timing out waiting for new work, and eventually will retire themselves, until the pool shrinks back to the minimum. Imagine that this happens and then a bunch of new work arrives. Oops. This can clearly lead to some nasty scalability nightmares. KB article 821261, Contention, poor performance, and deadlocks when you make Web service requests from ASP.NET applications, describes this problem among others.
To "fix" this we added the ability in v1.1 to specify the minimum thread count in the thread pool, either with the configuratoin file or with the ThreadPool.SetMinThreads API. See KB article 810259, FIX: SetMinThreads and GetMinThreads API Added to Common Language Runtime ThreadPool Class, for details. It turns out that Microsoft Biztalk Server has run into the same problem: FIX: Slow performance on startup when you process a high volume of messages through the SOAP adapter in BizTalk Server 2004. I suspect many other commercial products have run into this as well. And it's rather annoying that each of them have to figure this out after they've shipped something, turning into a support bulletin, an internal bug-fix, and (I would guess) a service pack containing said bug-fix.
I wouldn't actually call what we did a fix. At best, it's a workaround. Hell, one of the KB articles above says that if you want decent scalability you need to change the minWorkerThreads count to 50. Our default is 1! Not too far off, eh? Shouldn't decent scalability be the default behavior?
We need to fix this for real.
Now, of course, it's a hard problem to solve. You don’t want to be too liberal adding threads to the pool because it can cause poor scalability should a large number of those extra threads suddenly become runnable. In an ideal world, no threads block, and having the same number of threads as you have CPUs gives you the best performance. (Better cache utilization, less overhead due to context switching, and so forth.) But software is often far less than ideal. As noted above, ASP.NET asynchronous pages are a great way to acheive this, and compared to injecting a whole bunch of relatively expensive threads into the process, it's obviously a better design. Unfortunately, I am not convinced all of our customers will stumble across this design, nor will it be brain-dead simple to rearchitect an existing site to take advantage of the feature without considerable work.
My hope is that we can solve this problem in the CLR by applying clever heuristics that even out over time. For example, we may start out life being over eager and generous with thread injection, but then "learn our lesson" after running for a period of time. This would lead to stabalization and an increasingly superior performance over the life of the process for the work that the server experiences. For example, if the server often experiences bursts, we will monitor the number of threads that lead to the best throughput during such an active period, and during periods of inactivity we will avoid retiring threads. This ensures that the next time the server is busy, work can be responded to in a more scalable manner, albeit with some extra working set overhead for keeping those threads around for longer. Perhaps more appropriate configuration settings could be dynamically recommended based on statistics gatherer during previous up-times of the server. And of course, we can offer more reasonable defaults for clients with short-living processes that might be harmed by over-eagerness with thread injection.
If anybody has experienced this problem in the wild, I’d love any feedback you might have. Feel free to leave a comment or email me at joedu at you-know-where dot com.
 Thursday, July 06, 2006
As I mentioned in a recent post, Windows Vista has new built-in support for deadlock detection. At the time, I couldn't find any publicly available documentation on this feature. Well, I just found it:
Wait Chain Traversal Wait Chain Traversal (WCT) enables debuggers to diagnose application hangs and deadlocks. A wait chain is an alternating sequence of threads and synchronization objects; each thread waits for the object that follows it, which is owned by the subsequent thread in the chain. Read More.
The new APIs OpenThreadWaitChainSession, CloseThreadWaitChainSession, and GetThreadWaitChain permit both asynchronous and synchronous detection and response. MSDN also has a fairly detailed code sample that uses the new APIs to print out the wait chain for all threads in a process.
 Monday, July 03, 2006
The world at large seems to be gravitating towards AJAX applications. I suppose this shouldn’t be surprising, given the relative lack of differentiation when comparing today’s “rich” client apps with what can run inside of a browser. We’ve actually hit a low point on the client if you ask me: I actually prefer Outlook Web Access over the rich client because at least my web browser doesn’t hang as much. While rich media and ink have made client-side interactions (theoretically) more interactive, satisfying, and powerful, I have to admit that my personal day-to-day experience with client software is anywhere near what I’d like it to be.
It’s surprising (to me) that people can build such nice looking, responsive, and (dare I say?) rich programs with a whole lot of ingenuity, blood, sweat, and tears, all using technologies that date back to when I was a teenager and which have evolved at a tremendously slow pace. I mean, standards committees aren’t necessarily known for rapid innovation. What’s even more surprising is that, with a state of the art IDE, Visual Studio, impressive presentation stacks like Windows Forms and Presentation Foundation, and a killer VM and associated Framework, the momentum is decidedly in the web frontier. Thank God for ASP.NET.
The problem isn’t that I don’t understand the AJAX sales pitch. It’s that I can’t believe we haven’t solved those problems on the client and really set it apart.
Despite the A in AJAX standing from Asynchronous, the marriage with multi-core seems less obvious than the rich client even. When the computational edge is offloaded to a set of back-office machines running in an expensive data center, the need for tera[fl]ops on the client seems slightly farfetched. Those servers sure better have the goods, though.
It seems to me that concurrency in AJAX apps will be more about masking latency and aggregating content from disparate sources than anything else: issuing tons of network requests and letting them complete in an overlapped fashion. Fault tolerance is another really attractive feature that concurrency could buy you. A quick search turned up this post about a Scheme message passing AJAX library. Very handy, and in one of my favorite languages to boot. But I wonder how mainstream these techniques will become?
Maybe I’m not thinking big. Speech and handwriting recognition, data mining over terabytes of personal information (which, by the way, it seems you need access to a hard drive for; see, the client’s good for something! ...or was that GFS?), synthesis and analysis of complex business, scientific, financial, and personal problems, and so on, are all things I can easily see a rich client doing. But than again, there's no real reason that such things couldn't be packaged up into AJAX libraries and all executed inside of the browser. A CPU cycle is a cycle, regardless of whether it's in the data center or running inside a web browser. This is where technologies like Flash/Flex come into play, taking the web experience and incrementally improving it to deliver richer experiences.
Google has already mastered the art of offloading impressive and constantly improving analysis over hoardes of data to the data-center. It seems to me that they may also be in a position to transition some of this complex analysis onto the clients connected to that same data-center. A cluster of machine nodes lookes surprisingly similar to a cluster of CPU nodes. Cutting down on communication and data transfer costs is key in both domains. And after all, why waste all of that (potentially massive) computing power that the 8-, 16-, 32-, ... core client has available? Instead of dumbing down the algorithms that are shared by [b|m]illions of clients simply so that a finite amount of computing power can be evenly distributed among a statistical peak load of consumers, make the clients do some work for themselves instead. And of course, the "constantly improving" attribute needn’t be lost: after all, it’s just a JS file. (IP of course starts to matter.)
Or is this all something that only a totally integrated client package can provide (whatever that means)? I suppose we’ll eventually find out.
When threads are created on Windows, the caller of the CreateThread API has the option to supply stack reserve/commit sizes. If not specified--i.e. the stack size parameter is 0--Windows just uses the sizes found in the PE header of the executable. Microsoft's linkers by and large use 1MB reserve/2 page commit by default, although most let you override this (e.g. LINK.EXE's /STACK:xxx,[yyy] option and VC++'s CL.EXE /F xxx). The CLR always pre-commits the entire stack for managed threads.
You'll often find situations where a program has been deployed and starts running out of stack space. Many times this is just a bug. But this also often happens when more data is fed to the application than was used during testing, causing deeper recursion or larger stack allocated data structures than is typical. ASP.NET, for example, uses 256KB stack sizes by default to minimize memory pressure due to large numbers of concurrent requests. It does this by setting the PE header's reserve size to 256KB, and relying on the fact that the CLR thread-pool creates its threads with a default stack size. I think WSDL.EXE also uses a 256KB stack to make startup faster. I was recently chatting with a customer who kept stack overflowing WSDL.EXE due to an extremely large XML file they were trying to parse (recursive XML parsers tend to use very deep stacks anyhow).
If you don't have the source code for the program in question, you can always use the EDITBIN.EXE utility that comes in the VC++ SDK to change the PE header's default stack values. Say you have an executable, FOO.EXE, that has been deployed and suddenly starts running out of stack space. You know it's not a bug -- it simply needs to consume more stack than was originally reserved. Running `EDITBIN.EXE FOO.EXE /STACK:2097152`, for example, changes the default stack to 2MB. This of course only works for threads that are created using the default stack size; if they override it explicitly, changing the PE header has no effect. This always works for threads in the CLR's thread pool.
Warning: Using EDITBIN.EXE like this can invalidate support and servicing warranties on commercial executables. You might want to use this approach for workarounds in your own organization or for personal use, but I don't recommend it for, say, Microsoft shipped binaries. There's no guarantee things will continue working as you'd hope, especially if you're shrinking the stack size instead of growing it. And next time you download an update from the Windows Update server, you may find that you've accidentally hosed your machine (although it honestly seems rather unlikely).
 Sunday, July 02, 2006
There are two main reasons to use concurrency.
The first reason is throughput. If you have multiple CPUs, then clearly you need at least as many threads as CPUs to keep them all busy. It's a little odd to talk about client-side workloads in terms of throughput, but we'll have to get used to it as multi-core becomes more prevalent. In the best case, there would be the same number of active threads as there are CPUs, each of which are entirely CPU-bound.
This is a very simplistic view of the world, however, and you typically end up needing more than that. The reason? Latency. Whenever you issue an operation with a non-zero latency, there will be some number of wasted CPU cycles during which computations will not make forward progress. If the latency is high enough, you can mask it with concurrency, and instead overlap some of the computation that needs to get done. Simply put, this maximizes the amount of work that actually gets done in a given amount of time (i.e. throughput).
To illustrate this point, consider Intel’s HyperThreading (HT) technology for a moment. Any memory access—and particularly those that miss cache entirely and go to main memory—have a noticeable latency (e.g. on the order of tens to hundreds of cycles). Instruction-level parallelism (ILP) can mask this to some degree. But HT also improves instruction-level throughput by overlapping adjacent instruction streams as stalls occur due to latency. This is clearly concurrency-in-the-small and doesn’t incur any noticeable overhead for context switching as do coarser grained forms of concurrency. But for many workloads it can do a surprisingly good job at masking delays. This technology, by the way, is based on technologies pioneered by super-computer makers like Cray and Tera years ago; many such architectures actually don’t use caches, so the latency of accessing main memory is incurred much more frequently, and thus this technique is much more beneficial in practice.
To illustrate this idea further, consider coarse-grained IO, such as issuing a web service request. The latency here is huge when compared to a simple cache miss, often warranting application-level concurrency to mask the latency. Again, if your goal is to maximize throughput, then you’d like to use as many cycles/time as possible, assuming that ensures you get the most work done. Asynchronous overlapped IO via Windows Completion Ports is meant exactly for this purpose (e.g. via the Stream.BeginXXX/EndXXX functions combined with the thread pool), allowing you to resume the paused “continuation” once the IO completes. In the meantime, you can continue performing meaningful work. This technique also often leads to better bandwidth utilization; for example, you can have several pending network requests which complete as individual responses are received, again masking the unpredictable latencies and response times.
A special case is when maximizing throughput of an individual component rather than the system as a whole. The UI thread, for example, is a precious resource that needs to maximize its message dispatching throughput so that latencies are masked from the user. Instead of statistical throughput degradation, failing to do this can lead to disastrous user experiences. This typically involves dispatching events to a separate worker thread whenever any IO might occur during the event's execution. And it may mean sacrificing the throughput of the entire system so that you can maximize throughput and remove waiting from the single component. Other systems with finite resources often exhibit this same characteristic.
The second major reason to use concurrency is fairness. If you are performing some work and suddenly some new work arrives, it often makes sense to start the computations associated with the new work as soon as possible. This allows round-robin servicing (e.g. at thread quantum intervals), ensuring that multiple pieces of work make progress at somewhat equivalent speeds. Anti-starvation of pending requests can often mandate this technique. For example, if you have a shared hosted web server whose pages just block indefinitely, you may end up starving other sites if you don’t create more threads to service them. In some cases, you may actually want to preempt the existing work if the new work is a higher priority. Windows thread priorities are good for that.
For compute-intensive workloads, optimizing for fairness will typically decrease throughput. That’s because you often need to create more threads than you have CPUs to accommodate the new work, and therefore more time is spent context switching and damaging locality. What may not be obvious is that this can actually lead to better throughput for many workloads, because IO can be overlapped and therefore as instruction streams stall, other threads can overlap progress.
Locking messes with all of this. Today's locking mechanisms aren't conducive to optimizing for throughput. The latency involved with racing with other concurrent workers is unpredictable but measurable at best. It is very difficult to systematically design to hide such latencies. And of course most locks have no idea of fairness or priority. Because context switches can happen while a lock is held, it may be the case that every thread about to be scheduled tries to acquire that same lock. Bam, you suddenly have a lock convoy on your hands. And priorities and threads don't mix very well, priority inversion can happen unexpectedly, leading to substantial loss in throughput at best and deadlock at worst. STM is a glimmer of hope in both regards.
 Thursday, June 29, 2006
I'll be speaking at JAOO'06 in Denmark this October. They have an entire track dedicated to concurrency. If you're in the area (or don't mind the travel), I highly recommend checking it out:
Concurrency and the composition of frameworks
Abstract: Multi-core computer architectures pose both a threat and an opportunity to modern software. The amount of computing power that will soon be available will enable mainstream applications to solve problems that require computing power that has until recently only been available in supercomputers. But it also means that our software needs to evolve alongside to better support and enable the levels of concurrency we'll need to effectively use all of those cores. This fact applies as much to reusable software libraries as it does to applications themselves.
This new direction imposes some new and interesting constraints on the architecture of reusable software components, including the need to remain thread agnostic, expose latency characteristics and mechanisms for hiding latency, and, for computationally expensive library routines, some way to carry them out in parallel based on the context in which they were called. These are all areas which have not yet been researched heavily and which commercial library vendors are only now beginning to seriously deal with.
This talk presents an overview of the problem, identifies some key challenges, and proposes some direction for enabling our software to both take advantage of concurrency and to avoid inhibiting it. While the discussion has been derived from experiences on the Windows and .NET Framework platforms, the ideas presented aim to transcend any specific technology.
If you're in Denmark and want to meet up to chat, definitely drop me a line.
 Saturday, June 24, 2006
It shouldn't be news to anybody that Bill is transitioning into a new role in 2 years. I figured I'd dump some of my thoughts about this onto paper. Remember: This is in no way the official company view on the matter, nor is it motivated by any sort of company-private information.
First, I am surprised that Bill didn't make this transition sooner. I think it's admirable how he's been able to maneuver through the technology details for so long. He refused to completely give up his technical edge. And I think it's cool that he can venture out into entirely new verticals with the premise in hand that all you need is a bunch of really smart, motivated people to succeed. It's risky. And it's a sort of an antithesis to traditionalist business management views.
From talking to colleagues about this news, however, I think people tend to downplay the importance of having Bill around. Externally, there's no doubt he's a huge part of our PR, whether you like him or not. That's probably why the stock has been flat after the announcement. Not having him in charge of technical direction may open up some new avenues that we wouldn't have otherwise explored. New blood is always healthy. At the same time, though, it takes a functioning system and runs the risk of disrupting it. But I'm more worried about the internal climate...
Just as Dave Cutler is a god to the NT Team, Bill is a role model for every technical person in the company. He's a geek. He talks like one, he looks like one, and he acts like one. He was very successful at a young age, was self-taught, and didn't need college to do and succeed at what he loved. And he has an inconceivable level of power and influence both within and outside the company. For a place that's full of uber-geek MS-for-lifers who joined the firm straight out of college, and who wouldn't think about leaving (with Bill around at least), all of these are very important traits. While Ray Ozzie is a very accomplished and intelligent guy, he's missing almost all of the traits I mentioned above. And I think people will notice.
In the past month alone, I've had a BillG Review and received feedback from him on two spring ThinkWeek papers that I submitted. These were my first personal interactions with Bill, and perhaps my last. I was impressed. He's scary smart. There's no doubt he's very clever and can effortlessly cut through complexity to understand the core of really deep technical problems. There's no way the new senior technical leadership will have the same traits and to the same degree. It's not that they aren't great people. I've had several meetings with Craig Mundie recently, our CTO, and he's extraordinarily insightful and talented. I found myself blown away by some points he was making. But Bill's just too good to beat.
So here's the gist of it all. Microsoft in the past has been a technically motivated company. Bill's passion was around how we could use technology to change the world. But at the same time, he cared about how that technology was architected and built. He didn't simply spew MBA mumbo-jumbo. Microsoft feels like a company full of 100s of start-ups, each of which reports to the same technical leader, all fighting tooth and nail to build the greatest technology possible. I think all of that is going to change. I conjecture that we're at the beginning of a major shift, where Microsoft will slowly evolve from a technology-driven company to a business-driven company. The two are not mutually exclusive, obviously, but the balance will shift. We'll do more projects based on business reasons and less based on pure technology reasons. We'll waste less money in the process. It had to happen sooner or later. But for the geeks like me, I have to wonder whether it will remain as much of an enjoyable place to work. Or whether those who are looking for such an environment will be forced to go elsewhere...
 Wednesday, June 21, 2006
Windows Vista has some great new features for concurrent programming. For those of you still writing native code, it's worth checking them out. For those writing managed code, we have a bunch of great stuff in the pipeline for the future, but unfortunately you'll have to wait. Or convert (back) to the dark side.
The Vista features include:
1. Reader/writer locks. The kernel32 function InitializeSRWLock takes a pointer to a SRWLOCK structure, just like InitializeCriticalSection, and initializes it. AcquireSRWLockExclusive and AcquireSRWLockShared acquire the lock in the specific mode and ReleaseSRWLockXXX releases the lock. This is a "slim" RW lock, meaning it's actually comprised of a pointer-sized value, and is ultra-fast and lightweight, much like existing Win32 CRITICAL_SECTIONs. It should be about the cost of a single interlocked operation to acquire. E.g.
SRWLOCK rwLock; InitializeSRWLock(&rwLock); AcquireSRWLockShared(&rwLock); // ... shared operations ... ReleaseSRWLockShared(&rwLock);
2. Condition variables. These integrate with RW locks and critical sections, enabling you to do essentially what you can already do with Monitor.Wait/Pulse/PulseAll. InitializeConditionVariable takes a pointer to a CONDITION_VARIABLE and initializes it. SleepConditionVariableCS and SleepConditionVariableSRW release the specified lock (either CRITICAL_SECTION or SRWLOCK) and wait on the condition variable as an atomic action. When the thread wakes up again, it immediately attempts to acquire the lock it released during the wait. WakeConditionVariable wakes a single waiter for the target condition and WakeAllConditionVariable wakes all waiters, much like Pulse and PulseAll. E.g.
Buffer * pBuffer = ...; PCRITICAL_SECTION pCsBufferLock = ...; PCONDITION_VARIABLE pCvBufferHasItem = ...;
// Producer code: EnterCriticalSection(pCsBufferLock); while (pBuffer->Count == 0) { SleepConditionVariableCS(pCvBufferHasItem, pCsBufferLock, INFINITE); } // process item... LeaveCriticalSection(pCsBufferLock);
// Consumer code: EnterCriticalSection(pCsBufferLock); pBuffer->Put(NewItem()); LeaveCriticalSection(pCsBufferLock); WakeAllConditionVariable(pCvBufferHasItem);
More details on condition variables can be found on MSDN.
3. Lazy/one-time initialization. This allows you to write lazy allocation without fully understanding memory models and that sort of nonsense. The new APIs in kernel32, InitXXX, support both synchronous and asynchronous initialization. These have some amount of overhead for the initialization case due to the use of a callback, but in general this will be fast enough for most lazy initialization and much less error prone. Herb Sutter has proposed a similar construct for the VC++ language, and to be honest I wish we had this built-in to C# too. See the MSDN docs for an example and more details.
4. An overhauled thread pool API. The Windows kernel team has actually rewritten the thread pool from the ground up for this release. Their APIs now support creating multiple pools per process, waiting for queues to drain or a specific work item to complete, cancellation of work, cancellation of IO, and new cleanup functionality, including automatically releasing locks. It also has substantial performance improvements due to a large portion of the code residing in user-mode instead of kernel-mode. MSDN has a comparison between the old and new APIs.
5. A bunch of new InterlockedXXX variants.
6. Application deadlock detection. This is separate from the existing Driver Verifier ability to diagnose deadlocks in drivers. This capability integrates with all synchronization mechanisms, from CRITICAL_SECTION to SRWLOCK to Mutex, and keys off of any calls to XXXWaitForYYYObjectZZZ. Unfortunately, I think this is new to the latest Vista SDK, and thus there isn't a lot of information available publicly. This could probably make a good future blog post if there's interest.
Have fun with this stuff, of course. But be careful. Don't poke an eye out.
|
|
Me
Joe  is an architect and developer on a systems incubation project at Microsoft.
Recent
Search
Browse
Disclaimer:
The content of this site are my own personal opinions and do
not represent my employer's view in anyway.
© 2013, Joe Duffy
|
|