|
Personal Info:
Joe  leads the architecture of an experimental OS's developer platform, where
he is also chief architect of its programming language. His current mission is to enable
writing large-scale software that is reliable, secure, and scalable by-construction. Before this, Joe
founded the Parallel Extensions to .NET project.
He has been granted 19 patents, with 49 pending. When not working, Joe enjoys travelling with his wife,
writing books, writing music,
studying music theory & mathematics, and doing anything involving food & wine.
My books
My music
Disclaimer:
The content of this site are my own personal opinions and do
not represent my employer's view in anyway.
© 2012, Joe Duffy
|
|
 Saturday, July 08, 2006
The CLR thread pool is a very useful thing. It amortizes the cost of thread creation and deletion--which, on Windows, are not cheap things--over the life of your process, without you having to write the hairy, complex logic to do the same thing yourself. The algorithms it uses have been tuned over three major releases of the .NET Framework now. Unfortunately, it’s still not perfect. In particular, it stutters occasionally.
As I’ve hinted at before, we have a lot of work actively going on right now that we hope to show up over the course of the next couple CLR versions (keep an eye on those CTPs!). This may include vastly improved performance for work items and IO completions, significantly reducing the overhead of using our thread pool (in some cases to as little as ~1/8th of what it is today), eliminating accidental deadlocks due to lots of blocked thread pool workers, and a slew of useful new features (prioritization, isolation, better debugging, etc.).
One silly thing our thread pool currently does has to do with how it creates new threads. Namely, it severely throttles creation of new threads once you surpass the "minimum" number of threads, which, by default, is the number of CPUs on the machine. We limit ourselves to at most one new thread per 500ms once we reach or surpass this number. This can be pretty bad for some workloads, most notable those that are "bursty"; i.e. those that exhibit interspersed inactive and active periods rather sporadically and unpredictably. ASP.NET is a great example of an environment in which this frequerntly happens. Here’s an illustration:
- Imagine we have a 4 CPU web server. The "minimum" thread count used thus 4 (assuming the default).
- The web server has just started up.
- 16 new requests come in within a short period of time.
- Ther CLR quickly create 4 thread pool threads to service the first 4 requests. Because we don’t want to add any more for another 500ms, the other 12 requests sit in the queue.
- The 4 thread pool threads are running some arbitrary web page response. Imagine the response generation code does some type of database query that takes 4 seconds to complete. (This is a strong argument for using ASP.NET asynchronous pages (see http://msdn.microsoft.com/msdnmag/issues/05/10/WickedCode/) -- in which case, the 4 thread pool threads would free up to execute 4 new requests almost immediately -- or perhaps simply rearchitecting the seemingly poor database interaction, but ignore this for now.)
- After 500ms, a new thread pool thread is created, and the 5th request is serviced.
- We now wait another 500ms to add another thread, service, the next request, and so forth.
If the server has a constant load, eventually the pool will become "primed." But if a burst of work is followed by an inactive period of time, the threads in the thread pool start timing out waiting for new work, and eventually will retire themselves, until the pool shrinks back to the minimum. Imagine that this happens and then a bunch of new work arrives. Oops. This can clearly lead to some nasty scalability nightmares. KB article 821261, Contention, poor performance, and deadlocks when you make Web service requests from ASP.NET applications, describes this problem among others.
To "fix" this we added the ability in v1.1 to specify the minimum thread count in the thread pool, either with the configuratoin file or with the ThreadPool.SetMinThreads API. See KB article 810259, FIX: SetMinThreads and GetMinThreads API Added to Common Language Runtime ThreadPool Class, for details. It turns out that Microsoft Biztalk Server has run into the same problem: FIX: Slow performance on startup when you process a high volume of messages through the SOAP adapter in BizTalk Server 2004. I suspect many other commercial products have run into this as well. And it's rather annoying that each of them have to figure this out after they've shipped something, turning into a support bulletin, an internal bug-fix, and (I would guess) a service pack containing said bug-fix.
I wouldn't actually call what we did a fix. At best, it's a workaround. Hell, one of the KB articles above says that if you want decent scalability you need to change the minWorkerThreads count to 50. Our default is 1! Not too far off, eh? Shouldn't decent scalability be the default behavior?
We need to fix this for real.
Now, of course, it's a hard problem to solve. You don’t want to be too liberal adding threads to the pool because it can cause poor scalability should a large number of those extra threads suddenly become runnable. In an ideal world, no threads block, and having the same number of threads as you have CPUs gives you the best performance. (Better cache utilization, less overhead due to context switching, and so forth.) But software is often far less than ideal. As noted above, ASP.NET asynchronous pages are a great way to acheive this, and compared to injecting a whole bunch of relatively expensive threads into the process, it's obviously a better design. Unfortunately, I am not convinced all of our customers will stumble across this design, nor will it be brain-dead simple to rearchitect an existing site to take advantage of the feature without considerable work.
My hope is that we can solve this problem in the CLR by applying clever heuristics that even out over time. For example, we may start out life being over eager and generous with thread injection, but then "learn our lesson" after running for a period of time. This would lead to stabalization and an increasingly superior performance over the life of the process for the work that the server experiences. For example, if the server often experiences bursts, we will monitor the number of threads that lead to the best throughput during such an active period, and during periods of inactivity we will avoid retiring threads. This ensures that the next time the server is busy, work can be responded to in a more scalable manner, albeit with some extra working set overhead for keeping those threads around for longer. Perhaps more appropriate configuration settings could be dynamically recommended based on statistics gatherer during previous up-times of the server. And of course, we can offer more reasonable defaults for clients with short-living processes that might be harmed by over-eagerness with thread injection.
If anybody has experienced this problem in the wild, I’d love any feedback you might have. Feel free to leave a comment or email me at joedu at you-know-where dot com.
 Thursday, July 06, 2006
As I mentioned in a recent post, Windows Vista has new built-in support for deadlock detection. At the time, I couldn't find any publicly available documentation on this feature. Well, I just found it:
Wait Chain Traversal Wait Chain Traversal (WCT) enables debuggers to diagnose application hangs and deadlocks. A wait chain is an alternating sequence of threads and synchronization objects; each thread waits for the object that follows it, which is owned by the subsequent thread in the chain. Read More.
The new APIs OpenThreadWaitChainSession, CloseThreadWaitChainSession, and GetThreadWaitChain permit both asynchronous and synchronous detection and response. MSDN also has a fairly detailed code sample that uses the new APIs to print out the wait chain for all threads in a process.
 Monday, July 03, 2006
The world at large seems to be gravitating towards AJAX applications. I suppose this shouldn’t be surprising, given the relative lack of differentiation when comparing today’s “rich” client apps with what can run inside of a browser. We’ve actually hit a low point on the client if you ask me: I actually prefer Outlook Web Access over the rich client because at least my web browser doesn’t hang as much. While rich media and ink have made client-side interactions (theoretically) more interactive, satisfying, and powerful, I have to admit that my personal day-to-day experience with client software is anywhere near what I’d like it to be.
It’s surprising (to me) that people can build such nice looking, responsive, and (dare I say?) rich programs with a whole lot of ingenuity, blood, sweat, and tears, all using technologies that date back to when I was a teenager and which have evolved at a tremendously slow pace. I mean, standards committees aren’t necessarily known for rapid innovation. What’s even more surprising is that, with a state of the art IDE, Visual Studio, impressive presentation stacks like Windows Forms and Presentation Foundation, and a killer VM and associated Framework, the momentum is decidedly in the web frontier. Thank God for ASP.NET.
The problem isn’t that I don’t understand the AJAX sales pitch. It’s that I can’t believe we haven’t solved those problems on the client and really set it apart.
Despite the A in AJAX standing from Asynchronous, the marriage with multi-core seems less obvious than the rich client even. When the computational edge is offloaded to a set of back-office machines running in an expensive data center, the need for tera[fl]ops on the client seems slightly farfetched. Those servers sure better have the goods, though.
It seems to me that concurrency in AJAX apps will be more about masking latency and aggregating content from disparate sources than anything else: issuing tons of network requests and letting them complete in an overlapped fashion. Fault tolerance is another really attractive feature that concurrency could buy you. A quick search turned up this post about a Scheme message passing AJAX library. Very handy, and in one of my favorite languages to boot. But I wonder how mainstream these techniques will become?
Maybe I’m not thinking big. Speech and handwriting recognition, data mining over terabytes of personal information (which, by the way, it seems you need access to a hard drive for; see, the client’s good for something! ...or was that GFS?), synthesis and analysis of complex business, scientific, financial, and personal problems, and so on, are all things I can easily see a rich client doing. But than again, there's no real reason that such things couldn't be packaged up into AJAX libraries and all executed inside of the browser. A CPU cycle is a cycle, regardless of whether it's in the data center or running inside a web browser. This is where technologies like Flash/Flex come into play, taking the web experience and incrementally improving it to deliver richer experiences.
Google has already mastered the art of offloading impressive and constantly improving analysis over hoardes of data to the data-center. It seems to me that they may also be in a position to transition some of this complex analysis onto the clients connected to that same data-center. A cluster of machine nodes lookes surprisingly similar to a cluster of CPU nodes. Cutting down on communication and data transfer costs is key in both domains. And after all, why waste all of that (potentially massive) computing power that the 8-, 16-, 32-, ... core client has available? Instead of dumbing down the algorithms that are shared by [b|m]illions of clients simply so that a finite amount of computing power can be evenly distributed among a statistical peak load of consumers, make the clients do some work for themselves instead. And of course, the "constantly improving" attribute needn’t be lost: after all, it’s just a JS file. (IP of course starts to matter.)
Or is this all something that only a totally integrated client package can provide (whatever that means)? I suppose we’ll eventually find out.
When threads are created on Windows, the caller of the CreateThread API has the option to supply stack reserve/commit sizes. If not specified--i.e. the stack size parameter is 0--Windows just uses the sizes found in the PE header of the executable. Microsoft's linkers by and large use 1MB reserve/2 page commit by default, although most let you override this (e.g. LINK.EXE's /STACK:xxx,[yyy] option and VC++'s CL.EXE /F xxx). The CLR always pre-commits the entire stack for managed threads.
You'll often find situations where a program has been deployed and starts running out of stack space. Many times this is just a bug. But this also often happens when more data is fed to the application than was used during testing, causing deeper recursion or larger stack allocated data structures than is typical. ASP.NET, for example, uses 256KB stack sizes by default to minimize memory pressure due to large numbers of concurrent requests. It does this by setting the PE header's reserve size to 256KB, and relying on the fact that the CLR thread-pool creates its threads with a default stack size. I think WSDL.EXE also uses a 256KB stack to make startup faster. I was recently chatting with a customer who kept stack overflowing WSDL.EXE due to an extremely large XML file they were trying to parse (recursive XML parsers tend to use very deep stacks anyhow).
If you don't have the source code for the program in question, you can always use the EDITBIN.EXE utility that comes in the VC++ SDK to change the PE header's default stack values. Say you have an executable, FOO.EXE, that has been deployed and suddenly starts running out of stack space. You know it's not a bug -- it simply needs to consume more stack than was originally reserved. Running `EDITBIN.EXE FOO.EXE /STACK:2097152`, for example, changes the default stack to 2MB. This of course only works for threads that are created using the default stack size; if they override it explicitly, changing the PE header has no effect. This always works for threads in the CLR's thread pool.
Warning: Using EDITBIN.EXE like this can invalidate support and servicing warranties on commercial executables. You might want to use this approach for workarounds in your own organization or for personal use, but I don't recommend it for, say, Microsoft shipped binaries. There's no guarantee things will continue working as you'd hope, especially if you're shrinking the stack size instead of growing it. And next time you download an update from the Windows Update server, you may find that you've accidentally hosed your machine (although it honestly seems rather unlikely).
 Sunday, July 02, 2006
There are two main reasons to use concurrency.
The first reason is throughput. If you have multiple CPUs, then clearly you need at least as many threads as CPUs to keep them all busy. It's a little odd to talk about client-side workloads in terms of throughput, but we'll have to get used to it as multi-core becomes more prevalent. In the best case, there would be the same number of active threads as there are CPUs, each of which are entirely CPU-bound.
This is a very simplistic view of the world, however, and you typically end up needing more than that. The reason? Latency. Whenever you issue an operation with a non-zero latency, there will be some number of wasted CPU cycles during which computations will not make forward progress. If the latency is high enough, you can mask it with concurrency, and instead overlap some of the computation that needs to get done. Simply put, this maximizes the amount of work that actually gets done in a given amount of time (i.e. throughput).
To illustrate this point, consider Intel’s HyperThreading (HT) technology for a moment. Any memory access—and particularly those that miss cache entirely and go to main memory—have a noticeable latency (e.g. on the order of tens to hundreds of cycles). Instruction-level parallelism (ILP) can mask this to some degree. But HT also improves instruction-level throughput by overlapping adjacent instruction streams as stalls occur due to latency. This is clearly concurrency-in-the-small and doesn’t incur any noticeable overhead for context switching as do coarser grained forms of concurrency. But for many workloads it can do a surprisingly good job at masking delays. This technology, by the way, is based on technologies pioneered by super-computer makers like Cray and Tera years ago; many such architectures actually don’t use caches, so the latency of accessing main memory is incurred much more frequently, and thus this technique is much more beneficial in practice.
To illustrate this idea further, consider coarse-grained IO, such as issuing a web service request. The latency here is huge when compared to a simple cache miss, often warranting application-level concurrency to mask the latency. Again, if your goal is to maximize throughput, then you’d like to use as many cycles/time as possible, assuming that ensures you get the most work done. Asynchronous overlapped IO via Windows Completion Ports is meant exactly for this purpose (e.g. via the Stream.BeginXXX/EndXXX functions combined with the thread pool), allowing you to resume the paused “continuation” once the IO completes. In the meantime, you can continue performing meaningful work. This technique also often leads to better bandwidth utilization; for example, you can have several pending network requests which complete as individual responses are received, again masking the unpredictable latencies and response times.
A special case is when maximizing throughput of an individual component rather than the system as a whole. The UI thread, for example, is a precious resource that needs to maximize its message dispatching throughput so that latencies are masked from the user. Instead of statistical throughput degradation, failing to do this can lead to disastrous user experiences. This typically involves dispatching events to a separate worker thread whenever any IO might occur during the event's execution. And it may mean sacrificing the throughput of the entire system so that you can maximize throughput and remove waiting from the single component. Other systems with finite resources often exhibit this same characteristic.
The second major reason to use concurrency is fairness. If you are performing some work and suddenly some new work arrives, it often makes sense to start the computations associated with the new work as soon as possible. This allows round-robin servicing (e.g. at thread quantum intervals), ensuring that multiple pieces of work make progress at somewhat equivalent speeds. Anti-starvation of pending requests can often mandate this technique. For example, if you have a shared hosted web server whose pages just block indefinitely, you may end up starving other sites if you don’t create more threads to service them. In some cases, you may actually want to preempt the existing work if the new work is a higher priority. Windows thread priorities are good for that.
For compute-intensive workloads, optimizing for fairness will typically decrease throughput. That’s because you often need to create more threads than you have CPUs to accommodate the new work, and therefore more time is spent context switching and damaging locality. What may not be obvious is that this can actually lead to better throughput for many workloads, because IO can be overlapped and therefore as instruction streams stall, other threads can overlap progress.
Locking messes with all of this. Today's locking mechanisms aren't conducive to optimizing for throughput. The latency involved with racing with other concurrent workers is unpredictable but measurable at best. It is very difficult to systematically design to hide such latencies. And of course most locks have no idea of fairness or priority. Because context switches can happen while a lock is held, it may be the case that every thread about to be scheduled tries to acquire that same lock. Bam, you suddenly have a lock convoy on your hands. And priorities and threads don't mix very well, priority inversion can happen unexpectedly, leading to substantial loss in throughput at best and deadlock at worst. STM is a glimmer of hope in both regards.
 Thursday, June 29, 2006
I'll be speaking at JAOO'06 in Denmark this October. They have an entire track dedicated to concurrency. If you're in the area (or don't mind the travel), I highly recommend checking it out:
Concurrency and the composition of frameworks
Abstract: Multi-core computer architectures pose both a threat and an opportunity to modern software. The amount of computing power that will soon be available will enable mainstream applications to solve problems that require computing power that has until recently only been available in supercomputers. But it also means that our software needs to evolve alongside to better support and enable the levels of concurrency we'll need to effectively use all of those cores. This fact applies as much to reusable software libraries as it does to applications themselves.
This new direction imposes some new and interesting constraints on the architecture of reusable software components, including the need to remain thread agnostic, expose latency characteristics and mechanisms for hiding latency, and, for computationally expensive library routines, some way to carry them out in parallel based on the context in which they were called. These are all areas which have not yet been researched heavily and which commercial library vendors are only now beginning to seriously deal with.
This talk presents an overview of the problem, identifies some key challenges, and proposes some direction for enabling our software to both take advantage of concurrency and to avoid inhibiting it. While the discussion has been derived from experiences on the Windows and .NET Framework platforms, the ideas presented aim to transcend any specific technology.
If you're in Denmark and want to meet up to chat, definitely drop me a line.
 Saturday, June 24, 2006
It shouldn't be news to anybody that Bill is transitioning into a new role in 2 years. I figured I'd dump some of my thoughts about this onto paper. Remember: This is in no way the official company view on the matter, nor is it motivated by any sort of company-private information.
First, I am surprised that Bill didn't make this transition sooner. I think it's admirable how he's been able to maneuver through the technology details for so long. He refused to completely give up his technical edge. And I think it's cool that he can venture out into entirely new verticals with the premise in hand that all you need is a bunch of really smart, motivated people to succeed. It's risky. And it's a sort of an antithesis to traditionalist business management views.
From talking to colleagues about this news, however, I think people tend to downplay the importance of having Bill around. Externally, there's no doubt he's a huge part of our PR, whether you like him or not. That's probably why the stock has been flat after the announcement. Not having him in charge of technical direction may open up some new avenues that we wouldn't have otherwise explored. New blood is always healthy. At the same time, though, it takes a functioning system and runs the risk of disrupting it. But I'm more worried about the internal climate...
Just as Dave Cutler is a god to the NT Team, Bill is a role model for every technical person in the company. He's a geek. He talks like one, he looks like one, and he acts like one. He was very successful at a young age, was self-taught, and didn't need college to do and succeed at what he loved. And he has an inconceivable level of power and influence both within and outside the company. For a place that's full of uber-geek MS-for-lifers who joined the firm straight out of college, and who wouldn't think about leaving (with Bill around at least), all of these are very important traits. While Ray Ozzie is a very accomplished and intelligent guy, he's missing almost all of the traits I mentioned above. And I think people will notice.
In the past month alone, I've had a BillG Review and received feedback from him on two spring ThinkWeek papers that I submitted. These were my first personal interactions with Bill, and perhaps my last. I was impressed. He's scary smart. There's no doubt he's very clever and can effortlessly cut through complexity to understand the core of really deep technical problems. There's no way the new senior technical leadership will have the same traits and to the same degree. It's not that they aren't great people. I've had several meetings with Craig Mundie recently, our CTO, and he's extraordinarily insightful and talented. I found myself blown away by some points he was making. But Bill's just too good to beat.
So here's the gist of it all. Microsoft in the past has been a technically motivated company. Bill's passion was around how we could use technology to change the world. But at the same time, he cared about how that technology was architected and built. He didn't simply spew MBA mumbo-jumbo. Microsoft feels like a company full of 100s of start-ups, each of which reports to the same technical leader, all fighting tooth and nail to build the greatest technology possible. I think all of that is going to change. I conjecture that we're at the beginning of a major shift, where Microsoft will slowly evolve from a technology-driven company to a business-driven company. The two are not mutually exclusive, obviously, but the balance will shift. We'll do more projects based on business reasons and less based on pure technology reasons. We'll waste less money in the process. It had to happen sooner or later. But for the geeks like me, I have to wonder whether it will remain as much of an enjoyable place to work. Or whether those who are looking for such an environment will be forced to go elsewhere...
 Wednesday, June 21, 2006
Windows Vista has some great new features for concurrent programming. For those of you still writing native code, it's worth checking them out. For those writing managed code, we have a bunch of great stuff in the pipeline for the future, but unfortunately you'll have to wait. Or convert (back) to the dark side.
The Vista features include:
1. Reader/writer locks. The kernel32 function InitializeSRWLock takes a pointer to a SRWLOCK structure, just like InitializeCriticalSection, and initializes it. AcquireSRWLockExclusive and AcquireSRWLockShared acquire the lock in the specific mode and ReleaseSRWLockXXX releases the lock. This is a "slim" RW lock, meaning it's actually comprised of a pointer-sized value, and is ultra-fast and lightweight, much like existing Win32 CRITICAL_SECTIONs. It should be about the cost of a single interlocked operation to acquire. E.g.
SRWLOCK rwLock; InitializeSRWLock(&rwLock); AcquireSRWLockShared(&rwLock); // ... shared operations ... ReleaseSRWLockShared(&rwLock);
2. Condition variables. These integrate with RW locks and critical sections, enabling you to do essentially what you can already do with Monitor.Wait/Pulse/PulseAll. InitializeConditionVariable takes a pointer to a CONDITION_VARIABLE and initializes it. SleepConditionVariableCS and SleepConditionVariableSRW release the specified lock (either CRITICAL_SECTION or SRWLOCK) and wait on the condition variable as an atomic action. When the thread wakes up again, it immediately attempts to acquire the lock it released during the wait. WakeConditionVariable wakes a single waiter for the target condition and WakeAllConditionVariable wakes all waiters, much like Pulse and PulseAll. E.g.
Buffer * pBuffer = ...; PCRITICAL_SECTION pCsBufferLock = ...; PCONDITION_VARIABLE pCvBufferHasItem = ...;
// Producer code: EnterCriticalSection(pCsBufferLock); while (pBuffer->Count == 0) { SleepConditionVariableCS(pCvBufferHasItem, pCsBufferLock, INFINITE); } // process item... LeaveCriticalSection(pCsBufferLock);
// Consumer code: EnterCriticalSection(pCsBufferLock); pBuffer->Put(NewItem()); LeaveCriticalSection(pCsBufferLock); WakeAllConditionVariable(pCvBufferHasItem);
More details on condition variables can be found on MSDN.
3. Lazy/one-time initialization. This allows you to write lazy allocation without fully understanding memory models and that sort of nonsense. The new APIs in kernel32, InitXXX, support both synchronous and asynchronous initialization. These have some amount of overhead for the initialization case due to the use of a callback, but in general this will be fast enough for most lazy initialization and much less error prone. Herb Sutter has proposed a similar construct for the VC++ language, and to be honest I wish we had this built-in to C# too. See the MSDN docs for an example and more details.
4. An overhauled thread pool API. The Windows kernel team has actually rewritten the thread pool from the ground up for this release. Their APIs now support creating multiple pools per process, waiting for queues to drain or a specific work item to complete, cancellation of work, cancellation of IO, and new cleanup functionality, including automatically releasing locks. It also has substantial performance improvements due to a large portion of the code residing in user-mode instead of kernel-mode. MSDN has a comparison between the old and new APIs.
5. A bunch of new InterlockedXXX variants.
6. Application deadlock detection. This is separate from the existing Driver Verifier ability to diagnose deadlocks in drivers. This capability integrates with all synchronization mechanisms, from CRITICAL_SECTION to SRWLOCK to Mutex, and keys off of any calls to XXXWaitForYYYObjectZZZ. Unfortunately, I think this is new to the latest Vista SDK, and thus there isn't a lot of information available publicly. This could probably make a good future blog post if there's interest.
Have fun with this stuff, of course. But be careful. Don't poke an eye out.
 Tuesday, June 20, 2006
Jim Johnson started a series back in January that I’m dying to see continued. It's about writing resource managers in System.Transactions, which surprisingly turns out to be incredibly straightforward. Provided you are able to implement the correct ACI[D] transactional qualities for the resource in question, that is. Juval Lowy’s December 2005 MSDN Magazine article on volatile resource managers described how to build what turns out to be essentially mini-transactional memory, without much of the syntax, implicit and transitive qualities, and robustness.
As an example of where you might use a resource manager, imagine that you wanted to ensure that any memory allocations and deallocations inside a transaction scope participate with the System.Transactions ambient transaction. Maybe you'd like your allocations to be in sync with the database server or web service to which you’re also transacting access. I’ll walk through an example of how straightforward writing such a resource manager can be.
First, our starting class is quite simple. It just allocates and frees memory. Sans transactions, it looks like this:
using System.Runtime.InteropServices;
public static class Mm { public static IntPtr Malloc(long bytes) { return Marshal.AllocHGlobal(new IntPtr(bytes)); } public static void Free(IntPtr pp) { Marshal.FreeHGlobal(pp); } }
Mm.Malloc returns a pointer to ‘bytes’ amount of memory via kernel32!GlobalAlloc (which turns out to be a crappy way to manage memory by the way, and is still alive only to support DDE, the clipboard, and OLE, or so I’m told; it works as an example though). Mm.Free takes a pointer to memory that was previously allocated via Mm.Malloc and frees it. Pretty simple.
OK, that’s not incredibly useful, especially considering that we’re just making single-line invocations to the Marshal class. But it’s a starting point.
Ultimately, what we want to ensure is that at the end of a transaction, any memory allocation and deallocation that happened within the transaction is consistent with the outcome of that transaction. That means, quite simply, that if memory was allocated and the transaction commits, we keep the memory allocated around; but if, on the other hand, the transaction rolls back, we must undo the allocation. Similarly, if we free memory and the transaction commits, then the memory remains freed; if it rolls back, we must undo the freeing.
If we want to build such a thing directly on top of existing facilities we clearly can’t do this precisely as I suggest. How do you undo a call to free in the CRT, for example? You can’t. Once you call free, the memory's gone, returned to the pool, and possibly used before your transaction even knows what to do with itself. But it turns out that we can "fake it" sufficiently close enough that most people can’t tell that we're faking it. Here’s what we do instead:
- When somebody allocates memory, we log a compensating action in the transaction that frees the memory should we roll back. If the transaction commits, we do nothing more.
- When somebody frees memory, we defer the call to commit time. If it never commit, we never free the memory.
This is fairly well known in database literature. Take a look at Jim Gray’s 1980 paper, A Transaction Model, where he describes REDO and UNDO actions, to see what I mean. (1980! That was ages ago.) What we’re saying basically is that allocation logs an UNDO action and freeing logs a REDO action. The isolation leaks out of this in some regard--evidence of our "faking it"--because the fact that our freed memory isn’t instantaneously available to other allocations might be noticed, especially under high stress conditions. OOMs may result that would have not otherwise happened, and the working set of the program may increase, especially for long running transactions. Cest la vie.
Anyhow… on to the implementation of these ideas. It’s surprisingly simple.
We will allow instances of our Mm class to be created by the implementation. From the viewpoint of a user, the class is still entirely static and cannot be constructed. These instances will become enlistments responsible for implementing transactional semantics and responding to certain event notifications from the System.Transactions machinery. To do so, the class must implement the System.Transactions interface IEnlistmentNotification:
using System; using System.Collections.Generic; using System.Runtime.InteropServices; using System.Transactions;
public sealed class Mm : IEnlistmentNotification {
/** Fields **/
private LinkedList<IntPtr> m_freeOnCommit = new LinkedList<IntPtr>(); private LinkedList<IntPtr> m_freeOnRollback = new LinkedList<IntPtr>();
[ThreadStatic] private static Dictionary<Transaction, Mm> s_currentMm;
/** Constructors **/
private Mm() {}
/** Methods **/
public static IntPtr Malloc(long bytes) { … }
public static void Free(IntPtr pp) { … }
}
We’ve added two linked lists to hold the deferred (m_freeOnCommit) and compensating actions (m_freeOnRollback). And we have a thread-static dictionary that maps the current transaction to the enlisted instance of Mm. This is pretty straightforward stuff, although there are a plethora of alternative designs. Now let’s see how we get data into these things. The Malloc and Free implementations will change slightly to check for an existing transaction:
public static IntPtr Malloc(long bytes) { IntPtr pp = Marshal.AllocHGlobal(new IntPtr(bytes));
// If insufficient memory, OOM is thrown and we never log the free. Mm mm = GetCurrentMm(); if (mm != null) { // Compensating activity to ensure that if we rollback, we free. mm.m_freeOnRollback.AddLast(pp); }
return pp; }
public static void Free(IntPtr pp) { Mm mm = GetCurrentMm(); if (mm != null) { // We defer the freeing of memory in case we don't commit. mm.m_freeOnCommit.AddLast(pp); } else { Marshal.FreeHGlobal(pp); } }
This implements the commit and rollback behavior I described above, i.e. we add the memory location to free on to the deferred or compensated list according to the rules we’ve already established. GetCurrentMm is responsible for lazily allocating and enlisting an instance of Mm. If there is no active ambient transaction, it just returns null:
private static Mm GetCurrentMm() { // Are we in a transaction? Transaction currTx = Transaction.Current; if (currTx == null) { // Return null to indicate we're not in a transaction. return null; }
// Have we already allocated and enlisted a volatile RM for this transaction? Mm currMm = null; if (s_currentMm == null) { s_currentMm = new Dictionary<Transaction, Mm>(); } else { s_currentMm.TryGetValue(currTx, out currMm); }
// No RM found, create/enlist one. if (currMm == null) { currMm = new Mm(); s_currentMm.Add(currTx, currMm); currTx.EnlistVolatile(currMm, EnlistmentOptions.None); }
return currMm; }
And of course we have a RemoveCurrentMm which will be used eventually to remove the enlistment information from our dictionary:
private static void RemoveCurrentMm() { Transaction currTx = Transaction.Current; if (currTx != null && s_currentMm != null) { s_currentMm.Remove(currTx); } }
So now we have all of the information about what should be freed and when, but there’s no code that actually executes the free operations. To do that, all we have to do is implement the IEnlistmentNotification interface properly, iterating the proper list, and invoking Malloc.FreeHGlobal on the contents. In other words, Commit and Rollback just invoke free on all of the memory addresses in the respective linked list:
void IEnlistmentNotification.Commit(Enlistment enlistment) { FreeAll(m_freeOnCommit); RemoveCurrentMm(); enlistment.Done(); }
void IEnlistmentNotification.Rollback(Enlistment enlistment) { FreeAll(m_freeOnRollback); RemoveCurrentMm(); enlistment.Done(); }
private void FreeAll(LinkedList<IntPtr> toFree) { foreach (IntPtr p in toFree) { Marshal.FreeHGlobal(p); } }
We’re assuming in all of those cases that FreeAll and RemoveCurrentMm can’t fail. If our commit or rollback processing failed mid-way, that would put the entire process at risk: memory could be leaked or become corrupt. System.Transactions will respond to that by sending InDoubt notifications to all enlistments. Since the only way we can potentially contain and resolve volatile state corruption is to crash the process, that’s exactly what we do:
void IEnlistmentNotification.InDoubt(Enlistment enlistment) { Environment.FailFast("State protected by RM is in question"); }
This is a Byzantine response, sure, but it’s the only way we can guarantee that state doesn’t become corrupt when the transaction’s fate is InDoubt. If the fate of one or more resource managers cannot be determined, we don't know whether to commit or fail for sure. We could guess, of course, but guessing doesn’t lead to pleasant behavior in software, especially when we have to debug it. (And if you're making guesses, you'll probably have to spend more time debugging, so it's a double whammy of sorts.)
And that’s it! Now we can use memory operations inside of a transaction, and have it behave as expected. Just as an example, this test case ensures that writing to memory that was allocated in a transaction that eventually aborts causes an AccessViolation:
bool test1success = false; IntPtr pMem1 = IntPtr.Zero; try { try { using (TransactionScope txScope = new TransactionScope()) { pMem1 = Mm.Malloc(1024 * 1024); // get 1MB of space throw new Exception(); // cause an abort } } catch { // The txn was aborted, we expect reading from memory to fail. uint * pInt = (uint *)pMem1.ToPointer(); *pInt = 0xdeadbeef; } } catch (AccessViolationException) { test1success = true; } Console.WriteLine("Test 1 succeeded: {0} (rollback of malloc)", test1success);
There are three other tests that may be of interest in the source file for the Mm class and associated code: MallocFree.cs.
Of course this approach is not perfect in several areas. One that comes to mind immediately is the fact that we're doing a potentially expensive lookup for an ambient transaction on every memory allocation and deletion, which could be too much, especially if it happened in some general purpose allocation and deallocation routines. And of course automatically finding the transaction and using might also be a bad idea. We might instead want the user to opt-in to transactional Malloc and Free at the callsite, so that users aren't surprised when their malloc or free never happens (the transaction rolled back). Nevertheless, this article at least cracked the surface of a very difficult problem and surfaced some interesting issues.
 Saturday, June 17, 2006
Ntdll exports an undocumented function from WinNT.h:
PTEB NtCurrentTeb();
This gives you access to the current thread's TEB (thread environment block), which is a per-thread data structure that holds things like a pointer to the SEH exception chain, stack range, TLS, fiber information, and so forth. This function actually returns you a PTEB, which is defined as _TEB*. _TEB is an internal data structure defined in winternl.h, and consists of a bunch of byte arrays. You can cast this to PNT_TIB (defined as _NT_TIB*), which gives you access to the data in a strongly typed way. And _NT_TIB is a documented data structure, unlike _TEB, meaning you can actually rely on it not breaking between versions of Windows.
For example, this code prints out the current thread's stack base and limit. The base is the start of the user-mode stack, and the limit is the last committed page, which grows as you use more stack:
PNT_TIB pTib = reinterpret_cast<PNT_TIB>(NtCurrentTeb()); printf("Base = %p, Limit = %p\r\n", pTib->StackBase, pTib->StackLimit);
There's a shortcut you can take. You can always find a pointer to the TEB in the register FS:[18h]:
PNT_TIB pTib; _asm { mov eax,fs:[18h] mov pTib,eax } printf("Base = %p, Limit = %p\r\n", pTib->StackBase, pTib->StackLimit);
There's an even shorter shortcut you can take. You can actually find the base and limit in different segments of the FS register, FS:[04h] for the base and FS:[08h] for the limit:
void * pStackBase; void * pStackLimit; _asm { mov eax,fs:[04h] mov pStackBase,eax mov eax,fs:[08h] mov pStackLimit,eax } printf("Base = %p, Limit = %p\r\n", pStackBase, pStackLimit);
Unfortunately, the _asm keyword is not supported on all architectures, so the above code is only guaranteed to work on x86 (e.g. the VC++ Intel Itanium compiler doesn't support it). Furthermore, the hardcoded offsets 04h and 08h are clearly wrong on 64-bit: you need more than 4 bytes to represent a 64-bit pointer. NtCurrentTeb hides all of this and uses whatever platform-specific technique is needed to retrieve the information.
Matt Pietrek's 1996 and 1998 Microsoft Systems Jounral articles are the best reference I could find on TEBs, aside from the Windows Internals book.
Believe it or not, this is useful information. I wrote some code recently that took a different code path based on whether it was writing to the stack or the heap, and using the TEB does the trick.
I have written about 7 pages on user-mode stacks in my upcoming concurrency book. This ranges from CLR stack frames to stack overflow to just how stacks work internally in Windows. I haven't found any book or resource that collects all of this information together in one place. It turns out that most developers don't need to worry about stacks at all, but this understanding is crucial to moving forward to more advanced concurrency programming models.
|
|
Recent Entries:
Search:
Browse by Date:
| | Sun | Mon | Tue | Wed | Thu | Fri | Sat | | 25 | 26 | 27 | 28 | 29 | 30 | 1 | | 2 | 3 | 4 | 5 | 6 | 7 | 8 | | 9 | 10 | 11 | 12 | 13 | 14 | 15 | | 16 | 17 | 18 | 19 | 20 | 21 | 22 | | 23 | 24 | 25 | 26 | 27 | 28 | 29 | | 30 | 31 | 1 | 2 | 3 | 4 | 5 |
Browse by Category:
Notables:
|