|
Personal Info:
Joe  leads the architecture of an experimental OS's developer platform, where
he is also chief architect of its programming language. His current mission is to enable
writing large-scale software that is reliable, secure, and scalable by-construction. Before this, Joe
founded the Parallel Extensions to .NET project.
He has been granted 19 patents, with 49 pending. When not working, Joe enjoys travelling with his wife,
writing books, writing music,
studying music theory & mathematics, and doing anything involving food & wine.
My books
My music
Disclaimer:
The content of this site are my own personal opinions and do
not represent my employer's view in anyway.
© 2012, Joe Duffy
|
|
 Sunday, June 11, 2006
After posting my last article on creating a lazy allocation IAsyncResult, I received a few mails on the ordering of the completion sequence. It was wrong and has been updated. Thanks to those who pointed this out.
I used the following incorrect ordering: 1) set IsCompleted to true, 2) invoke the callback, and 3) signal the handle.
This can lead to deadlocks if the callback waits on the handle. My implementation carefully avoided EndXxx-induced deadlocks (by checking IsCompleted before waiting on the handle), but if the callback directly WaitOne's on the IAsyncResult.AsyncWaitHandle property, the callback will of course deadlock. Directly accessing the handle might be attractive to the callback author, especially for higher level orchestration via WaitAny and WaitAll. So it's probably something we'd like to support. One way to avoid this is to invoke the callback asynchronously with BeginInvoke, but a better solution is to use the correct ordering instead.
The correct ordering is: 1) set IsCompleted to true, 2) signal the handle, and 3) invoke the callback.
The first version I wrote had the correct ordering, since that seemed to be the logical choice. Unfortunately, the Framework Design Guidelines lists the steps in the wrong order, which led me down that path. I've let Brad and Krzys about this. A customer who read my blog actually mailed Brad about this error too, just about simultaneously. There may be rationale behind this, but we've used the correct ordering in the file and network IO APIs since V1.0 so I think it's just wrong.
It's worth pointing out that the network classes already use a lazy allocation scheme very similar to the one I wrote about. Check out the System.Net.LazyAsyncResult internal type in System.dll. I'm advocating for moving file IO onto the same plan in the next release of the Framework. We'll see how it turns out.
Lastly, some might notice I originally said this would be a two part series. Well, I wrote a whole bunch of code to implement a sophisticated LRU-based resurrection caching scheme--to avoid allocating IAsyncResults every time--and then realized that my example doesn't do anything expensive on IAsyncResult creation that would warrant such a thing. The result? It was actually slower than the ordinary lazy version I posted a couple weeks back. I think the techniques I used are interesting nonetheless, so I am going to try and rework the example to incorporate some expensive buffer management, and then see where I stand. But I'm not promising anything just yet. And this was a great reminder that solving actual profiled problems is always more worthwhile than solving perceived, yet unmeasured, non-problems.
 Tuesday, May 30, 2006
Lock free code is hard. But it can come in handy in a pinch.
There have been some recent internal discussions about the IAsyncResult pattern and performance. Namely that, for high throughput scenarios, where the cost of the asynchronous work is small relative to the cost of instantiating new objects, there is a considerable overhead to using the IAsyncResult pattern. This is due to two allocations necessary to implement the pattern: (1) the object that implements IAsyncResult itself, and (2) the WaitHandle for consumers of the API who must access the IAsyncResult.AsyncWaitHandle property. I will address (2) in this article, since it is much more expensive than (1).
Update: I've posted an addendum to this article here.
Rendezvousing
Just to recap, there are four broad ways to rendezvous with the IAsyncResult pattern:
- You can poll the IAsyncResult.IsCompleted boolean flag. If it's true, the work has completed. If it's false, you can go off and do some interesting work, coming back to check it once in a while.
- Supplying a delegate callback to the BeginXxx method. This callback is invoked when the work completes, passing the IAsyncResult as an argument to your callback.
- Waiting on the IAsyncResult.AsyncWaitHandle. This is a Windows WaitHandle, typically a ManualResetEvent, which allows you to block for a while until the work completes.
- Call the EndXxx method. Internally, this will often check IsCompleted and, if it's false, will wait on the AsyncWaitHandle.
Notice that in cases 1 and 2, the WaitHandle isn't even needed. And in case 4, it's only needed some fraction of the time. Well, it turns out we can avoid allocating it altogether for those cases where it's not used. We can "just" lazily allocate it. Note that for asynchronous IO, the majority of code will use method 2 above. For scalable servers, we often don't want to tie up an extra thread polling or waiting for completion, since that contradicts the primary benefits of Windows IO Completion Ports.
The Requirements
Notice I enclosed the word just above in quotes when mentioning lazy allocation. We could of course use a lock. But that would require that we allocate an object to lock against. We could of course just lock 'this', but that also comes with a performance overhead. We can get away with lock free code in this case, so long as we recognize a very important race condition that we must tolerate. Imagine this case: Thread A checks IsCompleted. It's false. So it accesses the AsyncWaitHandle property, triggering lazy allocation. Meanwhile, Thread B finishes the async work, and sets IsCompleted to true. We need to ensure a deadlock doesn't ensue.
This race could go one of two ways:
- Thread A lazily allocates and publishes the WaitHandle before Thread B sets IsCompleted to true. Thread B must now witness a non-null WaitHandle when it checks, and it must return the WaitHandle in the signaled state. If it returns an unsigned WaitHandle, Thread A will wait on it, and never be woken up. This is a deadlock.
- Thread B finishes first, setting IsCompleted to true, and seeing a null WaitHandle. Thread A must see IsCompleted as true and consequently return the event in a signaled state. Just like before, if this doesn't happen, Thread A will wait on an unsigned WaitHandle which will never be signaled. Deadlock.
To ensure both of these cases work, Thread A's read of the WaitHandle field and Thread B's read of IsCompleted must be preceded by a memory barrier. This ensures the memory accesses aren't reordered at the compiler or processor level, either of which could lead to the deadlock situations we are worried about. The CLR 2.0's memory model is not sufficient even with volatile loads, because the load acquire can still move "before" the store release.
An Implementation
Here is one simplistic implementation of a FastAsyncResult class, with ample comments embedded within to explain things:
class FastAsyncResult : IAsyncResult, IDisposable {
// Fields
private object m_state;
private ManualResetEvent m_waitHandle;
private bool m_isCompleted;
private AsyncCallback m_callback;
internal object m_internal;
// Constructors
internal FastAsyncResult(AsyncCallback callback, object state) {
m_callback = callback;
m_state = state;
}
// Properties
public object AsyncState {
get { return m_state; }
}
public WaitHandle AsyncWaitHandle {
get { return LazyCreateWaitHandle(); }
}
public bool CompletedSynchronously {
get { return false; }
}
public bool IsCompleted {
get { return m_isCompleted; }
}
internal object InternalState {
get { return m_internal; }
}
// Methods
public void Dispose() {
if (m_waitHandle != null) {
m_waitHandle.Close();
}
}
public void SetComplete() {
// We set the boolean first.
m_isCompleted = true;
// And then, if the wait handle was created, we need to signal it. Note the
// use of a memory barrier. This is required to ensure the read of m_waitHandle
// never moves before the store of m_isCompleted; otherwise we might encounter a
// race that leads us to not signal the handle, leading to a deadlock. We can't
// just do a volatile read of m_waitHandle, because it is possible for an acquire
// load to move before a store release.
Thread.MemoryBarrier();
if (m_waitHandle != null) {
m_waitHandle.Set();
}
// If the callback is non-null, we invoke it.
if (m_callback != null) {
m_callback(this);
}
}
private WaitHandle LazyCreateWaitHandle() {
if (m_waitHandle != null) {
return m_waitHandle;
}
ManualResetEvent newHandle = new ManualResetEvent(false);
if (Interlocked.CompareExchange(ref m_waitHandle, newHandle, null) != null) {
// We lost the race. Release the handle we created, it's garbage.
newHandle.Close();
}
if (m_isCompleted) {
// If the result has already completed, we must ensure we return the
// handle in a signaled state. The read of m_isCompleted must never move
// before the read of m_waitHandle earlier; the use of an interlocked
// compare-exchange just above ensures that. And there's a race that could
// lead to multiple threads setting the event; that's no problem.
m_waitHandle.Set();
}
return m_waitHandle;
}
}
Notice also that we tolerate the race where two threads try to lazily allocate the handle. Only one thread wins. The thread that loses is responsible for cleaning up after itself so that we don't "leak" a WaitHandle (requiring finalization to close it). This is an example of tolerating races instead of preventing them, and is similar to the design we use for jitting code in the runtime, for example.
Some Initial Results
I'll do a more thorough analysis as follow up to my next post, including profiling traces. But the initial results are promising.
I wrote a test harness that calculates the fibonacci series asynchronously, based on the sample code used in the Framework Design Guidelines book. As you can see by the comparisons, the larger the series being calculated, the less substantial the impact my performance work makes:
Size Normal Lazy Improvement 1 177 ms 62 ms 185% 2 179 ms 63 ms 184% 3 180 ms 65 ms 177% 4 181 ms 66 ms 174% 5 187 ms 69 ms 171% 6 195 ms 79 ms 147% 7 210 ms 97 ms 116% 8 239 ms 122 ms 96% 9 275 ms 165 ms 67% 10 356 ms 257 ms 39% 12 745 ms 631 ms 18% 15 3217 ms 3075 ms 05%
As we would expect, as the ratio of the cost of computation to the cost of allocating the WaitHandle increases (with an increased "size" of the fibonacci series being calculated), the observed performance improvement also decreases. For very small computations, however, this technique can really pay off. In the case of high performance asynchronous IO, for example, where completion often involves simply marshaling some bytes between buffers, this can be a key step in the process of improving system throughput.
As I noted earlier, lock free techniques are almost never worth the trouble unless you've measured a problem, especially due to the maintenance and testing costs for such code. And I jumped right over using a lock for allocation. It turns out in this particular scenario, that technique fares just as well as the lock free code, albeit with a lot more simplicity. It only incurs slight overhead when checking the handle to see if it has been allocated yet as well as when setting it at completion time. Since my test case never has to check the WaitHandle, it only has to enter the lock upon completion, which is relatively cheap. As always, start simple and then go from there.
C# 1.0 shipped with the ability to stack allocate data with the stackalloc keyword, much like C++'s alloca. There are restrictions, however, around what you can allocate on the stack: Inline arrays of primitive types or structs that themselves have fields of primitive types (or structs that etc...). That's it. C# 2.0 now allows you to embed similar inline arrays inside other value types, even for those that are allocated inside of a reference type on the heap, by using the fixed keyword.
And of course, you can allocate arrays of those value types on the stack too:
using System;
unsafe class Program {
struct A {
internal int x;
internal fixed byte y[1024];
}
public static void Main() {
byte * bb = stackalloc byte[2048];
Console.WriteLine("&bb : {0:X}", (uint)&bb);
Console.WriteLine("&bb[1] : {0:X}", (uint)&bb[1]);
Console.WriteLine("&bb[2048] : {0:X}", (uint)&bb[2048]);
A * a = stackalloc A[2048];
Console.WriteLine("&a : {0:X}", (uint)&a);
Console.WriteLine("&a->x : {0:X}", (uint)&a->x);
Console.WriteLine("&a->y[0] : {0:X}", (uint)&a->y[0]);
Console.WriteLine("&a->y[2048] : {0:X}", (uint)&a->y[2048]);
Console.WriteLine("&a[1] : {0:X}", (uint)&a[1]);
Console.WriteLine("&a[2048] : {0:X}", (uint)&a[2048]);
}
}
The use of this is of course almost always limited to unmanaged interop scenarios. For example, there's at least one place in the BCL where we use this to stack allocate the binary layout of a security descriptor that we then pass into the Win32 CreateMutex API, which avoids having to create a new interop struct. (Whether such hacks are a good thing to put in our code-base is another topic altogether...)
The stack allocated data doesn't outlive the stack frame, so as soon as you return from the function in which the stackalloc occurs, the data is gone. If you pass a pointer to it and somebody stores it, they could later try to dereference a pointer into dead (and possibly since reused) stack space. And reading too far can lead to buffer over- or underflows which bash other data on the stack. Using this requires compilation with /unsafe, and needless to say, you need to be careful with it (if not avoid it altogether).
 Wednesday, May 24, 2006
In managed code, you can pass ByRefs "down the stack." You can't do much aside from that, however, other than use things like ldind and stind on them. And of course, you can cast them to native pointers, store them elsewhere, and so on, but those sorts of (evil) things are unverifiable.
Right? Well, sort of.
In Whidbey, we made a change to the verification rules such that a function can now return a ByRef to a caller. This of course is safe so long as the ByRef doesn't refer to a stack location. A field ref inside a heap-allocated object, static field ref, or an array element ref are all just fine. And of course, a function can just return a ByRef that was passed to it as an argument. Take a look at this IL:
.assembly extern mscorlib {} .assembly byrefret {}
.class Program extends [mscorlib]System.Object {
.field static int32 s_x
.method static void Main() { .entrypoint
call int32& Program::f() ldind.i4 call void [mscorlib]System.Console::WriteLine(int32)
call int32& Program::g() ldind.i4 call void [mscorlib]System.Console::WriteLine(int32)
ret }
.method static int32& f() { ldsflda int32 Program::s_x ret }
.method static int32& g() { .locals init (int32 x) ldloca x ret }
}
Function f verifies just fine, since it just returns a ByRef to a static field, whereas function g fails verification, because it returns a ByRef to a local on the stack. You actually can't write code to produce the IL shown above from any of Microsoft's compilers except for VC++, i.e. C# won't let you say "static ref int f() { return ref s_x; }".
Now, why would you ever want such a thing? VC++ needed it, for example, to implement STL.NET. Traditional STL returns references to elements inside of internal data structures, which can subsequently be modified. Without this support, such values would need to be copied, or the STL.NET APIs would have had to deviate from the traditional STL APIs.
Interestingly, this doesn't change the ECMA Specification. It's always been loose on the issue, saying in Section 12.1.1.2 of Partition I: "Verification restrictions guarantee that, if all code is verifiable, a managed pointer to a value on the evaluation stack doesn't outlast the life of the location to which it points." Since you can't return a ByRef to a stack location, we don't violate this guarantee. Our previous verifier was simply being overly strict.
 Saturday, May 20, 2006
Via DBox and TBray, I stumbled upon Will Continuations Continue?, a great essay about why continuation support in modern VMs is not a good idea after all:
"By far he most compelling use case for continuations are continuation-based web servers. ... Rather than relying on the server’s stack to keep track of what location we’re looking at, the [UI] will be a view on a model ... When you pressed “Buy”, it would pass all the information necessary to complete the transaction onto the server. Consequently, we’ll have no more of a pressing need for continuations than traditional applications have today."
I couldn't agree more, although I arrive at the conclusion via a different line of thought.
Just over a year ago now, I was working on continuations for my Scheme interpreter and compiler, Sencha. I managed to create something that "worked" -- in the sense that the stack could be captured, passed around, and restored; and it even still reported locals as roots to the GC -- but there are so many facets of a modern runtime to consider that true product support would be a massive undertaking. I thought continuations were a good idea. Why? To be honest, the main reason was my simple goal of having a full-fidelity Scheme runtime. But I also admired their power.
In retrospect, I now realize something important: the stack is evil. It's a horrible representation of state, especially for web applications.
The stack is unnecessarily bound to an OS thread, and munges control flow with the "state" of the program. The fact that return addresses for function calls lives on it has been the source of many security problems and counter-measures (/GS). When a thread blocks, the entire stack is wasted, even if there is logical work on it that could progress if it weren't for the arbitrary physical association. There's so much crap on it that to summarize the state of your entire program often requires pausing threads and walking their stacks. How dirty and impolite! Freak-of-nature abominations have twisted what the stack was meant for, e.g. COM and GUI reentrancy and APCs, completely disassociating logical and physical representations. You have to reserve a contiguous chunk of the thing per thread (often 1MB), wasting virtual memory space, because Windows doesn't support linked stack regions (not as big a deal on 64-bit as on 32-bit, sure), which also leads to the CLR ripping the process if you ever exceed it (overflow).
So many problems we encounter with parallel programming (among other domains) would go away with a more structured representation of the program as a state machine.
Dharma and the rest of the WF team are delivering just that (in the large). C# 2.0's iterator feature supplies a similar capability (in the small). The Concurrency and Coordination Runtime (CCR) eschews stack in favor of orchestration and message passing. We'll converge at some point. And it won't be around serializing stacks, it will be around getting rid of the damned things.
 Friday, May 19, 2006
It's been a while since I last posted a "recent reads" list.
| Show-Stopper! The Breakneck Race to Create Windows NT and the Next Generation at Microsoft -- G. Pascal Zachary |
 |
10 of 10. I read this book in nearly one sitting. I couldn't put it down. This book details the story of the conception, design, and implementation of the Windows NT OS. It's a great "from the trenches" report of what it must have been like to work on the project, and stars many familiar faces, not the least of which is Dave Cutler. Some DevDiv familiar show up, such as David Treadwell (was dev on WinSock, now VP for the .NET Framework) and S. Somasegar (was tester, now VP for DevDiv), among many Windows core architects who are still at the company today. It's out of print, but I found a like-new copy for a few bucks (yaay). |
| The Calculi of Lambda Conversion (AM-6, Annals of Mathematics Studies) -- Alonzo Church |
 |
10 of 10. λ! Who am I to rate such a classic book? This is the seminal work for all modern functional languages (LISP is modern, yes?). I had to read it twice... carefully... to follow everything. (Perhaps I'm slow?) But it's only 77 pages. The text covers α conversion, β reduction, and η conversion, in addition to normal and head normal forms. And the best part of all? It's very concise, not very wordy, and follows a nice, natural progression. |
| Chances Are...: Adventures in Probability -- Michael Kaplan, Ellen Kaplan |
 |
8 of 10. This book is fun. It's a bit lighter on the math than I'd prefer, but nevertheless offers a great historical insight into the evolution of probability. It begins in the 1600s, and details its origins in mathematics and science, and -- surprise! -- its practical use as a tool for gamblers. Eventually it discusses impacts to more interesting parts of society, such as the development of an insurance industry, evaluation of new drugs, and combat and war. A welcome break from my typical computer nerd books. |
| Dr. Euler's Fabulous Formula: Cures Many Mathematical Ills -- Paul J. Nahin |
 |
8 of 10. Ahhh, a book after my heart. A quote from the opening says it all:
I used to think math was no fun 'Cause I couldn't see how it was done Now Euler's my hero For I now see why zero Equals eπi + 1
The book details the historical development and importance of Euler's formula. Throughout, there is quite a bit of description-by-example by way of complex number mathematics, in addition to great historical accounts. |
| Why Most Things Fail: Evolution, Extinction, and Economics -- Paul Ormerod |
 |
7 of 10. First, let me admit: I was a little disappointed by the broad title and relative narrow focus of the text. While some correlation is drawn between evolution, extinction, and economics, most of the book is spent describing why uncertainty in business--and the aparrent disregard for such uncertainty in commonplace naive business theory--leads to failure. He also uses examples from game theory on other related topics to draw such conclusions. The book should have been much longer, as I found myself at the end wondering, did I miss some big pieces? With that said, much of it is unique content backed by real research, so I'm sure developing the ideas took quite a bit of time. |
| The New Turing Omnibus: Sixty-Six Excursions in Computer Science -- A. K. Dewdney |
 |
6 of 10. I don't think I learned a whole lot from this particular book, but it was at least entertaining to read. I brought it along with me on a trip, and liked the format: Short, concise, often under 5-page essays on some topic in computer science. While I was traveling, this enabled me to pick it up and read an entire essay when I had only a brief period of time. The topics do range quite dramatically, and the content is a little "dumbed down," but it is a great coffee table addition. |
| The Devil's Cup: A History of the World According to Coffee -- Stewart Lee Allen |
 |
8 of 10. OK, this is definitely the odd book out. But I read it in about a day and a half, last weekend, and couldn't put it down. The book really isn't as much about coffee as it is about the author's crazy travels from Africa to Yemen to Europe and back to the US, in search of the "local brews." Quite a bit of historical insight is given, and it's a fun ride. I enjoyed it, and it was a much needed break from the techno babble and funny symbols. :) |
 Monday, May 15, 2006
The use of parallel spreadsheet calculations in Excel 12 is a great example of how software vendors can use multi-core CPUs to vastly improve the user experience.
There is some great stuff here: Intelligent parallel execution based on dependency analysis; near-linear speedup for spreadsheets with minimal dependencies; an extension model, where user-defined functions can be written either thread-safe or thread-unsafe, and be treated accordingly by the engine; user-defined thread-counts for functions that perform blocking operations; among others.
 Sunday, May 07, 2006
One of the challenges when designing reusable software that employs hidden parallelism -- such as a BCL API that parallelizes sorts, for-all style data parallel loops, and so forth -- is deciding, without a whole lot of context, whether to run in parallel or not. A leaf level function call running on a heavily loaded ASP.NET server, for example, probably should not suddenly take over all 16 already-busy CPUs to search an array of 1,000 elements. But if there's 16 idle CPUs and one request to process, doing so could reduce the response latency and make for a happier user. Especially for a search of an array of 1,000,000+ elements, for example. In most cases, before such a function "goes parallel," it has to ask: Is it worth it?
Answering this question is surprisingly tough. Running parallel at a high level might be more profitable, such as enabling multiple incoming ASP.NET requests to be processed, but often fine-grained parallelism can lead to better results. And just as often, a combination of the two works best. Consider an extreme case: Imagine that most ASP.NET web requests for a particular site ultimately acquire a mutual exclusive lock on a resource, essentially serializing a portion of all web requests. Of course, this is a design that's going to kill scalability eventually. But regardless, it could be present to a lesser degree, and might actually be an architectural requirement of the system. Executing some finer-grained operations in parallel might lead to better throughput in this case, especially those performed while the lock is held.
And clearly, the act of parallelizing an algorithm is not just based on the static properties of the system itself, but also dynamic capabilities and utilization of the machine. There are some APIs that allow dynamic querying of the machine state, which can aid in this process, e.g.:
- System.Environment.ProcessorCount: This property (new in 2.0) tells you how many hardware threads are on the system. Note that the number includes hyper-threads on Intel architectures, which really shouldn't be counted as a full parallel unit when deciding whether to parallelize your code. GetSystemInfo can give you richer information, albeit with some P/Invoke nonsense. We should give a better interface into this data for the next version of the Framework.
- Processor:% Processor Time performance counter: This gives you the % utilization of a specific processor and allows asynchronous querying. Using it, you could query each processor on the system to figure out what the overall system utilization is, and specifically how many sub-parts to break your problem into. The CLR thread-pool uses this today to decide when to inject or retire threads. You can use it too to determine whether introducing parallelism is a wise thing to do. Although your code may not have a lot of "context," this is often a good heuristic that even leaf level algorithms can use.
- System:Processor Queue Length performance counter: For more sophisticated situations, you can not only key off of the processor utilization, but also off the queue length of processes waiting to be scheduled. For a really deep queue (say, more than 2x the number of processors), introducing additional work is likely to lead to unnecessary waiting.
Using these are apt to lead to statistically good decisions. But clearly this is a heuristic, and as such the state of the system could change dramatically immediately after obtaining the values, perhaps making your deicision look naive and ill-conceived in retrospect. The worst case could be bad, but perhaps not terrible. The worst aspect of this is that performance characteristics could vary dramatically, and your users might respect predictable execution over sometimes-fast execution. The good news is that each of these functions are fairly cheap to call, amounting to less than 0.5ms total in some quick-and-dirty tests I wrote that read from all three.
But spending any time answering the question is tricky business. Assuming the software dynamically executes some code to decide if, and to what degree, we should run in parallel, and assuming these calculations are not done in parallel themselves ;), all of this work amounts to a fixed overhead on some part of the overall system, reducing overall parallel speedup (due to Amdahl's Law). We hope that in the future we can hide a lot of this messy work in the guts of the runtime and WinFX stack, but for now it's mostly up to you to decide.
Databases have utilized parallelism for a long time now to effectively scale-up and scale-out with continuously improving chip and cluster technologies. Consider a few high-level examples:
Parallel query execution is employed by all sophisticated modern databases, including SQL Server and Oracle. This comes in two flavors: (1) execution of multiple queries simultaneously which potentially access intersecting resources, and (2) implicit parallelization of individual queries, to acheive speed-ups even when a large quantity of incoming work is not present (e.g. high-cost queries, lots of data, etc.). Often a combination of both is used dynamically in a production system. I won't say much more, other than to refer to an interesting new query technology on the horizon.
Transactions are used as a simple model for concurrency control, enabling high scalability due to dynamic fine-grained locking techniques and policies, while supplying conveniences such as intelligent contention management and deadlock detection. And of course reliability is improved, because of the all-or-nothing semantics of transactions. Even in the face of asynchronous thread aborts, a transaction can ensure inconsistent state isn't left behind to corrupt a process, greatly improving the reliability of software at a surprisingly low cost. Software transactional memory (STM) borrows directly from the field, and applies it to general purpose parallel programming.
Invariants about data in databases are often modeled as integrity checks and foreign key constraints, which help to maintain reliable and consistent execution even in the face of concurrency. This, coupled with transactions, helps to ensure invariants aren't broken at transaction boundaries, and recent work done by MSR explores how this might be applied to general programming. STM combined with a rich system like Spec# could facilitate highly reliable and consistent systems that don't expose latent race conditions in the face of parallel execution.
Assuming you have (1) a lot of data to process, (2) complex computations to perform, and/or (3) simply a lot of individual tasks to accomodate, this model of parallel programming stretches quite far. With many cores per CPU, TB disks, and 100+-GB memories on desktops just around the corner; an order of magnitude more network bandwith available to consumers; and a continuing explosion of the amount of information humans generate and have to make some sense of, similar approaches could enable the next era of computer applications. I will also observe that surprisingly similar models of computation are precisely what fuel technologies like Google's MapReduce, albeit at a coarser granularity.
 Wednesday, May 03, 2006
Raymond's recent post talks about queueing user-mode APCs in Win32.
When you block in managed code, the CLR is responsible for figuring out the correct style of wait. This ends up in a CoWaitForMultipleHandles (on Win2k+) or MsgWaitForMultipleObjectsEx if you're executing in an STA; else, this ends up in a non-pumping wait, such as WaitForSingleObjectEx/WaitForMultipleObjectsEx. In any case, the wait is alertable, meaning that user-mode APCs will have a chance to run. There are various blocking calls hidden in Win32 and the CLR itself, so it's not guaranteed that all waits are alertable; but any that originate from managed code are, which we hope is a significant percentage.
This code illustrates a simple user-mode APC reentering as we do an alertable wait (via Thread.CurrentThread.Join(0)):
using System; using System.Runtime.InteropServices; using System.Threading;
static class Program {
static void Main() { QueueUserAPC( delegate { Console.WriteLine("APC fired"); }, GetCurrentThread(), UIntPtr.Zero);
Console.WriteLine("Doing join"); Thread.CurrentThread.Join(0); Console.WriteLine("Finishing join"); }
delegate void APCProc(UIntPtr dwParam);
[DllImport("kernel32.dll")] static extern uint QueueUserAPC(APCProc pfnAPC, IntPtr hThread, UIntPtr dwData);
[DllImport("kernel32.dll")] static extern IntPtr GetCurrentThread();
}
While this technique seems like an effective way to reuse a thread while it is blocked -- for example, you might contemplate doing this for thread-pool threads -- a little problem called thread affinity tends to arise. I wrote about this in terms of COM reentrancy before. An APC reentering doesn't perform a context transition, so even if we used a logical context to store such state, the problem would still exist. The simple fact is that user-mode APCs are good for system bookkeeping, but not for running general purpose code that modifies arbitrary program state.
|
|
Recent Entries:
Search:
Browse by Date:
Browse by Category:
Notables:
|