|
Personal Info:
Joe  leads the architecture of an experimental OS's developer platform, where
he is also chief architect of its programming language. His current mission is to enable
writing large-scale software that is reliable, secure, and scalable by-construction. Before this, Joe
founded the Parallel Extensions to .NET project.
He has been granted 19 patents, with 49 pending. When not working, Joe enjoys travelling with his wife,
writing books, writing music,
studying music theory & mathematics, and doing anything involving food & wine.
My books
My music
Disclaimer:
The content of this site are my own personal opinions and do
not represent my employer's view in anyway.
© 2012, Joe Duffy
|
|
 Tuesday, June 23, 2009
I wrote this memo over 2 1/2 years ago about what to do with concurrent exceptions in Parallel Extensions to .NET. Since Beta1 is now out, I thought posting it may provide some insight into our design decisions. (And yes, most design discussions start this way. Somebody develops a personal itch, dives deep into it, and emerges with a proposal for others to vote up, shoot down, or, as is typically the case, somewhere in the middle (provide constructive feedback, iterate, etc).) I've made only a few slight edits (like replacing code- and type-names), but it's mainly in original form. I still agree with much of what I wrote, although I'd definitely write it differently today. And in retrospect, I would have driven harder to get deeper runtime integration. Perhaps in the next release.
~~~
Concurrency and Exceptions October, 2006
Exceptions raised inside of concurrent workers must be dealt with in a deliberate way. Failures can happen concurrently, and yet often the programmer is working with an API that appears to them as though it’s sequential. The basic question is, then, how do we communicate failure?
The problem
Fork/join concurrency, in which a single “master” thread forks and coordinates with N separate parallel workers, is an incredibly common instance of one of these sequential-looking concurrent operations. The same callback is run by many threads at once, and may fail zero, one, or multiple times. The exception propagation problem is inescapable here and comes with a lot of expectations, because the programmer is presented a traditional stack-based function calling interface papered on top of data or task parallelism underneath.
I am faced with the need for a solution to this problem for PLINQ right now and, while I could invent a one-off solution, we owe it to our customers to come up with a common platform-wide approach (or at least ManyCore-wide). Any solution should compose well across the stack, so that somebody invoking a PLINQ query from within their TPL task that was spawned from a thread pool thread yields the expected and consistent result. And I would like for us to reach consensus for both managed and native programming models.
Before moving on, there is one non-goal to call out. Long-running tasks not under the category of fork/join also deserve some attention, because of the ease with which stack traces can be destroyed and the corresponding impact to debugging, but I will ignore them for now. The problem is not new, exists with the IAsyncResult pattern, and PLINQ doesn’t use this sort of singular asynchronous concurrency. These cases can typically be trivially solved using existing mechanisms, like standard exception marshaling.
No errors, one error, many errors
To understand the core of the issue, imagine we have an API ‘void ForAll<T>(Action<T> a, T[] d)’. It takes a delegate and an array, and for every element ‘e’ in ‘d’ invokes the delegate, passing the element, i.e. ‘a(e)’. If multiple processors are available, the implementation of ForAll may use some heuristic to distribute work among several OS threads, for instance by partitioning the array, probably running one partition on the caller’s thread, and finally joining with these threads before returning so that the caller knows that all of the work is complete when the API returns.
ForAll is not fictitious, and is similar to a number of PLINQ APIs: Where, Select, Join, Sort, etc. It is also exposed directly by the TPL runtime’s Parallel class which intelligently forks and joins with workers.
‘a’ is a user-specified delegate and can do just about anything. That includes, of course, throwing an exception. What’s worse, because ‘a’ is run in several threads concurrently, there may be more than one exception thrown. In fact, there are three distinct possibilities:
- No errors: No invocations of ‘a’ throw an exception.
- One error: A single invocation of ‘a’ throws an exception.
- Many errors: Concurrent invocations of ‘a’ on separate threads throw exceptions.
Clearly letting an exception crash whichever thread the problematic ‘a(e)’ happened to be run on is problematic and confusing. If for no other reason than the IAsyncResult pattern has established precedent. But realistically, the developer would be forced to devise his or her own scheme to marshal the failure back to the calling thread in order for any sort of chance at recovery. They would get it wrong and it would lead to incompatible and poorly composing silos over time. A Byzantine model that fully prohibits exceptions passing fork/join barriers goes against the simple, familiar, and understandable (albeit often deceptively so) model of exceptions.
(That said, marshaling leads to a crappy debugging experience. An already attached debugger will get a break-on-throw notification at the exception on the origin thread, but since we catch, marshal, and (presumably) rethrow, the first and second chances for unhandled exceptions won’t happen until after the exception been marshaled. This breaks the first pass, and by the time the debugger breaks in, or a crash dump is taken, the stack associated with the origin thread is apt to have gone away, been reused for another task (in the case of the thread pool), etc. We generally try to avoid breaking the first pass in the .NET Framework, but do it in plenty of places: the BCL today already contains tons of try { … } catch { /*cleanup */ throw; }-style exception handlers, for example. For this reason I’m not terribly distraught over the implications of doing it ourselves. And sans deeper integration with the exception subsystem – something we ought to consider – there aren’t many reasonable alternatives.)
What makes this problem really bad is that ForAll appears as though it’s synchronous:
void f() {
// do some stuff
ForAll(…, …);
// do some more stuff, ‘ForAll’ is completely done
}
The method call to ForAll itself is synchronous, but of course its internal execution is not. But still, to the developer, the call to this function represents one task, one logical piece of work, regardless of the fact that the implementation uses multiple threads for execution. As higher level APIs are built atop things like ForAll, the low level parallel infrastructure problem becomes a higher level library or application problem. A Sort that is internally parallel must now decide what exception(s) it will tell callers it may throw.
Nondeterministic exception ordering
We assume the ForAll API stops calling ‘a(e)’ on any given thread when it first encounters an exception. That is, each thread just does something like this:
for (int i = start_idx; i < end_idx; i++) {
a(d[i]);
}
The for loop terminates when any single iteration throws an exception. Imagine our array contains 2048 elements and that ForAll smears the data across 8 threads, partitioning the array into 256-element sized chunks of contiguous elements. So partition 0 gets elements [0…256), partition 1 gets [256…512), …, and partition 7 gets [1792…2048). Now imagine that ‘a’ throws an exception whenever fed a null element, and that every 256th element in ‘d’, starting at element 10, is null. What can a developer reasonably expect to happen?
On one hand, if we’re trying to preserve the illusion of sequential execution, we would only want to surface the exception from the 10th element. With a sequential loop, this would have prevented the 266th, 522nd, and so on, elements from even being passed to ‘a’. So we might simply say that the “left most” exception (based on ordinal index) is the one that gets propagated. The obvious problem with this is there are races involved: subsequent iterations indeed may have actually run. Alternatively, we might consider only letting the “first” propagate. Unfortunately, that doesn’t work either, because we unfortunately can’t necessarily determine, for a set of concurrent exceptions, which got thrown first. Even if they have timestamps, they could occur in parallel at indistinguishably close times. Nor does this really matter, because it feels fundamentally wrong.
The reason is that we can’t simply throw away failures without true recoverability in the system, a la STM. The execution of code leading up to the exception did actually happen, after all, and there could be residual effects. We might be masking a terrible problem by throwing failures away, possibly leading to (more) state corruption and (prolonged, perhaps unrecoverable) damage. What if the 10th element was a simple ArgumentNullException that the caller chooses to tolerate, but the 266th element’s exception was in response to a catastrophic error from which the application can’t recover? We can’t choose to propagate the 10th but swallow the 266th. Broadly accepted exceptions best practices suggest that app and library devs never catch and swallow exceptions they cannot reasonably handle. We should do our best to follow the spirit of this guidance too.
Re-propagation
We could employ an approach similar to the IAsyncResult pattern, with some slight tweaks.
If each concurrent copy of ForAll caught any unhandled exceptions and marshaled them to the forking thread, including any exceptions that happen on the forking thread itself, we could then propagate all of them together after the join completes. The question is then: what exactly do we propagate?
If there is just a single exception, it’s tempting to just rethrow it. But I don’t believe this is a good approach for two primary reasons:
- This will destroy the stack trace of the original exception. This means no information about the actual source of the error inside ‘a’ is available. With some help from the CLR team, we might be able to get a special type of ‘rethrow’ that copied the original stack trace before recreating a new one. This is already done for remoted exceptions, and the Exception base class will prefix the original remoted stack trace to the new stack trace.
- This doesn’t scale to handle multiple exceptions. If we could solve #1, it might be attractive because it appears as-if things happened sequentially, but we can’t escape #2, no matter what we do. We could have different behavior in these two cases, but I believe it’s better to remain consistent instead. Otherwise, developers will need to write their exception handles two ways: one way to handle singular cases, and the other way to handle multiple cases, where the same API may do either nondeterministically.
Given that we need to propagate multiple exceptions, we should wrap them in an aggregate exception object, and propagate that instead. At least this way, the original exceptions will be preserved, stack trace and all. Of course the original exceptions themselves might be other aggregates, handling arbitrary composition.
For sake of discussion, call this aggregate exception System.AggregateException, which of course derives from System.Exception. It exposes the raw Exception objects thrown by the threads, via an ‘Exception[] InnerExceptions’ property, and additional meta-data about each exception: from which thread it was thrown, and any API specific information about the concurrent operation itself. This last part is just to help debuggability. For instance, we might tell the developer that the ArgumentNullException was thrown from a thread pool thread with ID 1011, and that it occurred while invoking the 266th element ‘e’ of array ‘d’. We might also guarantee the exceptions will be stored in the order in which they were marshaled back to the forking thread, just to help the developer (as much as we can) piece together the sequence of events leading to failures.
(Editor’s note: we decided against storing this meta-data information for various reasons.)
Now the dev can do whatever he or she wishes in response to the exception. Previously they might have written:
try {
ForEach(a, d);
} catch (FileNotFoundException fex) {
// Handler(fex);
}
And now they would have to instead write:
try {
ForAll(a, d);
} catch (AggregateException pex) {
List<Exception> unhandled = new List<Exception>();
foreach (Exception e in pex.InnerExceptions) {
FileNotFoundException fex = e as FileNotFoundException;
if (fex == null) {
unhandled.Add(fex);
} else {
// Handler(fex);
}
}
if (unhandled.Count > 0)
throw new AggregateException(unhandled);
}
In other words, they would catch the AggregateException, enumerate over the inner exceptions, and react to any FileNotFoundExceptions as they would have normally. (Taking into consideration that there might have been multiple.) At the end, if there are any non-FileNotFoundExceptions left over, we propagate a new AggregateException with the handled FileNotFoundExceptions removed. If there was only one remaining, we could, I suppose, try to rethrow just that, but this has the same nondeterminism problems mentioned above.
Very few people will write this code. But one of the most vocal arguments against it is: just throw one singular exception, such as ForAllException, and let it crash, because no developer will handle it. Well, that scheme is no better than throwing the AggregateException. At least the aggregation model lets people write backout and recovery code if they have the patience to deal with the reality that multiple exceptions occurred.
To make this slightly easier, we could expose an API, ‘void Handle(Func<Exception, bool> a) where T : Exception’, that effectively encapsulates the same logic as shown above, repropagating the exception at the end if all the exceptions weren’t handled (i.e. some weren’t of type T):
try {
ForAll(a, d);
} catch (AggregateException pex) {
pex.Handle(delegate(Exception ex) {
FileNotFoundException fex = ex as FileNotFoundException;
if (fex != null) {
// Handle(fex);
return true;
}
return false;
});
}
(One problem with this approach is that the ‘throw’ inside of Handle will destroy the original stack trace for ‘pex’. An alternative might be for Handle to modify the AggregateException in place, keeping the stack trace intact, returning a bool that the caller switches on and does a ‘throw’ if it returns false; this is unattractive because it’s error prone and could lead to accidentally swallowing, but in the end might help debuggability.)
If we cared about eliminating unnecessary catch/rethrows, we could use 1st pass filters instead, but this would only be available to VB and C++/CLI programmers, as C# doesn’t expose filters. For example, in pseudo-code:
try {
ForAll(a, d);
} catch (fex.InnerExceptions.Contains<FileNotFoundException>()) {
// Handle …
}
Although interesting, we’re trying to move away from our two pass model. So let’s forget about this for now.
This approach suffers when composing with non-aggregate exception aware code. For it to work well, everybody on the call stack needs to be looking inside the aggregate for “their” exception, handling it, and possibly repropagating. If we want existing BCL APIs to start using data parallelism internally, we would have to be careful here, not to break AppCompat because we start throwing AggregateExceptions instead of the originals.
This is probably where there’s an opportunity for better CLR and tool integration. For instance, you could imagine a world where the CLR automatically unravels the parallel failures, matching and running handlers for specific exceptions inside the aggregate as it goes, but repropagating if all exceptions weren’t handled. This is very hand-wavy and fundamentally changes the way exceptions work, so it would require a lot more thought. A catch block that swallows an exception (today) is just about guaranteed—asynchronous exceptions aside—that the IP will soon reach the next instruction after the try/catch block. This is a pretty basic invariant. With this proposal, that wouldn’t be the case, and would be bound to break large swaths of code. Sticking with the library approach (with all its imperfections) seems like the best plan of attack for now.
Waiting for the “join” to finish
There was something implicit in the design mentioned above. The ForAll API, and others like it, wouldn’t actually propagate exceptions until the fate of all threads was known.
Imagine we have the scenario described earlier (2048 elements, 8 threads), but slightly different: the 0th element causes an exception, but no other. It turns out this is probably a common case, i.e. that only a subset of the partitions will yield an exception. In this case, we would still have to wait for 7*256 = 1,792 elements to be run through ‘a’ before this exception is propagated. Imagine a slightly different case. The 0th element throws a catastrophic exception, and the application is going to terminate as soon as it propagates. ‘a’ simply can’t be run any more, and will keep reporting back this same exception. But it will take 8 of these exceptions to actually stop the application, i.e. by calling ‘a’ on the 0th, 256th, 512th, etc. elements, if we wait for all tasks to complete. If each exception corresponds to some failed attempt at forward progress, one that possibly corrupts state, then the damage is O(N) times “worse” (for some measurement) than in the sequential program, where N is the number of concurrent tasks.
Instead of waiting helplessly, we could try to aggressively shut down these concurrent workers.
At first, you might be tempted to employ CLR asynchronous thread aborts, but this is fraught with peril. Almost all .NET Framework code today is taught that thread abort == AppDomain unload, and reacts accordingly. State corruption stemming from libraries as fundamental as the BCL would be just about guaranteed. Changing this state of mind and the state of our software would be quite the undertaking.
Instead, we can have the concurrent API itself periodically check an ‘abort’ flag shared among all workers. The first thread to propagate an exception would set this flag. And whenever a worker has seen that it has been set, it voluntarily returns instead of finishing processing data:
for (int i = start_idx; i < end_idx && !aborted; i++) {
a(d[i]);
}
This increases the responsiveness of exception propagation, but clearly isn’t foolproof. There will still be a delay for long-running callbacks. Thankfully, with PLINQ, TPL, and I hope most of our parallel libraries, the units of work will be individually fine-grained, and therefore this technique should suffice.
If a concurrent worker is blocked, there’s not a whole lot we can do. Much like thread aborts, you might be tempted to use Thread.Interrupt to remove it from the wait condition. Unfortunately this will leave state corruption in its wake, because plenty of code does things like WaitHandle.WaitOne(Timeout.Infinite) without checking the return value or expecting a ThreadInterruptionException. The same argument applies to, say, user-mode APCs. Eventually you might also be tempted to use IO cancellation in Windows Vista to cancel errant, runaway network or disk IO requests. This would be great. But this also generally has the same problem as interruption, so until we find a general solution to that, we can’t do any of this.
(Editor’s note: We eventually solved this problem by coming up with a unified cancellation framework.)
One last note
This path forward seems best for now, but it leaves me wanting more.
In the end, this feels like a more fundamental problem. An API like ForAll gives the illusion of an ordinary, old sequential caller/callee relationship. But the callee doesn’t use a stack-based calling approach: instead, it distributes work among many concurrent workers, turning the linear stack into a sort of dynamically unfolding cactus stack (or tree). And SEH exceptions are fundamentally linear stack-based creatures.
In this world, it’s just a simple fact that data all over the place can become corrupt simultaneously. Many things can fail at once because many things are happening at once. It’s inescapable. Recovery is disastrously difficult, so most failures will end in crashes. STM’s promise for automatic recovery offers a glimmer of hope, but without it, I worry that papering a sequential “feel” on top of data/task parallelism is a dangerous game to play.
 Tuesday, June 16, 2009
One of my many focuses lately has been developing a memory ordering model for our project here at Microsoft. There are four main questions to answer when defining such a model:
- What are the ordering guarantees for ordinary loads and stores?
- What are the ordering guarantees for volatile loads and stores?
- What kinds of explicit fences are allowed?
- Where are fences used automatically, e.g. to preserve type safety and security?
These tend to be the differentiation points for any model. Everything else is mostly commodity. Not that there is much else, mind you, but respecting data dependence, not speculating ahead such that exceptions occur that wouldn't have occurred in a sequential execution, and so forth are all must haves, for instance. Most interesting permutations of answers for these questions have already been explored, and industry consensus is being reached, so it would be better to say I've been picking a model rather than defining one.
What's interesting is that memory model designers are often colored by their favorite architecture du jour. If somebody cares primarily about X86, they are apt to choose something very strong. If somebody cares primarily about ARM, however, they are apt to choose something very weak. There is a classic tradeoff here. Stronger means easier to program, while weaker means better performance. For some reason, many of the projects I've worked on have had an abundance of strong hardware (like X86) and a scarcity of weak hardware (like ARM and IA64). The reality sinks in: most developers on the team code to X86, and then when it comes time to getting more serious about the other platforms, code starts breaking all over the place. This is why the CLR went so strong in 2.0, even though IA64 was an important platform to support.
Let's look at some common answers to the above questions.
For #1:
- C++, Visual C++, ECMA 1.0, Java Memory Model, and Prism: no ordering guarantees.
- CLR 2.0: ordered stores, no ordering for loads.
For #2:
- C++: prevents compiler-only code motion, but explicit fences are needed for processor ordering.
- Visual C++, ECMA 1.0, and CLR 2.0: loads are acquire, stores are release ordered.
- Java Memory Model: loads and stores are fully ordered (sequentially consistent).
For #3:
- C++: implementation-specific.
- Visual C++: intrinsics and Win32 APIs.
- ECMA 1.0 and CLR 2.0: locks, and mostly Win32-style interlocked APIs.
- Java Memory Model: locks, compare-and-swap, atomics, etc.
For #4:
Managed environments like the CLR and JVM need to ensure type safety, even if ordinary loads and stores are unordered. This is nontrivial, because the boundary around type safety is blurred. Certainly we must ensure garbage v-table pointers are not seen. But is a thread allowed to read non-zeroed memory behind an object reference? And can it contain garbage (e.g. "values out of thin air")? What about writes done by mutator threads, including write barriers, while a concurrent collector is tracing objects in the heap? Are array lengths part of the set of protected fields that mustn't be read out of order? Strings, since they are commonly used for security checking? And so on.
It is mainly the deep questions around #4, and also some simple compatibility struggles (around things like double checked locking), that caused the stronger answers for #1 in the CLR 2.0.
In any case, I'm advocating a very different approach than the traditional models.
We pick completely weak ordering for ordinary loads and stores, to enable efficient execution on weaker platforms like ARM, PowerPC, IA64, etc. That part isn't new. But here's the clincher. No volatiles. There are special variables that are used to communicate between threads (call them volatile if you'd like), but using them implies no kind of special automatic fencing. Instead, whenever accessing such a variable, at the site of usage, the kind of fence desired must be used (compiler-enforced): full-fence (sequentially consistent), acquire-fence, release-fence, no-fence, or compiler-only-fence (for things like ensuring loads don't get hoisted as loop invariant). Of course, certain kinds of fences are sprinkled throughout the system to guarantee type safety in all of the aforementioned places (and more), but these are implementation details.
(This approach is rather like Herb Sutter's Prism and C++0x atomics. See http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2008/n2664.htm.)
Particularly after managing teams who developed a plethora of lock free code, I love this approach. I can review code and immediately understand what ordering invariants the developer assumed when writing the code. This doesn't really make writing lock free code any simpler, except that it forces you to pause and think about things a bit more carefully than you may have otherwise. But it certainly makes code easier to understand and maintain, and makes it clear to people that sprinkling volatile all over the place isn't going to save your butt: the only thing that will do that is careful thinking and engineering.
 Thursday, June 04, 2009
An interesting alternative to reader/writer locks is to combine pessimistic writing with optimistic reading. This borrows some ideas from transactional memory, although of course the ideas have existed long before. I was reminded of this trick by a colleague on my new team just a couple days ago.
The basic idea is to read a bunch of state optimistically, without taking a lock of any sort, and then prior to using it for meaningful work (which may depend on the state being consistent and correct), a validation step must take place. This validation uses version numbers which writers are responsible for maintaining. Specifically, we'll use two version counters, version1 and version2: the writer increments version1, performs the writes, and then increments version2; and the reader reads version2, performs its reads, and then verifies that version 1 is equal to the version2 that it saw at the start. If this verification fails, we'll ordinarily just do a little spinning and then go back around the loop again.
Stop for a moment and ponder something very critical to this algorithm. The writer increments variables in the opposite order of the reader's reads. To see why this works, imagine we start with version1 == version2 == 0. There are two hazards to worry about. (1) A reader begins reading, and writes occur before it has finished. And (2) a reader begins reading while a write is in progress. These are simple to detect, and in fact boil down to the same thing. A reader sees version2 == 0, and the first thing a writer does is version1++. So when the reader attempts to validate, it will notice the version2 it saw != version1 any longer. If the writer has already begun by the time the reader arrives, it is possible for the reader to know it is doomed even before it has started doing any of its reads.
Here is the code in its full glory:
using System;
using System.Threading;
public class OptimisticSynchronizer
{
private volatile int m_version1;
private volatile int m_version2;
public void BeforeWrite() {
++m_version1;
}
public void AfterWrite() {
++m_version2;
}
public ReadMark GetReadMark() {
return new ReadMark(this, m_version2);
}
public struct ReadMark
{
private OptimisticSynchronizer m_sync;
private int m_version;
internal ReadMark(OptimisticSynchronizer sync, int version) {
m_sync = sync;
m_version = version;
}
public bool IsValid {
get { return m_sync.m_version1 == m_version; }
}
}
public void DoWrite(Action writer) {
BeforeWrite();
try {
writer();
} finally {
AfterWrite();
}
}
public T DoRead<T>(Func<T> reader) {
T value = default(T);
SpinWait sw = new SpinWait();
while (true) {
ReadMark mark = GetReadMark();
value = reader();
if (mark.IsValid) {
break;
}
sw.SpinOnce();
}
return value;
}
}
We leave it to the caller of this class to acquire locks as appropriate to synchronize writers. Typically this will just mean wrapping a Monitor.Enter/Exit around calls to things like BeforeWrite, AfterWrite, and DoWrite. But readers explicitly do not need this same protection. DoRead exemplifies the safe reading pattern, although it can be done by hand via the ReadMark APIs.
It's also worth considering what kinds of fences are truly required for this to work. Logically speaking, we need to ensure the entrance to a protected block (either read or write) is an acquire fence, and exit from the block is a release fence. This is similar to the ordering semenaitcs necessary for a lock block. So long as we use volatile modifiers for the version counters, and for the variables read within the protected block, this will work fine. Even on weak models like IA64. The beautiful thing is that we don't need full fences, even on models like X86 that make use of store buffer forwarding The classic store buffering case we may worry about (on a single-threaded execution) would be something like this, in pseudo-code:
version1++;
X = 42;
Y = 99;
version2++;
tmp = version2;
r0 = X;
r1 = Y;
success = (tmp == version1);
We'd be worried about satisfying some loads out of the store buffer, while satisfying others out of the memory system. But this is safe: if the load of X or Y sees a different processor's writes, then the subsequent load of version1 necessarily must witness the new value written by the other processor too. And therefore the validation will fail as we would expect and hope.
Here is a quick performance benchmark I whipped together, much in the same spirit as my previous reader/writer lock examples. I've measured varying numbers of writers (0%, 5%, 10%, 25%, 50%, and 100%), and each thread spends a certain amount of time inside the "lock region" doing some nonsense busy work. The certain amount of time is measured in terms of number of function calls (0, 10, 100, and 1000), and the work doesn't vary at all depending on whether a thread is reading or writing. I've measured four things: (1) Monitor.Enter/Exit as the baseline (where both readers and writers just acquire the mutually exclusive lock), (2) ReaderWriterLockSlim, (3) the spin-based lock that I showed previously, and (4) the new OptimisticSynchronizer class with optimistic retry. The values are the ratio compared to the baseline (1), so that >1.0x means the particular entry is slower, while <1.0x is faster. I did these measurements on an 8-way machine -- unlike the previous study which was on a 4-way machine -- which means that 0.125x would be a linear speedup compared to the serialized Monitor version:
0% writers:
0 calls 10 calls 100 calls 1000 calls
RWLSlim 1.26 1.55 1.39 0.38
SpinRWL 0.12 0.17 0.13 0.18
OptSync 0.05 0.08 0.11 0.12
5% writers:
0 calls 10 calls 100 calls 1000 calls
RWLSlim 1.36 1.70 1.40 1.07
SpinRWL 0.98 1.07 0.55 0.30
OptSync 0.35 0.43 0.31 0.24
10% writers:
0 calls 10 calls 100 calls 1000 calls
RWLSlim 1.42 1.66 1.23 1.06
SpinRWL 1.41 1.61 0.91 0.51
OptSync 0.56 0.66 0.46 0.31
25% writers:
0 calls 10 calls 100 calls 1000 calls
RWLSlim 1.36 1.97 1.24 1.03
SpinRWL 2.39 2.22 1.08 0.89
OptSync 0.84 0.99 0.86 0.59
50% writers:
0 calls 10 calls 100 calls 1000 calls
RWLSlim 1.48 1.80 1.21 1.05
SpinRWL 3.16 3.30 1.81 1.19
OptSync 0.91 0.94 1.10 0.92
100% writers:
0 calls 10 calls 100 calls 1000 calls
RWLSlim 1.35 1.67 1.22 1.09
SpinRWL 5.84 5.84 2.49 1.18
OptSync 0.93 0.99 1.13 1.17
For all cases but the 100% writers case, the OptimisticSynchronizer class does extraordinarily well.
Although this approach screams performance-wise, it is admittedly much more difficult and error-prone to use. If the variables protected are references to heap objects, you need to worry about using the read protection each time you touch a field. Just like locks, this technique doesn't compose. As with anything other than simple locking, use this technique with great care and caution; although the built-in acquire and release fences shield you from memory model reordering issues, there are some easy traps you can fall into. And as with any optimistic reading, memory safety is a necessity; trying to use these techniques in C++, for example, can easily lead to access violations and memory corruption. Tread with caution.
Update 6/4: This technique, of course, is subject to ABA problems. I failed to mention that originally. That is, if between reading version2, Int32.MaxValue writers perform writes, the version1 field will wrap around such that the reader will (erroneously) successfully validate. Fixing this on 64-bit is simple (use a 64-bit counter) but is less so on 32-bit due to the lack of atomicity on loads and stores of 64-bit values (without using, say, an XCHG or related primitive).
Update 6/18: My original write-up incorrectly made some hidden assumptions about the use of volatile. This has now been cleared up.
 Thursday, May 28, 2009
Two persons stand on a railway embankment at points A and C, exactly 500 meters apart. Lightning strikes precisely in the center of them, at point B, 250 meters away from both:
<--A----------B----------C-->
Presuming both persons are stationary, does the event (lightning strikes) occur “at the same time” from the perspective of the two persons? In our simplistic one dimensional model, the answer is Yes, precisely because the point of the lightning strike, B, is equidistant from A and C.
The person at point C may just as well be responsible for generating the event, by using some form of light rod instead of a lightning bolt supplied by nature. If the person at C lights such a rod, would the event still occur “at the same time” for both persons? Clearly No, because it will take some amount of time for the event’s occurrence to travel the distance from C to A, specifically the time it takes for light to travel 500 meters. Whereas for C it happens nearly instantaneously.
Practically speaking, this amount of time it takes for the light to travel to A will of course be so minute as to be nearly immeasurable, but nevertheless there are two separate times t and t’, the former being the actual time the rod is lit at C, and the latter being the time at which it is perceived at A. This is commonly referred to as relativity of simultaneity, introduced as Lorentz’s local time in the late 1800's and further formalized by Einstein's special theory in 1905.
Now imagine that a new person is placed at point B, equidistant from A and C, where the original lightning struck. If the person at C lights his rod, will the person at B observe the event before the person at A does? Most certainly. It takes less time for light to travel 250 meters than it does 500 meters.
Let us extend our working example a bit. Imagine again three persons, one at point A, one at point B, and another at point C. Those at B and C hold their own light rods. The person at C lights a rod, and in response to seeing C’s rod lighting, the person at B also lights a rod. The question is, must the person at A witness the light emanating from C’s rod prior to witnessing the light emanating from B’s rod? Unless the person at B’s response is truly instantaneous (which we assume is practically impossible), or unless he can see into the future (which we also assume is impossible), clearly the answer is Yes. Because the rod at B was lit in response to witnessing C’s lighting of the rod, some amount of time must have passed during the response, and the person at A will thus see C’s lighting first (or at the very least simultaneously, assuming near-impossible instantaneous response). We say B’s lighting is causally dependent on C’s lighting.
The main point here is that time is an illusion. There is no global time clock. Instead, events are not only distinguished by some monotonically increasing time value t, but also by a location which is defined by three-space coordinates. This is Minkowski’s four-dimensional space. One event occurring at coordinates { x=0, y=0, z=0, t=99 } may not appear to be simultaneous with some other event at coordinates { x=42, y=42, z=42, t=99 }, depending on the observer's location, even though both events occur at time t = 99.
Perception is relative. There is no global total ordering, only a local (relative) one.
A similar phenomenon is true of multiprocessors. In fact, nearly everything said above applies equally to them, provided that you replace “persons” with “processors”, and lighting of rods with writes to memory and witnessing of the light with reads from memory.
Multiprocessor architects must cope with the increasing bottleneck on a central memory unit, particularly due to shared memory programming. One common means of doing so is to increase the distance between processing elements and the memory units they use, padding this distance with ample levels of cache. Some processor A may have a local memory (cache) that is separate from some other processor C’s, and A’s writes to it will be visible to A before C, for example. And if some processor B sits in between them, it may notice such writes before C does. Locality matters.
Of course, memory ordering models are meant to eliminate such distances from the programmer’s mind, at least to some degree. They provide a set of rules governing the timing and ordering of events. But there is just no denying the laws of physics. My claim is that a proper ordering model ought to obey what can be derived from the special theory of relativity: no more, no less. That is, the fundamental laws of how events occur and are correlated in the real world should be mimicked. This means only two things, as far as I can tell:
- An event stream (writes) originating from a source must appear to happen in order.
- Causality is respected, in that when C causes B, it is implied that A must see C followed by B.
This turns out to be stronger than some models, but also weaker in some regards. Distance and latency are first class, embellished, and allowed. There is some cost to ensuring events leaving a locale do so in order, and that events arriving into a locale also do so in order. Given coarse enough locales, however, this cost ought to be amortized over the cost of inter-locale communication.
Notice that sequential consistency is explicitly discouraged. The ordinary store-followed-by-load ordering that I've written about many times is legal. Considering this phenomena in the context of light rods and relativity makes it clear why. Imagine that the persons at A and C light their rods simultaneously. If the person at A immediately, after lighting the rod, looks to the right to see if C has lit the rod, the answer will be No; and similarly, if the person at C immediately, after lighting the rod, looks to the left to see if A has lit the rod, the answer will also be No. Although the real reason has to do with gross details like store buffers and cache coherency, the elegant reason supported by the model is that it takes time for light to travel the distance between A and C.
I also want to point out that “memory ordering model” commonly refers to individual loads and stores, at a very low level, but just as well applies to a higher-level model such as might be found in an actor-oriented (message passing) programming language. People often believe memory ordering and interleaving goes away magically with message passing models. This is simply not true, even if instruction-level interleaving is eliminated. The granularity merely coarsens, but the problem still remains the same.
Despite the lack of sequential consistency, implementing such a model can pose challenges, due to restrictions on optimizations like pipelining and out of order execution. (Hey, at least we needn’t worry about processors moving about at different velocities, as in the more interesting parts of special relativity.) But I believe it is necessary. This price paid will be rewarded with a system that human beings can be taught to reason about as they do in the real world. Remember: I am not just talking about memory models in the traditional sense, where people are tempted to sweep the problem under the rug of "only super-developers doing lock-free programming need a model"; it matters for higher level concurrency orchestration patterns too. In the end, let us not forget: correctness and understandability trump performance optimizations for all but the most low-level systems developers, which make up less than 1% of the development population.
1. Relativity: The Special and General Theory. http://en.wikisource.org/wiki/Relativity:_The_Special_and_General_Theory 2. Time, Clocks, and the Ordering of Events in a Distributed System. http://research.microsoft.com/en-us/um/people/lamport/pubs/time-clocks.pdf
 Saturday, May 16, 2009
A while back, I made a big stink about what appeared to be the presence of illegal load-load reorderings in Intel's IA32 memory model. They specifically claim this is impossible in their documentation. Well, last week I was chatting with a colleague, Sebastian Burckhardt, about this disturbing fact. And it turned out he had recently written a paper that formalizes the CLR 2.0 memory model, and in fact treats this phenomenon with a great deal of rigor:
Verifying Compiler Transformations for Concurrent Programs http://research.microsoft.com/pubs/76524/tr-2008-171-latest-03-11-09.pdf
To jog your memory, the problematic example is
X = 1; r0 = X; r1 = Y;
where both X and Y are shared memory locations, and r0 and r1 are processor registers. According to Intel's IA32 memory model, two loads to different locations cannot reorder. But it is completely possible for the load of X to be satisfied out of the store buffer, and for r1=Y to pass the store (thereby also passing the load r0=X). This is a standard Dekker reordering, but the usual example consists of just { X = 1; r1 = Y }.
The key to modeling this is to turn an adjacent store-load affecting the same location into a single instruction. Therefore, the above becomes something like:
r0 = 1; X = r0; r1 = Y;
Now it becomes entirely clear what has gone wrong. I have yet to see a clear description of this phenomenon, but Sebastian's paper does a great job.
During the discussion, Sebastian showed me another disturbing four processor example:
P0 P1 P2 P3 == == == == X = 1; r0 = X; Y = 1; s0 = X; r1 = Y; s1 = Y;
Is it possible, after all four processors complete, that { r0 == 1, r1 == 0 } and { s0 == 0, s1 == 1 }? This seems ridiculous, given a memory model where loads cannot reorder. It seems that no serializable execution should lead to this. But let's look at one problematic interleaving. First, we merge the instruction stream on P0 with P1, and also P2 with P3. This effect could occur if these writes are in functions that end up running on the same processor, or running on a machine that shares functional units (like hyperthreading), hierarchies that share a cache, and so on. We end up with:
P0/P1 P2/P3 ===== ===== X = 1 Y = 1; r0 = X; s0 = X; r1 = Y; s1 = Y;
Now let's permute these with the new rule introduced above in mind:
P0/P1 P2/P3 ===== ===== r0 = 1; s0 = X; r1 = Y; s1 = 1; X = r0; Y = s1;
At this point, it should be obvious what the problematic reordering would be. Let's continue merging these into a single execution order:
P0/P1/P2/P3 =========== r0 = 1; // #1 r1 = Y; // #0 s0 = X; // #0 s1 = 1; // #1 X = r0; // #1 Y = r1; // #1
The outcome? { r0 == 1, r1 == 0 } and { s0 == 0, s1 == 1 }. Whoops.
I have yet to observe this happening in practice, but models that permit store buffer forwarding are fundamentally vulnerable to this reordering. The solution here is the same as with Dekker. Marking the volatiles is insufficient: you need to insert full memory fences between the store-load adjacent pairs.
As we were hard at work creating PFX, we had a sister team of great talent working with us every step of the way. Their job? To do to Visual Studio 2010 what PFX did to .NET 4.0, by substantially improving the development experience for parallel programming on Windows. This includes both diagnostics & debugging, as well as profiling.
Daniel Moth, the program manager for a lot of the IDE features, just wrote up a comprehensive blog post on the new tasks window:
Parallel Tasks - new Visual Studio 2010 debugger window
The new window gives you a view into all of the tasks in your process, their statuses, and where they are running:

Because both TPL and PLINQ use tasks for execution, it supports both quite nicely. And it has (consistent!) support for both .NET and C++ tasks. The parallel stacks window is also an impressive new feature, and I'm guessing Daniel will also cover that in the coming weeks. Keep your eyes peeled. If all goes well, you'll even get to try them out first-hand, once Beta1 is available.
And if that weren't enough to entice you to visit his blog, check out this nasty machine that Daniel uses to run his kitchen appliances:

Oh, the insanity. I am thinking Task Manager will need revising in Windows 8.
 Friday, May 08, 2009
The parallel computing team just shipped an early release Axum (fka Maestro), an actor based programming language with message passing and strong isolation.
I'm personally very excited to see what comes of Axum. It's one step on the long road towards the vision of automatic parallelism. Although I can't claim credit for anything concrete, I was the chief designer of the fine grained isolation model Axum is built atop (something I call "Taming Side Effects" (TSE)). It's a blend of functional programming with imperative programming enabled by using the concepts of Haskell's state monad in a more familiar way. I'll try to blog a bit more about it in coming weeks. It turns out I've recently shifted my focus to a new project with the aim of applying these ideas very broadly for a whole new platform.
Doing incubation work at Microsoft is tough work, because it takes a strong vision and drive to keep pushing forward. You need to take stances that are unconventional, risky, and often just plain unpopular, and drive against all odds. Usually you aren't going to make any money off the ideas for years at a time, so it also takes a supportive management team who is willing to give you creative freedom and cut you checks. Most such efforts fail in a vaccuum. But hats off to the team for pushing hard, and going out early to ask what developers think. This is a huge milestone.
 Tuesday, March 31, 2009
I often speak of the need to develop programming models whereby developers write code in the most natural way possible, and it just so happens to be inherently parallel. I don’t believe the lion’s share of developers want to rearchitect and rewrite their code with parallelism at the forefront of their development process. They don’t want to think about shipping memory over to the GPU and launching a highly-specialized data parallel kernel of computation, nor do they want to have to add locks and transactions everywhere to make things safe. But I do, however, believe the lion’s share of developers wouldn’t mind if their code ran faster as hardware got faster (via more cores).
(To be clear, there are certainly a lot of developers who will be happy to write specialized code if it means eking out every last bit of performance on their machine. But that’s the minority.)
This viewpoint tends to get a lot of skeptical looks from people who quickly point out that this has been tried countless times before, and always leads to failure. They, of course, are referring back to the 80’s and early 90’s where “dusty deck” parallelization was all the rage, mostly in the realm of vectorization and HPC. To be fair, there were some mild successes in getting floating point loops parallelized, but there’s no wonder these attempts had little longevity. No touch solutions are always inadequate. Trying to make a fundamentally non-secure program secure, by way of, say, virtualization, may work in some constrained circumstances. But the right solution is to develop models and practices that lead to security-by-construction.
Furthermore, languages were (and in most cases still are) lacking some major prerequisites for large-scale automatic parallelization:
- Safety. Unless a compiler can reason about the determinism of a program when run in parallel, one cannot prove that its results will remain correct when parallelism is added. Compilers are therefore limited to parallelizing highly-specialized recognized patterns, like loops comprised solely of floating point additions of two arrays indexed by the loop iteration.
- Performance. Rampantly parallelizing a huge program wherever possible is dangerous, will drain performance, and make power consumption skyrocket. Dynamic techniques like workstealing and static techniques like nested data parallelism and profile guided feedback need to work together to inform these decisions. Machine-wide resource management needs to know about the memory topology, machine load and policy, and make informed decisions based on them. Although there has been a lot of disparate research in these areas over the years, they have only recently been coming together. Certainly in the 80’s, they were in their infancy.
- Declarative patterns. Most of the prior art was done in FORTRAN, a standard imperative language riddled with loops, effects, and assignments. Programs need to be written with as few dependencies as possible in order to expose large amounts of parallelism, and the von Neumann inspired languages fall short of this aim. Data comprehensions allow set-at-a-time computations to be expressed in a higher-order way, and newer languages like Fortress have language semantics that permit parallel evaluation in many more areas, like argument evaluation. And application models that encourage isolation and loose state coupling allow coarse partitioning.
In addition to all of those three things, we must have realistic expectations. Even if a program were fully safe to parallelize, as many Haskell kernels are, we would seldom see perfect scaling. Buying a 128-core machine doesn’t necessarily give you a 128x speedup. Why? Because there are still portions of the code that will end up less parallel than other portions, and some parts may even continue to run sequentially. There will always be I/O and waiting: these are real programs, after all, and real programs tend not to be 100% computation. They need to do something useful with the real world. Moreover, safety does not mean “dependence free.” And data dependencies are ultimately what limit parallelism.
My stated goal would therefore be that parallelism in programs is solely limited by data dependence. Safety issues are handled by construction. Performance is (mostly) handled by the system, although as with all things, there will be some amount of measurement, hints, and tuning necessary. But hopefully a huge part of tuning performance will be seeking out needless dependencies, or finding new algorithms that have different dependence characteristics. And with that, we can focus our energy on raising the level of abstraction and pushing more declarative patterns that are broadly useful. Over time as more and more programs are written in this fashion, they become more and more naturally parallel.
What do you think? Am I crazy? Perhaps. But I still know we can do it.
 Friday, March 13, 2009
Managed code generally is not hardened against asynchronous exceptions.
“Hardened” simply means “written to preserve state invariants in the face of unexpected failure.” In other words, hardened code can tolerate an exception and continue being called subsequently without a process or machine restart. Conversely, code that is not hardened may react sporadically if continued use is attempted: by corrupting state and subsequently behaving strangely and unpredictably.
Asynchronous exceptions are a foreign concept to native programmers, and arise because there is a runtime underneath all managed code that is silently injecting code on behalf of the original program. The only truly asynchronous exception is ThreadAbortException, but any in the set { OutOfMemoryException, TypeInitializationException, ThreadInterruptedException, StackOverflowException } are often labeled as such. While thread aborts can happen at any line of code outside a delay-abort regions, these other exceptions can be introduced by the CLR at surprising times; i.e., { memory allocations, static member access, blocking calls, any function call }. The effect is that, unlike most exceptions, the points at which they may occur are not obvious. OOMs, for instance, can happen at any method call (due to failure to allocate memory in which to JIT code), implicit boxing, etc.
(As of 2.0, StackOverflowException is no longer relevant because SO triggers a FailFast of the process instead. So saying that managed code is not hardened against SO is an understatement.)
Also, because of the way COM reentrancy works, any blocking call can lead to any arbitrary code dispatched through STA pumping. And that arbitrary code, much like an APC, can fail via any arbitrary exception. These are a lot like asynchronous exceptions. So in truth, code that isn’t written to respond to arbitrary exceptions at all blocking points is technically not hardened either.
.NET doesn’t provide checked exceptions, so the blunt reality is that very little managed code is hardened properly to synchronous exceptions either. I think we do a better job in the framework of carefully engineering the code to resiliently tolerate failure, usually by being very careful about argument validation, but we aren’t perfect. Some things slip through.
If you stop to think about why hardening isn’t done, it’s probably obvious. It’s darn difficult. Especially for asynchronous exceptions where nearly every line of code must be considered. In Win32 programming, most failure points are indicated by return codes. (Although C++ exceptions can sneak through the cracks at surprising times. Like the fact that EnterCriticalSection can throw.) While error codes are cumbersome to program against (since every call needs to be checked for a plethora of conditions, making it easy to miss something), at least the response to failure is explicit. You can decide to propagate and leave state corrupt, fix up state and then propagate, rip the process, or ignore the failure, as appropriate. This becomes part of the API contract. In managed code, you need to know to wrap such calls in try/catch blocks. Nobody does this. It’s insane to even consider doing that. And because nobody does, you can’t even catch exceptions coming out of a single API call and know that, when faced with an OOM (for example), that all code on the propagating callgraph has transitively handled the failure in a controlled manner. The very fact that the lock{} statement auto-unlocks without rolling back corrupt state should be indication enough of the current state of affairs.
An instance of any of the aforementioned exceptions means the AppDomain is toast.
By toast, I mean that it’s soon going to be unusable, and hopefully actively being unloaded. Code in the framework assumes this, and you should too. All it does is try to get out of the way by not crashing or hanging the ensuing unload. A small fraction of code that deals with process-wide state comprised of resources not under the purview of the CLR GC needs to worry about running and avoiding leaks. This is where things like CERs, CriticalFinalizerObjects, and paired operations stuck in finally blocks come into play. They ensure cross-process state is freed, and that asynchronous exceptions cannot occur in places that would crash or hang a clean unload.
Unfortunately, it’s not always the case that the AppDomain is unloading when such an exception occurs:
- Somebody can call Thread.Abort directly, without killing the AppDomain. They can either call ResetAbort and keep it around, or let it return to the ThreadPool which catches and swallows aborts. In fact, we tell people that synchronous aborts a la Thread.CurrentThread.Abort is “always safe”, whereas we tell people asynchronous aborts are dangerous and best avoided.
- Some framework infrastructure, most notably ASP.NET, even aborts individual threads routinely without unloading the domain. They backstop the ThreadAbortExceptions, call ResetAbort on the thread and reuse it or return it to the CLR ThreadPool. That means any code running in ASP.NET is apt to be corrupted when websites are recycled and AppDomain isolation is not being used.
- Assume AppDomain B is being unloaded. If some thread has called from A to B to C, the thread will immediately suffer an abort. The result is that C will see a thread unwinding with a ThreadAbortException, back into B, and then back to A, at which point the exception turns into a deniable AppDomainUnloadedException that can be caught. But C has seen an in-flight abort and yet it is not being unloaded. The result is that C’s state may be completely corrupt. I believe this should be considered a bug in the CLR.
- We can’t differentiate between soft- and hard-OOMs today. The former are caused by requests to allocate large blocks memory. Often a failure here isn’t indicative of a disaster. It may be due to a need to allocate 1GB of contiguous memory, and perhaps there is fragmentation. Hard OOMs are often caused by running up against the edge of the machine where no pagefile space is available, and may indicate a failure to JIT some important method, among other things. But because we don’t differentiate, any managed code can catch-and-ignore any kind of OOM, including hard ones.
- Thread interruptions are often used as a form of inter-thread communication. For example, they can be used as a poor man’s cancellation. (This is inappropriate, and cooperative techniques should always be used. But it is widespread.) But because they are used as a means of communication, they are almost always caught and handled in some controlled manner. This is one place where we screwed up by not hardening the frameworks against interrupted blocking calls and reacting intelligently. Checked exceptions would have saved us.
What does all of this mean? Quite simply, the .NET Framework cannot be trusted when any of the aforementioned exceptions are thrown. Ideally the process will come tumbling down shortly thereafter, but improperly written code can catch them and continue trying to limp along. In fact, as I mentioned above, some wildly popular & successful application models do (notably, ASP.NET and WinForms).
This state of affairs is admittedly unfortunate. We don’t properly separate out the truly fatal exceptions from those that we can gracefully recover from. In an ideal world, I’d love to see us do that. For example:
- At some point, we really ought to consider FailFast instead of continuing to run code under failures we know are fatal and dangerous to attempt to recover from, much like we do with SO. At least these failures should be undeniable like thread aborts are. But this is a fairly Byzantine response and is not for the faint of heart. Given that we still live in a world where WinForms wraps the top-most frame of the GUI thread in a catch-all, presents a dialog box, and allows a user to click “Ignore & Continue”, I seriously doubt we’ll get there anytime soon.
- Never expose a ThreadAbortException to code in an AppDomain unless we can guarantee the AppDomain is being unloaded. That means getting rid of the Abort API, and thus indirectly disallowing code from catching and calling ResetAbort. It also means the A calls B calls C case would not allow B to unload until the thread voluntarily unwinds out of C.
- Allow OOMs to be caught only when they are soft. That means a call to ‘new’, and it means the catch much occur inside the same stack frame as the call to ‘new’. Such exceptions can be tolerated if code is properly written, and we will tell developed to be mindful of them. Once such an OOM propagates past the calling stack frame, they will escalate to hard.
- All other OOMs are hard and fatal. This includes failure to allocate memory to JIT code and failure to allocate 20 bytes to box an int. Hard OOMs are thus undeniable.
- Get rid of ThreadInterruptedExceptions. We screwed this up from Day One, and it’s probably too late to fix this. We added cooperative cancellation in .NET 4.0 for a reason.
- TypeInitializationExceptions can probably stay, but we should allow rerunning the cctor upon subsequent accesses. Today, once a class C throws from its cctor, the class can never be constructed. So on the current plan, it only makes sense to FailFast.
I’m sure there are many other things we could do to improve things. But these 6 general themes would be a great start.
I’m just spitballing here. There are no concrete plans to do any of these 6 things as far as I know. And at the end of the day, hardening only improves the statistics of the situation, so it tends to be very difficult to argue for one change over another, particularly if taking the change would make existing programs break. But I really would like to see the base level of reliability in managed code improve with time. Especially with the exciting work going on around contract-checking in the BCL in Visual Studio 10, I hope these topics become top-of-mind for folks again in the near future.
 Friday, March 06, 2009
A developer from the Parallel Extensions team, Igor Ostrovsky, recently whipped together a really neat Silverlight game. It's called RoboZZle, and he calls it a "social puzzle game":
"RoboZZle is an online puzzle game that challenges players to program a robot to pick up all stars on a game board. The game mechanics are simple, yet allow for a wide variety of challenges that call for very different solution approaches." (from the blog entry introducing it.)
The coolest feature of this game is that you win by programming. And there's a whole community surrounding it, complete with forums, the ability to create your own games, a ranking system, its own blog and continuously updated news, etc. Check it out.
|
|
Recent Entries:
Search:
Browse by Date:
| | Sun | Mon | Tue | Wed | Thu | Fri | Sat | | 31 | 1 | 2 | 3 | 4 | 5 | 6 | | 7 | 8 | 9 | 10 | 11 | 12 | 13 | | 14 | 15 | 16 | 17 | 18 | 19 | 20 | | 21 | 22 | 23 | 24 | 25 | 26 | 27 | | 28 | 29 | 30 | 1 | 2 | 3 | 4 | | 5 | 6 | 7 | 8 | 9 | 10 | 11 |
Browse by Category:
Notables:
|