|
Personal Info:
Joe  leads the architecture of an experimental OS's developer platform, where
he is also chief architect of its programming language. His current mission is to enable
writing large-scale software that is reliable, secure, and scalable by-construction. Before this, Joe
founded the Parallel Extensions to .NET project.
He has been granted 19 patents, with 49 pending. When not working, Joe enjoys travelling with his wife,
writing books, writing music,
studying music theory & mathematics, and doing anything involving food & wine.
My books
My music
Disclaimer:
The content of this site are my own personal opinions and do
not represent my employer's view in anyway.
© 2012, Joe Duffy
|
|
 Sunday, March 04, 2007
In 2.0 SP1, we changed the threadpool's default maximum worker thread count from 25/CPU to 250/CPU.
The reason wasn't to encourage architectures that saturate so many threads with CPU-bound work, nor was it to suggest that having 249 threads/CPU blocked and 1/CPU actively running is actually a good design. A non-blocking, continuation-passing/event-driven architecture is generally a better approach for the latter case. Rather, we did this to statistically reduce the frequency of accidental and nondeterministic deadlocks.
Believe it or not, deadlocking the threadpool was the most frequently reported threading-related customer problem/complaint during my tenure as the CLR's concurrency PM. There are KB articles and a wealth of customer and Microsoft employee blog posts about this issue.
Many algorithms demand dependencies between parallel tasks. And sometimes—as is frequently the case in data parallel algorithms—the number of tasks can be variable up to a factor of the input size. A "task" in this context is just the closure passed to QueueUserWorkItem. Consider a dumb parallel merge sort, which uses divide-and-conquer style parallelism, for example:
void Sort(int[] numbers, int from, int to) { if (from < to) { ... ThreadPool.QueueUserWorkItem(delegate { Sort(numbers, from, (from+to+1)/2); }); // T1 Sort(numbers, (from+to+1)/2, to); // T2 ... Wait for T1 to finish ... ... Merge left and right ... } }
In this case, T1 is run in parallel and sorts the left half; T2 runs on the calling thread and sorts the right. After T2 runs sequentially, it must wait for T1 to complete before moving on to the merge step. As written, this algorithm is clearly inefficient and deadlocks easily. Pass it an array of size 33 on a 2-CPU machine, and the threadpool's default maximums will ensure that some T1's can't even get scheduled, leaving threadpool threads blocked waiting for their queued (and stuck) counterparts to finish. Depending on how tasks are scheduled at runtime this could deadlock.
Clearly the programmer needs to "stop" dividing the problem at some reasonable point, i.e. limit the maximum number of tasks generated; otherwise the task count will grow with some factor of the input size, causing deadlocks for large inputs (not to mention huge context switch and resource consumption overheads). When might that point be? Perhaps the programmer calculates some degree-of-parallelism (DOP) at the top of the recursive call stack, say log2(#ofCPUs). DOP is passed to the first call to Sort and each subsequent recursive call decrements the DOP by 1. So long as the DOP argument is >0, T1 is run in parallel; otherwise, T1 is run sequentially on the same thread, just like T2. This ensures we don't spawn more tasks than there are CPUs.
And this will probably work. Most of the time.
What if, just by chance, the stars aligned and 25 instances of this algorithm ran simultaneously? Seems farfetched? Maybe so. Consider this: using log2(#ofCPUs) might not be enough in the case that some comparison routines block during sorting, possibly suggesting log2(2(#ofCPUs)) as a better DOP instead. And then all we need is 12.5 occurrences. Still a little farfetched, but not quite as much. But wait: there could be other algorithms using the thread pool simultaneously, particularly on a server. (Yes, data parallelism on a server is probably suspect in the first place, but for highly parallel servers with volatile loads, it could be useful.) And remember, the thread pool is shared across all AppDomains in the process, so if you've written a reusable component, you're relying on all other software in the process to behave properly too (which you may have absolutely no control over).
Most of these admittedly represent imperfections in the overall design and architecture of the application, but the sad fact is that they tend to be somewhat common. Especially when components are dynamically composed in the process, as is common with server and AddIn-style apps. And they are very nondeterministic and hard to test for. Our platform doesn't offer a mechanism today that allows developers to write code that is intelligently-parallel, particularly when many heterogeneous components are trying to use concurrency for performance. Even with the suggested improvements and the CLR threadpool's old 25/CPU thread limit, the Sort routine could deadlock once in a while, maybe under extreme stress and very hard to reproduce conditions. This will occur less frequently, statistically speaking, with the 250/CPU limit. The problem is that all of this is just a heuristic, there aren't any hard numbers and coordination involved.
It’s also worth noting that the threadpool throttles its creation of threads to 2/second once the count has exceeded the #ofCPUs. That means if a programmer sees this situation happening with regularity, they will also observe hard to diagnose performance degradations. Once in a while, that sort algorithm might take 10-times longer to run, inexplicably. If this happens a lot, the developer will notice, profile, and fix the issue. While this isn’t great, this problem is typically quite rare in any one (properly written) program, and doesn’t happen with regularity. Our first priority was to prevent the periodic and sporadic hangs, the things eating away at the reliability and uptime of programs, to trade them off for possible periodic performance blips. Many of those horribly misarchitected programs will still deadlock deterministically with the new limits, and the thread injection throttling will help to discourage them from being written this way. (It would take 125 seconds to create 250 threads on a 1-CPU machine, and it seems unlikely that a 2 minute-plus delay would be tolerated. Some people use SetMinThreads to get around this, which is (usually) inexcusable.)
With all of this said, we clearly have a lot of work to do in the future to encourage better parallel program architectures and to provide better tools for diagnostics in this area. I agree with the basic tenet "use as few threads as possible, but no fewer," but sometimes we have to sacrifice idealism to solve practical real-world problems. In my experience, this should solve many such problems.
 Monday, February 19, 2007
A reader asked for clarification on a past article of mine, regarding my claim that one particular variant of the double checked locking pattern won't work on the .NET 2.0 memory model. The confusion was caused because my advice seems to contradict Vance's MSDN article on the topic.
The problem is with variants of double checked locking that use a flag to indicate that a variable has or has not been initialized, versus using the presence of null to indicate this. This can come in handy if null is a valid initialized value, when the value is a value type, and/or if multiple variables are involved in the initialization.
After following up with a few Microsoft and Intel folks about this, I still believe this to be an issue. Here is what I claim:
- Because standard Intel processors (X86/IA32, EM64T) use non-binding speculative reads, the problem will not happen due to speculation. And because processor consistency memory models don’t permit loads to freely reorder, this won’t happen because of cache hits.
- However, on IA64, non-volatile loads can be freely reordered, and therefore a cache hit can cause the load of the value to pass the load of the flag. I have not been given a clear answer yet on the nature of IA64’s speculation model, but I suspect IA64 is non-binding too, and therefore this cannot occur as a sole result of branch prediction (though that is pretty much immaterial because of cache reordering).
- In talking with some compiler folks here, they also agree that legal compiler transformations (according to .NET 2.0’s memory model) can break the code.
- With that said, no Microsoft compiler we know of will actually make the transformation.
- With some simple (though unlikely) modifications, existing compilers could find it more attractive to apply CSE/PRE, causing the read to move and break the code pattern.
The take-away is not necessarily the specific details, though perhaps those are interesting too. Rather, the primary take-away is that you really ought to use the volatile modifier whenever you aren’t 100% certain that the default memory model will prevent these kinds of reorderings. (And even then, volatile is still a good idea, to declare your intent to other programmers looking at the code.)
As I mentioned in the original article, the use of volatile is enough to ensure this particular example works correctly.
 Monday, February 12, 2007
Somebody recently asked in a blog comment whether the new ReaderWriterLockSlim uses a full barrier (a.k.a. two-way fence, CMPXCHG, etc.) on lock exit. It does, and I claimed that "it has to". It turns out that my statement was actually too strong. Doing so prevents a certain class of potentially surprising results, so it's a matter of preference to the lock designer whether these results are so surprising as to incur the cost of a full barrier. Vance Morrison's "low lock" article, for instance, shows a spin lock that doesn't make this guarantee. And, FWIW, this is also left unspecified in the CLR 2.0 memory model. Java's memory model permits non-barrier lock releases, though I will also note the JMM is substantially weaker in areas when compared to the CLR's.
Here's an example of a possibly surprising result that non-barrier releases can cause:
Initially x == y == 0.
Thread 0 Thread 1 =============== =============== lock(A); lock(B); x = 1; y = 1; unlock(A); unlock(B); t0 = y; t1 = x;
Is it possible after executing both that: t0 == t1 == 0?
It is simple to reason that this is impossible with sequential consistency. In SC the only way that t0 == 0 is if Thread 0's t0 = y statement (and therefore x = 1, assuming a memory model in which writes happen in order) were to occur before Thread 1's y = 1 (and therefore t1 = x). In this case, t1 = x must subsequently see t1 == x == 1, otherwise the history contradicts SC. The only other possibility is that Thread 1's t1 = x happens before Thread 0's x = 1 and therefore also t0 = y, in which case it must be the case that the subsequent t0 = y by Thread 0 yields t0 == y == 1. In both cases, either x or y must be seen as 1. (Interleavings are clearly possible that result in x == y == 1.)
The CLR 2.0's memory model guarantees that, if the unlock incurs a barrier, the same SC reasoning applies. Unfortunately, if the unlock is not a barrer, then in both cases the load of x or y may occur before the write buffer has been flushed, meaning the write to x or y and the unlocking write itself, possibly leading to t0 == t1 == 0. This happens even on relatively strong processor consistency memory models such as X86, and on weaker ones such as IA-64 (even when all loads are acquire and all stores are release, which only happens for volatile CLR fields). To ensure the write buffer has been flushed before the read happens, the unlock statement must flush the buffer (or an explicit barrier is needed), accomplished with CMPXCHG on X86 ISAs.
Many would argue that, because the locks taken are different between the two threads, SC does not apply and therefore implementing unlock as a non-barrier write is legal. JMM takes this stance. This actually seems like a fine argument to me, and after thinking about it for a bit, it's probably what I would choose if I were defining a memory model. But the CLR 2.0 MM is generally stronger than most, so people might actually depend on this and expect it to work. This could cause Monitor-based code to break when moved to alternative lock implementations that don't issue full barriers at release time. This is just one example of why it'd be really great to have a canonical specification for the CLR's MM. At least we'd then have a leg to stand on when faced with tricky compatability trade-offs some day down the line.
 Wednesday, February 07, 2007
In Orcas, we offer a new reader/writer lock: System.Threading.ReaderWriterLockSlim.
Motivation for a new lock
The primary reason for creating this type was that we wanted to provide an official reader/writer lock for the .NET Framework that people could actually rely on for performance-critical code. It was no secret that the current ReaderWriterLock type was such a pig, costing somewhere around 6X the cost of a monitor acquisition for uncontended write lock acquires, that most people avoided it entirely. Jeff Richter wrote an entire MSDN article about this, and Vance Morrison showed how to build your own on his weblog. It was really too bad customers couldn't depend on the class in the Framework, and to be honest most devs really shouldn't be in the business of writing and maintaining their own reader/writer lock.
Second, we had a large number of qualms with the existing lock’s design. It had funny recursion semantics (and is in fact broken in a few thread reentrancy cases we know about), and has a dangerous non-atomic upgrade method. Did you know that you actually need to check the WriterSeqNum before and after a call to our ReaderWriterLock’s UpgradeToWriteLock method to ensure it didn’t change during the upgrade? You do. The code actually releases the reader lock before upgrading to the write lock, which allows other threads to sneak in, acquire the lock in between, and possibly change state that was read during the decision to upgrade. The reason? If we upgraded and then released the read lock, two threads trying to simultaneously upgrade would deadlock one another. All of these problems represent very fundamental flaws in the existing type’s design.
So we decided to build a new one that solves all of these problems. To be honest, I would have liked to fix the current one, but the existing API and compatibility responsibilities make that just about impossible. We considered obsoleting the existing one, but as I note at the end of this article, there are still reasons to prefer the old lock.
Three modes: read, write, and upgradeable-read
The new ReaderWriterLockSlim supports three lock modes: Read, Write, and UpgradeableRead, and has the methods EnterReadLock, EnterWriteLock, EnterUpgradeableReadLock, and corresponding TryEnterXXLock and ExitXXLock methods, that do what you’d expect. Read and Write are easy and should be familiar: Read is a typical shared lock mode, where any number of threads can acquire the lock in the Read mode simultaneously, and Write is a mutual exclusion mode, where no other threads are permitted to simultaneously hold the lock in any mode. The UpgradeableReadLock will probably be new to most people, though it’s a concept that’s well known to database folks, and is the magic that allows us to fix the upgrade problem I mentioned earlier. We’ll look at it more closely in a moment.
The performance of the new lock is roughly equal to that of Monitor. When I say “roughly”, I mean that it’s within a factor of 2X in just about all cases. And the new lock favors letting threads acquire the lock in Write mode over Read or UpgradeableRead, since writers tend to be less frequent than readers, generally leading to better scalability. We’d originally considered providing a set of contention management options to choose from, but decided in the end to ship a simpler design that works well for most cases.
Upgrading
Let’s look at upgrades more closely now. The UpgradeableRead mode allows you to safely upgrade from Read to Write mode. Remember I mentioned earlier that our old lock breaks atomicity in order to provide deadlock-free upgrade capabilities (which is bad, particularly since most people don't realize it). The new lock neither breaks atomicity nor causes deadlocks. We acheive this by allowing only one thread to be in the UpgradeableRead mode at once, though there may be any number of other threads in Read mode while there’s one UpgradeableRead owner.
Once the lock is held in the UpgradeableRead mode, a thread can then read state to determine whether to downgrade to Read or upgrade to Write. Note that this decision should ideally be made as fast as possible: holding the UpgradeableRead lock forces any new Read acquisitions to wait, though existing Read holders are still permitted to remain active. (Sadly the CLR team seems to have removed two methods, DowngradeToRead and UpgradeToWrite, that I had originally designed for this purpose. I admit what follows isn’t the most obvious way to do it.) To downgrade, you simply call EnterReadLock followed by ExitUpgradeableReadLock: this permits other Read and UpgradeableRead acquisitions to finish that were previously held up by the fact that there was an UpgradeableRead lock held. To upgrade, you similarly call EnterWriteLock: this may actually have to wait until there are no longer any threads that still hold the lock in Read mode. There’s no real reason to also exit the UpgradeableReadLock at this point unlike the downgrade case, though in some cases it makes your code more uniform. E.g.:
ReaderWriterLockSlim rwl = …; … bool upgraded = true; rwl.EnterUpgradeableReadLock(); try { if (… read some state to decide whether to upgrade …) { rwl.EnterWriteLock(); try { … write to state … } finally { rwl.ExitWriteLock(); } } else { rwl.EnterReadLock(); rwl.ExitUpgradeableReadLock(); upgraded = false; try { … read from state … } finally { rwl.ExitReadLock(); } } } finally { if (upgraded) rwl.ExitUpgradeableReadLock(); }
Recursive acquires
Another nice feature with the new lock is how it treats recursion. By default, all recursive acquires, aside from the upgrade and downgrade cases already mentioned, is disallowed. This means you can’t call EnterReadLock twice on the same thread without first exiting the lock, for example, and similarly with the other modes. If you try, you get a LockRecursionException thrown at you. You can, however, turn recursion on at construction time: pass the enum value LockRecursionPolicy.SupportsRecursion to your lock’s constructor, and voila, recursion will be permitted. The chosen policy for a given lock is subsequently accessible from its RecursionPolicy property.
There’s one special case that is never permitted, regardless of the lock recursion policy: acquiring a Write lock when a Read lock is held. We considered enabling this, or at least giving a new enum value for it, but decided to hold off for now: if it turns out customers need it, we can always add it later. But it’s dangerous and leads to the same Read-to-Write upgrade deadlocks that the old lock was prone to, and so we didn’t want to lead developers down a path fraught with danger. If you need this kind of recursion, it’s a “simple” matter of changing your design to hoist a call to either EnterWriteLock or EnterUpgradeableReadLock (and the corresponding Exit method) to the outermost scope in which the lock is acquired.
There are corresponding properties IsReadLockHeld, IsWriteLockHeld, and IsUpgradeableReadLockHeld, to determine whether the current thread holds the lock in the specified mode. You can also query the WaitingReadCount, WaitingWriteCount, and WaitingUpgradeCount properties to see how many threads are waiting to acquire the lock in the specific mode, and CurrentReadCount to see how many concurrent readers there are. The RecursiveReadCount, RecursiveWriteCount, and RecursiveUpgradeCount properties tell you how many recursive acquires the current thread has made for the specific mode.
Some limitations: reliability
Lastly, I mentioned there are some caveats around where this lock’s use is appropriate. Well, there’s one, really: it’s not hardened to be reliable. This means a few things.
First, unlike the existing ReaderWriterLock, the ReaderWriterLockSlim type does not cooperate with CLR hosts through the hosting APIs. This means a host will not be given a chance to override various lock behaviors, including performing deadlock detection (as SQL Server does). Thus, you really ought not to use this lock if your code will be run inside SQL Server.
Next, the lock is not robust to asynchronous exceptions such as thread aborts and out of memory conditions. If one of these occurs while in the middle of one of the lock’s methods, the lock state can be corrupt, causing subsequent deadlocks, unhandled exceptions, and (sadly) due to the use of spin locks internally, a pegged 100% CPU. So if you’re going to be running your code in an environment that regularly uses thread aborts or attempts to survive hard OOMs, you’re not going to be happy with this lock. Unfortunately the lock doesn’t even mark critical regions appropriately, so hosts that do make use of thread aborts won’t know that the thread abort could possibly put the AppDomain at risk: many hosts would prefer to wait, or immediately escalate to an AppDomain unload, if an individual thread abort is necessary while the thread is in a critical region. But in the case of ReaderWriterLockSlim, a host has no idea if a thread holds the lock because the implementation doesn’t call Begin- and EndCriticalRegion. And the kind of problems I mentioned in the previous post are always an issue with ReaderWriterLockSlim, because we don't necessarily guarantee that there will be no instructions in the JIT-generated code between the acquisition and entrance to the following try block.
Summary
In summary, the new ReaderWriterLockSlim lock eliminates all of the major adoption blockers that plagued the old ReaderWriterLock. It performs much better, has deadlock-free and atomicity-preserving upgrades, and leads developers to program cleaner designs free of lock recursion. There are some downsides to the new lock, however, that may cause programmers writing hosted or low-level reliability-sensitive code to wait to adopt it. Don’t get me wrong, most people really don’t need to worry about these topics, so I apologize if my words of warning have scared you off: but those that do really need to be told about the state of affairs. Thankfully, I’m confident that many of these issues will be fixed in subsequent releases. And for most developers out there, the new ReaderWriterLockSlim is perfect for the job.
 Saturday, February 03, 2007
[Update 2/19/07: the search for Jim has been called off after just about three weeks. Thanks to everybody who helped, and best wishes to Jim's family.]
Jim Gray, a Microsoft technical fellow who is best known for his seminal work on database transactions and data management, has been missing for almost a week. He took his 40-foot sailboat, Tenacious, last Sunday to scatter the remains of his late mother near the Farallon Islands off the coast of California, and has been missing ever since. This article has additional details.
Although the Coast Guard has called off the search, friends and family are chartering private aircraft, scouring satellite imagery, and NASA even made a special run of its ER-2 aircraft to generate additional data. Werner Vogels has more information on his website, including how you can help with the search.
Jim actually provided important feedback while I was thinking and writing about PLINQ, and was one of the most supportive and excited reviewers of the project. Not to mention all of the papers he wrote or was involved with that inspired a lot of the direction. I’d be very sad to see his life cut short, particularly in this way.
 Monday, January 29, 2007
I previously mentioned the X86 JIT contains a "hack" to ensure that thread aborts can't sneak in between a Monitor.Enter(o) and the subsequent try-block. This ensures that a lock won't be leaked due to a thread abort occurring in the middle of a lock(o) { S1; } block. In the following example, that means an abort can't be triggered at S0:
Monitor.Enter(o); S0; try { S1; } finally { Monitor.Exit(o); }
If an abort could happen at S0, it'd be possible for a thread to acquire lock o, but before entering the try block, be asynchronously aborted, and then not run the finally block to release the lock on o. This would lead to an orphaned lock, and probable deadlocks later on during execution. Debugging an instance of such a deadlock would of course be rather difficult because it depends on a very subtle race condition that must occur within the tiny window of a single instruction. On a single-processor machine, this would require a precariously placed context switch, but as more and more cores are added to the machines that this software runs on, the probability simply increases.
Characterizing this as a "hack" was a little harsh. It's really just a byproduct of the way that the X86 JIT generates code.
For an asynchronous thread abort to be thrown in a thread, that thread must be either: (1) polling for the abort in the EE or (2) running inside of managed code. And even if the thread is in managed code, we may not be able to abort it, as is the case if the thread is currently executing a finally block, inside a constrained execution region, etc. The C# code generation for the lock statement ensures there are no IL instructions between the CALL to Monitor.Enter and the instruction marked as the start of the try block. The JIT correspondingly will not insert any machine instructions in between the two. And since any attempted thread aborts in Monitor.Enter are not polled for after the lock has been acquired and before returning, the soonest subsequent point at which an abort can happen is the first instruction following the call to Monitor.Enter. And at that point, the IP will already be inside the try block (the return from Monitor.Enter returns to the CALL+1), thereby ensuring that the finally block will always run if the lock was acquired.
This might seem like an implementation detail, but the reality is that we can never change it. Too many people depend on this guarantee.
It turns out that Whidbey's X64 JIT does not guarantee this behavior. (I suspect IA64 doesn't either, but don't know for sure.) In fact there's a high probability that this won't work: there is always a NOP instruction before the CALL and the instruction marking the try block in the JITted code. This is done to make it easier to identify try/catch scopes during stack unwind. This means that, yes indeed, an abort can happen at S0 on 64-bit.
This will likely be fixed for the next runtime release, but I can't say for sure.
Update 4/17/08: This was indeed fixed for the X64 JIT in Visual Studio 2008. Note that when compiling C# code targeting both X86 and X64, if you do not use the /o+ switch, this problem can still occur due to extra explicit NOPs inserted before the try.
The framework implements a method Monitor.ReliableEnter, by the way, that could be used to avoid orphaning locks in the face of thread aborts, but it's internal to mscorlib.dll. It sets an out parameter within a region of code that cannot be interrupted by a thread abort, which the caller can then check inside the finally block. The acquisition then gets moved inside so that, if the CALL is reached, the finally block is guaranteed to always run. You'd then write this instead:
bool taken; try { Monitor.ReliableEnter(o, out taken); S1; } finally { if (taken) Monitor.Exit(o); }
It's also possible the CLR team would expose this API in the future. We wanted to in Whidbey, but didn't have enough time. If 64-bit code generation was changed so that it doesn't emit a NOP before the try block, however, we probably wouldn't need ReliableEnter after all.
 Tuesday, January 23, 2007
Joel tagged me on this silly 5 things you didn't know about me meme. Since it's floating around, and because I’d hate to disappoint a crazy Australian who’s apt to shoot vinegar in my eyes with a Super Soaker (woof, woof), I figure I'll give it a shot:
- I played in a heavy metal band when I was a teenager. I was the lead guitarist, and loved to shred it up, with a sort of { 80's metal, 90's heavy metal, neo-classical, blues } style, influenced heavily by Slayer, Metallica, Pantera, GnR, Yngwie Malmsteen, Joe Satriani, and Stevie Ray Vaughan. (Odd combination? You bet!) Don’t ask for pictures, I have conveniently "lost" them. I stopped playing guitar for several years, but picked it back up in the past year and have been playing daily (focusing on building better music theory technical depth, but still shredding: my recent obsession is Dragonforce). I also created several independent industrial and experimental techno CDs. I try again every now and again, but I seem to have lost the patience.
- I stopped playing in the heavy metal band because I "accidentally" punched a guy in the face in a mosh pit at a Sepultura concert. This broke my hand in several places and was quite unpleasant. (I did stay for the whole set, I'm happy to report. The bar was kind enough to supply a plastic cup with some ice.) The emergency room wrapped it in a cast, and then after 4 weeks, it had healed incorrectly. So then I had to have it manually rebroken, and spent the entire summer in a new cast. At the end, I had no hand or arm strength, my finger was crooked (making it hard to hold a pick), and my band had broken up because we couldn't play out (and they apparently didn't have the motivation to replace me temporarily). This was the end of my dreams of being a rock star.
- When I was 18 years old, I weighed a whopping 260 pounds. Most of it was muscle, resulting from me working out literally every single day, and I have the stretch marks to show it. I was a big dude, benched a little more than my weight, and even played center for a short period in high school football (I wasn’t keen on the fact that it’s apparently protocol that the quarterback routinely lays his hands on the center’s butt). Now I weigh about 160 and have no muscle.
- I started programming when I was 12. This came about because a local gaming place (Gamer's Guild in Milford, MA) had set up a new BBS system, on which they provided several MUDs, MOOs, MUSHes, etc., on their Commodore Amiga 3000's and 4000. I became hooked, especially when I found out you could get access to the source code and actually change it. When CircleMUD came out, circa '92 (IIRC), I got serious about hacking code, doing the bulk of my MUD programming on a hand-assembled Slackware Linux 1.0 IBM PC. Ironically, yours truly (a Microsoft employee) did most of his learning on the Linux OS, using GNU's C Compiler. (I actually refused to use DOS until I started doing games programming that required graphics and found some cool DOS toolkits. I refused Windows even longer, until I got sick of using 'lynx' to browse the WWW. I have always found Windows API programming, with UPPER_CASE type names everywhere, rather ugly and I simply didn't get it at first.) I set up my own 8-line BBS and later in '94 converted to co-located Internet hosting, running a heavily modified CircleMUD (3.1?) for a year and a half. There was typically around 20 players signed on at any given point.
- I am VERY obsessive about things. I want to buy as many books as I can get my hold on, read them all the time, write code every waking moment, drink wine every night, go out for 15 course dinners every night, play guitar and learn every scale, major and minor, mode, modification, chord, arpeggio, I like to work, work, work until I am barely awake, and so on. It drives my friends completely nuts, but it keeps me busy. I've been that way ever since I was a little kid, so I don't think it's going to change anytime soon.
I don't know many people, so I'm going to commit meme heresy and not link to anybody else. Sorry.
 Monday, January 22, 2007
I was recently asked by a customer how to guarantee alignment of CLR data on 16-byte boundaries. They needed this capability to interoperate with code that uses SSE vector instructions to manipulate the data (which require 16-byte alignment). The bad news is that there’s no real good way of doing this. That is, there isn’t any “align at N bytes” feature for the CLR in which type layout and stack and heap allocation cooperate. The good news is that you can fake it.
(I spoke about alignment with respect to atomic cmpxchg8b instructions previously, right here, for those interested in reading about that too.)
The details of how to go about ensuring 16-byte alignment depend on whether you allocate your data on the stack or the GC heap. For illustration purposes, imagine we’re dealing with an array of float32[]’s. We’d like to ensure the beginning lies on a 16-byte boundary:
- float [] a0 = new float[N]; // GC-allocated array of N floats
- float * a1 = stackalloc float[N]; // stack-allocated array of N floats
If you use the former, GC allocation (1), you’re going to have a really tough time. The GC moves objects around on you as it performs compactions, and only aligns the 1st element of the array on a 4-byte boundary. So even if you manage to get your object allocated on a 16-byte boundary (by chance), it is apt to move during a subsequent GC.
To solve this problem, you’d have to pin the object. Pinning causes GC fragmentation, so I really encourage you to avoid this approach and go with stack allocation, (2), if you can afford it. A float[] on the stack is similarly aligned to begin at a 4-byte boundary, but, unlike (1), it will subsequently not move around. Of course stack allocation is often impossible, or difficult, if you are writing a reusable library that may be called in an unknown context (where the caller may have very little stack left). This is a tradeoff you would have to make. If the pinning is very short lived, i.e. the duration of a single function call, it might be tolerable for you, a la P/Invoke.
Regardless of whether you choose (1) with pinning or (2) by itself, you’ve now got a stable address. And you can use the stable address to calculate the next 16-byte element in the array from the base address, and then use that as the start of the array. You will need some extra padding at the end for the worst case, which is base + 3, meaning at most 12 bytes, so you need to allocate 3 extra floats in the array. Here’s an example:
void * AlignUp(void * p, ulong alignBytes) { ulong addr = new UIntPtr(p).ToUInt64(); alignBytes -= 1; // adjust pointer for arithmetic if (((1<<(IntPtr.Size*4 - 1)) - alignBytes) <= addr) throw new Exception(“overflow”); ulong newAddr = (addr + alignBytes) & ~alignBytes; return new UIntPtr(newAddr).ToPointer(); } … float * p = stackalloc float[N + 3]; p = (float *)AlignUp(p, 16); … use p …
Note that if you were to use an array of doubles instead, you’d have some challenges. That’s because a 8-byte value on the 32-bit CLR is only 4-byte aligned, and therefore you can end up with a situation where the next 16-byte granularity is in the middle of a single element. For example, 12 + 8 = 20 byte, +8 = 28 byte, +8 = 36 byte, and so on. None of these are 16-byte aligned. Not that it really matters, so long as you allocate enough memory, but you will need to do some casting of the array reference, as shown in the above code, to do the arithmetic.
Note also that there’s a StructLayout attribute that allows you to specify alignment, through its padding field, but sadly this doesn’t impact the GC’s heap or the JIT’s stack alignment, and so it’s useless for our purposes. Though the relative alignment within the data structure will be correct, the absolute alignment is not guaranteed to be so.
OK, so I know all of this isn’t pretty. But it works.
 Monday, January 08, 2007
I will be giving a PLINQ talk at the Declarative Aspects of Multicore Programming (DAMP) workshop at POPL next week:
[Update: 1/23/07 -- added a link to the slide deck below.]
PLINQ: A Query Language for Data Parallel Programming
Microsoft’s language integrated query (LINQ) technology provides an expressive syntax and suite of APIs which facilitate classic data parallel operations: filtering, mapping, reductions, loop tiling, sorts, nested loops parallelism, prefix scans, and more. In this talk, we look at a new set of extensions to the LINQ technology, called parallel LINQ (PLINQ), which automatically optimizes and parallelizes query operations based on dynamic runtime information. We believe that the use of a familiar SQL-like syntax will broaden the reach of PLINQ in industry, and provides programmers with a more declarative and flexible way of expressing data-intensive computations.
Download presentation (PPT).
The workshop is being chaired by Guy Blelloch from CMU, is located in Nice, France, and features some interesting recent research in the field of parallel programming. If anybody will be in town and wants to meet up, please drop me a line at joedu AT microsoft.
 Sunday, January 07, 2007
 |
Jim Larus, a Microsoft colleague of mine, recently co-authored a book on Transactional Memory with Ravi Rajwar from Intel, one of the most prominent authorities in the TM community. It just became available. You can purchase and download it online at Morgan & Claypool's website: Transactional Memory (Synthesis Lectures on Computer Architecture). The series of which it is part is new, but there are at least a few other great books in its pipeline, including a CMP (chip multi-processor) architecture book by Kunle Olukotun, from Stanford and an architect of Sun's Niagara processor. |
|
|
Recent Entries:
Search:
Browse by Date:
| | Sun | Mon | Tue | Wed | Thu | Fri | Sat | | 25 | 26 | 27 | 28 | 1 | 2 | 3 | | 4 | 5 | 6 | 7 | 8 | 9 | 10 | | 11 | 12 | 13 | 14 | 15 | 16 | 17 | | 18 | 19 | 20 | 21 | 22 | 23 | 24 | | 25 | 26 | 27 | 28 | 29 | 30 | 31 | | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
Browse by Category:
Notables:
|