|
Personal Info:
Joe  leads the architecture of an experimental OS's developer platform, where
he is also chief architect of its programming language. His current mission is to enable
writing large-scale software that is reliable, secure, and scalable by-construction. Before this, Joe
founded the Parallel Extensions to .NET project.
He has been granted 19 patents, with 49 pending. When not working, Joe enjoys travelling with his wife,
writing books, writing music,
studying music theory & mathematics, and doing anything involving food & wine.
My books
My music
Disclaimer:
The content of this site are my own personal opinions and do
not represent my employer's view in anyway.
© 2012, Joe Duffy
|
|
 Wednesday, September 12, 2007
Lock recursion is usually a bad idea. It can seem convenient (at first), but once the slippery slope of making calls from critical regions into complex ecosystems of code is embarked upon (which is usually a necessary pre-requisite to lock recursion, except for some relatively simple cases), it’s easy to accidentally fall right off the edge. This topic was part of the doc I wrote previously about using concurrency inside of reusable libraries. My opinions haven't changed much since then.
Lock recursion coupled with condition variables is even worse. In fact, its behavior might surprise you.
To motivate this, would you ever think of writing code that does something like this?
void BreakAtomicity() { … I assume somebody called me with a recursive lock on ‘obj’ … Monitor.Exit(obj); Monitor.Exit(obj); … Do something … Monitor.Enter(obj); Monitor.Enter(obj); }
I should certainly hope not! Unless you're crazy or reeeeally know what you're doing. Who knows what state invariants are busted at the time the call to BreakAtomicity was made? Releasing the lock in this manner hoists these ticking timebombs onto the other threads into the process that might want to inspect the shared state. If you, the author of BreakAtomicity, have all-knowing omnipresent knowledge of the entire program, perhaps you know precisely. But, particularly in the case of recursion, where it's all-too-common to engage in practices of sloppy composition, this is actually quite unlikely. Lock recursion is typically used for convenience, not because of a really solid design that is based on clean algorithmic recursion.
What does this example have to do with condition variables anyway? Glad you asked! It matters because of what happens when you wait on a monitor that has been recursively acquired. In such cases, Monitor.Wait will release all recursive acquires as part of its waiting. I.e. if it has been acquired 10 times, it is released 10 times before waiting. It does this, of course, because otherwise it would deadlock waiting for some other thread to make a call to Monitor.Pulse/PulseAll (since a separate thread needs to first acquire the lock in order to do either). This is symmetric, so once the thread has been awoken, it will reacquire the lock as many times as needed before returning to attain the same level of recursion that existed prior to the call.
Now, Monitor.Wait breaks atomicity anyway. This is obvious. It releases and reacquires the lock internally, and so any conditions regarding shared state that exist prior to the call cannot be assumed to exist after the call returns. (Most) people understand this and tend to use Wait in fairly common and safe patterns, such as guarded regions where some predicate is checked for validity at the very front of a critical region before doing anything interesting with state. But the really nasty thing about recursive locks and the Wait behavior described above is that this breaks atomicity for some unknowing number of nested critical regions that have existed for some unknown amount of time leading up to the Wait. This is a recipe for pain. My recommendation is probably predictable: just following the broadly accepted advice that, because lock recursion is evil to begin with, it is best avoided, and you will safely avoid the more complicated case outlined above.
It’s worth pointing out that the new CONDITION_VARIABLE in Vista, i.e. SleepConditionVariableCS and SleepConditionVariableSRW, only release the lock once, despite recursive acquire counts. (SRWL doesn’t officiailly support recursion, although it works for shared locks since there is no affinity used and it is undetectable.) Deadlocks result instead. From an editorial perspective, I prefer this behavior quite a bit, since it’s easier to debug. (Admittedly, if Monitor’s behavior is what you want, it’s less than straightforward to achieve, unless you know the recursion counter somehow. Although, I will also note that I am convinced very few real people would want Monitor’s current behavior...) My preferred solution to this would have been to throw an exception, since I do think issuing a Wait when locks have been recursively acquired is in most cases a bug. As a workaround, we could have exposed a RecursionCount on Monitor so that a developer could manually exit the lock RecursionCount-1 times before the call to Wait and then reacquire it RecursionCount-1 times after the call returns. (Actually no -- I would have made Monitor non-recursive by default, like the new ReaderWriterLockSlim.) Sadly, I guess I'm only about 10 years too late...
 Wednesday, August 22, 2007
Most managed code in the .NET Framework has not been hardened against asynchronous exceptions. This includes out of memory (OOM) conditions and asynchronous thread aborts, and is entirely by design. Hardening against OOM, for example, is historically an extraordinarily difficult feat, and few systems undertake the development and QA costs needed to do so. (FWIW, the CLR VM is one such system.) Simply failing gracefully is usually hard enough. Failing gracefully is admittedly leaps and bounds easier in managed code because allocation failures are communicated via exceptions rather than return values, and are thus transitively propagated “by default.” Thread aborts are even more difficult to harden against, however, because they can originate at any instruction (with a handful of exceptions). Ensuring data invariants are protected for every single instruction is clearly just a little difficult.
These things are certainly not impossible. With enough effort, you can make inroads toward solutions for both issues. Portions of the .NET Framework have gone to such lengths. For example, code that manipulates process-wide state spanning AppDomains needs to ensure that this state is not corrupted by an unfortunately placed thread abort when run inside systems like SQL Server that use aborts to tear down boundaries of isolation. While possible, the important thing to understand here is that most of the .NET Framework is in fact not resilient to these things. See this doc as an example of guidance the CLR team provided to other developers inside of Microsoft to this effect. OOMs are in a similar category, though many subsystems take different, inconsistent approaches to memory allocation failures (e.g. WPF takes a different stance than WCF).
All of this is a long winded build up to the following problem: thread interrupts are just about as evil as these sorts of asynchronous exceptions. The failure injection points are more constrained—e.g. an OOM can occur wherever an allocation occurs, a thread abort can happen in between nearly any two instructions, and thread interruptions can only occur at blocking calls that transition the managed thread into the state WaitSleepJoin—but this doesn’t change the fact that most code is unprepared to deal properly with such interruptions. Once again, it’s not that managed code cannot be constructed to be resilient to interruptions—in fact, it’s much easier than OOMs and thread aborts—it’s simply that the .NET Framework hasn’t been constructed to tolerate arbitrary interruptions. If threads are calling into these APIs and thread interruptions are provoked, state corruption, memory leaks, and possible deadlocks can be left in the wake.
To take a brief example of where such a problem might crop up, imagine a thread has blocked on FileStream.EndRead because it is finishing some asynchronous IO operation. After a brief inspection of the code, I’m convinced interrupting the call it makes to WaitHandle.WaitOne internally will lead to a memory leak:
if (1 == Interlocked.CompareExchange(ref result._EndXxxCalled, 1, 0)) { __Error.EndReadCalledTwice(); } WaitHandle handle = result._waitHandle; if (handle != null) { try { handle.WaitOne(); } finally { handle.Close(); } } NativeOverlapped* nativeOverlappedPtr = result._overlapped; if (nativeOverlappedPtr != null) { Overlapped.Free(nativeOverlappedPtr); }
The method ensures only one call to EndRead can occur, and will throw on subsequent attempts. So the above code will only ever run once. Sadly, EndRead needs to free the NativeOverlapped structure used internally for asynchronous IO completion. But because the call to Overlapped.Free follows the call to WaitOne, and doesn’t occur inside of a finally block, it won’t execute. In summary: interrupt that call to WaitOne, and boom, we leak a NativeOverlapped object. Whether or not this is disastrous of course depends on the precise scenario. A few bytes here and a few bytes there can quickly add up, particularly for long running programs. At least this particular example protects invariants sufficiently well to avoid state corruption that would lead to further unpredictability. But recall that this is just one example. In my experience, the BCL represents some of the most carefully written code in the .NET Framework, so this problem is undoubtedly scattered about all over the place.
Unfortunately, it’s become somewhat common advice that using thread interruption as a synchronization and control mechanism is a GoodThing™. Andrew Birrell, a researcher from Microsoft Research, for example, suggested this in his paper “An Introduction to Programming with C# Threads”:
“Interrupts are most useful when you don’t know exactly what is going on. For example, the target thread might be blocked in any of several packages, or within a single package it might be waiting on any of several objects. In these cases an interrupt is certainly the best solution. Even when other alternatives are available, it might be best to use interrupts just because they are a single unified scheme for provoking thread termination.” (p33)
While I am sure this advice is well intentioned, it is extremely dangerous for the subtle reasons outlined above and can lead to reliability problems in any programs that follow it. My recommendation is to build this kind of higher level synchronization into the code that you actually own, and handle shutdown and interruption logic yourself. This is a bit cumbersome and is more work, but it also ensures that arbitrary blocking points in the libraries you use will not be affected by interruptions.
With the increase in hardware parallelism over the coming years, I worry that the use of interruptions will become more widespread as a popular technique developers use to control threads. And as more and more of the .NET Framework uses higher degrees of concurrency, necessarily requiring more internal synchronization, the number of blocking points that are vulnerable to this kind of abuse will grow accordingly. So, please, do your part… avoid Thread.Interrupt like the plague. In fact, perhaps we should deprecate it.
 Monday, July 30, 2007
My book, Concurrent Programming on Windows, is shaping up quite nicely. (Given that I've been working on it for over a year now, I suppose it had better be!) I’ve been surprised at the amazing level of anticipation and excitement from blog readers, coworkers, and Microsoft customers, and I really can’t wait for it to be finished. Thanks for the patience so far.
I feel like I’m almost on the home stretch. End of September is my current target for completion. It’s looking like it’ll contain 18 chapters, with 3 appendices, and will have a total running length of somewhere around 700 pages. The reasons it has taken so long are numerous, but the primary reason is that the content is quite deep and detail-oriented—more than I expected at the start—and I’ve wanted to take the time to get it just right rather than cut corners. My editors recently gave me feedback that there will be very little developmental editing required, since I’m (ahem) very, uhh, meticulous when it comes to writing. And feedback from technical reviewers has been very positive as well. I think both are good news.
I’m confident the end product will be worth the wait.
In the meantime, it seems that some of the abstractions I’ve built while writing the book will likely become part of a future release of the .NET Framework. Keep an eye out on Channel9 for some additional details in a few months time. Right now is a super exiting time to be in the field of computer science, that’s for sure... Laissez les bons temps rouler!
 Saturday, July 21, 2007
We're hiring developers and testers for the Parallel Computing team in Microsoft's Developer Division.
Update: Here are links to the jobs on microsoft.com:
Our team works regularly with MSR and the CTO, as well as other Developer Division teams like CLR, VS, C#, VB, etc.
If you want to help define and build the next generation of concurrency support in C# and the .NET Framework, this is your chance.
If you want to work closely with supersmart folks like Anders, Burton, and David, wait no longer.
Send me an email at joedu at microsoft dot com if you are interested.
 Friday, July 20, 2007
Whether or not it’s possible for an object to be published before it has been fully constructed is perhaps the most common .NET memory model-related question that arises time and time again. In fact, there was a discussion this week on an internal .NET alias, and another a couple weeks ago in the Joel on Software forums.
The basic question is: Can one thread read a pointer to an object whose constructor has not finished running on a separate thread?
This pattern pops up quite a bit in lazily initialization scenarios, for instance. For example, given some class C:
class C { public int f; public static C s_c; public C() { f = 55; } }
And some code that lazily initializes and then uses the object:
if (s_c == null) s_c = new C(); Console.WriteLine(s_c.f);
Specifically, is it possible in this case to write 0 (or garbage) to the console, instead of 55?
(Note that related examples, like the Joel on Software thread, use separate initialization routines or steps before publishing the pointer. It boils down to the same issue.)
How could observing anything other ‘f’ value than 55 possibly happen anyway, you might wonder? Well, since some processors are free to execute certain instructions out-of-order, the write of the return value of ‘new C()’ could theoretically retire before the write to that instance’s ‘f’ field. This isn’t an issue on X86, since the processor memory model doesn’t permit it, but architectures like IA64 do permit such reordering. Moreover, some compilers might decide to reorder writes; in this example, if the constructor were to be inlined, the compiler could subsequently use code motion to delay the write to the field.
(Note: obviously the constructor could publish a reference to 'this' before it has finished. In this case, clearly other threads could then access the instance before it was fully constructed.)
On .NET, the answer is no, this kind of code motion and processor reordering is not legal. This specific example was a primary motivation for the strengthening changes we made to the .NET Framework 2.0’s implemented memory model in the CLR. Writes always retire in-order. To forbid out-of-order writes, the CLR’s JIT compiler emits the proper instructions on appropriate architectures (i.e. in this case, ensuring all writes on IA64 are store/release). Although reads can retire out-of-order, the data dependence on the pointer value being published prevents subsequent read of fields from happening before the read of the pointer itself. So thankfully this simply cannot happen.
A lot of .NET code out there, including code in the Framework itself, would have suddenly been open to reordering bugs when the CLR 2.0 shipped with IA64 support had we not made this decision. We decided to sacrifice some performance on one particular architecture (and possibly subsequent ones) to ensure these tricky races didn’t bite people unexpectedly, and to avoid a costly audit of the entire Framework.
Lastly, I will note a couple things. First, this strength is not specified in ECMA, so other implementations of the CLI do not provide such guarantees. (I hope one day we decide to standardize the stronger model.) I don’t know what Mono implements, but it may be weaker. Second, the Java Memory Model does not prohibit such publication reorderings, unless the assignments are to a ‘final’ field. So I’m sure people who are familiar with the JMM will assume this pattern is broken on .NET and use locks and/or explicit memory barriers instead. This approach is more conservative and still leads to correct code, however, so it really matters very little for most code.
 Sunday, June 24, 2007
In response to a previous post, a reader said
“I was under the impression that monitors were implemented in .NET in a multiplexed way, so that events are only allocated to an object while there is contention - and that they aren't "sticky", becoming permanently attached to the object.”
This is absolutely correct. My nulling out of the object reference in the example only has the slight advantage of promoting the object’s collection sooner, but it does not have the effect of speeding up the reclamation of the internally managed monitor state. My original posting erroneously said that it would.
Let’s take a quick step back, and see exactly what this means.
Monitors are comprised of two capabilities: critical regions (i.e. Monitor.Enter and Exit), to achieve mutual exclusion, and condition variables (i.e. Monitor.Wait, Pulse, and PulseAll), to coordinate between threads. Any CLR object can be used as a monitor.
For the critical region case, the CLR uses an efficient thin lock which simply embeds locking information as a bit pattern inside the object’s header word. Other parts of the system also try to use the header, e.g. when caching an object’s default hash-code, COM interop, etc. There are limits to what can be stored in the header, so use of any two of these things simultaneously causes inflation, meaning the object header’s contents become an index into a table of sync blocks. Sync blocks are just little data structures capable of holding all of that state simultaneously. The CLR manages a system-wide table of them and recycles and reuses them as objects need them. Another event that causes inflation is the first occasion on which a thread tries to enter the critical region while another thread holds it (i.e. contention).
When contention arises, the CLR will spin briefly before truly waiting, but it may eventually need to allocate a Windows kernel event object. This is an auto-reset (synchronization) event, and a handle to it gets stored on the sync block. Waiting threads just wait on it, and threads exiting the critical region will set it (if the wait count is non-zero). Note that this leads to unfair behavior, because threads can steal the critical region between the signal and the wake-up, but helps to prevent convoys.
Condition variables are implemented slightly differently. Each CLR thread object has a single event object dedicated to it. The first time a thread calls Wait on a condition variable, the event is lazily allocated. And then the thread simply places its own thread-local event into a list of events associated with the monitor. Registering the event also requires inflation to a sync block, if it hasn’t happened already, because obviously the event list can’t be stored in the object header. When a Pulse happens, the pulsing thread just signals the first event in the list. Waiting and pulsing is thus actually somewhat fair, but there are other races that can eliminate this that I won't get into. When a PulseAll occurs, the pulsing thread walks the whole list and signals each.
So now back to the question: when are sync blocks reclaimed?
When a GC is triggered, objects in the reachability traversal may have their sync blocks reclaimed, even if the object in question is still alive, and made available again in the system-wide pool of reusable sync blocks. This reclamation can happen so long as the sync block isn’t needed permanently (as would be the case if COM interop information was stored inside of it) and the sync block isn’t marked precious. A sync block is precious anytime there is a thread inside of the object’s critical region, when a thread is waiting to enter the critical region, or when at least one thread has registered its event into the associated condition variable list. Notice that orphaning monitors can thus lead to leaking events, because they will remain precious, unless the monitor object itself becomes unreachable. When a sync block is reclaimed in this fashion, certain reusable state is kept, like the critical region event object, so that the next monitor to use the sync block can reuse it.
 Saturday, June 09, 2007
Windows Vista has a new one-time initialization feature, which I’m pretty envious of being someone who writes most of his code in C# and answers countless questions about double-checked locking in the CLR. Rather than sprinkling double-checked locking all over your code base, along with the ever-lasting worry in the back of your mind that you’ve gotten the synchronization incorrect, it's a better idea to consolidate it into one place.
That’s the purpose of the LazyInit<T> and LazyInitOnlyOnce<T> structs below. Both let you specify an “initialization” routine (as a delegate) which gets invoked at the appropriate time to lazily initialize the state. The only difference between the two is that LazyInit<T> might invoke your delegate more than once, due to races, but it will ensure only one value “wins”. LazyInitOnlyOnce<T> does the extra work to ensure the initialization routine only gets called once, though at a slightly higher cost: we might need to block a thread, which means allocating a Win32 event.
Why the two? I had originally written this with a Boolean specified at construction time to pick one over the other, but this required an extra object field which, for LazyInit<T> which was never used, along with a Boolean field. I defined both as structs to make them super lightweight to use, and getting rid of the extra two fields seemed worth the extra baggage of an extra class, given that such a type could end up used very pervasively throughout a large code-base. As it stands, LazyInit<T> is just the size of a pointer plus the size of T. LazyInitOnlyOnce<T> adds one additional pointer to that.
To start with, both use the same Initializer<T> delegate:
public delegate T Initializer<T>();
And here’s LazyInit<T>, the simpler of the two:
public struct LazyInit<T> where T : class { private Initializer<T> m_init; private T m_value;
public LazyInit(Initializer<T> init) { m_init = init; m_value = null; }
public T Value { get { if (m_value == null) { T newValue = m_init(); if (Interlocked.CompareExchange(ref m_value, newValue, null) != null && newValue is IDisposable) { ((IDisposable)newValue).Dispose(); } }
return m_value; } } }
Note that T is constrained to a reference type, so that we can use a null check to determine when initialization is needed. We could have used a separate Boolean, but this would required adding another field as well as considering some trickier memory model issues.
If the Interlocked.CompareExchange fails, it means we lost the lazy initialization race with another thread, and thus just return the value the other thread produced. We also Dispose of the garbage object if it implements IDisposable. This pattern is very common in lazy initialization scenarios, like allocating an expensive kernel object lazily on demand. We’d prefer to get rid of it right away since we know it will never be used.
I wish there was a way to make boxing a compile-time error for some value types. Clearly you don't ever want to box one of these, because making a copy will entirely break the synchronization guarantees.
I’ve omitted some error checking, like ensuring m_init actually got initialized to a non-null value.
Say you need a lazily initialized event on your object. You would just do this:
public class C { private LazyInit<EventWaitHandle> m_event; private object m_otherState; public C() { m_event = new LazyInit<EventWaitHandle>( delegate { return new ManualResetEvent(false); }); m_otherState = ...; } ... private void DoSomething() { ... if (... need to set the event ...) m_event.Value.Set(); } }
And lastly, here’s LazyInitOnlyOnce<T>:
public struct LazyInitOnlyOnce<T> where T : class { private Initializer<T> m_init; private T m_value; private object m_syncLock;
public LazyInitOnlyOnce(Initializer<T> init) { m_init = init; m_value = null; m_syncLock = null; }
public T Value { get { if (m_value == null) { object newSyncLock = new object(); object syncLockToUse = Interlocked.CompareExchange( ref m_syncLock, newSyncLock, null); if (syncLockToUse == null) syncLockToUse = newSyncLock; lock (syncLockToUse) { if (m_value == null) m_value = m_init(); m_syncLock = null; m_init = null; } }
return m_value; } } }
We use a monitor to ensure mutual exclusion. I lazily allocate the object used for synchronization, but this is clearly a tradeoff. We pay for the added complexity to the code and the extra interlocked instruction (on the slow path), but avoid having to allocate an extra object when we create the struct itself and keep it alive, when we might not ever need it. There’s already an allocation for the delegate, but this just means there’s one instead of two.
It may also not be obvious why I null out the m_syncLock field before exiting. If we don't, the object will remain live as long as the lazily initialized variable remains live. We want the object to be GC'd as soon as possible, because it is no longer needed.
You can use a class constructor in .NET to acheive a similar effect. Static field initializers, however, execute in the class constructor, meaning if you have multiple lazily initialized objects or static methods, they all get initialized at once. This is much more like LazyInitOnlyOnce<T> than LazyInit<T>, since the CLR uses locks to prevent the class constructor from running on multiple threads at once.
Anyway, there’s very little that is novel here. But I do believe having these primitives in the .NET Framework would be immensely useful. It would at least help steer people towards the recommended and most efficient lazy initialization pattern, which is to use double-checked locking, rather than having them possibly pursue more complicated designs. It also removes the need to worry about volatile and Thread.MemoryBarrier, for those that aren't knowledgeable of the work we did in the CLR 2.0 to ensure double-checked locking works properly. Lastly, it has the added benefit of getting rid of tricky calls to Interlocked.CompareExchange and lock statements scattered throughout your code, in favor of something more declarative. What do you think?
 Friday, June 08, 2007
It's coming to be that time of year at Microsoft: review time. Actually, it's a multi-month process, but the first step involves assessing how well (or, in some cases, how poorly) you've done at certain tasks you set out to pursue a year prior. Part of this process also involves discussing career aspirations with those who care to listen and help, which usually means your manager.
So, Joe: What do you want to be when you grow up? ;)
Technical Fellow is the highest title at Microsoft awarded to technical contributors. Some TFs manage people too, but most do not. According to Microsoft.com, there are 18. Two people I work with closely, Anders Hejlsberg and Burton Smith, are TFs, and, until just recently, Jim Gray was also. There are others I know and recognize, but I am not deeply familiar with all of them. This lead me to wonder about the contributions and accomplishments the others have had throughout their careers, and so I did some reading. Aside from causing a bit of depression and a realization that I have quite a long mountain to climb ahead of me, I ended up reading Butler Lampson’s immensely wonderful paper “Hints for Computer System Design” (HTML, PDF).
Yeah, yeah, I’m only 24 years behind the times. In the paper, Butler provides many sound principles backed by concrete examples illustrating tradeoffs between functionality, speed, and fault-tolerance, drawn mostly from his experience building operating and distributed systems. As I read it the paper, I was struck by how much his advice applies to building just about any kind of complicated software system, including frameworks.
A few random teaser quotes:
“Designing a computer system is very different from designing an algorithm:
The external interface (that is, the requirement) is less precisely defined, more complex, and more subject to change.
The system has much more internal structure, and hence many internal interfaces.
The measure of success is much less clear.
The designer usually finds himself floundering in a sea of possibilities, unclear about how one choice will limit his freedom to make other choices, or affect the size and performance of the entire system. There probably isn’t a ‘best’ way to build the system, or even any major part of it; much more important is to avoid choosing a terrible way, and to have clear division of responsibilities among the parts.”
---
“Do one thing at a time, and do it well. An interface should capture the minimum essentials of an abstraction. Don’t generalize; generalizations are generally wrong.”
---
“Keep secrets of the implementation. Secrets are assumptions about an implementation that client programs are not allowed to make. In other words, they are things that can change; the interface defines the things that cannot change (without simultaneous changes to both implementation and client). Obviously, it is easier to program and modify a system if its parts make fewer assumptions about each other. On the other hand, the system may not be easier to design—it’s hard to design a good interface.”
---
“One way to improve performance is to increase the number of assumptions that one part of a system makes about another; the additional assumptions often allow less work to be done, sometimes a lot less.”
---
“When in doubt, use brute force. Especially as the cost of hardware declines, a straightforward, easily analyzed solution that requires a lot of special-purpose computing cycles is better than a complex, poorly characterized one that may work well if certain assumptions are satisfied.”
Needless to say, I strongly recommend the paper. Now back to my career planning. I think I’ll start by writing a prolific and influential paper that will still be highly relevant, quoted, and recommended 24 years into the future. Thankfully I'm still comparatively young so if I get started soon I just might be around to see the 24 year mark.
 Wednesday, May 30, 2007
Intel and AMD processors have had very limited support for SIMD computations in the form of MMX and SSE since the late 90s. Though most programmers live in a MIMD-oriented world, SIMD programming had a surge in research interest in the 80s, and has remained promising for all those years, albeit a bit silently. Vectorization is a fairly popular technique primarily in niche markets such as the FORTRAN and supercomputing communities. Given the rise of GPGPU (see here, here, and here) and rumors floating about in the microprocessor arena, this is an interesting space to watch.
You can get at SSE from managed code, though it requires some hoop jumping and the interop overheads end up killing you. Let's take a quick look at what it takes to use classic loop stripmining techniques for a pairwise multiplication of two arrays.
Since we can't access the SSE instructions directly in managed code, we need to first define a native DLL. We'll call it 'vecthelp.dll' and it just exports a single function:
#include <xmmintrin.h>
const int c_vectorStride = 4;
extern "C" __declspec(dllexport) void VectMult(float * src1, float * src2, float * dest, int length) { for (int i = 0; i < length; i += c_vectorStride) { // Vector load, multiply, store. __m128 v1 = _mm_load_ps(src1 + i); // MOVAPS __m128 v2 = _mm_load_ps(src2 + i); // MOVAPS __m128 vresult = _mm_mul_ps(v1, v2); // MULPS _mm_store_ps(dest + i, vresult); // MOVAPS } }
'VectMult' takes two pointers to float arrays, 'src1' and 'src2', of size 'length', and does a pairwise multiplication, storing results into 'dest'. It walks the array with a stride of 4. On each iteration, it does a vector load using the SSE intrnsic '_mm_load_ps', which loads 4 contiguous floats from 'src1' and 'src2' into XMMx registers. Then we multiply them via '_mm_mul_ps' which is a 4-way vector multiply (i.e. the multiplication for each pair occurs in parallel). Lastly, we store the results back to the 'dest' array. Note we naively assume the array's size is a multiple of 4.
To use this routine, we just need to P/Invoke. Well, sadly we also need to do some tricky alignment since SSE demands 16 byte alignment. As I've written before, this isn't easy to acheive on the CLR. I've used stack allocation to avoid pinning the arrays, though clearly for large arrays this would easily lead to stack overflow. It's just for illustration.
using System;
unsafe class Program { [System.Runtime.InteropServices.DllImport("vecthelp.dll")] private extern static void VectMult(float * src1, float * src2, float * dest, int length);
public static void Main() { const int vecsize = 1024 * 16; // 16KB of floats.
float * a = stackalloc float[vecsize + (16 / sizeof(float)) - 1]; float * b = stackalloc float[vecsize + (16 / sizeof(float)) - 1]; float * c = stackalloc float[vecsize + (16 / sizeof(float)) - 1];
// To use SSE, we must ensure 16 byte alignment. a = (float *)AlignUp(a, 16); b = (float *)AlignUp(b, 16); c = (float *)AlignUp(c, 16);
// Initialize 'a' and 'b': for (int i = 0; i < vecsize; i++) { a[i] = i; b[i] = vecsize - i; }
// Now perform the multiplication. VectMult(a, b, c, vecsize);
... do something with c ... }
private static void * AlignUp(void * p, ulong alignBytes) { ulong addr = (ulong)p; ulong newAddr = (addr + alignBytes - 1) & ~(alignBytes - 1); return (void *)newAddr; } }
I wish I could report some stellar perf numbers, to the tune of the vector version being 4X faster than a non-vector equivalent. Sadly the P/Invoke overheads kill perf unless the array is unreasonably large. Who needs to multiply two 16MB arrays of floats together? Some people I'm sure, but not many. If the P/Invoke overheads are excluded, however, arrays as small as a few hundred elements see 2X speedup. And for larger arrays it hovers around 3X.
Clearly as future architectures offer more vector width, these speedups just increase. And perhaps there will eventually be more incentive for native CLR support. Just imagine if we had a 32-core system in which each core had a 16-way vector arithmetic unit: that's 32X16 (512) degrees of parallelism if you can just subdivide the problem appropriately. GPUs, of course, already offer many-fold larger vector width than SSE, which is one reason why GPGPU is attractive. Maybe I'll show how to write a DirectX pixel shader that adds two float arrays together in a future post.
 Saturday, May 19, 2007
I’ve opined on thread affinity several times in the past. I think the term “thread affinity” is en vogue only internal to Microsoft, so it may help to define what it means for the rest of the world.
Many services on Windows have traditionally associated state with the executing thread to keep track of certain ambient contextual information. Errors, security, arbitrary library state. Storing data on the physical thread ensures that it flows around with the logical continuation of work, no matter what APIs are called or how interwoven the stack ends up, and is therefore “always” accessible. Thank our imperative history for this one. People have had to deal with this in Haskell, though since Haskell generally doesn’t have statefulness which persists across callstacks, they came up with a more elegant “implicit parameter” solution.
Affinity creates difficulties for parallel programming models for a number of reasons. We’d like it to be the case that work can be transferred for execution on separate processors when feasible and profitable, and often even implicitly. For example in the query ‘var q = from x in A where p(x) select f(x);’, so long as ‘p’ and ‘f’ are sufficiently complicated and ‘A’ sufficiently large, perhaps we want to run this in parallel. But “transferring work for execution on separate processors” means using many threads to execute the same logical chunk of work. If ‘p’ or ‘f’ rely on thread affinity, what are we to do? Affinity becomes a concurrency blocker here: if part of that work depends on the thread’s identity across multiple steps, then how can we possibly use multiple threads?
One answer is that we first need to know the duration of the affinity if we’re to deal intelligently with it somehow. That’s what the .NET Framework’s Thread.BeginThreadAffinity and EndThreadAffinity are meant for: they denote the boundaries of affinity that has been acquired and then released. But this still doesn’t solve the fundamental problem, which is the mere presence of thread affinity in the first place. We would presumably respond to the affinity by just suppressing parallelism. That’s no good. And sadly affinity isn’t really a well-defined single thing that we can do away with in one well-defined step.
Win32 is littered with affinity: error codes are stored in the TEB (accessible via GetLastError), as are impersonation tokens and locale IDs. Arbitrary program and library can be—and routinely is—stashed away into Thread Local Storage (TLS) for retrieval later on. In fact, most mutual exclusion mechanisms today assume thread affinity: that is, a lock is taken by some thread and then the only agent in the system working under the protection of that lock is that one thread. Various transactional memory nesting forms seek to solve this problem, including what happens when many threads which comprise the same logical piece of work need to write to overlapping data. Heck, stacks are even a subtle form of thread affinity too, in which some portion of the program state is all cobbled up with the thing which is meant to execute the program itself. COM introduced an even more grotesque form of affinity with its Threading apartment model, particularly Single Threaded Apartments (STAs), in which components created on an STA are only ever accessed from the single STA thread in that apartment. And let’s not forget all of the GUI frameworks: all of the Windows GUI frameworks are built on the notion of strong affinity. And since the introduction of LIBCMT and MSVCRT those C Runtime library functions which historically relied on global state now rely on TLS, so some of the CRT itself is even guilty (which means those programs that use the guilty parts are also guilty, though perhaps unknowingly). Managed code adds one degree of separation by detaching the CLR thread from the OS thread, which is a step in the right direction; but the .NET Framework is still littered with affinity that is either inherited from Win32 or of its own creation. And so on, and so forth.
All of those examples of thread affinity above are cases where the library developer needed to have an isolated context. There really is no reason that this context needs to be specific to a single OS thread, it just so happens the context that most of them chosen is in fact specific to one. The .NET Framework's approach of offering a layered and shiftable abstraction on top of the concrete thing is promising... assuming you’re comfortable using that abstraction. CLR remoting offers various forms of contexts that flow in a multitude of ways. Sadly the machinery is complex, not documented satisfactorily, and is, well, tied to remoting. We need something more general purpose and ubiquitous. Maybe the CLR thread is it. Until somebody needs to come along and build something that is one level above CLR threads, I suppose.
So how bad is all of this anyway? It’s actually fairly bad. Any one of these things in isolation is teachable and avoidable, but pile it all up and what you’re left with is a veritable minefield to navigate. Affinity shows up as a huge concurrency blocker alongside other favorites like mutable data structures and impure functions. As if concurrency weren’t hard enough!
|
|
Recent Entries:
Search:
Browse by Date:
| | Sun | Mon | Tue | Wed | Thu | Fri | Sat | | 26 | 27 | 28 | 29 | 30 | 31 | 1 | | 2 | 3 | 4 | 5 | 6 | 7 | 8 | | 9 | 10 | 11 | 12 | 13 | 14 | 15 | | 16 | 17 | 18 | 19 | 20 | 21 | 22 | | 23 | 24 | 25 | 26 | 27 | 28 | 29 | | 30 | 1 | 2 | 3 | 4 | 5 | 6 |
Browse by Category:
Notables:
|