|
Personal Info:
Joe  leads the architecture of an experimental OS's developer platform, where
he is also chief architect of its programming language. His current mission is to enable
writing large-scale software that is reliable, secure, and scalable by-construction. Before this, Joe
founded the Parallel Extensions to .NET project.
He has been granted 19 patents, with 49 pending. When not working, Joe enjoys travelling with his wife,
writing books, writing music,
studying music theory & mathematics, and doing anything involving food & wine.
My books
My music
Disclaimer:
The content of this site are my own personal opinions and do
not represent my employer's view in anyway.
© 2012, Joe Duffy
|
|
 Wednesday, May 30, 2007
Intel and AMD processors have had very limited support for SIMD computations in the form of MMX and SSE since the late 90s. Though most programmers live in a MIMD-oriented world, SIMD programming had a surge in research interest in the 80s, and has remained promising for all those years, albeit a bit silently. Vectorization is a fairly popular technique primarily in niche markets such as the FORTRAN and supercomputing communities. Given the rise of GPGPU (see here, here, and here) and rumors floating about in the microprocessor arena, this is an interesting space to watch.
You can get at SSE from managed code, though it requires some hoop jumping and the interop overheads end up killing you. Let's take a quick look at what it takes to use classic loop stripmining techniques for a pairwise multiplication of two arrays.
Since we can't access the SSE instructions directly in managed code, we need to first define a native DLL. We'll call it 'vecthelp.dll' and it just exports a single function:
#include <xmmintrin.h>
const int c_vectorStride = 4;
extern "C" __declspec(dllexport) void VectMult(float * src1, float * src2, float * dest, int length) { for (int i = 0; i < length; i += c_vectorStride) { // Vector load, multiply, store. __m128 v1 = _mm_load_ps(src1 + i); // MOVAPS __m128 v2 = _mm_load_ps(src2 + i); // MOVAPS __m128 vresult = _mm_mul_ps(v1, v2); // MULPS _mm_store_ps(dest + i, vresult); // MOVAPS } }
'VectMult' takes two pointers to float arrays, 'src1' and 'src2', of size 'length', and does a pairwise multiplication, storing results into 'dest'. It walks the array with a stride of 4. On each iteration, it does a vector load using the SSE intrnsic '_mm_load_ps', which loads 4 contiguous floats from 'src1' and 'src2' into XMMx registers. Then we multiply them via '_mm_mul_ps' which is a 4-way vector multiply (i.e. the multiplication for each pair occurs in parallel). Lastly, we store the results back to the 'dest' array. Note we naively assume the array's size is a multiple of 4.
To use this routine, we just need to P/Invoke. Well, sadly we also need to do some tricky alignment since SSE demands 16 byte alignment. As I've written before, this isn't easy to acheive on the CLR. I've used stack allocation to avoid pinning the arrays, though clearly for large arrays this would easily lead to stack overflow. It's just for illustration.
using System;
unsafe class Program { [System.Runtime.InteropServices.DllImport("vecthelp.dll")] private extern static void VectMult(float * src1, float * src2, float * dest, int length);
public static void Main() { const int vecsize = 1024 * 16; // 16KB of floats.
float * a = stackalloc float[vecsize + (16 / sizeof(float)) - 1]; float * b = stackalloc float[vecsize + (16 / sizeof(float)) - 1]; float * c = stackalloc float[vecsize + (16 / sizeof(float)) - 1];
// To use SSE, we must ensure 16 byte alignment. a = (float *)AlignUp(a, 16); b = (float *)AlignUp(b, 16); c = (float *)AlignUp(c, 16);
// Initialize 'a' and 'b': for (int i = 0; i < vecsize; i++) { a[i] = i; b[i] = vecsize - i; }
// Now perform the multiplication. VectMult(a, b, c, vecsize);
... do something with c ... }
private static void * AlignUp(void * p, ulong alignBytes) { ulong addr = (ulong)p; ulong newAddr = (addr + alignBytes - 1) & ~(alignBytes - 1); return (void *)newAddr; } }
I wish I could report some stellar perf numbers, to the tune of the vector version being 4X faster than a non-vector equivalent. Sadly the P/Invoke overheads kill perf unless the array is unreasonably large. Who needs to multiply two 16MB arrays of floats together? Some people I'm sure, but not many. If the P/Invoke overheads are excluded, however, arrays as small as a few hundred elements see 2X speedup. And for larger arrays it hovers around 3X.
Clearly as future architectures offer more vector width, these speedups just increase. And perhaps there will eventually be more incentive for native CLR support. Just imagine if we had a 32-core system in which each core had a 16-way vector arithmetic unit: that's 32X16 (512) degrees of parallelism if you can just subdivide the problem appropriately. GPUs, of course, already offer many-fold larger vector width than SSE, which is one reason why GPGPU is attractive. Maybe I'll show how to write a DirectX pixel shader that adds two float arrays together in a future post.
 Saturday, May 19, 2007
I’ve opined on thread affinity several times in the past. I think the term “thread affinity” is en vogue only internal to Microsoft, so it may help to define what it means for the rest of the world.
Many services on Windows have traditionally associated state with the executing thread to keep track of certain ambient contextual information. Errors, security, arbitrary library state. Storing data on the physical thread ensures that it flows around with the logical continuation of work, no matter what APIs are called or how interwoven the stack ends up, and is therefore “always” accessible. Thank our imperative history for this one. People have had to deal with this in Haskell, though since Haskell generally doesn’t have statefulness which persists across callstacks, they came up with a more elegant “implicit parameter” solution.
Affinity creates difficulties for parallel programming models for a number of reasons. We’d like it to be the case that work can be transferred for execution on separate processors when feasible and profitable, and often even implicitly. For example in the query ‘var q = from x in A where p(x) select f(x);’, so long as ‘p’ and ‘f’ are sufficiently complicated and ‘A’ sufficiently large, perhaps we want to run this in parallel. But “transferring work for execution on separate processors” means using many threads to execute the same logical chunk of work. If ‘p’ or ‘f’ rely on thread affinity, what are we to do? Affinity becomes a concurrency blocker here: if part of that work depends on the thread’s identity across multiple steps, then how can we possibly use multiple threads?
One answer is that we first need to know the duration of the affinity if we’re to deal intelligently with it somehow. That’s what the .NET Framework’s Thread.BeginThreadAffinity and EndThreadAffinity are meant for: they denote the boundaries of affinity that has been acquired and then released. But this still doesn’t solve the fundamental problem, which is the mere presence of thread affinity in the first place. We would presumably respond to the affinity by just suppressing parallelism. That’s no good. And sadly affinity isn’t really a well-defined single thing that we can do away with in one well-defined step.
Win32 is littered with affinity: error codes are stored in the TEB (accessible via GetLastError), as are impersonation tokens and locale IDs. Arbitrary program and library can be—and routinely is—stashed away into Thread Local Storage (TLS) for retrieval later on. In fact, most mutual exclusion mechanisms today assume thread affinity: that is, a lock is taken by some thread and then the only agent in the system working under the protection of that lock is that one thread. Various transactional memory nesting forms seek to solve this problem, including what happens when many threads which comprise the same logical piece of work need to write to overlapping data. Heck, stacks are even a subtle form of thread affinity too, in which some portion of the program state is all cobbled up with the thing which is meant to execute the program itself. COM introduced an even more grotesque form of affinity with its Threading apartment model, particularly Single Threaded Apartments (STAs), in which components created on an STA are only ever accessed from the single STA thread in that apartment. And let’s not forget all of the GUI frameworks: all of the Windows GUI frameworks are built on the notion of strong affinity. And since the introduction of LIBCMT and MSVCRT those C Runtime library functions which historically relied on global state now rely on TLS, so some of the CRT itself is even guilty (which means those programs that use the guilty parts are also guilty, though perhaps unknowingly). Managed code adds one degree of separation by detaching the CLR thread from the OS thread, which is a step in the right direction; but the .NET Framework is still littered with affinity that is either inherited from Win32 or of its own creation. And so on, and so forth.
All of those examples of thread affinity above are cases where the library developer needed to have an isolated context. There really is no reason that this context needs to be specific to a single OS thread, it just so happens the context that most of them chosen is in fact specific to one. The .NET Framework's approach of offering a layered and shiftable abstraction on top of the concrete thing is promising... assuming you’re comfortable using that abstraction. CLR remoting offers various forms of contexts that flow in a multitude of ways. Sadly the machinery is complex, not documented satisfactorily, and is, well, tied to remoting. We need something more general purpose and ubiquitous. Maybe the CLR thread is it. Until somebody needs to come along and build something that is one level above CLR threads, I suppose.
So how bad is all of this anyway? It’s actually fairly bad. Any one of these things in isolation is teachable and avoidable, but pile it all up and what you’re left with is a veritable minefield to navigate. Affinity shows up as a huge concurrency blocker alongside other favorites like mutable data structures and impure functions. As if concurrency weren’t hard enough!
 Sunday, May 13, 2007
Everybody’s probably aware of the RegisterWaitForSingleObject method: it exists in the native and CLR thread pools, and does pretty much the same thing in both. (It's called CreateThreadpoolWait and SetThreadpoolWait in Vista.) This feature allows you to consolidate a bunch of waits onto dedicated thread pool wait threads. Each such thread waits on up to 63 registered objects using a wait-any-style WaitForMultipleObjects. When any of the objects become signaled, or a timeout occurs, the wait thread just wakes up and queues a callback to run on the normal thread pool work queue. Then it updates timeouts and possibly removes the object from its wait set, and then goes right back to waiting.
This is great. Fewer threads, more overlapped waiting, better performance. If you wait on 1,024 objects, you only need 17 threads instead of 1,024. Not only do you end up with fewer threads, but your program can actually handle the case where all 1,024 objects become signaled at once, because the thread pool throttles the number of threads that can run callbacks.
But you really don’t want to register a wait for a mutex. If you stop to think about it for a moment, the reason will become clear. It just doesn’t make any sense with the architecture I just explained.
The pool’s wait threads are the ones that do the actual waits. And when a wait for a mutex is satisfied, the thread which performed the wait now owns the mutex. Uh oh. In our case that means the wait thread owns the mutex. But all the wait thread knows how to do is wait on stuff and queue callbacks. There are two problems here.
The first problem is that the thread which will run the callback lives in the thread pool’s worker queue, and doesn’t actually own the mutex. Which means it can’t actually release the mutex either. In fact, nobody really can, except for the wait thread that performed the wait, but remember all that thread knows how to do is wait on stuff and queue callbacks. A mutex? What the heck is that? Eventually the wait thread may exit and the mutex may become abandoned, but whether this actually happens depends on the ebb and flow of wait registrations.
(With the Win32 thread pool, you can specify the WT_EXECUTEINWAITTHREAD flag during registration, which ensures the callback is run in the wait thread itself and not queued to worker thread. While this can suffice as a workaround to this problem, it’s generally a bad practice to hold up the wait thread from doing its job. And there is no equivalent in Vista or with the CLR thread pool.)
The second problem may or may not surface depending on whether you’ve specified that the wait callback should execute only once. If the callback executes only once, the thread pool will remove it from its wait set after waking up once. Otherwise, it keeps it in the wait set and goes back to waiting on it right after queuing the callback. Here are the "only once" defaults for the Vista, legacy, and CLR pools: yes in Windows Vista (and no way to specify otherwise, other than reregistering manually), no in the legacy Win32 pool (unless the WT_EXECUTEONLYONCE flag is passed during registration), and you always have to specify in managed code with the executeOnlyOnce argument.
So what's the problem? Because mutexes allow recursive acquires, then if the callback is set to execute multiple times, the wait thread will simply go back and wait on all of its objects, including the mutex, after it queues a callback. The same thing that happens with a persistent signal object like a manual-reset event now happens. Each time the wait thread tries to wait, the acquisition of the mutex immediately succeeds, incrementing its recursion counter by one, and each time causing the wait thread to queue yet another callback. Ouch. The insanity never stops:
Mutex m = new Mutex(); m.WaitOne(); ThreadPool.RegisterWaitForSingleObject( m, delegate { Console.WriteLine("The insanity!"); }, null, -1, false); m.ReleaseMutex();
The moral of the story? Nothing terribly deep. Thread affinity strikes once again.
 Sunday, April 29, 2007
Long time readers will remember that I used to regularly blog about what I've been reading lately. (See here, here, here, and here for examples.) Well, it's been close to a year since I've posted such a list. So here's a little bit of a clearance of my back- log. (And this isn't even everything! I guess I read a lot.) I've separated the list into geek and non-geek books.
Geek books:
| The Soul of a New Machine -- Tracy Kidder |
 |
10 of 10. I'm only 16 years late on this one. This book is about DG's race to build a machine to contend with DEC's VAX, and has some great story telling. It reads almost like a fiction book, with great character building, a nice storyline running through the book, etc. The geek factor is low on this one, though I suppose it _is_ a book about a bunch of geeks building hardware, so it's a little up there. Evidently it won the Pulitzer prize. |
| Programming Pearls -- Jon Bentley |
 |
10 of 10. Another one in the category of "how the heck did I not read this sooner?" This book is a collection of essays on various programming tasks, and gives tons of insight on engineering software in general. The prose is even entertaining too. This book now occupies a special place on my bookshelf. |
| Java Concurrency in Practice -- Brian Goetz, et. al |
 |
10 of 10. This is a very down-to-earth, pragmatic overview of concurrency in Java. It even has chapters on testing and debugging, which get very little coverage in most articles but are clearly important. Though its focus is on Java, many of the ideas are more general, and thus it's a must-read for any serious Windows or .NET concurrency programmer. It's not as good as my upcoming book, but given that you can't buy mine yet, it will do for now ;). |
| The Old New Thing: Practical Development Throughout the Evolution of Windows -- Raymond Chen |
 |
8 of 10. I'm sure everybody who reads my blog reads Raymond's too. If you do, then the book will be quite familiar to you. It's a collection of essays, most of which are edited versions of ones that have appeared on Raymond's blog in the past. I really like the layout and organization of chapters. One downside: Raymond apparently tried to keep it from becoming too geeky (he mentions several times, "for you non-programmers, you can skip this chapter"), but c'mon: how many non-programmers are actually going to read this book? IMHO he should have just let loose and never looked back. |
| Software Pioneers -- Manfred Broy (editor), Ernst Denert (editor) |
 |
8 of 10. This book is a collection of classic articles by precisely what the book's title suggests, software pioneers: Friedrich L. Bauer (ALGOL), Ole-Johan Dahl (Simula), Niklaus Wirth (Pascal), Fred Brooks (Mythical Man Month, OS/360), Alan Kay (GUIs), Rudolf Bayer (B-trees), Peter Chen (E/R modeling), Edsger Dijkstra (obvious), Tony Hoare (CSPs, axiomatic semantics), David Parnas, John Guttag (abstract data types), Michael Jackson (JSP), Tom DeMarco (structured analysis), Michael Fagan (code inspection), Barry Boehm (engineering economics) and Erich Gamma (design patterns). Most of the papers are available on the net, but having them in a printed hardcover is really nice. There are also some new write-ups and an accompanying CD which contains talks from all of the above people. Nice. |
| An APL Compiler -- Timothy Budd |
 |
8 of 10. An overview of Timothy Budd's APL compiler, from front to back end. (For those not up on APL, it's a great language. See the next item.) Also contains a chapter on compiling high level program statements to vector ISAs, which is quite timely and interesting. |
| APL with a Mathematical Accent -- C. A. Reiter, W. R. Jones |
 |
8 of 10. This is the best overview and reference book on APL I've encountered to date. For $120, you get the cheesiest binding and printing ever, so be prepared to be disappointed in that department, but the content is well worth it. Covers APL from start to finish and has some handy reference charts. |
| Inside OS/2 -- Gordon Letwin |
 |
8 of 10. I picked this book up after reading Larry Osterman's blog entry about how acquiring a critical section in OS/2 effectively suspended all threads in the system until it was released. Sure enough, that's how it worked. Funny. Anyway, this book was fun because it takes you back in time, and, since Gordon was the chief architect for OS/2, the writing gives a ton of insight into software design and architecture when OS/2 was being developed. Though there is plenty of detail which is fairly useless today (like how to specifically use DosCWait), I enjoyed skimming it. |
| Dreaming in Code: Two Dozen Programmers, Three Years, 4,732 Bugs, and One Quest for Transcendent Software -- Scott Rosenberg |
 |
7 of 10. In the style of Soul of a New Machine (see above), this book takes you through the development of Chandler, a start -up software project lead by Mitch Kapor, the founder of Lotus. I generally agree with most of the reviews on Amazon: the book idea was good, the writing is very good, but the project he chose to follow was crap. As far as I know, Chandler is now dead, and the start-up didn't seem like a place buzzing with immensely passionate people, no matter how hard the author tried to convey that. He should have chosen Windows Vista or something ;). |
Non-(computer-)geek books:
| Molecular Gastronomy: Exploring the Science of Flavor -- Hervey This |
 |
10 of 10. This is THE book on molecular gastronomy, literally. Hervey basically invented the field, from what I understand, and this book is a great collection of easy-to-read essays on the topic. You'll walk away with a better scientific understanding of what cooking is all about, including how various new-age techniques work, and perhaps even the confidence to try some of it out on your own. This is a must read for any serious food geek. |
| The Making of a Chef: Mastering Heat at the Culinary Institute -- Michael Ruhlman |
 |
10 of 10. In this book, the author enrolls at the Cullinary Institute of America (CIA) and goes through basically the full cirriculum. The book is written excellently, and you feel like you're right there in class with him. And you certainly walk away with a deep and lucid appreciation for those who are slaves to the knife, cooking up delicious food because they love it. I was considering a stint at the CIA ... until I read this ... I'm now convinced that I couldn't handle it ;). |
| This is your Brain on Music: The Science of a Human Obsession -- Daniel J. Levitin |
 |
10 of 10. So many of you probably don't know that I'm a music geek too. I write a lot of material -- actually, I've been writing a whole lot more lately -- though I don't play live at all anymore. This book has some light introduction to music theory, but the great part is the way the author dives into cognitive neuroscience and the effect of music on people, their brains, and their psychology. The book is incredibly unique and will instill a newfound appreciation of every nuance of that next breakbeat you hear. |
| The Soul of a Chef: The Journey Toward Perfection -- Michael Ruhlman |
 |
10 of 10. Great follow-up to The Making of a Chef. The delves into the make-up of a chef, starting with the master chef exam at the CIA, and then profiling two chefs: Michael Symon (Lola) and Thomas Keller (French Laundry, per se). While the whole thing is great, the 1/3 of the book devoted to Thomas Keller makes the book wortwhile. |
| A Cook's Tour -- Anthony Bourdain |
 |
9 of 10. After reading Kitchen Confidential, I couldn't not read A Cook's Tour. Anthony is extremely entertaining, crude, and raw. That's what he does best. This is the book version of his late TV show with the same title. In it, he travels the world, tasting regional cuisines and reporting "from the trenches". Lots of mouth watering street food, bizzare encounters, exotic beverages, etc. If you can't afford to do a foodie world-tour yourself, this is the next best thing. |
| Guns, Germs, and Steel: The Fates of Human Societies -- Jared M. Diamond |
 |
9 of 10. This book needs little introduction. It won the Pulitzer prize, after all. The book describes how certain socieites came to dominate various geographies of the world, including (as the title suggests) the role of guns, germs (disease), and steel in the process. For obvious reasons, the writing is a tad dry, but it's so jam packed full of interesting data and written incredibly methodically, both of which more than make up for the dryness. |
| The Omnivore's Dilemma: A Natural History of Four Meals -- Michael Pollan |
 |
9 of 10. This book takes you through many different forms of eating, from corn and our new industrialized food system, to farming, to hunting and foraging, and beyond. Each section concludes with a meal characteristic of one particular style of food creation. Though at times the writing gives a hint of an agenda, the writing is generally excellent, and many facts are presented for consideration. |
| The Reach of a Chef: Beyond the Kitchen -- Michael Ruhlman |
 |
8 of 10. This is the third of Ruhlman's XXX of a Chef series of books, and I really enjoyed it. This one looks at chefs and their influence inside and outside of the kitchen. That includes Grant Achatz (of Alinea), how he studied under Keller and became an American pioneer of molecular gastronomy, various Food TV celebs, and so on. Great read. |
| The Essays of Warren Buffett: Lessons for Corporate America -- Warren Buffet, Lawrence A. Cunningham (editor) |
 |
8 of 10. Sometime in 2005, I stumbled across the archive of Warren Buffet's letters to the Berkshire Hathaway shareholders, dating back to 1977. I immediately printed them all out and have a stack of many 1,000 pages in my office to this day. Though they are fairly lengthy, there are many great lessons to be learned from reading them. This book is an edited down version of those essays. Very edited and abridged, actually, but some of the more important points are pulled out and analyzed. And the best part is that if you want to drill into more detail, all of the letters are available online. |
| Judgment of Paris: California vs. France and the Historic 1976 Paris Tasting that Revolutionized Wine -- Michael Ruhlman |
 |
7 of 10. All wine lovers need to read this. I gave it 7 out of 10 simply because the writing style is actually pretty bad IMHO. But the content itself is great, and of historical signifciance. Wikipedia has a page on the event, but in summary, in 1976 many Bordeaux 1st growths, etc. were pitted against various California winemakers in a blind tasting. The tasting was held in Paris, and the judges were all French. Everybody believed the US wines would fall short, but it turned out that the US fared quite well: the book is the story of events leading up to and including this tasting. |
| The Nasty Bits: Collected Varietal Cuts, Usable Trim, Scraps, and Bones -- Anthony Bourdain |
 |
7 of 10. To be entirely honest, this book was VERY entertaining, but fell a bit short of my expectations. This is a collection of fairly disjoint essays, almost the kind of thing you'd expect to see on a blog. Each reads very well on its own, but it lacked the kind of cohesive feel I was looking for. Nevertheless, Anthony is always entertaining, crude, funny, and makes me laugh. This book was no exception. |
 Tuesday, April 24, 2007
Haskell is the most underappreciated yet extraordinarily significant programming language in the world. The syntax is frightening enough to scare off those with weak stomaches, but some of the most interesting and creative research in type systems and, within recent years, parallelism have arisen from the Haskell community. I recently stumbled across a fascinating paper from the ACM SIGPLAN History of Programming Languages Conference (HOPL'III) from earlier this year:
A History of Haskell: being lazy with class
Abstract This long (55-page) paper describes the history of Haskell, including its genesis and principles, technical contributions, implementations and tools, and applications and impact.
First I'll admit that I'm a functional programming geek. Second I'll admit that I love reading about technology history. But those biases aside, the paper is really quite good. Recommended reading for anybody who's ever run across a lambda floating around in their dreams. I know that I have.
 Sunday, April 22, 2007
Somebody I respect a lot on our team said something interesting the other day: paraphrasing, "parallelism is about taking one trick and applying it to as many things as possible." Well, what's the trick? The trick is breaking a problem into successively smaller pieces on which disjoint subsets of the overall computation can run concurrently. Pieces in this sense can be little bits of data or instructions, or both. It seems so obvious, but that really is all there is to it. That's not to say it's easy, of course, though some people believe it is. One of the nice things about PLINQ is that you express a big computation and we hide the tricks. But the tricks aren't impossible to do on your own... today, even.
 Thursday, April 19, 2007
Michael Suess, author of the tremendously cool blog thinkingparallel.com, recently ran a series of interviews. He asked five parallelism experts from different domains (Erlang, MPI, OpenMP, POSIX threads, .NET threads) to answer the same set of questions:
It's interesting to see the range of answers given. We agree on many things, but our different backgrounds really shine through in other cases.
 Monday, April 16, 2007
To gather meaningful performance metrics, it's usually a good idea to run several iterations of the same test, averaging the numbers in some way, to eliminate noise from the results. This is true of sequential and fine-grained parallel performance analysis alike. Though it's clearly important for sequential code too, data locality can add enough noise to your parallel tests that you'll want to do something about it. For example, if iteration #1 enjoys some form of temporal locality left over from iteration #0, then all but the first iteration would receive an unfair advantage. This advantage isn't usually present in the real world -- most library code isn't called over and over again in a tight loop -- and could cause test results to appear rosier than what customers will actually experience. Therefore, we probably want to get rid of it.
To eliminate this noise, we can clear the Dcaches (data caches) of all processors used to run tests before each iteration. How do you do that? Well...
Intel offers a WBINVD instruction to clear a processor's Dcache, but sadly it's a privileged instruction and there's no way to get at it from user-mode. So that's a no-go for most Windows programmers. There's also a Win32 function to clear a processor's Icache, but this doesn't work for Dcaches, which is what we're after, so we're out of luck there too.
Instead, we can implement a really low tech solution. Take some random data, sized big enough to fill a processor's L2 cache, and read the whole thing from each processor whose cache we wish to clear before each iteration. This will evict all of the existing lines in the caches that could be left over from previous iterations. All of the new lines will be brought in as shared, and, while they will be evicted when we start using real data in the query, this effectively simulates a cold cache. Here's an example of this:
const int s_garbageSize = 1024 * 1024 * 64; // 64MB. static IntPtr s_garbage = System.Runtime.InteropServices.Marshal.AllocHGlobal(s_garbageSize);
unsafe static void ClearCaches() { for (int i = 0; i < Environment.ProcessorCount; i++) { SetThreadAffinityMask(GetCurrentThread(), new IntPtr(1 << i)); long * gb = (long *)s_garbage.ToPointer(); for (int j = 0; j < s_garbageSize / sizeof(long); j++) { long x = *(gb + j); // Read the line (shared). long y = Math.Max(Math.Min(x, 0L), 0L); // Prevent optimizing away the read. } } SetThreadAffinityMask(GetCurrentThread(), new IntPtr(0)); }
[DllImport("kernel32.dll")] static extern IntPtr GetCurrentThread(); [DllImport("kernel32.dll")] static extern IntPtr SetThreadAffinityMask(IntPtr hThread, IntPtr dwThreadAffinityMask);
This clearly isn't the most efficient implementation. On multi-core architectures, some cores are apt to share some levels of cache, so with the above approach we'll end up doing duplicate (and wasted) work. And few processors on the market have 64MB of L2 cache -- I've just chosen a reasonable number that's bigger than most processors -- so we could try to be more precise there too. You could use the GetLogicalProcessorInformation API, new to Windows Server 2003 (server) and Vista (client), if you really wanted to be a stud. In any case, this does the trick.
 Thursday, April 12, 2007
I wrote an article that appears in the May 2007 issue of MSDN Magazine. It's now online for your reading pleasure:
CLR Inside Out: 9 Reusable Parallel Data Structures and Algorithms
This column is less about the mechanics of a common language runtime (CLR) feature and more about how to efficiently use what you’ve got at your disposal. Selecting the right data structures and algorithms is, of course, one of the most common yet important decisions a programmer must make. The wrong choice can make the difference between success and failure or, as is the case most of the time, good performance and, well, terrible performance. Given that parallel programming is often meant to improve performance and that it is generally more difficult than serial programming, the choices are even more fundamental to your success.
In this column, we’ll take a look at nine reusable data structures and algorithms that are common to many parallel programs and that you should be able to adapt with ease to your own .NET software. Each example is accompanied by fully working, though not completely hardened, tested, and tuned, code. The list is by no means exhaustive, but it represents some of the more common patterns. As you’ll notice, many of the examples build on each other.
(Read more...)
The 9 items are: Countdown Latch, Reusable Spin Wait, Barrier, Blocking Queue, Bounded Buffer, Thin Event, Lock-Free Stack, Loop Tiling, Parallel Reduction. Much of the content is closely related to, or even derived from, content that will appear in my book. (Yes, it's still in the works.)
As Stephen notes on the MSDN Magazine blog, there was a printing error which resulted in the last page of the article being printed twice, one of which overwrote another page in the article. Thankfully the online article doesn't suffer from this same problem. But to remedy this, the article will also appear in next month's magazine, for double the fun.
 Wednesday, April 11, 2007
Late last summer, an interesting issue with traditional optimistic read-based software transactional memory (STM) systems surfaced. We termed this “privatization” and there has been a good deal of research on possible solutions since then. I won’t talk about solutions here, but I will give a quick overview of the problem and a pointer to recent work.
As a quick refresher, optimistic reads are nice because they are invisible. Being invisible eliminates the responsibility from the reading transaction to inform other transactions about the act of reading. Why is that nice? Because informing other transactions requires shared writes, which are expensive and often require atomic (compare-and-swap) writes. Doing that on every read can clearly hamper your performance. With optimistic systems, this step can be skipped, which is a stark contrast to pessimistic read-based systems.
As many have already described (e.g. Harris and Fraser), this is often accomplished by having the reader observe a location-specific version number after the read and ensuring that writing transactions increment this same version number during commit. Transactions later validate the observed version numbers, and if any changed, the transaction rolls back. Notice that the reading transactions continue to do work after the optimistic read, in hopes that the work needn’t be thrown away later due to a conflict. This, as you can imagine, is where the “optimistic” terminology comes from. But this is also why we run into the privatization issue.
To illustrate, imagine that we have some linked list and two transactions, Tx1 and Tx2. Tx1 walks the list and updates all nodes (perhaps by incrementing some counter), and Tx2 simply removes one node from the list so that it can do some work privately with it. The code might look like this:
class Node { Node next; int value; } Node head = …;
// Tx1: atomic { S0: Node n = head; S1: while (n != null) { S2: n.value++; S3: n = n.next; } }
// Tx2: Node n; atomic { S4: n = head.next; // take the 2nd element S5: head.next = n.next; } S6: Console.WriteLine(n.value);
Assuming all nodes have values of 0 to begin with, and Tx2 commits before Tx1, is it possible for Tx2 to print out the value 1 at S6? Perhaps surprisingly (and disappointingly), the answer is yes (with traditional optimistic read systems, as described in the literature).
How? Say Tx1 executes S0 through S3 first. So Tx1’s local variable n now contains a reference to the 2nd node in the list. Then Tx2 runs S4 and S5, removing the 2nd node from the list. Then Tx2 commits successfully, and the IP is sitting at S6 but hasn’t run yet. Note that Tx1 is doomed at this point—it has read a reference to the 2nd node via the head’s next reference which is now out of date—but doesn’t know it and, with traditional optimistic read systems, won’t find out until it tries to commit.
From here, things go terribly wrong. Tx1 may run and write to the 2nd node’s value which, in this particular example, could cause S6 to erroneously print out the value of 1. Worse, for more complicated data structures, invariants may be horribly broken as there are plenty of races: S6 could even execute while Tx1 is in the process of rolling back, etc. This can clearly be catastrophic.
This problem is described in depth in Larus and Rajwar’s recent transactional memory book. Spear, et. al recently released a technical report that also overviews the problem and presents some possible approaches to solve it. Some have suggested that data must be once-transactional, always-transactional, but some thought exercises and other simpler examples should be sufficient to convince you that that direction isn't very straightforward either.
|
|
Recent Entries:
Search:
Browse by Date:
| | Sun | Mon | Tue | Wed | Thu | Fri | Sat | | 29 | 30 | 1 | 2 | 3 | 4 | 5 | | 6 | 7 | 8 | 9 | 10 | 11 | 12 | | 13 | 14 | 15 | 16 | 17 | 18 | 19 | | 20 | 21 | 22 | 23 | 24 | 25 | 26 | | 27 | 28 | 29 | 30 | 31 | 1 | 2 | | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
Browse by Category:
Notables:
|