RSS 2.0

Personal Info:

Joe Send mail to the author(s) leads the architecture of an experimental OS's developer platform, where he is also chief architect of its programming language. His current mission is to enable writing large-scale software that is reliable, secure, and scalable by-construction. Before this, Joe founded the Parallel Extensions to .NET project. He has been granted 19 patents, with 49 pending. When not working, Joe enjoys travelling with his wife, writing books, writing music, studying music theory & mathematics, and doing anything involving food & wine.

My books

My music

Disclaimer:
The content of this site are my own personal opinions and do not represent my employer's view in anyway.

© 2012, Joe Duffy

 
 Saturday, June 09, 2007

Windows Vista has a new one-time initialization feature, which I’m pretty envious of being someone who writes most of his code in C# and answers countless questions about double-checked locking in the CLR.  Rather than sprinkling double-checked locking all over your code base, along with the ever-lasting worry in the back of your mind that you’ve gotten the synchronization incorrect, it's a better idea to consolidate it into one place.

That’s the purpose of the LazyInit<T> and LazyInitOnlyOnce<T> structs below.  Both let you specify an “initialization” routine (as a delegate) which gets invoked at the appropriate time to lazily initialize the state.   The only difference between the two is that LazyInit<T> might invoke your delegate more than once, due to races, but it will ensure only one value “wins”.  LazyInitOnlyOnce<T> does the extra work to ensure the initialization routine only gets called once, though at a slightly higher cost: we might need to block a thread, which means allocating a Win32 event.

Why the two?  I had originally written this with a Boolean specified at construction time to pick one over the other, but this required an extra object field which, for LazyInit<T> which was never used, along with a Boolean field.  I defined both as structs to make them super lightweight to use, and getting rid of the extra two fields seemed worth the extra baggage of an extra class, given that such a type could end up used very pervasively throughout a large code-base.  As it stands, LazyInit<T> is just the size of a pointer plus the size of T.  LazyInitOnlyOnce<T> adds one additional pointer to that.

To start with, both use the same Initializer<T> delegate:

public delegate T Initializer<T>();

And here’s LazyInit<T>, the simpler of the two:

public struct LazyInit<T> where T : class {
    private Initializer<T> m_init;
    private T m_value;

    public LazyInit(Initializer<T> init) {
        m_init = init;
        m_value = null;
    }

    public T Value {
        get {
            if (m_value == null) {
                T newValue = m_init();
                if (Interlocked.CompareExchange(ref m_value, newValue, null) != null &&
                        newValue is IDisposable) {
                    ((IDisposable)newValue).Dispose();
                }
            }

            return m_value;
        }
    }
}

Note that T is constrained to a reference type, so that we can use a null check to determine when initialization is needed.  We could have used a separate Boolean, but this would required adding another field as well as considering some trickier memory model issues.

If the Interlocked.CompareExchange fails, it means we lost the lazy initialization race with another thread, and thus just return the value the other thread produced.  We also Dispose of the garbage object if it implements IDisposable.  This pattern is very common in lazy initialization scenarios, like allocating an expensive kernel object lazily on demand.  We’d prefer to get rid of it right away since we know it will never be used.

I wish there was a way to make boxing a compile-time error for some value types.  Clearly you don't ever want to box one of these, because making a copy will entirely break the synchronization guarantees.

I’ve omitted some error checking, like ensuring m_init actually got initialized to a non-null value.

Say you need a lazily initialized event on your object.  You would just do this:

public class C {
    private LazyInit<EventWaitHandle> m_event;
    private object m_otherState;
    public C() {
        m_event = new LazyInit<EventWaitHandle>(
                delegate { return new ManualResetEvent(false); });
        m_otherState = ...;
    }
    ...
    private void DoSomething() {
        ...
        if (... need to set the event ...)
            m_event.Value.Set();
    }

}

And lastly, here’s LazyInitOnlyOnce<T>:

public struct LazyInitOnlyOnce<T> where T : class {
    private Initializer<T> m_init;
    private T m_value;
    private object m_syncLock;

    public LazyInitOnlyOnce(Initializer<T> init)
    {
        m_init = init;
        m_value = null;
        m_syncLock = null;
    }

    public T Value {
        get {
            if (m_value == null) {
                object newSyncLock = new object();
                object syncLockToUse = Interlocked.CompareExchange(
                        ref m_syncLock, newSyncLock, null);
                if (syncLockToUse == null)
                    syncLockToUse = newSyncLock;
                lock (syncLockToUse) {
                    if (m_value == null)
                        m_value = m_init();
                    m_syncLock = null;
                    m_init = null;
                }
            }

            return m_value;
        }
    }
}

We use a monitor to ensure mutual exclusion.  I lazily allocate the object used for synchronization, but this is clearly a tradeoff.  We pay for the added complexity to the code and the extra interlocked instruction (on the slow path), but avoid having to allocate an extra object when we create the struct itself and keep it alive, when we might not ever need it.  There’s already an allocation for the delegate, but this just means there’s one instead of two.

It may also not be obvious why I null out the m_syncLock field before exiting.  If we don't, the object will remain live as long as the lazily initialized variable remains live.  We want the object to be GC'd as soon as possible, because it is no longer needed.

You can use a class constructor in .NET to acheive a similar effect.  Static field initializers, however, execute in the class constructor, meaning if you have multiple lazily initialized objects or static methods, they all get initialized at once.  This is much more like LazyInitOnlyOnce<T> than LazyInit<T>, since the CLR uses locks to prevent the class constructor from running on multiple threads at once.

Anyway, there’s very little that is novel here.  But I do believe having these primitives in the .NET Framework would be immensely useful.  It would at least help steer people towards the recommended and most efficient lazy initialization pattern, which is to use double-checked locking, rather than having them possibly pursue more complicated designs.  It also removes the need to worry about volatile and Thread.MemoryBarrier, for those that aren't knowledgeable of the work we did in the CLR 2.0 to ensure double-checked locking works properly.  Lastly, it has the added benefit of getting rid of tricky calls to Interlocked.CompareExchange and lock statements scattered throughout your code, in favor of something more declarative.  What do you think?

6/9/2007 2:09:14 PM (Pacific Daylight Time, UTC-07:00)  #   

 Friday, June 08, 2007

It's coming to be that time of year at Microsoft: review time.  Actually, it's a multi-month process, but the first step involves assessing how well (or, in some cases, how poorly) you've done at certain tasks you set out to pursue a year prior.  Part of this process also involves discussing career aspirations with those who care to listen and help, which usually means your manager.

So, Joe: What do you want to be when you grow up?  ;)

Technical Fellow is the highest title at Microsoft awarded to technical contributors.  Some TFs manage people too, but most do not.  According to Microsoft.com, there are 18.  Two people I work with closely, Anders Hejlsberg and Burton Smith, are TFs, and, until just recently, Jim Gray was also.  There are others I know and recognize, but I am not deeply familiar with all of them.  This lead me to wonder about the contributions and accomplishments the others have had throughout their careers, and so I did some reading.  Aside from causing a bit of depression and a realization that I have quite a long mountain to climb ahead of me, I ended up reading Butler Lampson’s immensely wonderful paper “Hints for Computer System Design” (HTML, PDF).

Yeah, yeah, I’m only 24 years behind the times.  In the paper, Butler provides many sound principles backed by concrete examples illustrating tradeoffs between functionality, speed, and fault-tolerance, drawn mostly from his experience building operating and distributed systems.  As I read it the paper, I was struck by how much his advice applies to building just about any kind of complicated software system, including frameworks.

A few random teaser quotes:

“Designing a computer system is very different from designing an algorithm:

The external interface (that is, the requirement) is less precisely defined, more complex, and more subject to change.

The system has much more internal structure, and hence many internal interfaces.

The measure of success is much less clear.

The designer usually finds himself floundering in a sea of possibilities, unclear about how one choice will limit his freedom to make other choices, or affect the size and performance of the entire system. There probably isn’t a ‘best’ way to build the system, or even any major part of it; much more important is to avoid choosing a terrible way, and to have clear division of responsibilities among the parts.”

---

“Do one thing at a time, and do it well. An interface should capture the minimum essentials of an abstraction. Don’t generalize; generalizations are generally wrong.”

---

“Keep secrets of the implementation. Secrets are assumptions about an implementation that client programs are not allowed to make. In other words, they are things that can change; the interface defines the things that cannot change (without simultaneous changes to both implementation and client). Obviously, it is easier to program and modify a system if its parts make fewer assumptions about each other. On the other hand, the system may not be easier to design—it’s hard to design a good interface.”

---

“One way to improve performance is to increase the number of assumptions that one part of a system makes about another; the additional assumptions often allow less work to be done, sometimes a lot less.”

---

“When in doubt, use brute force. Especially as the cost of hardware declines, a straightforward, easily analyzed solution that requires a lot of special-purpose computing cycles is better than a complex, poorly characterized one that may work well if certain assumptions are satisfied.”

Needless to say, I strongly recommend the paper.  Now back to my career planning.  I think I’ll start by writing a prolific and influential paper that will still be highly relevant, quoted, and recommended 24 years into the future.  Thankfully I'm still comparatively young so if I get started soon I just might be around to see the 24 year mark.

6/8/2007 9:19:23 PM (Pacific Daylight Time, UTC-07:00)  #   

 Wednesday, May 30, 2007

Intel and AMD processors have had very limited support for SIMD computations in the form of MMX and SSE since the late 90s.  Though most programmers live in a MIMD-oriented world, SIMD programming had a surge in research interest in the 80s, and has remained promising for all those years, albeit a bit silently.  Vectorization is a fairly popular technique primarily in niche markets such as the FORTRAN and supercomputing communities.  Given the rise of GPGPU (see here, here, and here) and rumors floating about in the microprocessor arena, this is an interesting space to watch.

You can get at SSE from managed code, though it requires some hoop jumping and the interop overheads end up killing you.  Let's take a quick look at what it takes to use classic loop stripmining techniques for a pairwise multiplication of two arrays.

Since we can't access the SSE instructions directly in managed code, we need to first define a native DLL.  We'll call it 'vecthelp.dll' and it just exports a single function:

#include <xmmintrin.h>

const int c_vectorStride = 4;

extern "C" __declspec(dllexport)
void VectMult(float * src1, float * src2, float * dest, int length) {
    for (int i = 0; i < length; i += c_vectorStride) {
        // Vector load, multiply, store.
        __m128 v1 = _mm_load_ps(src1 + i); // MOVAPS
        __m128 v2 = _mm_load_ps(src2 + i); // MOVAPS
        __m128 vresult = _mm_mul_ps(v1, v2); // MULPS
        _mm_store_ps(dest + i, vresult); // MOVAPS
    }
}

'VectMult' takes two pointers to float arrays, 'src1' and 'src2', of size 'length', and does a pairwise multiplication, storing results into 'dest'.  It walks the array with a stride of 4.  On each iteration, it does a vector load using the SSE intrnsic '_mm_load_ps', which loads 4 contiguous floats from 'src1' and 'src2' into XMMx registers.  Then we multiply them via '_mm_mul_ps' which is a 4-way vector multiply (i.e. the multiplication for each pair occurs in parallel).  Lastly, we store the results back to the 'dest' array.  Note we naively assume the array's size is a multiple of 4.

To use this routine, we just need to P/Invoke.  Well, sadly we also need to do some tricky alignment since SSE demands 16 byte alignment.  As I've written before, this isn't easy to acheive on the CLR.  I've used stack allocation to avoid pinning the arrays, though clearly for large arrays this would easily lead to stack overflow.  It's just for illustration.

using System;

unsafe class Program {
    [System.Runtime.InteropServices.DllImport("vecthelp.dll")]
    private extern static void VectMult(float * src1, float * src2, float * dest, int length);

    public static void Main() {
        const int vecsize = 1024 * 16; // 16KB of floats.

        float * a = stackalloc float[vecsize + (16 / sizeof(float)) - 1];
        float * b = stackalloc float[vecsize + (16 / sizeof(float)) - 1];
        float * c = stackalloc float[vecsize + (16 / sizeof(float)) - 1];

        // To use SSE, we must ensure 16 byte alignment.
        a = (float *)AlignUp(a, 16);
        b = (float *)AlignUp(b, 16);
        c = (float *)AlignUp(c, 16);

        // Initialize 'a' and 'b':
        for (int i = 0; i < vecsize; i++) {
            a[i] = i;
            b[i] = vecsize - i;
        }

        // Now perform the multiplication.
        VectMult(a, b, c, vecsize);

        ... do something with c ...
    }

    private static void * AlignUp(void * p, ulong alignBytes) {
        ulong addr = (ulong)p;
        ulong newAddr = (addr + alignBytes - 1) & ~(alignBytes - 1);
        return (void *)newAddr;
    }
}

I wish I could report some stellar perf numbers, to the tune of the vector version being 4X faster than a non-vector equivalent.  Sadly the P/Invoke overheads kill perf unless the array is unreasonably large.  Who needs to multiply two 16MB arrays of floats together?  Some people I'm sure, but not many.  If the P/Invoke overheads are excluded, however, arrays as small as a few hundred elements see 2X speedup.  And for larger arrays it hovers around 3X.

Clearly as future architectures offer more vector width, these speedups just increase.  And perhaps there will eventually be more incentive for native CLR support.  Just imagine if we had a 32-core system in which each core had a 16-way vector arithmetic unit: that's 32X16 (512) degrees of parallelism if you can just subdivide the problem appropriately.  GPUs, of course, already offer many-fold larger vector width than SSE, which is one reason why GPGPU is attractive.  Maybe I'll show how to write a DirectX pixel shader that adds two float arrays together in a future post.

5/30/2007 12:45:26 AM (Pacific Daylight Time, UTC-07:00)  #   

 Saturday, May 19, 2007

I’ve opined on thread affinity several times in the past.  I think the term “thread affinity” is en vogue only internal to Microsoft, so it may help to define what it means for the rest of the world.

Many services on Windows have traditionally associated state with the executing thread to keep track of certain ambient contextual information.  Errors, security, arbitrary library state.  Storing data on the physical thread ensures that it flows around with the logical continuation of work, no matter what APIs are called or how interwoven the stack ends up, and is therefore “always” accessible.  Thank our imperative history for this one.  People have had to deal with this in Haskell, though since Haskell generally doesn’t have statefulness which persists across callstacks, they came up with a more elegant “implicit parameter” solution.

Affinity creates difficulties for parallel programming models for a number of reasons.  We’d like it to be the case that work can be transferred for execution on separate processors when feasible and profitable, and often even implicitly.  For example in the query ‘var q = from x in A where p(x) select f(x);’, so long as ‘p’ and ‘f’ are sufficiently complicated and ‘A’ sufficiently large, perhaps we want to run this in parallel.  But “transferring work for execution on separate processors” means using many threads to execute the same logical chunk of work.  If ‘p’ or ‘f’ rely on thread affinity, what are we to do?  Affinity becomes a concurrency blocker here: if part of that work depends on the thread’s identity across multiple steps, then how can we possibly use multiple threads?

One answer is that we first need to know the duration of the affinity if we’re to deal intelligently with it somehow.  That’s what the .NET Framework’s Thread.BeginThreadAffinity and EndThreadAffinity are meant for: they denote the boundaries of affinity that has been acquired and then released.  But this still doesn’t solve the fundamental problem, which is the mere presence of thread affinity in the first place.  We would presumably respond to the affinity by just suppressing parallelism.  That’s no good.  And sadly affinity isn’t really a well-defined single thing that we can do away with in one well-defined step.

Win32 is littered with affinity: error codes are stored in the TEB (accessible via GetLastError), as are impersonation tokens and locale IDs.  Arbitrary program and library can be—and routinely is—stashed away into Thread Local Storage (TLS) for retrieval later on.  In fact, most mutual exclusion mechanisms today assume thread affinity: that is, a lock is taken by some thread and then the only agent in the system working under the protection of that lock is that one thread.  Various transactional memory nesting forms seek to solve this problem, including what happens when many threads which comprise the same logical piece of work need to write to overlapping data.  Heck, stacks are even a subtle form of thread affinity too, in which some portion of the program state is all cobbled up with the thing which is meant to execute the program itself.  COM introduced an even more grotesque form of affinity with its Threading apartment model, particularly Single Threaded Apartments (STAs), in which components created on an STA are only ever accessed from the single STA thread in that apartment.  And let’s not forget all of the GUI frameworks: all of the Windows GUI frameworks are built on the notion of strong affinity.  And since the introduction of LIBCMT and MSVCRT those C Runtime library functions which historically relied on global state now rely on TLS, so some of the CRT itself is even guilty (which means those programs that use the guilty parts are also guilty, though perhaps unknowingly).  Managed code adds one degree of separation by detaching the CLR thread from the OS thread, which is a step in the right direction; but the .NET Framework is still littered with affinity that is either inherited from Win32 or of its own creation.  And so on, and so forth.

All of those examples of thread affinity above are cases where the library developer needed to have an isolated context.  There really is no reason that this context needs to be specific to a single OS thread, it just so happens the context that most of them chosen is in fact specific to one.  The .NET Framework's approach of offering a layered and shiftable abstraction on top of the concrete thing is promising... assuming you’re comfortable using that abstraction.  CLR remoting offers various forms of contexts that flow in a multitude of ways.  Sadly the machinery is complex, not documented satisfactorily, and is, well, tied to remoting.  We need something more general purpose and ubiquitous.  Maybe the CLR thread is it.  Until somebody needs to come along and build something that is one level above CLR threads, I suppose.

So how bad is all of this anyway?  It’s actually fairly bad.  Any one of these things in isolation is teachable and avoidable, but pile it all up and what you’re left with is a veritable minefield to navigate.  Affinity shows up as a huge concurrency blocker alongside other favorites like mutable data structures and impure functions.  As if concurrency weren’t hard enough!

5/19/2007 5:22:53 PM (Pacific Daylight Time, UTC-07:00)  #   

 Sunday, May 13, 2007

Everybody’s probably aware of the RegisterWaitForSingleObject method: it exists in the native and CLR thread pools, and does pretty much the same thing in both.  (It's called CreateThreadpoolWait and SetThreadpoolWait in Vista.)  This feature allows you to consolidate a bunch of waits onto dedicated thread pool wait threads.  Each such thread waits on up to 63 registered objects using a wait-any-style WaitForMultipleObjects.  When any of the objects become signaled, or a timeout occurs, the wait thread just wakes up and queues a callback to run on the normal thread pool work queue.  Then it updates timeouts and possibly removes the object from its wait set, and then goes right back to waiting.

This is great.  Fewer threads, more overlapped waiting, better performance.  If you wait on 1,024 objects, you only need 17 threads instead of 1,024.  Not only do you end up with fewer threads, but your program can actually handle the case where all 1,024 objects become signaled at once, because the thread pool throttles the number of threads that can run callbacks.

But you really don’t want to register a wait for a mutex.  If you stop to think about it for a moment, the reason will become clear.  It just doesn’t make any sense with the architecture I just explained.

The pool’s wait threads are the ones that do the actual waits.  And when a wait for a mutex is satisfied, the thread which performed the wait now owns the mutex.  Uh oh.  In our case that means the wait thread owns the mutex.  But all the wait thread knows how to do is wait on stuff and queue callbacks.  There are two problems here.

The first problem is that the thread which will run the callback lives in the thread pool’s worker queue, and doesn’t actually own the mutex.  Which means it can’t actually release the mutex either.  In fact, nobody really can, except for the wait thread that performed the wait, but remember all that thread knows how to do is wait on stuff and queue callbacks.  A mutex?  What the heck is that?  Eventually the wait thread may exit and the mutex may become abandoned, but whether this actually happens depends on the ebb and flow of wait registrations.

(With the Win32 thread pool, you can specify the WT_EXECUTEINWAITTHREAD flag during registration, which ensures the callback is run in the wait thread itself and not queued to worker thread.  While this can suffice as a workaround to this problem, it’s generally a bad practice to hold up the wait thread from doing its job.  And there is no equivalent in Vista or with the CLR thread pool.)

The second problem may or may not surface depending on whether you’ve specified that the wait callback should execute only once.  If the callback executes only once, the thread pool will remove it from its wait set after waking up once.  Otherwise, it keeps it in the wait set and goes back to waiting on it right after queuing the callback.  Here are the "only once" defaults for the Vista, legacy, and CLR pools: yes in Windows Vista (and no way to specify otherwise, other than reregistering manually), no in the legacy Win32 pool (unless the WT_EXECUTEONLYONCE flag is passed during registration), and you always have to specify in managed code with the executeOnlyOnce argument.

So what's the problem?  Because mutexes allow recursive acquires, then if the callback is set to execute multiple times, the wait thread will simply go back and wait on all of its objects, including the mutex, after it queues a callback.  The same thing that happens with a persistent signal object like a manual-reset event now happens.  Each time the wait thread tries to wait, the acquisition of the mutex immediately succeeds, incrementing its recursion counter by one, and each time causing the wait thread to queue yet another callback.  Ouch.  The insanity never stops:

Mutex m = new Mutex();
m.WaitOne();
ThreadPool.RegisterWaitForSingleObject(
    m, delegate { Console.WriteLine("The insanity!"); }, null, -1, false);
m.ReleaseMutex();

The moral of the story?  Nothing terribly deep.  Thread affinity strikes once again.

5/13/2007 8:09:37 PM (Pacific Daylight Time, UTC-07:00)  #   

 Sunday, April 29, 2007

Long time readers will remember that I used to regularly blog about what I've been reading lately. (See here, here, here, and here for examples.) Well, it's been close to a year since I've posted such a list. So here's a little bit of a clearance of my back- log. (And this isn't even everything! I guess I read a lot.) I've separated the list into geek and non-geek books.

Geek books:

The Soul of a New Machine -- Tracy Kidder

10 of 10.
I'm only 16 years late on this one. This book is about DG's race to build a machine to contend with DEC's VAX, and has some great story telling. It reads almost like a fiction book, with great character building, a nice storyline running through the book, etc. The geek factor is low on this one, though I suppose it _is_ a book about a bunch of geeks building hardware, so it's a little up there. Evidently it won the Pulitzer prize.


Programming Pearls -- Jon Bentley

10 of 10.
Another one in the category of "how the heck did I not read this sooner?" This book is a collection of essays on various programming tasks, and gives tons of insight on engineering software in general. The prose is even entertaining too. This book now occupies a special place on my bookshelf.


Java Concurrency in Practice -- Brian Goetz, et. al

10 of 10.
This is a very down-to-earth, pragmatic overview of concurrency in Java. It even has chapters on testing and debugging, which get very little coverage in most articles but are clearly important. Though its focus is on Java, many of the ideas are more general, and thus it's a must-read for any serious Windows or .NET concurrency programmer. It's not as good as my upcoming book, but given that you can't buy mine yet, it will do for now ;).


The Old New Thing: Practical Development Throughout the Evolution of Windows -- Raymond Chen

8 of 10.
I'm sure everybody who reads my blog reads Raymond's too. If you do, then the book will be quite familiar to you. It's a collection of essays, most of which are edited versions of ones that have appeared on Raymond's blog in the past. I really like the layout and organization of chapters. One downside: Raymond apparently tried to keep it from becoming too geeky (he mentions several times, "for you non-programmers, you can skip this chapter"), but c'mon: how many non-programmers are actually going to read this book? IMHO he should have just let loose and never looked back.


Software Pioneers -- Manfred Broy (editor), Ernst Denert (editor)

8 of 10.
This book is a collection of classic articles by precisely what the book's title suggests, software pioneers: Friedrich L. Bauer (ALGOL), Ole-Johan Dahl (Simula), Niklaus Wirth (Pascal), Fred Brooks (Mythical Man Month, OS/360), Alan Kay (GUIs), Rudolf Bayer (B-trees), Peter Chen (E/R modeling), Edsger Dijkstra (obvious), Tony Hoare (CSPs, axiomatic semantics), David Parnas, John Guttag (abstract data types), Michael Jackson (JSP), Tom DeMarco (structured analysis), Michael Fagan (code inspection), Barry Boehm (engineering economics) and Erich Gamma (design patterns). Most of the papers are available on the net, but having them in a printed hardcover is really nice. There are also some new write-ups and an accompanying CD which contains talks from all of the above people. Nice.


An APL Compiler -- Timothy Budd

8 of 10.
An overview of Timothy Budd's APL compiler, from front to back end. (For those not up on APL, it's a great language. See the next item.) Also contains a chapter on compiling high level program statements to vector ISAs, which is quite timely and interesting.


APL with a Mathematical Accent -- C. A. Reiter, W. R. Jones

8 of 10.
This is the best overview and reference book on APL I've encountered to date. For $120, you get the cheesiest binding and printing ever, so be prepared to be disappointed in that department, but the content is well worth it. Covers APL from start to finish and has some handy reference charts.


Inside OS/2 -- Gordon Letwin

8 of 10.
I picked this book up after reading Larry Osterman's blog entry about how acquiring a critical section in OS/2 effectively suspended all threads in the system until it was released. Sure enough, that's how it worked. Funny. Anyway, this book was fun because it takes you back in time, and, since Gordon was the chief architect for OS/2, the writing gives a ton of insight into software design and architecture when OS/2 was being developed. Though there is plenty of detail which is fairly useless today (like how to specifically use DosCWait), I enjoyed skimming it.


Dreaming in Code: Two Dozen Programmers, Three Years, 4,732 Bugs, and One Quest for Transcendent Software -- Scott Rosenberg

7 of 10.
In the style of Soul of a New Machine (see above), this book takes you through the development of Chandler, a start -up software project lead by Mitch Kapor, the founder of Lotus. I generally agree with most of the reviews on Amazon: the book idea was good, the writing is very good, but the project he chose to follow was crap. As far as I know, Chandler is now dead, and the start-up didn't seem like a place buzzing with immensely passionate people, no matter how hard the author tried to convey that. He should have chosen Windows Vista or something ;).


Non-(computer-)geek books:

Molecular Gastronomy: Exploring the Science of Flavor -- Hervey This

10 of 10.
This is THE book on molecular gastronomy, literally. Hervey basically invented the field, from what I understand, and this book is a great collection of easy-to-read essays on the topic. You'll walk away with a better scientific understanding of what cooking is all about, including how various new-age techniques work, and perhaps even the confidence to try some of it out on your own. This is a must read for any serious food geek.


The Making of a Chef: Mastering Heat at the Culinary Institute -- Michael Ruhlman

10 of 10.
In this book, the author enrolls at the Cullinary Institute of America (CIA) and goes through basically the full cirriculum. The book is written excellently, and you feel like you're right there in class with him. And you certainly walk away with a deep and lucid appreciation for those who are slaves to the knife, cooking up delicious food because they love it. I was considering a stint at the CIA ... until I read this ... I'm now convinced that I couldn't handle it ;).


This is your Brain on Music: The Science of a Human Obsession -- Daniel J. Levitin

10 of 10.
So many of you probably don't know that I'm a music geek too. I write a lot of material -- actually, I've been writing a whole lot more lately -- though I don't play live at all anymore. This book has some light introduction to music theory, but the great part is the way the author dives into cognitive neuroscience and the effect of music on people, their brains, and their psychology. The book is incredibly unique and will instill a newfound appreciation of every nuance of that next breakbeat you hear.


The Soul of a Chef: The Journey Toward Perfection -- Michael Ruhlman

10 of 10.
Great follow-up to The Making of a Chef. The delves into the make-up of a chef, starting with the master chef exam at the CIA, and then profiling two chefs: Michael Symon (Lola) and Thomas Keller (French Laundry, per se). While the whole thing is great, the 1/3 of the book devoted to Thomas Keller makes the book wortwhile.


Heat: An Amateur's Adventures as Kitchen Slave, Line Cook, Pasta-Maker, and Apprentice to a Dante-Quoting Butcher in Tuscany -- Bill Buford

10 of 10.
I couldn't put this book down for a whole weekend, and then I was sad when it was done. The writing is extraordinary, and makes you feel like you're there, learning to make various pastas, to be a line cook at Babbo in New York, to try and learn what you can from Mario, and to travel to Italy to learn from some very interesting characters indeed. Easily one of my favorite reads from 2006.


A Cook's Tour -- Anthony Bourdain

9 of 10.
After reading Kitchen Confidential, I couldn't not read A Cook's Tour. Anthony is extremely entertaining, crude, and raw. That's what he does best. This is the book version of his late TV show with the same title. In it, he travels the world, tasting regional cuisines and reporting "from the trenches". Lots of mouth watering street food, bizzare encounters, exotic beverages, etc. If you can't afford to do a foodie world-tour yourself, this is the next best thing.


Guns, Germs, and Steel: The Fates of Human Societies -- Jared M. Diamond

9 of 10.
This book needs little introduction. It won the Pulitzer prize, after all. The book describes how certain socieites came to dominate various geographies of the world, including (as the title suggests) the role of guns, germs (disease), and steel in the process. For obvious reasons, the writing is a tad dry, but it's so jam packed full of interesting data and written incredibly methodically, both of which more than make up for the dryness.


The Omnivore's Dilemma: A Natural History of Four Meals -- Michael Pollan

9 of 10.
This book takes you through many different forms of eating, from corn and our new industrialized food system, to farming, to hunting and foraging, and beyond. Each section concludes with a meal characteristic of one particular style of food creation. Though at times the writing gives a hint of an agenda, the writing is generally excellent, and many facts are presented for consideration.


The Reach of a Chef: Beyond the Kitchen -- Michael Ruhlman

8 of 10.
This is the third of Ruhlman's XXX of a Chef series of books, and I really enjoyed it. This one looks at chefs and their influence inside and outside of the kitchen. That includes Grant Achatz (of Alinea), how he studied under Keller and became an American pioneer of molecular gastronomy, various Food TV celebs, and so on. Great read.


The Essays of Warren Buffett: Lessons for Corporate America -- Warren Buffet, Lawrence A. Cunningham (editor)

8 of 10.
Sometime in 2005, I stumbled across the archive of Warren Buffet's letters to the Berkshire Hathaway shareholders, dating back to 1977. I immediately printed them all out and have a stack of many 1,000 pages in my office to this day. Though they are fairly lengthy, there are many great lessons to be learned from reading them. This book is an edited down version of those essays. Very edited and abridged, actually, but some of the more important points are pulled out and analyzed. And the best part is that if you want to drill into more detail, all of the letters are available online.


Judgment of Paris: California vs. France and the Historic 1976 Paris Tasting that Revolutionized Wine -- Michael Ruhlman

7 of 10.
All wine lovers need to read this. I gave it 7 out of 10 simply because the writing style is actually pretty bad IMHO. But the content itself is great, and of historical signifciance. Wikipedia has a page on the event, but in summary, in 1976 many Bordeaux 1st growths, etc. were pitted against various California winemakers in a blind tasting. The tasting was held in Paris, and the judges were all French. Everybody believed the US wines would fall short, but it turned out that the US fared quite well: the book is the story of events leading up to and including this tasting.


The Nasty Bits: Collected Varietal Cuts, Usable Trim, Scraps, and Bones -- Anthony Bourdain

7 of 10.
To be entirely honest, this book was VERY entertaining, but fell a bit short of my expectations. This is a collection of fairly disjoint essays, almost the kind of thing you'd expect to see on a blog. Each reads very well on its own, but it lacked the kind of cohesive feel I was looking for. Nevertheless, Anthony is always entertaining, crude, funny, and makes me laugh. This book was no exception.


4/29/2007 12:22:39 AM (Pacific Daylight Time, UTC-07:00)  #   

 Tuesday, April 24, 2007

Haskell is the most underappreciated yet extraordinarily significant programming language in the world.  The syntax is frightening enough to scare off those with weak stomaches, but some of the most interesting and creative research in type systems and, within recent years, parallelism have arisen from the Haskell community.  I recently stumbled across a fascinating paper from the ACM SIGPLAN History of Programming Languages Conference (HOPL'III) from earlier this year:

A History of Haskell: being lazy with class

Abstract This long (55-page) paper describes the history of Haskell, including its genesis and principles, technical contributions, implementations and tools, and applications and impact.

First I'll admit that I'm a functional programming geek.  Second I'll admit that I love reading about technology history.  But those biases aside, the paper is really quite good.  Recommended reading for anybody who's ever run across a lambda floating around in their dreams.  I know that I have.

4/24/2007 10:20:49 PM (Pacific Daylight Time, UTC-07:00)  #   

 Sunday, April 22, 2007

Somebody I respect a lot on our team said something interesting the other day: paraphrasing, "parallelism is about taking one trick and applying it to as many things as possible."  Well, what's the trick?  The trick is breaking a problem into successively smaller pieces on which disjoint subsets of the overall computation can run concurrently.  Pieces in this sense can be little bits of data or instructions, or both.  It seems so obvious, but that really is all there is to it.  That's not to say it's easy, of course, though some people believe it is.  One of the nice things about PLINQ is that you express a big computation and we hide the tricks.  But the tricks aren't impossible to do on your own... today, even.

4/22/2007 11:45:55 PM (Pacific Daylight Time, UTC-07:00)  #   

 Thursday, April 19, 2007

Michael Suess, author of the tremendously cool blog thinkingparallel.com, recently ran a series of interviews.  He asked five parallelism experts from different domains (Erlang, MPI, OpenMP, POSIX threads, .NET threads) to answer the same set of questions:

It's interesting to see the range of answers given.  We agree on many things, but our different backgrounds really shine through in other cases.

4/19/2007 9:34:06 AM (Pacific Daylight Time, UTC-07:00)  #   

 Monday, April 16, 2007

To gather meaningful performance metrics, it's usually a good idea to run several iterations of the same test, averaging the numbers in some way, to eliminate noise from the results.  This is true of sequential and fine-grained parallel performance analysis alike.  Though it's clearly important for sequential code too, data locality can add enough noise to your parallel tests that you'll want to do something about it.  For example, if iteration #1 enjoys some form of temporal locality left over from iteration #0, then all but the first iteration would receive an unfair advantage.  This advantage isn't usually present in the real world -- most library code isn't called over and over again in a tight loop -- and could cause test results to appear rosier than what customers will actually experience.  Therefore, we probably want to get rid of it.

To eliminate this noise, we can clear the Dcaches (data caches) of all processors used to run tests before each iteration.  How do you do that?  Well...

Intel offers a WBINVD instruction to clear a processor's Dcache, but sadly it's a privileged instruction and there's no way to get at it from user-mode.  So that's a no-go for most Windows programmers.  There's also a Win32 function to clear a processor's Icache, but this doesn't work for Dcaches, which is what we're after, so we're out of luck there too.

Instead, we can implement a really low tech solution.  Take some random data, sized big enough to fill a processor's L2 cache, and read the whole thing from each processor whose cache we wish to clear before each iteration.  This will evict all of the existing lines in the caches that could be left over from previous iterations.  All of the new lines will be brought in as shared, and, while they will be evicted when we start using real data in the query, this effectively simulates a cold cache.  Here's an example of this:

const int s_garbageSize = 1024 * 1024 * 64; // 64MB.
static IntPtr s_garbage = System.Runtime.InteropServices.Marshal.AllocHGlobal(s_garbageSize);

unsafe static void ClearCaches() {
    for (int i = 0; i < Environment.ProcessorCount; i++) {
        SetThreadAffinityMask(GetCurrentThread(), new IntPtr(1 << i));
        long * gb = (long *)s_garbage.ToPointer();
        for (int j = 0; j < s_garbageSize / sizeof(long); j++) {
            long x = *(gb + j); // Read the line (shared).
            long y = Math.Max(Math.Min(x, 0L), 0L); // Prevent optimizing away the read.
        }
    }
    SetThreadAffinityMask(GetCurrentThread(), new IntPtr(0));
}

[DllImport("kernel32.dll")]
static extern IntPtr GetCurrentThread();
[DllImport("kernel32.dll")]
static extern IntPtr SetThreadAffinityMask(IntPtr hThread, IntPtr dwThreadAffinityMask);

This clearly isn't the most efficient implementation.  On multi-core architectures, some cores are apt to share some levels of cache, so with the above approach we'll end up doing duplicate (and wasted) work.  And few processors on the market have 64MB of L2 cache -- I've just chosen a reasonable number that's bigger than most processors -- so we could try to be more precise there too.  You could use the GetLogicalProcessorInformation API, new to Windows Server 2003 (server) and Vista (client), if you really wanted to be a stud.  In any case, this does the trick.

4/16/2007 6:11:02 PM (Pacific Daylight Time, UTC-07:00)  #   

 

Recent Entries:

Search:

Browse by Date:
<June 2007>
SunMonTueWedThuFriSat
272829303112
3456789
10111213141516
17181920212223
24252627282930
1234567

Browse by Category:

Notables: