|
Personal Info:
Blogroll:
Disclaimer:
The content of this site are my own personal opinions and do
not represent my employer's view in anyway.
© 2010, Joe Duffy
|
|
 Wednesday, February 11, 2009
A couple weeks ago, I illustrated a very simple reader/writer lock that was comprised of a single word and used spinning instead of blocking under contention. The reason you might use a lock with a read (aka shared) mode is fairly well known: by allowing multiple readers to enter the lock simultaneously, concurrency is improved and therefore so does scalability. Or so the textbook theory goes.
As a purely theoretical illustration, imagine we’re on a heavily loaded 8-CPU server where a new request arrives every 0.25ms and runs for 1ms. In an ideal world, we could service requests coming in at a rate of 1ms / 8-CPUs = 0.125ms without falling behind. But imagine these requests need to access some shared state, and so there is a bit of serialization required. In fact, let’s imagine each does 0.5ms’ worth of its work inside a lock. If you were to use a mutually exclusive lock, then you’d have an immediate lock convoy on your hands. Even with 8-CPUs you won’t be able to keep up. You’ll start off gradually building up a debt, and eventually come to a crawl. Let’s examine the initial timeline:
Req# Arrival Acquire Release Wait Time
1 0.0ms 0.0ms 0.5ms 0.0ms
2 0.25ms 0.5ms 1.0ms 0.25ms
3 0.5ms 1.0ms 1.5ms 0.5ms
4 0.75ms 1.5ms 2.0ms 0.75ms
5 1.0ms 2.0ms 2.5ms 1.0ms
6 1.25ms 2.5ms 3.0ms 1.25ms
7 1.5ms 3.0ms 3.5ms 1.5ms
8 1.75ms 3.5ms 4.0ms 1.75ms
Oh jeez, after only the first 8 requests, we’ve fallen way behind.
Each new request adds 0.25ms onto the amount of time the request must wait for the lock. And it’s not going to get any better:
9 2.0ms 4.0ms 4.5ms 2ms
10 2.25ms 4.5ms 5.0ms 2.25ms
11 2.5ms 5.0ms 5.5ms 2.5ms
12 2.75ms 5.5ms 6.0ms 2.75ms
... and so on ...
By request #9, requests have to wait for twice as long as they run. Eventually something has to give, or the server will come tumbling down.
Now, imagine we used a reader/writer lock instead. Threads would never wait for each other, and we wouldn’t end up with this never-ending buildup of wait times. In other words, the “Wait Time” column above would always be 0.0ms. And because the arrival rate is less than our theoretical limit of one request per 0.125ms, our lock convoy is gone. Right?
Unfortunately, probably not; this mental model is overly naïve.
Even when a read lock is acquired, there is mutual exclusion going on:
- Some reader/writer locks actually use mutually exclusive locks to protect their own internal state, like the list of current readers! This can come as a surprise, but it’s true of the .NET reader/writer locks. Vance’s example even does and, although it uses a spin lock in an attempt to reduce the overhead, there’s still no denying that it’s mutual exclusion.
- And even if they don’t use mutually exclusive locks, like the simple spin-based one from my previous blog post, there are CAS instructions. And a CAS instruction actually amounts to a form of mutual exclusion at the hardware level, because the cache coherency machinery needs to ensure that no two processors try to acquire and modify the same cache line exclusively.
- In addition to all of that overhead, the cost in CPU cycles of acquiring the read-lock is nowhere near zero. Because of the use of locks and/or CAS internally, and the resulting cache contention and line evictions that this will cause, throughput will suffer. If there is contention, threads may end up blocked (if real locks are being used), spinning (if spin locks are being used), or simply optimistically retrying CAS’s due to line ping ponging.
The result?
Read locks are just as bad as mutually exclusive locks when lock hold times are short. In fact, they can be worse, because reader/writer locks are more complicated and therefore cost more than simple mutually exclusive locks: many need to keep track of read lists in order to disallow recursive acquires, maintain multiple event handles so certain kinds of waiters can be awakened over others, and store various kinds of counters and flags. Even my super simple single-word, spin-based reader/writer lock needed to worry about blocking out readers when a writer was waiting, properly incrementing and decrementing the reader count when readers are racing with one another (leading to more complicated CAS on the exit path than ordinary write locks), and so on.
That said, a reader/writer lock would in fact probably work in the situation above. A hold time of 0.5ms is huge, and with only 8 concurrent threads and the arrival rate we’re talking about, the overheads are apt to be quite small in relation to the work being done. Another similar setting in which reader/writer locks commonly make a noticeable difference is in the execution of large database transactions.
But the sad truth is that we tell programmers to keep lock hold times short, and most locks I see are comprised of two dozen instructions or less. So we’re in the microsecond range at the very most, which is certainly not large enough for read locks to pan out.
To illustrate this point, I wrote a little benchmark program that benchmarks the legacy .NET ReaderWriterLock, the 3.5 ReaderWriterLockSlim type, and my little spin reader/writer lock. All it does is spawn 4 threads on my dual-socket, dual-core (4-CPU) machine, and then loop around so many times acquiring and releasing a certain kind of lock. I’ve written the test so that the amount of work done inside the lock is parameterized as a certain number of non-inlined function calls. I also parameterize the percentage of acquires that will be write-locks. Then I’ve run this a bunch of times, and compared the total time taken with the equivalent code using a CLR Monitor for mutual exclusion instead.
Here are some results, where each column represents the number of function calls. The entries are the cost relative to Monitor: 1.00x means they are the same, 0.5x means the alternative lock is twice as fast, and 2.0x means the alternative lock is twice as slow. Remember, the ideal situation would be 0.25x: that is, by allowing four threads to run completely concurrently, we run four times faster.
0% writers:
0 calls 10 calls 100 calls 1000 calls
RWL (legacy) 9.23x 6.46x 0.90x 0.49x
RWLSlim (3.5) 2.11x 2.01x 0.96x 0.32x
SpinRWL 9.63x 7.04x 1.02x 0.26x
5% writers:
0 calls 10 calls 100 calls 1000 calls
RWL (legacy) 10.55x 8.23x 1.71x 0.63x
RWLSlim (3.5) 2.29x 2.36x 1.18x 0.61x
SpinRWL 5.69x 5.59x 1.43x 0.94x
10% writers:
0 calls 10 calls 100 calls 1000 calls
RWL (legacy) 20.31x 10.39x 2.34x 0.99x
RWLSlim (3.5) 2.26x 2.04x 1.15x 1.00x
SpinRWL 6.87x 5.03x 1.42x 1.34x
25% writers:
0 calls 10 calls 100 calls 1000 calls
RWL (legacy) 74.49x 49.59x 9.18x 2.15x
RWLSlim (3.5) 2.09x 2.10x 1.14x 1.00x
SpinRWL 4.70x 4.20x 1.43x 1.69x
50% writers:
0 calls 10 calls 100 calls 1000 calls
RWL (legacy) 148.34x 98.46x 20.46x 3.63x
RWLSlim (3.5) 2.18x 1.95x 1.15x 0.95x
SpinRWL 3.23x 3.73x 1.54x 1.39x
100% writers:
0 calls 10 calls 100 calls 1000 calls
RWL (legacy) 170.59x 123.66x 24.04x 4.29x
RWLSlim (3.5) 2.18x 1.95x 1.04x 0.92x
SpinRWL 2.63x 2.04x 1.06x 0.87x
Clearly there are a number of anomalies in these numbers. Why the legacy ReaderWriterLock balloons to 170X the cost of Monitor when we have 100% writers is a very interesting question indeed. Why my simple spin reader/writer lock is 9.63X when we have pure reads and 0 calls, and yet the ReaderWriterLockSlim type is only 2.11X is also interesting. And so on. The numbers are very specific to the version of .NET I am using, and indeed the precise machine configuration, including the number and layout of cores and caches.
But if we look more generally at the numbers, ignoring some of the surprising ones, we can make one interesting and safe conclusion: You need a really low percentage of writers, and a really long amount of time inside the lock, for any scalability wins to show up as a result of using a reader/writer lock. Our best case was the spin reader/writer lock when we had 0% writers and 1000 calls. But clearly if you have no writers, i.e., state is immutable, there’s little point in using any locks whatsoever! This is an extreme result, where threads are hammering on the lock constantly in a tight loop, but if you stop to think about it: When else would a reader/writer lock make a difference? If threads are just getting in and out of the lock very quickly, and arrivals are infrequent, then there is no benefit to allowing multiple threads in at once anyway.
The moral of the story? Besides suggesting that you seriously question whether a reader/writer lock is actually going to buy you anything, it's the same as the conclusion in my previous post on the matter:
Sharing is evil, fundamentally limits scalability, and is best avoided.
 Monday, February 02, 2009
I frequently get asked about the C# compiler's warning CS0420 about taking byrefs to volatile fields. For example, given a program
class P { static volatile int x;
static void Main() { f(ref x); }
static void f(ref int y) { while (y == 0) ; } }
the C# compiler will complain
xx.cs(8,15): warning CS0420: 'P.x': a reference to a volatile field will not be treated as volatile
because of the line containing 'ref x'. (The same applies to 'out' parameters too.) The natural question is, of course, whether to worry about it.
In general, the answer is yes, you must worry. In the above example, the use of the 'y' parameter inside 'f' will not be treated as volatile, as the warning says. What does that mean in practice? For one, the read of 'y' in 'f's while loop could be considered loop invariant by the JIT compiler and hoisted, and you'd possibly loop forever. It also means that on IA64 platforms, such reads will be emitted as ordinary loads instead of the special load-acquire variant that is emitted for volatile loads. This can lead to reordering bugs. In other words, you lose the volatile-ness of the field as soon as you cast it away as an ordinary byref. And unlike C++ where you can have a volatile pointer, there's no way to mark a .NET byref as volatile.
(You can use the Thread.VolatileRead and VolatileWrite methods to use a byref in a volatile manner. Unfortunately they are far more costly than ordinary volatile loads and stores.)
There is one particularly annoying case in which this warning is complete noise: when passing a byref to an API that internally performs volatile (or stronger) loads and stores. I.e., the Interlocked.*, Thread.VolatileRead, and VolatileWrite methods. Because these APIs internally use explicit memory barriers and atomic hardware instructions, the byref will effectively be treated as volatile regardless of whether it was taken from a volatile field or not. And therefore it is safe.
For instance, the compiler will warn you about the following code
volatile int x;
static void f() { Interlocked.Exchange(ref x, 1); }
even though there is no problem. You can suppress the warning with a "#pragma warning disable" just before the call
volatile int x;
static void f() { #pragma warning disable 0420 Interlocked.Exchange(ref x, 1); #pragma warning restore 0420 }
and then restore it immediately afterwards. (It's a good idea to restore the warning so that you catch other possibly-problematic instances from being missed.)
This comes up a whole lot. Why? Because many times you'll mark a field volatile, even though it is updated exclusively with CAS operations, because it's also used in other contexts: e.g., sequences where loads mustn't reorder or erroneously be considered loop invariant. I personally have a habit of always marking these variables as such, mostly as a carryover from Win32 whose InterlockedXX family of APIs demand volatile pointers (i.e., volatile * LONG).
I'm told that this annoying case might be fixed in the next C# compiler, by the way. Until then, I figured I'd throw this up for reference purposes.
 Thursday, January 29, 2009
Reader/writer locks are commonly used when significantly more time is spent reading shared state than writing it (as is often the case), with the aim of improving scalability. The theoretical scalability wins come because the lock can be acquired in a special read-mode, which permits multiple readers to enter at once. A write-mode is also available which offers typical mutual exclusion with respect to all readers and writers. The idea is simple: if many readers can read simultaneously, the theory goes, concurrency improves.
(I’ll be posting an analysis of reader/writer lock scalability in an upcoming post. For a variety of reasons--most related to my recent CAS post--they seldom make a dramatic impact in practice.)
In addition to showing up in libraries--such as Vista’s new SRWLock, .NET’s ReaderWriterLock, and .NET 3.5’s ReaderWriterLockSlim--they are used pervasively in relational databases, distributed transactions, and software transactional memory.
Vance Morrison demonstrated a lightweight reader/writer lock on his blog a couple years back. Although quite small, you can get smaller. Much like the new SpinLock type being made available in .NET 4.0, we can build a ReaderWriterSpinLock that offers several advantages:
- It’s a struct, and so there is no object allocation or space for an object header necessary.
- It’s a single word in size (i.e., 4 bytes).
- No kernel events are ever allocated; we will spin instead.
For cases in which reads are extraordinarily frequent, and writes are extraordinarily rare, this approach can actually be useful. Unfortunately, because one common case in which reader/writer locks scale very well is when hold times are lengthy, as will be shown in my upcoming post, even moderately common writes will result in chewing up a whole lot of wasted CPU time due to (3). If there’s interest, I will look into implementing a variant of this type that uses events for waiting. Clearly this would sacrifice (2).
Some design decisions have been made in the name of keeping this thing lightweight:
- No thread affinity will be used.
- And therefore no recursive acquires will be allowed.
The full code is below, at the bottom of this post. But let’s review the details one-by-one.
First, all state is packed into a single field, m_state. We’ll use the 32nd bit to represent whether the write lock is held, and we’ll use the 31st bit to represent whether a writer is attempting to acquire the lock. As with most reader/writer locks, we will give writers priority over readers because they are supposed to be very infrequent. In other words, once a writer arrives, no more read lock acquires will be permitted. The remaining 30 bits will be used to store the reader count. Some masks make this convenient:
private volatile int m_state; private const int MASK_WRITER_BIT = unchecked((int)0x80000000); private const int MASK_WRITER_WAITING_BIT = unchecked((int)0x40000000); private const int MASK_WRITER_BITS = unchecked((int)(MASK_WRITER_BIT | MASK_WRITER_WAITING_BIT)); private const int MASK_READER_BITS = unchecked((int)~MASK_WRITER_BITS);
Now we can write the four methods: EnterWriteLock, ExitWriteLock, EnterReadLock, ExitReadLock.
Entering the write lock merely entails setting m_state to MASK_WRITER_BIT, provided that we see it available. If it’s not available, we’ll just go ahead and try to set the MASK_WRITER_WAITING_BIT to prevent subsequent read locks from being acquired until we get in. We then go ahead and spin until the lock is available using the new type SpinWait in .NET 4.0, checking the m_state field over and over again. The lock is available if m_state is 0 or MASK_WRITER_WAITING_BIT:
public void EnterWriteLock() { SpinWait sw = new SpinWait(); do { // If there are no readers currently, grab the write lock. int state = m_state; if ((state == 0 || state == MASK_WRITER_WAITING_BIT) && Interlocked.CompareExchange(ref m_state, MASK_WRITER_BIT, state) == state) return;
// Otherwise, if the writer waiting bit is unset, set it. We don't // care if we fail -- we'll have to try again the next time around. if ((state & MASK_WRITER_WAITING_BIT) == 0) Interlocked.CompareExchange(ref m_state, state | MASK_WRITER_WAITING_BIT, state);
sw.SpinOnce(); } while (true); }
Leaving the write lock is actually quite simple. We just set the m_state field to 0, preserving the MASK_WRITER_WAITING_BIT just in case another writer has arrived since we acquired the lock. We use an Interlocked.Exchange (XCHG) operation for this, although we technically could have just done an ordinary write, provided doing so wouldn’t cause memory model or availability problems:
public void ExitWriteLock() { // Exiting the write lock is simple: just set the state to 0. We // try to keep the writer waiting bit to prevent readers from getting // in -- but don't want to resort to a CAS, so we may lose one. Interlocked.Exchange(ref m_state, 0 | (m_state & MASK_WRITER_WAITING_BIT)); }
Entering the read lock is even more straightforward. The lock is available for readers when m_state & MASK_WRITER_BITS is 0. In other words, no writer holds the lock and no writer is waiting for the lock. Once we see the lock in such a state, we merely try to add one to the state value and CAS it in. In this way, m_state & MASK_READER_BITS will be equal to the number of concurrent readers in the lock:
public void EnterReadLock() { SpinWait sw = new SpinWait(); do { int state = m_state; if ((state & MASK_WRITER_BITS) == 0) { if (Interlocked.CompareExchange(ref m_state, state + 1, state) == state) return; }
sw.SpinOnce(); } while (true); }
Lastly, exiting the read lock is the most complicated operation of all. It needs to decrement the reader count, while at the same time preserving the MASK_WRITER_WAITING_BIT:
public void ExitReadLock() { SpinWait sw = new SpinWait(); do { // Validate we hold a read lock. int state = m_state; if ((state & MASK_READER_BITS) == 0) throw new Exception("Cannot exit read lock when there are no readers");
// Try to exit the read lock, preserving the writer waiting bit (if any). if (Interlocked.CompareExchange( ref m_state, ((state & MASK_READER_BITS) - 1) | (state & MASK_WRITER_WAITING_BIT), state) == state) return;
sw.SpinOnce(); } while (true); }
And that’s it.
Here are some single-threaded performance numbers, comparing the relative costs of several locks out there. These are taken from a large number of acquire/release pairs, i.e., ‘for (int i = 0; i < N; i++) { lock.Enter(); lock.Exit(); }’, for a very large value of N:
Monitor 0004487479 RWL read lock (legacy) 0023042785 5.13491x RWL write lock (legacy) 0023118085 5.15169x SlimRWL read lock (3.5) 0009423579 2.099976x SlimRWL write lock (3.5) 0008680855 1.934465x Vance read lock 0004923609 1.097193x Vance write lock 0004802136 1.070123x SpinRWL read lock 0004298525 0.9579604x SpinRWL write lock 0003819024 0.8510431x
The Nx ratios compare the lock in question to Monitor as our baseline. Smaller is better. As you can see, we seem to be on pretty solid ground to start with. But clearly the most interesting part of this whole thing is the scaling numbers--in particular whether read-mode helps with throughput--both for the existing reader/writer locks and our new one. The results may surprise you. That’s coming in the next post...
(Here is the full listing.)
using System;
// We use plenty of interlocked operations on volatile fields below. Safe. #pragma warning disable 0420
namespace System.Threading { /// <summary> /// A very lightweight reader/writer lock. It uses a single word of memory, and /// only spins when contention arises (no events are necessary). /// </summary> public struct ReaderWriterSpinLock { private volatile int m_state; private const int MASK_WRITER_BIT = unchecked((int)0x80000000); private const int MASK_WRITER_WAITING_BIT = unchecked((int)0x40000000); private const int MASK_WRITER_BITS = unchecked((int)(MASK_WRITER_BIT | MASK_WRITER_WAITING_BIT)); private const int MASK_READER_BITS = unchecked((int)~MASK_WRITER_BITS);
public void EnterWriteLock() { SpinWait sw = new SpinWait(); do { // If there are no readers currently, grab the write lock. int state = m_state; if ((state == 0 || state == MASK_WRITER_WAITING_BIT) && Interlocked.CompareExchange(ref m_state, MASK_WRITER_BIT, state) == state) return;
// Otherwise, if the writer waiting bit is unset, set it. We don't // care if we fail -- we'll have to try again the next time around. if ((state & MASK_WRITER_WAITING_BIT) == 0) Interlocked.CompareExchange(ref m_state, state | MASK_WRITER_WAITING_BIT, state);
sw.SpinOnce(); } while (true); }
public void ExitWriteLock() { // Exiting the write lock is simple: just set the state to 0. We // try to keep the writer waiting bit to prevent readers from getting // in -- but don't want to resort to a CAS, so we may lose one. Interlocked.Exchange(ref m_state, 0 | (m_state & MASK_WRITER_WAITING_BIT)); }
public void EnterReadLock() { SpinWait sw = new SpinWait(); do { int state = m_state; if ((state & MASK_WRITER_BITS) == 0) { if (Interlocked.CompareExchange(ref m_state, state + 1, state) == state) return; }
sw.SpinOnce(); } while (true); }
public void ExitReadLock() { SpinWait sw = new SpinWait(); do { // Validate we hold a read lock. int state = m_state; if ((state & MASK_READER_BITS) == 0) throw new Exception("Cannot exit read lock when there are no readers");
// Try to exit the read lock, preserving the writer waiting bit (if any). if (Interlocked.CompareExchange( ref m_state, ((state & MASK_READER_BITS) - 1) | (state & MASK_WRITER_WAITING_BIT), state) == state) return;
sw.SpinOnce(); } while (true); } } }
 Friday, January 16, 2009
I just uploaded a free sample chapter for my Concurrent Programming on Windows book:
2 Synchronization and Time
STATE IS AN important part of any computer system. This point seems so obvious that it sounds silly to say it explicitly. But state within even a single computer program is seldom a simple thing, and, in fact, is often scattered throughout the program, involving complex interrelationships and different components responsible for managing state transitions, persistence, and so on. Some of this state may reside inside a process’s memory—whether that means memory allocated dynamically in the heap (e.g., objects) or on thread stacks—as well as files on-disk, data stored remotely in database systems, spread across one or more remote systems accessed over a network, and so on. The relationships between related parts may be protected by transactions, handcrafted semitransactional systems, or nothing at all.
The broad problems associated with state management, such as keeping all sources of state in-synch, and architecting consistency and recoverability plans all grow in complexity as the system itself grows and are all traditionally very tricky problems. If one part of the system fails, either state must have been protected so as to avoid corruption entirely (which is generally not possible) or some means of recovering from a known safe point must be put into place.
While state management is primarily outside of the scope of this book, state “in-the-small” is fundamental to building concurrent programs. Most Windows systems are built with a strong dependency on shared memory due to the way in which many threads inside a process share access to the same virtual memory address space. The introduction of concurrent access to such state introduces some tough challenges. With concurrency, many parts of the program may simultaneously try to read or write to the same shared memory locations, which, if left uncontrolled, will quickly wreak havoc. This is due to a fundamental concurrency problem called a data race or often just race condition. Because such things manifest only during certain interactions between concurrent parts of the system, it’s all too easy to be given a false sense of security—that the possibility of havoc does not exist.
In this chapter, we’ll take a look at state and synchronization at a fairly high level. We’ll review the three general approaches to managing state in a concurrent system:
- Isolation, ensuring each concurrent part of the system has its own copy of state.
- Immutability, meaning that shared state is read-only and never modified, and
- Synchronization, which ensures multiple concurrent parts that wish to access the same shared state simultaneously cooperate to do so in a safe way.
We won’t explore the real mechanisms offered by Windows and the .NET Framework yet. The aim is to understand the fundamental principles first, leaving many important details for subsequent chapters, though pseudo-code will be used often for illustration.
We also will look at the relationship between state, control flow, and the impact on coordination among concurrent threads in this chapter. This brings about a different kind of synchronization that helps to coordinate state dependencies between threads. This usually requires some form of waiting and notification. We use the term control synchronization to differentiate this from the kind of synchronization described above, which we will term data synchronization.
Read more here...
Related, I was recently interviewed by DZone about the book. You can read my responses here. Enjoy.
 Monday, January 12, 2009
I received some feedback on my previous post, Some performance implications of CAS operations, indicating that a few clarifications are in order. If I had to summarize the intended conclusion, it’d go something like this:
Sharing is evil, fundamentally limits scalability, and is best avoided.
I have to admit that the post was meant to focus more on concrete data, since I expected the meta-point about sharing to be implied. I figured folks would pick up on the link: (i) Sharing memory requires concurrency control, (ii) Concurrency control requires CAS, (iii) CAS is expensive, therefore (iv) Sharing memory is expensive. Many people simply don’t understand how crippling CAS can be when placed in a hot path, and I wanted to point out some (albeit extreme) examples of this point.
I did have a motivation for the post. A lot of people point at lock-free techniques, software transactional memory, reader/writer locks, etc. as ways to improve scalability. Sadly this seldom pans out. Each involves CASs of some sort, and, assuming the lock-based equivalent is written properly (that is, to hold locks for very short periods of time), the alternative can in fact often fare worse. I call this game “count the CASs.” It’s the roundtrips back to shared memory, failed optmistic attempts, cache invalidations, and line ping ponging that kills you.
Some might accuse me of unfairly targeting CAS. That’s hogwash. I’ve been in the trenches for years writing and optimizing systems-level parallel code on Windows. A parallel for loop can go from scaling perfectly to not scaling at all if you choose the wrong granularity for the loop counter increments. And vice versa. Why? Because the frequency of CASs will bring the memory system to its knees. You simply must consider these kinds of things when developing your data structures and algorithms; easing pressure on the cache hierarchy is the only way to scale beyond a handful of processors.
The sad truth is that only radical changes to the way we write software will allow fine-grained parallelism to scale to the numbers we expect in the 5 year time horizon. Hiding more and more conveniently inserted CAS operations auto-magically for folks is not doing them any good. Mostly functional combined with concurrency-safe mutation on guaranteed-isolated object graphs is, in my opinion, the only path forward.
 Friday, January 09, 2009
Along with type systems, I'm casually interested in formal specifications and verification of software. During lunch today, I watched an internal Microsoft Research talk given by Leslie Lamport. The topic was TLA+ -- his formal verification system -- during which he blurted out a couple amusing quotes:
"Writing is nature's way of letting you know how sloppy your thinking is." --- Guindon (cartoon)
"Math is nature's way of letting you know how sloppy your writing is." --- Leslie Lamport (riffing on Guindon)
And related:
"Formal math is nature's way of letting you know how sloppy your math is." --- Leslie Lamport
They made me chuckle out loud, so I figured I'd share them. Unfortunately the talk isn't available outside the company (as far as I can tell), but Lamport has written a book, Specifying Systems, available online, in addition to dozens of interesting papers, on the topic.
 Thursday, January 08, 2009
CAS operations kill scalability.
(“CAS” means compare-and-swap. This is the term most commonly used in academic literature, but it is commonly referred to under many guises. Windows has historically called it an “interlocked” operation and offers a bunch of such-named Win32 APIs; .NET does the same. This set entails X86 instructions like XCHG, CMPXCHG, and certain instructions prefixed with LOCK, such as INC, ADD, and so on.)
My opening statement is a bit extreme, but it’s true enough. There are several reasons:
0. CAS relies on support in the hardware to ensure atomicity. Namely, most Intel and AMD architectures use a MOSEI cache coherency protocol to manage cache lines. In such an architecture, CAS operations on uncontended lines that are owned exclusively (E) within a processor’s cache are relatively cheap. But any contention – false or otherwise – leads to invalidations and bus traffic. The more invalidations, the more saturated the bus, and the greater the latency for CAS completion. Cache contention is a scalability killer for non-CAS memory operations too, but the need to acquire a line exclusively makes matters doubly worse when CAS is involved.
1. CAS costs more than ordinary memory operations, in CPU cycles. This is due to the additional burden on the cache hierarchy, and also because of requirements around flushing write buffers, restrictions on speculation across the fences, and impact to a compiler’s ability to optimize around the CAS.
2. CAS is often used in optimistically concurrent operations. That means a failed CAS will lead to a retry of some sort – typically with some kind of backoff – which is purely wasted work that isn’t present when there isn’t any contention. And 0 and 1 both increase the risk of contention.
The most common occurrence of a CAS is upon lock entrance and exit. Although a lock can be built with a single CAS operation, CLR monitors use two (one for Enter and another for Exit). Lock-free algorithms often use CAS in place of locks, but due to memory reordering such algorithms often need explicit fences that are typically encoded as CAS instructions. Although locks are evil, most good developers know to keep lock hold times small. As a result, one of the nastiest impediments to performance and scaling has nothing to do with locks at all; it has to do with the number, frequency, and locality of CAS operations.
As a simple illustration, imagine we’d like to increment a counter 100,000,000 times. There are a few ways we could do this. If we’re just running on a single CPU, we can use ordinary memory operations:
Variant #0: static volatile int s_counter = 0; for (int i = 0; i < N; i++) s_counter++;
This clearly isn’t threadsafe, but provides a good baseline for the cost of incrementing a counter. The first way we might make it threadsafe is by using a LOCK INC:
Variant #1: static volatile int s_counter = 0; for (int i = 0; i < N; i++) Interlocked.Increment(ref s_counter);
This is now threadsafe. An alternative way of doing this – commonly needed if we must perform some kind of validation (like overflow prevention) – is to use a CMPXCHG:
Variant #2: static volatile int s_counter = 0; for (int i = 0; i < N; i++) { int tmp; do { tmp = s_counter; } while (Interlocked.CompareExchange(ref s_counter, tmp+1, tmp) != tmp); }
An interesting question to ask now is: How much slower will each variant be when cache contention is introduced? In other words, run a copy of each code on P separate processors, incrementing the same s_counter variable by N/P, and compare the running times for different values of P, including 1. You might be surprised by the results.
For example, on one of my dual-processor/dual-core (that’s 4-way) Intel machines, the results are as follows. I’ve run Variant #0 even though it’s not threadsafe, simply because it shows the effects of cache contention on ordinary memory loads and stores.
#0, P = 1: 1.00X #1, P = 1: 4.73X #2, P = 1: 5.38X #0, P = 2: 2.11X #1, P = 2: 10.74X #2, P = 2: 16.70X #0, P = 4: 3.87X #1, P = 4: 7.57X #2, P = 4: 73.35X
All numbers are normalized and compared to the ++ code on a single processor. In other words, Variant #0 run on 2 processors is 2.11X the cost of Variant #0 run on 1 processor; similarly, Variant #0 run on 4 processors is 3.87X the cost of Variant #0 run on 1 processor. Variant #1 gets even worse at 4.73X, 10.74X, and 7.57X, respectively. And Variant #2 explodes in cost as more contention is added, going from 5.38X, to 16.70X, to a whopping 73.35X. Adding more concurrency actually makes things substantially worse.
(The absolute numbers are not to be trusted, and there are anomalies undoubtedly introduced based on how threads are scheduled; I’ve not affinitized them, so they may end up sharing sockets at will. A more scientific experiment needs to consider such things.)
The CMPXCHG example (Variant #2) can be improved by strategic spinning when a CAS fails. Part of what makes the numbers so bad – particularly the P = 4 case – is the amount of lost time due to livelock and the associated memory system interference.
This is an extreme example. Few workloads sit in a loop modifying the same location in memory over and over and over again. Even if they do – as in the case of a parallel for loop in which all threads fight to increment the shared “current index” variable – these accesses are ordinarily broken apart by sizeable delays during which useful work is done. Augmenting the test to delay accessing the shared location by a certain number of function calls certainly relieves pressure.
For example, here are the numbers if we add a 2-function-call delay in between accesses:
#0, P = 1: 1.00X #1, P = 1: 2.54X #2, P = 1: 2.77X #0, P = 2: 1.47X #1, P = 2: 5.19X #2, P = 2: 8.59X #0, P = 4: 2.78X #1, P = 4: 3.67X #2, P = 4: 26.55X
And if we add a 64-function-call delay in between accesses, the micro-cost between the three variants doesn’t matter much. But the contention behavior sure is different. And we can even find some cases where the multithreaded variants run faster than the single-threaded counterpart:
#0, P = 1: 1.00X #1, P = 1: 1.00X #2, P = 1: 1.00X #0, P = 2: 0.59X #1, P = 2: 0.74X #2, P = 2: 0.85X #0, P = 4: 0.51X #1, P = 4: 0.45X #2, P = 4: 1.23X
This is the first time we have seen a number < 1.00X. That's a speedup; remember, we are using parallelism after all.
As you might guess, in the region between 2 and 64 function calls the results gradually get better and better; and beyond 64, they get substantially better. In fact, when we insert 128 function calls in between, we get very close to perfect, linear scaling for all 3 variants:
#0, P = 1: 1.00X #1, P = 1: 1.00X #2, P = 1: 1.00X #0, P = 2: 0.50X #1, P = 2: 0.52X #2, P = 2: 0.52X #0, P = 4: 0.30X #1, P = 4: 0.29X #2, P = 4: 0.27X
(As a reminder, 0.50X is a perfect speedup on a 2-CPU machine, and 0.25X is a perfect speedup on a 4-CPU machine.)
The moral of this story is that nothing is free, and CAS is certainly no exception. You should be extremely stingy with adding them to your code, and conscious of the frequency at which threads will perform them. The same is generally true of all memory access patterns when parallelism is in play, but particularly for expensive operations like CAS.
And even if you’re not using CAS’s directly in your code, you may be using them via some system service. Parallel Extensions uses them in many ways. For instance, when you’re doing a Parallel.For loop, we internally share a counter that is accessed by multiple threads. So even if your algorithm is theoretically embarrassingly parallel, the internally counter management could get in your way. We try to be intelligent by chunking up indices, but we aren’t perfect: if you have very small loop bodies the overhead of CAS could begin to impact scalability. You can work around this by making loop bodies more chunky; one example of how is by doing your own partitioning on top of our library (like executing multiple loop iterations inside the body passed to Parallel.For). Even things like allocating memory with the CLR’s workstation GC requires the occasional roundtrip to reserve a thread-local allocation context by issuing a CAS operation against a shared memory location.
 Sunday, December 28, 2008
As embarassing as it is, the errata for Concurrent Programming on Windows is non-empty.
I've posted an initial listing -- full of primarily simple typos like misplaced commas -- at http://www.bluebytesoftware.com/books/winconc/winconc_book_resources.html#Errata.
Sincere thanks to everybody who has reported errors thus far. If you find any additional ones, please email them to me directly: joe AT bluebytesoftware DOT com. We'll attempt to fix as many errors as possible in subsequent printings of the 1st edition and, if that fails, they'll make the 2nd edition.
I've spent the past few months (from September onward) travelling approximately 75% of the time. As a result, I may be slow responding to email concerning the book. I've also not finished putting together the code samples up for download; my current ETA for that is mid-January 2009. I already know there are a few more errata entries lurking within, due to some last minute typographical updates made late in the editing process. If only word processing software came complete with built-in compilers... (excuses, excuses)
In any case, I'd love to receive feedback on the book. Even if it's not about an error. Things you like, things you'd like to see improved, things you wish I'd not written about, requests for clarifications, etc. Just drop me an email. Cheers.
 Saturday, November 29, 2008
I've had an obsession with programming languages for some time now. This probably began the first time I learned of LISP. Most people I know have had a similar "Ah-Hah!" moment associated with LISP, but it was when I first truly realized the deep extent to which a programming language shapes thought -- sometimes in negative ways. LISP put it all into perspective.
Since then, the obsession has only become worse through my employment at Microsoft, where I've had the privilege to work alongside and interact with some of the greatest minds in programming languages. This is an absolute honor. I worked on a few compilers and did some language design, particularly when on the Common Language Runtime team, and my favorite project today is my work on type-system support for static enforcment of concurrency safety and guaranteed isolation. I have found great joy in applying underlying concepts in more niche (and extreme) languages like Haskell to more mainstream languages like C#. My favorite pasttime is tracing back the lineage of languages to their earliest ideas, especially when this leads to the unearthing of a subtle commonality among them. I have been designing one of my own and, while it is undoubtedly a 5-year project that may never see the light of day, I do it for the love of languages.
This book has been stewing inside me for a while now. And after seeing Guy L. Steele and Richard P. Gabriel's infinitely beautiful "50 in 50" presentation at JAOO this year, I decided it was time for it to escape.
Notation and Thought: Behind Computer Science's Most Influential Programming Languages
“That language is an instrument of human reason, and not merely a medium for the expression of thought, is a truth generally admitted.” --- George Boole, Laws of Thought
Programming languages are not only a notation for expression, but also a medium of thought, akin to the duality between natural written and spoken languages. If you can think it, you can create it. The reverse is also true: if a language poses impediments to your thought process, certain solutions to problems are simply unfathomable. Languages are therefore not just what you see “on paper”--each is a unique tool that can substantially limit, or expand, the creative freedom of the programmer in whose hands it sits. Good languages get out of the way, and great ones do a whole lot more.
In the early days, there was of course nothing that resembled modern day languages. Computers had to be told what to do in excruciating detail. One only has to look at modern day assembly language to see that programming a computer in this manner constrains creativity and slows progress. Alan Turing didn’t even have that when he wrote his classic On Computable Numbers with an Application to the Entscheidungsproblem paper, but he at least managed to solve some simple problems: by moving a tape reader and reading and writing symbols, he was able to create the modern day equivalent to subroutines and even add up a number or two. But our industry would have never seen radical advances in enabling technologies, and widespread computer use, that we enjoy today without significant advances in higher-order abstractions.
Plankalkül, or the plan calculus, is widely recognized as the first real programming language. It was designed by a German computer engineer, Konrad Zuse, and first written down in an unpublished manuscript in 1943. The language offered composite (albeit simplistic) data structures, arrays, named variables, subroutines, and moderately sophisticated control flow and looping constructs. Although it was never used in practice, Plankalkül was surprisingly ahead of its time. It was a big step towards more abstract problem solving.
It should be no surprise that subsequent programming languages are as varied in their design as the humans that created them. This fact can be seen by examining the ensuing decade of computing post-Plankalkül. The 1950s saw the invention of four new major languages that fundamentally shaped the future of language design. FORTRAN, or the FORmula TRANslation language, specialized in describing transformations on data and numerics, and was the first non- assembly language to reach widespread use in performance sensitive situations. LISP, or the LISt Processing language, was developed for symbolic processing and, eventually, found a home in artificial intelligence, pioneering many techniques that are still in use today such as first class functions as data, a recursive style of programming, and garbage collection. Its principles were derived from the mathematical logics of Alonzo Church and Haskell B. Curry, notably Church’s lambda calculus from the 1930s. ALGOL, or the ALGOrightmic Language, focused on describing algorithms elegantly, kick-started the imperative family of languages (of which many popular industry programming languages like C++ and Java are members), and later set the de facto standard style for Computer Science education curricula. Its method of encoding algorithms with assignments was far closer to the von Neumann architecture than was LISP, making the resulting programs behave predictably and efficiently. Lastly, COBOL, or the COmmon Business-Oriented Language, became the first domain-specific language (DSL) that targeted non-programming business and finance experts, broadening the general accessibility of computers. Each of the four has had a crucially important role to play in the history of programming languages.
There has been no shortage of language diversity after the birth of the initial four. In fact, hundreds of languages have since come and gone, some enjoying brief or extended periods of popular use. All that have since come have been deeply influenced by the pioneers, but have also contributed a handful of innovative new ideas that help programmers more clearly think about and express solutions to real-world problems. The lineage of languages has branched off into separately named family trees--such as imperative, functional, logic, declarative and domain-specific--only to reunite intimately with each other down the line. Indeed, it really is just one big happy family.
This book traces this lineage through the most influential languages--those that have deeply impacted the way that programmers think and write--and provides insight into the motivation behind them, their major influences, and the important features that each language contributed. Throughout, it is my hope to develop within the reader a new appreciation of the art of programming computers, an understanding of the impact that language has on our thinking, and an excitement about the future of language design that lies ahead.
Joe Duffy November, 2008
 Tuesday, November 04, 2008
Type classes, kinds, and higher-order polymorphism represent some of Haskell’s most unique and important contributions to the world of programming languages. They are all related, and began life as type classes in Wadler and Blott’s 1988 paper, How to make ad-hoc polymorphism less ad hoc. Eventually, Jones introduced the (then separate) concept of constructor classes, in his 1993 paper, A system of constructor classes: overloading and implicit higher-order polymorphism. Eventually these two ideas were unified into a beautiful single set of features (namely, type constructors and kinds) in Haskell.
In this short essay, I’ll explain what these things are and why I’m sad that we don’t have them in C#.
To take the simplest motivating example, say we want to define a generic square function:
square x = x * x
Given a Hindley-Milney type system (with type inference), how should the compiler type this function? The challenge that immediately arises is that, to know the type of x and the function’s return value, we must know something about the function * being called within the body of square. But to know something about that function, we’d need to know the type of x. We’ve entered into a cycle, and have hit a wall. Clearly the type will be something generic, but polymorphic on what?
Imagine that we could infer the type of the * function as follows:
(*) :: a -> a -> b
In other words, * is a function that takes two values, both of type a, and produces some value of another type b. We know its two arguments must be of the same type because in square we pass the same value x to it twice. Given this typing for *, we could then type square similarly as:
square :: a -> b
In other words, square takes a single value of type a and produces a value of type b. The constraint on the type a here is, of course, that some function * is available that is typed as taking an a as input. There’s no obvious way to capture this in the type system, though we might conceive of something like:
square :: (* :: a -> a -> b) => a -> b
In other words, given a type a for which some function * is defined, which takes two a’s and returns a single b, the type of square thus takes an a and produces a b. You can’t say that in Haskell, although we’ll see a bit later that type classes allow similar constraints (with “=>”) to be written.
While this hypothetical typing is extremely general purpose, it would produce considerable challenges in its implementation. Standard ML throws up its hands and infers all mathematical operators (like *) as working with floats, meaning that all of the types above (both a and b) will be inferred under the type of float. (*) is of type float -> float -> float, and square is of type float -> float. Similarly, F# assumes you’re working with ints. Both Standard ML and F# have amazingly rich type inference systems, but this begins to run right up against the limits of what they can do. We’ll see some harder examples shortly.
You can probably guess that Haskell’s solution to this conundrum is to use higher order polymorphism with a feature of its type system called type classes. They allow us to classify types much in the same way types ordinarily classify objects. We can classify the set of numeric types as follows, for instance:
class Num a where
(*) :: a -> a -> a
… other numeric operations …
And then we can go ahead and provide concrete mappings for integers and floating point numbers:
instance Num Int where
(*) = addInt
…
instance Num Float where
(*) = addFloat
Each instance of the type class (in this case, Num) is a bit like a dictionary mapping the named functions (in this case, just *) to other functions that are defined for the concrete type (in this case, supplied in a’s stead). With this information defined, the Haskell compiler can now infer the type of square as:
square :: Num a => a -> a
This inference really just says that the function square is defined for all types a that are in the type class Num. The “Num a =>” part is a bit like a C# generic type constraint, in that it restricts what kinds of a’s can be supplied. Given what has been stated thus far, that’s just Int and Float. So we can only call the square function with types on which multiplication is properly defined, which is exactly what we want.
At this point, we might want to try defining a similar thing in C# using generics. (And for this simplistic example, and others like Haskell’s Eq a type class, we will succeed.) There are two basic ways we could achieve this. The first is to define an INum<T> interface (or abstract class—pick your poison), and give it an instance method to multiply the target with another number:
interface INum<T> {
T Mult(T x);
}
We would then have the basic numeric data types like Int32 and Float implement INum<T>:
struct Int32 : INum<Int32> {
public Int32 Mult(Int32 x) { return value * x; }
…
}
struct Float : INum<Float> {
public Float Mult(Float x) { return value * x; }
…
}
Given these definitions, it would be a breeze to write a Square method that only operates on INum<T>s:
T Square<T>(T x) where T : INum<T> { return x.Mult(x); }
Thankfully, we can recursively reference the T from within the generic type constraint.
Now, of course, there’s no way the C# compiler would infer the necessary INum<T> constraint. But given that we don’t have rich type inference (aside from for local variables) in C#, this doesn’t pose any new problems. Another slight annoyance is that you need to modify the source type to declare support for INum<T>, when a perfectly reasonable implementation could have been provided “from the outside,” but you’ll find that this will only occasionally get under your skin.
The second way we might go about this is to take an approach similar to .NET’s EqualityComparer<T> class, where we have an abstract base class that represents the ability to do something with instances of Ts. And then we only provide implementations on concrete Ts for which that ability makes sense. For example, we could have a Multiplier<T> that looks a lot like INum<T>:
abstract class Multiplier<T> {
public abstract T Mult(T x, T y);
}
Multiplier<T> on its own isn’t usable. But we can provide implementations for Int32 and Float:
class Int32Multiplier : Multiplier<Int32> {
public override Int32 Mult(Int32 x, Int32 y) { return x * y; }
}
class FloatMultiplier : Multiplier<Float> {
public override Float Mult(Float x, Float y) { return x * y; }
}
// And so on …
Now we can write a slightly different Square method that takes a Multiplier<T> as an extra argument:
T Square<T>(T x, Multiplier<T> m) { return m.Mult(x, x); }
Now there isn’t any kind of generic type constraint on Square’s T, but of course we can only call it if we have a concrete instance of Multiplier<T> in hand. And by definition that means there is a Mult method defined that we can call. (This isn’t wholeheartedly true. You can of course call Square<U> for any U, passing in null as the second argument. But presumably the method would check for null and throw. This is a real limitation, however, which would likely push us back in the direction of the original interface solution. If we had non-null types, we could get closer to a fully statically verifiable solution.)
Aside from a lot more typing, and the lack of rich type inference, we seem to have reached parity. The simple examples provided in the literature and Haskell’s Standard Prelude can be implemented in such a fashion. But we are kidding ourselves if we think these are the same thing.
The main problem is that C# doesn’t support higher-kinded type parameters. We haven’t yet seen a type class in Haskell that fully exploits this capability, but there are several. The simplest one I know about in the Haskell Standard Prelude is the Functor type. (Monad is also a great example, but is a bit more complicated (and sufficiently frightening) that this will be a topic for another day.) Functor’s definition is:
class Functor f where
fmap :: (a -> b) -> f a -> f b
The Functor type class offers a single function, fmap. It takes two things—a function that transforms a value of type a into a value of type b and some functor value of type f a—and returns some new functor value of type f b. This looks like an ordinary type class, except for one funny (and subtle) aspect. Functor abstracts over type f, but notice that we’re using f in fmap’s second argument and return type by actually constructing it with two other types a and b! In case you’re having a hard time thinking in Haskell, it’s as though we tried to write this in C# using our interface trick from earlier:
interface IFunctor<T> {
T<B> FMap<A, B>(Func<A, B> f, T<A> a);
}
This won’t compile. We can’t refer to T in the typing of FMap as T<B> and T<A>: it’s not expressible in C# and .NET’s type system. Let’s pretend for a moment, however, that we could. What is an example of class that might implement this? How about something that deals in terms of Nullable<T> instances?
class NullableFunctor<T> : IFunctor<Nullable<>> {
Nullable<B> FMap<A, B>(Func<A, B> f, Nullable<A> a) {
return new Nullable<B>(f(a.Value));
}
}
All you need to do is take a close look at a 1997 paper by Simon Peyton Jones, Mark Jones, and Erik Meijer, entitled Type classes: an exploration of the design space, and you will find a plethora of even more complicated (and useful) examples that use an innocent-sounding aspect of Haskell’s type system called multi-parameter type classes. All of the types are higher-order and are merely moved around and manipulated like abstract (higher-order) symbols. The type system gracefully gets out of the way and allows you to drop abstract type parameters into any holes they fit in, without mandating that you say too much. The secret sauce—as noted earlier—is kinds.
Kinds are used in the implementation of Haskell’s type system, and you won’t mention a whole lot about them anywhere. They basically categorize what kind of types can appear anywhere a type is expected. A great overview (with plenty of context) can be found in Mark P. Jones’s Functional Programming with Overloading and Higher-Order Polymorphism paper and, of course, the Haskell 98 Report.
Here’s a quick rundown. Kinds appear in one of two forms:
- the symbol * represents a concrete type (a.k.a. a monotype), and,
- if k1 and k2 are kinds, then k1 -> k2 is the kind of types that take a type of kind k1 and return a type of kind k2.
Kinds are formed in many ways: the primitive types (such as Char, Int, Float, Double, etc.) are an example of the former, and are of kind *. They “bottom out.” Type constructors, however, like Functor are an example of the latter, and are of kind * -> *. That is, they take a kind k1 (the first *) and produce another kind k2 (the second *). By giving some concrete type T (*) to Functor, we get back a Functor T (also *). The latter is therefore a bit like a function mapping one kind to another. Functions have a kind of * -> * -> *, because a function has two types: the type of arguments (the first *) and the type of its return value (the second *). These compose, so that you might have (* -> *) -> * -> *. And so on. Thinking about kinds can take a bit of getting used to.
But the really useful thing here is that kinds allow you to write higher order type constructors like those we have begun to explore above, like Functors and Monads. I.e., given a type t1 of kind k1 -> k2, and a type t2 of kind k1, then t1 t2 is a type expression of kind k2. This can be applied to the occurrences of f a and f b in Functor’s fmap function. In the type Functor f they are of kind * -> * -> *. When a concrete Functor instance is specified, e.g., by substituting T for f, this turns fmap’s T a and T b arguments to kind * -> *. That is, they still both expect another kind before bottoming out. And therefore we can substitute some concrete U and V types for a and b, to reduce them from kind * -> * to kind *.
Now we’re done. And, as if by magic, it all works.
|
|
Recent Entries:
Search:
Browse by Date:
| | Sun | Mon | Tue | Wed | Thu | Fri | Sat | | 25 | 26 | 27 | 28 | 29 | 30 | 31 | | 1 | 2 | 3 | 4 | 5 | 6 | 7 | | 8 | 9 | 10 | 11 | 12 | 13 | 14 | | 15 | 16 | 17 | 18 | 19 | 20 | 21 | | 22 | 23 | 24 | 25 | 26 | 27 | 28 | | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
Browse by Category:
Notables:
Currently Up To:
|