Sunday, October 28, 2012

A glimpse of some research we've done recently just appeared at OOPSLA last week:

Uniqueness and Reference Immutability for Safe Parallelism

A key challenge for concurrent programming is that side-effects (memory operations) in one thread can affect the behavior of another thread. In this paper, we present a type system to restrict the updates to memory to prevent these unintended side-effects. We provide a novel combination of immutable and unique (isolated) types that ensures safe parallelism (race freedom and deterministic execution). The type system includes support for polymorphism over type qualifiers, and can easily create cycles of immutable objects. Key to the system's flexibility is the ability to recover immutable or externally unique references after violating uniqueness without any explicit alias tracking. Our type system models a prototype extension to C# that is in active use by a Microsoft team. We describe their experiences building large systems with this extension. We prove the soundness of the type system by an embedding into a program logic.

The official ACM page is here, and a tech report version is available on MSR's website.

As I said, this is just a glimpse. Its focus was mainly on the type soundness work we've done jointly with MSR, and less about the language, syntax, and uses. You'll have to use your imagination to fill in the rest ;-)

10/28/2012 3:58:30 PM (Pacific Daylight Time, UTC-07:00)  #   

 Tuesday, July 17, 2012

It's been quite some time since I blogged.

The reason is simple, as always: I'm having the time of my life at work.

I will try to do better blogwise in the coming months. But in the meantime ...

If you'd like to join me, Adam, and Krzysztof in our quest to build the best APIs and developer platform known to mankind, please shoot me an email with you resume. Or just apply.

In short, you could be having the time of your life too. Why wait?

7/17/2012 5:49:08 PM (Pacific Daylight Time, UTC-07:00)  #   

 Saturday, November 12, 2011

I often wish that .NET had erred on the side of offering postmortem instead of premortem finalization.

The distinction here is when exactly the finalizer runs, i.e. after or before the GC has actually reclaimed an object. This governs whether a dying object is (a) accessible from within its own finalizer, and therefore (b) eligible to become resurrected. Postmortem finalization occurs after the object is long gone, and hence says “no” to both of these questions; premortem finalization happens beforehand and hence says “yes.”

.NET chose the latter.

The primary downside of premortem finalization, setting aside the confusing nature of resurrection, is that the object in question cannot be collected until after its finalizer has run. This should be fairly obvious: it is only that second time the object is found to be dead “again” that we know the finalizer has or has not resurrected it.

This may seem like a small matter. But it matters quite a lot when building high performance software. In a garbage collected system, relying on high rates of finalization to keep up with demanding workloads almost never works. But in a premortem finalization system, even moderate demands become cause for concern.

Premortem finalization leads to finalized objects getting promoted to the elder generations before actually dying. If you check the value of GC.GetGeneration(this) within an object’s finalizer, for example, you will notice it is one greater than the generation in which the object was found to be dead the first time. Say it was found dead in Gen1; then GC.GetGeneration(this) will return ‘2’. Yet another collection must happen, in Gen2 to boot, in order to actually reclaim this object. And, of course, it’s not just this object, but also the transitive closure of objects to which it refers.

This approach penalizes the majority use case of finalizable objects. At least on .NET, most objects merely invoke CloseHandle on an IntPtr in the finalizer. This clearly needn’t hold up freeing the managed state. And resurrection is a dubious scenario anyway: such objects quickly end up in Gen2 where collections are expensive and infrequent. If you’re pooling via resurrection because you create expensive objects at a high rate of birth and death, manual memory management (or a different design altogether) is likely your only savior.

Although Java’s finalizers are also premortem, the JDK offers the facilities necessary to implement postmortem finalization on your own. It entails using WeakReference and ReferenceQueue. See this article if you are curious.

.NET doesn’t offer the notifications required to do the same. You can, however, learn from postmortem finalization to write better premotem finalizers: prefer simple finalizable objects that refer to only the state necessary to implement finalization – which ordinarily means no other managed objects. The SafeHandle abstraction is a good example of this. Most implementations are comprised of a simple IntPtr. This pattern will ensure that collateral promotion due to finalization is more contained.

After saying all of this, I hope it is just amusing trivia. I'm sure nobody is writing finalizers these days anyway.

11/12/2011 2:03:13 PM (Pacific Standard Time, UTC-08:00)  #   

 Sunday, October 23, 2011

It's been unbelievably long since I last blogged.

The reason is simple. I've been ecstatic in my job and, every time I think to write something, I quickly end up turning to work and soon find that hours (days? months?) have passed. This is a wonderful problem to have, but not so good for keeping the blog looking fresh and new. (I've also been writing a fair bit of music lately.) Well, this weekend I managed to lock myself out of my VPN access, and decided that this was a sign that I ought to dust off the cobwebs on a blog entry or two that I've had in the works for quite some time.

The topic for today is generics, a feature many of us know and love. Specifically, their impact on software performance, something I frequently see developers struggling to understand and tame in the wild.

The blessing; and, the curse

I absolutely love generics. I can hardly imagine writing code without them these days. The code reuse, higher-order expressiveness, beautiful abstractions, and static type-safety enabled by first class parametric polymorphism are all game-changing. And being a language history wonk, I'm delighted to see many mainstream programming languages stealing a page from ML and theoretical CS generally.

Generics, however, are not free. And in some circumstances, they are, dare I say, rather expensive. Few language features surpass generics in the ability to write a concise and elegant bit of code, which then translates into reams of ugly assembly code out the rear end of the compiler. I am of course speaking mainly to models in which compilation leads to code specialization (like .NET's), versus erasure (like Java's).

Most developers coming from a C++ background understand code expansion deeply, because they program with templates. Unlike templates, however, there is ample runtime type information (RTTI) associated with generic instantiations… such that the costs associated with generics frequently – and perhaps surprisingly – are a superset of those costs normally associated with C++ templates. At the same time, because the compiler understands parametric polymorphism, it can sometimes do a better job optimizing, e.g. with techniques like code sharing.

Basically, with templates and erasure, the equation for predicting code expansion is super simple. You get it all (in the former) or you get none of it (in the latter), but with specialized generics this equation is quite complex.

Paradoxically, these same costs are the main value that generics bring to the table! Write a little type-agnostic code and then "instantiate" that same code over multiple types without repeating yourself. But, generics are not magic; did you ever stop to wonder things like: What machine code is generated for these types? Does the compiler need to specialize the actual code that runs on the processor for unique instantiations, or is it all the same? And if it does need to specialize, where, how, and why? And perhaps most importantly, what hidden costs are there, and how should I think about them while writing code?

Before reading further, paranoia need not ensue. The point of this article is merely to raise your awareness. All programmers should know what the abstractions they use cost, and make conscious tradeoffs when writing code with them. The aforementioned benefits of generics really are often "worth it," both in the elegance and reusability of abstractions, and in developer productivity. In my experience, however, the associated costs are so subtle and ill-documented that even people who write highly generic code typically remain unaware of them. Even more subtly, these costs are somewhat different in nature when pre-compiling your code, such as with .NET's NGen technology.

This brief essay will walk through a few such costs in the context of the .NET Framework and CLR's implementation of generics. This is in no way an exhaustive study of generic compilation, and your mileage will vary from one platform to the next. Although the studies presented would apply to other implementations of generics, the reality is that if you're writing code in, say, Java – where type erasure is employed rather than code specialization – then all of this is going to be less relevant to you.

With no further delay, let's get started.

Code, RTTI, oh my

When considering costs, we must always think about both size and speed.

There is at least as much assembly code created for an instantiation as the code you've written for the generic abstraction in C# or MSIL. A simple mental model – that thankfully turns out isn't entirely accurate, thanks to some sharing optimizations described below – is that for each instantiation of a generic type or method you get a new copy of that code specialized to the type in question. Obviously, this increases code size. And just as obviously, it will add some runtime cost to JIT compile the code (if you aren't using ahead of time compilation), as well as putting more pressure on I-cache and TLB.

Another source of significant cost is the runtime data structures needed for RTTI and Reflection, like vtables and other metadata. Quite simply, the runtime needs to know the identity of each generic instantiation, to prevent things like casting a List<Int16> to a List<String>, and even List<Object> to List<String>; and given that there is often distinct code generated for unique instantiations, the vtable contents for those different List<T> instantiations are going to look quite different.

And of course, there are statics. Each generic instantiation gets its own set, requiring extra storage and another level of indirection when fetching them. Unique statics means D-cache and TLB pressure. It turns out that code shared across AppDomains, like mscorlib.dll, already need such things. But I have found that it's surprisingly common for a developer to throw a static field (or nested class!) onto an outer generic type, without actually needing it to be replicated for each unique instantiation.

In addition to the immediate effects, generic types often refer to other generic types which refer to other generic types … and so on. Instantiating a root type is akin to instantiating the full transitive closure.

To make our discussion friendly and familiar, we shall use the .NET Framework's List<T> type – presumably one of the most commonly used generic types on the planet – to illustrate many of these costs. And unfortunately, you'll also see that many of the common performance pitfalls plague this type too. (So, really, you need not feel bad if your own code is guilty of them too.)

Why the distinct code, anyway?

There is only one copy of List<T>'s code in mscorlib's MSIL. It is essentially just a blueprint for the list class.

When I create a List<Int16> in my program and use it, however, there clearly needs to be some assembly code created in order to execute List<T>'s associated functionality, just with any T's used by List<T>'s code replaced by actual 2-byte short integers. And similarly, if I were to instantiate a List<String>, all those T's need to be replaced by pointer-sized object references, either 4- or 8-bytes depending on machine architecture, that are reported live to the garbage collector.

This is what leads to our simple mental model above, in which each instantiation gets its own copy of the code. In this case, both List<Int16> and List<String> would be entirely independent types at runtime, with wholly separate copies of the machine code.

Certainly if I manually went about creating my own Int16List and StringList types, they would be distinct types with distinct machine code generated. Being a prudent developer, however, I'd probably try to arrange to share as much of the implementation as possible between the two types, perhaps using implementation inheritance. But alas, there's no way I could share it all: any code specific to Int16 or String, for example, would surely differ, both in MSIL and in the native code.

Generics basically give you the ability to do this same thing, without you needing to do the factoring of type-independent and type-specific code yourself. The compiler does that for you.

Why might the code be different? As stated above, Int16 values are 2 bytes and String pointers are native word sized (4 bytes on 32-bit, 8 bytes on 64-bit). All the code that passes values of type T on the stack, either as arguments or return values, moves instances into and out of memory locations (like the T[] backing array), and so on, needs to be specialized based on the size of T. This wouldn't be true of a generics implementation that used type erasure, like Java's, but then you'd need to box the value types on the heap so that everything is a pointer. If T is a Float, we will likely emit code that uses floating point math instead of general purpose registers. Any tables that report GC roots are likely to be different, since object references can be embedded inside struct values that get laid out on the stack. And so on. Some day you might want to compare the machine code for a simple generic Echo<T> method for different kinds of T's; it is really easy to do, and is quite illustrative.

A naïve wish might go as follows. Imagine that I had written my own dedicated Int16List and StringList types, and that we diffed the resulting machine code between the distinct list types; we'd presumably find a fair bit of duplication for all the reasons stated above. It would be a nice property if, when we used the generic List<Int16> and List<String> types, and similarly diffed the resulting assembly, the amount of specialized code would be no greater than the amount of specialized code between our best hand-written Int16List and StringList types. I.e., only parts that need to be different are different.

We could go even further with our wish. Imagine I had a List<DateTime> and List<Int64>. Both are 8-byte values, and do not contain any GC references. If I were writing a specialized 8ByteValueList in C++ and had immense performance constraints, I would, again being a prudent developer, probably use some type unsafe code, with nasty reinterpret_casts, so that I could use the same list type to store any kind of 8-byte value. (Except in C++ I could even store pointers!) It would also be a nice property if generics did some of this for us, while still retaining the type safety we love about generics.

It turns out we will get neither of our wishes exactly, although we will get something close to the spirit of our wishes.

Code sharing

Indeed, the CLR does arrange to share many generic instantiations. The rule is simple, although it is subject to change in the future (being an optimization and all): instantiations over reference types are shared among all reference type instantiations for that generic type/method, whereas instantiations over value types get their own full copy of the code. In other words, List<String> and List<Object> are backed by the same code, but List<DateTime> and List<Int64> get their own.

It is true that, in theory, List<DateTime> and List<Int64> could use the same shared code, because they are of identical size and have GC roots in the same locations (trivially, because neither has one). But there are additional restrictions on generated code that makes this problematic, for example if we were talking about Double and Int64. In short, the CLR doesn't actually share value type instantiations as of the 4.0 runtime, although clearly it could in certain situations (value types of the same size with GC roots in the same locations).

As you might guess, this extends to multi-parameter generics in obvious ways. A Dictionary<Object, Object> is shared with a Dictionary<String, String>, etc., and a Dictionary<Int64, Object> is shared with a Dictionary<Int64, String>. A Dictionary<DateTime, DateTime> is not, however, shared with a Dictionary<Int64, Int64> instantiation, as per the above.

My pal Joel Pobar wrote a post eons ago describing how code sharing works in great detail, which I do not intend to rehash. Please refer to his post for an excellent overview of how code sharing works.

An important thing to remember, however, is that no matter how much code sharing happens, you still need distinct RTTI data structures. So although List<Object> and List<String> share the same machine code, they have distinct vtables; sure, each table is full of pointers to the same code functions, but you are still paying for the runtime data structures. A distinct instantiation, therefore, is never actually free!

Transitive closures

Why am I making such a big deal about code sharing, anyway?

Another surprising aspect of generics is the transitive closure problem. Particularly when doing pre-compilation of generics, each unique instantiation doesn't simply lead to a specialized version of the code associated with the type being directly instantiated. The whole transitive closure of types, starting with that root type, will also be compiled. This can be a surprisingly huge number of types! JIT is much more pay-for-play, such that you get one level of explosion at a time, but once there is code that calls a particular type's method, even if that code is lazily compiled, creation of the type is forced.

To illustrate this, let's take our friend List<T>. Before examining the list, how many generic types would you expect that a single new List<T> instantiation instantiates?

What if I told you that a single List<int> instantiation creates (at least) 28 types? And that, say, five unique instantiations of List<T> might cost you 300K of disk space and 70K of working set? Well, of course, if you are writing a script, or something with fairly loose performance requirements, this might not matter much. But if topics like download time, mobile footprint, and cache performance are important to you, then you probably want to pay attention to this. To a first approximation, size is speed.

Yes, you heard me right: 28 types. Holy smokes... How can this be?!

Nested types are one obvious answer, and indeed List<T> has two: an Enumerator class (which is reasonable), and one to support the legacy synchronized collections pattern (which we presumably wish we didn't have to pay for). The larger answer here, however, is functionality. Yes, functionality! This is a great example where the cost of generics explodes as you add more features. Start simple, keep adding stuff, as has happened to List<T> over the years, and you will soon find that a series of elegant abstractions adds up to a gut-wrenching bucket of bytes.

Here's a quick sketch of the transitive closure of generic types used by List<T>:

List<T>
	T[] type
	IList<T> type
		ICollection<T> type
			IEnumerable<T> type
				IEnumerator<T> type
	ReadOnlyCollection<T> type (AsReadOnly)
		(Nothing more than List<T>)
	IComparer<T> type (BinarySearch, Sort)
	{Array.BinarySearch<T> method (BinarySearch)}
		ArraySortHelper<T> type
			IArraySortHelper<T> type
			GenericArraySortHelper<T> type
	EqualityComparer<T> type (Contains)
		IEqualityComparer<T> type
		IEquatable<T> type
		NullableEqualityComparer<T> type
			Nullable<T> type
		EnumEqualityComparer<T> type
			{JitHelpers.UnsafeEnumCast<T> method}
		ObjectEqualityComparer<T> type
	Predicate<T> delegate type (Find*)
	Action<T> delegate type (ForEach)
	{Array.LastIndexOf<T> method (LastIndexOf)}
	Comparison<T> delegate type (Sort)
	Array.FunctorComparer<T> type (Sort)
		Comparer<T> type
		GenericComparer<T> type
		NullableComparer<T> type
		ObjectComparer<T> type
	{Array.Sort<T> method (Sort)}
		ArraySortHelper<T> type (see earlier)
	Enumerator inner type
	SynchronizedList<T> inner type
		IList<T> interface (see earlier)
		ICollection<T> interface (see earlier)
		IEnumerable<T> interface (see earlier)
	{Interlocked.CompareExchange<Object> method (SyncRoot)}
	{_emptyArray T[] static field}

I'm not trying to pick on List<T>. This class is only unique in this regard in that it offers a large transitive closure of (mostly useful!) functionality. And it's not the only guilty party. We recently shaved off 100K's of code size on my team, for example, that were being lost simply because all the LINQ methods were declared as instance methods on the base collection class, rather than being extension methods as in .NET. We found nested enumerator and iterator types, cached static lambdas as static fields, and huge transitive closures of other generic types, all allocated when you just touched any collection type. Any collections library is apt to be full of this stuff, since they are highly generic. But collection libraries are certainly not the only places to go sniffing for such problems.

As an aside, it turns out that extension methods are a great way to make generic abstractions more pay-for-play.

Adding it up

Let's see what the above adds up to. I ran some programs through NGen as a quick and dirty experiment, and inspected the on-disk sizes and also the runtime working set sizes. I ensured clrjit.dll was not loaded into the process. Here's what I found. Take these numbers with a grain of salt, as they will change from release to release; they are simply rules of thumb. When in doubt, crank up NGen, DumpBin, and/or start trawling the heap with VADump yourself!

One empty type with no methods in CLR 4.0 seems to cost roughly 0.2K bytes of on-disk metadata, and about 0.7K in x64 working set. (This is a good rule of thumb irrespective of generics… in terms of order of magnitude, you can think "one empty type means 1K of memory.") A single List<S> instantiation, where S is an empty struct, is in the neighborhood of 60K on-disk metadata, and 14K of x64 working set. A single List<C> instantiation, however, where C is an empty class, is only – surprise – about 7K on-disk and 4K in-memory. Why the large discrepancy? Well, it just so happens that mscorlib.dll already includes an instantiation or two of List<T> over reference types, so this 4K is the incremental cost on top of reusing what is there; remember, there are still unique vtables and data structures still required for RTTI.

Rico did a similar analysis a few years back, and concluded that each unique List, where E was an enum type, cost 8K. Why the increase to 14K over the years? x64 and ever-increasing functionality on the basic collections classes, presumably. Remember, it's not just List<T> that has grown, it's also everything that List<T> uses internally as an implementation detail.

Dynamic specialization with dictionaries

Some specialization in behavior can be accomplished with dynamic runtime behavior, rather than static code specialization. A prime example is the following:

class C
{
    public static void M<T>()
    {
        System.Console.WriteLine(typeof(T).Name);
    }
}

Where does the program get the value of typeof(T) from? If you look at the MSIL, you will see that C# has emitted a ldtoken MSIL instruction. For some struct type, we can compile that as a constant in the code, because it is getting its own copy of the code. What occurs when two instantiations share code, like M<String> and M<Object>, however? As you might guess, there is an indirection.

The thing we usually use for such indirections – the vtable – is nowhere to be found in this particular example, because M is a static method. To deal with this, the compiler inserts an extra "hidden" argument, frequently called a generic dictionary, from which the emitted assembly code can fetch the type token. The cost here typically isn't bad, because many of the operations that pull in the dictionary are already RTTI or Reflection-based, and would require an indirection already (e.g., through a vtable).

The operations which require a dictionary of some kind include anything that has to do with RTTI and yet no vtable is readily accessible: typeof, casts, is and as operators, etc. And as you might guess, if instantiations aren't shared (such as with value types on the CLR), no dictionary is needed, because the code is fully specialized. There are also multiple kinds of dictionaries used by the runtime, depending on whether you are using a generic type, method, or some combination of both.

JITting when you didn't mean to

There are two primary ways in which you will JIT compile when using generics, even if you were good doobie and used NGen to reduce startup time.

One way is if you instantiate a new generic type exported from mscorlib.dll with a type argument also defined in mscorlib.dll, that wasn't already instantiated inside mscorlib.dll. (See my old Generics and Performance blog entry for more details.) You can very easily see this happening by using an instantiation like Dictionary<DateTime, DateTime>, and watching the clrjit.dll module getting loaded.

The other way is generic virtual methods (GVMs). It turns out that GVMs pose incredible difficulty for ahead of time separate compilation, because the compiler cannot know statically which slot in the vtable points at the particular implementation you are about to call. (Unless you use whole program compilation, something not offered by .NET at present time.) For each such method, there's an unbounded set of possible specialized instantiations a slot might point to, and so the vtable cannot be laid out in a traditional manner. C++ doesn't allow templated virtual methods for this very reason.

Thankfully, GVMs are somewhat rare. However, it only took 5 minutes of poking around to find one that is quite front-and-center in .NET: in the implementation of LINQ, there is an Iterator<T> type that has a method declared as follows:

public abstract IEnumerable<TResult> Select<TResult>(Func<TSource, TResult> selector);

All we need to do is figure out how to tickle that method, and we're guaranteed to JIT. As it turns out, sure enough, the following code does the trick and forces clrjit.dll to get loaded in .NET 4.0:

int[] xs = …;
int[] ys = xs.Where(x => true).Select(x => x).ToArray();

The Iterator<T> type is used for back-to-back Where and Select operators, as a performance optimization that avoids excess allocations and interface dispatch. But because it depends on a GVM, it does incur an initial penalty for using it, even if you have used NGen to avoid runtime code generation.

In conclusion

The moral of the story here is not that you should fear generics. Beautiful things can be built with them.

Instead, it's to use generics thoughtfully. Nothing in life is free, and generics are no exception to this rule. If code size is important to you, then you will want to have performance gates measuring your numbers against your goals; if you are working in a codebase that uses generics heavily, and you end up spending any significant time on code size optimizations, you will want to try to track down large transitive closures. As I stated above, you could really be throwing away 100K's of code here.

And as to the surprise JITting, I've seen teams compiling with NGen and having a functional gate that fails any new code that causes clrjit.dll to get loaded at runtime. Although tracking down the root cause might be tricky when that gate fails, at least you won't let the camel's nose under the tent.

Investing in tools here is a very good idea.

When it comes down to it, really thinking about what code must be executed by the process is helpful. Step back and imagine you were writing this all in C++, with the associated performance concerns front-and-center: consider how you'd arrange to reuse as much implementation as you can, manage memory efficiently, perhaps employ unsafe tricks that would have violated type safety and so are offlimits in .NET, and all that jazz. Then step back and be grateful that you have a type- and memory-safe environment to help you write more robust code, but also be realistic about what you are paying in exchange.

I hope you've learned a useful thing or two in this article. If you'd like to learn more, here are a few other good resources:

  1. An MSR paper on the original implementation of .NET generics: http://research.microsoft.com/pubs/64031/designandimplementationofgenerics.pdf
  2. Rico's "Six Questions About Generics and Performance" blog entry: http://blogs.msdn.com/b/ricom/archive/2004/09/13/229025.aspx
  3. Joel Pobar's "Generics and Code Sharing" blog post: http://blogs.msdn.com/b/joelpob/archive/2004/11/17/259224.aspx
  4. My "Generics and Performance" blog entry: http://www.bluebytesoftware.com/blog/2005/03/23/DGUpdateGenericsAndPerformance.aspx

Cheers.

10/23/2011 3:26:02 PM (Pacific Daylight Time, UTC-07:00)  #   

 Wednesday, June 01, 2011

InfoQ recently asked me a few questions about concurrency and programming languages, and here is what I had to say:

http://www.infoq.com/articles/Interview-Joe-Duffy

A little teaser:

"The major shift we face will be that mainstream languages will start to incorporate more concurrency-safety -- immutability and isolation -- and the platform libraries and architectures will better support this style of software decomposition. OOP developers are accustomed to partitioning their program into classes and objects; now they will need to become accustomed to partitioning their program into asynchronous Actors that can run concurrently. Within this sea of asynchrony will lay ordinary imperative code, frequently augmented with fine-grained task and data parallelism."

As an aside, I know I've been super quiet lately.  I never thought I'd go months without blogging.  My sincere apologies for this; work has been too all-consuming / fun, and I've been unable to carve out much time for anything else.  (Speaking of which, we are still hiring: email me at joedu at you-know-where dot com if you are interested.)  I'm about to head out to Europe for a few weeks, where hopefully I'll have a bit of time to write up something exciting to share.  Cheers.

6/1/2011 10:57:44 AM (Pacific Daylight Time, UTC-07:00)  #   

 Saturday, December 04, 2010

After spending more time than I’d like to admit over the years researching memory model literature (particularly Java’s terrific JMM and related publications), subsequently trying to “fix” the CLR memory model, reviewing proposals for new C++ memory models, and beating my head against the wall for months developing a new memory model that supports weakly ordered processors like the kind you’ll find on ARM in a developer-friendly yet power-efficient way, I have a conclusion to share.

Volatile is evil

Why? Let me recount the reasons:

  1. It doesn’t mean what you think.
  2. It used to have a very specific purpose — to enure memory operations with external side-effects did not get reordered — and has since gotten bastardized and used for many secondarily-related purposes.
  3. Even if you think it does mean what you think, the annotation scheme is all wrong. Volatile annotates a storage location, and yet what really matters is what happens when accessing said storage locations. The fences occur when you access the variable, not when you declare it. And yet from a readability perspective, they are completely invisible and easy to miss.
  4. And even if you don’t care about readability, the meaning of volatile changes wildly when you switch platforms. Today it’s store / release, tomorrow it’s write / read fences. Perhaps it’s even sequentially consistent. And the label of “store / release” could actually be a white lie, as with the CLR’s memory model thanks to store buffer forwarding and the lack of fences in the CLR JIT’s x64 stores.
  5. Performance, man, performance! Sure sequential consistency as the default sounds nice on the tin, but once you’re running that mobile app on ARM, and sucking up 160 cycles for each write you perform, you’re going to curse volatile like the plague.

And so the moral of the story follows...

Attempting to “fix” volatile is a waste of time

Instead, a new world order has arrived. We must take a two-pronged approach to solving instruction-level interleaving bugs, neither prong having much to do with the traditional definition of memory models or volatiles. We must:

  1. Eliminate memory ordering from 99% of developers’ purviews. This is already the case with single-threaded programs, because code motion in compilers and processors is limited to what only affects concurrent observations. So the answer is pretty clear: developers must move towards single-threaded programming models connected through message passing, optionally with provably race-free fine-grained parallelism inside of those single-threaded worlds.
  2. Leave the memory model esoterica to the Einsteins, and radically change its meaning. Data dependence and transitive visibility of memory operations are in. Volatiles on storage locations are out. Instead, we must throw fences into programmer’s faces, and force them to understand each and every one that occurs. And moreover, force them to decide about each and every one that occurs. Specifically, hidden fences thanks to volatile are no longer. Those who cannot take it should fall into the 99% bucket already mentioned above, versus the 1% bucket.

Let’s set #1 aside for now, since it’s obviously a huge can of worms.

But what about #2? It is quite encouraging that the C++0x group is firmly on the path of #2. See http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n2145.html for more details. In a nutshell, each location that you’d have ordinarily tagged volatile instead becomes a template atomic type. And then each read / write has the opportunity to specify the desired kind of fence, whether that is acquire (for reads), release (for writes), fully ordered, or relaxed (meaning no fence).

I do think it’d be worth them considering compiler-only fences too. So that relaxed means fully-relaxed, and there is a fence in-between that merely prevents the compiler from optimizing the memory operation. This pays homage to volatile’s legacy in C as merely a variable that mustn’t be subject to optimizations, because operations against these variables pertain to, say, memory-mapped device I/O.

Another nitpick of mine is that I’d have required each access to specify the fence, whereas C++0x implicitly uses full fence if left unspecified at the callsite. It’s a minor convenience, but I like always having the fence spelled out very explicitly in the code. Lock-free access to shared variables is sufficiently dangerous that automatic sequential consistency is the least of your problems.

Nevertheless, the C++0x direction is a massive good step forward, and these are just minor details.

My hope is that .NET follows suit. And the timing couldn’t be more apropos as “now”: we are moving forward in a heavily mobile, distributed-system-on-a-chip, and heterogeneous world, where processor memory models will necessarily continue to weaken. The overly strong x86 memory model, kept alive primarily to ensure compatibility, has simply grown too expensive to accommodate. The power benefits and architectural simplifications are hard to argue with, and because compatibility becomes less of an issue as new platforms arise (e.g. for mobile), the world moves to the cloud, and hence there is legacy to worry about, I do hope that processor vendors seize the opportunity. ARM certainly has. It is less about out-of-order execution as it is about coherency costs. Truthfully, I’d be disappointed if anything else happened, even though the risk to compatibility for shrink-wrapped software scares the hell out of me. But this is most certainly the right way to go, long-term. As software platforms move in the direction of #1 – as I also foresee – the need for fences dwindles. The cost of supporting the current .NET memory model is too great and will become a liability with time.

Thankfully, it is quite simple to build a veneer atop .NET that works a lot more like atomic. For example, imagine that we had a new System.Threading.Volatile static class, and that it offered the moral equivalent to atomic inner types for each atomic primitive we can synchronize against:

namespace System.Threading
{
    public static class Volatile
    {
        public struct Int32 {..}
        public struct Int64 {..}
        …
        public struct Reference where T : class {..}
        …
    }
}

Now instead of tagging a location as ‘volatile’, you would use one of these primitives. For example, rather than:

static volatile MySingleton s_instance;

You would say:

static Volatile.Reference<MySingleton> s_instance;

Each class has a similar set of operations. For example:

namespace System.Threading
{
    public static class Volatile
    {
        public struct Int32
        {
            public Int32(int value);

            public int ReadAcquireFence();
            public int ReadFullFence();
            public int ReadCompilerOnlyFence();
            public int ReadUnfenced();

            public void WriteReleaseFence(int newValue);
            public void WriteFullFence(int newValue);
            public void WriteCompilerOnlyFence(int newValue);
            public void WriteUnfenced(int newValue);

            public int AtomicCompareExchange(int newValue, int comparand);
            public int AtomicExchange(int newValue);
            public int AtomicAdd(int delta);
            public int AtomicIncrement();
            public int AtomicDecrement();

            // Etc…, bitwise ops, other math ops, etc.
        }
    }
}

Of course, only the integer types would offer the increment, decrement, add, and related operators. And it turns out that offering different kinds of fences on the Atomic* operations would be incredibly useful too, because processors like ARM do not couple the fence to the compare-and-swap / load-locked-store-conditional as x86 processors do. Taking advantage of this can be huge if you are writing performance critical code, like a concurrent garbage collector whose atomic swaps need not imply ordering with the surrounding instruction stream. You can quibble over the details, like whether these should use enums instead of the name to encode the fence-kind. I did it this way to keep the implementations branch-free, although with a decent inlining JIT compiler, it’d probably optimize those away thanks to constant propagation.

It’s quite trivial to implement these APIs atop existing .NET primitives. I built a little library that does so, but it was so boring and repetitive I decided not to post it alongside this blog entry as originally intended.

With the above definition, we can very clearly see the fences involved in doing, say, double-checked locking:

static Volatile.Reference<MySingleton> s_instance;

public static MySingleton Instance
{
    get
    {
        MySingleton instance = s_instance.ReadAcquireFence();
        if (instance == null) {
            instance = new MySingleton();
            instance = s_instance.AtomicCompareExchangeRelease(instance, null);
        }
        return instance;
    }
}

We see there are two fences. One is an acquire and, depending on what your memory model says about data dependence, is probably unnecessary. Most sane memory models guarantee that data dependent loads do not pass. So we needn’t worry that we’ll see a non-null s_instance whose contents haven’t been initialized. (If we were talking structs, it’d be another story.) Nevertheless, it’s definitely required that we use a release-only fence for the publication of the object. This guarantees writes to fields within the MySingleton constructor have completed prior to the write of the new object to the shared instance field. The point here is that you are forced to think about the fences, and you actually see them.

Of course, most platforms need to provide the bare minimum of fencing to assure type safety, particularly for languages like C#. My understanding is that C++0x has decided, at least for now, not to offer type-safety in the face of multithreading. That means you might publish an object and, if stores occur out-of-order, the reader could see an object partially initialized with an invalid vtable pointer. In C# and Java, the language designers have thankfully decided to shield programmers from this. The need for fences also extends to unsafe code like strings, where – were it possible for a thread to read the non-zero length before the char* pointer was valid – writes to random memory could occur and hence threaten type-safety. Thankfully, again, C# and Java protect developers from this, mostly due to the automatic zero’ing of memory as the GC allocates and hands out new segments.

There are costs to offering this type safety assurance. So you can understand why the C++ designers want to keep fences out of object allocation. If you have #1 above, however, the costs are dramatically lower and more acceptable. But the world is – unfortunately – still a freethreaded one, and we have several years to go before we’ve reached the final destination. As a step forward, however, the death of volatile is a welcomed one. Say it with me.

“Sayonara volatile.”

Here’s hoping that .NET 5.0 takes this step forward too.

12/4/2010 3:16:34 PM (Pacific Standard Time, UTC-08:00)  #   

 Sunday, October 31, 2010

I rambled on and on a few weeks back about how much performance matters. Although I got a lot of contrary feedback, most of it had to do with my deliberately controversial title than the content of the article. I had intended to post a redux, but something more concise is on my mind lately.

GC-based memory management is a boon to productivity, not to mention program safety. Few would argue with this. However, the most effective developers know how their particular GC works, and optimize their program’s data structure, allocation, and lifetime behavior to suit their particular GC best.

This is dangerous, but a pragmatic fact of life. It is dangerous because who’s to say that the runtime team doesn’t intend to entirely revamp the GC’s collection strategy next release, at which point your thoughtfulness may actually harm you? It’s a pragmatic fact for a few reasons: it’s probably not likely that the behavior of your favorite GC is going to change too fundamentally over time; if it did, you’d need to rethink things anyhow; oh, and when was that next release anyway (2 or more years out); and finally, what do you care about more, the theoretical loose coupling, or real results today?

One of the worst data structures for traversal is a linked list. That’s because its contents are fetched by pointer-chasing, an act that usually destroys locality, unless the data associated with each pointer was carefully constructed to live next to its previous and next pointer’s data. This seldom happens, because the main point of a linked list is to free you from such constraints.

One of the best data structures for traversal, on the other hand, is an array. Adjacent elements are truly adjacent to one another in memory, meaning that as you fetch the i’th element from memory, you’re probably pulling in the i’th, i+1’th, …, and so on, thanks to spatial locality. Of course, if the elements are just pointers, then you're back to the chasing game; as with anything, it depends.

How many elements you prefetch of course depends on the size of the elements with respect to your processor’s cache line. If you’re working with 8-byte elements, and 128-byte cache lines, then you may pay 100 cycles for the first fetch, and then amortize that cost over the subsequent 15 found cheaply in cache for 10 cycles. The result is about 250 cycles total; compare this to a linked list, where you’ll probably spend 1,600 cycles, or more than 6X the cost. And of course you’re trashing other data in the cache in the process. As you traverse more and more of the list, the numbers snowball, and the amortization of locality, or lack thereof, provides a stark contrast.

There’s another subtle reason why this is important. Stop and think about what happens when a GC occurs.

Yep, that’s right. Your data structures need to be traversed during a GC, after all, to ascertain the liveness of any pointers held within. That scan looks a whole heck of a lot like the same traversal I just described, and enjoys the same locality properties. So we can immediately conclude that data structures whose traversal is efficient will translate into less time spent in the GC chasing pointers, and better cache efficiency.

For programs that are sensitive to long pause times, this is huge. I talk to customers all the time whose programs are sensitive to microsecond-long GC delays, and – aside from ensuring good GC lifetime practices, like ensuring all objects either die young or live long – being conscious about locality can be immensely important. Especially for any long-lived, large data structures that will be subject to Gen2 collections throughout their lifetime.

There is another useful trick to know. If a data structure contains no pointers, the GC will not have to trace these pointers. Obvious, right? A linked list inherently contains pointers, so this trick really doesn’t apply: the GC will need to traverse the whole live portions of the object graph. What’s interesting is that an array, on the other hand, may or may not contain elements that contain pointers. For example, an array of ints clearly has no pointers, whereas an array of string references clearly does. This doesn’t just apply to the primitive types, but also custom structs which may or may not contain references. When the GC encounters such an array, its contents need not be traversed: instead, the array is alive, and that's that. Yet another opportunity to eliminate pointer chasing. Not only does this save the GC from doing some heavy lifting, but the pointer-free structs eliminate the need for the GC write barriers on array stores too.

So think of all this next time you’re confronted with the decision to employ a tree, graph, or linked list, and whether a dense, and perhaps pointer-free, representation could be beneficial. Even if it means you must replace pointers with index calculations. The locality benefits may not matter, but then again, they may. And at least you can knowingly make a balanced tradeoff, with these potential advantages in mind.

10/31/2010 5:41:41 PM (Pacific Daylight Time, UTC-07:00)  #   

 Saturday, September 18, 2010

I have several positions open on my team here at Microsoft.

My team's responsibility spans multiple aspects of a research operating system’s programming model. The three main areas are concurrency, languages, and frameworks. When I say concurrency, I mean things like asynchrony and message passing, data and task parallelism, distributed parallelism, runtime scheduling and resource management, and heterogeneity and GPGPU. When I say languages, I mean type systems, mostly-functional programming, verified safe concurrency, and both front- and back-end compilation. And when I say frameworks, I mean virtually anything you could imagine wanting out of a platform framework: all things XML, data interoperability (database, web services, etc.), collections, transactions, multimaster synchronization, and even low level things, like regex, numerics, and globalization.

Our team is 100% developers, and we have an “everybody codes, everybody loves to code” culture. Even managers are expected to spend a significant amount of time prolifically writing code.

All of these components are new and built from the ground up. So self-drive and an ability to have a vision and make it happen are incredibly important.

We are always happy to hire great, hard-working people, regardless of years of experience. If you’re extremely strong in one or more of the abovementioned areas, more of a generalist, are an amazing coder, or all of the above, you’d fit in perfectly. This is the most amazing team of people I’ve ever worked with. If you are interested, please email your resume to me at joedu AT microsoft DOT com.

9/18/2010 11:42:27 AM (Pacific Daylight Time, UTC-07:00)  #   

 Monday, September 06, 2010

I can’t tell you how many times I’ve heard the age-old adage echoed inappropriately and out of context:

"We should forget about small efficiencies, say about 97% of the time; premature optimization is the root of all evil"
            -- Donald E. Knuth, Structured Programming with go to Statements

I have heard the "premature optimization is the root of all evil" statement used by programmers of varying experience at every stage of the software lifecycle, to defend all sorts of choices, ranging from poor architectures, to gratuitous memory allocations, to inappropriate choices of data structures and algorithms, to complete disregard for variable latency in latency-sensitive situations, among others.

Mostly this quip is used defend sloppy decision-making, or to justify the indefinite deferral of decision-making. In other words, laziness. It is safe to say that the very mention of this oft-misquoted phrase causes an immediate visceral reaction to commence within me... and it’s not a pleasant one.

In this short article, we’ll look at some important principles that are counter to what many people erroneously believe this statement to be saying. To save you time and suspense, I will summarize the main conclusions: I do not advocate contorting oneself in order to achieve a perceived minor performance gain. Even the best performance architects, when following their intuition, are wrong 9 times out of 10 about what matters. (Or maybe 97 times out of 100, based on Knuth’s numbers.) What I do advocate is thoughtful and intentional performance tradeoffs being made as every line of code is written. Always understand the order of magnitude that matters, why it matters, and where it matters. And measure regularly! I am a big believer in statistics, so if a programmer sitting in his or her office writing code thinks just a little bit more about the performance implications of every line of code that is written, he or she will save an entire team that time and then some down the road. Given the choice between two ways of writing a line of code, both with similar readability, writability, and maintainability properties, and yet interestingly different performance profiles, don’t be a bozo: choose the performant approach. Eschew redundant work, and poorly written code. And lastly, avoid gratuitously abstract, generalized, and allocation-heavy code, when slimmer, more precise code will do the trick.

Follow these suggestions and your code will just about always win in both maintainability and performance.

Understand the order of magnitude that matters

First and foremost, you really ought to understand what order of magnitude matters for each line of code you write.

In other words, you need to have a budget; what can you afford, and where can you afford it? The answer here changes dramatically depending on whether you’re writing a device driver, reusable framework library, UI control, highly-connected network application, installation script, etc. No single answer fits all.

I am personally used to writing code where 100 CPU cycles matters. So invoking a function that acquires a lock by way of a shared-memory interlocked instruction that may take 100 cycles is something I am apt to think hard about; even more worrisome is if that acquisition could block waiting for 100,000 cycles. Indeed this situation could become disastrous under load. As you can tell, I write a lot of systems code. If you’re working on a network-intensive application, on the other hand, most of the code you write is going to be impervious to 100 cycle blips, and more sensitive to efficient network utilization, scalability, and end-to-end performance. And if you’re writing a little one-time script, or some testing or debugging program, you may get away with ignoring performance altogether, even multi-million cycle network round-trips.

To be successful at this, you’ll need to know what things cost. If you don’t know what things cost, you’re just flailing in the dark, hoping to get lucky. This includes rule of thumb order of magnitudes for primitive operations – e.g. reading / writing a register (nanoseconds, single-digit cycles), a cache hit (nanoseconds, tens of cycles), a cache miss to main memory (nanoseconds, hundreds of cycles), a disk access including page faults (micro- or milliseconds, millions of cycles), and a network roundtrip (milliseconds or seconds, many millions of cycles) – in addition to peering beneath opaque abstractions provided by other programmers, to understand their best, average, and worst case performance.

Clearly the concerns and situations you must work to avoid change quite substantially depending on the class of code you are writing, and whether the main function of your program is delivering a user experience (where usability reigns supreme), delivering server-side throughput, etc. Thinking this through is crucial, because it helps avoid true "premature optimization" traps where a programmer ends up writing complicated and convoluted code to save 10 cycles, when he or she really needs to be thinking about architecting the interaction with the network more thoughtfully to asynchronously overlap round-trips. Understanding how performance impacts the main function of your program drives all else.

Pay attention to interoperability between layers of separately authored software that is composed together. The most common cause of hangs is that an API didn’t specify the expected performance, and so a UI programmer ended up using it in an innocuous but inappropriate way, because they couldn’t afford the range of order of magnitude cost that the API’s performance was expected to fall within. Hangs aren’t the only manifestation; O(N^2), or worse, performance can also result, if, for example, a caller didn’t realize the function called was going to enumerate a list in order to generate its results.

It is also important to think about worst case situations. What happens if that lock is held for longer than expected, because the system is under load and the scheduler is overloaded? And what if the owning thread was preempted while holding the lock, and now will not get to run again for quite some time? What happens if the network is saturated because a big news event is underway, or worse, the phone network is intermittently cutting out, the network cable has been unplugged, etc.? What about the case where, because a user has launched far too many applications at once, your memory-intensive operation that usually enjoys nice warmth and locality suddenly begins waiting for the disk on the majority of its memory accesses, due to demand paging? These things happen all the time.

In each of these situations, you can end up paying many more orders of magnitude in cost than you expected under ordinary circumstances. The lock acquisition that usually took 100 CPU cycles now takes several million cycles (as long as a network roundtrip), and the network operation that is usually measured in milliseconds is now measured in tens of seconds, as the software painfully waits for the operation to time out. And your "non-blocking" memory-intensive algorithm on the UI thread just caused a hang, because it’s paging like crazy.

You’ve experienced these problems as a user of modern software, I am sure, and it isn’t fun. An hourglass, spinning donut, unresponsive button clicks, "(Not Responsive)" title bars, and bleachy white screens. An important measurement of a programmer’s worth is how good the code they write operates under the extreme and rare circumstances. Because, truth be told, when you have a large user-base, these circumstances aren’t that rare after all. This is more of a "performance in the large" thing, but it turns out that the end result is delivered as a result of many "performance in the small" decisions adding up. A developer wrote code meant to be used in a particular way, but decided what order of magnitude was reasonable based on best case, … and gave no thought to the worst case.

Using the right data structure for the job

This is motherhood and apple pie, Computer Science 101, … bad clichés abound. And yet so many programmers get this wrong, because they simply don’t give it enough thought.

One of my favorite books on the topic ("Programming Pearls") has this to say about them:

"Most programmers have seen them, and most good programmers realize they’ve written at least one.
            They are huge, messy, ugly programs that should have been short, clean, beautiful programs."

I’ll add one adjective to the "short, clean, beautiful" list: fast.

Data structures drive storage and access behavior, both strongly affecting the size and speed of algorithms and components that make use of them. Worst case really does matter. This too is a case where the right choice will boost not only performance but also the cleanliness of the program.

I’m actually not going to spend too much time on this; when I said this is CS101, I meant it. However, it is crucial to be intentional and smart in this choice. Validate assumptions, and measure.

Ironically, in my experience, many senior programmers can make frighteningly bad data structure choices, often because they are more apt to choose a sophisticated and yet woefully inappropriate one. They may choose a linked list, for example, because they want zero-allocation element linking via an embedded next pointer. And yet they then end up with many lists traversals throughout the program, where a dense array representation would have been well worth the extra allocation. The naïve programmer would have happily new’d up a List<T>, and avoided some common pitfalls; yet, here the senior guy is working as hard as humanly possible to avoid a single extra allocation. They over-optimized in one dimension, and ignored many others that mattered more.

This same class of programmer may choose a very complicated lock-free data structure for sharing elements between threads, incurring many more object allocations (and thus increased GC pressure), and a large number of expensive interlocked operations scattered throughout the code. The sexy lure of lock-freedom tricked them into making a bad choice. Perhaps they didn’t quite understand that locks and lock-free data structures share many costs in common. Or perhaps they just hoped to get lucky and squeeze out out-of-this-world scalability thanks to lock-freedom, without actually considering the access patterns necessary to lead to such scalability and whether their program employed them.

These are often held up as examples of "premature optimization", but I hold them up as examples of "careless optimization". The double kicker here is that the time spent building the more complicated solution would have been better spent carefully thinking and measuring, and ultimately deciding not to be overly clever in the first place. This most often plagues the mid-range programmer, who is just smart enough to know about a vast array of techniques, but not yet mature enough to know when not to employ them.

A different, better-performing approach

It’s an all-too-common occurrence. I’ll give code review feedback, asking "Why didn’t you take approach B? It seems to be just as clear, and yet obviously has superior performance." Again, this is in a circumstance where I believe the difference matters, given the order of magnitude that matters for the code in question. And I’ll get a response, "Premature optimization is the root of all evil." At which point I want to smack this certain someone upside the head, because it’s such a lame answer.

The real answer is that the programmer didn’t stop to carefully consider alternatives before coding up solution A. (To be fair, sometimes good solutions evade the best of us.) The reality is that the alternative approach should have been taken; it may be true that it’s "too late" because the implications of the original decision were nontrivial and perhaps far-reaching, but that is too often an unfortunate consequence of not taking the care and thought to do it right in the first place.

These kinds of "peanut butter" problems add up in a hard to identify way. Your performance profiler may not obviously point out the effect of such a bad choice so that it’s staring you in your face. Rather than making one routine 1000% slower, you may have made your entire program 3% slower. Make enough of these sorts of decisions, and you will have dug yourself a hole deep enough to take a considerable percentage of the original development time just digging out. I don’t know about you, but I prefer to clean my house incrementally and regularly rather than letting garbage pile up to the ceilings, with flies buzzing around, before taking out the trash. Simply put, all great software programmers I know are proactive in writing clear, clean, and smart code. They pour love into the code they write.

In this day and age, where mobility and therefore power is king, instructions matter. My boss is fond of saying "the most performant instruction is the one you didn’t have to execute." And it’s true. The best way to save battery power on mobile phones is to execute less code to get the same job done.

To take an example of a technology that I am quite supportive of, but that makes writing inefficient code very easy, let’s look at LINQ-to-Objects. Quick, how many inefficiencies are introduced by this code?

int[] Scale(int[] inputs, int lo, int hi, int c) {
    var results = from x in inputs
                  where (x >= lo) && (x <= hi)
                  select (x * c);
    return results.ToArray();
}

It’s hard to account for them all.

There are two delegate object allocations, one for the call to Enumerable.Where and the other for the call to Enumerable.Select. These delegates point to potentially two distinct closure objects, each of which has captured enclosing variables. These closure objects are instances of new classes, which occupy nontrivial space in both the binary and at runtime. (And of course, the arguments are now stored in two places, must be copied to the closure objects, and then we must incur extra indirections each time we access them.) In all likelihood, the Where and Select operators are going to allocate new IEnumerable and new IEnumerator objects. For each element in the input, the Where operator will make two interface method calls, one to IEnumerator.MoveNext and the other to IEnumerator.get_Current. It will then make a delegate call, which is slightly more expensive than a virtual method call on the CLR. For each element for which the Where delegate returns ‘true’, the Select operator will have likewise made two interface method calls, in addition to another delegate invocation. Oh, and the implementations of these likely use C# iterators, which produce relatively fat code, and are implemented as state machines which will incur more overhead (switch statements, state variables, etc.) than a hand-written implementation.

Wow. And we aren’t even done yet. The ToArray method doesn’t know the size of the output, so it must make many allocations. It’ll start with 4 elements, and then keep doubling and copying elements as necessary. And we may end up with excess storage. If we end up with 33,000 elements, for example, we will waste about 128KB of dynamic storage (32,000 X 4-byte ints).

A programmer may have written this routine this way because he or she has recently discovered LINQ, or has heard of the benefits of writing code declaratively. And/or he or she may have decided to introduce a more general purpose implementation of a Scale API versus doing something specific to the use in the particular program that Scale will be immediately used in. This is a great example of why premature generalization is often at odds with writing efficient code.

Imagine an alternative universe, on the other hand, where Scale will only get used once and therefore we can take advantage of certain properties of its usage. Namely, perhaps the input array need not be preserved, and instead we can update the elements matching the criteria in place:

void ScaleInPlace(int[] inputs, int lo, int hi, int c) {
    for (int i = 0; i < inputs.Length; i++) {
        if ((inputs[i] >= lo) & (inputs[i] <= hi)) {
            inputs[i] *= c;
        }
    }
}

A quick-and-dirty benchmark shows this to be an order of magnitude faster. Again, is it an order of magnitude that you care about? Perhaps not. See my earlier thoughts on that particular topic. But if you care about costs in the 100s or 1000s of cycles range, you probably want to pay heed.

Now, I’m not trying to take potshots at LINQ. It was just an example. In fact, I spent 3 years running a team that delivered PLINQ, a parallel execution engine for LINQ-to-Objects. LINQ is great where you can afford it, and/or where the alternatives do not offer ridiculously better performance. For example, if you can’t do in-place updates, functionally producing new data is going to require allocations no matter which way you slice it. But having watched people using PLINQ, I have witnessed numerous occasions where an inordinately expensive query was made 8-times faster by parallelizing… where the trivial refactoring into a slimmed down algorithm with proper data structures would have speed the code up by 100-fold. Parallelizing a piggy piece of code to make it faster merely uses more of the machine to get the same job done, and will negatively impact power, resource management, and utilization.

Another view is that writing code in this declarative manner is better, because it’ll just get faster as the compiler and runtimes enjoy new optimizations. This sounds nice, and seems like taking a high road of some sort. But what usually matters is today: how does the code perform using the latest and greatest technology available today. And if you scratch underneath the surface, it turns out that most of these optimizations are what I call "science fiction" and unlikely to happen. If you write redundant asserts at twenty adjacent layers of your program, well, you’re probably going to pay for them. If you allocate objects like they are cheap apples growing on trees, you’re going to pay for them. True, optimizations might make things faster over time, but usually not in the way you expect and usually not by orders of magnitude unless you are lucky.

A colleague of mine used to call C a WYWIWYG language—"what you write is what you get"—wherein each line of code roughly mapped one-to-one, in a self-evident way, with a corresponding handful of assembly instructions. This is a stark contrast to C#, wherein a single line of code can allocate many objects and have an impact to the surrounding code by introducing numerous silent indirections. For this reason alone, understanding what things cost and paying attention to them is admittedly more difficult – and arguably more important – in C# than it was back in the good ole’ C days. ILDASM is your friend … as is the disassembler. Yes, good systems programmers regularly look at the assembly code generated by the .NET JIT. Don’t assume it generates the code you think it does.

Gratuitous memory allocations

I love C#. I really do. I was reading my "Secure Coding in C and C++" book for fun this weekend, and it reminded me how many of those security vulnerabilities are eliminated by construction thanks to type- and memory-safety.

But the one thing I don’t love is how easy and invisible it makes heap allocations.

The very fact that C++ memory management is painful means most C++ programmers are overly-conscious about allocation and footprint. Having to opt-in to using pointers means developers are conscious about indirections, rather than having them everywhere by default. These are qualitative, hard-to-backup statements, but in my experience they are quite true. It’s also cultural.

Because they are so easy, it’s doubly important to be on the lookout for allocations in C#. Each one adds to a hard-to-quantify debt that must be repaid later on when a GC subsequently scans the heap looking for garbage. An API may appear cheap to invoke, but it may have allocated an inordinate amount of memory whose cost is only paid for down the line. This is certainly not "paying it forward."

It’s never been so easy to read a GB worth of data into memory and then haphazardly leave it hanging around for the GC to take care of at some indeterminate point in the future as it is in .NET. Or an entire list of said data. Too many times a .NET program holds onto GBs of working set, when a more memory conscientious approach would have been to devise an incremental loading strategy, employ a denser representation of this data, or some combination of the two. But, hey, memory is plentiful and cheap! And in the worst case, paging to disk is free! Right? Wrong. Think about the worst case.

Depending on the size of the objects allocated, how long they remain live, and how many processors are being used, the subsequent GCs necessary to clean up incessant allocations may impact a program in a difficult to predict way. Allocating a bunch of very large objects that live for long enough to make it out of the nursery, but not forever, for instance, is one of the worst evils you can do. This is known as mid-life crisis. You either want really short-lived objects or really long-lived ones. But in any case, it really matters: the LINQ example earlier shows how easy it is to allocate crap without seeing it in the code.

If I could do it all over again, I would make some changes to C#. I would try to keep pointers, and merely not allow you to free them. Indirections would be explicit. The reference type versus value type distinction would not exist; whether something was a reference would work like C++, i.e. you get to decide. Things get tricky when you start allocating things on the stack, because of the lifetime issues, so we’d probably only support stack allocation for a subset of primitive types. (Or we’d employ conservative escape analysis.) Anyway, the point here is to illustrate that in such a world, you’d be more conscious about how data is laid out in memory, encouraging dense representations over sparse and pointer rich data structures silently spreading all over the place. We don't live in this world, so pretend as though we do; each time you see a reference to an object, think "indirection!" to yourself and react as you would in C++ when you see a pointer dereference.

Allocations are not always bad, of course. It’s easy to become paranoid here. You need to understand your memory manager to know for sure. Most GC-based systems, for example, are heavily tuned to make successive small object allocations very fast. So if you’re programming in C#, you’ll get less "bang for your buck" by fusing a large number of contiguous object allocations into a small number of large object allocations that ultimately occupy an equivalent amount of space, particularly if those objects are short-lived. Lots of little garbage tends to be pretty painless, at least relatively.

Variable latency and asynchrony

There aren’t many ways to introduce a multisecond delay into your program at a moment’s notice. But I/O can do just that.

Code with highly variable latency is dangerous, because it can have dramatically different performance characteristics depending on numerous variables, many of which are driven by environmental conditions outside of your program’s control. As such it is immensely important to document where such variable latency can occur, and to program defensively against it happening.

For example, imagine a team of twenty programmers building some desktop application. The team is just large enough that no one person can understand the full details of how the entire system works. So you’ve got to compose many pieces together. (As I mentioned earlier, composition can lead to hard-to-predict performance characteristics.) Programmer Alice is responsible for serving up a list of fonts, and Programmer Bob is consuming that list to paint it on the UI. Does Bob know what it takes to fetch the list of fonts? Probably not. Does Alice know the full set of concerns that Bob must think about to deliver a responsive UI, like incremental repaints, progress reporting, and cancellation? Probably not. So Alice does the best she knows how to do: she hits the cache, when the font cache is fully populated, and falls back to fetching fonts from the printer otherwise. She returns a List<Font> object from her API. Now Bob just calls the API and paints the results on the UI; the call appears to be quite snappy in his testing. Unfortunately, when the cache is not populated, the "quick" cache hit turns into a series of network roundtrips with the printer; that hangs the UI, but only for a few milliseconds. Even worse, when the printer is accidentally offline, perhaps due to a power outage, the UI freezes for twenty seconds, because that’s the hard-coded timeout. Ouch!

This situation happens all the time. It’s one of the most common causes of UI hangs.

If you’re programming in an environment where asynchrony is first class, Alice could have advertised the fact that, under worst-case circumstances, fetching the font list would take some time. If she were programming in .NET, for example, she’d return a Task<List<Font>> rather than a List<Font>. The API would then be self-documenting, and Bob would know that waiting for the task’s result is dangerous business. He knows, of course, that blocking the UI thread often leads to responsiveness problems. So he would instead use the ContinueWith API to rendezvous with the results once they become available. And Bob may now know he needs to go back and work more closely with Alice on this interface: to ensure cancellation is wired up correctly, and to design a richer interface that facilitates incremental painting and progress reporting.

Variable latency is not just problematic for responsiveness reasons. If I/O is expressed synchronously, a program cannot efficiently overlap many of them. Imagine we must make three network calls as part of completing an operation, and that each one will take 25 milliseconds to complete. If we do synchronous I/O, the whole operation will take at least 75 milliseconds. If we launch do asynchronous I/O, on the other hand, the operation may take as few as 25 milliseconds to finish. That’s huge.

If I had my druthers, all I/O would be asynchronous. But that’s not where we are today.

The concern is not limited to just I/O, of course. Compute- and memory-bound work can quickly turn into variable latency work, particularly under stressful situations like when an application is paging. So truthfully any abstraction doing "heavy lifting" should offer an asynchronous alternative.

Examples of "bad" optimizations

It is easy to take it too far. Even if you’re shaving off cycles where each-and-every cycle matters, you may be doing the wrong thing. If anything, I hope this article has convinced you to be thoughtful, and to strive to strike a healthy balance between beautiful code and performance.

Anytime the optimization sacrifices maintainability, it is highly suspect. Indeed, many such optimizations are superficial and may not actually improve the resulting code’s performance.

The worst category of optimization is one that can lead to brittle and insecure code.

One example of this is heavy reliance on stack allocation. In C and C++, doing stack allocation of buffers often leads to difficult choices, like fixing the buffer size and writing to it in place. There is perhaps no single technique that, over the years, has led to the most buffer overrun-based exploits. Not only that, but stack overflow in Windows programs is quite catastrophic, and increases in likelihood the more stack allocation that a program does. So doing _alloca in C++ or stackalloc in C# is really just playing with fire, particularly for dynamically sized and potentially big allocations.

Another example is using unsafe code in C#. I can’t tell you how many times I’ve seen programmers employ unsafe pointer arithmetic to avoid the automatic bounds checking generated by the CLR JIT compiler. It is true that in some circumstances this can be a win. But it is also true that most programmers who do this never bothered to crack open the resulting assembly to see that the JIT compiler does a fairly decent job at automatic bounds check hoisting. This is an example where the cost of the optimization outweighs the benefits in most circumstances. The cost to pin memory, the risk of heap corruption due to a failure to properly pin memory or an offset error, and the complication in the code, are all just not worth it. Unless you really have actually measured and found the routine to be a problem.

If it smells way too complicated, it probably is.

In conclusion

I’m not saying Knuth didn’t have a good point. He did. But the "premature optimization is the root of all evil" pop-culture and witty statement is not a license to ignore performance altogether. It’s not justification to be sloppy about writing code. Proactive attention to performance is priceless, especially for certain kinds of product- and systems-level software.

My hope is that this article has helped to instill a better sense, or reinforce an existing good sense, for what matters, where it matters, and why it matters when it comes to performance. Before you write a line of code, you really need to have a budget for what you can afford; and, as you write your code, you must know what that code costs, and keep a constant mental tally of how much of that budget has been spent. Don’t exceed the budget, and most certainly do not ignore it and just hope that wishful thinking will save your behind. Building up this debt will cost you down the road, I promise. And ultimately, test-driven development works for performance too; you will at least know right away once you have exceeded your budget.

Think about worst case performance. It’s not the only thing that matters, but particularly when the best and worst case differ by an order of magnitude, you will probably need to think more holistically about the composition of caller and callee while building a larger program out of constituent parts.

And lastly, the productivity and safety gains of managed code, thanks to nice abstractions, type- and memory-safety, and automatic memory management, do not have to come at the expense of performance. Indeed this is a stereotype that performance conscious programmers are in a position to break down. All you need to do is slow down and be thoughtful about each and every line of code you write. Remember, programming is as much engineering as it is an art. Measure, measure, measure; and, of course, be thoughtful, intentional, and responsible in crafting beautiful and performant code.

9/6/2010 11:38:49 AM (Pacific Daylight Time, UTC-07:00)  #   

 Sunday, July 11, 2010

That immutability facilitates increased degrees of concurrency is an oft-cited dictum. But is it true? And either way, why?

My view on this matter may be a controversial one. Immutability is an important foundational tool in the toolkit for building concurrent – in addition to reliable and predictable – software. But it is not the only one that matters. Making all your data immutable isn’t going to instantly lead to a massively scalable program. Natural isolation is also critically important, perhaps more so. And, as it turns out, sometimes mutability is just what the doctor ordered, as with large-scale data parallelism.

Isolation first; immutability second; synchronization last

Stepping back for a moment, the recipe for concurrency is rather simple. Say you’ve got multiple concurrent pieces of work running simultaneously (or have a goal of getting there); for discussion’s sake, call them tasks. Take two tasks. The first critical decision has two cases: either these tasks concurrently access overlapping data in shared-memory, or they do not. If they do not, they are isolated, and no precautions associated with racing memory updates are needed. If they do share data, on the other hand, then something else must give. If all concurrently accessed data is immutable, or all functions used to interact with data are pure, then dangerous concurrency hazards are avoided. All is well. If some data is mutable, however, then this is where things get tricky, and higher-level synchronization is needed to make accesses safe. This decision tree is straightforward and clear.

I have listed those four attributes – isolated, immutable and pure, and synchronization – in a very intentional order. Thankfully, this order mirrors the natural top-down hierarchical architecture of most modern object- and component-based programs: we have large containers that communicate through well-defined interfaces, each comprised of layers of such containers, and somewhere towards the leaves, a fair amount of intimate commingling of knowledge regarding data and invariants.

This order also reflects the order of complexity and execution-time costs, from least to most. Isolation is simple, because components depend on each other in loosely-coupled ways, and in fact scales superiorly in a concurrent program because no synchronization is necessary, the “right” data structure may be chosen for the job – immutable or not – and locality is part-and-parcel to the architecture. Immutability at least avoids the morass of synchronization, which can affect programs immensely in complexity, runtime overheads, and write-contention for shared data. It is clear that synchronization is something to avoid at all costs, particularly anything done in an ad-hoc manner like locks.

Making the concurrency

But where did all this concurrency come from, anyway?

It came from two things:

  1. The coarse-grained breakdown of a program into isolated pieces.
  2. The fine-grained data parallelism.

On #1: Program fragments that are isolated are already half-way down the road to running concurrently as tasks. The second half of this journey, of course, is teaching them to interact with one another asynchronously, most frequently through message-passing or by sticking them into a pipeline. The details of course depend on what programming language you are using. It may be through agents, actors, active objects, COM objects, EJBs, CCR receivers, web-services, something ad-hoc built with .NET tasks, or some other reification. Nevertheless the isolation is common to all these.

On #2: Data parallelism, it turns out, often works best with mutable data structures. These structures must be partitionable, of course, so that tasks comprising the data parallel operations may operate with logically isolated chunks of this data safely, even if they are parts of the same physical data structure. So chunks of them are isolated even though they don’t appear to be. This is trivially achievable with many important parallel-friendly data structures like arrays, vectors, and matrices. Capturing this isolation in the type system is of course no small task, though region typing gets close (see UIUC’s Data Parallel Java).

But you usually don’t want these structures to be immutable, because they can be modified in constant-time and space if they are their classic simple mutable forms. Programmers doing HPC-style data-parallelism a la FORTRAN, vectorization, and GPGPU know this quite well. Compare this a world where we are doing data-parallelism over immutable data structures, where modifications often necessitate allocations or more complicated big-oh times due to clever techniques meant to avoid such allocations, as with persistent immutable data structures. This is likely less ideal. It is true that some data parallel operations are not in-place against mutable data – as with PLINQ – at which point purity, but not immutability, is key. The two are related but not identical: immutability pervades the construction of data structures, whereas purity pervades the construction of functions. But if you can get by with one copy of the data, why not do it? Particularly since most datasets amenable to parallel speedups are quite large.

Immutability: the bricks, not the mortar

Notice that the concurrency did not actually come from immutable data structures in either case, however. So what are they good for?

One obvious use, which has little to do with concurrency, is to enforce characteristics of particular data structures in a program. A translation lookup table may not have been meant to be written to except for initialization time, and using an immutable data structure is a wonderful way to enforce this intent.

What about concurrency? Immutable data structures facilitate sharing data amongst otherwise isolated tasks in an efficient zero-copy manner. No synchronization necessary. This is the real payoff.

For example, say we’ve got a document-editor and would like to launch a background task that does spellchecking in parallel. How will the spellchecker concurrently access the document, given that the user may continue editing it simultaneously? Likely we will use an immutable data structure to hold some interesting document state, such as storing text in a piece-table. OneNote, Visual Studio, and many other document-editors use this technique. This is zero-cost snapshot isolation.

Not having immutability in this particular scenario is immensely painful. Isolation won’t work very well. You could model the document as a task, and require the spellchecker to interact with it using messages. Chattiness would be a concern. And, worse, the spellchecker’s messages may now interleave with other messages, like a user editing the document. Those kinds of message-passing races are non-trivial to deal with. Synchronization won’t work well either. Clearly we don’t want to lock the user out of editing his or her document just because spellchecking is occurring. Such a boneheaded design is what leads to spinning donuts, bleached-white screens, and “(Not Responding)” title bars. But clearly we don’t want to acquire a lock and then make a full copy of the entire document. Perhaps we’d try to copy just what is visible on the screen. This is a dangerous game to play.

Immutability does not solve all of the problems in this scenario, however. Snapshots of any kind lead to a subtle issue that is familiar to those with experience doing multimaster, in which multiple parties have conflicting views on what “the” data ought to be, and in which these views must be reconciled.

In this particular case, the spellchecker sends the results back to the task which spawned it, and presumably owns the document, when it has finished checking some portion of the document. Because the spellchecker was working with an immutable snapshot, however, its answer may now be out-of-date. We have turned the need to deal with message-level interleaving – as described above – into the need to deal with all of the messages that may have interleaved within a window of time. This is where multimaster techniques, such as diffing and merging come into play. Other techniques can be used, of course, like cancelling and ignoring out-of-date results. But it is clear something intentional must be done.

In conclusion

It is safe to say that immutability facilitates important concurrent architectures and algorithms. It can really help big time, for sure. But it is clearly no panacea. Whether mutability or immutability is the right choice for a particular data structure in your program, as with all things, depends.

It could be the case that choosing a piece-table for storing your text facilitates large-scales of concurrency in version two of your software application, but that in version one you have no use for it. Making that call ahead of time may pay in spades down the road, even if it comes at a marginal cost up-front. Or it could be that choosing an immutable data structure costs you in time and space, and you never end up exploiting the fact that you could have shared that particular structure in a zero-cost way across agents in your program.

One thing’s for sure: I’m glad to be programming in languages like C#, F#, Clojure, and Scala, where I’ve got a choice.

7/11/2010 8:18:30 PM (Pacific Daylight Time, UTC-07:00)  #   

 

RSS 2.0

Me
 

Joe Send mail to the author(s) is an architect and developer on a systems incubation project at Microsoft.

Recent

Search

Browse

Disclaimer:
The content of this site are my own personal opinions and do not represent my employer's view in anyway.

© 2013, Joe Duffy