| |
 Sunday, October 23, 2011
It's been unbelievably long since I last blogged.
The reason is simple. I've been ecstatic in my job and, every time I think to write something, I quickly end up turning to work and soon find that hours (days? months?) have passed. This is a wonderful problem to have, but not so good for keeping the blog looking fresh and new. (I've also been writing a fair bit of music lately.) Well, this weekend I managed to lock myself out of my VPN access, and decided that this was a sign that I ought to dust off the cobwebs on a blog entry or two that I've had in the works for quite some time.
The topic for today is generics, a feature many of us know and love. Specifically, their impact on software performance, something I frequently see developers struggling to understand and tame in the wild.
The blessing; and, the curse
I absolutely love generics. I can hardly imagine writing code without them these days. The code reuse, higher-order expressiveness, beautiful abstractions, and static type-safety enabled by first class parametric polymorphism are all game-changing. And being a language history wonk, I'm delighted to see many mainstream programming languages stealing a page from ML and theoretical CS generally.
Generics, however, are not free. And in some circumstances, they are, dare I say, rather expensive. Few language features surpass generics in the ability to write a concise and elegant bit of code, which then translates into reams of ugly assembly code out the rear end of the compiler. I am of course speaking mainly to models in which compilation leads to code specialization (like .NET's), versus erasure (like Java's).
Most developers coming from a C++ background understand code expansion deeply, because they program with templates. Unlike templates, however, there is ample runtime type information (RTTI) associated with generic instantiations… such that the costs associated with generics frequently – and perhaps surprisingly – are a superset of those costs normally associated with C++ templates. At the same time, because the compiler understands parametric polymorphism, it can sometimes do a better job optimizing, e.g. with techniques like code sharing.
Basically, with templates and erasure, the equation for predicting code expansion is super simple. You get it all (in the former) or you get none of it (in the latter), but with specialized generics this equation is quite complex.
Paradoxically, these same costs are the main value that generics bring to the table! Write a little type-agnostic code and then "instantiate" that same code over multiple types without repeating yourself. But, generics are not magic; did you ever stop to wonder things like: What machine code is generated for these types? Does the compiler need to specialize the actual code that runs on the processor for unique instantiations, or is it all the same? And if it does need to specialize, where, how, and why? And perhaps most importantly, what hidden costs are there, and how should I think about them while writing code?
Before reading further, paranoia need not ensue. The point of this article is merely to raise your awareness. All programmers should know what the abstractions they use cost, and make conscious tradeoffs when writing code with them. The aforementioned benefits of generics really are often "worth it," both in the elegance and reusability of abstractions, and in developer productivity. In my experience, however, the associated costs are so subtle and ill-documented that even people who write highly generic code typically remain unaware of them. Even more subtly, these costs are somewhat different in nature when pre-compiling your code, such as with .NET's NGen technology.
This brief essay will walk through a few such costs in the context of the .NET Framework and CLR's implementation of generics. This is in no way an exhaustive study of generic compilation, and your mileage will vary from one platform to the next. Although the studies presented would apply to other implementations of generics, the reality is that if you're writing code in, say, Java – where type erasure is employed rather than code specialization – then all of this is going to be less relevant to you.
With no further delay, let's get started.
Code, RTTI, oh my
When considering costs, we must always think about both size and speed.
There is at least as much assembly code created for an instantiation as the code you've written for the generic abstraction in C# or MSIL. A simple mental model – that thankfully turns out isn't entirely accurate, thanks to some sharing optimizations described below – is that for each instantiation of a generic type or method you get a new copy of that code specialized to the type in question. Obviously, this increases code size. And just as obviously, it will add some runtime cost to JIT compile the code (if you aren't using ahead of time compilation), as well as putting more pressure on I-cache and TLB.
Another source of significant cost is the runtime data structures needed for RTTI and Reflection, like vtables and other metadata. Quite simply, the runtime needs to know the identity of each generic instantiation, to prevent things like casting a List<Int16> to a List<String>, and even List<Object> to List<String>; and given that there is often distinct code generated for unique instantiations, the vtable contents for those different List<T> instantiations are going to look quite different.
And of course, there are statics. Each generic instantiation gets its own set, requiring extra storage and another level of indirection when fetching them. Unique statics means D-cache and TLB pressure. It turns out that code shared across AppDomains, like mscorlib.dll, already need such things. But I have found that it's surprisingly common for a developer to throw a static field (or nested class!) onto an outer generic type, without actually needing it to be replicated for each unique instantiation.
In addition to the immediate effects, generic types often refer to other generic types which refer to other generic types … and so on. Instantiating a root type is akin to instantiating the full transitive closure.
To make our discussion friendly and familiar, we shall use the .NET Framework's List<T> type – presumably one of the most commonly used generic types on the planet – to illustrate many of these costs. And unfortunately, you'll also see that many of the common performance pitfalls plague this type too. (So, really, you need not feel bad if your own code is guilty of them too.)
Why the distinct code, anyway?
There is only one copy of List<T>'s code in mscorlib's MSIL. It is essentially just a blueprint for the list class.
When I create a List<Int16> in my program and use it, however, there clearly needs to be some assembly code created in order to execute List<T>'s associated functionality, just with any T's used by List<T>'s code replaced by actual 2-byte short integers. And similarly, if I were to instantiate a List<String>, all those T's need to be replaced by pointer-sized object references, either 4- or 8-bytes depending on machine architecture, that are reported live to the garbage collector.
This is what leads to our simple mental model above, in which each instantiation gets its own copy of the code. In this case, both List<Int16> and List<String> would be entirely independent types at runtime, with wholly separate copies of the machine code.
Certainly if I manually went about creating my own Int16List and StringList types, they would be distinct types with distinct machine code generated. Being a prudent developer, however, I'd probably try to arrange to share as much of the implementation as possible between the two types, perhaps using implementation inheritance. But alas, there's no way I could share it all: any code specific to Int16 or String, for example, would surely differ, both in MSIL and in the native code.
Generics basically give you the ability to do this same thing, without you needing to do the factoring of type-independent and type-specific code yourself. The compiler does that for you.
Why might the code be different? As stated above, Int16 values are 2 bytes and String pointers are native word sized (4 bytes on 32-bit, 8 bytes on 64-bit). All the code that passes values of type T on the stack, either as arguments or return values, moves instances into and out of memory locations (like the T[] backing array), and so on, needs to be specialized based on the size of T. This wouldn't be true of a generics implementation that used type erasure, like Java's, but then you'd need to box the value types on the heap so that everything is a pointer. If T is a Float, we will likely emit code that uses floating point math instead of general purpose registers. Any tables that report GC roots are likely to be different, since object references can be embedded inside struct values that get laid out on the stack. And so on. Some day you might want to compare the machine code for a simple generic Echo<T> method for different kinds of T's; it is really easy to do, and is quite illustrative.
A naïve wish might go as follows. Imagine that I had written my own dedicated Int16List and StringList types, and that we diffed the resulting machine code between the distinct list types; we'd presumably find a fair bit of duplication for all the reasons stated above. It would be a nice property if, when we used the generic List<Int16> and List<String> types, and similarly diffed the resulting assembly, the amount of specialized code would be no greater than the amount of specialized code between our best hand-written Int16List and StringList types. I.e., only parts that need to be different are different.
We could go even further with our wish. Imagine I had a List<DateTime> and List<Int64>. Both are 8-byte values, and do not contain any GC references. If I were writing a specialized 8ByteValueList in C++ and had immense performance constraints, I would, again being a prudent developer, probably use some type unsafe code, with nasty reinterpret_casts, so that I could use the same list type to store any kind of 8-byte value. (Except in C++ I could even store pointers!) It would also be a nice property if generics did some of this for us, while still retaining the type safety we love about generics.
It turns out we will get neither of our wishes exactly, although we will get something close to the spirit of our wishes.
Code sharing
Indeed, the CLR does arrange to share many generic instantiations. The rule is simple, although it is subject to change in the future (being an optimization and all): instantiations over reference types are shared among all reference type instantiations for that generic type/method, whereas instantiations over value types get their own full copy of the code. In other words, List<String> and List<Object> are backed by the same code, but List<DateTime> and List<Int64> get their own.
It is true that, in theory, List<DateTime> and List<Int64> could use the same shared code, because they are of identical size and have GC roots in the same locations (trivially, because neither has one). But there are additional restrictions on generated code that makes this problematic, for example if we were talking about Double and Int64. In short, the CLR doesn't actually share value type instantiations as of the 4.0 runtime, although clearly it could in certain situations (value types of the same size with GC roots in the same locations).
As you might guess, this extends to multi-parameter generics in obvious ways. A Dictionary<Object, Object> is shared with a Dictionary<String, String>, etc., and a Dictionary<Int64, Object> is shared with a Dictionary<Int64, String>. A Dictionary<DateTime, DateTime> is not, however, shared with a Dictionary<Int64, Int64> instantiation, as per the above.
My pal Joel Pobar wrote a post eons ago describing how code sharing works in great detail, which I do not intend to rehash. Please refer to his post for an excellent overview of how code sharing works.
An important thing to remember, however, is that no matter how much code sharing happens, you still need distinct RTTI data structures. So although List<Object> and List<String> share the same machine code, they have distinct vtables; sure, each table is full of pointers to the same code functions, but you are still paying for the runtime data structures. A distinct instantiation, therefore, is never actually free!
Transitive closures
Why am I making such a big deal about code sharing, anyway?
Another surprising aspect of generics is the transitive closure problem. Particularly when doing pre-compilation of generics, each unique instantiation doesn't simply lead to a specialized version of the code associated with the type being directly instantiated. The whole transitive closure of types, starting with that root type, will also be compiled. This can be a surprisingly huge number of types! JIT is much more pay-for-play, such that you get one level of explosion at a time, but once there is code that calls a particular type's method, even if that code is lazily compiled, creation of the type is forced.
To illustrate this, let's take our friend List<T>. Before examining the list, how many generic types would you expect that a single new List<T> instantiation instantiates?
What if I told you that a single List<int> instantiation creates (at least) 28 types? And that, say, five unique instantiations of List<T> might cost you 300K of disk space and 70K of working set? Well, of course, if you are writing a script, or something with fairly loose performance requirements, this might not matter much. But if topics like download time, mobile footprint, and cache performance are important to you, then you probably want to pay attention to this. To a first approximation, size is speed.
Yes, you heard me right: 28 types. Holy smokes... How can this be?!
Nested types are one obvious answer, and indeed List<T> has two: an Enumerator class (which is reasonable), and one to support the legacy synchronized collections pattern (which we presumably wish we didn't have to pay for). The larger answer here, however, is functionality. Yes, functionality! This is a great example where the cost of generics explodes as you add more features. Start simple, keep adding stuff, as has happened to List<T> over the years, and you will soon find that a series of elegant abstractions adds up to a gut-wrenching bucket of bytes.
Here's a quick sketch of the transitive closure of generic types used by List<T>:
List<T>
T[] type
IList<T> type
ICollection<T> type
IEnumerable<T> type
IEnumerator<T> type
ReadOnlyCollection<T> type (AsReadOnly)
(Nothing more than List<T>)
IComparer<T> type (BinarySearch, Sort)
{Array.BinarySearch<T> method (BinarySearch)}
ArraySortHelper<T> type
IArraySortHelper<T> type
GenericArraySortHelper<T> type
EqualityComparer<T> type (Contains)
IEqualityComparer<T> type
IEquatable<T> type
NullableEqualityComparer<T> type
Nullable<T> type
EnumEqualityComparer<T> type
{JitHelpers.UnsafeEnumCast<T> method}
ObjectEqualityComparer<T> type
Predicate<T> delegate type (Find*)
Action<T> delegate type (ForEach)
{Array.LastIndexOf<T> method (LastIndexOf)}
Comparison<T> delegate type (Sort)
Array.FunctorComparer<T> type (Sort)
Comparer<T> type
GenericComparer<T> type
NullableComparer<T> type
ObjectComparer<T> type
{Array.Sort<T> method (Sort)}
ArraySortHelper<T> type (see earlier)
Enumerator inner type
SynchronizedList<T> inner type
IList<T> interface (see earlier)
ICollection<T> interface (see earlier)
IEnumerable<T> interface (see earlier)
{Interlocked.CompareExchange<Object> method (SyncRoot)}
{_emptyArray T[] static field}
I'm not trying to pick on List<T>. This class is only unique in this regard in that it offers a large transitive closure of (mostly useful!) functionality. And it's not the only guilty party. We recently shaved off 100K's of code size on my team, for example, that were being lost simply because all the LINQ methods were declared as instance methods on the base collection class, rather than being extension methods as in .NET. We found nested enumerator and iterator types, cached static lambdas as static fields, and huge transitive closures of other generic types, all allocated when you just touched any collection type. Any collections library is apt to be full of this stuff, since they are highly generic. But collection libraries are certainly not the only places to go sniffing for such problems.
As an aside, it turns out that extension methods are a great way to make generic abstractions more pay-for-play.
Adding it up
Let's see what the above adds up to. I ran some programs through NGen as a quick and dirty experiment, and inspected the on-disk sizes and also the runtime working set sizes. I ensured clrjit.dll was not loaded into the process. Here's what I found. Take these numbers with a grain of salt, as they will change from release to release; they are simply rules of thumb. When in doubt, crank up NGen, DumpBin, and/or start trawling the heap with VADump yourself!
One empty type with no methods in CLR 4.0 seems to cost roughly 0.2K bytes of on-disk metadata, and about 0.7K in x64 working set. (This is a good rule of thumb irrespective of generics… in terms of order of magnitude, you can think "one empty type means 1K of memory.") A single List<S> instantiation, where S is an empty struct, is in the neighborhood of 60K on-disk metadata, and 14K of x64 working set. A single List<C> instantiation, however, where C is an empty class, is only – surprise – about 7K on-disk and 4K in-memory. Why the large discrepancy? Well, it just so happens that mscorlib.dll already includes an instantiation or two of List<T> over reference types, so this 4K is the incremental cost on top of reusing what is there; remember, there are still unique vtables and data structures still required for RTTI.
Rico did a similar analysis a few years back, and concluded that each unique List, where E was an enum type, cost 8K. Why the increase to 14K over the years? x64 and ever-increasing functionality on the basic collections classes, presumably. Remember, it's not just List<T> that has grown, it's also everything that List<T> uses internally as an implementation detail.
Dynamic specialization with dictionaries
Some specialization in behavior can be accomplished with dynamic runtime behavior, rather than static code specialization. A prime example is the following:
class C
{
public static void M<T>()
{
System.Console.WriteLine(typeof(T).Name);
}
}
Where does the program get the value of typeof(T) from? If you look at the MSIL, you will see that C# has emitted a ldtoken MSIL instruction. For some struct type, we can compile that as a constant in the code, because it is getting its own copy of the code. What occurs when two instantiations share code, like M<String> and M<Object>, however? As you might guess, there is an indirection.
The thing we usually use for such indirections – the vtable – is nowhere to be found in this particular example, because M is a static method. To deal with this, the compiler inserts an extra "hidden" argument, frequently called a generic dictionary, from which the emitted assembly code can fetch the type token. The cost here typically isn't bad, because many of the operations that pull in the dictionary are already RTTI or Reflection-based, and would require an indirection already (e.g., through a vtable).
The operations which require a dictionary of some kind include anything that has to do with RTTI and yet no vtable is readily accessible: typeof, casts, is and as operators, etc. And as you might guess, if instantiations aren't shared (such as with value types on the CLR), no dictionary is needed, because the code is fully specialized. There are also multiple kinds of dictionaries used by the runtime, depending on whether you are using a generic type, method, or some combination of both.
JITting when you didn't mean to
There are two primary ways in which you will JIT compile when using generics, even if you were good doobie and used NGen to reduce startup time.
One way is if you instantiate a new generic type exported from mscorlib.dll with a type argument also defined in mscorlib.dll, that wasn't already instantiated inside mscorlib.dll. (See my old Generics and Performance blog entry for more details.) You can very easily see this happening by using an instantiation like Dictionary<DateTime, DateTime>, and watching the clrjit.dll module getting loaded.
The other way is generic virtual methods (GVMs). It turns out that GVMs pose incredible difficulty for ahead of time separate compilation, because the compiler cannot know statically which slot in the vtable points at the particular implementation you are about to call. (Unless you use whole program compilation, something not offered by .NET at present time.) For each such method, there's an unbounded set of possible specialized instantiations a slot might point to, and so the vtable cannot be laid out in a traditional manner. C++ doesn't allow templated virtual methods for this very reason.
Thankfully, GVMs are somewhat rare. However, it only took 5 minutes of poking around to find one that is quite front-and-center in .NET: in the implementation of LINQ, there is an Iterator<T> type that has a method declared as follows:
public abstract IEnumerable<TResult> Select<TResult>(Func<TSource, TResult> selector);
All we need to do is figure out how to tickle that method, and we're guaranteed to JIT. As it turns out, sure enough, the following code does the trick and forces clrjit.dll to get loaded in .NET 4.0:
int[] xs = …;
int[] ys = xs.Where(x => true).Select(x => x).ToArray();
The Iterator<T> type is used for back-to-back Where and Select operators, as a performance optimization that avoids excess allocations and interface dispatch. But because it depends on a GVM, it does incur an initial penalty for using it, even if you have used NGen to avoid runtime code generation.
In conclusion
The moral of the story here is not that you should fear generics. Beautiful things can be built with them.
Instead, it's to use generics thoughtfully. Nothing in life is free, and generics are no exception to this rule. If code size is important to you, then you will want to have performance gates measuring your numbers against your goals; if you are working in a codebase that uses generics heavily, and you end up spending any significant time on code size optimizations, you will want to try to track down large transitive closures. As I stated above, you could really be throwing away 100K's of code here.
And as to the surprise JITting, I've seen teams compiling with NGen and having a functional gate that fails any new code that causes clrjit.dll to get loaded at runtime. Although tracking down the root cause might be tricky when that gate fails, at least you won't let the camel's nose under the tent.
Investing in tools here is a very good idea.
When it comes down to it, really thinking about what code must be executed by the process is helpful. Step back and imagine you were writing this all in C++, with the associated performance concerns front-and-center: consider how you'd arrange to reuse as much implementation as you can, manage memory efficiently, perhaps employ unsafe tricks that would have violated type safety and so are offlimits in .NET, and all that jazz. Then step back and be grateful that you have a type- and memory-safe environment to help you write more robust code, but also be realistic about what you are paying in exchange.
I hope you've learned a useful thing or two in this article. If you'd like to learn more, here are a few other good resources:
- An MSR paper on the original implementation of .NET generics: http://research.microsoft.com/pubs/64031/designandimplementationofgenerics.pdf
- Rico's "Six Questions About Generics and Performance" blog entry: http://blogs.msdn.com/b/ricom/archive/2004/09/13/229025.aspx
- Joel Pobar's "Generics and Code Sharing" blog post: http://blogs.msdn.com/b/joelpob/archive/2004/11/17/259224.aspx
- My "Generics and Performance" blog entry: http://www.bluebytesoftware.com/blog/2005/03/23/DGUpdateGenericsAndPerformance.aspx
Cheers.
 Wednesday, June 01, 2011
InfoQ recently asked me a few questions about concurrency and programming languages, and here is what I had to say:
http://www.infoq.com/articles/Interview-Joe-Duffy
A little teaser:
"The major shift we face will be that mainstream languages will start to incorporate more concurrency-safety -- immutability and isolation -- and the platform libraries and architectures will better support this style of software decomposition. OOP developers are accustomed to partitioning their program into classes and objects; now they will need to become accustomed to partitioning their program into asynchronous Actors that can run concurrently. Within this sea of asynchrony will lay ordinary imperative code, frequently augmented with fine-grained task and data parallelism."
As an aside, I know I've been super quiet lately. I never thought I'd go months without blogging. My sincere apologies for this; work has been too all-consuming / fun, and I've been unable to carve out much time for anything else. (Speaking of which, we are still hiring: email me at joedu at you-know-where dot com if you are interested.) I'm about to head out to Europe for a few weeks, where hopefully I'll have a bit of time to write up something exciting to share. Cheers.
 Saturday, December 04, 2010
After spending more time than I’d like to admit over the years researching memory model literature (particularly Java’s terrific JMM and related publications), subsequently trying to “fix” the CLR memory model, reviewing proposals for new C++ memory models, and beating my head against the wall for months developing a new memory model that supports weakly ordered processors like the kind you’ll find on ARM in a developer-friendly yet power-efficient way, I have a conclusion to share.
Volatile is evil
Why? Let me recount the reasons:
- It doesn’t mean what you think.
- It used to have a very specific purpose — to enure memory operations with external side-effects did not get reordered — and has since gotten bastardized and used for many secondarily-related purposes.
- Even if you think it does mean what you think, the annotation scheme is all wrong. Volatile annotates a storage location, and yet what really matters is what happens when accessing said storage locations. The fences occur when you access the variable, not when you declare it. And yet from a readability perspective, they are completely invisible and easy to miss.
- And even if you don’t care about readability, the meaning of volatile changes wildly when you switch platforms. Today it’s store / release, tomorrow it’s write / read fences. Perhaps it’s even sequentially consistent. And the label of “store / release” could actually be a white lie, as with the CLR’s memory model thanks to store buffer forwarding and the lack of fences in the CLR JIT’s x64 stores.
- Performance, man, performance! Sure sequential consistency as the default sounds nice on the tin, but once you’re running that mobile app on ARM, and sucking up 160 cycles for each write you perform, you’re going to curse volatile like the plague.
And so the moral of the story follows...
Attempting to “fix” volatile is a waste of time
Instead, a new world order has arrived. We must take a two-pronged approach to solving instruction-level interleaving bugs, neither prong having much to do with the traditional definition of memory models or volatiles. We must:
- Eliminate memory ordering from 99% of developers’ purviews. This is already the case with single-threaded programs, because code motion in compilers and processors is limited to what only affects concurrent observations. So the answer is pretty clear: developers must move towards single-threaded programming models connected through message passing, optionally with provably race-free fine-grained parallelism inside of those single-threaded worlds.
- Leave the memory model esoterica to the Einsteins, and radically change its meaning. Data dependence and transitive visibility of memory operations are in. Volatiles on storage locations are out. Instead, we must throw fences into programmer’s faces, and force them to understand each and every one that occurs. And moreover, force them to decide about each and every one that occurs. Specifically, hidden fences thanks to volatile are no longer. Those who cannot take it should fall into the 99% bucket already mentioned above, versus the 1% bucket.
Let’s set #1 aside for now, since it’s obviously a huge can of worms.
But what about #2? It is quite encouraging that the C++0x group is firmly on the path of #2. See http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n2145.html for more details. In a nutshell, each location that you’d have ordinarily tagged volatile instead becomes a template atomic type. And then each read / write has the opportunity to specify the desired kind of fence, whether that is acquire (for reads), release (for writes), fully ordered, or relaxed (meaning no fence).
I do think it’d be worth them considering compiler-only fences too. So that relaxed means fully-relaxed, and there is a fence in-between that merely prevents the compiler from optimizing the memory operation. This pays homage to volatile’s legacy in C as merely a variable that mustn’t be subject to optimizations, because operations against these variables pertain to, say, memory-mapped device I/O.
Another nitpick of mine is that I’d have required each access to specify the fence, whereas C++0x implicitly uses full fence if left unspecified at the callsite. It’s a minor convenience, but I like always having the fence spelled out very explicitly in the code. Lock-free access to shared variables is sufficiently dangerous that automatic sequential consistency is the least of your problems.
Nevertheless, the C++0x direction is a massive good step forward, and these are just minor details.
My hope is that .NET follows suit. And the timing couldn’t be more apropos as “now”: we are moving forward in a heavily mobile, distributed-system-on-a-chip, and heterogeneous world, where processor memory models will necessarily continue to weaken. The overly strong x86 memory model, kept alive primarily to ensure compatibility, has simply grown too expensive to accommodate. The power benefits and architectural simplifications are hard to argue with, and because compatibility becomes less of an issue as new platforms arise (e.g. for mobile), the world moves to the cloud, and hence there is legacy to worry about, I do hope that processor vendors seize the opportunity. ARM certainly has. It is less about out-of-order execution as it is about coherency costs. Truthfully, I’d be disappointed if anything else happened, even though the risk to compatibility for shrink-wrapped software scares the hell out of me. But this is most certainly the right way to go, long-term. As software platforms move in the direction of #1 – as I also foresee – the need for fences dwindles. The cost of supporting the current .NET memory model is too great and will become a liability with time.
Thankfully, it is quite simple to build a veneer atop .NET that works a lot more like atomic. For example, imagine that we had a new System.Threading.Volatile static class, and that it offered the moral equivalent to atomic inner types for each atomic primitive we can synchronize against:
namespace System.Threading
{
public static class Volatile
{
public struct Int32 {..}
public struct Int64 {..}
…
public struct Reference where T : class {..}
…
}
}
Now instead of tagging a location as ‘volatile’, you would use one of these primitives. For example, rather than:
static volatile MySingleton s_instance;
You would say:
static Volatile.Reference<MySingleton> s_instance;
Each class has a similar set of operations. For example:
namespace System.Threading
{
public static class Volatile
{
public struct Int32
{
public Int32(int value);
public int ReadAcquireFence();
public int ReadFullFence();
public int ReadCompilerOnlyFence();
public int ReadUnfenced();
public void WriteReleaseFence(int newValue);
public void WriteFullFence(int newValue);
public void WriteCompilerOnlyFence(int newValue);
public void WriteUnfenced(int newValue);
public int AtomicCompareExchange(int newValue, int comparand);
public int AtomicExchange(int newValue);
public int AtomicAdd(int delta);
public int AtomicIncrement();
public int AtomicDecrement();
// Etc…, bitwise ops, other math ops, etc.
}
}
}
Of course, only the integer types would offer the increment, decrement, add, and related operators. And it turns out that offering different kinds of fences on the Atomic* operations would be incredibly useful too, because processors like ARM do not couple the fence to the compare-and-swap / load-locked-store-conditional as x86 processors do. Taking advantage of this can be huge if you are writing performance critical code, like a concurrent garbage collector whose atomic swaps need not imply ordering with the surrounding instruction stream. You can quibble over the details, like whether these should use enums instead of the name to encode the fence-kind. I did it this way to keep the implementations branch-free, although with a decent inlining JIT compiler, it’d probably optimize those away thanks to constant propagation.
It’s quite trivial to implement these APIs atop existing .NET primitives. I built a little library that does so, but it was so boring and repetitive I decided not to post it alongside this blog entry as originally intended.
With the above definition, we can very clearly see the fences involved in doing, say, double-checked locking:
static Volatile.Reference<MySingleton> s_instance;
public static MySingleton Instance
{
get
{
MySingleton instance = s_instance.ReadAcquireFence();
if (instance == null) {
instance = new MySingleton();
instance = s_instance.AtomicCompareExchangeRelease(instance, null);
}
return instance;
}
}
We see there are two fences. One is an acquire and, depending on what your memory model says about data dependence, is probably unnecessary. Most sane memory models guarantee that data dependent loads do not pass. So we needn’t worry that we’ll see a non-null s_instance whose contents haven’t been initialized. (If we were talking structs, it’d be another story.) Nevertheless, it’s definitely required that we use a release-only fence for the publication of the object. This guarantees writes to fields within the MySingleton constructor have completed prior to the write of the new object to the shared instance field. The point here is that you are forced to think about the fences, and you actually see them.
Of course, most platforms need to provide the bare minimum of fencing to assure type safety, particularly for languages like C#. My understanding is that C++0x has decided, at least for now, not to offer type-safety in the face of multithreading. That means you might publish an object and, if stores occur out-of-order, the reader could see an object partially initialized with an invalid vtable pointer. In C# and Java, the language designers have thankfully decided to shield programmers from this. The need for fences also extends to unsafe code like strings, where – were it possible for a thread to read the non-zero length before the char* pointer was valid – writes to random memory could occur and hence threaten type-safety. Thankfully, again, C# and Java protect developers from this, mostly due to the automatic zero’ing of memory as the GC allocates and hands out new segments.
There are costs to offering this type safety assurance. So you can understand why the C++ designers want to keep fences out of object allocation. If you have #1 above, however, the costs are dramatically lower and more acceptable. But the world is – unfortunately – still a freethreaded one, and we have several years to go before we’ve reached the final destination. As a step forward, however, the death of volatile is a welcomed one. Say it with me.
“Sayonara volatile.”
Here’s hoping that .NET 5.0 takes this step forward too.
 Sunday, October 31, 2010
I rambled on and on a few weeks back about how much performance matters. Although I got a lot of contrary feedback, most of it had to do with my deliberately controversial title than the content of the article. I had intended to post a redux, but something more concise is on my mind lately.
GC-based memory management is a boon to productivity, not to mention program safety. Few would argue with this. However, the most effective developers know how their particular GC works, and optimize their program’s data structure, allocation, and lifetime behavior to suit their particular GC best.
This is dangerous, but a pragmatic fact of life. It is dangerous because who’s to say that the runtime team doesn’t intend to entirely revamp the GC’s collection strategy next release, at which point your thoughtfulness may actually harm you? It’s a pragmatic fact for a few reasons: it’s probably not likely that the behavior of your favorite GC is going to change too fundamentally over time; if it did, you’d need to rethink things anyhow; oh, and when was that next release anyway (2 or more years out); and finally, what do you care about more, the theoretical loose coupling, or real results today?
One of the worst data structures for traversal is a linked list. That’s because its contents are fetched by pointer-chasing, an act that usually destroys locality, unless the data associated with each pointer was carefully constructed to live next to its previous and next pointer’s data. This seldom happens, because the main point of a linked list is to free you from such constraints.
One of the best data structures for traversal, on the other hand, is an array. Adjacent elements are truly adjacent to one another in memory, meaning that as you fetch the i’th element from memory, you’re probably pulling in the i’th, i+1’th, …, and so on, thanks to spatial locality. Of course, if the elements are just pointers, then you're back to the chasing game; as with anything, it depends.
How many elements you prefetch of course depends on the size of the elements with respect to your processor’s cache line. If you’re working with 8-byte elements, and 128-byte cache lines, then you may pay 100 cycles for the first fetch, and then amortize that cost over the subsequent 15 found cheaply in cache for 10 cycles. The result is about 250 cycles total; compare this to a linked list, where you’ll probably spend 1,600 cycles, or more than 6X the cost. And of course you’re trashing other data in the cache in the process. As you traverse more and more of the list, the numbers snowball, and the amortization of locality, or lack thereof, provides a stark contrast.
There’s another subtle reason why this is important. Stop and think about what happens when a GC occurs.
Yep, that’s right. Your data structures need to be traversed during a GC, after all, to ascertain the liveness of any pointers held within. That scan looks a whole heck of a lot like the same traversal I just described, and enjoys the same locality properties. So we can immediately conclude that data structures whose traversal is efficient will translate into less time spent in the GC chasing pointers, and better cache efficiency.
For programs that are sensitive to long pause times, this is huge. I talk to customers all the time whose programs are sensitive to microsecond-long GC delays, and – aside from ensuring good GC lifetime practices, like ensuring all objects either die young or live long – being conscious about locality can be immensely important. Especially for any long-lived, large data structures that will be subject to Gen2 collections throughout their lifetime.
There is another useful trick to know. If a data structure contains no pointers, the GC will not have to trace these pointers. Obvious, right? A linked list inherently contains pointers, so this trick really doesn’t apply: the GC will need to traverse the whole live portions of the object graph. What’s interesting is that an array, on the other hand, may or may not contain elements that contain pointers. For example, an array of ints clearly has no pointers, whereas an array of string references clearly does. This doesn’t just apply to the primitive types, but also custom structs which may or may not contain references. When the GC encounters such an array, its contents need not be traversed: instead, the array is alive, and that's that. Yet another opportunity to eliminate pointer chasing. Not only does this save the GC from doing some heavy lifting, but the pointer-free structs eliminate the need for the GC write barriers on array stores too.
So think of all this next time you’re confronted with the decision to employ a tree, graph, or linked list, and whether a dense, and perhaps pointer-free, representation could be beneficial. Even if it means you must replace pointers with index calculations. The locality benefits may not matter, but then again, they may. And at least you can knowingly make a balanced tradeoff, with these potential advantages in mind.
 Saturday, September 18, 2010
I have several positions open on my team here at Microsoft.
My team's responsibility spans multiple aspects of a research operating system’s programming model. The three main areas are concurrency, languages, and frameworks. When I say concurrency, I mean things like asynchrony and message passing, data and task parallelism, distributed parallelism, runtime scheduling and resource management, and heterogeneity and GPGPU. When I say languages, I mean type systems, mostly-functional programming, verified safe concurrency, and both front- and back-end compilation. And when I say frameworks, I mean virtually anything you could imagine wanting out of a platform framework: all things XML, data interoperability (database, web services, etc.), collections, transactions, multimaster synchronization, and even low level things, like regex, numerics, and globalization.
Our team is 100% developers, and we have an “everybody codes, everybody loves to code” culture. Even managers are expected to spend a significant amount of time prolifically writing code.
All of these components are new and built from the ground up. So self-drive and an ability to have a vision and make it happen are incredibly important.
We are always happy to hire great, hard-working people, regardless of years of experience. If you’re extremely strong in one or more of the abovementioned areas, more of a generalist, are an amazing coder, or all of the above, you’d fit in perfectly. This is the most amazing team of people I’ve ever worked with. If you are interested, please email your resume to me at joedu AT microsoft DOT com.
 Monday, September 06, 2010
I can’t tell you how many times I’ve heard the age-old adage echoed inappropriately and out of context:
"We should forget about small efficiencies, say about 97% of the time; premature optimization is the root of all evil" -- Donald E. Knuth, Structured Programming with go to Statements
I have heard the "premature optimization is the root of all evil" statement used by programmers of varying experience at every stage of the software lifecycle, to defend all sorts of choices, ranging from poor architectures, to gratuitous memory allocations, to inappropriate choices of data structures and algorithms, to complete disregard for variable latency in latency-sensitive situations, among others.
Mostly this quip is used defend sloppy decision-making, or to justify the indefinite deferral of decision-making. In other words, laziness. It is safe to say that the very mention of this oft-misquoted phrase causes an immediate visceral reaction to commence within me... and it’s not a pleasant one.
In this short article, we’ll look at some important principles that are counter to what many people erroneously believe this statement to be saying. To save you time and suspense, I will summarize the main conclusions: I do not advocate contorting oneself in order to achieve a perceived minor performance gain. Even the best performance architects, when following their intuition, are wrong 9 times out of 10 about what matters. (Or maybe 97 times out of 100, based on Knuth’s numbers.) What I do advocate is thoughtful and intentional performance tradeoffs being made as every line of code is written. Always understand the order of magnitude that matters, why it matters, and where it matters. And measure regularly! I am a big believer in statistics, so if a programmer sitting in his or her office writing code thinks just a little bit more about the performance implications of every line of code that is written, he or she will save an entire team that time and then some down the road. Given the choice between two ways of writing a line of code, both with similar readability, writability, and maintainability properties, and yet interestingly different performance profiles, don’t be a bozo: choose the performant approach. Eschew redundant work, and poorly written code. And lastly, avoid gratuitously abstract, generalized, and allocation-heavy code, when slimmer, more precise code will do the trick.
Follow these suggestions and your code will just about always win in both maintainability and performance.
Understand the order of magnitude that matters
First and foremost, you really ought to understand what order of magnitude matters for each line of code you write.
In other words, you need to have a budget; what can you afford, and where can you afford it? The answer here changes dramatically depending on whether you’re writing a device driver, reusable framework library, UI control, highly-connected network application, installation script, etc. No single answer fits all.
I am personally used to writing code where 100 CPU cycles matters. So invoking a function that acquires a lock by way of a shared-memory interlocked instruction that may take 100 cycles is something I am apt to think hard about; even more worrisome is if that acquisition could block waiting for 100,000 cycles. Indeed this situation could become disastrous under load. As you can tell, I write a lot of systems code. If you’re working on a network-intensive application, on the other hand, most of the code you write is going to be impervious to 100 cycle blips, and more sensitive to efficient network utilization, scalability, and end-to-end performance. And if you’re writing a little one-time script, or some testing or debugging program, you may get away with ignoring performance altogether, even multi-million cycle network round-trips.
To be successful at this, you’ll need to know what things cost. If you don’t know what things cost, you’re just flailing in the dark, hoping to get lucky. This includes rule of thumb order of magnitudes for primitive operations – e.g. reading / writing a register (nanoseconds, single-digit cycles), a cache hit (nanoseconds, tens of cycles), a cache miss to main memory (nanoseconds, hundreds of cycles), a disk access including page faults (micro- or milliseconds, millions of cycles), and a network roundtrip (milliseconds or seconds, many millions of cycles) – in addition to peering beneath opaque abstractions provided by other programmers, to understand their best, average, and worst case performance.
Clearly the concerns and situations you must work to avoid change quite substantially depending on the class of code you are writing, and whether the main function of your program is delivering a user experience (where usability reigns supreme), delivering server-side throughput, etc. Thinking this through is crucial, because it helps avoid true "premature optimization" traps where a programmer ends up writing complicated and convoluted code to save 10 cycles, when he or she really needs to be thinking about architecting the interaction with the network more thoughtfully to asynchronously overlap round-trips. Understanding how performance impacts the main function of your program drives all else.
Pay attention to interoperability between layers of separately authored software that is composed together. The most common cause of hangs is that an API didn’t specify the expected performance, and so a UI programmer ended up using it in an innocuous but inappropriate way, because they couldn’t afford the range of order of magnitude cost that the API’s performance was expected to fall within. Hangs aren’t the only manifestation; O(N^2), or worse, performance can also result, if, for example, a caller didn’t realize the function called was going to enumerate a list in order to generate its results.
It is also important to think about worst case situations. What happens if that lock is held for longer than expected, because the system is under load and the scheduler is overloaded? And what if the owning thread was preempted while holding the lock, and now will not get to run again for quite some time? What happens if the network is saturated because a big news event is underway, or worse, the phone network is intermittently cutting out, the network cable has been unplugged, etc.? What about the case where, because a user has launched far too many applications at once, your memory-intensive operation that usually enjoys nice warmth and locality suddenly begins waiting for the disk on the majority of its memory accesses, due to demand paging? These things happen all the time.
In each of these situations, you can end up paying many more orders of magnitude in cost than you expected under ordinary circumstances. The lock acquisition that usually took 100 CPU cycles now takes several million cycles (as long as a network roundtrip), and the network operation that is usually measured in milliseconds is now measured in tens of seconds, as the software painfully waits for the operation to time out. And your "non-blocking" memory-intensive algorithm on the UI thread just caused a hang, because it’s paging like crazy.
You’ve experienced these problems as a user of modern software, I am sure, and it isn’t fun. An hourglass, spinning donut, unresponsive button clicks, "(Not Responsive)" title bars, and bleachy white screens. An important measurement of a programmer’s worth is how good the code they write operates under the extreme and rare circumstances. Because, truth be told, when you have a large user-base, these circumstances aren’t that rare after all. This is more of a "performance in the large" thing, but it turns out that the end result is delivered as a result of many "performance in the small" decisions adding up. A developer wrote code meant to be used in a particular way, but decided what order of magnitude was reasonable based on best case, … and gave no thought to the worst case.
Using the right data structure for the job
This is motherhood and apple pie, Computer Science 101, … bad clichés abound. And yet so many programmers get this wrong, because they simply don’t give it enough thought.
One of my favorite books on the topic ("Programming Pearls") has this to say about them:
"Most programmers have seen them, and most good programmers realize they’ve written at least one. They are huge, messy, ugly programs that should have been short, clean, beautiful programs."
I’ll add one adjective to the "short, clean, beautiful" list: fast.
Data structures drive storage and access behavior, both strongly affecting the size and speed of algorithms and components that make use of them. Worst case really does matter. This too is a case where the right choice will boost not only performance but also the cleanliness of the program.
I’m actually not going to spend too much time on this; when I said this is CS101, I meant it. However, it is crucial to be intentional and smart in this choice. Validate assumptions, and measure.
Ironically, in my experience, many senior programmers can make frighteningly bad data structure choices, often because they are more apt to choose a sophisticated and yet woefully inappropriate one. They may choose a linked list, for example, because they want zero-allocation element linking via an embedded next pointer. And yet they then end up with many lists traversals throughout the program, where a dense array representation would have been well worth the extra allocation. The naïve programmer would have happily new’d up a List<T>, and avoided some common pitfalls; yet, here the senior guy is working as hard as humanly possible to avoid a single extra allocation. They over-optimized in one dimension, and ignored many others that mattered more.
This same class of programmer may choose a very complicated lock-free data structure for sharing elements between threads, incurring many more object allocations (and thus increased GC pressure), and a large number of expensive interlocked operations scattered throughout the code. The sexy lure of lock-freedom tricked them into making a bad choice. Perhaps they didn’t quite understand that locks and lock-free data structures share many costs in common. Or perhaps they just hoped to get lucky and squeeze out out-of-this-world scalability thanks to lock-freedom, without actually considering the access patterns necessary to lead to such scalability and whether their program employed them.
These are often held up as examples of "premature optimization", but I hold them up as examples of "careless optimization". The double kicker here is that the time spent building the more complicated solution would have been better spent carefully thinking and measuring, and ultimately deciding not to be overly clever in the first place. This most often plagues the mid-range programmer, who is just smart enough to know about a vast array of techniques, but not yet mature enough to know when not to employ them.
A different, better-performing approach
It’s an all-too-common occurrence. I’ll give code review feedback, asking "Why didn’t you take approach B? It seems to be just as clear, and yet obviously has superior performance." Again, this is in a circumstance where I believe the difference matters, given the order of magnitude that matters for the code in question. And I’ll get a response, "Premature optimization is the root of all evil." At which point I want to smack this certain someone upside the head, because it’s such a lame answer.
The real answer is that the programmer didn’t stop to carefully consider alternatives before coding up solution A. (To be fair, sometimes good solutions evade the best of us.) The reality is that the alternative approach should have been taken; it may be true that it’s "too late" because the implications of the original decision were nontrivial and perhaps far-reaching, but that is too often an unfortunate consequence of not taking the care and thought to do it right in the first place.
These kinds of "peanut butter" problems add up in a hard to identify way. Your performance profiler may not obviously point out the effect of such a bad choice so that it’s staring you in your face. Rather than making one routine 1000% slower, you may have made your entire program 3% slower. Make enough of these sorts of decisions, and you will have dug yourself a hole deep enough to take a considerable percentage of the original development time just digging out. I don’t know about you, but I prefer to clean my house incrementally and regularly rather than letting garbage pile up to the ceilings, with flies buzzing around, before taking out the trash. Simply put, all great software programmers I know are proactive in writing clear, clean, and smart code. They pour love into the code they write.
In this day and age, where mobility and therefore power is king, instructions matter. My boss is fond of saying "the most performant instruction is the one you didn’t have to execute." And it’s true. The best way to save battery power on mobile phones is to execute less code to get the same job done.
To take an example of a technology that I am quite supportive of, but that makes writing inefficient code very easy, let’s look at LINQ-to-Objects. Quick, how many inefficiencies are introduced by this code?
int[] Scale(int[] inputs, int lo, int hi, int c) {
var results = from x in inputs
where (x >= lo) && (x <= hi)
select (x * c);
return results.ToArray();
}
It’s hard to account for them all.
There are two delegate object allocations, one for the call to Enumerable.Where and the other for the call to Enumerable.Select. These delegates point to potentially two distinct closure objects, each of which has captured enclosing variables. These closure objects are instances of new classes, which occupy nontrivial space in both the binary and at runtime. (And of course, the arguments are now stored in two places, must be copied to the closure objects, and then we must incur extra indirections each time we access them.) In all likelihood, the Where and Select operators are going to allocate new IEnumerable and new IEnumerator objects. For each element in the input, the Where operator will make two interface method calls, one to IEnumerator.MoveNext and the other to IEnumerator.get_Current. It will then make a delegate call, which is slightly more expensive than a virtual method call on the CLR. For each element for which the Where delegate returns ‘true’, the Select operator will have likewise made two interface method calls, in addition to another delegate invocation. Oh, and the implementations of these likely use C# iterators, which produce relatively fat code, and are implemented as state machines which will incur more overhead (switch statements, state variables, etc.) than a hand-written implementation.
Wow. And we aren’t even done yet. The ToArray method doesn’t know the size of the output, so it must make many allocations. It’ll start with 4 elements, and then keep doubling and copying elements as necessary. And we may end up with excess storage. If we end up with 33,000 elements, for example, we will waste about 128KB of dynamic storage (32,000 X 4-byte ints).
A programmer may have written this routine this way because he or she has recently discovered LINQ, or has heard of the benefits of writing code declaratively. And/or he or she may have decided to introduce a more general purpose implementation of a Scale API versus doing something specific to the use in the particular program that Scale will be immediately used in. This is a great example of why premature generalization is often at odds with writing efficient code.
Imagine an alternative universe, on the other hand, where Scale will only get used once and therefore we can take advantage of certain properties of its usage. Namely, perhaps the input array need not be preserved, and instead we can update the elements matching the criteria in place:
void ScaleInPlace(int[] inputs, int lo, int hi, int c) {
for (int i = 0; i < inputs.Length; i++) {
if ((inputs[i] >= lo) & (inputs[i] <= hi)) {
inputs[i] *= c;
}
}
}
A quick-and-dirty benchmark shows this to be an order of magnitude faster. Again, is it an order of magnitude that you care about? Perhaps not. See my earlier thoughts on that particular topic. But if you care about costs in the 100s or 1000s of cycles range, you probably want to pay heed.
Now, I’m not trying to take potshots at LINQ. It was just an example. In fact, I spent 3 years running a team that delivered PLINQ, a parallel execution engine for LINQ-to-Objects. LINQ is great where you can afford it, and/or where the alternatives do not offer ridiculously better performance. For example, if you can’t do in-place updates, functionally producing new data is going to require allocations no matter which way you slice it. But having watched people using PLINQ, I have witnessed numerous occasions where an inordinately expensive query was made 8-times faster by parallelizing… where the trivial refactoring into a slimmed down algorithm with proper data structures would have speed the code up by 100-fold. Parallelizing a piggy piece of code to make it faster merely uses more of the machine to get the same job done, and will negatively impact power, resource management, and utilization.
Another view is that writing code in this declarative manner is better, because it’ll just get faster as the compiler and runtimes enjoy new optimizations. This sounds nice, and seems like taking a high road of some sort. But what usually matters is today: how does the code perform using the latest and greatest technology available today. And if you scratch underneath the surface, it turns out that most of these optimizations are what I call "science fiction" and unlikely to happen. If you write redundant asserts at twenty adjacent layers of your program, well, you’re probably going to pay for them. If you allocate objects like they are cheap apples growing on trees, you’re going to pay for them. True, optimizations might make things faster over time, but usually not in the way you expect and usually not by orders of magnitude unless you are lucky.
A colleague of mine used to call C a WYWIWYG language—"what you write is what you get"—wherein each line of code roughly mapped one-to-one, in a self-evident way, with a corresponding handful of assembly instructions. This is a stark contrast to C#, wherein a single line of code can allocate many objects and have an impact to the surrounding code by introducing numerous silent indirections. For this reason alone, understanding what things cost and paying attention to them is admittedly more difficult – and arguably more important – in C# than it was back in the good ole’ C days. ILDASM is your friend … as is the disassembler. Yes, good systems programmers regularly look at the assembly code generated by the .NET JIT. Don’t assume it generates the code you think it does.
Gratuitous memory allocations
I love C#. I really do. I was reading my "Secure Coding in C and C++" book for fun this weekend, and it reminded me how many of those security vulnerabilities are eliminated by construction thanks to type- and memory-safety.
But the one thing I don’t love is how easy and invisible it makes heap allocations.
The very fact that C++ memory management is painful means most C++ programmers are overly-conscious about allocation and footprint. Having to opt-in to using pointers means developers are conscious about indirections, rather than having them everywhere by default. These are qualitative, hard-to-backup statements, but in my experience they are quite true. It’s also cultural.
Because they are so easy, it’s doubly important to be on the lookout for allocations in C#. Each one adds to a hard-to-quantify debt that must be repaid later on when a GC subsequently scans the heap looking for garbage. An API may appear cheap to invoke, but it may have allocated an inordinate amount of memory whose cost is only paid for down the line. This is certainly not "paying it forward."
It’s never been so easy to read a GB worth of data into memory and then haphazardly leave it hanging around for the GC to take care of at some indeterminate point in the future as it is in .NET. Or an entire list of said data. Too many times a .NET program holds onto GBs of working set, when a more memory conscientious approach would have been to devise an incremental loading strategy, employ a denser representation of this data, or some combination of the two. But, hey, memory is plentiful and cheap! And in the worst case, paging to disk is free! Right? Wrong. Think about the worst case.
Depending on the size of the objects allocated, how long they remain live, and how many processors are being used, the subsequent GCs necessary to clean up incessant allocations may impact a program in a difficult to predict way. Allocating a bunch of very large objects that live for long enough to make it out of the nursery, but not forever, for instance, is one of the worst evils you can do. This is known as mid-life crisis. You either want really short-lived objects or really long-lived ones. But in any case, it really matters: the LINQ example earlier shows how easy it is to allocate crap without seeing it in the code.
If I could do it all over again, I would make some changes to C#. I would try to keep pointers, and merely not allow you to free them. Indirections would be explicit. The reference type versus value type distinction would not exist; whether something was a reference would work like C++, i.e. you get to decide. Things get tricky when you start allocating things on the stack, because of the lifetime issues, so we’d probably only support stack allocation for a subset of primitive types. (Or we’d employ conservative escape analysis.) Anyway, the point here is to illustrate that in such a world, you’d be more conscious about how data is laid out in memory, encouraging dense representations over sparse and pointer rich data structures silently spreading all over the place. We don't live in this world, so pretend as though we do; each time you see a reference to an object, think "indirection!" to yourself and react as you would in C++ when you see a pointer dereference.
Allocations are not always bad, of course. It’s easy to become paranoid here. You need to understand your memory manager to know for sure. Most GC-based systems, for example, are heavily tuned to make successive small object allocations very fast. So if you’re programming in C#, you’ll get less "bang for your buck" by fusing a large number of contiguous object allocations into a small number of large object allocations that ultimately occupy an equivalent amount of space, particularly if those objects are short-lived. Lots of little garbage tends to be pretty painless, at least relatively.
Variable latency and asynchrony
There aren’t many ways to introduce a multisecond delay into your program at a moment’s notice. But I/O can do just that.
Code with highly variable latency is dangerous, because it can have dramatically different performance characteristics depending on numerous variables, many of which are driven by environmental conditions outside of your program’s control. As such it is immensely important to document where such variable latency can occur, and to program defensively against it happening.
For example, imagine a team of twenty programmers building some desktop application. The team is just large enough that no one person can understand the full details of how the entire system works. So you’ve got to compose many pieces together. (As I mentioned earlier, composition can lead to hard-to-predict performance characteristics.) Programmer Alice is responsible for serving up a list of fonts, and Programmer Bob is consuming that list to paint it on the UI. Does Bob know what it takes to fetch the list of fonts? Probably not. Does Alice know the full set of concerns that Bob must think about to deliver a responsive UI, like incremental repaints, progress reporting, and cancellation? Probably not. So Alice does the best she knows how to do: she hits the cache, when the font cache is fully populated, and falls back to fetching fonts from the printer otherwise. She returns a List<Font> object from her API. Now Bob just calls the API and paints the results on the UI; the call appears to be quite snappy in his testing. Unfortunately, when the cache is not populated, the "quick" cache hit turns into a series of network roundtrips with the printer; that hangs the UI, but only for a few milliseconds. Even worse, when the printer is accidentally offline, perhaps due to a power outage, the UI freezes for twenty seconds, because that’s the hard-coded timeout. Ouch!
This situation happens all the time. It’s one of the most common causes of UI hangs.
If you’re programming in an environment where asynchrony is first class, Alice could have advertised the fact that, under worst-case circumstances, fetching the font list would take some time. If she were programming in .NET, for example, she’d return a Task<List<Font>> rather than a List<Font>. The API would then be self-documenting, and Bob would know that waiting for the task’s result is dangerous business. He knows, of course, that blocking the UI thread often leads to responsiveness problems. So he would instead use the ContinueWith API to rendezvous with the results once they become available. And Bob may now know he needs to go back and work more closely with Alice on this interface: to ensure cancellation is wired up correctly, and to design a richer interface that facilitates incremental painting and progress reporting.
Variable latency is not just problematic for responsiveness reasons. If I/O is expressed synchronously, a program cannot efficiently overlap many of them. Imagine we must make three network calls as part of completing an operation, and that each one will take 25 milliseconds to complete. If we do synchronous I/O, the whole operation will take at least 75 milliseconds. If we launch do asynchronous I/O, on the other hand, the operation may take as few as 25 milliseconds to finish. That’s huge.
If I had my druthers, all I/O would be asynchronous. But that’s not where we are today.
The concern is not limited to just I/O, of course. Compute- and memory-bound work can quickly turn into variable latency work, particularly under stressful situations like when an application is paging. So truthfully any abstraction doing "heavy lifting" should offer an asynchronous alternative.
Examples of "bad" optimizations
It is easy to take it too far. Even if you’re shaving off cycles where each-and-every cycle matters, you may be doing the wrong thing. If anything, I hope this article has convinced you to be thoughtful, and to strive to strike a healthy balance between beautiful code and performance.
Anytime the optimization sacrifices maintainability, it is highly suspect. Indeed, many such optimizations are superficial and may not actually improve the resulting code’s performance.
The worst category of optimization is one that can lead to brittle and insecure code.
One example of this is heavy reliance on stack allocation. In C and C++, doing stack allocation of buffers often leads to difficult choices, like fixing the buffer size and writing to it in place. There is perhaps no single technique that, over the years, has led to the most buffer overrun-based exploits. Not only that, but stack overflow in Windows programs is quite catastrophic, and increases in likelihood the more stack allocation that a program does. So doing _alloca in C++ or stackalloc in C# is really just playing with fire, particularly for dynamically sized and potentially big allocations.
Another example is using unsafe code in C#. I can’t tell you how many times I’ve seen programmers employ unsafe pointer arithmetic to avoid the automatic bounds checking generated by the CLR JIT compiler. It is true that in some circumstances this can be a win. But it is also true that most programmers who do this never bothered to crack open the resulting assembly to see that the JIT compiler does a fairly decent job at automatic bounds check hoisting. This is an example where the cost of the optimization outweighs the benefits in most circumstances. The cost to pin memory, the risk of heap corruption due to a failure to properly pin memory or an offset error, and the complication in the code, are all just not worth it. Unless you really have actually measured and found the routine to be a problem.
If it smells way too complicated, it probably is.
In conclusion
I’m not saying Knuth didn’t have a good point. He did. But the "premature optimization is the root of all evil" pop-culture and witty statement is not a license to ignore performance altogether. It’s not justification to be sloppy about writing code. Proactive attention to performance is priceless, especially for certain kinds of product- and systems-level software.
My hope is that this article has helped to instill a better sense, or reinforce an existing good sense, for what matters, where it matters, and why it matters when it comes to performance. Before you write a line of code, you really need to have a budget for what you can afford; and, as you write your code, you must know what that code costs, and keep a constant mental tally of how much of that budget has been spent. Don’t exceed the budget, and most certainly do not ignore it and just hope that wishful thinking will save your behind. Building up this debt will cost you down the road, I promise. And ultimately, test-driven development works for performance too; you will at least know right away once you have exceeded your budget.
Think about worst case performance. It’s not the only thing that matters, but particularly when the best and worst case differ by an order of magnitude, you will probably need to think more holistically about the composition of caller and callee while building a larger program out of constituent parts.
And lastly, the productivity and safety gains of managed code, thanks to nice abstractions, type- and memory-safety, and automatic memory management, do not have to come at the expense of performance. Indeed this is a stereotype that performance conscious programmers are in a position to break down. All you need to do is slow down and be thoughtful about each and every line of code you write. Remember, programming is as much engineering as it is an art. Measure, measure, measure; and, of course, be thoughtful, intentional, and responsible in crafting beautiful and performant code.
 Sunday, July 11, 2010
That immutability facilitates increased degrees of concurrency is an oft-cited dictum. But is it true? And either way, why?
My view on this matter may be a controversial one. Immutability is an important foundational tool in the toolkit for building concurrent – in addition to reliable and predictable – software. But it is not the only one that matters. Making all your data immutable isn’t going to instantly lead to a massively scalable program. Natural isolation is also critically important, perhaps more so. And, as it turns out, sometimes mutability is just what the doctor ordered, as with large-scale data parallelism.
Isolation first; immutability second; synchronization last
Stepping back for a moment, the recipe for concurrency is rather simple. Say you’ve got multiple concurrent pieces of work running simultaneously (or have a goal of getting there); for discussion’s sake, call them tasks. Take two tasks. The first critical decision has two cases: either these tasks concurrently access overlapping data in shared-memory, or they do not. If they do not, they are isolated, and no precautions associated with racing memory updates are needed. If they do share data, on the other hand, then something else must give. If all concurrently accessed data is immutable, or all functions used to interact with data are pure, then dangerous concurrency hazards are avoided. All is well. If some data is mutable, however, then this is where things get tricky, and higher-level synchronization is needed to make accesses safe. This decision tree is straightforward and clear.
I have listed those four attributes – isolated, immutable and pure, and synchronization – in a very intentional order. Thankfully, this order mirrors the natural top-down hierarchical architecture of most modern object- and component-based programs: we have large containers that communicate through well-defined interfaces, each comprised of layers of such containers, and somewhere towards the leaves, a fair amount of intimate commingling of knowledge regarding data and invariants.
This order also reflects the order of complexity and execution-time costs, from least to most. Isolation is simple, because components depend on each other in loosely-coupled ways, and in fact scales superiorly in a concurrent program because no synchronization is necessary, the “right” data structure may be chosen for the job – immutable or not – and locality is part-and-parcel to the architecture. Immutability at least avoids the morass of synchronization, which can affect programs immensely in complexity, runtime overheads, and write-contention for shared data. It is clear that synchronization is something to avoid at all costs, particularly anything done in an ad-hoc manner like locks.
Making the concurrency
But where did all this concurrency come from, anyway?
It came from two things:
- The coarse-grained breakdown of a program into isolated pieces.
- The fine-grained data parallelism.
On #1: Program fragments that are isolated are already half-way down the road to running concurrently as tasks. The second half of this journey, of course, is teaching them to interact with one another asynchronously, most frequently through message-passing or by sticking them into a pipeline. The details of course depend on what programming language you are using. It may be through agents, actors, active objects, COM objects, EJBs, CCR receivers, web-services, something ad-hoc built with .NET tasks, or some other reification. Nevertheless the isolation is common to all these.
On #2: Data parallelism, it turns out, often works best with mutable data structures. These structures must be partitionable, of course, so that tasks comprising the data parallel operations may operate with logically isolated chunks of this data safely, even if they are parts of the same physical data structure. So chunks of them are isolated even though they don’t appear to be. This is trivially achievable with many important parallel-friendly data structures like arrays, vectors, and matrices. Capturing this isolation in the type system is of course no small task, though region typing gets close (see UIUC’s Data Parallel Java).
But you usually don’t want these structures to be immutable, because they can be modified in constant-time and space if they are their classic simple mutable forms. Programmers doing HPC-style data-parallelism a la FORTRAN, vectorization, and GPGPU know this quite well. Compare this a world where we are doing data-parallelism over immutable data structures, where modifications often necessitate allocations or more complicated big-oh times due to clever techniques meant to avoid such allocations, as with persistent immutable data structures. This is likely less ideal. It is true that some data parallel operations are not in-place against mutable data – as with PLINQ – at which point purity, but not immutability, is key. The two are related but not identical: immutability pervades the construction of data structures, whereas purity pervades the construction of functions. But if you can get by with one copy of the data, why not do it? Particularly since most datasets amenable to parallel speedups are quite large.
Immutability: the bricks, not the mortar
Notice that the concurrency did not actually come from immutable data structures in either case, however. So what are they good for?
One obvious use, which has little to do with concurrency, is to enforce characteristics of particular data structures in a program. A translation lookup table may not have been meant to be written to except for initialization time, and using an immutable data structure is a wonderful way to enforce this intent.
What about concurrency? Immutable data structures facilitate sharing data amongst otherwise isolated tasks in an efficient zero-copy manner. No synchronization necessary. This is the real payoff.
For example, say we’ve got a document-editor and would like to launch a background task that does spellchecking in parallel. How will the spellchecker concurrently access the document, given that the user may continue editing it simultaneously? Likely we will use an immutable data structure to hold some interesting document state, such as storing text in a piece-table. OneNote, Visual Studio, and many other document-editors use this technique. This is zero-cost snapshot isolation.
Not having immutability in this particular scenario is immensely painful. Isolation won’t work very well. You could model the document as a task, and require the spellchecker to interact with it using messages. Chattiness would be a concern. And, worse, the spellchecker’s messages may now interleave with other messages, like a user editing the document. Those kinds of message-passing races are non-trivial to deal with. Synchronization won’t work well either. Clearly we don’t want to lock the user out of editing his or her document just because spellchecking is occurring. Such a boneheaded design is what leads to spinning donuts, bleached-white screens, and “(Not Responding)” title bars. But clearly we don’t want to acquire a lock and then make a full copy of the entire document. Perhaps we’d try to copy just what is visible on the screen. This is a dangerous game to play.
Immutability does not solve all of the problems in this scenario, however. Snapshots of any kind lead to a subtle issue that is familiar to those with experience doing multimaster, in which multiple parties have conflicting views on what “the” data ought to be, and in which these views must be reconciled.
In this particular case, the spellchecker sends the results back to the task which spawned it, and presumably owns the document, when it has finished checking some portion of the document. Because the spellchecker was working with an immutable snapshot, however, its answer may now be out-of-date. We have turned the need to deal with message-level interleaving – as described above – into the need to deal with all of the messages that may have interleaved within a window of time. This is where multimaster techniques, such as diffing and merging come into play. Other techniques can be used, of course, like cancelling and ignoring out-of-date results. But it is clear something intentional must be done.
In conclusion
It is safe to say that immutability facilitates important concurrent architectures and algorithms. It can really help big time, for sure. But it is clearly no panacea. Whether mutability or immutability is the right choice for a particular data structure in your program, as with all things, depends.
It could be the case that choosing a piece-table for storing your text facilitates large-scales of concurrency in version two of your software application, but that in version one you have no use for it. Making that call ahead of time may pay in spades down the road, even if it comes at a marginal cost up-front. Or it could be that choosing an immutable data structure costs you in time and space, and you never end up exploiting the fact that you could have shared that particular structure in a zero-cost way across agents in your program.
One thing’s for sure: I’m glad to be programming in languages like C#, F#, Clojure, and Scala, where I’ve got a choice.
 Thursday, July 01, 2010
In .NET today, readonly/initonly-ness is in the eye of the provider. Not the beholder.
Although both C# and the CLR verifier go to great pains to ensure you don't change a readonly/initonly field outside of its constructor (or class constructor, in the case of a static field), this guarantee doesn't imply what you might imagine. It means what it says: you can't change such fields except for in certain contexts.
If you try, C# won't let you, including forming byrefs to them:
v.cs(0,0): error CS0191: A readonly field cannot be assigned to (except in a constructor or a variable initializer)
v.cs(0,0): error CS0192: A readonly field cannot be passed ref or out (except in a constructor)
v.cs(0,0): error CS0198: A static readonly field cannot be assigned to (except in a static constructor or a variable initializer)
v.cs(0,0): error CS0199: A static readonly field cannot be passed ref or out (except in a static constructor)
And neither will the CLR verifier:
[IL]: Error: [c:\v.exe : C::Main][offset 0x00000001] Cannot change initonly
field outside its .ctor.
Of course, attempting to invoke an operation on a readonly struct will make a defensive copy locally, and invoke the method against that. This ensures the readonly contents cannot change.
One unfortunate hole in this safety is with unions. You do not need unsafe code to break readonly, and yet the effect is the same as with an unverifiable program that writes to a readonly field:
struct S1 {
public readonly int X;
}
struct S2 {
public int X;
}
[StructLayout(LayoutKind.Explicit)]
struct S3 {
[FieldOffset(0)]
public S1 A;
[FieldOffset(0)]
public S2 B;
}
Now we can change A.X via B.X, even though A.X is supposedly readonly:
S3 s3 = ...;
int x = s3.A.X;
s3.B.X++;
ASSERT(x == s3.A.X); // false; it is +1
The same would have been true even if the field S3.A was marked readonly.
This is quite an evil trick. I have to be honest that I believe this is a CIL verification hole, and should produce unverifiable MSIL much like when you try to overlay structs containing overlapping GC references. Nevertheless, it is what it is.
Let's step back. Why does all of this matter, anyway, and what guarantees were we hoping that readonly would provide?
It would be ideal, I assert, if the guarantee was not just "the target field can only be written to in the constructor", but also "the target field, once read, cannot be observed with a different value later on". This would not be true during construction, but we'd like to say it holds at all other times.
The above example throws a wrench in this idea. As does the following example. But this new example will be more disturbing, because the solution is not a simple verifier change.
What would you expect this program to print to the console?
struct S {
public readonly int X;
public S(int x) { X = x; }
public void MultiplyInto(int c, out S target) {
System.Console.WriteLine(this.X);
target = new S(X * c);
System.Console.WriteLine(this.X); // same? it is, after all, readonly.
}
}
S s = new S(42);
s.MultiplyInto(10, out s);
As you may or may not have guessed, the output is "42" followed by "420". Yes, the value of 'this.X' changes after we have assigned to 'target' inside MultiplyTo, because the caller aliases the out-param with the 'this' param. Recall that parameter passing for structs in C# is done byref, so that these two references actually physically point to the same location when that call is made. The assignment to 'target', therefore, actually replaces the entire contents of 'this' all at once. And hence this gives the illusion that readonly fields are shifting.
You might be tempted to say that this can be prevented with alias analysis. But this is deceptively difficult to do. Consider this more complicated example:
class C {
public struct S S;
}
void M1(C c) {
M2(c, out c.S);
}
void M2(C c, out S s) {
c.S.MultiplyInto(10, out s);
}
It is in no way clear inside M2 that the two aliases refer to the same location. The aliasing occurred higher up in the stack. Although byrefs are restricted to stack-only passing, making the necessary alias analysis tantalizingly close to attainable, it is nontrivial to say the least. Presumably we would have had to have blocked the forming of the byref within M1, rather than its use within M2. We could fall back to runtime checks, but that is also unfortunate for numerous reasons.
The moral of the story? Structs as containers of readonly values are not to be trusted, at least not for situations that call for bulletproof safety, such as caching values in the compiler rather than rereading them, because the fields are readonly. Although C# and the CLR do a good job at verifying readonly/initonly are done right at the initialization site, there are still places where these guarantees break down. Thankfully the byref aliasing problem does not threaten thread-safety, but the union problem does. And in conclusion, I do have to imagine all of this will get fixed somewhere down the road, it's just a matter of when and where.
 Sunday, June 27, 2010
Partially-constructed objects are a constant source of difficulty in object-oriented systems. These are objects whose construction has not completed in its entirety prior to another piece of code trying to use them. Such uses are often error-prone, because the object in question is likely not in a consistent state. Because this situation is comparatively rare in the wild, however, most people (safely) ignore or remain ignorant of the problem. Until one day they get bitten.
Not only are partially-constructed objects a source of consternation for everyday programmers, they are also a challenge for language designers wanting to provide guarantees around invariants, immutability and concurrency-safety, and non-nullability. We shall see examples below why this is true. The world would be better off if partially-constructed objects did not exist. Thankfully there is some interesting prior art that moves us in this direction from which to learn.
Seeing such a beast in the wild
In what situations might you see a partially-constructed object? There are two common ones in C++ and C#:
- ‘this’ is leaked out of a constructor to some code that assumes the object has been initialized.
- A failure partway through an object’s construction leads to its destructor or finalizer running against a partially-constructed object.
In the first case, the rule of thumb is “don’t do that.” This is easier said than done. The second case, on the other hand, is a fact of life, and the rule of thumb is “tread with care, and be intentional.” Let’s examine both more closely.
The evils of leaking ‘this’
Leaking ‘this’ during construction to code that expects to see a fully-initialized object is a terrible practice. Before moving on, it’s important to remember initialization order in C++ and C#: base constructors run first, and then more derived constructors. If I have E subclasses D subclasses C, then constructing an instance of E will run C’s constructor, and then D’s, and then lastly E’s. Destructors in C++, of course, run in the reverse order.
Member initializers, on the other hand, run in different orders in C++ versus C#. In C#, they run from most derived first, to least derived. So in the above situation, E’s initializers run, and then D’s, and then C’s. This happens fully before running the ad-hoc constructor code. In C++, however, member initializers run alongside the ordinary construction process. C’s member initializers run just before C’s ad-hoc construction code, and then D’s, and then E’s. Another difference is that C#’s initializers cannot access ‘this’, whereas C++’s initializers can.
For example, this C# program will print E_init, D_init, C_init, C_ctor, D_ctor, and then E_ctor:
using System;
class C {
int x = M();
public C() {
Console.WriteLine("C_ctor");
}
private static int M() {
Console.WriteLine("C_init");
return 42;
}
}
class D : C {
int x = M();
public D() : base() {
Console.WriteLine("D_ctor");
}
private static int M() {
Console.WriteLine("D_init");
return 42;
}
}
class E : D {
int x = M();
public E() : base() {
Console.WriteLine("E_ctor");
}
private static int M() {
Console.WriteLine("E_init");
return 42;
}
}
class Program {
public static void Main() {
new E();
}
}
And this C++ program will print C_init, C_ctor, D_init, D_ctor, E_init, E_ctor, ~E, ~D, and finally ~C:
#include
using namespace std;
struct C {
int x;
C() : x(M()) { cout << "C_ctor" << endl; }
~C() { cout << "~C" << endl; }
static int M() { cout << "C_init" << endl; return 42; }
};
struct D : C {
int x;
D(): x(M()) { cout << "D_ctor" << endl; }
~D() { cout << "~D" << endl; }
static int M() { cout << "D_init" << endl; return 42; }
};
struct E : D {
int x;
E() : x(M()) { cout << "E_ctor" << endl; }
~E() { cout << "~E" << endl; }
static int M() { cout << "E_init" << endl; return 42; }
};
static void main() {
E e;
}
It’s interesting to note that the CLR allows constructor chaining to happen in any order. The C# compiler emits the calls to base as the first thing a constructor does, but other languages can choose to do differently. The verifier ensures that a call has occurred somewhere in the constructor body before returning.
This IL program, for example, will print E, D, and then C – the reverse of what C# gives us:
.assembly extern mscorlib { }
.assembly ctor { }
.class C {
.method public specialname rtspecialname instance void .ctor() cil managed {
ldstr "C"
call void [mscorlib]System.Console::WriteLine(string)
ldarg.0
call instance void [mscorlib]System.Object::.ctor()
ret
}
}
.class D extends C {
.method public specialname rtspecialname instance void .ctor() cil managed {
ldstr "D"
call void [mscorlib]System.Console::WriteLine(string)
ldarg.0
call instance void C::.ctor()
ret
}
}
.class E extends D {
.method public specialname rtspecialname instance void .ctor() cil managed {
ldstr "E"
call void [mscorlib]System.Console::WriteLine(string)
ldarg.0
call instance void D::.ctor()
ret
}
}
.class Program {
.method public static void Main() cil managed {
.entrypoint
newobj instance void E::.ctor()
pop
ret
}
}
So why is leaking ‘this’ bad, anyway?
Say you’ve decided in the implementation of D’s constructor that you would like to stick ‘this’ into a global hash-map. Doing this sadly means other code could grab the pointer and begin accessing it before E’s constructor has even run. That is a race at-best and a ticking time-bomb in all likelihood. For example:
class C {
public static Dictionary s_globalLookup;
private readonly int m_key;
public C(int key) {
m_key = key;
s_globalLookup.Add(key, this);
}
}
Even though we have taken great care to initialize our readonly field m_key before sticking ‘this’ into a dictionary, any subclasses will not have been initialized at this point. Only if C is sealed can we be assured of this. Another piece of code that grabs the element out of the hashtable and begins calling virtual methods on it, for example, is in a race with the completion of the initialization code for subclasses.
Leaking ‘this’ isn’t always such an explicit act. Merely invoking a virtual method within the constructor may dispatch a more derived class’s override before the more derived class’s constructor has run. And therefore its state is most likely not in a place conducive to correct execution of that override. It is fairly common knowledge that invoking virtual methods during construction is an extraordinarily poor practice, and best avoided.
Failure to construct
Let’s move on to the second issue. Imagine we suffer an exception during construction of an object. Perhaps this is due to a failure to allocate resources, or maybe even due to argument validation. It is clear that a leaked ‘this’ object in such cases will be a problem, because the object will have escaped into the wild even though its initialization failed. Subsequent attempts to use the object will obviously pose problems. What is more subtle is that if the class in question declares a destructor (in C++) or finalizer (in C#), a problem may now be lurking.
Let’s say we have the original situation shown above: C derives from D derives from E. Now say an exception happens within D’s constructor. At this point in time, C’s constructor has run to completion, D’s constructor has run partially up to the point of failure, and E’s has not run at all. (And, of course, in the case of C#, all member-initializers for all classes have actually run.) What happens to the cleanup code?
In C++, only constructors that have run will have their corresponding destructors executed. In the above situation, where C, D, and E each declares a destructor, only C’s will be run during stack unwind. It is imperative, therefore, that D handles failure within its constructor rather than relying on the destructor. For example:
class D : C {
int* m_pBuf1;
int* m_pBuf2;
public:
D() {
m_pBuf1 = … alloc …;
m_pBuf2 = … alloc …;
}
~D() {
if (m_pBuf2) … free …;
if (m_pBuf1) … free …;
}
}
If the allocation destined for m_pBuf2 fails by throwing an exception, the destructor for D will not run, and therefore m_pBuf1 will leak. The C++ solution to this particular example is to use smart pointers and member initializers for the allocations, because successfully initialized members do get destructed, even when the constructor (or indeed one of the member initializers) subsequently fails. This means that destructors for a particular class do not have to tolerate that class’s state not having been fully constructed, because those destructors will never run. Destructors must not, however, invoke virtuals, for two obvious reasons: (a) subclasses may not have ever been initialized, and (b) destructors run in reverse order, and so the destruction code for the subclass will have already run by the time a baseclass’s destructor runs.
In C#, finalizers will run, regardless of whether an object’s constructor ran fully, partway through, or not at all. If the object is allocated – and so long as GC.SuppressFinalize isn’t called on it – the finalizer runs. This distinctly means that C# finalizers must always be resilient to partially-constructed objects (unlike C++ destructors). Thankfully finalizers are a rare bird, and therefore this issue is seldom even noticed by .NET programmers.
This issue does not arise in the case of .NET’s IDisposable pattern. If a constructor throws, the assignment to the target local variable does not occur. And therefore the variable enclosed in, say, a using statement remains null. This means that there is no way to possibly invoke Dispose on the object. Moreover, the allocation in using occurs prior to entering the try/finally block. Hence, you really had better be writing constructors that don’t throw and/or protecting such resources with smart-pointer-like things with finalizers, a la SafeHandle.
Impediments to language support
As if these weren’t sufficient cause for concern, I also mentioned earlier – and somewhat vaguely – that partially-constructed objects interfere with language designers’ efforts to add invariants, immutability and concurrency-safety, and non-nullability to the language. And all of these are important properties to consider in our present age of complexity and concurrency, so this point is worth understanding more deeply. Let’s look at each in-order.
Invariants and safe-points
A partially-constructed object obviously may have broken invariants. By definition, invariants are meant to hold at the end of construction, and so if construction never completes, the rules of engagement are being broken.
Imagine, for example:
class C {
int m_x;
int m_y;
invariant m_x < m_y;
public C(int a) {
m_x = a;
m_y = a + 1;
}
}
It is ordinarily very difficult to ensure that each instruction atomically transitions the state of an object from one invariant safe-point to another. A common technique is to define well-defined points at which invariants must hold. We might model each failure point as one such technique. But even in C#, the above program does not satisfy this constraint, because an overflow exception may be generated at the ‘m_y = a + 1’ line. Or a thread-abort exception may be generated right in the middle of those two functions. Or, if addition were implemented as a JIT helper, an out-of-memory exception could get generated due to failure to JIT the helper function.
In such cases, we’d like to say that the object does not exist. But the sad fact is that the object *does* exist, and if the object has acquired physical resources at the time of failure to construct, we must compensate and reclaim them. The ideal world looks a lot like object construction as transactions, where the end of construction is the commit-point. The state-of-the-art is very different from this, however, and so any static verification and theorem proving that depends on invariants on an object holding, well, invariantly, is subject to being broken by partially-constructed objects.
Immutability… or not
Immutability is also threatened by partially-constructed objects. Immutability is a one of many first class techniques for solving concurrency-safety in the language, so this one is quite unfortunate.
In C#, for example, we might be tempted to say that a shallowly immutable type is one whose fields are all marked readonly. (And a deeply immutable type is one whose fields are all readonly, and also refer to types that are immutable.) A readonly field cannot be written to after construction has completed. Unfortunately, if the ‘this’ leaks out during construction, we may see those readonly fields in an uninitialized or even changing state:
class C {
public static C s_c;
readonly int m_x;
public C() {
m_x = 42;
s_c = this;
while (true) {
++m_x;
}
}
}
This is quite evidently a terrible and malicious program. C appears to be immutable, because it only contains readonly fields, but is quite clearly not, because the value of m_x is assigned to multiple times. If we had a guarantee that all readonly fields were definitely assigned once-and-only-once before ‘this’ can escape, then we’d be close to a solution. But of course we have no such guarantee. In C#, at least.
A related issue is co-initialization of objects. This is interesting and relevant, because in such cases we actually want to leak out partially-constructed objects. Imagine we’d like to build a cyclic graph comprised of two nodes, A and B, each referring to the other. With a naïve approach to immutability, we simply cannot make this work. Either A must first refer to B, in which case A refers to a partially-constructed B; or B must first refer to A, in which case B refers to a partially-constructed A. Both are equally as bad. The two assignments are not atomic.
Cyclic data structures are a commonly cited weakness of immutability, and an argument in favor of supporting partially-instantiated objects in a first class way, although there are approaches that can work. One example is to separate edges from nodes, and represent them with different data structures. We can then build the nodes A and B, and then build the edges A->B and B->A without needing to use cycles.
It’s not-null, or at least it wasn’t supposed to be
Tony Hoare called it his billion-dollar mistake: the introduction of nulls into a programming language. I think he sells himself short, however, because the absence of nulls in an imperative programming language – however worthy a pursuit – is actually a notoriously difficult to attain.
Spec# is one example of a C-style language with non-nullability, in which T! represents a “non-null T”, for any T. Although this is done in a pleasant way conducive to C#-compatibility – a significant goal of Spec# -- I’d personally prefer to see the polarity switched: T would mean “non-null T” and T? would mean “nullable T”, for any T, reference- and value-types included. This is much more like Haskell’s Maybe monad, and is the syntax I’ll use below for illustration purposes. But I digress.
Non-nullability is a wonderful invention, because it is common for 75% or more of the contracts and assertions in modern programs to be about a pointer being non-null prior to dereferencing it, both in C# and in C++. Relying on the type-system to prove the absence of nulls is one big step towards creating programs that are robust and correct out-of-the-gate, particularly for systems software where such degrees of reliability are important. And it cuts down on all those boilerplate contracts sprinkled throughout a program. Instead of:
void M(C c, D d, E e)
requires c != null, d != null, e != null
{
… use(c, d, e) …
}
You simply say:
void M(C c, D d, E e)
{
… use (c, d, e) …
}
No opportunity to miss one, and no need for runtime checks. It’s beautiful.
A problem quickly arises, however, having to do with partially-constructed objects. All of an object’s fields cannot possibly be non-null while the constructor is executing, because the object has been zero’d out and the assignments to its fields have not yet been made. Clearly constructor code needs to be treated “differently” somehow. We cannot simply live with the fact that ‘this’ escaping leads to a partially-constructed object leaking out into the program, because that could lead to serious errors. These serious errors include potential security holes, if unsafe code is manipulating the supposedly non-null pointer. One advantage to adding non-nullability is that runtime null checks can be omitted, because the type system guarantees the absence of nulls in certain places. In this situation, partially-constructed objects lead to holes in our nice type system support. Either the dynamic non-null checks are required as back-stop, or we’ll need some other coping technique.
There are related issues with non-nullability, like with partially-populated arrays. Imagine we’d like to allocate an array of 1M elements of type T, and we will proceed to populate those elements following the array’s allocation. There’s clearly a window of time during which the array contains 1M nulls, and then 1M-1 nulls, …, and if we finish 1M-1M nulls, i.e. 0 nulls. It is only at that last transition that the array can be considered to contain non-null T’s. The standard technique is to use an explicit dynamic conversion check, or to force the creation of the list to supply all of the elements of the array at construction time.
Coping techniques
There are, thankfully, some interesting ways to cope with partially-constructed objects. There is, in fact, a spectrum of techniques, ranging from escape analysis in various forms, to limitations around how objects are constructed such that a partially-constructed one can never leak, to automatic insertion of dynamic checks to prevent the use of partially-constructed objects, to static annotations that treat partially-constructed instances as first class things in the type system. And of course there’s always the technique of “deal with it”, which is the one that most C++-style languages have chosen, including our beloved C#.
The F# approach: restrictions and dynamic checks
F#, it turns out, does a very good job in preventing partially-initialized objects. A first important step is that fields in F# are readonly by-default, unless you opt-in to mutability using the mutable keyword. Therefore data structures are mostly immutable. And the construction rules are meant to make it very unlikely that you’ll expose a partially-constructed object during construction. How so? It’s simple: such fields must be initialized prior to running ad-hoc construction code, and if you attempt to initialize them multiple times, the compiler supplies an error. In other words, you really have to work hard to screw yourself, unlike C++ and C# where it’s very easy.
It’s of course possible to do some dirty tricks to publish or access a partially-initialized object, despite needing to work very hard at it. There is, however, a nice surprise awaiting us when we try. For example:
type C() =
abstract member virt : unit -> unit
default this.virt() = ()
type D() as this =
inherit C()
do this.virt()
type E =
inherit D
val x : int
new() = { x = 42; }
override this.virt() = printf "x: %d" this.x
let e = new E()
This example attempts to perform a virtual invocation from C before the more derived class has been fully initialized. This overridden virtual simply (attempts to) prints out the value of x. If we compile and run this program, however, we will notice that we get an exception: “InvalidOperationException: the initialization of an object or value resulted in an object or value being accessed recursively before it was fully initialized.”
Pretty neat. The compiler will stick in the checks necessary when ‘this’ is being accessed, to dynamically verify that an object is not being leaked before having been fully initialized. The F# approach can be summed up as trying to make things airtight as possible statically at compile-time, but admitting that there are holes – primarily due to inheritance – and dealing with them by inserting dynamic runtime checks. This tradeoff clearly makes sense for F#, because it is attempting to attain a robust level of reliability around immutability.
F# also adds non-nullability for the most part. Like Haskell’s Maybe monad, F# adds an option type that can store a single None value which lies outside of the underlying type’s domain to effectively represent null. Because F# is a .NET language it of course also needs to worry about nulls at interop boundaries with other languages like C# and VB. This is a great step forward; first class CLI support would be a nice next step.
A slight variant of the F# idea is to initialize data up the whole class hierarchy in one pass, and then run ad-hoc construction code in a second pass in the usual way. So long as readonly data can be initialized without running the ad-hoc construction code, this helps to statistically cut down on the chances for accidental leaking of ‘this’.
Type system: T versus notconstructed-T
We can model two kinds of T in the type-system: T and notconstructed-T. The constructor for any type T would then see the ‘this’ pointer as an notconstructed-T, and everything else – by default – sees ordinary T’s.
What good does this distinction do? It enables us to add verification and restrictions around the use of notconstructed-T’s and limitations to the conversion between the two types. See this paper by Manuel Fahndrich and Rustan Leino for an example of how this approach was taken in Spec#’s non-nullability work.
For example, we can prohibit conversion between T and notconstructed-T altogether, thereby disallowing escaping ‘this’ references altogether. If the type of ‘this’ within a constructor is different than all other references to type T, and they are not convertible, we’ve successfully walled the problem off in the type system. This protects against erroneous method calls, so that a constructor could not call any of its own methods, because these methods expect an ordinary T whereas the constructor only has a notconstructed-T. And because you cannot state notconstructed-T in the language, you cannot let one leak by storing it into a field.
We could add more sophisticated support, by allowing a programmer to explicitly tag certain non-field references as T-notconstructed. This makes the concept first-class in the language, and allows one to explicitly declare the fact that code is interacting with a partially-constructed T:
class C
{
int m_x;
public C() {
m_x = V();
}
protected virtual int V() notconstructed {
… I know to be careful …
}
}
In this example, the programmer has annotated V with ‘notconstructed’. This enables the call from the constructor because the method’s ‘this’ is an uninitialized-T, and serves as a warning to the programmer that he or she should take care, much like the code written inside a constructor.
We must also decide whether fields are offlimits via notconstructed-T’s. If yes, we can add F#-style dynamic checks for initialization, but only for attempted accesses against notconstructed object’s fields. This is nice because it means the scope of dynamic checks are limited, and used in a pay-for-play manner. And we could even enable a programmer to sidestep the error by stating at the use-site that they understand a particular field access may be to uninitialized memory, like Field.ReadMaybeUninitialized(&m_field).
To be honest, the reason this approach has likely not yet seen widespread use is that the cost is not commensurate with the benefit. At least, in my opinion. To make something like partially-constructed objects a first-class concept in a programming language, programmers would need to want to know where they are dealing with them. For systems programmers, this makes sense. For many other programmers, it would be useless ceremony with no perceived value. And yet the initial approach where nothing new needed to be stated, but yet escaping ‘this’ was prevented, blocks certain patterns of legal use. This is where theory and practice run up against one another. There is, however, presumably a nice middle ground awaiting discovery.
Winding down
I meant for this to be a short post. But the topic really is fascinating, and has been coming up time and time again as we do language work here at Microsoft. But it is truly fascinating mainly because, like nulls, the problem is widespread yet tolerable, and most C++ and C# programmers learn the rules and make do. Partially-constructed objects are a major blemish, but not a crisis that threatens the complete abandonment of imperative programming.
I do believe language trends indicate that more will move away from C++-style object initialization and related issues, and more towards immutability and treating data and its initialization separately from imperative ad-hoc initialization code. Haskell, F#, and Clojure, for example, show us some promising paths forward. There are a plethora of other attempts at solving related problems, and I unfortunately could only scratch the surface.
Although these techniques are not new, the primary question – to me – is how close to “the metal” in systems programming these concepts can be made to work. Typically for those situations, you need to rely on a mixture of static verification and complete freedom, because the dynamic checking is too costly and the code to work around overly-limiting static verification also adds too much cost. But as soon as you add complete freedom into the picture, you run into partially-constructed objects as a consequence, and all the issues I’ve mentioned above.
Anyway, I hope it was interesting.
 Sunday, May 16, 2010
My article about Transactional Memory (TM) was picked up by a few news feeds recently.
If I had known this would occur, I would have written it with greater precision. Because my article presents a mixture of technical challenges interspersed among more subjective and cultural issues, I am sure it is difficult to tease out my intended conclusion. To summarize, I merely believe adding TM to a shared memory architecture alone is insufficient.
Indeed, I remain a big fan of transactions. Atomicity, consistency, and isolation, and coming up with strategies for achieving all three in tandem, are part and parcel to architecting software.
After watching Barbara Liskov's OOPSLA Turing Award reprise, I decided to reacquaint myself with some old Argus papers this weekend. It has been some time since I last read them. Argus was Liskov's language for distributed programming and her follow-on to CLU. As with most research done by brilliant people, the work was way ahead of its time, has appeared in ad-hoc incarnations and permutations over time, and remains relevant today. This research is particularly interesting to work that my team is doing right now, especially its notion of guardians. And it is relevant to the TM discussion too.
The Argus approach of using isolation to coarsely partition state and operations into independent bubbles, and then communicating asynchronously between the so-called guardians that are responsible for this state, is an architecture that is common among most highly concurrent programs. This aids state management and fault tolerance. Argus makes an interesting observation that, although guardians may be sent messages concurrently -- and indeed activities themselves may even introduce local concurrency -- manipulation of state can be done safely and even in parallel thanks to transactions.
The requirement is that messages are atomic and commute. Transactions, it turns out, are a convenient way of implementing this requirement.
You will observe a similar architecture in other places, including in some languages that have adopted TM. Haskell has moved in this direction. Everything is purely functional and so, of course, no state is mutated in an unsafe way by default. However, with the introduction of concucrrency comes mutable cells for message passing and with parallelism comes indeterminism. You can push the state management problem up indefinitely, but at the top there are almost always mutable operations on real-world state (even if it is "just I/O"). Haskell programs have a safe architecture to begin with, and it is the intentional and careful addition of specific facilities that forces one to focus on the problematic seams. One could say that Haskell starts clean and stays clean, versus most shared memory-based languages which start dirty and try to attain cleanliness (at least when it comes to concurrency).
Why aren't transactions sufficient, then, given that the I in ACID stands for Isolation? You wouldn't model a database as one flat table in which each row is a single byte, however, would you? As you begin to decompose your program into isolated state, your bubbles (or guardians) are the tables, and your objects are the rows. This is just an analogy but I find it useful to think in these terms. Taking a bunch of intermingled state and pouring transactions on top is not going to give you this nicely partitioned separation of state which has proven to be the lifeblood of concurrency.
Even once you've attained a more isolated architecture, however, transactions are not a panacea. They are just one of many viable state management techniques in a programmer's arsenal, hierarchical state machines being another notable example. And in fact, many of the problems I mentioned in the TM article are still worrisome even when you start from the right place. From within a guardian, you may wish to enlist the aid of another unrelated guardian to perform a coordinated atomic activity, because a higher level invariant relationship between them must be preserved. Or an application which composes multiple guardians may wish to do so atomically. Even Argus required manual compensation for such things. This can be solved in part by DTC. But experience suggests that continuing to push the enlistment scope one level higher eventually leads to substantial problems. A topic for another day, I suppose.
My primary conclusion is that TM is a great complement to highly concurrent programs, but only so long as you start from the right place. The Argus and Haskell approaches are conducive to large-scale concurrency, but it is primarily because of the natural isolation those models provide; the addition of transactions address problems that remain after taking that step. But without that first step, they would have gotten nowhere.
|
|
Me
Joe  is an architect and developer on a systems incubation project at Microsoft.
Recent
Search
Browse
Disclaimer:
The content of this site are my own personal opinions and do
not represent my employer's view in anyway.
© 2013, Joe Duffy
|
|