<?xml version="1.0" encoding="utf-8"?>
<rss xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:trackback="http://madskills.com/public/xml/rss/module/trackback/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:pingback="http://madskills.com/public/xml/rss/module/pingback/" version="2.0">
  <channel>
    <title>Generalities &amp; Details: Adventures in the High-tech Underbelly</title>
    <link>http://www.bluebytesoftware.com/blog/</link>
    <description>Joe Duffy's Weblog</description>
    <language>en-us</language>
    <copyright>Joe Duffy</copyright>
    <lastBuildDate>Sat, 12 Nov 2011 22:03:13 GMT</lastBuildDate>
    <generator>newtelligence dasBlog 1.8.5223.2</generator>
    <managingEditor>joe@bluebytesoftware.com</managingEditor>
    <webMaster>joe@bluebytesoftware.com</webMaster>
    <item>
      <trackback:ping>http://www.bluebytesoftware.com/blog/Trackback.aspx?guid=e1e621c4-c9d2-47ed-9144-f881a3c00588</trackback:ping>
      <pingback:server>http://www.bluebytesoftware.com/blog/pingback.aspx</pingback:server>
      <pingback:target>http://www.bluebytesoftware.com/blog/PermaLink,guid,e1e621c4-c9d2-47ed-9144-f881a3c00588.aspx</pingback:target>
      <dc:creator>
      </dc:creator>
      <body xmlns="http://www.w3.org/1999/xhtml">
        <p>
      I often wish that .NET had erred on the side of offering postmortem instead of premortem
      finalization.
   </p>
        <p>
      The distinction here is when exactly the finalizer runs, i.e. <em>after</em> or <em>before</em> the
      GC has actually reclaimed an object. This governs whether a dying object is (a) accessible
      from within its own finalizer, and therefore (b) eligible to become resurrected. Postmortem
      finalization occurs after the object is long gone, and hence says “no” to both of
      these questions; premortem finalization happens beforehand and hence says “yes.”
   </p>
        <p>
      .NET chose the latter.
   </p>
        <p>
      The primary downside of premortem finalization, setting aside the confusing nature
      of resurrection, is that the object in question cannot be collected until <em>after</em> its
      finalizer has run. This should be fairly obvious: it is only that second time the
      object is found to be dead “again” that we know the finalizer has or has not resurrected
      it.
   </p>
        <p>
      This may seem like a small matter. But it matters quite a lot when building high performance
      software. In a garbage collected system, relying on high rates of finalization to
      keep up with demanding workloads almost never works. But in a premortem finalization
      system, even moderate demands become cause for concern.
   </p>
        <p>
      Premortem finalization leads to finalized objects getting promoted to the elder generations
      before actually dying. If you check the value of GC.GetGeneration(this) within an
      object’s finalizer, for example, you will notice it is one greater than the generation
      in which the object was found to be dead the first time. Say it was found dead in
      Gen1; then GC.GetGeneration(this) will return ‘2’. Yet another collection must happen,
      in Gen2 to boot, in order to actually reclaim this object. And, of course, it’s not
      just this object, but also the transitive closure of objects to which it refers.
   </p>
        <p>
      This approach penalizes the majority use case of finalizable objects. At least on
      .NET, most objects merely invoke CloseHandle on an IntPtr in the finalizer. This clearly
      needn’t hold up freeing the managed state. And resurrection is a dubious scenario
      anyway: such objects quickly end up in Gen2 where collections are expensive and infrequent.
      If you’re pooling via resurrection because you create expensive objects at a high
      rate of birth and death, manual memory management (or a different design altogether)
      is likely your only savior.
   </p>
        <p>
      Although Java’s finalizers are also premortem, the JDK offers the facilities necessary
      to implement postmortem finalization on your own. It entails using WeakReference<T>
         and ReferenceQueue<T>
            . See <a href="http://java.sun.com/developer/technicalArticles/javase/finalization/">this
            article</a> if you are curious.
         </T></T></p>
        <p>
      .NET doesn’t offer the notifications required to do the same. You can, however, learn
      from postmortem finalization to write better premotem finalizers: prefer simple finalizable
      objects that refer to only the state necessary to implement finalization – which ordinarily
      means no other managed objects. The SafeHandle abstraction is a good example of this.
      Most implementations are comprised of a simple IntPtr. This pattern will ensure that
      collateral promotion due to finalization is more contained.
   </p>
        <p>
      After saying all of this, I hope it is just amusing trivia. I'm sure nobody is writing
      finalizers these days anyway.
   </p>
        <img width="0" height="0" src="http://www.bluebytesoftware.com/blog/aggbug.ashx?id=e1e621c4-c9d2-47ed-9144-f881a3c00588" />
      </body>
      <title>A brief note on object mortality</title>
      <guid>http://www.bluebytesoftware.com/blog/PermaLink,guid,e1e621c4-c9d2-47ed-9144-f881a3c00588.aspx</guid>
      <link>http://www.bluebytesoftware.com/blog/2011/11/12/ABriefNoteOnObjectMortality.aspx</link>
      <pubDate>Sat, 12 Nov 2011 22:03:13 GMT</pubDate>
      <description>&lt;p&gt;
   I often wish that .NET had erred on the side of offering postmortem instead of premortem
   finalization.
&lt;/p&gt;
&lt;p&gt;
   The distinction here is when exactly the finalizer runs, i.e. &lt;em&gt;after&lt;/em&gt; or &lt;em&gt;before&lt;/em&gt; the
   GC has actually reclaimed an object. This governs whether a dying object is (a) accessible
   from within its own finalizer, and therefore (b) eligible to become resurrected. Postmortem
   finalization occurs after the object is long gone, and hence says “no” to both of
   these questions; premortem finalization happens beforehand and hence says “yes.”
&lt;/p&gt;
&lt;p&gt;
   .NET chose the latter.
&lt;/p&gt;
&lt;p&gt;
   The primary downside of premortem finalization, setting aside the confusing nature
   of resurrection, is that the object in question cannot be collected until &lt;em&gt;after&lt;/em&gt; its
   finalizer has run. This should be fairly obvious: it is only that second time the
   object is found to be dead “again” that we know the finalizer has or has not resurrected
   it.
&lt;/p&gt;
&lt;p&gt;
   This may seem like a small matter. But it matters quite a lot when building high performance
   software. In a garbage collected system, relying on high rates of finalization to
   keep up with demanding workloads almost never works. But in a premortem finalization
   system, even moderate demands become cause for concern.
&lt;/p&gt;
&lt;p&gt;
   Premortem finalization leads to finalized objects getting promoted to the elder generations
   before actually dying. If you check the value of GC.GetGeneration(this) within an
   object’s finalizer, for example, you will notice it is one greater than the generation
   in which the object was found to be dead the first time. Say it was found dead in
   Gen1; then GC.GetGeneration(this) will return ‘2’. Yet another collection must happen,
   in Gen2 to boot, in order to actually reclaim this object. And, of course, it’s not
   just this object, but also the transitive closure of objects to which it refers.
&lt;/p&gt;
&lt;p&gt;
   This approach penalizes the majority use case of finalizable objects. At least on
   .NET, most objects merely invoke CloseHandle on an IntPtr in the finalizer. This clearly
   needn’t hold up freeing the managed state. And resurrection is a dubious scenario
   anyway: such objects quickly end up in Gen2 where collections are expensive and infrequent.
   If you’re pooling via resurrection because you create expensive objects at a high
   rate of birth and death, manual memory management (or a different design altogether)
   is likely your only savior.
&lt;/p&gt;
&lt;p&gt;
   Although Java’s finalizers are also premortem, the JDK offers the facilities necessary
   to implement postmortem finalization on your own. It entails using WeakReference&lt;T&gt;
      and ReferenceQueue&lt;T&gt;
         . See &lt;a href="http://java.sun.com/developer/technicalArticles/javase/finalization/"&gt;this
         article&lt;/a&gt; if you are curious.
&lt;/p&gt;
&lt;p&gt;
   .NET doesn’t offer the notifications required to do the same. You can, however, learn
   from postmortem finalization to write better premotem finalizers: prefer simple finalizable
   objects that refer to only the state necessary to implement finalization – which ordinarily
   means no other managed objects. The SafeHandle abstraction is a good example of this.
   Most implementations are comprised of a simple IntPtr. This pattern will ensure that
   collateral promotion due to finalization is more contained.
&lt;/p&gt;
&lt;p&gt;
   After saying all of this, I hope it is just amusing trivia. I'm sure nobody is writing
   finalizers these days anyway.
&lt;/p&gt;
&lt;img width="0" height="0" src="http://www.bluebytesoftware.com/blog/aggbug.ashx?id=e1e621c4-c9d2-47ed-9144-f881a3c00588" /&gt;</description>
      <category>Technology</category>
    </item>
    <item>
      <trackback:ping>http://www.bluebytesoftware.com/blog/Trackback.aspx?guid=2d01b5eb-7f19-4a4c-a3b3-1cecb1e3dc75</trackback:ping>
      <pingback:server>http://www.bluebytesoftware.com/blog/pingback.aspx</pingback:server>
      <pingback:target>http://www.bluebytesoftware.com/blog/PermaLink,guid,2d01b5eb-7f19-4a4c-a3b3-1cecb1e3dc75.aspx</pingback:target>
      <dc:creator>
      </dc:creator>
      <body xmlns="http://www.w3.org/1999/xhtml">
        <p>
      It's been unbelievably long since I last blogged.
   </p>
        <p>
      The reason is simple. I've been ecstatic in my job and, every time I think to write
      something, I quickly end up turning to work and soon find that hours (days? months?)
      have passed. This is a wonderful problem to have, but not so good for keeping the
      blog looking fresh and new. (I've also been writing <a href="http://soundcloud.com/45nm_joeduffy/sets/current/">a
      fair bit of music</a> lately.) Well, this weekend I managed to lock myself out of
      my VPN access, and decided that this was a sign that I ought to dust off the cobwebs
      on a blog entry or two that I've had in the works for quite some time.
   </p>
        <p>
      The topic for today is generics, a feature many of us know and love. Specifically,
      their impact on software performance, something I frequently see developers struggling
      to understand and tame in the wild. 
   </p>
        <h1>The blessing; and, the curse
   </h1>
        <p>
      I absolutely love generics. I can hardly imagine writing code without them these days.
      The code reuse, higher-order expressiveness, beautiful abstractions, and static type-safety
      enabled by first class parametric polymorphism are all game-changing. And being a
      language history wonk, I'm delighted to see many mainstream programming languages
      stealing a page from ML and theoretical CS generally.
   </p>
        <p>
      Generics, however, are not free. And in some circumstances, they are, dare I say,
      rather expensive. Few language features surpass generics in the ability to write a
      concise and elegant bit of code, which then translates into reams of ugly assembly
      code out the rear end of the compiler. I am of course speaking mainly to models in
      which compilation leads to code specialization (like .NET's), versus erasure (like
      Java's).
   </p>
        <p>
      Most developers coming from a C++ background understand code expansion deeply, because
      they program with templates. Unlike templates, however, there is ample runtime type
      information (RTTI) associated with generic instantiations… such that the costs associated
      with generics frequently – and perhaps surprisingly – are a superset of those costs
      normally associated with C++ templates. At the same time, because the compiler understands
      parametric polymorphism, it can sometimes do a better job optimizing, e.g. with techniques
      like code sharing.
   </p>
        <p>
      Basically, with templates and erasure, the equation for predicting code expansion
      is super simple. You get it all (in the former) or you get none of it (in the latter),
      but with specialized generics this equation is quite complex.
   </p>
        <p>
      Paradoxically, these same costs are the main value that generics bring to the table!
      Write a little type-agnostic code and then "instantiate" that same code over multiple
      types without repeating yourself. But, generics are not magic; did you ever stop to
      wonder things like: What machine code is generated for these types? Does the compiler
      need to specialize the actual code that runs on the processor for unique instantiations,
      or is it all the same? And if it does need to specialize, where, how, and why? And
      perhaps most importantly, what hidden costs are there, and how should I think about
      them while writing code?
   </p>
        <p>
      Before reading further, paranoia need not ensue. The point of this article is merely
      to raise your awareness. All programmers should know what the abstractions they use
      cost, and make conscious tradeoffs when writing code with them. The aforementioned
      benefits of generics really are often "worth it," both in the elegance and reusability
      of abstractions, and in developer productivity. In my experience, however, the associated
      costs are so subtle and ill-documented that even people who write highly generic code
      typically remain unaware of them. Even more subtly, these costs are somewhat different
      in nature when pre-compiling your code, such as with .NET's NGen technology.
   </p>
        <p>
      This brief essay will walk through a few such costs in the context of the .NET Framework
      and CLR's implementation of generics. This is in no way an exhaustive study of generic
      compilation, and your mileage will vary from one platform to the next. Although the
      studies presented would apply to other implementations of generics, the reality is
      that if you're writing code in, say, Java – where type erasure is employed rather
      than code specialization – then all of this is going to be less relevant to you.
   </p>
        <p>
      With no further delay, let's get started.
   </p>
        <h1>Code, RTTI, oh my
   </h1>
        <p>
      When considering costs, we must always think about both size and speed.
   </p>
        <p>
      There is at least as much assembly code created for an instantiation as the code you've
      written for the generic abstraction in C# or MSIL. A simple mental model – that thankfully
      turns out isn't entirely accurate, thanks to some sharing optimizations described
      below – is that for each instantiation of a generic type or method you get a new copy
      of that code specialized to the type in question. Obviously, this increases code size.
      And just as obviously, it will add some runtime cost to JIT compile the code (if you
      aren't using ahead of time compilation), as well as putting more pressure on I-cache
      and TLB.
   </p>
        <p>
      Another source of significant cost is the runtime data structures needed for RTTI
      and Reflection, like vtables and other metadata. Quite simply, the runtime needs to
      know the identity of each generic instantiation, to prevent things like casting a
      List&lt;Int16&gt; to a List&lt;String&gt;, and even List&lt;Object&gt; to List&lt;String&gt;;
      and given that there is often distinct code generated for unique instantiations, the
      vtable contents for those different List&lt;T&gt; instantiations are going to look
      quite different.
   </p>
        <p>
      And of course, there are statics. Each generic instantiation gets its own set, requiring
      extra storage and another level of indirection when fetching them. Unique statics
      means D-cache and TLB pressure. It turns out that code shared across AppDomains, like
      mscorlib.dll, already need such things. But I have found that it's surprisingly common
      for a developer to throw a static field (or nested class!) onto an outer generic type,
      without actually needing it to be replicated for each unique instantiation.
   </p>
        <p>
      In addition to the immediate effects, generic types often refer to other generic types
      which refer to other generic types … and so on. Instantiating a root type is akin
      to instantiating the full transitive closure.
   </p>
        <p>
      To make our discussion friendly and familiar, we shall use the .NET Framework's List&lt;T&gt;
      type – presumably one of the most commonly used generic types on the planet – to illustrate
      many of these costs. And unfortunately, you'll also see that many of the common performance
      pitfalls plague this type too. (So, really, you need not feel bad if your own code
      is guilty of them too.)
   </p>
        <h1>Why the distinct code, anyway?
   </h1>
        <p>
      There is only one copy of List&lt;T&gt;'s code in mscorlib's MSIL. It is essentially
      just a blueprint for the list class.
   </p>
        <p>
      When I create a List&lt;Int16&gt; in my program and use it, however, there clearly
      needs to be some assembly code created in order to execute List&lt;T&gt;'s associated
      functionality, just with any T's used by List&lt;T&gt;'s code replaced by actual 2-byte
      short integers. And similarly, if I were to instantiate a List&lt;String&gt;, all
      those T's need to be replaced by pointer-sized object references, either 4- or 8-bytes
      depending on machine architecture, that are reported live to the garbage collector.
   </p>
        <p>
      This is what leads to our simple mental model above, in which each instantiation gets
      its own copy of the code. In this case, both List&lt;Int16&gt; and List&lt;String&gt;
      would be entirely independent types at runtime, with wholly separate copies of the
      machine code.
   </p>
        <p>
      Certainly if I manually went about creating my own Int16List and StringList types,
      they would be distinct types with distinct machine code generated. Being a prudent
      developer, however, I'd probably try to arrange to share as much of the implementation
      as possible between the two types, perhaps using implementation inheritance. But alas,
      there's no way I could share it all: any code specific to Int16 or String, for example,
      would surely differ, both in MSIL and in the native code.
   </p>
        <p>
      Generics basically give you the ability to do this same thing, without you needing
      to do the factoring of type-independent and type-specific code yourself. The compiler
      does that for you.
   </p>
        <p>
      Why might the code be different? As stated above, Int16 values are 2 bytes and String
      pointers are native word sized (4 bytes on 32-bit, 8 bytes on 64-bit). All the code
      that passes values of type T on the stack, either as arguments or return values, moves
      instances into and out of memory locations (like the T[] backing array), and so on,
      needs to be specialized based on the size of T. This wouldn't be true of a generics
      implementation that used type erasure, like Java's, but then you'd need to box the
      value types on the heap so that everything is a pointer. If T is a Float, we will
      likely emit code that uses floating point math instead of general purpose registers.
      Any tables that report GC roots are likely to be different, since object references
      can be embedded inside struct values that get laid out on the stack. And so on. Some
      day you might want to compare the machine code for a simple generic Echo&lt;T&gt;
      method for different kinds of T's; it is really easy to do, and is quite illustrative.
   </p>
        <p>
      A naïve wish might go as follows. Imagine that I had written my own dedicated Int16List
      and StringList types, and that we diffed the resulting machine code between the distinct
      list types; we'd presumably find a fair bit of duplication for all the reasons stated
      above. It would be a nice property if, when we used the generic List&lt;Int16&gt;
      and List&lt;String&gt; types, and similarly diffed the resulting assembly, the amount
      of specialized code would be no greater than the amount of specialized code between
      our best hand-written Int16List and StringList types. I.e., only parts that need to
      be different are different.
   </p>
        <p>
      We could go even further with our wish. Imagine I had a List&lt;DateTime&gt; and List&lt;Int64&gt;.
      Both are 8-byte values, and do not contain any GC references. If I were writing a
      specialized 8ByteValueList in C++ and had immense performance constraints, I would,
      again being a prudent developer, probably use some type unsafe code, with nasty reinterpret_casts,
      so that I could use the same list type to store any kind of 8-byte value. (Except
      in C++ I could even store pointers!) It would also be a nice property if generics
      did some of this for us, while still retaining the type safety we love about generics.
   </p>
        <p>
      It turns out we will get neither of our wishes exactly, although we will get something
      close to the spirit of our wishes.
   </p>
        <h1>Code sharing
   </h1>
        <p>
      Indeed, the CLR does arrange to share many generic instantiations. The rule is simple,
      although it is subject to change in the future (being an optimization and all): instantiations
      over reference types are shared among all reference type instantiations for that generic
      type/method, whereas instantiations over value types get their own full copy of the
      code. In other words, List&lt;String&gt; and List&lt;Object&gt; are backed by the
      same code, but List&lt;DateTime&gt; and List&lt;Int64&gt; get their own.
   </p>
        <p>
      It is true that, in theory, List&lt;DateTime&gt; and List&lt;Int64&gt; could use the
      same shared code, because they are of identical size and have GC roots in the same
      locations (trivially, because neither has one). But there are additional restrictions
      on generated code that makes this problematic, for example if we were talking about
      Double and Int64. In short, the CLR doesn't actually share value type instantiations
      as of the 4.0 runtime, although clearly it could in certain situations (value types
      of the same size with GC roots in the same locations).
   </p>
        <p>
      As you might guess, this extends to multi-parameter generics in obvious ways. A Dictionary&lt;Object,
      Object&gt; is shared with a Dictionary&lt;String, String&gt;, etc., and a Dictionary&lt;Int64,
      Object&gt; is shared with a Dictionary&lt;Int64, String&gt;. A Dictionary&lt;DateTime,
      DateTime&gt; is not, however, shared with a Dictionary&lt;Int64, Int64&gt; instantiation,
      as per the above.
   </p>
        <p>
      My pal <a href="http://callvirt.net/blog/">Joel Pobar</a> wrote <a href="http://blogs.msdn.com/b/joelpob/archive/2004/11/17/259224.aspx">a
      post eons ago describing how code sharing works</a> in great detail, which I do not
      intend to rehash. Please refer to his post for an excellent overview of how code sharing
      works.
   </p>
        <p>
      An important thing to remember, however, is that no matter how much code sharing happens,
      you still need distinct RTTI data structures. So although List&lt;Object&gt; and List&lt;String&gt;
      share the same machine code, they have distinct vtables; sure, each table is full
      of pointers to the same code functions, but you are still paying for the runtime data
      structures. A distinct instantiation, therefore, is never actually free!
   </p>
        <h1>Transitive closures
   </h1>
        <p>
      Why am I making such a big deal about code sharing, anyway?
   </p>
        <p>
      Another surprising aspect of generics is the transitive closure problem. Particularly
      when doing pre-compilation of generics, each unique instantiation doesn't simply lead
      to a specialized version of the code associated with the type being directly instantiated.
      The whole transitive closure of types, starting with that root type, will also be
      compiled. This can be a surprisingly huge number of types! JIT is much more pay-for-play,
      such that you get one level of explosion at a time, but once there is code that calls
      a particular type's method, even if that code is lazily compiled, creation of the
      type is forced.
   </p>
        <p>
      To illustrate this, let's take our friend List&lt;T&gt;. Before examining the list,
      how many generic types would you expect that a single new List&lt;T&gt; instantiation
      instantiates?
   </p>
        <p>
      What if I told you that a single List&lt;int&gt; instantiation creates (at least)
      28 types? And that, say, five unique instantiations of List&lt;T&gt; might cost you
      300K of disk space and 70K of working set? Well, of course, if you are writing a script,
      or something with fairly loose performance requirements, this might not matter much.
      But if topics like download time, mobile footprint, and cache performance are important
      to you, then you probably want to pay attention to this. To a first approximation,
      size <u><i>is</i></u> speed.
   </p>
        <p>
      Yes, you heard me right: 28 types. Holy smokes... How can this be?!
   </p>
        <p>
      Nested types are one obvious answer, and indeed List&lt;T&gt; has two: an Enumerator
      class (which is reasonable), and one to support the legacy synchronized collections
      pattern (which we presumably wish we didn't have to pay for). The larger answer here,
      however, is functionality. Yes, functionality! This is a great example where the cost
      of generics explodes as you add more features. Start simple, keep adding stuff, as
      has happened to List&lt;T&gt; over the years, and you will soon find that a series
      of elegant abstractions adds up to a gut-wrenching bucket of bytes.
   </p>
        <p>
      Here's a quick sketch of the transitive closure of generic types used by List&lt;T&gt;:
   </p>
        <p>
        </p>
        <pre>
          <code>List&lt;T&gt; T[] type IList&lt;T&gt; type ICollection&lt;T&gt; type
   IEnumerable&lt;T&gt; type IEnumerator&lt;T&gt; type ReadOnlyCollection&lt;T&gt; type
   (AsReadOnly) (Nothing more than List&lt;T&gt;) IComparer&lt;T&gt; type (BinarySearch,
   Sort) {Array.BinarySearch&lt;T&gt; method (BinarySearch)} ArraySortHelper&lt;T&gt;
   type IArraySortHelper&lt;T&gt; type GenericArraySortHelper&lt;T&gt; type EqualityComparer&lt;T&gt;
   type (Contains) IEqualityComparer&lt;T&gt; type IEquatable&lt;T&gt; type NullableEqualityComparer&lt;T&gt;
   type Nullable&lt;T&gt; type EnumEqualityComparer&lt;T&gt; type {JitHelpers.UnsafeEnumCast&lt;T&gt;
   method} ObjectEqualityComparer&lt;T&gt; type Predicate&lt;T&gt; delegate type (Find*)
   Action&lt;T&gt; delegate type (ForEach) {Array.LastIndexOf&lt;T&gt; method (LastIndexOf)}
   Comparison&lt;T&gt; delegate type (Sort) Array.FunctorComparer&lt;T&gt; type (Sort)
   Comparer&lt;T&gt; type GenericComparer&lt;T&gt; type NullableComparer&lt;T&gt; type
   ObjectComparer&lt;T&gt; type {Array.Sort&lt;T&gt; method (Sort)} ArraySortHelper&lt;T&gt;
   type (see earlier) Enumerator inner type SynchronizedList&lt;T&gt; inner type IList&lt;T&gt;
   interface (see earlier) ICollection&lt;T&gt; interface (see earlier) IEnumerable&lt;T&gt;
   interface (see earlier) {Interlocked.CompareExchange&lt;Object&gt; method (SyncRoot)}
   {_emptyArray T[] static field} </code>
        </pre>
        <p>
      I'm not trying to pick on List&lt;T&gt;. This class is only unique in this regard
      in that it offers a large transitive closure of (mostly useful!) functionality. And
      it's not the only guilty party. We recently shaved off 100K's of code size on my team,
      for example, that were being lost simply because all the LINQ methods were declared
      as instance methods on the base collection class, rather than being extension methods
      as in .NET. We found nested enumerator and iterator types, cached static lambdas as
      static fields, and huge transitive closures of other generic types, all allocated
      when you just touched any collection type. Any collections library is apt to be full
      of this stuff, since they are highly generic. But collection libraries are certainly
      not the only places to go sniffing for such problems.
   </p>
        <p>
      As an aside, it turns out that extension methods are a great way to make generic abstractions
      more pay-for-play.
   </p>
        <h1>Adding it up
   </h1>
        <p>
      Let's see what the above adds up to. I ran some programs through NGen as a quick and
      dirty experiment, and inspected the on-disk sizes and also the runtime working set
      sizes. I ensured clrjit.dll was not loaded into the process. Here's what I found.
      Take these numbers with a grain of salt, as they will change from release to release;
      they are simply rules of thumb. When in doubt, crank up NGen, DumpBin, and/or start
      trawling the heap with VADump yourself!
   </p>
        <p>
      One empty type with no methods in CLR 4.0 seems to cost roughly 0.2K bytes of on-disk
      metadata, and about 0.7K in x64 working set. (This is a good rule of thumb irrespective
      of generics… in terms of order of magnitude, you can think "one empty type means 1K
      of memory.") A single List&lt;S&gt; instantiation, where S is an empty struct, is
      in the neighborhood of 60K on-disk metadata, and 14K of x64 working set. A single
      List&lt;C&gt; instantiation, however, where C is an empty class, is only – surprise
      – about 7K on-disk and 4K in-memory. Why the large discrepancy? Well, it just so happens
      that mscorlib.dll already includes an instantiation or two of List&lt;T&gt; over reference
      types, so this 4K is the incremental cost on top of reusing what is there; remember,
      there are still unique vtables and data structures still required for RTTI.
   </p>
        <p>
      Rico did <a href="http://blogs.msdn.com/b/ricom/archive/2005/08/26/performance-quiz-7-generics-improvements-and-costs-solution.aspx">a
      similar analysis a few years back</a>, and concluded that each unique List<E>
         , where E was an enum type, cost 8K. Why the increase to 14K over the years? x64 and
         ever-increasing functionality on the basic collections classes, presumably. Remember,
         it's not just List&lt;T&gt; that has grown, it's also everything that List&lt;T&gt;
         uses internally as an implementation detail.
      </E></p>
        <h1>Dynamic specialization with dictionaries
   </h1>
        <p>
      Some specialization in behavior can be accomplished with dynamic runtime behavior,
      rather than static code specialization. A prime example is the following:
   </p>
        <p>
        </p>
        <pre>
          <code>class C { public static void M&lt;T&gt;() { System.Console.WriteLine(typeof(T).Name);
   } } </code>
        </pre>
        <p>
      Where does the program get the value of typeof(T) from? If you look at the MSIL, you
      will see that C# has emitted a ldtoken MSIL instruction. For some struct type, we
      can compile that as a constant in the code, because it is getting its own copy of
      the code. What occurs when two instantiations share code, like M&lt;String&gt; and
      M&lt;Object&gt;, however? As you might guess, there is an indirection.
   </p>
        <p>
      The thing we usually use for such indirections – the vtable – is nowhere to be found
      in this particular example, because M is a static method. To deal with this, the compiler
      inserts an extra "hidden" argument, frequently called a generic dictionary, from which
      the emitted assembly code can fetch the type token. The cost here typically isn't
      bad, because many of the operations that pull in the dictionary are already RTTI or
      Reflection-based, and would require an indirection already (e.g., through a vtable).
   </p>
        <p>
      The operations which require a dictionary of some kind include anything that has to
      do with RTTI and yet no vtable is readily accessible: typeof, casts, is and as operators,
      etc. And as you might guess, if instantiations aren't shared (such as with value types
      on the CLR), no dictionary is needed, because the code is fully specialized. There
      are also multiple kinds of dictionaries used by the runtime, depending on whether
      you are using a generic type, method, or some combination of both.
   </p>
        <h1>JITting when you didn't mean to
   </h1>
        <p>
      There are two primary ways in which you will JIT compile when using generics, even
      if you were good doobie and used NGen to reduce startup time.
   </p>
        <p>
      One way is if you instantiate a new generic type exported from mscorlib.dll with a
      type argument also defined in mscorlib.dll, that wasn't already instantiated inside
      mscorlib.dll. (See my old <a href="http://www.bluebytesoftware.com/blog/2005/03/23/DGUpdateGenericsAndPerformance.aspx">Generics
      and Performance</a> blog entry for more details.) You can very easily see this happening
      by using an instantiation like Dictionary&lt;DateTime, DateTime&gt;, and watching
      the clrjit.dll module getting loaded.
   </p>
        <p>
      The other way is generic virtual methods (GVMs). It turns out that GVMs pose incredible
      difficulty for ahead of time separate compilation, because the compiler cannot know
      statically which slot in the vtable points at the particular implementation you are
      about to call. (Unless you use whole program compilation, something not offered by
      .NET at present time.) For each such method, there's an unbounded set of possible
      specialized instantiations a slot might point to, and so the vtable cannot be laid
      out in a traditional manner. C++ doesn't allow templated virtual methods for this
      very reason.
   </p>
        <p>
      Thankfully, GVMs are somewhat rare. However, it only took 5 minutes of poking around
      to find one that is quite front-and-center in .NET: in the implementation of LINQ,
      there is an Iterator&lt;T&gt; type that has a method declared as follows:
   </p>
        <p>
        </p>
        <pre>
          <code>public abstract IEnumerable&lt;TResult&gt; Select&lt;TResult&gt;(Func&lt;TSource,
   TResult&gt; selector); </code>
        </pre>
        <p>
      All we need to do is figure out how to tickle that method, and we're guaranteed to
      JIT. As it turns out, sure enough, the following code does the trick and forces clrjit.dll
      to get loaded in .NET 4.0:
   </p>
        <p>
        </p>
        <pre>
          <code>int[] xs = …; int[] ys = xs.Where(x =&gt; true).Select(x =&gt; x).ToArray(); </code>
        </pre>
        <p>
      The Iterator&lt;T&gt; type is used for back-to-back Where and Select operators, as
      a performance optimization that avoids excess allocations and interface dispatch.
      But because it depends on a GVM, it does incur an initial penalty for using it, even
      if you have used NGen to avoid runtime code generation.
   </p>
        <h1>In conclusion
   </h1>
        <p>
      The moral of the story here is <i>not</i> that you should fear generics. Beautiful
      things can be built with them.
   </p>
        <p>
      Instead, it's to use generics thoughtfully. Nothing in life is free, and generics
      are no exception to this rule. If code size is important to you, then you will want
      to have performance gates measuring your numbers against your goals; if you are working
      in a codebase that uses generics heavily, and you end up spending any significant
      time on code size optimizations, you will want to try to track down large transitive
      closures. As I stated above, you could really be throwing away 100K's of code here.
   </p>
        <p>
      And as to the surprise JITting, I've seen teams compiling with NGen and having a functional
      gate that fails any new code that causes clrjit.dll to get loaded at runtime. Although
      tracking down the root cause might be tricky when that gate fails, at least you won't
      let the camel's nose under the tent.
   </p>
        <p>
      Investing in tools here is a very good idea.
   </p>
        <p>
      When it comes down to it, really thinking about what code must be executed by the
      process is helpful. Step back and imagine you were writing this all in C++, with the
      associated performance concerns front-and-center: consider how you'd arrange to reuse
      as much implementation as you can, manage memory efficiently, perhaps employ unsafe
      tricks that would have violated type safety and so are offlimits in .NET, and all
      that jazz. Then step back and be grateful that you have a type- and memory-safe environment
      to help you write more robust code, but also be realistic about what you are paying
      in exchange.
   </p>
        <p>
      I hope you've learned a useful thing or two in this article. If you'd like to learn
      more, here are a few other good resources:
   </p>
        <ol>
          <li>
         An MSR paper on the original implementation of .NET generics: <a href="http://research.microsoft.com/pubs/64031/designandimplementationofgenerics.pdf">http://research.microsoft.com/pubs/64031/designandimplementationofgenerics.pdf</a></li>
          <li>
         Rico's "Six Questions About Generics and Performance" blog entry: <a href="http://blogs.msdn.com/b/ricom/archive/2004/09/13/229025.aspx">http://blogs.msdn.com/b/ricom/archive/2004/09/13/229025.aspx</a></li>
          <li>
         Joel Pobar's "Generics and Code Sharing" blog post: <a href="http://blogs.msdn.com/b/joelpob/archive/2004/11/17/259224.aspx">http://blogs.msdn.com/b/joelpob/archive/2004/11/17/259224.aspx</a></li>
          <li>
         My "Generics and Performance" blog entry: <a href="http://www.bluebytesoftware.com/blog/2005/03/23/DGUpdateGenericsAndPerformance.aspx">http://www.bluebytesoftware.com/blog/2005/03/23/DGUpdateGenericsAndPerformance.aspx</a></li>
        </ol>
        <p>
      Cheers.
   </p>
        <img width="0" height="0" src="http://www.bluebytesoftware.com/blog/aggbug.ashx?id=2d01b5eb-7f19-4a4c-a3b3-1cecb1e3dc75" />
      </body>
      <title>On generics and (some of) the associated overheads</title>
      <guid>http://www.bluebytesoftware.com/blog/PermaLink,guid,2d01b5eb-7f19-4a4c-a3b3-1cecb1e3dc75.aspx</guid>
      <link>http://www.bluebytesoftware.com/blog/2011/10/23/OnGenericsAndSomeOfTheAssociatedOverheads.aspx</link>
      <pubDate>Sun, 23 Oct 2011 22:26:02 GMT</pubDate>
      <description>&lt;p&gt;
   It's been unbelievably long since I last blogged.
&lt;/p&gt;
&lt;p&gt;
   The reason is simple. I've been ecstatic in my job and, every time I think to write
   something, I quickly end up turning to work and soon find that hours (days? months?)
   have passed. This is a wonderful problem to have, but not so good for keeping the
   blog looking fresh and new. (I've also been writing &lt;a href="http://soundcloud.com/45nm_joeduffy/sets/current/"&gt;a
   fair bit of music&lt;/a&gt; lately.) Well, this weekend I managed to lock myself out of
   my VPN access, and decided that this was a sign that I ought to dust off the cobwebs
   on a blog entry or two that I've had in the works for quite some time.
&lt;/p&gt;
&lt;p&gt;
   The topic for today is generics, a feature many of us know and love. Specifically,
   their impact on software performance, something I frequently see developers struggling
   to understand and tame in the wild. 
&lt;h1&gt;The blessing; and, the curse
&lt;/h1&gt;
&lt;p&gt;
   I absolutely love generics. I can hardly imagine writing code without them these days.
   The code reuse, higher-order expressiveness, beautiful abstractions, and static type-safety
   enabled by first class parametric polymorphism are all game-changing. And being a
   language history wonk, I'm delighted to see many mainstream programming languages
   stealing a page from ML and theoretical CS generally.
&lt;/p&gt;
&lt;p&gt;
   Generics, however, are not free. And in some circumstances, they are, dare I say,
   rather expensive. Few language features surpass generics in the ability to write a
   concise and elegant bit of code, which then translates into reams of ugly assembly
   code out the rear end of the compiler. I am of course speaking mainly to models in
   which compilation leads to code specialization (like .NET's), versus erasure (like
   Java's).
&lt;/p&gt;
&lt;p&gt;
   Most developers coming from a C++ background understand code expansion deeply, because
   they program with templates. Unlike templates, however, there is ample runtime type
   information (RTTI) associated with generic instantiations… such that the costs associated
   with generics frequently – and perhaps surprisingly – are a superset of those costs
   normally associated with C++ templates. At the same time, because the compiler understands
   parametric polymorphism, it can sometimes do a better job optimizing, e.g. with techniques
   like code sharing.
&lt;/p&gt;
&lt;p&gt;
   Basically, with templates and erasure, the equation for predicting code expansion
   is super simple. You get it all (in the former) or you get none of it (in the latter),
   but with specialized generics this equation is quite complex.
&lt;/p&gt;
&lt;p&gt;
   Paradoxically, these same costs are the main value that generics bring to the table!
   Write a little type-agnostic code and then "instantiate" that same code over multiple
   types without repeating yourself. But, generics are not magic; did you ever stop to
   wonder things like: What machine code is generated for these types? Does the compiler
   need to specialize the actual code that runs on the processor for unique instantiations,
   or is it all the same? And if it does need to specialize, where, how, and why? And
   perhaps most importantly, what hidden costs are there, and how should I think about
   them while writing code?
&lt;/p&gt;
&lt;p&gt;
   Before reading further, paranoia need not ensue. The point of this article is merely
   to raise your awareness. All programmers should know what the abstractions they use
   cost, and make conscious tradeoffs when writing code with them. The aforementioned
   benefits of generics really are often "worth it," both in the elegance and reusability
   of abstractions, and in developer productivity. In my experience, however, the associated
   costs are so subtle and ill-documented that even people who write highly generic code
   typically remain unaware of them. Even more subtly, these costs are somewhat different
   in nature when pre-compiling your code, such as with .NET's NGen technology.
&lt;/p&gt;
&lt;p&gt;
   This brief essay will walk through a few such costs in the context of the .NET Framework
   and CLR's implementation of generics. This is in no way an exhaustive study of generic
   compilation, and your mileage will vary from one platform to the next. Although the
   studies presented would apply to other implementations of generics, the reality is
   that if you're writing code in, say, Java – where type erasure is employed rather
   than code specialization – then all of this is going to be less relevant to you.
&lt;/p&gt;
&lt;p&gt;
   With no further delay, let's get started.
&lt;/p&gt;
&lt;h1&gt;Code, RTTI, oh my
&lt;/h1&gt;
&lt;p&gt;
   When considering costs, we must always think about both size and speed.
&lt;/p&gt;
&lt;p&gt;
   There is at least as much assembly code created for an instantiation as the code you've
   written for the generic abstraction in C# or MSIL. A simple mental model – that thankfully
   turns out isn't entirely accurate, thanks to some sharing optimizations described
   below – is that for each instantiation of a generic type or method you get a new copy
   of that code specialized to the type in question. Obviously, this increases code size.
   And just as obviously, it will add some runtime cost to JIT compile the code (if you
   aren't using ahead of time compilation), as well as putting more pressure on I-cache
   and TLB.
&lt;/p&gt;
&lt;p&gt;
   Another source of significant cost is the runtime data structures needed for RTTI
   and Reflection, like vtables and other metadata. Quite simply, the runtime needs to
   know the identity of each generic instantiation, to prevent things like casting a
   List&amp;lt;Int16&amp;gt; to a List&amp;lt;String&amp;gt;, and even List&amp;lt;Object&amp;gt; to List&amp;lt;String&amp;gt;;
   and given that there is often distinct code generated for unique instantiations, the
   vtable contents for those different List&amp;lt;T&amp;gt; instantiations are going to look
   quite different.
&lt;/p&gt;
&lt;p&gt;
   And of course, there are statics. Each generic instantiation gets its own set, requiring
   extra storage and another level of indirection when fetching them. Unique statics
   means D-cache and TLB pressure. It turns out that code shared across AppDomains, like
   mscorlib.dll, already need such things. But I have found that it's surprisingly common
   for a developer to throw a static field (or nested class!) onto an outer generic type,
   without actually needing it to be replicated for each unique instantiation.
&lt;/p&gt;
&lt;p&gt;
   In addition to the immediate effects, generic types often refer to other generic types
   which refer to other generic types … and so on. Instantiating a root type is akin
   to instantiating the full transitive closure.
&lt;/p&gt;
&lt;p&gt;
   To make our discussion friendly and familiar, we shall use the .NET Framework's List&amp;lt;T&amp;gt;
   type – presumably one of the most commonly used generic types on the planet – to illustrate
   many of these costs. And unfortunately, you'll also see that many of the common performance
   pitfalls plague this type too. (So, really, you need not feel bad if your own code
   is guilty of them too.)
&lt;/p&gt;
&lt;h1&gt;Why the distinct code, anyway?
&lt;/h1&gt;
&lt;p&gt;
   There is only one copy of List&amp;lt;T&amp;gt;'s code in mscorlib's MSIL. It is essentially
   just a blueprint for the list class.
&lt;/p&gt;
&lt;p&gt;
   When I create a List&amp;lt;Int16&amp;gt; in my program and use it, however, there clearly
   needs to be some assembly code created in order to execute List&amp;lt;T&amp;gt;'s associated
   functionality, just with any T's used by List&amp;lt;T&amp;gt;'s code replaced by actual 2-byte
   short integers. And similarly, if I were to instantiate a List&amp;lt;String&amp;gt;, all
   those T's need to be replaced by pointer-sized object references, either 4- or 8-bytes
   depending on machine architecture, that are reported live to the garbage collector.
&lt;/p&gt;
&lt;p&gt;
   This is what leads to our simple mental model above, in which each instantiation gets
   its own copy of the code. In this case, both List&amp;lt;Int16&amp;gt; and List&amp;lt;String&amp;gt;
   would be entirely independent types at runtime, with wholly separate copies of the
   machine code.
&lt;/p&gt;
&lt;p&gt;
   Certainly if I manually went about creating my own Int16List and StringList types,
   they would be distinct types with distinct machine code generated. Being a prudent
   developer, however, I'd probably try to arrange to share as much of the implementation
   as possible between the two types, perhaps using implementation inheritance. But alas,
   there's no way I could share it all: any code specific to Int16 or String, for example,
   would surely differ, both in MSIL and in the native code.
&lt;/p&gt;
&lt;p&gt;
   Generics basically give you the ability to do this same thing, without you needing
   to do the factoring of type-independent and type-specific code yourself. The compiler
   does that for you.
&lt;/p&gt;
&lt;p&gt;
   Why might the code be different? As stated above, Int16 values are 2 bytes and String
   pointers are native word sized (4 bytes on 32-bit, 8 bytes on 64-bit). All the code
   that passes values of type T on the stack, either as arguments or return values, moves
   instances into and out of memory locations (like the T[] backing array), and so on,
   needs to be specialized based on the size of T. This wouldn't be true of a generics
   implementation that used type erasure, like Java's, but then you'd need to box the
   value types on the heap so that everything is a pointer. If T is a Float, we will
   likely emit code that uses floating point math instead of general purpose registers.
   Any tables that report GC roots are likely to be different, since object references
   can be embedded inside struct values that get laid out on the stack. And so on. Some
   day you might want to compare the machine code for a simple generic Echo&amp;lt;T&amp;gt;
   method for different kinds of T's; it is really easy to do, and is quite illustrative.
&lt;/p&gt;
&lt;p&gt;
   A naïve wish might go as follows. Imagine that I had written my own dedicated Int16List
   and StringList types, and that we diffed the resulting machine code between the distinct
   list types; we'd presumably find a fair bit of duplication for all the reasons stated
   above. It would be a nice property if, when we used the generic List&amp;lt;Int16&amp;gt;
   and List&amp;lt;String&amp;gt; types, and similarly diffed the resulting assembly, the amount
   of specialized code would be no greater than the amount of specialized code between
   our best hand-written Int16List and StringList types. I.e., only parts that need to
   be different are different.
&lt;/p&gt;
&lt;p&gt;
   We could go even further with our wish. Imagine I had a List&amp;lt;DateTime&amp;gt; and List&amp;lt;Int64&amp;gt;.
   Both are 8-byte values, and do not contain any GC references. If I were writing a
   specialized 8ByteValueList in C++ and had immense performance constraints, I would,
   again being a prudent developer, probably use some type unsafe code, with nasty reinterpret_casts,
   so that I could use the same list type to store any kind of 8-byte value. (Except
   in C++ I could even store pointers!) It would also be a nice property if generics
   did some of this for us, while still retaining the type safety we love about generics.
&lt;/p&gt;
&lt;p&gt;
   It turns out we will get neither of our wishes exactly, although we will get something
   close to the spirit of our wishes.
&lt;/p&gt;
&lt;h1&gt;Code sharing
&lt;/h1&gt;
&lt;p&gt;
   Indeed, the CLR does arrange to share many generic instantiations. The rule is simple,
   although it is subject to change in the future (being an optimization and all): instantiations
   over reference types are shared among all reference type instantiations for that generic
   type/method, whereas instantiations over value types get their own full copy of the
   code. In other words, List&amp;lt;String&amp;gt; and List&amp;lt;Object&amp;gt; are backed by the
   same code, but List&amp;lt;DateTime&amp;gt; and List&amp;lt;Int64&amp;gt; get their own.
&lt;/p&gt;
&lt;p&gt;
   It is true that, in theory, List&amp;lt;DateTime&amp;gt; and List&amp;lt;Int64&amp;gt; could use the
   same shared code, because they are of identical size and have GC roots in the same
   locations (trivially, because neither has one). But there are additional restrictions
   on generated code that makes this problematic, for example if we were talking about
   Double and Int64. In short, the CLR doesn't actually share value type instantiations
   as of the 4.0 runtime, although clearly it could in certain situations (value types
   of the same size with GC roots in the same locations).
&lt;/p&gt;
&lt;p&gt;
   As you might guess, this extends to multi-parameter generics in obvious ways. A Dictionary&amp;lt;Object,
   Object&amp;gt; is shared with a Dictionary&amp;lt;String, String&amp;gt;, etc., and a Dictionary&amp;lt;Int64,
   Object&amp;gt; is shared with a Dictionary&amp;lt;Int64, String&amp;gt;. A Dictionary&amp;lt;DateTime,
   DateTime&amp;gt; is not, however, shared with a Dictionary&amp;lt;Int64, Int64&amp;gt; instantiation,
   as per the above.
&lt;/p&gt;
&lt;p&gt;
   My pal &lt;a href="http://callvirt.net/blog/"&gt;Joel Pobar&lt;/a&gt; wrote &lt;a href="http://blogs.msdn.com/b/joelpob/archive/2004/11/17/259224.aspx"&gt;a
   post eons ago describing how code sharing works&lt;/a&gt; in great detail, which I do not
   intend to rehash. Please refer to his post for an excellent overview of how code sharing
   works.
&lt;/p&gt;
&lt;p&gt;
   An important thing to remember, however, is that no matter how much code sharing happens,
   you still need distinct RTTI data structures. So although List&amp;lt;Object&amp;gt; and List&amp;lt;String&amp;gt;
   share the same machine code, they have distinct vtables; sure, each table is full
   of pointers to the same code functions, but you are still paying for the runtime data
   structures. A distinct instantiation, therefore, is never actually free!
&lt;/p&gt;
&lt;h1&gt;Transitive closures
&lt;/h1&gt;
&lt;p&gt;
   Why am I making such a big deal about code sharing, anyway?
&lt;/p&gt;
&lt;p&gt;
   Another surprising aspect of generics is the transitive closure problem. Particularly
   when doing pre-compilation of generics, each unique instantiation doesn't simply lead
   to a specialized version of the code associated with the type being directly instantiated.
   The whole transitive closure of types, starting with that root type, will also be
   compiled. This can be a surprisingly huge number of types! JIT is much more pay-for-play,
   such that you get one level of explosion at a time, but once there is code that calls
   a particular type's method, even if that code is lazily compiled, creation of the
   type is forced.
&lt;/p&gt;
&lt;p&gt;
   To illustrate this, let's take our friend List&amp;lt;T&amp;gt;. Before examining the list,
   how many generic types would you expect that a single new List&amp;lt;T&amp;gt; instantiation
   instantiates?
&lt;/p&gt;
&lt;p&gt;
   What if I told you that a single List&amp;lt;int&amp;gt; instantiation creates (at least)
   28 types? And that, say, five unique instantiations of List&amp;lt;T&amp;gt; might cost you
   300K of disk space and 70K of working set? Well, of course, if you are writing a script,
   or something with fairly loose performance requirements, this might not matter much.
   But if topics like download time, mobile footprint, and cache performance are important
   to you, then you probably want to pay attention to this. To a first approximation,
   size &lt;u&gt;&lt;i&gt;is&lt;/i&gt;&lt;/u&gt; speed.
&lt;/p&gt;
&lt;p&gt;
   Yes, you heard me right: 28 types. Holy smokes... How can this be?!
&lt;/p&gt;
&lt;p&gt;
   Nested types are one obvious answer, and indeed List&amp;lt;T&amp;gt; has two: an Enumerator
   class (which is reasonable), and one to support the legacy synchronized collections
   pattern (which we presumably wish we didn't have to pay for). The larger answer here,
   however, is functionality. Yes, functionality! This is a great example where the cost
   of generics explodes as you add more features. Start simple, keep adding stuff, as
   has happened to List&amp;lt;T&amp;gt; over the years, and you will soon find that a series
   of elegant abstractions adds up to a gut-wrenching bucket of bytes.
&lt;/p&gt;
&lt;p&gt;
   Here's a quick sketch of the transitive closure of generic types used by List&amp;lt;T&amp;gt;:
&lt;/p&gt;
&lt;p&gt;
&lt;pre&gt;&lt;code&gt;List&amp;lt;T&amp;gt; T[] type IList&amp;lt;T&amp;gt; type ICollection&amp;lt;T&amp;gt; type IEnumerable&amp;lt;T&amp;gt;
type IEnumerator&amp;lt;T&amp;gt; type ReadOnlyCollection&amp;lt;T&amp;gt; type (AsReadOnly) (Nothing
more than List&amp;lt;T&amp;gt;) IComparer&amp;lt;T&amp;gt; type (BinarySearch, Sort) {Array.BinarySearch&amp;lt;T&amp;gt;
method (BinarySearch)} ArraySortHelper&amp;lt;T&amp;gt; type IArraySortHelper&amp;lt;T&amp;gt; type
GenericArraySortHelper&amp;lt;T&amp;gt; type EqualityComparer&amp;lt;T&amp;gt; type (Contains) IEqualityComparer&amp;lt;T&amp;gt;
type IEquatable&amp;lt;T&amp;gt; type NullableEqualityComparer&amp;lt;T&amp;gt; type Nullable&amp;lt;T&amp;gt;
type EnumEqualityComparer&amp;lt;T&amp;gt; type {JitHelpers.UnsafeEnumCast&amp;lt;T&amp;gt; method}
ObjectEqualityComparer&amp;lt;T&amp;gt; type Predicate&amp;lt;T&amp;gt; delegate type (Find*) Action&amp;lt;T&amp;gt;
delegate type (ForEach) {Array.LastIndexOf&amp;lt;T&amp;gt; method (LastIndexOf)} Comparison&amp;lt;T&amp;gt;
delegate type (Sort) Array.FunctorComparer&amp;lt;T&amp;gt; type (Sort) Comparer&amp;lt;T&amp;gt;
type GenericComparer&amp;lt;T&amp;gt; type NullableComparer&amp;lt;T&amp;gt; type ObjectComparer&amp;lt;T&amp;gt;
type {Array.Sort&amp;lt;T&amp;gt; method (Sort)} ArraySortHelper&amp;lt;T&amp;gt; type (see earlier)
Enumerator inner type SynchronizedList&amp;lt;T&amp;gt; inner type IList&amp;lt;T&amp;gt; interface
(see earlier) ICollection&amp;lt;T&amp;gt; interface (see earlier) IEnumerable&amp;lt;T&amp;gt; interface
(see earlier) {Interlocked.CompareExchange&amp;lt;Object&amp;gt; method (SyncRoot)} {_emptyArray
T[] static field} &lt;/code&gt;&lt;/pre&gt;
&gt;
&lt;p&gt;
   I'm not trying to pick on List&amp;lt;T&amp;gt;. This class is only unique in this regard
   in that it offers a large transitive closure of (mostly useful!) functionality. And
   it's not the only guilty party. We recently shaved off 100K's of code size on my team,
   for example, that were being lost simply because all the LINQ methods were declared
   as instance methods on the base collection class, rather than being extension methods
   as in .NET. We found nested enumerator and iterator types, cached static lambdas as
   static fields, and huge transitive closures of other generic types, all allocated
   when you just touched any collection type. Any collections library is apt to be full
   of this stuff, since they are highly generic. But collection libraries are certainly
   not the only places to go sniffing for such problems.
&lt;/p&gt;
&lt;p&gt;
   As an aside, it turns out that extension methods are a great way to make generic abstractions
   more pay-for-play.
&lt;/p&gt;
&lt;h1&gt;Adding it up
&lt;/h1&gt;
&lt;p&gt;
   Let's see what the above adds up to. I ran some programs through NGen as a quick and
   dirty experiment, and inspected the on-disk sizes and also the runtime working set
   sizes. I ensured clrjit.dll was not loaded into the process. Here's what I found.
   Take these numbers with a grain of salt, as they will change from release to release;
   they are simply rules of thumb. When in doubt, crank up NGen, DumpBin, and/or start
   trawling the heap with VADump yourself!
&lt;/p&gt;
&lt;p&gt;
   One empty type with no methods in CLR 4.0 seems to cost roughly 0.2K bytes of on-disk
   metadata, and about 0.7K in x64 working set. (This is a good rule of thumb irrespective
   of generics… in terms of order of magnitude, you can think "one empty type means 1K
   of memory.") A single List&amp;lt;S&amp;gt; instantiation, where S is an empty struct, is
   in the neighborhood of 60K on-disk metadata, and 14K of x64 working set. A single
   List&amp;lt;C&amp;gt; instantiation, however, where C is an empty class, is only – surprise
   – about 7K on-disk and 4K in-memory. Why the large discrepancy? Well, it just so happens
   that mscorlib.dll already includes an instantiation or two of List&amp;lt;T&amp;gt; over reference
   types, so this 4K is the incremental cost on top of reusing what is there; remember,
   there are still unique vtables and data structures still required for RTTI.
&lt;/p&gt;
&lt;p&gt;
   Rico did &lt;a href="http://blogs.msdn.com/b/ricom/archive/2005/08/26/performance-quiz-7-generics-improvements-and-costs-solution.aspx"&gt;a
   similar analysis a few years back&lt;/a&gt;, and concluded that each unique List&lt;E&gt;
      , where E was an enum type, cost 8K. Why the increase to 14K over the years? x64 and
      ever-increasing functionality on the basic collections classes, presumably. Remember,
      it's not just List&amp;lt;T&amp;gt; that has grown, it's also everything that List&amp;lt;T&amp;gt;
      uses internally as an implementation detail.
&lt;/p&gt;
&lt;h1&gt;Dynamic specialization with dictionaries
&lt;/h1&gt;
&lt;p&gt;
   Some specialization in behavior can be accomplished with dynamic runtime behavior,
   rather than static code specialization. A prime example is the following:
&lt;/p&gt;
&lt;p&gt;
&lt;pre&gt;&lt;code&gt;class C { public static void M&amp;lt;T&amp;gt;() { System.Console.WriteLine(typeof(T).Name);
} } &lt;/code&gt;&lt;/pre&gt;
&gt;
&lt;p&gt;
   Where does the program get the value of typeof(T) from? If you look at the MSIL, you
   will see that C# has emitted a ldtoken MSIL instruction. For some struct type, we
   can compile that as a constant in the code, because it is getting its own copy of
   the code. What occurs when two instantiations share code, like M&amp;lt;String&amp;gt; and
   M&amp;lt;Object&amp;gt;, however? As you might guess, there is an indirection.
&lt;/p&gt;
&lt;p&gt;
   The thing we usually use for such indirections – the vtable – is nowhere to be found
   in this particular example, because M is a static method. To deal with this, the compiler
   inserts an extra "hidden" argument, frequently called a generic dictionary, from which
   the emitted assembly code can fetch the type token. The cost here typically isn't
   bad, because many of the operations that pull in the dictionary are already RTTI or
   Reflection-based, and would require an indirection already (e.g., through a vtable).
&lt;/p&gt;
&lt;p&gt;
   The operations which require a dictionary of some kind include anything that has to
   do with RTTI and yet no vtable is readily accessible: typeof, casts, is and as operators,
   etc. And as you might guess, if instantiations aren't shared (such as with value types
   on the CLR), no dictionary is needed, because the code is fully specialized. There
   are also multiple kinds of dictionaries used by the runtime, depending on whether
   you are using a generic type, method, or some combination of both.
&lt;/p&gt;
&lt;h1&gt;JITting when you didn't mean to
&lt;/h1&gt;
&lt;p&gt;
   There are two primary ways in which you will JIT compile when using generics, even
   if you were good doobie and used NGen to reduce startup time.
&lt;/p&gt;
&lt;p&gt;
   One way is if you instantiate a new generic type exported from mscorlib.dll with a
   type argument also defined in mscorlib.dll, that wasn't already instantiated inside
   mscorlib.dll. (See my old &lt;a href="http://www.bluebytesoftware.com/blog/2005/03/23/DGUpdateGenericsAndPerformance.aspx"&gt;Generics
   and Performance&lt;/a&gt; blog entry for more details.) You can very easily see this happening
   by using an instantiation like Dictionary&amp;lt;DateTime, DateTime&amp;gt;, and watching
   the clrjit.dll module getting loaded.
&lt;/p&gt;
&lt;p&gt;
   The other way is generic virtual methods (GVMs). It turns out that GVMs pose incredible
   difficulty for ahead of time separate compilation, because the compiler cannot know
   statically which slot in the vtable points at the particular implementation you are
   about to call. (Unless you use whole program compilation, something not offered by
   .NET at present time.) For each such method, there's an unbounded set of possible
   specialized instantiations a slot might point to, and so the vtable cannot be laid
   out in a traditional manner. C++ doesn't allow templated virtual methods for this
   very reason.
&lt;/p&gt;
&lt;p&gt;
   Thankfully, GVMs are somewhat rare. However, it only took 5 minutes of poking around
   to find one that is quite front-and-center in .NET: in the implementation of LINQ,
   there is an Iterator&amp;lt;T&amp;gt; type that has a method declared as follows:
&lt;/p&gt;
&lt;p&gt;
&lt;pre&gt;&lt;code&gt;public abstract IEnumerable&amp;lt;TResult&amp;gt; Select&amp;lt;TResult&amp;gt;(Func&amp;lt;TSource,
TResult&amp;gt; selector); &lt;/code&gt;&lt;/pre&gt;
&gt;
&lt;p&gt;
   All we need to do is figure out how to tickle that method, and we're guaranteed to
   JIT. As it turns out, sure enough, the following code does the trick and forces clrjit.dll
   to get loaded in .NET 4.0:
&lt;/p&gt;
&lt;p&gt;
&lt;pre&gt;&lt;code&gt;int[] xs = …; int[] ys = xs.Where(x =&gt; true).Select(x =&gt; x).ToArray(); &lt;/code&gt;&lt;/pre&gt;
&gt;
&lt;p&gt;
   The Iterator&amp;lt;T&amp;gt; type is used for back-to-back Where and Select operators, as
   a performance optimization that avoids excess allocations and interface dispatch.
   But because it depends on a GVM, it does incur an initial penalty for using it, even
   if you have used NGen to avoid runtime code generation.
&lt;/p&gt;
&lt;h1&gt;In conclusion
&lt;/h1&gt;
&lt;p&gt;
   The moral of the story here is &lt;i&gt;not&lt;/i&gt; that you should fear generics. Beautiful
   things can be built with them.
&lt;/p&gt;
&lt;p&gt;
   Instead, it's to use generics thoughtfully. Nothing in life is free, and generics
   are no exception to this rule. If code size is important to you, then you will want
   to have performance gates measuring your numbers against your goals; if you are working
   in a codebase that uses generics heavily, and you end up spending any significant
   time on code size optimizations, you will want to try to track down large transitive
   closures. As I stated above, you could really be throwing away 100K's of code here.
&lt;/p&gt;
&lt;p&gt;
   And as to the surprise JITting, I've seen teams compiling with NGen and having a functional
   gate that fails any new code that causes clrjit.dll to get loaded at runtime. Although
   tracking down the root cause might be tricky when that gate fails, at least you won't
   let the camel's nose under the tent.
&lt;/p&gt;
&lt;p&gt;
   Investing in tools here is a very good idea.
&lt;/p&gt;
&lt;p&gt;
   When it comes down to it, really thinking about what code must be executed by the
   process is helpful. Step back and imagine you were writing this all in C++, with the
   associated performance concerns front-and-center: consider how you'd arrange to reuse
   as much implementation as you can, manage memory efficiently, perhaps employ unsafe
   tricks that would have violated type safety and so are offlimits in .NET, and all
   that jazz. Then step back and be grateful that you have a type- and memory-safe environment
   to help you write more robust code, but also be realistic about what you are paying
   in exchange.
&lt;/p&gt;
&lt;p&gt;
   I hope you've learned a useful thing or two in this article. If you'd like to learn
   more, here are a few other good resources:
&lt;/p&gt;
&lt;ol&gt;
   &lt;li&gt;
      An MSR paper on the original implementation of .NET generics: &lt;a href="http://research.microsoft.com/pubs/64031/designandimplementationofgenerics.pdf"&gt;http://research.microsoft.com/pubs/64031/designandimplementationofgenerics.pdf&lt;/a&gt; 
   &lt;li&gt;
      Rico's "Six Questions About Generics and Performance" blog entry: &lt;a href="http://blogs.msdn.com/b/ricom/archive/2004/09/13/229025.aspx"&gt;http://blogs.msdn.com/b/ricom/archive/2004/09/13/229025.aspx&lt;/a&gt; 
   &lt;li&gt;
      Joel Pobar's "Generics and Code Sharing" blog post: &lt;a href="http://blogs.msdn.com/b/joelpob/archive/2004/11/17/259224.aspx"&gt;http://blogs.msdn.com/b/joelpob/archive/2004/11/17/259224.aspx&lt;/a&gt; 
   &lt;li&gt;
      My "Generics and Performance" blog entry: &lt;a href="http://www.bluebytesoftware.com/blog/2005/03/23/DGUpdateGenericsAndPerformance.aspx"&gt;http://www.bluebytesoftware.com/blog/2005/03/23/DGUpdateGenericsAndPerformance.aspx&lt;/a&gt; 
&lt;/ol&gt;
&lt;p&gt;
   Cheers.
&lt;/p&gt;
&lt;img width="0" height="0" src="http://www.bluebytesoftware.com/blog/aggbug.ashx?id=2d01b5eb-7f19-4a4c-a3b3-1cecb1e3dc75" /&gt;</description>
      <category>Technology</category>
    </item>
    <item>
      <trackback:ping>http://www.bluebytesoftware.com/blog/Trackback.aspx?guid=d59f5b8a-0373-4143-8f98-999e044a4ad0</trackback:ping>
      <pingback:server>http://www.bluebytesoftware.com/blog/pingback.aspx</pingback:server>
      <pingback:target>http://www.bluebytesoftware.com/blog/PermaLink,guid,d59f5b8a-0373-4143-8f98-999e044a4ad0.aspx</pingback:target>
      <dc:creator>
      </dc:creator>
      <body xmlns="http://www.w3.org/1999/xhtml">
        <p>
      InfoQ recently asked me a few questions about concurrency and programming languages,
      and here is what I had to say:
   </p>
        <blockquote dir="ltr" style="MARGIN-RIGHT: 0px">
          <p>
            <a href="http://www.infoq.com/articles/Interview-Joe-Duffy">http://www.infoq.com/articles/Interview-Joe-Duffy</a>
          </p>
        </blockquote>
        <p>
      A little teaser:
   </p>
        <blockquote dir="ltr" style="MARGIN-RIGHT: 0px">
          <p>
            <em>"The major shift we face will be that mainstream languages will start to incorporate
      more concurrency-safety -- immutability and isolation -- and the platform libraries
      and architectures will better support this style of software decomposition. OOP developers
      are accustomed to partitioning their program into classes and objects; now they will
      need to become accustomed to partitioning their program into asynchronous Actors that
      can run concurrently. Within this sea of asynchrony will lay ordinary imperative code,
      frequently augmented with fine-grained task and data parallelism."</em>
          </p>
        </blockquote>
        <p>
      As an aside, I know I've been super quiet lately.  I never thought I'd go months
      without blogging.  My sincere apologies for this; work has been too all-consuming
      / fun, and I've been unable to carve out much time for anything else.  (Speaking
      of which, we are still hiring: email me at joedu at you-know-where dot com if you
      are interested.)  I'm about to head out to Europe for a few weeks, where hopefully
      I'll have a bit of time to write up something exciting to share.  Cheers.
   </p>
        <img width="0" height="0" src="http://www.bluebytesoftware.com/blog/aggbug.ashx?id=d59f5b8a-0373-4143-8f98-999e044a4ad0" />
      </body>
      <title>An interview with InfoQ</title>
      <guid>http://www.bluebytesoftware.com/blog/PermaLink,guid,d59f5b8a-0373-4143-8f98-999e044a4ad0.aspx</guid>
      <link>http://www.bluebytesoftware.com/blog/2011/06/01/AnInterviewWithInfoQ.aspx</link>
      <pubDate>Wed, 01 Jun 2011 17:57:44 GMT</pubDate>
      <description>&lt;p&gt;
   InfoQ recently asked me a few questions about concurrency and programming languages,
   and here is what I had to say:
&lt;/p&gt;
&lt;blockquote dir=ltr style="MARGIN-RIGHT: 0px"&gt; 
&lt;p&gt;
   &lt;a href="http://www.infoq.com/articles/Interview-Joe-Duffy"&gt;http://www.infoq.com/articles/Interview-Joe-Duffy&lt;/a&gt;
&lt;/p&gt;
&lt;/blockquote&gt; 
&lt;p&gt;
   A little teaser:
&lt;/p&gt;
&lt;blockquote dir=ltr style="MARGIN-RIGHT: 0px"&gt; 
&lt;p&gt;
   &lt;em&gt;"The major shift we face will be that mainstream languages will start to incorporate
   more concurrency-safety -- immutability and isolation -- and the platform libraries
   and architectures will better support this style of software decomposition. OOP developers
   are accustomed to partitioning their program into classes and objects; now they will
   need to become accustomed to partitioning their program into asynchronous Actors that
   can run concurrently. Within this sea of asynchrony will lay ordinary imperative code,
   frequently augmented with fine-grained task and data parallelism."&lt;/em&gt;
&lt;/p&gt;
&lt;/blockquote&gt; 
&lt;p&gt;
   As an aside, I know I've been super quiet lately.&amp;nbsp; I never thought I'd go months
   without blogging.&amp;nbsp; My sincere apologies for this; work has been too all-consuming
   / fun, and I've been unable to carve out much time for anything else.&amp;nbsp; (Speaking
   of which, we are still hiring: email me at joedu at you-know-where dot com if you
   are interested.)&amp;nbsp; I'm about to head out to Europe for a few weeks, where hopefully
   I'll have a bit of time to write up something exciting to share.&amp;nbsp; Cheers.
&lt;/p&gt;
&lt;img width="0" height="0" src="http://www.bluebytesoftware.com/blog/aggbug.ashx?id=d59f5b8a-0373-4143-8f98-999e044a4ad0" /&gt;</description>
      <category>Technology</category>
    </item>
    <item>
      <trackback:ping>http://www.bluebytesoftware.com/blog/Trackback.aspx?guid=a0484627-c752-45f6-a2ac-414130cb3d2f</trackback:ping>
      <pingback:server>http://www.bluebytesoftware.com/blog/pingback.aspx</pingback:server>
      <pingback:target>http://www.bluebytesoftware.com/blog/PermaLink,guid,a0484627-c752-45f6-a2ac-414130cb3d2f.aspx</pingback:target>
      <dc:creator>
      </dc:creator>
      <body xmlns="http://www.w3.org/1999/xhtml">
        <p>
      After spending more time than I’d like to admit over the years researching memory
      model literature (particularly Java’s terrific JMM and related publications), subsequently
      trying to “fix” the CLR memory model, reviewing proposals for new C++ memory models,
      and beating my head against the wall for months developing a new memory model that
      supports weakly ordered processors like the kind you’ll find on ARM in a developer-friendly
      yet power-efficient way, I have a conclusion to share.
   </p>
        <h1>Volatile is evil
   </h1>
        <p>
      Why? Let me recount the reasons:
   </p>
        <ol>
          <li>
         It doesn’t mean what you think. 
      </li>
          <li>
         It used to have a very specific purpose — to enure memory operations with external
         side-effects did not get reordered — and has since gotten bastardized and used for
         many secondarily-related purposes. 
      </li>
          <li>
         Even if you think it does mean what you think, the annotation scheme is all wrong.
         Volatile annotates a storage location, and yet what really matters is what happens
         when accessing said storage locations. The fences occur when you access the variable,
         not when you declare it. And yet from a readability perspective, they are completely
         invisible and easy to miss. 
      </li>
          <li>
         And even if you don’t care about readability, the meaning of volatile changes wildly
         when you switch platforms. Today it’s store / release, tomorrow it’s write / read
         fences. Perhaps it’s even sequentially consistent. And the label of “store / release”
         could actually be a white lie, as with the CLR’s memory model thanks to store buffer
         forwarding and the lack of fences in the CLR JIT’s x64 stores. 
      </li>
          <li>
         Performance, man, performance! Sure sequential consistency as the default sounds nice
         on the tin, but once you’re running that mobile app on ARM, and sucking up 160 cycles
         for each write you perform, you’re going to curse volatile like the plague. 
      </li>
        </ol>
        <p>
      And so the moral of the story follows...
   </p>
        <h1>Attempting to “fix” volatile is a waste of time
   </h1>
        <p>
      Instead, a new world order has arrived. We must take a two-pronged approach to solving
      instruction-level interleaving bugs, neither prong having much to do with the traditional
      definition of memory models or volatiles. We must:
   </p>
        <ol>
          <li>
         Eliminate memory ordering from 99% of developers’ purviews. This is already the case
         with single-threaded programs, because code motion in compilers and processors is
         limited to what only affects concurrent observations. So the answer is pretty clear:
         developers must move towards single-threaded programming models connected through
         message passing, optionally with provably race-free fine-grained parallelism inside
         of those single-threaded worlds. 
      </li>
          <li>
         Leave the memory model esoterica to the Einsteins, and radically change its meaning.
         Data dependence and transitive visibility of memory operations are in. Volatiles on
         storage locations are out. Instead, we must throw fences into programmer’s faces,
         and force them to understand each and every one that occurs. And moreover, force them
         to decide about each and every one that occurs. Specifically, hidden fences thanks
         to volatile are no longer. Those who cannot take it should fall into the 99% bucket
         already mentioned above, versus the 1% bucket. 
      </li>
        </ol>
        <p>
      Let’s set #1 aside for now, since it’s obviously a huge can of worms.
   </p>
        <p>
      But what about #2? It is quite encouraging that the C++0x group is firmly on the path
      of #2. See <a href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n2145.html">http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n2145.html</a> for
      more details. In a nutshell, each location that you’d have ordinarily tagged volatile
      instead becomes a template atomic<T>
         type. And then each read / write has the opportunity to specify the desired kind of
         fence, whether that is acquire (for reads), release (for writes), fully ordered, or
         relaxed (meaning no fence).
      </T></p>
        <p>
      I do think it’d be worth them considering compiler-only fences too. So that relaxed
      means fully-relaxed, and there is a fence in-between that merely prevents the compiler
      from optimizing the memory operation. This pays homage to volatile’s legacy in C as
      merely a variable that mustn’t be subject to optimizations, because operations against
      these variables pertain to, say, memory-mapped device I/O.
   </p>
        <p>
      Another nitpick of mine is that I’d have required each access to specify the fence,
      whereas C++0x implicitly uses full fence if left unspecified at the callsite. It’s
      a minor convenience, but I like always having the fence spelled out very explicitly
      in the code. Lock-free access to shared variables is sufficiently dangerous that automatic
      sequential consistency is the least of your problems.
   </p>
        <p>
      Nevertheless, the C++0x direction is a massive good step forward, and these are just
      minor details.
   </p>
        <p>
      My hope is that .NET follows suit. And the timing couldn’t be more apropos as “now”:
      we are moving forward in a heavily mobile, distributed-system-on-a-chip, and heterogeneous
      world, where processor memory models will necessarily continue to weaken. The overly
      strong x86 memory model, kept alive primarily to ensure compatibility, has simply
      grown too expensive to accommodate. The power benefits and architectural simplifications
      are hard to argue with, and because compatibility becomes less of an issue as new
      platforms arise (e.g. for mobile), the world moves to the cloud, and hence there is
      legacy to worry about, I do hope that processor vendors seize the opportunity. ARM
      certainly has. It is less about out-of-order execution as it is about coherency costs.
      Truthfully, I’d be disappointed if anything else happened, even though the risk to
      compatibility for shrink-wrapped software scares the hell out of me. But this is most
      certainly the right way to go, long-term. As software platforms move in the direction
      of #1 – as I also foresee – the need for fences dwindles. The cost of supporting the
      current .NET memory model is too great and will become a liability with time.
   </p>
        <p>
      Thankfully, it is quite simple to build a veneer atop .NET that works a lot more like
      atomic<T>
         . For example, imagine that we had a new System.Threading.Volatile static class, and
         that it offered the moral equivalent to atomic<T>
            inner types for each atomic primitive we can synchronize against:
         </T></T></p>
        <p>
        </p>
        <pre>
          <code>namespace System.Threading { public static class Volatile { public
   struct Int32 {..} public struct Int64 {..} … public struct Reference<T>
      where T : class {..} … } }
   </T></code>
        </pre>
        <p>
      Now instead of tagging a location as ‘volatile’, you would use one of these primitives.
      For example, rather than:
   </p>
        <p>
        </p>
        <pre>
          <code>static volatile MySingleton s_instance;</code>
        </pre>
        <p>
      You would say:
   </p>
        <p>
        </p>
        <pre>
          <code>static Volatile.Reference&lt;MySingleton&gt; s_instance;</code>
        </pre>
        <p>
      Each class has a similar set of operations. For example:
   </p>
        <p>
        </p>
        <pre>
          <code>namespace System.Threading { public static class Volatile { public
   struct Int32 { public Int32(int value); public int ReadAcquireFence(); public int
   ReadFullFence(); public int ReadCompilerOnlyFence(); public int ReadUnfenced(); public
   void WriteReleaseFence(int newValue); public void WriteFullFence(int newValue); public
   void WriteCompilerOnlyFence(int newValue); public void WriteUnfenced(int newValue);
   public int AtomicCompareExchange(int newValue, int comparand); public int AtomicExchange(int
   newValue); public int AtomicAdd(int delta); public int AtomicIncrement(); public int
   AtomicDecrement(); // Etc…, bitwise ops, other math ops, etc. } } }</code>
        </pre>
        <p>
      Of course, only the integer types would offer the increment, decrement, add, and related
      operators. And it turns out that offering different kinds of fences on the Atomic*
      operations would be incredibly useful too, because processors like ARM do not couple
      the fence to the compare-and-swap / load-locked-store-conditional as x86 processors
      do. Taking advantage of this can be huge if you are writing performance critical code,
      like a concurrent garbage collector whose atomic swaps need not imply ordering with
      the surrounding instruction stream. You can quibble over the details, like whether
      these should use enums instead of the name to encode the fence-kind. I did it this
      way to keep the implementations branch-free, although with a decent inlining JIT compiler,
      it’d probably optimize those away thanks to constant propagation.
   </p>
        <p>
      It’s quite trivial to implement these APIs atop existing .NET primitives. I built
      a little library that does so, but it was so boring and repetitive I decided not to
      post it alongside this blog entry as originally intended.
   </p>
        <p>
      With the above definition, we can very clearly see the fences involved in doing, say,
      double-checked locking:
   </p>
        <p>
        </p>
        <pre>
          <code>static Volatile.Reference&lt;MySingleton&gt; s_instance; public static
   MySingleton Instance { get { MySingleton instance = s_instance.ReadAcquireFence();
   if (instance == null) { instance = new MySingleton(); instance = s_instance.AtomicCompareExchangeRelease(instance,
   null); } return instance; } }</code>
        </pre>
        <p>
      We see there are two fences. One is an acquire and, depending on what your memory
      model says about data dependence, is probably unnecessary. Most sane memory models
      guarantee that data dependent loads do not pass. So we needn’t worry that we’ll see
      a non-null s_instance whose contents haven’t been initialized. (If we were talking
      structs, it’d be another story.) Nevertheless, it’s definitely required that we use
      a release-only fence for the publication of the object. This guarantees writes to
      fields within the MySingleton constructor have completed prior to the write of the
      new object to the shared instance field. The point here is that you are forced to
      think about the fences, and you actually see them.
   </p>
        <p>
      Of course, most platforms need to provide the bare minimum of fencing to assure type
      safety, particularly for languages like C#. My understanding is that C++0x has decided,
      at least for now, not to offer type-safety in the face of multithreading. That means
      you might publish an object and, if stores occur out-of-order, the reader could see
      an object partially initialized with an invalid vtable pointer. In C# and Java, the
      language designers have thankfully decided to shield programmers from this. The need
      for fences also extends to unsafe code like strings, where – were it possible for
      a thread to read the non-zero length before the char* pointer was valid – writes to
      random memory could occur and hence threaten type-safety. Thankfully, again, C# and
      Java protect developers from this, mostly due to the automatic zero’ing of memory
      as the GC allocates and hands out new segments.
   </p>
        <p>
      There are costs to offering this type safety assurance. So you can understand why
      the C++ designers want to keep fences out of object allocation. If you have #1 above,
      however, the costs are dramatically lower and more acceptable. But the world is –
      unfortunately – still a freethreaded one, and we have several years to go before we’ve
      reached the final destination. As a step forward, however, the death of volatile is
      a welcomed one. Say it with me.
   </p>
        <h1>“Sayonara volatile.”
   </h1>
        <p>
      Here’s hoping that .NET 5.0 takes this step forward too.
   </p>
        <img width="0" height="0" src="http://www.bluebytesoftware.com/blog/aggbug.ashx?id=a0484627-c752-45f6-a2ac-414130cb3d2f" />
      </body>
      <title>Sayonara volatile</title>
      <guid>http://www.bluebytesoftware.com/blog/PermaLink,guid,a0484627-c752-45f6-a2ac-414130cb3d2f.aspx</guid>
      <link>http://www.bluebytesoftware.com/blog/2010/12/04/SayonaraVolatile.aspx</link>
      <pubDate>Sat, 04 Dec 2010 23:16:34 GMT</pubDate>
      <description>&lt;p&gt;
   After spending more time than I’d like to admit over the years researching memory
   model literature (particularly Java’s terrific JMM and related publications), subsequently
   trying to “fix” the CLR memory model, reviewing proposals for new C++ memory models,
   and beating my head against the wall for months developing a new memory model that
   supports weakly ordered processors like the kind you’ll find on ARM in a developer-friendly
   yet power-efficient way, I have a conclusion to share.
&lt;/p&gt;
&lt;h1&gt;Volatile is evil
&lt;/h1&gt;
&lt;p&gt;
   Why? Let me recount the reasons:
&lt;/p&gt;
&lt;ol&gt;
   &lt;li&gt;
      It doesn’t mean what you think. 
   &lt;li&gt;
      It used to have a very specific purpose — to enure memory operations with external
      side-effects did not get reordered — and has since gotten bastardized and used for
      many secondarily-related purposes. 
   &lt;li&gt;
      Even if you think it does mean what you think, the annotation scheme is all wrong.
      Volatile annotates a storage location, and yet what really matters is what happens
      when accessing said storage locations. The fences occur when you access the variable,
      not when you declare it. And yet from a readability perspective, they are completely
      invisible and easy to miss. 
   &lt;li&gt;
      And even if you don’t care about readability, the meaning of volatile changes wildly
      when you switch platforms. Today it’s store / release, tomorrow it’s write / read
      fences. Perhaps it’s even sequentially consistent. And the label of “store / release”
      could actually be a white lie, as with the CLR’s memory model thanks to store buffer
      forwarding and the lack of fences in the CLR JIT’s x64 stores. 
   &lt;li&gt;
      Performance, man, performance! Sure sequential consistency as the default sounds nice
      on the tin, but once you’re running that mobile app on ARM, and sucking up 160 cycles
      for each write you perform, you’re going to curse volatile like the plague. 
&lt;/ol&gt;
&lt;p&gt;
   And so the moral of the story follows...
&lt;/p&gt;
&lt;h1&gt;Attempting to “fix” volatile is a waste of time
&lt;/h1&gt;
&lt;p&gt;
   Instead, a new world order has arrived. We must take a two-pronged approach to solving
   instruction-level interleaving bugs, neither prong having much to do with the traditional
   definition of memory models or volatiles. We must:
&lt;/p&gt;
&lt;ol&gt;
   &lt;li&gt;
      Eliminate memory ordering from 99% of developers’ purviews. This is already the case
      with single-threaded programs, because code motion in compilers and processors is
      limited to what only affects concurrent observations. So the answer is pretty clear:
      developers must move towards single-threaded programming models connected through
      message passing, optionally with provably race-free fine-grained parallelism inside
      of those single-threaded worlds. 
   &lt;li&gt;
      Leave the memory model esoterica to the Einsteins, and radically change its meaning.
      Data dependence and transitive visibility of memory operations are in. Volatiles on
      storage locations are out. Instead, we must throw fences into programmer’s faces,
      and force them to understand each and every one that occurs. And moreover, force them
      to decide about each and every one that occurs. Specifically, hidden fences thanks
      to volatile are no longer. Those who cannot take it should fall into the 99% bucket
      already mentioned above, versus the 1% bucket. 
&lt;/ol&gt;
&lt;p&gt;
   Let’s set #1 aside for now, since it’s obviously a huge can of worms.
&lt;/p&gt;
&lt;p&gt;
   But what about #2? It is quite encouraging that the C++0x group is firmly on the path
   of #2. See &lt;a href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n2145.html"&gt;http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n2145.html&lt;/a&gt; for
   more details. In a nutshell, each location that you’d have ordinarily tagged volatile
   instead becomes a template atomic&lt;T&gt;
      type. And then each read / write has the opportunity to specify the desired kind of
      fence, whether that is acquire (for reads), release (for writes), fully ordered, or
      relaxed (meaning no fence).
&lt;/p&gt;
&lt;p&gt;
   I do think it’d be worth them considering compiler-only fences too. So that relaxed
   means fully-relaxed, and there is a fence in-between that merely prevents the compiler
   from optimizing the memory operation. This pays homage to volatile’s legacy in C as
   merely a variable that mustn’t be subject to optimizations, because operations against
   these variables pertain to, say, memory-mapped device I/O.
&lt;/p&gt;
&lt;p&gt;
   Another nitpick of mine is that I’d have required each access to specify the fence,
   whereas C++0x implicitly uses full fence if left unspecified at the callsite. It’s
   a minor convenience, but I like always having the fence spelled out very explicitly
   in the code. Lock-free access to shared variables is sufficiently dangerous that automatic
   sequential consistency is the least of your problems.
&lt;/p&gt;
&lt;p&gt;
   Nevertheless, the C++0x direction is a massive good step forward, and these are just
   minor details.
&lt;/p&gt;
&lt;p&gt;
   My hope is that .NET follows suit. And the timing couldn’t be more apropos as “now”:
   we are moving forward in a heavily mobile, distributed-system-on-a-chip, and heterogeneous
   world, where processor memory models will necessarily continue to weaken. The overly
   strong x86 memory model, kept alive primarily to ensure compatibility, has simply
   grown too expensive to accommodate. The power benefits and architectural simplifications
   are hard to argue with, and because compatibility becomes less of an issue as new
   platforms arise (e.g. for mobile), the world moves to the cloud, and hence there is
   legacy to worry about, I do hope that processor vendors seize the opportunity. ARM
   certainly has. It is less about out-of-order execution as it is about coherency costs.
   Truthfully, I’d be disappointed if anything else happened, even though the risk to
   compatibility for shrink-wrapped software scares the hell out of me. But this is most
   certainly the right way to go, long-term. As software platforms move in the direction
   of #1 – as I also foresee – the need for fences dwindles. The cost of supporting the
   current .NET memory model is too great and will become a liability with time.
&lt;/p&gt;
&lt;p&gt;
   Thankfully, it is quite simple to build a veneer atop .NET that works a lot more like
   atomic&lt;T&gt;
      . For example, imagine that we had a new System.Threading.Volatile static class, and
      that it offered the moral equivalent to atomic&lt;T&gt;
         inner types for each atomic primitive we can synchronize against:
&lt;/p&gt;
&lt;p&gt;
&lt;pre&gt;&lt;code&gt;namespace System.Threading { public static class Volatile { public struct
Int32 {..} public struct Int64 {..} … public struct Reference&lt;T&gt;
   where T : class {..} … } }
&lt;/code&gt;&lt;/pre&gt;
&gt;
&lt;p&gt;
   Now instead of tagging a location as ‘volatile’, you would use one of these primitives.
   For example, rather than:
&lt;/p&gt;
&lt;p&gt;
&lt;pre&gt;&lt;code&gt;static volatile MySingleton s_instance;&lt;/code&gt;&lt;/pre&gt;
&gt;
&lt;p&gt;
   You would say:
&lt;/p&gt;
&lt;p&gt;
&lt;pre&gt;&lt;code&gt;static Volatile.Reference&amp;lt;MySingleton&amp;gt; s_instance;&lt;/code&gt;&lt;/pre&gt;
&gt;
&lt;p&gt;
   Each class has a similar set of operations. For example:
&lt;/p&gt;
&lt;p&gt;
&lt;pre&gt;&lt;code&gt;namespace System.Threading { public static class Volatile { public struct
Int32 { public Int32(int value); public int ReadAcquireFence(); public int ReadFullFence();
public int ReadCompilerOnlyFence(); public int ReadUnfenced(); public void WriteReleaseFence(int
newValue); public void WriteFullFence(int newValue); public void WriteCompilerOnlyFence(int
newValue); public void WriteUnfenced(int newValue); public int AtomicCompareExchange(int
newValue, int comparand); public int AtomicExchange(int newValue); public int AtomicAdd(int
delta); public int AtomicIncrement(); public int AtomicDecrement(); // Etc…, bitwise
ops, other math ops, etc. } } }&lt;/code&gt;&lt;/pre&gt;
&gt;
&lt;p&gt;
   Of course, only the integer types would offer the increment, decrement, add, and related
   operators. And it turns out that offering different kinds of fences on the Atomic*
   operations would be incredibly useful too, because processors like ARM do not couple
   the fence to the compare-and-swap / load-locked-store-conditional as x86 processors
   do. Taking advantage of this can be huge if you are writing performance critical code,
   like a concurrent garbage collector whose atomic swaps need not imply ordering with
   the surrounding instruction stream. You can quibble over the details, like whether
   these should use enums instead of the name to encode the fence-kind. I did it this
   way to keep the implementations branch-free, although with a decent inlining JIT compiler,
   it’d probably optimize those away thanks to constant propagation.
&lt;/p&gt;
&lt;p&gt;
   It’s quite trivial to implement these APIs atop existing .NET primitives. I built
   a little library that does so, but it was so boring and repetitive I decided not to
   post it alongside this blog entry as originally intended.
&lt;/p&gt;
&lt;p&gt;
   With the above definition, we can very clearly see the fences involved in doing, say,
   double-checked locking:
&lt;/p&gt;
&lt;p&gt;
&lt;pre&gt;&lt;code&gt;static Volatile.Reference&amp;lt;MySingleton&amp;gt; s_instance; public static
MySingleton Instance { get { MySingleton instance = s_instance.ReadAcquireFence();
if (instance == null) { instance = new MySingleton(); instance = s_instance.AtomicCompareExchangeRelease(instance,
null); } return instance; } }&lt;/code&gt;&lt;/pre&gt;
&gt;
&lt;p&gt;
   We see there are two fences. One is an acquire and, depending on what your memory
   model says about data dependence, is probably unnecessary. Most sane memory models
   guarantee that data dependent loads do not pass. So we needn’t worry that we’ll see
   a non-null s_instance whose contents haven’t been initialized. (If we were talking
   structs, it’d be another story.) Nevertheless, it’s definitely required that we use
   a release-only fence for the publication of the object. This guarantees writes to
   fields within the MySingleton constructor have completed prior to the write of the
   new object to the shared instance field. The point here is that you are forced to
   think about the fences, and you actually see them.
&lt;/p&gt;
&lt;p&gt;
   Of course, most platforms need to provide the bare minimum of fencing to assure type
   safety, particularly for languages like C#. My understanding is that C++0x has decided,
   at least for now, not to offer type-safety in the face of multithreading. That means
   you might publish an object and, if stores occur out-of-order, the reader could see
   an object partially initialized with an invalid vtable pointer. In C# and Java, the
   language designers have thankfully decided to shield programmers from this. The need
   for fences also extends to unsafe code like strings, where – were it possible for
   a thread to read the non-zero length before the char* pointer was valid – writes to
   random memory could occur and hence threaten type-safety. Thankfully, again, C# and
   Java protect developers from this, mostly due to the automatic zero’ing of memory
   as the GC allocates and hands out new segments.
&lt;/p&gt;
&lt;p&gt;
   There are costs to offering this type safety assurance. So you can understand why
   the C++ designers want to keep fences out of object allocation. If you have #1 above,
   however, the costs are dramatically lower and more acceptable. But the world is –
   unfortunately – still a freethreaded one, and we have several years to go before we’ve
   reached the final destination. As a step forward, however, the death of volatile is
   a welcomed one. Say it with me.
&lt;/p&gt;
&lt;h1&gt;“Sayonara volatile.”
&lt;/h1&gt;
&lt;p&gt;
   Here’s hoping that .NET 5.0 takes this step forward too.
&lt;/p&gt;
&lt;img width="0" height="0" src="http://www.bluebytesoftware.com/blog/aggbug.ashx?id=a0484627-c752-45f6-a2ac-414130cb3d2f" /&gt;</description>
      <category>Technology</category>
    </item>
    <item>
      <trackback:ping>http://www.bluebytesoftware.com/blog/Trackback.aspx?guid=558a8e4d-efe4-4f61-99e1-f46f8af76e0e</trackback:ping>
      <pingback:server>http://www.bluebytesoftware.com/blog/pingback.aspx</pingback:server>
      <pingback:target>http://www.bluebytesoftware.com/blog/PermaLink,guid,558a8e4d-efe4-4f61-99e1-f46f8af76e0e.aspx</pingback:target>
      <dc:creator>
      </dc:creator>
      <body xmlns="http://www.w3.org/1999/xhtml">
        <p>
      I rambled on and on a few weeks back about how much performance matters. Although
      I got a lot of contrary feedback, most of it had to do with my deliberately controversial
      title than the content of the article. I had intended to post a redux, but something
      more concise is on my mind lately.
   </p>
        <p>
      GC-based memory management is a boon to productivity, not to mention program safety.
      Few would argue with this. However, the most effective developers know how their particular
      GC works, and optimize their program’s data structure, allocation, and lifetime behavior
      to suit their particular GC best.
   </p>
        <p>
      This is dangerous, but a pragmatic fact of life. It is dangerous because who’s to
      say that the runtime team doesn’t intend to entirely revamp the GC’s collection strategy
      next release, at which point your thoughtfulness may actually harm you? It’s a pragmatic
      fact for a few reasons: it’s probably not likely that the behavior of your favorite
      GC is going to change too fundamentally over time; if it did, you’d need to rethink
      things anyhow; oh, and when was that next release anyway (2 or more years out); and
      finally, what do you care about more, the theoretical loose coupling, or real results
      today?
   </p>
        <p>
      One of the worst data structures for traversal is a linked list. That’s because its
      contents are fetched by pointer-chasing, an act that usually destroys locality, unless
      the data associated with each pointer was carefully constructed to live next to its
      previous and next pointer’s data. This seldom happens, because the main point of a
      linked list is to free you from such constraints.
   </p>
        <p>
      One of the best data structures for traversal, on the other hand, is an array. Adjacent
      elements are truly adjacent to one another in memory, meaning that as you fetch the
      i’th element from memory, you’re probably pulling in the i’th, i+1’th, …, and so on,
      thanks to spatial locality. Of course, if the elements are just pointers, then you're
      back to the chasing game; as with anything, it depends.
   </p>
        <p>
      How many elements you prefetch of course depends on the size of the elements with
      respect to your processor’s cache line. If you’re working with 8-byte elements, and
      128-byte cache lines, then you may pay 100 cycles for the first fetch, and then amortize
      that cost over the subsequent 15 found cheaply in cache for 10 cycles. The result
      is about 250 cycles total; compare this to a linked list, where you’ll probably spend
      1,600 cycles, or more than 6X the cost. And of course you’re trashing other data in
      the cache in the process. As you traverse more and more of the list, the numbers snowball,
      and the amortization of locality, or lack thereof, provides a stark contrast.
   </p>
        <p>
      There’s another subtle reason why this is important. Stop and think about what happens
      when a GC occurs.
   </p>
        <p>
      Yep, that’s right. Your data structures need to be traversed during a GC, after all,
      to ascertain the liveness of any pointers held within. That scan looks a whole heck
      of a lot like the same traversal I just described, and enjoys the same locality properties.
      So we can immediately conclude that data structures whose traversal is efficient will
      translate into less time spent in the GC chasing pointers, and better cache efficiency.
   </p>
        <p>
      For programs that are sensitive to long pause times, this is huge. I talk to customers
      all the time whose programs are sensitive to microsecond-long GC delays, and – aside
      from ensuring good GC lifetime practices, like ensuring all objects either die young
      or live long – being conscious about locality can be immensely important. Especially
      for any long-lived, large data structures that will be subject to Gen2 collections
      throughout their lifetime.
   </p>
        <p>
      There is another useful trick to know. If a data structure contains no pointers, the
      GC will not have to trace these pointers. Obvious, right? A linked list inherently
      contains pointers, so this trick really doesn’t apply: the GC will need to traverse
      the whole live portions of the object graph. What’s interesting is that an array,
      on the other hand, may or may not contain elements that contain pointers. For example,
      an array of ints clearly has no pointers, whereas an array of string references clearly
      does. This doesn’t just apply to the primitive types, but also custom structs which
      may or may not contain references. When the GC encounters such an array, its contents
      need not be traversed: instead, the array is alive, and that's that. Yet another opportunity
      to eliminate pointer chasing. Not only does this save the GC from doing some heavy
      lifting, but the pointer-free structs eliminate the need for the GC write barriers
      on array stores too.
   </p>
        <p>
      So think of all this next time you’re confronted with the decision to employ a tree,
      graph, or linked list, and whether a dense, and perhaps pointer-free, representation
      could be beneficial. Even if it means you must replace pointers with index calculations.
      The locality benefits may not matter, but then again, they may. And at least you can
      knowingly make a balanced tradeoff, with these potential advantages in mind.
   </p>
        <img width="0" height="0" src="http://www.bluebytesoftware.com/blog/aggbug.ashx?id=558a8e4d-efe4-4f61-99e1-f46f8af76e0e" />
      </body>
      <title>Dense, and pointer-free</title>
      <guid>http://www.bluebytesoftware.com/blog/PermaLink,guid,558a8e4d-efe4-4f61-99e1-f46f8af76e0e.aspx</guid>
      <link>http://www.bluebytesoftware.com/blog/2010/11/01/DenseAndPointerfree.aspx</link>
      <pubDate>Mon, 01 Nov 2010 00:41:41 GMT</pubDate>
      <description>&lt;p&gt;
   I rambled on and on a few weeks back about how much performance matters. Although
   I got a lot of contrary feedback, most of it had to do with my deliberately controversial
   title than the content of the article. I had intended to post a redux, but something
   more concise is on my mind lately.
&lt;/p&gt;
&lt;p&gt;
   GC-based memory management is a boon to productivity, not to mention program safety.
   Few would argue with this. However, the most effective developers know how their particular
   GC works, and optimize their program’s data structure, allocation, and lifetime behavior
   to suit their particular GC best.
&lt;/p&gt;
&lt;p&gt;
   This is dangerous, but a pragmatic fact of life. It is dangerous because who’s to
   say that the runtime team doesn’t intend to entirely revamp the GC’s collection strategy
   next release, at which point your thoughtfulness may actually harm you? It’s a pragmatic
   fact for a few reasons: it’s probably not likely that the behavior of your favorite
   GC is going to change too fundamentally over time; if it did, you’d need to rethink
   things anyhow; oh, and when was that next release anyway (2 or more years out); and
   finally, what do you care about more, the theoretical loose coupling, or real results
   today?
&lt;/p&gt;
&lt;p&gt;
   One of the worst data structures for traversal is a linked list. That’s because its
   contents are fetched by pointer-chasing, an act that usually destroys locality, unless
   the data associated with each pointer was carefully constructed to live next to its
   previous and next pointer’s data. This seldom happens, because the main point of a
   linked list is to free you from such constraints.
&lt;/p&gt;
&lt;p&gt;
   One of the best data structures for traversal, on the other hand, is an array. Adjacent
   elements are truly adjacent to one another in memory, meaning that as you fetch the
   i’th element from memory, you’re probably pulling in the i’th, i+1’th, …, and so on,
   thanks to spatial locality. Of course, if the elements are just pointers, then you're
   back to the chasing game; as with anything, it depends.
&lt;/p&gt;
&lt;p&gt;
   How many elements you prefetch of course depends on the size of the elements with
   respect to your processor’s cache line. If you’re working with 8-byte elements, and
   128-byte cache lines, then you may pay 100 cycles for the first fetch, and then amortize
   that cost over the subsequent 15 found cheaply in cache for 10 cycles. The result
   is about 250 cycles total; compare this to a linked list, where you’ll probably spend
   1,600 cycles, or more than 6X the cost. And of course you’re trashing other data in
   the cache in the process. As you traverse more and more of the list, the numbers snowball,
   and the amortization of locality, or lack thereof, provides a stark contrast.
&lt;/p&gt;
&lt;p&gt;
   There’s another subtle reason why this is important. Stop and think about what happens
   when a GC occurs.
&lt;/p&gt;
&lt;p&gt;
   Yep, that’s right. Your data structures need to be traversed during a GC, after all,
   to ascertain the liveness of any pointers held within. That scan looks a whole heck
   of a lot like the same traversal I just described, and enjoys the same locality properties.
   So we can immediately conclude that data structures whose traversal is efficient will
   translate into less time spent in the GC chasing pointers, and better cache efficiency.
&lt;/p&gt;
&lt;p&gt;
   For programs that are sensitive to long pause times, this is huge. I talk to customers
   all the time whose programs are sensitive to microsecond-long GC delays, and – aside
   from ensuring good GC lifetime practices, like ensuring all objects either die young
   or live long – being conscious about locality can be immensely important. Especially
   for any long-lived, large data structures that will be subject to Gen2 collections
   throughout their lifetime.
&lt;/p&gt;
&lt;p&gt;
   There is another useful trick to know. If a data structure contains no pointers, the
   GC will not have to trace these pointers. Obvious, right? A linked list inherently
   contains pointers, so this trick really doesn’t apply: the GC will need to traverse
   the whole live portions of the object graph. What’s interesting is that an array,
   on the other hand, may or may not contain elements that contain pointers. For example,
   an array of ints clearly has no pointers, whereas an array of string references clearly
   does. This doesn’t just apply to the primitive types, but also custom structs which
   may or may not contain references. When the GC encounters such an array, its contents
   need not be traversed: instead, the array is alive, and that's that. Yet another opportunity
   to eliminate pointer chasing. Not only does this save the GC from doing some heavy
   lifting, but the pointer-free structs eliminate the need for the GC write barriers
   on array stores too.
&lt;/p&gt;
&lt;p&gt;
   So think of all this next time you’re confronted with the decision to employ a tree,
   graph, or linked list, and whether a dense, and perhaps pointer-free, representation
   could be beneficial. Even if it means you must replace pointers with index calculations.
   The locality benefits may not matter, but then again, they may. And at least you can
   knowingly make a balanced tradeoff, with these potential advantages in mind.
&lt;/p&gt;
&lt;img width="0" height="0" src="http://www.bluebytesoftware.com/blog/aggbug.ashx?id=558a8e4d-efe4-4f61-99e1-f46f8af76e0e" /&gt;</description>
      <category>Technology</category>
    </item>
    <item>
      <trackback:ping>http://www.bluebytesoftware.com/blog/Trackback.aspx?guid=d3af3274-061f-4cb4-889f-fa92bf6792dc</trackback:ping>
      <pingback:server>http://www.bluebytesoftware.com/blog/pingback.aspx</pingback:server>
      <pingback:target>http://www.bluebytesoftware.com/blog/PermaLink,guid,d3af3274-061f-4cb4-889f-fa92bf6792dc.aspx</pingback:target>
      <dc:creator>
      </dc:creator>
      <body xmlns="http://www.w3.org/1999/xhtml">
        <p>
      I have several positions open on my team here at Microsoft.
   </p>
        <p>
      My team's responsibility spans multiple aspects of a research operating system’s programming
      model. The three main areas are concurrency, languages, and frameworks. When I say
      concurrency, I mean things like asynchrony and message passing, data and task parallelism,
      distributed parallelism, runtime scheduling and resource management, and heterogeneity
      and GPGPU. When I say languages, I mean type systems, mostly-functional programming,
      verified safe concurrency, and both front- and back-end compilation. And when I say
      frameworks, I mean virtually anything you could imagine wanting out of a platform
      framework: all things XML, data interoperability (database, web services, etc.), collections,
      transactions, multimaster synchronization, and even low level things, like regex,
      numerics, and globalization.
   </p>
        <p>
      Our team is 100% developers, and we have an “everybody codes, everybody loves to code”
      culture. Even managers are expected to spend a significant amount of time prolifically
      writing code.
   </p>
        <p>
      All of these components are new and built from the ground up. So self-drive and an
      ability to have a vision and make it happen are incredibly important.
   </p>
        <p>
      We are always happy to hire great, hard-working people, regardless of years of experience.
      If you’re extremely strong in one or more of the abovementioned areas, more of a generalist,
      are an amazing coder, or all of the above, you’d fit in perfectly. This is the most
      amazing team of people I’ve ever worked with. If you are interested, please email
      your resume to me at joedu AT microsoft DOT com.
   </p>
        <img width="0" height="0" src="http://www.bluebytesoftware.com/blog/aggbug.ashx?id=d3af3274-061f-4cb4-889f-fa92bf6792dc" />
      </body>
      <title>We are hiring</title>
      <guid>http://www.bluebytesoftware.com/blog/PermaLink,guid,d3af3274-061f-4cb4-889f-fa92bf6792dc.aspx</guid>
      <link>http://www.bluebytesoftware.com/blog/2010/09/18/WeAreHiring.aspx</link>
      <pubDate>Sat, 18 Sep 2010 18:42:27 GMT</pubDate>
      <description>&lt;p&gt;
   I have several positions open on my team here at Microsoft.
&lt;/p&gt;
&lt;p&gt;
   My team's responsibility spans multiple aspects of a research operating system’s programming
   model. The three main areas are concurrency, languages, and frameworks. When I say
   concurrency, I mean things like asynchrony and message passing, data and task parallelism,
   distributed parallelism, runtime scheduling and resource management, and heterogeneity
   and GPGPU. When I say languages, I mean type systems, mostly-functional programming,
   verified safe concurrency, and both front- and back-end compilation. And when I say
   frameworks, I mean virtually anything you could imagine wanting out of a platform
   framework: all things XML, data interoperability (database, web services, etc.), collections,
   transactions, multimaster synchronization, and even low level things, like regex,
   numerics, and globalization.
&lt;/p&gt;
&lt;p&gt;
   Our team is 100% developers, and we have an “everybody codes, everybody loves to code”
   culture. Even managers are expected to spend a significant amount of time prolifically
   writing code.
&lt;/p&gt;
&lt;p&gt;
   All of these components are new and built from the ground up. So self-drive and an
   ability to have a vision and make it happen are incredibly important.
&lt;/p&gt;
&lt;p&gt;
   We are always happy to hire great, hard-working people, regardless of years of experience.
   If you’re extremely strong in one or more of the abovementioned areas, more of a generalist,
   are an amazing coder, or all of the above, you’d fit in perfectly. This is the most
   amazing team of people I’ve ever worked with. If you are interested, please email
   your resume to me at joedu AT microsoft DOT com.
&lt;/p&gt;
&lt;img width="0" height="0" src="http://www.bluebytesoftware.com/blog/aggbug.ashx?id=d3af3274-061f-4cb4-889f-fa92bf6792dc" /&gt;</description>
      <category>Miscellaneous;Technology</category>
    </item>
    <item>
      <trackback:ping>http://www.bluebytesoftware.com/blog/Trackback.aspx?guid=4db70333-295b-441f-80f9-21b90bd44287</trackback:ping>
      <pingback:server>http://www.bluebytesoftware.com/blog/pingback.aspx</pingback:server>
      <pingback:target>http://www.bluebytesoftware.com/blog/PermaLink,guid,4db70333-295b-441f-80f9-21b90bd44287.aspx</pingback:target>
      <dc:creator>
      </dc:creator>
      <title>The 'premature optimization is evil' myth</title>
      <guid>http://www.bluebytesoftware.com/blog/PermaLink,guid,4db70333-295b-441f-80f9-21b90bd44287.aspx</guid>
      <link>http://www.bluebytesoftware.com/blog/2010/09/06/ThePrematureOptimizationIsEvilMyth.aspx</link>
      <pubDate>Mon, 06 Sep 2010 18:38:49 GMT</pubDate>
      <description>&lt;p&gt;
   I can’t tell you how many times I’ve heard the age-old adage echoed inappropriately
   and out of context:
&lt;/p&gt;
&lt;p&gt;
   &lt;i&gt;&lt;blockquote&gt;"We should forget about small efficiencies, say about 97% of the time;
   premature optimization is the root of all evil"&lt;br&gt;
   -- Donald E. Knuth, &lt;a href="http://portal.acm.org/citation.cfm?id=356640"&gt;Structured
   Programming with go to Statements&lt;/a&gt;&lt;/blockquote&gt;&lt;/i&gt;
&lt;/p&gt;
&lt;p&gt;
   I have heard the "premature optimization is the root of all evil" statement used by
   programmers of varying experience at every stage of the software lifecycle, to defend
   all sorts of choices, ranging from poor architectures, to gratuitous memory allocations,
   to inappropriate choices of data structures and algorithms, to complete disregard
   for variable latency in latency-sensitive situations, among others.
&lt;/p&gt;
&lt;p&gt;
   Mostly this quip is used defend sloppy decision-making, or to justify the indefinite
   deferral of decision-making. In other words, laziness. It is safe to say that the
   very mention of this oft-misquoted phrase causes an immediate visceral reaction to
   commence within me... and it’s not a pleasant one.
&lt;/p&gt;
&lt;p&gt;
   In this short article, we’ll look at some important principles that are counter to
   what many people erroneously believe this statement to be saying. To save you time
   and suspense, I will summarize the main conclusions: I do not advocate contorting
   oneself in order to achieve a perceived minor performance gain. Even the best performance
   architects, when following their intuition, are wrong 9 times out of 10 about what
   matters. (Or maybe 97 times out of 100, based on Knuth’s numbers.) What I do advocate
   is thoughtful and intentional performance tradeoffs being made as every line of code
   is written. Always understand the order of magnitude that matters, why it matters,
   and where it matters. And measure regularly! I am a big believer in statistics, so
   if a programmer sitting in his or her office writing code thinks just a little bit
   more about the performance implications of every line of code that is written, he
   or she will save an entire team that time and then some down the road. Given the choice
   between two ways of writing a line of code, both with similar readability, writability,
   and maintainability properties, and yet interestingly different performance profiles,
   don’t be a bozo: choose the performant approach. Eschew redundant work, and poorly
   written code. And lastly, avoid gratuitously abstract, generalized, and allocation-heavy
   code, when slimmer, more precise code will do the trick.
&lt;/p&gt;
&lt;p&gt;
   Follow these suggestions and your code will just about always win in both maintainability
   and performance.
&lt;/p&gt;
&lt;h1&gt;Understand the order of magnitude that matters
&lt;/h1&gt;
&lt;p&gt;
   First and foremost, you really ought to understand what order of magnitude matters
   for each line of code you write.
&lt;/p&gt;
&lt;p&gt;
   In other words, you need to have a budget; what can you afford, and where can you
   afford it? The answer here changes dramatically depending on whether you’re writing
   a device driver, reusable framework library, UI control, highly-connected network
   application, installation script, etc. No single answer fits all.
&lt;/p&gt;
&lt;p&gt;
   I am personally used to writing code where 100 CPU cycles matters. So invoking a function
   that acquires a lock by way of a shared-memory interlocked instruction that may take
   100 cycles is something I am apt to think hard about; even more worrisome is if that
   acquisition could block waiting for 100,000 cycles. Indeed this situation could become
   disastrous under load. As you can tell, I write a lot of systems code. If you’re working
   on a network-intensive application, on the other hand, most of the code you write
   is going to be impervious to 100 cycle blips, and more sensitive to efficient network
   utilization, scalability, and end-to-end performance. And if you’re writing a little
   one-time script, or some testing or debugging program, you may get away with ignoring
   performance altogether, even multi-million cycle network round-trips.
&lt;/p&gt;
&lt;p&gt;
   To be successful at this, you’ll need to know what things cost. If you don’t know
   what things cost, you’re just flailing in the dark, hoping to get lucky. This includes
   rule of thumb order of magnitudes for primitive operations – e.g. reading / writing
   a register (nanoseconds, single-digit cycles), a cache hit (nanoseconds, tens of cycles),
   a cache miss to main memory (nanoseconds, hundreds of cycles), a disk access including
   page faults (micro- or milliseconds, millions of cycles), and a network roundtrip
   (milliseconds or seconds, many millions of cycles) – in addition to peering beneath
   opaque abstractions provided by other programmers, to understand their best, average,
   and worst case performance.
&lt;/p&gt;
&lt;p&gt;
   Clearly the concerns and situations you must work to avoid change quite substantially
   depending on the class of code you are writing, and whether the main function of your
   program is delivering a user experience (where usability reigns supreme), delivering
   server-side throughput, etc. Thinking this through is crucial, because it helps avoid
   true "premature optimization" traps where a programmer ends up writing complicated
   and convoluted code to save 10 cycles, when he or she really needs to be thinking
   about architecting the interaction with the network more thoughtfully to asynchronously
   overlap round-trips. Understanding how performance impacts the main function of your
   program drives all else.
&lt;/p&gt;
&lt;p&gt;
   Pay attention to interoperability between layers of separately authored software that
   is composed together. The most common cause of hangs is that an API didn’t specify
   the expected performance, and so a UI programmer ended up using it in an innocuous
   but inappropriate way, because they couldn’t afford the range of order of magnitude
   cost that the API’s performance was expected to fall within. Hangs aren’t the only
   manifestation; O(N^2), or worse, performance can also result, if, for example, a caller
   didn’t realize the function called was going to enumerate a list in order to generate
   its results.
&lt;/p&gt;
&lt;p&gt;
   It is also important to think about worst case situations. What happens if that lock
   is held for longer than expected, because the system is under load and the scheduler
   is overloaded? And what if the owning thread was preempted while holding the lock,
   and now will not get to run again for quite some time? What happens if the network
   is saturated because a big news event is underway, or worse, the phone network is
   intermittently cutting out, the network cable has been unplugged, etc.? What about
   the case where, because a user has launched far too many applications at once, your
   memory-intensive operation that usually enjoys nice warmth and locality suddenly begins
   waiting for the disk on the majority of its memory accesses, due to demand paging?
   These things happen all the time.
&lt;/p&gt;
&lt;p&gt;
   In each of these situations, you can end up paying many more orders of magnitude in
   cost than you expected under ordinary circumstances. The lock acquisition that usually
   took 100 CPU cycles now takes several million cycles (as long as a network roundtrip),
   and the network operation that is usually measured in milliseconds is now measured
   in tens of seconds, as the software painfully waits for the operation to time out.
   And your "non-blocking" memory-intensive algorithm on the UI thread just caused a
   hang, because it’s paging like crazy.
&lt;/p&gt;
&lt;p&gt;
   You’ve experienced these problems as a user of modern software, I am sure, and it
   isn’t fun. An hourglass, spinning donut, unresponsive button clicks, "(Not Responsive)"
   title bars, and bleachy white screens. An important measurement of a programmer’s
   worth is how good the code they write operates under the extreme and rare circumstances.
   Because, truth be told, when you have a large user-base, these circumstances aren’t
   that rare after all. This is more of a "performance in the large" thing, but it turns
   out that the end result is delivered as a result of many "performance in the small"
   decisions adding up. A developer wrote code meant to be used in a particular way,
   but decided what order of magnitude was reasonable based on best case, … and gave
   no thought to the worst case.
&lt;/p&gt;
&lt;h1&gt;Using the right data structure for the job
&lt;/h1&gt;
&lt;p&gt;
   This is motherhood and apple pie, Computer Science 101, … bad clichés abound. And
   yet so many programmers get this wrong, because they simply don’t give it enough thought.
&lt;/p&gt;
&lt;p&gt;
   One of my favorite books on the topic ("Programming Pearls") has this to say about
   them:
&lt;/p&gt;
&lt;p&gt;
   &lt;i&gt;&lt;blockquote&gt;"Most programmers have seen them, and most good programmers realize
   they’ve written at least one.&lt;br&gt;
   They are huge, messy, ugly programs that should have been short, clean, beautiful
   programs."&lt;/blockquote&gt;&lt;/i&gt;
&lt;/p&gt;
&lt;p&gt;
   I’ll add one adjective to the "short, clean, beautiful" list: fast.
&lt;/p&gt;
&lt;p&gt;
   Data structures drive storage and access behavior, both strongly affecting the size
   and speed of algorithms and components that make use of them. Worst case really does
   matter. This too is a case where the right choice will boost not only performance
   but also the cleanliness of the program.
&lt;/p&gt;
&lt;p&gt;
   I’m actually not going to spend too much time on this; when I said this is CS101,
   I meant it. However, it is crucial to be intentional and smart in this choice. Validate
   assumptions, and measure.
&lt;/p&gt;
&lt;p&gt;
   Ironically, in my experience, many senior programmers can make frighteningly bad data
   structure choices, often because they are more apt to choose a sophisticated and yet
   woefully inappropriate one. They may choose a linked list, for example, because they
   want zero-allocation element linking via an embedded next pointer. And yet they then
   end up with many lists traversals throughout the program, where a dense array representation
   would have been well worth the extra allocation. The naïve programmer would have happily
   new’d up a List&amp;lt;T&amp;gt;, and avoided some common pitfalls; yet, here the senior guy
   is working as hard as humanly possible to avoid a single extra allocation. They over-optimized
   in one dimension, and ignored many others that mattered more.
&lt;/p&gt;
&lt;p&gt;
   This same class of programmer may choose a very complicated lock-free data structure
   for sharing elements between threads, incurring many more object allocations (and
   thus increased GC pressure), and a large number of expensive interlocked operations
   scattered throughout the code. The sexy lure of lock-freedom tricked them into making
   a bad choice. Perhaps they didn’t quite understand that locks and lock-free data structures
   share many costs in common. Or perhaps they just hoped to get lucky and squeeze out
   out-of-this-world scalability thanks to lock-freedom, without actually considering
   the access patterns necessary to lead to such scalability and whether their program
   employed them.
&lt;/p&gt;
&lt;p&gt;
   These are often held up as examples of "premature optimization", but I hold them up
   as examples of "careless optimization". The double kicker here is that the time spent
   building the more complicated solution would have been better spent carefully thinking
   and measuring, and ultimately deciding not to be overly clever in the first place.
   This most often plagues the mid-range programmer, who is just smart enough to know
   about a vast array of techniques, but not yet mature enough to know when not to employ
   them.
&lt;/p&gt;
&lt;h1&gt;A different, better-performing approach
&lt;/h1&gt;
&lt;p&gt;
   It’s an all-too-common occurrence. I’ll give code review feedback, asking "Why didn’t
   you take approach B? It seems to be just as clear, and yet obviously has superior
   performance." Again, this is in a circumstance where I believe the difference matters,
   given the order of magnitude that matters for the code in question. And I’ll get a
   response, "Premature optimization is the root of all evil." At which point I want
   to smack this certain someone upside the head, because it’s such a lame answer.
&lt;/p&gt;
&lt;p&gt;
   The real answer is that the programmer didn’t stop to carefully consider alternatives
   before coding up solution A. (To be fair, sometimes good solutions evade the best
   of us.) The reality is that the alternative approach should have been taken; it may
   be true that it’s "too late" because the implications of the original decision were
   nontrivial and perhaps far-reaching, but that is too often an unfortunate consequence
   of not taking the care and thought to do it right in the first place.
&lt;/p&gt;
&lt;p&gt;
   These kinds of "peanut butter" problems add up in a hard to identify way. Your performance
   profiler may not obviously point out the effect of such a bad choice so that it’s
   staring you in your face. Rather than making one routine 1000% slower, you may have
   made your entire program 3% slower. Make enough of these sorts of decisions, and you
   will have dug yourself a hole deep enough to take a considerable percentage of the
   original development time just digging out. I don’t know about you, but I prefer to
   clean my house incrementally and regularly rather than letting garbage pile up to
   the ceilings, with flies buzzing around, before taking out the trash. Simply put,
   all great software programmers I know are proactive in writing clear, clean, and smart
   code. They pour love into the code they write.
&lt;/p&gt;
&lt;p&gt;
   In this day and age, where mobility and therefore power is king, instructions matter.
   My boss is fond of saying "the most performant instruction is the one you didn’t have
   to execute." And it’s true. The best way to save battery power on mobile phones is
   to execute less code to get the same job done.
&lt;/p&gt;
&lt;p&gt;
   To take an example of a technology that I am quite supportive of, but that makes writing
   inefficient code very easy, let’s look at LINQ-to-Objects. Quick, how many inefficiencies
   are introduced by this code?
&lt;/p&gt;
&lt;p&gt;
   &lt;code&gt;&lt;blockquote&gt;&lt;pre&gt;&lt;= hi)
                  select (x * c);
    return results.ToArray();
}
&lt;/pre&gt;
int[] Scale(int[] inputs, int lo, int hi, int c) {
    var results = from x in inputs
                  where (x &gt;= lo) &amp;&amp; (x 
   &lt;/blockquote&gt;&lt;/code&gt;
&lt;/p&gt;
&lt;p&gt;
   It’s hard to account for them all.
&lt;/p&gt;
&lt;p&gt;
   There are two delegate object allocations, one for the call to Enumerable.Where and
   the other for the call to Enumerable.Select. These delegates point to potentially
   two distinct closure objects, each of which has captured enclosing variables. These
   closure objects are instances of new classes, which occupy nontrivial space in both
   the binary and at runtime. (And of course, the arguments are now stored in two places,
   must be copied to the closure objects, and then we must incur extra indirections each
   time we access them.) In all likelihood, the Where and Select operators are going
   to allocate new IEnumerable&lt;T&gt;
      and new IEnumerator&lt;T&gt;
         objects. For each element in the input, the Where operator will make two interface
         method calls, one to IEnumerator&lt;T&gt;
            .MoveNext and the other to IEnumerator&lt;T&gt;
               .get_Current. It will then make a delegate call, which is slightly more expensive
               than a virtual method call on the CLR. For each element for which the Where delegate
               returns ‘true’, the Select operator will have likewise made two interface method calls,
               in addition to another delegate invocation. Oh, and the implementations of these likely
               use C# iterators, which produce relatively fat code, and are implemented as state
               machines which will incur more overhead (switch statements, state variables, etc.)
               than a hand-written implementation.
&lt;/p&gt;
&lt;p&gt;
   Wow. And we aren’t even done yet. The ToArray method doesn’t know the size of the
   output, so it must make many allocations. It’ll start with 4 elements, and then keep
   doubling and copying elements as necessary. And we may end up with excess storage.
   If we end up with 33,000 elements, for example, we will waste about 128KB of dynamic
   storage (32,000 X 4-byte ints).
&lt;/p&gt;
&lt;p&gt;
   A programmer may have written this routine this way because he or she has recently
   discovered LINQ, or has heard of the benefits of writing code declaratively. And/or
   he or she may have decided to introduce a more general purpose implementation of a
   Scale API versus doing something specific to the use in the particular program that
   Scale will be immediately used in. This is a great example of why premature generalization
   is often at odds with writing efficient code.
&lt;/p&gt;
&lt;p&gt;
   Imagine an alternative universe, on the other hand, where Scale will only get used
   once and therefore we can take advantage of certain properties of its usage. Namely,
   perhaps the input array need not be preserved, and instead we can update the elements
   matching the criteria in place:
&lt;/p&gt;
&lt;p&gt;
   &lt;code&gt;&lt;blockquote&gt;&lt;pre&gt;&lt; inputs.Length; i++) {
        if ((inputs[i] &gt;&lt;= hi)) {
            inputs[i] *= c;
        }
    }
}
&lt;/pre&gt;
void ScaleInPlace(int[] inputs, int lo, int hi, int c) {
    for (int i = 0; i = lo) &amp; (inputs[i] 
   &lt;/blockquote&gt;&lt;/code&gt;
&lt;/p&gt;
&lt;p&gt;
   A quick-and-dirty benchmark shows this to be an order of magnitude faster. Again,
   is it an order of magnitude that you care about? Perhaps not. See my earlier thoughts
   on that particular topic. But if you care about costs in the 100s or 1000s of cycles
   range, you probably want to pay heed.
&lt;/p&gt;
&lt;p&gt;
   Now, I’m not trying to take potshots at LINQ. It was just an example. In fact, I spent
   3 years running a team that delivered PLINQ, a parallel execution engine for LINQ-to-Objects.
   LINQ is great where you can afford it, and/or where the alternatives do not offer
   ridiculously better performance. For example, if you can’t do in-place updates, functionally
   producing new data is going to require allocations no matter which way you slice it.
   But having watched people using PLINQ, I have witnessed numerous occasions where an
   inordinately expensive query was made 8-times faster by parallelizing… where the trivial
   refactoring into a slimmed down algorithm with proper data structures would have speed
   the code up by 100-fold. Parallelizing a piggy piece of code to make it faster merely
   uses more of the machine to get the same job done, and will negatively impact power,
   resource management, and utilization.
&lt;/p&gt;
&lt;p&gt;
   Another view is that writing code in this declarative manner is better, because it’ll
   just get faster as the compiler and runtimes enjoy new optimizations. This sounds
   nice, and seems like taking a high road of some sort. But what usually matters is
   today: how does the code perform using the latest and greatest technology available
   today. And if you scratch underneath the surface, it turns out that most of these
   optimizations are what I call "science fiction" and unlikely to happen. If you write
   redundant asserts at twenty adjacent layers of your program, well, you’re probably
   going to pay for them. If you allocate objects like they are cheap apples growing
   on trees, you’re going to pay for them. True, optimizations might make things faster
   over time, but usually not in the way you expect and usually not by orders of magnitude
   unless you are lucky.
&lt;/p&gt;
&lt;p&gt;
   A colleague of mine used to call C a WYWIWYG language—"what you write is what you
   get"—wherein each line of code roughly mapped one-to-one, in a self-evident way, with
   a corresponding handful of assembly instructions. This is a stark contrast to C#,
   wherein a single line of code can allocate many objects and have an impact to the
   surrounding code by introducing numerous silent indirections. For this reason alone,
   understanding what things cost and paying attention to them is admittedly more difficult
   – and arguably more important – in C# than it was back in the good ole’ C days. ILDASM
   is your friend … as is the disassembler. Yes, good systems programmers regularly look
   at the assembly code generated by the .NET JIT. Don’t assume it generates the code
   you think it does.
&lt;/p&gt;
&lt;h1&gt;Gratuitous memory allocations
&lt;/h1&gt;
&lt;p&gt;
   I love C#. I really do. I was reading my "Secure Coding in C and C++" book for fun
   this weekend, and it reminded me how many of those security vulnerabilities are eliminated
   by construction thanks to type- and memory-safety.
&lt;/p&gt;
&lt;p&gt;
   But the one thing I don’t love is how easy and invisible it makes heap allocations.
&lt;/p&gt;
&lt;p&gt;
   The very fact that C++ memory management is painful means most C++ programmers are
   overly-conscious about allocation and footprint. Having to opt-in to using pointers
   means developers are conscious about indirections, rather than having them everywhere
   by default. These are qualitative, hard-to-backup statements, but in my experience
   they are quite true. It’s also cultural.
&lt;/p&gt;
&lt;p&gt;
   Because they are so easy, it’s doubly important to be on the lookout for allocations
   in C#. Each one adds to a hard-to-quantify debt that must be repaid later on when
   a GC subsequently scans the heap looking for garbage. An API may appear cheap to invoke,
   but it may have allocated an inordinate amount of memory whose cost is only paid for
   down the line. This is certainly not "paying it forward."
&lt;/p&gt;
&lt;p&gt;
   It’s never been so easy to read a GB worth of data into memory and then haphazardly
   leave it hanging around for the GC to take care of at some indeterminate point in
   the future as it is in .NET. Or an entire list of said data. Too many times a .NET
   program holds onto GBs of working set, when a more memory conscientious approach would
   have been to devise an incremental loading strategy, employ a denser representation
   of this data, or some combination of the two. But, hey, memory is plentiful and cheap!
   And in the worst case, paging to disk is free! Right? Wrong. Think about the worst
   case.
&lt;/p&gt;
&lt;p&gt;
   Depending on the size of the objects allocated, how long they remain live, and how
   many processors are being used, the subsequent GCs necessary to clean up incessant
   allocations may impact a program in a difficult to predict way. Allocating a bunch
   of very large objects that live for long enough to make it out of the nursery, but
   not forever, for instance, is one of the worst evils you can do. This is known as
   mid-life crisis. You either want really short-lived objects or really long-lived ones.
   But in any case, it really matters: the LINQ example earlier shows how easy it is
   to allocate crap without seeing it in the code.
&lt;/p&gt;
&lt;p&gt;
   If I could do it all over again, I would make some changes to C#. I would try to keep
   pointers, and merely not allow you to free them. Indirections would be explicit. The
   reference type versus value type distinction would not exist; whether something was
   a reference would work like C++, i.e. you get to decide. Things get tricky when you
   start allocating things on the stack, because of the lifetime issues, so we’d probably
   only support stack allocation for a subset of primitive types. (Or we’d employ conservative
   escape analysis.) Anyway, the point here is to illustrate that in such a world, you’d
   be more conscious about how data is laid out in memory, encouraging dense representations
   over sparse and pointer rich data structures silently spreading all over the place.
   We don't live in this world, so pretend as though we do; each time you see a reference
   to an object, think "indirection!" to yourself and react as you would in C++ when
   you see a pointer dereference.
&lt;/p&gt;
&lt;p&gt;
   Allocations are not always bad, of course. It’s easy to become paranoid here. You
   need to understand your memory manager to know for sure. Most GC-based systems, for
   example, are heavily tuned to make successive small object allocations very fast.
   So if you’re programming in C#, you’ll get less "bang for your buck" by fusing a large
   number of contiguous object allocations into a small number of large object allocations
   that ultimately occupy an equivalent amount of space, particularly if those objects
   are short-lived. Lots of little garbage tends to be pretty painless, at least relatively.
&lt;/p&gt;
&lt;h1&gt;Variable latency and asynchrony
&lt;/h1&gt;
&lt;p&gt;
   There aren’t many ways to introduce a multisecond delay into your program at a moment’s
   notice. But I/O can do just that.
&lt;/p&gt;
&lt;p&gt;
   Code with highly variable latency is dangerous, because it can have dramatically different
   performance characteristics depending on numerous variables, many of which are driven
   by environmental conditions outside of your program’s control. As such it is immensely
   important to document where such variable latency can occur, and to program defensively
   against it happening.
&lt;/p&gt;
&lt;p&gt;
   For example, imagine a team of twenty programmers building some desktop application.
   The team is just large enough that no one person can understand the full details of
   how the entire system works. So you’ve got to compose many pieces together. (As I
   mentioned earlier, composition can lead to hard-to-predict performance characteristics.)
   Programmer Alice is responsible for serving up a list of fonts, and Programmer Bob
   is consuming that list to paint it on the UI. Does Bob know what it takes to fetch
   the list of fonts? Probably not. Does Alice know the full set of concerns that Bob
   must think about to deliver a responsive UI, like incremental repaints, progress reporting,
   and cancellation? Probably not. So Alice does the best she knows how to do: she hits
   the cache, when the font cache is fully populated, and falls back to fetching fonts
   from the printer otherwise. She returns a List&amp;lt;Font&amp;gt; object from her API. Now
   Bob just calls the API and paints the results on the UI; the call appears to be quite
   snappy in his testing. Unfortunately, when the cache is not populated, the "quick"
   cache hit turns into a series of network roundtrips with the printer; that hangs the
   UI, but only for a few milliseconds. Even worse, when the printer is accidentally
   offline, perhaps due to a power outage, the UI freezes for twenty seconds, because
   that’s the hard-coded timeout. Ouch!
&lt;/p&gt;
&lt;p&gt;
   This situation happens all the time. It’s one of the most common causes of UI hangs.
&lt;/p&gt;
&lt;p&gt;
   If you’re programming in an environment where asynchrony is first class, Alice could
   have advertised the fact that, under worst-case circumstances, fetching the font list
   would take some time. If she were programming in .NET, for example, she’d return a
   Task&amp;lt;List&amp;lt;Font&amp;gt;&amp;gt; rather than a List&amp;lt;Font&amp;gt;. The API would then be
   self-documenting, and Bob would know that waiting for the task’s result is dangerous
   business. He knows, of course, that blocking the UI thread often leads to responsiveness
   problems. So he would instead use the ContinueWith API to rendezvous with the results
   once they become available. And Bob may now know he needs to go back and work more
   closely with Alice on this interface: to ensure cancellation is wired up correctly,
   and to design a richer interface that facilitates incremental painting and progress
   reporting.
&lt;/p&gt;
&lt;p&gt;
   Variable latency is not just problematic for responsiveness reasons. If I/O is expressed
   synchronously, a program cannot efficiently overlap many of them. Imagine we must
   make three network calls as part of completing an operation, and that each one will
   take 25 milliseconds to complete. If we do synchronous I/O, the whole operation will
   take at least 75 milliseconds. If we launch do asynchronous I/O, on the other hand,
   the operation may take as few as 25 milliseconds to finish. That’s huge.
&lt;/p&gt;
&lt;p&gt;
   If I had my druthers, all I/O would be asynchronous. But that’s not where we are today.
&lt;/p&gt;
&lt;p&gt;
   The concern is not limited to just I/O, of course. Compute- and memory-bound work
   can quickly turn into variable latency work, particularly under stressful situations
   like when an application is paging. So truthfully any abstraction doing "heavy lifting"
   should offer an asynchronous alternative.
&lt;/p&gt;
&lt;h1&gt;Examples of "bad" optimizations
&lt;/h1&gt;
&lt;p&gt;
   It is easy to take it too far. Even if you’re shaving off cycles where each-and-every
   cycle matters, you may be doing the wrong thing. If anything, I hope this article
   has convinced you to be thoughtful, and to strive to strike a healthy balance between
   beautiful code and performance.
&lt;/p&gt;
&lt;p&gt;
   Anytime the optimization sacrifices maintainability, it is highly suspect. Indeed,
   many such optimizations are superficial and may not actually improve the resulting
   code’s performance.
&lt;/p&gt;
&lt;p&gt;
   The worst category of optimization is one that can lead to brittle and insecure code.
&lt;/p&gt;
&lt;p&gt;
   One example of this is heavy reliance on stack allocation. In C and C++, doing stack
   allocation of buffers often leads to difficult choices, like fixing the buffer size
   and writing to it in place. There is perhaps no single technique that, over the years,
   has led to the most buffer overrun-based exploits. Not only that, but stack overflow
   in Windows programs is quite catastrophic, and increases in likelihood the more stack
   allocation that a program does. So doing _alloca in C++ or stackalloc in C# is really
   just playing with fire, particularly for dynamically sized and potentially big allocations.
&lt;/p&gt;
&lt;p&gt;
   Another example is using unsafe code in C#. I can’t tell you how many times I’ve seen
   programmers employ unsafe pointer arithmetic to avoid the automatic bounds checking
   generated by the CLR JIT compiler. It is true that in some circumstances this can
   be a win. But it is also true that most programmers who do this never bothered to
   crack open the resulting assembly to see that the JIT compiler does a fairly decent
   job at automatic bounds check hoisting. This is an example where the cost of the optimization
   outweighs the benefits in most circumstances. The cost to pin memory, the risk of
   heap corruption due to a failure to properly pin memory or an offset error, and the
   complication in the code, are all just not worth it. Unless you really have actually
   measured and found the routine to be a problem.
&lt;/p&gt;
&lt;p&gt;
   If it smells way too complicated, it probably is.
&lt;/p&gt;
&lt;h1&gt;In conclusion
&lt;/h1&gt;
&lt;p&gt;
   I’m not saying Knuth didn’t have a good point. He did. But the "premature optimization
   is the root of all evil" pop-culture and witty statement is not a license to ignore
   performance altogether. It’s not justification to be sloppy about writing code. Proactive
   attention to performance is priceless, especially for certain kinds of product- and
   systems-level software.
&lt;/p&gt;
&lt;p&gt;
   My hope is that this article has helped to instill a better sense, or reinforce an
   existing good sense, for what matters, where it matters, and why it matters when it
   comes to performance. Before you write a line of code, you really need to have a budget
   for what you can afford; and, as you write your code, you must know what that code
   costs, and keep a constant mental tally of how much of that budget has been spent.
   Don’t exceed the budget, and most certainly do not ignore it and just hope that wishful
   thinking will save your behind. Building up this debt will cost you down the road,
   I promise. And ultimately, test-driven development works for performance too; you
   will at least know right away once you have exceeded your budget.
&lt;/p&gt;
&lt;p&gt;
   Think about worst case performance. It’s not the only thing that matters, but particularly
   when the best and worst case differ by an order of magnitude, you will probably need
   to think more holistically about the composition of caller and callee while building
   a larger program out of constituent parts.
&lt;/p&gt;
&lt;p&gt;
   And lastly, the productivity and safety gains of managed code, thanks to nice abstractions,
   type- and memory-safety, and automatic memory management, do not have to come at the
   expense of performance. Indeed this is a stereotype that performance conscious programmers
   are in a position to break down. All you need to do is slow down and be thoughtful
   about each and every line of code you write. Remember, programming is as much engineering
   as it is an art. Measure, measure, measure; and, of course, be thoughtful, intentional,
   and responsible in crafting beautiful and performant code.
&lt;/p&gt;
&lt;img width="0" height="0" src="http://www.bluebytesoftware.com/blog/aggbug.ashx?id=4db70333-295b-441f-80f9-21b90bd44287" /&gt;</description>
      <category>Technology</category>
    </item>
    <item>
      <trackback:ping>http://www.bluebytesoftware.com/blog/Trackback.aspx?guid=7a57f623-d65d-4212-973d-29bdcf61dd3a</trackback:ping>
      <pingback:server>http://www.bluebytesoftware.com/blog/pingback.aspx</pingback:server>
      <pingback:target>http://www.bluebytesoftware.com/blog/PermaLink,guid,7a57f623-d65d-4212-973d-29bdcf61dd3a.aspx</pingback:target>
      <dc:creator>
      </dc:creator>
      <body xmlns="http://www.w3.org/1999/xhtml">
        <p>
      That immutability facilitates increased degrees of concurrency is an oft-cited dictum.
      But is it true? And either way, why?
   </p>
        <p>
      My view on this matter may be a controversial one. Immutability is an important foundational
      tool in the toolkit for building concurrent – in addition to reliable and predictable
      – software. But it is not the only one that matters. Making all your data immutable
      isn’t going to instantly lead to a massively scalable program. Natural isolation is
      also critically important, perhaps more so. And, as it turns out, sometimes mutability
      is just what the doctor ordered, as with large-scale data parallelism.
   </p>
        <p>
        </p>
        <h1>Isolation first; immutability second; synchronization last
   </h1>
        <p>
      Stepping back for a moment, the recipe for concurrency is rather simple. Say you’ve
      got multiple concurrent pieces of work running simultaneously (or have a goal of getting
      there); for discussion’s sake, call them <em>tasks</em>. Take two tasks. The first
      critical decision has two cases: either these tasks concurrently access overlapping
      data in shared-memory, or they do not. If they do not, they are <em>isolated</em>,
      and no precautions associated with racing memory updates are needed. If they do share
      data, on the other hand, then something else must give. If all concurrently accessed
      data is <em>immutable</em>, or all functions used to interact with data are <em>pure</em>,
      then dangerous concurrency hazards are avoided. All is well. If some data is mutable,
      however, then this is where things get tricky, and higher-level <em>synchronization</em> is
      needed to make accesses safe. This decision tree is straightforward and clear.
   </p>
        <p>
      I have listed those four attributes – isolated, immutable and pure, and synchronization
      – in a very intentional order. Thankfully, this order mirrors the natural top-down
      hierarchical architecture of most modern object- and component-based programs: we
      have large containers that communicate through well-defined interfaces, each comprised
      of layers of such containers, and somewhere towards the leaves, a fair amount of intimate
      commingling of knowledge regarding data and invariants.
   </p>
        <p>
      This order also reflects the order of complexity and execution-time costs, from least
      to most. Isolation is simple, because components depend on each other in loosely-coupled
      ways, and in fact scales superiorly in a concurrent program because no synchronization
      is necessary, the “right” data structure may be chosen for the job – immutable or
      not – and locality is part-and-parcel to the architecture. Immutability at least avoids
      the morass of synchronization, which can affect programs immensely in complexity,
      runtime overheads, and write-contention for shared data. It is clear that synchronization
      is something to avoid at all costs, particularly anything done in an ad-hoc manner
      like locks.
   </p>
        <p>
        </p>
        <h1>Making the concurrency
   </h1>
        <p>
      But where did all this concurrency come from, anyway?
   </p>
        <p>
      It came from two things:
   </p>
        <ol>
          <li>
         The coarse-grained breakdown of a program into isolated pieces. 
      </li>
          <li>
         The fine-grained data parallelism. 
      </li>
        </ol>
        <p>
      On #1: Program fragments that are isolated are already half-way down the road to running
      concurrently as tasks. The second half of this journey, of course, is teaching them
      to interact with one another asynchronously, most frequently through message-passing
      or by sticking them into a pipeline. The details of course depend on what programming
      language you are using. It may be through agents, actors, active objects, COM objects,
      EJBs, CCR receivers, web-services, something ad-hoc built with .NET tasks, or some
      other reification. Nevertheless the isolation is common to all these.
   </p>
        <p>
      On #2: Data parallelism, it turns out, often works best with mutable data structures.
      These structures must be partitionable, of course, so that tasks comprising the data
      parallel operations may operate with logically isolated chunks of this data safely,
      even if they are parts of the same physical data structure. So chunks of them are
      isolated even though they don’t appear to be. This is trivially achievable with many
      important parallel-friendly data structures like arrays, vectors, and matrices. Capturing
      this isolation in the type system is of course no small task, though region typing
      gets close (see UIUC’s Data Parallel Java).
   </p>
        <p>
      But you usually don’t want these structures to be immutable, because they can be modified
      in constant-time and space if they are their classic simple mutable forms. Programmers
      doing HPC-style data-parallelism a la FORTRAN, vectorization, and GPGPU know this
      quite well. Compare this a world where we are doing data-parallelism over immutable
      data structures, where modifications often necessitate allocations or more complicated
      big-oh times due to clever techniques meant to avoid such allocations, as with persistent
      immutable data structures. This is likely less ideal. It is true that some data parallel
      operations are not in-place against mutable data – as with PLINQ – at which point
      purity, but not immutability, is key. The two are related but not identical: immutability
      pervades the construction of data structures, whereas purity pervades the construction
      of functions. But if you can get by with one copy of the data, why not do it? Particularly
      since most datasets amenable to parallel speedups are quite large.
   </p>
        <h1>Immutability: the bricks, not the mortar
   </h1>
        <p>
      Notice that the concurrency did not actually come from immutable data structures in
      either case, however. So what are they good for?
   </p>
        <p>
      One obvious use, which has little to do with concurrency, is to enforce characteristics
      of particular data structures in a program. A translation lookup table may not have
      been meant to be written to except for initialization time, and using an immutable
      data structure is a wonderful way to enforce this intent.
   </p>
        <p>
      What about concurrency? Immutable data structures facilitate sharing data amongst
      otherwise isolated tasks in an efficient zero-copy manner. No synchronization necessary.
      This is the real payoff.
   </p>
        <p>
      For example, say we’ve got a document-editor and would like to launch a background
      task that does spellchecking in parallel. How will the spellchecker concurrently access
      the document, given that the user may continue editing it simultaneously? Likely we
      will use an immutable data structure to hold some interesting document state, such
      as storing text in a piece-table. OneNote, Visual Studio, and many other document-editors
      use this technique. This is zero-cost snapshot isolation.
   </p>
        <p>
      Not having immutability in this particular scenario is immensely painful. Isolation
      won’t work very well. You could model the document as a task, and require the spellchecker
      to interact with it using messages. Chattiness would be a concern. And, worse, the
      spellchecker’s messages may now interleave with other messages, like a user editing
      the document. Those kinds of message-passing races are non-trivial to deal with. Synchronization
      won’t work well either. Clearly we don’t want to lock the user out of editing his
      or her document just because spellchecking is occurring. Such a boneheaded design
      is what leads to spinning donuts, bleached-white screens, and “(Not Responding)” title
      bars. But clearly we don’t want to acquire a lock and then make a full copy of the
      entire document. Perhaps we’d try to copy just what is visible on the screen. This
      is a dangerous game to play.
   </p>
        <p>
      Immutability does not solve all of the problems in this scenario, however. Snapshots
      of any kind lead to a subtle issue that is familiar to those with experience doing
      multimaster, in which multiple parties have conflicting views on what “the” data ought
      to be, and in which these views must be reconciled.
   </p>
        <p>
      In this particular case, the spellchecker sends the results back to the task which
      spawned it, and presumably owns the document, when it has finished checking some portion
      of the document. Because the spellchecker was working with an immutable snapshot,
      however, its answer may now be out-of-date. We have turned the need to deal with message-level
      interleaving – as described above – into the need to deal with all of the messages
      that may have interleaved within a window of time. This is where multimaster techniques,
      such as diffing and merging come into play. Other techniques can be used, of course,
      like cancelling and ignoring out-of-date results. But it is clear something intentional
      must be done.
   </p>
        <p>
        </p>
        <h1>In conclusion
   </h1>
        <p>
      It is safe to say that immutability facilitates important concurrent architectures
      and algorithms. It can really help big time, for sure. But it is clearly no panacea.
      Whether mutability or immutability is the right choice for a particular data structure
      in your program, as with all things, depends.
   </p>
        <p>
      It could be the case that choosing a piece-table for storing your text facilitates
      large-scales of concurrency in version two of your software application, but that
      in version one you have no use for it. Making that call ahead of time may pay in spades
      down the road, even if it comes at a marginal cost up-front. Or it could be that choosing
      an immutable data structure costs you in time and space, and you never end up exploiting
      the fact that you could have shared that particular structure in a zero-cost way across
      agents in your program.
   </p>
        <p>
      One thing’s for sure: I’m glad to be programming in languages like C#, F#, Clojure,
      and Scala, where I’ve got a choice.
   </p>
        <img width="0" height="0" src="http://www.bluebytesoftware.com/blog/aggbug.ashx?id=7a57f623-d65d-4212-973d-29bdcf61dd3a" />
      </body>
      <title>Thoughts on immutability and concurrency</title>
      <guid>http://www.bluebytesoftware.com/blog/PermaLink,guid,7a57f623-d65d-4212-973d-29bdcf61dd3a.aspx</guid>
      <link>http://www.bluebytesoftware.com/blog/2010/07/12/ThoughtsOnImmutabilityAndConcurrency.aspx</link>
      <pubDate>Mon, 12 Jul 2010 03:18:30 GMT</pubDate>
      <description>&lt;p&gt;
   That immutability facilitates increased degrees of concurrency is an oft-cited dictum.
   But is it true? And either way, why?
&lt;/p&gt;
&lt;p&gt;
   My view on this matter may be a controversial one. Immutability is an important foundational
   tool in the toolkit for building concurrent – in addition to reliable and predictable
   – software. But it is not the only one that matters. Making all your data immutable
   isn’t going to instantly lead to a massively scalable program. Natural isolation is
   also critically important, perhaps more so. And, as it turns out, sometimes mutability
   is just what the doctor ordered, as with large-scale data parallelism.
&lt;/p&gt;
&lt;p&gt;
&lt;h1&gt;Isolation first; immutability second; synchronization last
&lt;/h1&gt;
&gt;
&lt;p&gt;
   Stepping back for a moment, the recipe for concurrency is rather simple. Say you’ve
   got multiple concurrent pieces of work running simultaneously (or have a goal of getting
   there); for discussion’s sake, call them &lt;em&gt;tasks&lt;/em&gt;. Take two tasks. The first
   critical decision has two cases: either these tasks concurrently access overlapping
   data in shared-memory, or they do not. If they do not, they are &lt;em&gt;isolated&lt;/em&gt;,
   and no precautions associated with racing memory updates are needed. If they do share
   data, on the other hand, then something else must give. If all concurrently accessed
   data is &lt;em&gt;immutable&lt;/em&gt;, or all functions used to interact with data are &lt;em&gt;pure&lt;/em&gt;,
   then dangerous concurrency hazards are avoided. All is well. If some data is mutable,
   however, then this is where things get tricky, and higher-level &lt;em&gt;synchronization&lt;/em&gt; is
   needed to make accesses safe. This decision tree is straightforward and clear.
&lt;/p&gt;
&lt;p&gt;
   I have listed those four attributes – isolated, immutable and pure, and synchronization
   – in a very intentional order. Thankfully, this order mirrors the natural top-down
   hierarchical architecture of most modern object- and component-based programs: we
   have large containers that communicate through well-defined interfaces, each comprised
   of layers of such containers, and somewhere towards the leaves, a fair amount of intimate
   commingling of knowledge regarding data and invariants.
&lt;/p&gt;
&lt;p&gt;
   This order also reflects the order of complexity and execution-time costs, from least
   to most. Isolation is simple, because components depend on each other in loosely-coupled
   ways, and in fact scales superiorly in a concurrent program because no synchronization
   is necessary, the “right” data structure may be chosen for the job – immutable or
   not – and locality is part-and-parcel to the architecture. Immutability at least avoids
   the morass of synchronization, which can affect programs immensely in complexity,
   runtime overheads, and write-contention for shared data. It is clear that synchronization
   is something to avoid at all costs, particularly anything done in an ad-hoc manner
   like locks.
&lt;/p&gt;
&lt;p&gt;
&lt;h1&gt;Making the concurrency
&lt;/h1&gt;
&gt;
&lt;p&gt;
   But where did all this concurrency come from, anyway?
&lt;/p&gt;
&lt;p&gt;
   It came from two things:
&lt;/p&gt;
&lt;ol&gt;
   &lt;li&gt;
      The coarse-grained breakdown of a program into isolated pieces. 
   &lt;li&gt;
      The fine-grained data parallelism. 
&lt;/ol&gt;
&lt;p&gt;
   On #1: Program fragments that are isolated are already half-way down the road to running
   concurrently as tasks. The second half of this journey, of course, is teaching them
   to interact with one another asynchronously, most frequently through message-passing
   or by sticking them into a pipeline. The details of course depend on what programming
   language you are using. It may be through agents, actors, active objects, COM objects,
   EJBs, CCR receivers, web-services, something ad-hoc built with .NET tasks, or some
   other reification. Nevertheless the isolation is common to all these.
&lt;/p&gt;
&lt;p&gt;
   On #2: Data parallelism, it turns out, often works best with mutable data structures.
   These structures must be partitionable, of course, so that tasks comprising the data
   parallel operations may operate with logically isolated chunks of this data safely,
   even if they are parts of the same physical data structure. So chunks of them are
   isolated even though they don’t appear to be. This is trivially achievable with many
   important parallel-friendly data structures like arrays, vectors, and matrices. Capturing
   this isolation in the type system is of course no small task, though region typing
   gets close (see UIUC’s Data Parallel Java).
&lt;/p&gt;
&lt;p&gt;
   But you usually don’t want these structures to be immutable, because they can be modified
   in constant-time and space if they are their classic simple mutable forms. Programmers
   doing HPC-style data-parallelism a la FORTRAN, vectorization, and GPGPU know this
   quite well. Compare this a world where we are doing data-parallelism over immutable
   data structures, where modifications often necessitate allocations or more complicated
   big-oh times due to clever techniques meant to avoid such allocations, as with persistent
   immutable data structures. This is likely less ideal. It is true that some data parallel
   operations are not in-place against mutable data – as with PLINQ – at which point
   purity, but not immutability, is key. The two are related but not identical: immutability
   pervades the construction of data structures, whereas purity pervades the construction
   of functions. But if you can get by with one copy of the data, why not do it? Particularly
   since most datasets amenable to parallel speedups are quite large.
&lt;/p&gt;
&lt;h1&gt;Immutability: the bricks, not the mortar
&lt;/h1&gt;
&lt;p&gt;
   Notice that the concurrency did not actually come from immutable data structures in
   either case, however. So what are they good for?
&lt;/p&gt;
&lt;p&gt;
   One obvious use, which has little to do with concurrency, is to enforce characteristics
   of particular data structures in a program. A translation lookup table may not have
   been meant to be written to except for initialization time, and using an immutable
   data structure is a wonderful way to enforce this intent.
&lt;/p&gt;
&lt;p&gt;
   What about concurrency? Immutable data structures facilitate sharing data amongst
   otherwise isolated tasks in an efficient zero-copy manner. No synchronization necessary.
   This is the real payoff.
&lt;/p&gt;
&lt;p&gt;
   For example, say we’ve got a document-editor and would like to launch a background
   task that does spellchecking in parallel. How will the spellchecker concurrently access
   the document, given that the user may continue editing it simultaneously? Likely we
   will use an immutable data structure to hold some interesting document state, such
   as storing text in a piece-table. OneNote, Visual Studio, and many other document-editors
   use this technique. This is zero-cost snapshot isolation.
&lt;/p&gt;
&lt;p&gt;
   Not having immutability in this particular scenario is immensely painful. Isolation
   won’t work very well. You could model the document as a task, and require the spellchecker
   to interact with it using messages. Chattiness would be a concern. And, worse, the
   spellchecker’s messages may now interleave with other messages, like a user editing
   the document. Those kinds of message-passing races are non-trivial to deal with. Synchronization
   won’t work well either. Clearly we don’t want to lock the user out of editing his
   or her document just because spellchecking is occurring. Such a boneheaded design
   is what leads to spinning donuts, bleached-white screens, and “(Not Responding)” title
   bars. But clearly we don’t want to acquire a lock and then make a full copy of the
   entire document. Perhaps we’d try to copy just what is visible on the screen. This
   is a dangerous game to play.
&lt;/p&gt;
&lt;p&gt;
   Immutability does not solve all of the problems in this scenario, however. Snapshots
   of any kind lead to a subtle issue that is familiar to those with experience doing
   multimaster, in which multiple parties have conflicting views on what “the” data ought
   to be, and in which these views must be reconciled.
&lt;/p&gt;
&lt;p&gt;
   In this particular case, the spellchecker sends the results back to the task which
   spawned it, and presumably owns the document, when it has finished checking some portion
   of the document. Because the spellchecker was working with an immutable snapshot,
   however, its answer may now be out-of-date. We have turned the need to deal with message-level
   interleaving – as described above – into the need to deal with all of the messages
   that may have interleaved within a window of time. This is where multimaster techniques,
   such as diffing and merging come into play. Other techniques can be used, of course,
   like cancelling and ignoring out-of-date results. But it is clear something intentional
   must be done.
&lt;/p&gt;
&lt;p&gt;
&lt;h1&gt;In conclusion
&lt;/h1&gt;
&gt;
&lt;p&gt;
   It is safe to say that immutability facilitates important concurrent architectures
   and algorithms. It can really help big time, for sure. But it is clearly no panacea.
   Whether mutability or immutability is the right choice for a particular data structure
   in your program, as with all things, depends.
&lt;/p&gt;
&lt;p&gt;
   It could be the case that choosing a piece-table for storing your text facilitates
   large-scales of concurrency in version two of your software application, but that
   in version one you have no use for it. Making that call ahead of time may pay in spades
   down the road, even if it comes at a marginal cost up-front. Or it could be that choosing
   an immutable data structure costs you in time and space, and you never end up exploiting
   the fact that you could have shared that particular structure in a zero-cost way across
   agents in your program.
&lt;/p&gt;
&lt;p&gt;
   One thing’s for sure: I’m glad to be programming in languages like C#, F#, Clojure,
   and Scala, where I’ve got a choice.
&lt;/p&gt;
&lt;img width="0" height="0" src="http://www.bluebytesoftware.com/blog/aggbug.ashx?id=7a57f623-d65d-4212-973d-29bdcf61dd3a" /&gt;</description>
      <category>Technology</category>
    </item>
    <item>
      <trackback:ping>http://www.bluebytesoftware.com/blog/Trackback.aspx?guid=4a49c0a0-3e4e-4d9f-b9de-599d89a94283</trackback:ping>
      <pingback:server>http://www.bluebytesoftware.com/blog/pingback.aspx</pingback:server>
      <pingback:target>http://www.bluebytesoftware.com/blog/PermaLink,guid,4a49c0a0-3e4e-4d9f-b9de-599d89a94283.aspx</pingback:target>
      <dc:creator>
      </dc:creator>
      <body xmlns="http://www.w3.org/1999/xhtml">
        <p>
      In .NET today, readonly/initonly-ness is in the eye of the provider. Not the beholder. 
   </p>
        <p>
      Although both C# and the CLR verifier go to great pains to ensure you don't change
      a readonly/initonly field outside of its constructor (or class constructor, in the
      case of a static field), this guarantee doesn't imply what you might imagine. It means
      what it says: you can't change such fields except for in certain contexts. 
   </p>
        <p>
      If you try, C# won't let you, including forming byrefs to them: 
   </p>
        <p>
          <code>
            <blockquote>
              <pre>
v.cs(0,0): error CS0191: A readonly field cannot be assigned to (except in a constructor or a variable initializer)
v.cs(0,0): error CS0192: A readonly field cannot be passed ref or out (except in a constructor)
v.cs(0,0): error CS0198: A static readonly field cannot be assigned to (except in a static constructor or a variable initializer)
v.cs(0,0): error CS0199: A static readonly field cannot be passed ref or out (except in a static constructor)
</pre>
            </blockquote>
          </code>
        </p>
        <p>
        </p>
        <p>
      And neither will the CLR verifier: 
   </p>
        <p>
          <code>
            <blockquote>
              <pre>
[IL]: Error: [c:\v.exe : C::Main][offset 0x00000001] Cannot change initonly
    field outside its .ctor.
</pre>
            </blockquote>
          </code>
        </p>
        <p>
        </p>
        <p>
      Of course, attempting to invoke an operation on a readonly struct will make a defensive
      copy locally, and invoke the method against that. This ensures the readonly contents
      cannot change. 
   </p>
        <p>
      One unfortunate hole in this safety is with unions. You do not need unsafe code to
      break readonly, and yet the effect is the same as with an unverifiable program that
      writes to a readonly field: 
   </p>
        <p>
          <code>
            <blockquote>
              <pre>struct S1 {
    public readonly int X;
}

struct S2 {
    public int X;
}

[StructLayout(LayoutKind.Explicit)]
struct S3 {
    [FieldOffset(0)]
    public S1 A;
    [FieldOffset(0)]
    public S2 B;
}
</pre>
            </blockquote>
          </code>
        </p>
        <p>
        </p>
        <p>
      Now we can change A.X via B.X, even though A.X is supposedly readonly: 
   </p>
        <p>
          <code>
            <blockquote>
              <pre>S3 s3 = ...;
int x = s3.A.X;
s3.B.X++;
ASSERT(x == s3.A.X); // false; it is +1
</pre>
            </blockquote>
          </code>
        </p>
        <p>
        </p>
        <p>
      The same would have been true even if the field S3.A was marked readonly. 
   </p>
        <p>
      This is quite an evil trick. I have to be honest that I believe this is a CIL verification
      hole, and should produce unverifiable MSIL much like when you try to overlay structs
      containing overlapping GC references. Nevertheless, it is what it is. 
   </p>
        <p>
      Let's step back. Why does all of this matter, anyway, and what guarantees were we
      hoping that readonly would provide?
   </p>
        <p>
      It would be ideal, I assert, if the guarantee was not just "the target field can only
      be written to in the constructor", but also "the target field, once read, cannot be
      observed with a different value later on". This would not be true during construction,
      but we'd like to say it holds at all other times. 
   </p>
        <p>
      The above example throws a wrench in this idea. As does the following example. But
      this new example will be more disturbing, because the solution is not a simple verifier
      change. 
   </p>
        <p>
      What would you expect this program to print to the console? 
   </p>
        <p>
          <code>
            <blockquote>
              <pre>struct S {
    public readonly int X;

    public S(int x) { X = x; }

    public void MultiplyInto(int c, out S target) {
        System.Console.WriteLine(this.X);
        target = new S(X * c);
        System.Console.WriteLine(this.X); // same? it is, after all, readonly.
    }
}

S s = new S(42);
s.MultiplyInto(10, out s);
</pre>
            </blockquote>
          </code>
        </p>
        <p>
        </p>
        <p>
      As you may or may not have guessed, the output is "42" followed by "420". Yes, the
      value of 'this.X' changes after we have assigned to 'target' inside MultiplyTo, because
      the caller aliases the out-param with the 'this' param. Recall that parameter passing
      for structs in C# is done byref, so that these two references actually physically
      point to the same location when that call is made. The assignment to 'target', therefore,
      actually replaces the entire contents of 'this' all at once. And hence this gives
      the illusion that readonly fields are shifting. 
   </p>
        <p>
      You might be tempted to say that this can be prevented with alias analysis. But this
      is deceptively difficult to do. Consider this more complicated example: 
   </p>
        <p>
          <code>
            <blockquote>
              <pre>class C {
    public struct S S;
}

void M1(C c) {
    M2(c, out c.S);
}

void M2(C c, out S s) {
    c.S.MultiplyInto(10, out s);
}
</pre>
            </blockquote>
          </code>
        </p>
        <p>
        </p>
        <p>
      It is in no way clear inside M2 that the two aliases refer to the same location. The
      aliasing occurred higher up in the stack. Although byrefs are restricted to stack-only
      passing, making the necessary alias analysis tantalizingly close to attainable, it
      is nontrivial to say the least. Presumably we would have had to have blocked the forming
      of the byref within M1, rather than its use within M2. We could fall back to runtime
      checks, but that is also unfortunate for numerous reasons. 
   </p>
        <p>
      The moral of the story? Structs as containers of readonly values are not to be trusted,
      at least not for situations that call for bulletproof safety, such as caching values
      in the compiler rather than rereading them, because the fields are readonly. Although
      C# and the CLR do a good job at verifying readonly/initonly are done right at the
      initialization site, there are still places where these guarantees break down. Thankfully
      the byref aliasing problem does not threaten thread-safety, but the union problem
      does. And in conclusion, I do have to imagine all of this will get fixed somewhere
      down the road, it's just a matter of when and where. 
   </p>
        <img width="0" height="0" src="http://www.bluebytesoftware.com/blog/aggbug.ashx?id=4a49c0a0-3e4e-4d9f-b9de-599d89a94283" />
      </body>
      <title>When is a readonly field not readonly?</title>
      <guid>http://www.bluebytesoftware.com/blog/PermaLink,guid,4a49c0a0-3e4e-4d9f-b9de-599d89a94283.aspx</guid>
      <link>http://www.bluebytesoftware.com/blog/2010/07/01/WhenIsAReadonlyFieldNotReadonly.aspx</link>
      <pubDate>Thu, 01 Jul 2010 19:41:20 GMT</pubDate>
      <description>&lt;p&gt;
   In .NET today, readonly/initonly-ness is in the eye of the provider. Not the beholder. 
&lt;/p&gt;
&lt;p&gt;
   Although both C# and the CLR verifier go to great pains to ensure you don't change
   a readonly/initonly field outside of its constructor (or class constructor, in the
   case of a static field), this guarantee doesn't imply what you might imagine. It means
   what it says: you can't change such fields except for in certain contexts. 
&lt;/p&gt;
&lt;p&gt;
   If you try, C# won't let you, including forming byrefs to them: 
&lt;/p&gt;
&lt;p&gt;
   &lt;code&gt; &lt;blockquote&gt;&lt;pre&gt;
v.cs(0,0): error CS0191: A readonly field cannot be assigned to (except in a constructor or a variable initializer)
v.cs(0,0): error CS0192: A readonly field cannot be passed ref or out (except in a constructor)
v.cs(0,0): error CS0198: A static readonly field cannot be assigned to (except in a static constructor or a variable initializer)
v.cs(0,0): error CS0199: A static readonly field cannot be passed ref or out (except in a static constructor)
&lt;/pre&gt;
   &lt;/blockquote&gt;&lt;/code&gt; 
&lt;p&gt;
&lt;/p&gt;
&lt;p&gt;
   And neither will the CLR verifier: 
&lt;/p&gt;
&lt;p&gt;
   &lt;code&gt; &lt;blockquote&gt;&lt;pre&gt;
[IL]: Error: [c:\v.exe : C::Main][offset 0x00000001] Cannot change initonly
    field outside its .ctor.
&lt;/pre&gt;
   &lt;/blockquote&gt;&lt;/code&gt; 
&lt;p&gt;
&lt;/p&gt;
&lt;p&gt;
   Of course, attempting to invoke an operation on a readonly struct will make a defensive
   copy locally, and invoke the method against that. This ensures the readonly contents
   cannot change. 
&lt;/p&gt;
&lt;p&gt;
   One unfortunate hole in this safety is with unions. You do not need unsafe code to
   break readonly, and yet the effect is the same as with an unverifiable program that
   writes to a readonly field: 
&lt;/p&gt;
&lt;p&gt;
   &lt;code&gt; &lt;blockquote&gt;&lt;pre&gt;struct S1 {
    public readonly int X;
}

struct S2 {
    public int X;
}

[StructLayout(LayoutKind.Explicit)]
struct S3 {
    [FieldOffset(0)]
    public S1 A;
    [FieldOffset(0)]
    public S2 B;
}
&lt;/pre&gt;
   &lt;/blockquote&gt;&lt;/code&gt; 
&lt;p&gt;
&lt;/p&gt;
&lt;p&gt;
   Now we can change A.X via B.X, even though A.X is supposedly readonly: 
&lt;/p&gt;
&lt;p&gt;
   &lt;code&gt; &lt;blockquote&gt;&lt;pre&gt;S3 s3 = ...;
int x = s3.A.X;
s3.B.X++;
ASSERT(x == s3.A.X); // false; it is +1
&lt;/pre&gt;
   &lt;/blockquote&gt;&lt;/code&gt; 
&lt;p&gt;
&lt;/p&gt;
&lt;p&gt;
   The same would have been true even if the field S3.A was marked readonly. 
&lt;/p&gt;
&lt;p&gt;
   This is quite an evil trick. I have to be honest that I believe this is a CIL verification
   hole, and should produce unverifiable MSIL much like when you try to overlay structs
   containing overlapping GC references. Nevertheless, it is what it is. 
&lt;/p&gt;
&lt;p&gt;
   Let's step back. Why does all of this matter, anyway, and what guarantees were we
   hoping that readonly would provide?
&lt;/p&gt;
&lt;p&gt;
   It would be ideal, I assert, if the guarantee was not just "the target field can only
   be written to in the constructor", but also "the target field, once read, cannot be
   observed with a different value later on". This would not be true during construction,
   but we'd like to say it holds at all other times. 
&lt;/p&gt;
&lt;p&gt;
   The above example throws a wrench in this idea. As does the following example. But
   this new example will be more disturbing, because the solution is not a simple verifier
   change. 
&lt;/p&gt;
&lt;p&gt;
   What would you expect this program to print to the console? 
&lt;/p&gt;
&lt;p&gt;
   &lt;code&gt; &lt;blockquote&gt;&lt;pre&gt;struct S {
    public readonly int X;

    public S(int x) { X = x; }

    public void MultiplyInto(int c, out S target) {
        System.Console.WriteLine(this.X);
        target = new S(X * c);
        System.Console.WriteLine(this.X); // same? it is, after all, readonly.
    }
}

S s = new S(42);
s.MultiplyInto(10, out s);
&lt;/pre&gt;
   &lt;/blockquote&gt;&lt;/code&gt; 
&lt;p&gt;
&lt;/p&gt;
&lt;p&gt;
   As you may or may not have guessed, the output is "42" followed by "420". Yes, the
   value of 'this.X' changes after we have assigned to 'target' inside MultiplyTo, because
   the caller aliases the out-param with the 'this' param. Recall that parameter passing
   for structs in C# is done byref, so that these two references actually physically
   point to the same location when that call is made. The assignment to 'target', therefore,
   actually replaces the entire contents of 'this' all at once. And hence this gives
   the illusion that readonly fields are shifting. 
&lt;/p&gt;
&lt;p&gt;
   You might be tempted to say that this can be prevented with alias analysis. But this
   is deceptively difficult to do. Consider this more complicated example: 
&lt;/p&gt;
&lt;p&gt;
   &lt;code&gt; &lt;blockquote&gt;&lt;pre&gt;class C {
    public struct S S;
}

void M1(C c) {
    M2(c, out c.S);
}

void M2(C c, out S s) {
    c.S.MultiplyInto(10, out s);
}
&lt;/pre&gt;
   &lt;/blockquote&gt;&lt;/code&gt; 
&lt;p&gt;
&lt;/p&gt;
&lt;p&gt;
   It is in no way clear inside M2 that the two aliases refer to the same location. The
   aliasing occurred higher up in the stack. Although byrefs are restricted to stack-only
   passing, making the necessary alias analysis tantalizingly close to attainable, it
   is nontrivial to say the least. Presumably we would have had to have blocked the forming
   of the byref within M1, rather than its use within M2. We could fall back to runtime
   checks, but that is also unfortunate for numerous reasons. 
&lt;/p&gt;
&lt;p&gt;
   The moral of the story? Structs as containers of readonly values are not to be trusted,
   at least not for situations that call for bulletproof safety, such as caching values
   in the compiler rather than rereading them, because the fields are readonly. Although
   C# and the CLR do a good job at verifying readonly/initonly are done right at the
   initialization site, there are still places where these guarantees break down. Thankfully
   the byref aliasing problem does not threaten thread-safety, but the union problem
   does. And in conclusion, I do have to imagine all of this will get fixed somewhere
   down the road, it's just a matter of when and where. 
&lt;/p&gt;
&lt;img width="0" height="0" src="http://www.bluebytesoftware.com/blog/aggbug.ashx?id=4a49c0a0-3e4e-4d9f-b9de-599d89a94283" /&gt;</description>
      <category>Technology</category>
    </item>
    <item>
      <trackback:ping>http://www.bluebytesoftware.com/blog/Trackback.aspx?guid=1f0b032e-5871-4749-9c74-71fe205c9bc5</trackback:ping>
      <pingback:server>http://www.bluebytesoftware.com/blog/pingback.aspx</pingback:server>
      <pingback:target>http://www.bluebytesoftware.com/blog/PermaLink,guid,1f0b032e-5871-4749-9c74-71fe205c9bc5.aspx</pingback:target>
      <dc:creator>
      </dc:creator>
      <title>On partially-constructed objects</title>
      <guid>http://www.bluebytesoftware.com/blog/PermaLink,guid,1f0b032e-5871-4749-9c74-71fe205c9bc5.aspx</guid>
      <link>http://www.bluebytesoftware.com/blog/2010/06/27/OnPartiallyconstructedObjects.aspx</link>
      <pubDate>Sun, 27 Jun 2010 19:52:29 GMT</pubDate>
      <description>&lt;p&gt;
   Partially-constructed objects are a constant source of difficulty in object-oriented
   systems. These are objects whose construction has not completed in its entirety prior
   to another piece of code trying to use them. Such uses are often error-prone, because
   the object in question is likely not in a consistent state. Because this situation
   is comparatively rare in the wild, however, most people (safely) ignore or remain
   ignorant of the problem. Until one day they get bitten. 
&lt;/p&gt;
&lt;p&gt;
   Not only are partially-constructed objects a source of consternation for everyday
   programmers, they are also a challenge for language designers wanting to provide guarantees
   around invariants, immutability and concurrency-safety, and non-nullability. We shall
   see examples below why this is true. The world would be better off if partially-constructed
   objects did not exist. Thankfully there is some interesting prior art that moves us
   in this direction from which to learn. 
&lt;/p&gt;
&lt;p&gt;
&lt;h1&gt;Seeing such a beast in the wild
&lt;/h1&gt;
&gt;
&lt;p&gt;
   In what situations might you see a partially-constructed object? There are two common
   ones in C++ and C#: 
&lt;/p&gt;
&lt;ul&gt;
   &lt;li&gt;
      ‘this’ is leaked out of a constructor to some code that assumes the object has been
      initialized.&lt;/li&gt;
   &lt;li&gt;
      A failure partway through an object’s construction leads to its destructor or finalizer
      running against a partially-constructed object.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;
   In the first case, the rule of thumb is “don’t do that.” This is easier said than
   done. The second case, on the other hand, is a fact of life, and the rule of thumb
   is “tread with care, and be intentional.” Let’s examine both more closely. 
&lt;/p&gt;
&lt;p&gt;
&lt;h2&gt;The evils of leaking ‘this’
&lt;/h2&gt;
&gt;
&lt;p&gt;
   Leaking ‘this’ during construction to code that expects to see a fully-initialized
   object is a terrible practice. Before moving on, it’s important to remember initialization
   order in C++ and C#: base constructors run first, and then more derived constructors.
   If I have E subclasses D subclasses C, then constructing an instance of E will run
   C’s constructor, and then D’s, and then lastly E’s. Destructors in C++, of course,
   run in the reverse order. 
&lt;/p&gt;
&lt;p&gt;
   Member initializers, on the other hand, run in different orders in C++ versus C#.
   In C#, they run from most derived first, to least derived. So in the above situation,
   E’s initializers run, and then D’s, and then C’s. This happens fully before running
   the ad-hoc constructor code. In C++, however, member initializers run alongside the
   ordinary construction process. C’s member initializers run just before C’s ad-hoc
   construction code, and then D’s, and then E’s. Another difference is that C#’s initializers
   cannot access ‘this’, whereas C++’s initializers can. 
&lt;/p&gt;
&lt;p&gt;
   For example, this C# program will print E_init, D_init, C_init, C_ctor, D_ctor, and
   then E_ctor: 
&lt;/p&gt;
&lt;p&gt;
   &lt;code&gt; &lt;blockquote&gt; &lt;pre&gt;
using System;
class C {
    int x = M();
    public C() {
        Console.WriteLine("C_ctor");
    }
    private static int M() {
        Console.WriteLine("C_init");
        return 42;
    }
}

class D : C {
    int x = M();
    public D() : base() {
        Console.WriteLine("D_ctor");
    }
    private static int M() {
        Console.WriteLine("D_init");
        return 42;
    }
}

class E : D {
    int x = M();
    public E() : base() {
        Console.WriteLine("E_ctor");
    }
    private static int M() {
        Console.WriteLine("E_init");
        return 42;
    }
}

class Program {
    public static void Main() {
        new E();
    }
}
&lt;/pre&gt;
   &lt;/blockquote&gt; &lt;/code&gt; 
&lt;/p&gt;
&lt;p&gt;
   And this C++ program will print C_init, C_ctor, D_init, D_ctor, E_init, E_ctor, ~E,
   ~D, and finally ~C: 
&lt;/p&gt;
&lt;p&gt;
   &lt;code&gt; &lt;blockquote&gt; &lt;pre&gt;
#include 
   &lt;iostream&gt;&lt;&lt; "C_ctor" &lt;&lt; endl;   }
    ~C() { cout &lt;&lt; "~C" &lt;&lt; endl; }
    static int M() { cout &lt;&lt; "C_init" &lt;&lt; endl; return 42; }
};

struct D : C {
   int x;
    D(): x(M()) { cout &lt;&lt; "D_ctor" &lt;&lt; endl; }
    ~D() { cout &lt;&lt; "~D" &lt;&lt; endl; }
    static int M() { cout &lt;&lt; "D_init" &lt;&lt; endl; return 42; }
};

struct E : D {
   int x;
    E() : x(M()) { cout &lt;&lt; "E_ctor" &lt;&lt; endl; }
    ~E() { cout &lt;&lt; "~E" &lt;&lt; endl; }
    static int M() { cout &lt;&lt; "E_init" &lt;&lt; endl; return 42; }
};

static void main() {
    E e;
}
&lt;/pre&gt;
      using namespace std; struct C { int x; C() : x(M()) { cout 
   &lt;/blockquote&gt; &lt;/code&gt; 
&lt;/p&gt;
&lt;p&gt;
   It’s interesting to note that the CLR allows constructor chaining to happen in any
   order. The C# compiler emits the calls to base as the first thing a constructor does,
   but other languages can choose to do differently. The verifier ensures that a call
   has occurred somewhere in the constructor body before returning. 
&lt;/p&gt;
&lt;p&gt;
   This IL program, for example, will print E, D, and then C – the reverse of what C#
   gives us: 
&lt;/p&gt;
&lt;p&gt;
   &lt;code&gt; &lt;blockquote&gt; &lt;pre&gt;
.assembly extern mscorlib { }
.assembly ctor { }

.class C {
  .method public specialname rtspecialname instance void .ctor() cil managed {
    ldstr      "C"
    call       void [mscorlib]System.Console::WriteLine(string)
    ldarg.0
    call       instance void [mscorlib]System.Object::.ctor()
    ret
  }
}

.class D extends C {
  .method public specialname rtspecialname instance void .ctor() cil managed {
    ldstr      "D"
    call       void [mscorlib]System.Console::WriteLine(string)
    ldarg.0
    call       instance void C::.ctor()
    ret
  }
}

.class E extends D {
  .method public specialname rtspecialname instance void .ctor() cil managed {
    ldstr      "E"
    call       void [mscorlib]System.Console::WriteLine(string)
    ldarg.0
    call       instance void D::.ctor()
    ret
  }
}

.class Program {
  .method public static void Main() cil managed {
    .entrypoint
    newobj     instance void E::.ctor()
    pop
    ret
  }
}
&lt;/pre&gt;
   &lt;/blockquote&gt; &lt;/code&gt; 
&lt;/p&gt;
&lt;p&gt;
   So why is leaking ‘this’ bad, anyway? 
&lt;/p&gt;
&lt;p&gt;
   Say you’ve decided in the implementation of D’s constructor that you would like to
   stick ‘this’ into a global hash-map. Doing this sadly means other code could grab
   the pointer and begin accessing it before E’s constructor has even run. That is a
   race at-best and a ticking time-bomb in all likelihood. For example: 
&lt;/p&gt;
&lt;p&gt;
   &lt;code&gt; &lt;blockquote&gt; &lt;pre&gt;&lt;int, C&gt;
class C {
    public static Dictionary s_globalLookup;
    private readonly int m_key;
    public C(int key) {
        m_key = key;
        s_globalLookup.Add(key, this);
    }
}
&lt;/pre&gt;
   &lt;/blockquote&gt; &lt;/code&gt; 
&lt;/p&gt;
&lt;p&gt;
   Even though we have taken great care to initialize our readonly field m_key before
   sticking ‘this’ into a dictionary, any subclasses will not have been initialized at
   this point. Only if C is sealed can we be assured of this. Another piece of code that
   grabs the element out of the hashtable and begins calling virtual methods on it, for
   example, is in a race with the completion of the initialization code for subclasses. 
&lt;/p&gt;
&lt;p&gt;
   Leaking ‘this’ isn’t always such an explicit act. Merely invoking a virtual method
   within the constructor may dispatch a more derived class’s override before the more
   derived class’s constructor has run. And therefore its state is most likely not in
   a place conducive to correct execution of that override. It is fairly common knowledge
   that invoking virtual methods during construction is an extraordinarily poor practice,
   and best avoided. 
&lt;/p&gt;
&lt;p&gt;
&lt;h2&gt;Failure to construct
&lt;/h2&gt;
&gt;
&lt;p&gt;
   Let’s move on to the second issue. Imagine we suffer an exception during construction
   of an object. Perhaps this is due to a failure to allocate resources, or maybe even
   due to argument validation. It is clear that a leaked ‘this’ object in such cases
   will be a problem, because the object will have escaped into the wild even though
   its initialization failed. Subsequent attempts to use the object will obviously pose
   problems. What is more subtle is that if the class in question declares a destructor
   (in C++) or finalizer (in C#), a problem may now be lurking. 
&lt;/p&gt;
&lt;p&gt;
   Let’s say we have the original situation shown above: C derives from D derives from
   E. Now say an exception happens within D’s constructor. At this point in time, C’s
   constructor has run to completion, D’s constructor has run partially up to the point
   of failure, and E’s has not run at all. (And, of course, in the case of C#, all member-initializers
   for all classes have actually run.) What happens to the cleanup code? 
&lt;/p&gt;
&lt;p&gt;
   In C++, only constructors that have run will have their corresponding destructors
   executed. In the above situation, where C, D, and E each declares a destructor, only
   C’s will be run during stack unwind. It is imperative, therefore, that D handles failure
   within its constructor rather than relying on the destructor. For example: 
&lt;/p&gt;
&lt;p&gt;
   &lt;code&gt; &lt;blockquote&gt; &lt;pre&gt;
class D : C {
    int* m_pBuf1;
    int* m_pBuf2;
public:
    D() {
        m_pBuf1 = … alloc …;
        m_pBuf2 = … alloc …;
    }
    ~D() {
        if (m_pBuf2) … free …;
        if (m_pBuf1) … free …;
    }
}
&lt;/pre&gt;
   &lt;/blockquote&gt; &lt;/code&gt; 
&lt;/p&gt;
&lt;p&gt;
   If the allocation destined for m_pBuf2 fails by throwing an exception, the destructor
   for D will not run, and therefore m_pBuf1 will leak. The C++ solution to this particular
   example is to use smart pointers and member initializers for the allocations, because
   successfully initialized members do get destructed, even when the constructor (or
   indeed one of the member initializers) subsequently fails. This means that destructors
   for a particular class do not have to tolerate that class’s state not having been
   fully constructed, because those destructors will never run. Destructors must not,
   however, invoke virtuals, for two obvious reasons: (a) subclasses may not have ever
   been initialized, and (b) destructors run in reverse order, and so the destruction
   code for the subclass will have already run by the time a baseclass’s destructor runs. 
&lt;/p&gt;
&lt;p&gt;
   In C#, finalizers will run, regardless of whether an object’s constructor ran fully,
   partway through, or not at all. If the object is allocated – and so long as GC.SuppressFinalize
   isn’t called on it – the finalizer runs. This distinctly means that C# finalizers
   must always be resilient to partially-constructed objects (unlike C++ destructors).
   Thankfully finalizers are a rare bird, and therefore this issue is seldom even noticed
   by .NET programmers. 
&lt;/p&gt;
&lt;p&gt;
   This issue does not arise in the case of .NET’s IDisposable pattern. If a constructor
   throws, the assignment to the target local variable does not occur. And therefore
   the variable enclosed in, say, a using statement remains null. This means that there
   is no way to possibly invoke Dispose on the object. Moreover, the allocation in using
   occurs prior to entering the try/finally block. Hence, you really had better be writing
   constructors that don’t throw and/or protecting such resources with smart-pointer-like
   things with finalizers, a la SafeHandle. 
&lt;/p&gt;
&lt;p&gt;
&lt;h1&gt;Impediments to language support
&lt;/h1&gt;
&gt;
&lt;p&gt;
   As if these weren’t sufficient cause for concern, I also mentioned earlier – and somewhat
   vaguely – that partially-constructed objects interfere with language designers’ efforts
   to add invariants, immutability and concurrency-safety, and non-nullability to the
   language. And all of these are important properties to consider in our present age
   of complexity and concurrency, so this point is worth understanding more deeply. Let’s
   look at each in-order. 
&lt;/p&gt;
&lt;p&gt;
&lt;h2&gt;Invariants and safe-points
&lt;/h2&gt;
&gt;
A partially-constructed object obviously may have broken invariants. By definition,
invariants are meant to hold at the end of construction, and so if construction never
completes, the rules of engagement are being broken. &gt;
&lt;p&gt;
   Imagine, for example: 
&lt;/p&gt;
&lt;p&gt;
   &lt;code&gt; &lt;blockquote&gt; &lt;pre&gt;&lt; m_y;
    public C(int a) {
        m_x = a;
        m_y = a + 1;
    }
}
&lt;/pre&gt;
class C {
    int m_x;
    int m_y;
    invariant m_x 

   &lt;/blockquote&gt; &lt;/code&gt; 
&lt;/p&gt;
&lt;p&gt;
   It is ordinarily very difficult to ensure that each instruction atomically transitions
   the state of an object from one invariant safe-point to another. A common technique
   is to define well-defined points at which invariants must hold. We might model each
   failure point as one such technique. But even in C#, the above program does not satisfy
   this constraint, because an overflow exception may be generated at the ‘m_y = a +
   1’ line. Or a thread-abort exception may be generated right in the middle of those
   two functions. Or, if addition were implemented as a JIT helper, an out-of-memory
   exception could get generated due to failure to JIT the helper function. 
&lt;/p&gt;
&lt;p&gt;
   In such cases, we’d like to say that the object does not exist. But the sad fact is
   that the object *does* exist, and if the object has acquired physical resources at
   the time of failure to construct, we must compensate and reclaim them. The ideal world
   looks a lot like object construction as transactions, where the end of construction
   is the commit-point. The state-of-the-art is very different from this, however, and
   so any static verification and theorem proving that depends on invariants on an object
   holding, well, invariantly, is subject to being broken by partially-constructed objects. 
&lt;/p&gt;
&lt;p&gt;
&lt;h2&gt;Immutability… or not
&lt;/h2&gt;
&gt;
&lt;p&gt;
   Immutability is also threatened by partially-constructed objects. Immutability is
   a one of many first class techniques for solving concurrency-safety in the language,
   so this one is quite unfortunate. 
&lt;/p&gt;
&lt;p&gt;
   In C#, for example, we might be tempted to say that a shallowly immutable type is
   one whose fields are all marked readonly. (And a deeply immutable type is one whose
   fields are all readonly, and also refer to types that are immutable.) A readonly field
   cannot be written to after construction has completed. Unfortunately, if the ‘this’
   leaks out during construction, we may see those readonly fields in an uninitialized
   or even changing state: 
&lt;/p&gt;
&lt;p&gt;
   &lt;code&gt; &lt;blockquote&gt; &lt;pre&gt;
class C {
    public static C s_c;
    readonly int m_x;
    public C() {
        m_x = 42;
        s_c = this;
        while (true) {
            ++m_x;
        }
    }
}
&lt;/pre&gt;
   &lt;/blockquote&gt; &lt;/code&gt; 
&lt;/p&gt;
&lt;p&gt;
   This is quite evidently a terrible and malicious program. C appears to be immutable,
   because it only contains readonly fields, but is quite clearly not, because the value
   of m_x is assigned to multiple times. If we had a guarantee that all readonly fields
   were definitely assigned once-and-only-once before ‘this’ can escape, then we’d be
   close to a solution. But of course we have no such guarantee. In C#, at least. 
&lt;/p&gt;
&lt;p&gt;
   A related issue is co-initialization of objects. This is interesting and relevant,
   because in such cases we actually want to leak out partially-constructed objects.
   Imagine we’d like to build a cyclic graph comprised of two nodes, A and B, each referring
   to the other. With a naïve approach to immutability, we simply cannot make this work.
   Either A must first refer to B, in which case A refers to a partially-constructed
   B; or B must first refer to A, in which case B refers to a partially-constructed A.
   Both are equally as bad. The two assignments are not atomic. 
&lt;/p&gt;
&lt;p&gt;
   Cyclic data structures are a commonly cited weakness of immutability, and an argument
   in favor of supporting partially-instantiated objects in a first class way, although
   there are approaches that can work. One example is to separate edges from nodes, and
   represent them with different data structures. We can then build the nodes A and B,
   and then build the edges A-&gt;B and B-&gt;A without needing to use cycles. 
&lt;/p&gt;
&lt;p&gt;
&lt;h2&gt;It’s not-null, or at least it wasn’t supposed to be
&lt;/h2&gt;
&gt;
&lt;p&gt;
   Tony Hoare called it his billion-dollar mistake: the introduction of nulls into a
   programming language. I think he sells himself short, however, because the absence
   of nulls in an imperative programming language – however worthy a pursuit – is actually
   a notoriously difficult to attain. 
&lt;/p&gt;
&lt;p&gt;
   Spec# is one example of a C-style language with non-nullability, in which T! represents
   a “non-null T”, for any T. Although this is done in a pleasant way conducive to C#-compatibility
   – a significant goal of Spec# -- I’d personally prefer to see the polarity switched:
   T would mean “non-null T” and T? would mean “nullable T”, for any T, reference- and
   value-types included. This is much more like Haskell’s Maybe monad, and is the syntax
   I’ll use below for illustration purposes. But I digress. 
&lt;/p&gt;
&lt;p&gt;
   Non-nullability is a wonderful invention, because it is common for 75% or more of
   the contracts and assertions in modern programs to be about a pointer being non-null
   prior to dereferencing it, both in C# and in C++. Relying on the type-system to prove
   the absence of nulls is one big step towards creating programs that are robust and
   correct out-of-the-gate, particularly for systems software where such degrees of reliability
   are important. And it cuts down on all those boilerplate contracts sprinkled throughout
   a program. Instead of: 
&lt;/p&gt;
&lt;p&gt;
   &lt;code&gt; &lt;blockquote&gt; &lt;pre&gt;
void M(C c, D d, E e)
    requires c != null, d != null, e != null
{
    … use(c, d, e) …
}
&lt;/pre&gt;
   &lt;/blockquote&gt; &lt;/code&gt; 
&lt;/p&gt;
&lt;p&gt;
   You simply say: 
&lt;/p&gt;
&lt;p&gt;
   &lt;code&gt; &lt;blockquote&gt; &lt;pre&gt;
void M(C c, D d, E e)
{
    … use (c, d, e) …
}
&lt;/pre&gt;
   &lt;/blockquote&gt; &lt;/code&gt; 
&lt;/p&gt;
&lt;p&gt;
   No opportunity to miss one, and no need for runtime checks. It’s beautiful. 
&lt;/p&gt;
&lt;p&gt;
   A problem quickly arises, however, having to do with partially-constructed objects.
   All of an object’s fields cannot possibly be non-null while the constructor is executing,
   because the object has been zero’d out and the assignments to its fields have not
   yet been made. Clearly constructor code needs to be treated “differently” somehow.
   We cannot simply live with the fact that ‘this’ escaping leads to a partially-constructed
   object leaking out into the program, because that could lead to serious errors. These
   serious errors include potential security holes, if unsafe code is manipulating the
   supposedly non-null pointer. One advantage to adding non-nullability is that runtime
   null checks can be omitted, because the type system guarantees the absence of nulls
   in certain places. In this situation, partially-constructed objects lead to holes
   in our nice type system support. Either the dynamic non-null checks are required as
   back-stop, or we’ll need some other coping technique. 
&lt;/p&gt;
&lt;p&gt;
   There are related issues with non-nullability, like with partially-populated arrays.
   Imagine we’d like to allocate an array of 1M elements of type T, and we will proceed
   to populate those elements following the array’s allocation. There’s clearly a window
   of time during which the array contains 1M nulls, and then 1M-1 nulls, …, and if we
   finish 1M-1M nulls, i.e. 0 nulls. It is only at that last transition that the array
   can be considered to contain non-null T’s. The standard technique is to use an explicit
   dynamic conversion check, or to force the creation of the list to supply all of the
   elements of the array at construction time. 
&lt;/p&gt;
&lt;p&gt;
&lt;h1&gt;Coping techniques
&lt;/h1&gt;
&gt;
&lt;p&gt;
   There are, thankfully, some interesting ways to cope with partially-constructed objects.
   There is, in fact, a spectrum of techniques, ranging from escape analysis in various
   forms, to limitations around how objects are constructed such that a partially-constructed
   one can never leak, to automatic insertion of dynamic checks to prevent the use of
   partially-constructed objects, to static annotations that treat partially-constructed
   instances as first class things in the type system. And of course there’s always the
   technique of “deal with it”, which is the one that most C++-style languages have chosen,
   including our beloved C#. 
&lt;/p&gt;
&lt;p&gt;
&lt;h2&gt;The F# approach: restrictions and dynamic checks
&lt;/h2&gt;
&gt;
&lt;p&gt;
   F#, it turns out, does a very good job in preventing partially-initialized objects.
   A first important step is that fields in F# are readonly by-default, unless you opt-in
   to mutability using the mutable keyword. Therefore data structures are mostly immutable.
   And the construction rules are meant to make it very unlikely that you’ll expose a
   partially-constructed object during construction. How so? It’s simple: such fields
   must be initialized prior to running ad-hoc construction code, and if you attempt
   to initialize them multiple times, the compiler supplies an error. In other words,
   you really have to work hard to screw yourself, unlike C++ and C# where it’s very
   easy. 
&lt;/p&gt;
&lt;p&gt;
   It’s of course possible to do some dirty tricks to publish or access a partially-initialized
   object, despite needing to work very hard at it. There is, however, a nice surprise
   awaiting us when we try. For example: 
&lt;/p&gt;
&lt;p&gt;
   &lt;code&gt; &lt;blockquote&gt; &lt;pre&gt;
type C() =
    abstract member virt : unit -&gt; unit
    default this.virt() = ()

type D() as this =
    inherit C()
    do this.virt()

type E =
    inherit D
    val x : int
    new() = { x = 42; }
    override this.virt() = printf "x: %d" this.x

let e = new E()
&lt;/pre&gt;
   &lt;/blockquote&gt; &lt;/code&gt; 
&lt;/p&gt;
&lt;p&gt;
   This example attempts to perform a virtual invocation from C before the more derived
   class has been fully initialized. This overridden virtual simply (attempts to) prints
   out the value of x. If we compile and run this program, however, we will notice that
   we get an exception: “InvalidOperationException: the initialization of an object or
   value resulted in an object or value being accessed recursively before it was fully
   initialized.” 
&lt;/p&gt;
&lt;p&gt;
   Pretty neat. The compiler will stick in the checks necessary when ‘this’ is being
   accessed, to dynamically verify that an object is not being leaked before having been
   fully initialized. The F# approach can be summed up as trying to make things airtight
   as possible statically at compile-time, but admitting that there are holes – primarily
   due to inheritance – and dealing with them by inserting dynamic runtime checks. This
   tradeoff clearly makes sense for F#, because it is attempting to attain a robust level
   of reliability around immutability. 
&lt;/p&gt;
&lt;p&gt;
   F# also adds non-nullability for the most part. Like Haskell’s Maybe monad, F# adds
   an option type that can store a single None value which lies outside of the underlying
   type’s domain to effectively represent null. Because F# is a .NET language it of course
   also needs to worry about nulls at interop boundaries with other languages like C#
   and VB. This is a great step forward; first class CLI support would be a nice next
   step. 
&lt;/p&gt;
&lt;p&gt;
   A slight variant of the F# idea is to initialize data up the whole class hierarchy
   in one pass, and then run ad-hoc construction code in a second pass in the usual way.
   So long as readonly data can be initialized without running the ad-hoc construction
   code, this helps to statistically cut down on the chances for accidental leaking of
   ‘this’. 
&lt;/p&gt;
&lt;p&gt;
&lt;h2&gt;Type system: T versus notconstructed-T
&lt;/h2&gt;
&gt;
&lt;p&gt;
   We can model two kinds of T in the type-system: T and notconstructed-T. The constructor
   for any type T would then see the ‘this’ pointer as an notconstructed-T, and everything
   else – by default – sees ordinary T’s. 
&lt;/p&gt;
&lt;p&gt;
   What good does this distinction do? It enables us to add verification and restrictions
   around the use of notconstructed-T’s and limitations to the conversion between the
   two types. See &lt;a href="http://portal.acm.org/citation.cfm?id=949305.949332"&gt;this
   paper by Manuel Fahndrich and Rustan Leino&lt;/a&gt; for an example of how this approach
   was taken in Spec#’s non-nullability work. 
&lt;/p&gt;
&lt;p&gt;
   For example, we can prohibit conversion between T and notconstructed-T altogether,
   thereby disallowing escaping ‘this’ references altogether. If the type of ‘this’ within
   a constructor is different than all other references to type T, and they are not convertible,
   we’ve successfully walled the problem off in the type system. This protects against
   erroneous method calls, so that a constructor could not call any of its own methods,
   because these methods expect an ordinary T whereas the constructor only has a notconstructed-T.
   And because you cannot state notconstructed-T in the language, you cannot let one
   leak by storing it into a field. 
&lt;/p&gt;
&lt;p&gt;
   We could add more sophisticated support, by allowing a programmer to explicitly tag
   certain non-field references as T-notconstructed. This makes the concept first-class
   in the language, and allows one to explicitly declare the fact that code is interacting
   with a partially-constructed T: 
&lt;/p&gt;
&lt;p&gt;
   &lt;code&gt; &lt;blockquote&gt; &lt;pre&gt;
class C
{
    int m_x;
    public C() {
        m_x = V();
    }
    protected virtual int V() notconstructed {
        … I know to be careful …
    }
}
&lt;/pre&gt;
   &lt;/blockquote&gt; &lt;/code&gt; 
&lt;/p&gt;
&lt;p&gt;
   In this example, the programmer has annotated V with ‘notconstructed’. This enables
   the call from the constructor because the method’s ‘this’ is an uninitialized-T, and
   serves as a warning to the programmer that he or she should take care, much like the
   code written inside a constructor. 
&lt;/p&gt;
&lt;p&gt;
   We must also decide whether fields are offlimits via notconstructed-T’s. If yes, we
   can add F#-style dynamic checks for initialization, but only for attempted accesses
   against notconstructed object’s fields. This is nice because it means the scope of
   dynamic checks are limited, and used in a pay-for-play manner. And we could even enable
   a programmer to sidestep the error by stating at the use-site that they understand
   a particular field access may be to uninitialized memory, like Field.ReadMaybeUninitialized(&amp;m_field). 
&lt;/p&gt;
&lt;p&gt;
   To be honest, the reason this approach has likely not yet seen widespread use is that
   the cost is not commensurate with the benefit. At least, in my opinion. To make something
   like partially-constructed objects a first-class concept in a programming language,
   programmers would need to want to know where they are dealing with them. For systems
   programmers, this makes sense. For many other programmers, it would be useless ceremony
   with no perceived value. And yet the initial approach where nothing new needed to
   be stated, but yet escaping ‘this’ was prevented, blocks certain patterns of legal
   use. This is where theory and practice run up against one another. There is, however,
   presumably a nice middle ground awaiting discovery. 
&lt;/p&gt;
&lt;p&gt;
&lt;h1&gt;Winding down
&lt;/h1&gt;
&gt;
&lt;p&gt;
   I meant for this to be a short post. But the topic really is fascinating, and has
   been coming up time and time again as we do language work here at Microsoft. But it
   is truly fascinating mainly because, like nulls, the problem is widespread yet tolerable,
   and most C++ and C# programmers learn the rules and make do. Partially-constructed
   objects are a major blemish, but not a crisis that threatens the complete abandonment
   of imperative programming. 
&lt;/p&gt;
&lt;p&gt;
   I do believe language trends indicate that more will move away from C++-style object
   initialization and related issues, and more towards immutability and treating data
   and its initialization separately from imperative ad-hoc initialization code. Haskell,
   F#, and Clojure, for example, show us some promising paths forward. There are a plethora
   of other attempts at solving related problems, and I unfortunately could only scratch
   the surface. 
&lt;/p&gt;
&lt;p&gt;
   Although these techniques are not new, the primary question – to me – is how close
   to “the metal” in systems programming these concepts can be made to work. Typically
   for those situations, you need to rely on a mixture of static verification and complete
   freedom, because the dynamic checking is too costly and the code to work around overly-limiting
   static verification also adds too much cost. But as soon as you add complete freedom
   into the picture, you run into partially-constructed objects as a consequence, and
   all the issues I’ve mentioned above. 
&lt;/p&gt;
&lt;p&gt;
   Anyway, I hope it was interesting. 
&lt;/p&gt;
&lt;img width="0" height="0" src="http://www.bluebytesoftware.com/blog/aggbug.ashx?id=1f0b032e-5871-4749-9c74-71fe205c9bc5" /&gt;</description>
      <category>Technology</category>
    </item>
  </channel>
</rss>
