<?xml version="1.0" encoding="utf-8"?>
<rss xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:trackback="http://madskills.com/public/xml/rss/module/trackback/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:pingback="http://madskills.com/public/xml/rss/module/pingback/" version="2.0">
  <channel>
    <title>Generalities &amp; Details: Adventures in the High-tech Underbelly</title>
    <link>http://www.bluebytesoftware.com/blog/</link>
    <description>Joe Duffy's Weblog</description>
    <language>en-us</language>
    <copyright>Joe Duffy</copyright>
    <lastBuildDate>Sun, 20 Jul 2008 08:14:14 GMT</lastBuildDate>
    <generator>newtelligence dasBlog 1.8.5223.2</generator>
    <managingEditor>joe@bluebytesoftware.com</managingEditor>
    <webMaster>joe@bluebytesoftware.com</webMaster>
    <item>
      <trackback:ping>http://www.bluebytesoftware.com/blog/Trackback.aspx?guid=cba61bab-7325-4b57-9db5-8b528f525cb1</trackback:ping>
      <pingback:server>http://www.bluebytesoftware.com/blog/pingback.aspx</pingback:server>
      <pingback:target>http://www.bluebytesoftware.com/blog/PermaLink,guid,cba61bab-7325-4b57-9db5-8b528f525cb1.aspx</pingback:target>
      <dc:creator>
      </dc:creator>
      <wfw:comment>http://www.bluebytesoftware.com/blog/CommentView,guid,cba61bab-7325-4b57-9db5-8b528f525cb1.aspx</wfw:comment>
      <wfw:commentRss>http://www.bluebytesoftware.com/blog/SyndicationService.asmx/GetEntryCommentsRss?guid=cba61bab-7325-4b57-9db5-8b528f525cb1</wfw:commentRss>
      <slash:comments>1</slash:comments>
      <body xmlns="http://www.w3.org/1999/xhtml">
        <p>
      Here's a slightly more formal approach to proving that the CLR MM is improperly implemented
      for the <a href="http://www.bluebytesoftware.com/blog/2008/07/17/LoadsCannotPassOtherLoadsIsAMyth.aspx">particular
      example I showed earlier</a>.
   </p>
        <p>
      As the Java MM folks have done, I will use a combination of happens-before and synchronizes-with
      relations, which allows order in a properly synchronized program to be describe as
      a "flat" sequence with total ordering among elements.  Assume &lt; means synchronizes-with. 
      If a happens-before b, and a &lt; b, then any writes in a are visible to any loads
      in b.  This relation is transitive: if a &lt; b and b &lt; c, then a &lt; c. 
      Given this, we can take an observed set of results (the values held in memory locations),
      a hypothesized execution order (which we can infer from the observation), and validate
      it against the program order (as written in the source); we do this by taking the
      MM-specific synchronizes-with relation rules, and see if we can produce the observed
      output given our belief of the execution order.  If we find a contradiction (the
      execution order required to produce the output could not be produced given the program
      order and MM rules), either there is an alternative execution order we failed to guess,
      or we have found a violation of the memory model.
   </p>
        <p>
      Single threaded programs are easy.  Multi threaded programs are hard.  We
      must manually "sequentialize" the program by constructing an interleaving of all executed
      program operations into a single flat sequence, and permute them as needed to produce
      the output in order to formulate a hypothesis of the execution order.  This is
      of course very difficult to do, so it only works with very small programs (like the
      one I will show below).
   </p>
        <p>
      I will try to define the CLR 2.0 MM in terms of synchronizes-with, although I have
      to admit it’s going to be difficult to do off the top of my head:
   </p>
        <ol>
          <li>
         a &lt; b, given a volatile load a that precedes any other memory operation b. 
         (Loads are acquire.) 
      </li>
          <li>
         a &lt; b, given any memory operation a that precedes any other store b. 
         (Stores are release.) 
      </li>
          <li>
         a &lt; b, given two separate memory operations a which precedes b that work with
         the same memory location.  (Data dependence.) 
      </li>
          <li>
         a &lt; b, given any memory operation a that precedes a full fence b.  (Cannot
         move after a fence.) 
      </li>
          <li>
         a &lt; b, given a full fence a that precedes some memory operation b.  (Cannot
         move before a fence.) 
      </li>
          <li>
         a &lt; b, given a lock acquire a that precedes some memory operation b.  (Lock
         acquires are acquire fences.) 
      </li>
          <li>
         a &lt; b, given a memory operation a that precedes a lock release b.  (Lock releases
         are release fences.)</li>
        </ol>
        <p>
      Let’s take the disturbing example, assuming all loads and stores are volatile. 
   </p>
        <blockquote dir="ltr" style="MARGIN-RIGHT: 0px">
          <p>
            <font face="Courier New">X = 1;              Y
      = 1;<br />
      R0 = X;             R2
      = Y;<br />
      R1 = Y;             R3
      = X;</font>
          </p>
        </blockquote>
        <p>
      Let’s hypothesize about execution order. 
   </p>
        <p>
      To produce an output in which R1 == R3 == 0, let us observe that it must be the case
      that X = 1 and Y = 1 must not happen first.  If one such instruction does occur
      first, then any possible outcome leads to R1 and/or R3 holding the value 1. 
      That is because of rule 3: if X = 1 happened first, then X = 1 &lt; R3 = X, leading
      to R3 == 1 and similarly if Y = 1 happened first, then Y= 1 &lt; R1 = Y, leading to
      R1 == 1.  So let us try to make X = 1 and Y = 1 not happen first. 
   </p>
        <p>
      Indeed, it is impossible for R0 = X or R2 = Y to happen first.  This is because
      of CLR MM rule 3: X = 1; R0 = X leads to data dependence, and thus X = 1 &lt; R0 =
      X.  Similarly, Y = 1 &lt; R2 = Y.  Dead end.  Let’s try the only other
      route.
   </p>
        <p>
      The only remaining possibility to produce the output R1 == R3 == 0 is if R1 = Y or
      R3 = X occurs first.  Let us try to make R1 = Y occur first.  Ah-hah! 
      We cannot!  Given CLR MM rule 1, R0 = X &lt; R1 = Y.  And because of transitivity,
      this necessarily implies that X = 1 &lt; R1 = Y.  The same holds for the other
      thread’s instructions: Y = 1 &lt; R3 = X.  The output R1 == R3 == 0 is therefore
      a contradiction and disallowed by the CLR MM.
   </p>
        <p>
      Now, this is light years from a formal proof, but is the reasoning I’ve been using
      in my mind to explain why this new realization is fundamentally very disturbing and
      is explicitly <strong>not</strong> allowed by the CLR MM. Thankfully it seems the
      JIT team agrees and is willing to fix this for the next release. And, I'm still in
      search of an example of code that is broken by this problem ...
   </p>
        <img width="0" height="0" src="http://www.bluebytesoftware.com/blog/aggbug.ashx?id=cba61bab-7325-4b57-9db5-8b528f525cb1" />
      </body>
      <title>A bit more formalism as to why CLR's MM is broken on x86/x64</title>
      <guid>http://www.bluebytesoftware.com/blog/PermaLink,guid,cba61bab-7325-4b57-9db5-8b528f525cb1.aspx</guid>
      <link>http://www.bluebytesoftware.com/blog/2008/07/20/ABitMoreFormalismAsToWhyCLRsMMIsBrokenOnX86x64.aspx</link>
      <pubDate>Sun, 20 Jul 2008 08:14:14 GMT</pubDate>
      <description>&lt;p&gt;
   Here's a slightly more formal approach to proving that the CLR MM is improperly implemented
   for the &lt;a href="http://www.bluebytesoftware.com/blog/2008/07/17/LoadsCannotPassOtherLoadsIsAMyth.aspx"&gt;particular
   example I showed earlier&lt;/a&gt;.
&lt;/p&gt;
&lt;p&gt;
   As the Java MM folks have done, I will use a combination of happens-before and synchronizes-with
   relations, which allows order in a properly synchronized program to be describe as
   a "flat" sequence with total ordering among elements.&amp;nbsp; Assume &amp;lt; means synchronizes-with.&amp;nbsp;
   If a happens-before b, and a &amp;lt; b, then any writes in a are visible to any loads
   in b.&amp;nbsp; This relation is transitive: if a &amp;lt; b and b &amp;lt; c, then a &amp;lt; c.&amp;nbsp;
   Given this, we can take an observed set of results (the values held in memory locations),
   a hypothesized execution order (which we can infer from the observation), and validate
   it against the program order (as written in the source); we do this by taking the
   MM-specific synchronizes-with relation rules, and see if we can produce the observed
   output given our belief of the execution order.&amp;nbsp; If we find a contradiction (the
   execution order required to produce the output could not be produced given the program
   order and MM rules), either there is an alternative execution order we failed to guess,
   or we have found a violation of the memory model.
&lt;/p&gt;
&lt;p&gt;
   Single threaded programs are easy.&amp;nbsp; Multi threaded programs are hard.&amp;nbsp; We
   must manually "sequentialize" the program by constructing an interleaving of all executed
   program operations into a single flat sequence, and permute them as needed to produce
   the output in order to formulate a hypothesis of the execution order.&amp;nbsp; This is
   of course very difficult to do, so it only works with very small programs (like the
   one I will show below).
&lt;/p&gt;
&lt;p&gt;
   I will try to define the CLR 2.0 MM in terms of synchronizes-with, although I have
   to admit it’s going to be difficult to do off the top of my head:
&lt;/p&gt;
&lt;ol&gt;
   &lt;li&gt;
      a &amp;lt; b, given a volatile load a that&amp;nbsp;precedes any other memory operation b.&amp;nbsp;
      (Loads are acquire.) 
   &lt;li&gt;
      a &amp;lt; b, given any memory operation a that&amp;nbsp;precedes any other store b.&amp;nbsp;
      (Stores are release.) 
   &lt;li&gt;
      a &amp;lt; b, given two separate memory operations a which&amp;nbsp;precedes b that work with
      the same memory location.&amp;nbsp; (Data dependence.) 
   &lt;li&gt;
      a &amp;lt; b, given any memory operation a that&amp;nbsp;precedes a full fence b.&amp;nbsp; (Cannot
      move after a fence.) 
   &lt;li&gt;
      a &amp;lt; b, given a full fence a that&amp;nbsp;precedes some memory operation b.&amp;nbsp; (Cannot
      move before a fence.) 
   &lt;li&gt;
      a &amp;lt; b, given a lock acquire a that precedes some memory operation b.&amp;nbsp; (Lock
      acquires are acquire fences.) 
   &lt;li&gt;
      a &amp;lt; b, given a memory operation a that precedes a lock release b.&amp;nbsp; (Lock releases
      are release fences.)&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;
   Let’s take the disturbing example, assuming all loads and stores are volatile. 
&lt;/p&gt;
&lt;blockquote dir=ltr style="MARGIN-RIGHT: 0px"&gt; 
&lt;p&gt;
   &lt;font face="Courier New"&gt;X = 1;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;Y
   = 1;&lt;br&gt;
   R0 = X;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; R2
   = Y;&lt;br&gt;
   R1 = Y;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; R3
   = X;&lt;/font&gt;
&lt;/p&gt;
&lt;/blockquote&gt; 
&lt;p&gt;
   Let’s hypothesize about execution order. 
&lt;/p&gt;
&lt;p&gt;
   To produce an output in which R1 == R3 == 0, let us observe that it must be the case
   that X = 1 and Y = 1 must not happen first.&amp;nbsp; If one such instruction does occur
   first, then any possible outcome leads to R1 and/or R3 holding the value 1.&amp;nbsp;
   That is because of rule 3: if X = 1 happened first, then X = 1 &amp;lt; R3 = X, leading
   to R3 == 1 and similarly if Y = 1 happened first, then Y= 1 &amp;lt; R1 = Y, leading to
   R1 == 1.&amp;nbsp; So let us try to make X = 1 and Y = 1 not happen first. 
&lt;/p&gt;
&lt;p&gt;
   Indeed, it is impossible for R0 = X or R2 = Y to happen first.&amp;nbsp; This is because
   of CLR MM rule 3: X = 1; R0 = X leads to data dependence, and thus X = 1 &amp;lt; R0 =
   X.&amp;nbsp; Similarly, Y = 1 &amp;lt; R2 = Y.&amp;nbsp; Dead end.&amp;nbsp; Let’s try the only other
   route.
&lt;/p&gt;
&lt;p&gt;
   The only remaining possibility to produce the output R1 == R3 == 0 is if R1 = Y or
   R3 = X occurs first.&amp;nbsp; Let us try to make R1 = Y occur first.&amp;nbsp; Ah-hah!&amp;nbsp;
   We cannot!&amp;nbsp; Given CLR MM rule 1, R0 = X &amp;lt; R1 = Y.&amp;nbsp; And because of transitivity,
   this necessarily implies that X = 1 &amp;lt; R1 = Y.&amp;nbsp; The same holds for the other
   thread’s instructions: Y = 1 &amp;lt; R3 = X.&amp;nbsp; The output R1 == R3 == 0 is therefore
   a contradiction and disallowed by the CLR MM.
&lt;/p&gt;
&lt;p&gt;
   Now, this is light years from a formal proof, but is the reasoning I’ve been using
   in my mind to explain why this new realization is fundamentally very disturbing and
   is explicitly &lt;strong&gt;not&lt;/strong&gt; allowed by the CLR MM. Thankfully it seems the
   JIT team agrees and is willing to fix this for the next release. And, I'm still in
   search of an example of code that is broken by this problem ...
&lt;/p&gt;
&lt;img width="0" height="0" src="http://www.bluebytesoftware.com/blog/aggbug.ashx?id=cba61bab-7325-4b57-9db5-8b528f525cb1" /&gt;</description>
      <comments>http://www.bluebytesoftware.com/blog/CommentView,guid,cba61bab-7325-4b57-9db5-8b528f525cb1.aspx</comments>
      <category>Technology</category>
    </item>
    <item>
      <trackback:ping>http://www.bluebytesoftware.com/blog/Trackback.aspx?guid=8bb27ef2-d53a-4e9f-b4e4-cc1607d06f37</trackback:ping>
      <pingback:server>http://www.bluebytesoftware.com/blog/pingback.aspx</pingback:server>
      <pingback:target>http://www.bluebytesoftware.com/blog/PermaLink,guid,8bb27ef2-d53a-4e9f-b4e4-cc1607d06f37.aspx</pingback:target>
      <dc:creator>
      </dc:creator>
      <wfw:comment>http://www.bluebytesoftware.com/blog/CommentView,guid,8bb27ef2-d53a-4e9f-b4e4-cc1607d06f37.aspx</wfw:comment>
      <wfw:commentRss>http://www.bluebytesoftware.com/blog/SyndicationService.asmx/GetEntryCommentsRss?guid=8bb27ef2-d53a-4e9f-b4e4-cc1607d06f37</wfw:commentRss>
      <slash:comments>21</slash:comments>
      <body xmlns="http://www.w3.org/1999/xhtml">
        <p>
      The adjacent release/acquire problem is well known.  As an example, given the
      program:
   </p>
        <blockquote dir="ltr" style="MARGIN-RIGHT: 0px">
          <p>
            <font face="Courier New">P0         
      P1<br />
      ==========  ==========<br />
      X = 1;      Y = 1;<br />
      R0 = Y;     R1 = X;</font>
          </p>
        </blockquote>
        <p>
      The outcome R0 == R1 == 0 is entirely legal.  This could happen because writes
      are delayed in processor store buffers; so before R0 = Y retires, the store X = 1
      may have not even left the local processor P0; similarly, before R1 = X retires, the
      store Y = 1 may not have even left processor P1.  It is as if the program was
      written as follows:
   </p>
        <blockquote dir="ltr" style="MARGIN-RIGHT: 0px">
          <p>
            <font face="Courier New">P0         
      P1<br />
      ==========  ==========<br />
      R0 = Y;     R1 = X;<br />
      X = 1;      Y = 1;</font>
          </p>
        </blockquote>
        <p>
      The standard way to fix this is to emit a full fence:
   </p>
        <blockquote dir="ltr" style="MARGIN-RIGHT: 0px">
          <p>
            <font face="Courier New">P0         
      P1<br />
      ==========  ==========<br />
      X = 1;      Y = 1;<br />
      XCHG;       XCHG;<br />
      R0 = Y;     R1 = X;</font>
          </p>
        </blockquote>
        <p>
      But here is one that may be a little surprising:
   </p>
        <blockquote dir="ltr" style="MARGIN-RIGHT: 0px">
          <p>
            <font face="Courier New">P0         
      P1<br />
      ==========  ==========<br />
      X = 1;      Y = 1;<br />
      R0 = X;     R2 = Y;<br />
      R1 = Y;     R3 = X;</font>
          </p>
        </blockquote>
        <p>
      Assuming X and Y are "volatile" to the compiler, is R1 == R3 == 0 a possible outcome
      in this program?
   </p>
        <p>
      Based on the rules we provide for <a href="http://www.bluebytesoftware.com/blog/2007/11/10/CLR20MemoryModel.aspx">.NET's
      MM</a>, and <a href="http://www.intel.com/products/processor/manuals/318147.pdf">Intel's
      whitepaper</a>, one could reasonably argue "no".  The reasoning goes as follows. 
      True data dependence prohibits R0 = X from moving before X = 1, and the no load/load
      reordering rule (e.g. Intel's Rule 2.1) prohibits R1 = Y from moving before R0 = X. 
      Thus, transitively, R1 = Y may not move before X = 1.  Similarly, true data dependence
      prohibits R2 = Y from moving before Y = 1, and the no load/load reordering rule prohibits
      R3 = X from moving before R2 = Y, and therefore R3 = X may not move before Y = 1. 
      Given this reasoning, the individual instruction streams cannot be reordered in place. 
      And therefore, no interleaving of them will yield R1 == R3 == 0, because either X
      = 1 or Y = 1 must happen first, and both R1 = Y and R3 = X must come later. 
      Hence at least one of R1 or R3 will observe a value of 1.
   </p>
        <p>
      Sadly, this reasoning is incorrect.  Rule 2.4 in the Intel whitepaper states
      that "intra-processor forwarding is allowed."  They even have an innocent example
      in the paper, but it actually doesn't exhibit load/load reordering.  It does,
      however, illustrate that stores may be delayed for some time in a write buffer. 
      Perhaps surprisingly, such intra-processor forwarding of buffered stores is actually
      permitted to satisfy subsequent loads from that location by the same processor <em>before</em> the
      store has left the processor.  This can happen even if it means passing intermediate
      loads from different memory locations!  The result is that load/load reordering
      is effectively possible under some circumstances.  Loads still physically retire
      in order of course, but because they may be satisfied by pending writes that other
      processors cannot yet see, it is as if the original program were written as:
   </p>
        <blockquote dir="ltr" style="MARGIN-RIGHT: 0px">
          <p>
            <font face="Courier New">P0         
      P1<br />
      ==========  ==========<br />
      R1 = Y;     R3 = X;<br />
      X = 1;      Y = 1;<br />
      R0 = X;     R2 = Y;</font>
          </p>
        </blockquote>
        <p>
      The fundamentally contradicts what most people believe about .NET's MM, and indeed,
      Intel's MM as specified in that whitepaper.  To be fair, the whitepaper actually
      does call this out, but in a roundabout and misleading fashion.  The text in
      Rule 2.1, which states that "no loads can be reordered with other loads", is far too
      strong.
   </p>
        <p>
      Anytime a little hole in something as fundamental as MM axioms is uncovered, it is
      cause for concern.  So I found this discovery deeply disturbing.  Many abstractions
      and theorems are proved with the assumption that the MM is rock solid. 
      I know a lot of code I have written relies on such proofs.
   </p>
        <p>
      That said, I've been racking my brain (and in fact was having nightmares about it
      last evening) trying to uncover a case where this is worse than the existing release/acquire
      reordering issue that I opened this post with.  Everything I come up with is
      saved at the last minute by rules 2.1 (for stores) and 2.5 "stores are transitively
      visible".  The basic problem is that a processor can get stuck seeing its own
      written value for some time, during which other processors cannot, but ultimately
      it doesn't seem to matter because the buffer will eventually be flushed.  Then
      any intermediary values that the destination may have held while that processor was
      stuck will have been overwritten anyway, so the outcome should be explainable (albeit
      racey).  I'm still thinking hard about this.
   </p>
        <img width="0" height="0" src="http://www.bluebytesoftware.com/blog/aggbug.ashx?id=8bb27ef2-d53a-4e9f-b4e4-cc1607d06f37" />
      </body>
      <title>"Loads cannot pass other loads" is a ~myth</title>
      <guid>http://www.bluebytesoftware.com/blog/PermaLink,guid,8bb27ef2-d53a-4e9f-b4e4-cc1607d06f37.aspx</guid>
      <link>http://www.bluebytesoftware.com/blog/2008/07/17/LoadsCannotPassOtherLoadsIsAMyth.aspx</link>
      <pubDate>Thu, 17 Jul 2008 02:43:00 GMT</pubDate>
      <description>&lt;p&gt;
   The adjacent release/acquire problem is well known.&amp;nbsp; As an example, given the
   program:
&lt;/p&gt;
&lt;blockquote dir=ltr style="MARGIN-RIGHT: 0px"&gt; 
&lt;p&gt;
   &lt;font face="Courier New"&gt;P0&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;
   P1&lt;br&gt;
   ==========&amp;nbsp; ==========&lt;br&gt;
   X = 1;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; Y = 1;&lt;br&gt;
   R0 = Y;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; R1 = X;&lt;/font&gt;
&lt;/p&gt;
&lt;/blockquote&gt; 
&lt;p&gt;
   The outcome R0 == R1 == 0 is entirely legal.&amp;nbsp; This could happen because writes
   are delayed in processor store buffers; so before R0 = Y retires, the store X = 1
   may have not even left the local processor P0; similarly, before R1 = X retires, the
   store Y = 1 may not have even left processor P1.&amp;nbsp; It is as if the program was
   written as follows:
&lt;/p&gt;
&lt;blockquote dir=ltr style="MARGIN-RIGHT: 0px"&gt; 
&lt;p&gt;
   &lt;font face="Courier New"&gt;P0&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;
   P1&lt;br&gt;
   ==========&amp;nbsp; ==========&lt;br&gt;
   R0 = Y;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; R1 = X;&lt;br&gt;
   X = 1;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; Y = 1;&lt;/font&gt;
&lt;/p&gt;
&lt;/blockquote&gt; 
&lt;p&gt;
   The standard way to fix this is to emit a full fence:
&lt;/p&gt;
&lt;blockquote dir=ltr style="MARGIN-RIGHT: 0px"&gt; 
&lt;p&gt;
   &lt;font face="Courier New"&gt;P0&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;
   P1&lt;br&gt;
   ==========&amp;nbsp; ==========&lt;br&gt;
   X = 1;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; Y = 1;&lt;br&gt;
   XCHG;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; XCHG;&lt;br&gt;
   R0 = Y;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; R1 = X;&lt;/font&gt;
&lt;/p&gt;
&lt;/blockquote&gt; 
&lt;p&gt;
   But here is one that may be a little surprising:
&lt;/p&gt;
&lt;blockquote dir=ltr style="MARGIN-RIGHT: 0px"&gt; 
&lt;p&gt;
   &lt;font face="Courier New"&gt;P0&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;
   P1&lt;br&gt;
   ==========&amp;nbsp; ==========&lt;br&gt;
   X = 1;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; Y = 1;&lt;br&gt;
   R0 = X;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; R2 = Y;&lt;br&gt;
   R1 = Y;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; R3 = X;&lt;/font&gt;
&lt;/p&gt;
&lt;/blockquote&gt; 
&lt;p&gt;
   Assuming X and Y are "volatile" to the compiler, is R1 == R3 == 0 a possible outcome
   in this program?
&lt;/p&gt;
&lt;p&gt;
   Based on the rules we provide for &lt;a href="http://www.bluebytesoftware.com/blog/2007/11/10/CLR20MemoryModel.aspx"&gt;.NET's
   MM&lt;/a&gt;, and &lt;a href="http://www.intel.com/products/processor/manuals/318147.pdf"&gt;Intel's
   whitepaper&lt;/a&gt;, one could reasonably argue "no".&amp;nbsp; The reasoning goes as follows.&amp;nbsp;
   True data dependence prohibits R0 = X from moving before X = 1, and the no load/load
   reordering rule (e.g. Intel's Rule 2.1) prohibits R1 = Y from moving before R0 = X.&amp;nbsp;
   Thus, transitively, R1 = Y may not move before X = 1.&amp;nbsp; Similarly, true data dependence
   prohibits R2 = Y from moving before Y = 1, and the no load/load reordering rule prohibits
   R3 = X from moving before R2 = Y, and therefore R3 = X may not move before Y = 1.&amp;nbsp;
   Given this reasoning, the individual instruction streams cannot be reordered in place.&amp;nbsp;
   And therefore, no interleaving of them will yield R1 == R3 == 0, because either X
   = 1 or Y = 1 must happen first, and both R1 = Y and R3 = X must come later.&amp;nbsp;
   Hence at least one of R1 or R3 will observe a value of 1.
&lt;/p&gt;
&lt;p&gt;
   Sadly, this reasoning is incorrect.&amp;nbsp; Rule 2.4 in the Intel whitepaper states
   that "intra-processor forwarding is allowed."&amp;nbsp; They even have an innocent example
   in the paper, but it actually doesn't exhibit load/load reordering.&amp;nbsp; It does,
   however, illustrate that stores may be delayed for some time in a write buffer.&amp;nbsp;
   Perhaps surprisingly, such intra-processor forwarding of buffered stores is actually
   permitted to satisfy subsequent loads from that location by the same processor &lt;em&gt;before&lt;/em&gt; the
   store has left the processor.&amp;nbsp; This can happen even if it means passing intermediate
   loads from different memory locations!&amp;nbsp; The result is that load/load reordering
   is effectively possible under some circumstances.&amp;nbsp; Loads still physically retire
   in order of course, but because they may be satisfied by pending writes that other
   processors cannot yet see, it is as if the original program were written as:
&lt;/p&gt;
&lt;blockquote dir=ltr style="MARGIN-RIGHT: 0px"&gt; 
&lt;p&gt;
   &lt;font face="Courier New"&gt;P0&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;
   P1&lt;br&gt;
   ==========&amp;nbsp; ==========&lt;br&gt;
   R1 = Y;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; R3 = X;&lt;br&gt;
   X = 1;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; Y = 1;&lt;br&gt;
   R0 = X;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; R2 = Y;&lt;/font&gt;
&lt;/p&gt;
&lt;/blockquote&gt; 
&lt;p&gt;
   The fundamentally contradicts what most people believe about .NET's MM, and indeed,
   Intel's MM as specified in that whitepaper.&amp;nbsp; To be fair, the whitepaper actually
   does call this out, but in a roundabout and misleading fashion.&amp;nbsp; The text in
   Rule 2.1, which states that "no loads can be reordered with other loads", is far too
   strong.
&lt;/p&gt;
&lt;p&gt;
   Anytime a little hole in something as fundamental as MM axioms is uncovered, it is
   cause for concern.&amp;nbsp; So I found this discovery deeply disturbing.&amp;nbsp; Many abstractions
   and theorems are proved with the assumption that&amp;nbsp;the MM is rock solid.&amp;nbsp;
   I know a lot of code I have written relies on such proofs.
&lt;/p&gt;
&lt;p&gt;
   That said, I've been racking my brain (and in fact was having nightmares about it
   last evening) trying to uncover a case where this is worse than the existing release/acquire
   reordering issue that I opened this post with.&amp;nbsp; Everything I come up with is
   saved at the last minute by rules 2.1 (for stores) and 2.5 "stores are transitively
   visible".&amp;nbsp; The basic problem is that a processor can get stuck seeing its own
   written value for some time, during which other processors cannot, but ultimately
   it doesn't seem to matter because the buffer will eventually be flushed.&amp;nbsp; Then
   any intermediary values that the destination may have held while that processor was
   stuck will have been overwritten anyway, so the outcome should be explainable (albeit
   racey).&amp;nbsp; I'm still thinking hard about this.
&lt;/p&gt;
&lt;img width="0" height="0" src="http://www.bluebytesoftware.com/blog/aggbug.ashx?id=8bb27ef2-d53a-4e9f-b4e4-cc1607d06f37" /&gt;</description>
      <comments>http://www.bluebytesoftware.com/blog/CommentView,guid,8bb27ef2-d53a-4e9f-b4e4-cc1607d06f37.aspx</comments>
      <category>Technology</category>
    </item>
    <item>
      <trackback:ping>http://www.bluebytesoftware.com/blog/Trackback.aspx?guid=dd98d41b-92cb-40b5-a276-b1a8b3d08c54</trackback:ping>
      <pingback:server>http://www.bluebytesoftware.com/blog/pingback.aspx</pingback:server>
      <pingback:target>http://www.bluebytesoftware.com/blog/PermaLink,guid,dd98d41b-92cb-40b5-a276-b1a8b3d08c54.aspx</pingback:target>
      <dc:creator>
      </dc:creator>
      <wfw:comment>http://www.bluebytesoftware.com/blog/CommentView,guid,dd98d41b-92cb-40b5-a276-b1a8b3d08c54.aspx</wfw:comment>
      <wfw:commentRss>http://www.bluebytesoftware.com/blog/SyndicationService.asmx/GetEntryCommentsRss?guid=dd98d41b-92cb-40b5-a276-b1a8b3d08c54</wfw:commentRss>
      <slash:comments>9</slash:comments>
      <body xmlns="http://www.w3.org/1999/xhtml">
        <p>
      I just submitted the final manuscript for <a href="http://www.bluebytesoftware.com/books/winconc/winconc_book_resources.html">Concurrent
      Programming on Windows</a> to Addison-Wesley.
   </p>
        <p>
      This marks the exciting transition from things happening on my timetable to things
      happening on AW’s timetable.
   </p>
        <p>
      A lot has changed for me since I decided to write this book.  You might
      be surprised to hear that I actually signed the contract for it on November 29th,
      2005.  That’s 2 years and 7 months ago.  It’s almost unbelievable that this
      book took so long to finish.  By comparison, my <a href="http://www.bluebytesoftware.com/books/netfx20/netfx20_book_resources.html">first
      one</a> took just a little over a year.  The road has been a long one, full of
      personal ups and downs (I was, at the beginning, actually engaged to get married),
      but it’s no doubt been an exciting trip.
   </p>
        <p>
      I’ve been at Microsoft the whole time.  At the outset, I was a PM on the CLR
      Team, hacking on software transactional memory and PLINQ as an evening activity. 
      Then I transitioned to doing it full time, but still as a PM.  Then I joined
      the Parallel Computing team as the dev for PLINQ.  Then I kicked off
      the whole Parallel Extensions effort (which is 20 members and growing strong), became
      the dev lead, and here I am today.  It’s pretty strange to say this, but without
      the book very little of that would have happened.  I can’t think of a better
      way to get entrenched in a technology, experience the breadth, and force yourself
      to learn every little intricate and often enlightening detail.  If you can afford
      the impact to mental health and personal relationships ;), it’s an activity I
      highly recommend to anybody wanting to master a technology...  not that one can
      actually master the concurrency beast, but y’know...
   </p>
        <p>
      In retrospect, it should have taken a year.  Maybe next time.
   </p>
        <p>
      The good news is that you will have the book in your hands soon.  (Well, if you
      decide to buy a copy, that is.)  If you manage to make it to <a href="http://www.bluebytesoftware.com/blog/2008/05/30/PDC08ConcurrentMulticoreProgrammingOnWindowsAndNET.aspx">my
      PDC 2008</a> pre-con session, I’m hoping we will have some copies available. 
      No promises, since I missed my final deadline by a couple weeks, but my fingers are
      crossed.
   </p>
        <p>
      Oh yeah, and you can expect me to pick up blogging again now that I’ll have some free
      time.  Hmm, free time?  What will I do with myself!
   </p>
        <p>
          <strong>
            <em>Laissez les bon temps roulez!</em>
          </strong>
        </p>
        <img width="0" height="0" src="http://www.bluebytesoftware.com/blog/aggbug.ashx?id=dd98d41b-92cb-40b5-a276-b1a8b3d08c54" />
      </body>
      <title>Final manuscript for Concurrent Programming on Windows has been submitted</title>
      <guid>http://www.bluebytesoftware.com/blog/PermaLink,guid,dd98d41b-92cb-40b5-a276-b1a8b3d08c54.aspx</guid>
      <link>http://www.bluebytesoftware.com/blog/2008/06/23/FinalManuscriptForConcurrentProgrammingOnWindowsHasBeenSubmitted.aspx</link>
      <pubDate>Mon, 23 Jun 2008 07:34:18 GMT</pubDate>
      <description>&lt;p&gt;
   I just submitted the final manuscript for &lt;a href="http://www.bluebytesoftware.com/books/winconc/winconc_book_resources.html"&gt;Concurrent
   Programming on Windows&lt;/a&gt; to Addison-Wesley.
&lt;/p&gt;
&lt;p&gt;
   This marks the exciting transition from things happening on my timetable to things
   happening on AW’s timetable.
&lt;/p&gt;
&lt;p&gt;
   A lot has changed for me&amp;nbsp;since I decided to write this book.&amp;nbsp; You might
   be surprised to hear that I actually signed the contract for it on November 29th,
   2005.&amp;nbsp; That’s 2 years and 7 months ago.&amp;nbsp; It’s almost unbelievable that this
   book took so long to finish.&amp;nbsp; By comparison, my &lt;a href="http://www.bluebytesoftware.com/books/netfx20/netfx20_book_resources.html"&gt;first
   one&lt;/a&gt; took just a little over a year.&amp;nbsp; The road has been a long one, full of
   personal ups and downs (I was, at the beginning, actually engaged to get married),
   but it’s no doubt been an exciting trip.
&lt;/p&gt;
&lt;p&gt;
   I’ve been at Microsoft the whole time.&amp;nbsp; At the outset, I was a PM on the CLR
   Team, hacking on software transactional memory and PLINQ as an evening activity.&amp;nbsp;
   Then I transitioned to doing it full time, but still as a PM.&amp;nbsp; Then I joined
   the Parallel Computing team as the&amp;nbsp;dev for&amp;nbsp;PLINQ.&amp;nbsp; Then I kicked off
   the whole Parallel Extensions effort (which is 20 members and growing strong), became
   the dev lead, and here I am today.&amp;nbsp; It’s pretty strange to say this, but without
   the book very little of that would have happened.&amp;nbsp; I can’t think of a better
   way to get entrenched in a technology, experience the breadth, and force yourself
   to learn every little intricate and often enlightening detail.&amp;nbsp; If you can afford
   the impact to mental health and personal relationships&amp;nbsp;;), it’s an activity I
   highly recommend to anybody wanting to master a technology...&amp;nbsp; not that one can
   actually master the concurrency beast, but y’know...
&lt;/p&gt;
&lt;p&gt;
   In retrospect, it should have taken a year.&amp;nbsp; Maybe next time.
&lt;/p&gt;
&lt;p&gt;
   The good news is that you will have the book in your hands soon.&amp;nbsp; (Well, if you
   decide to buy a copy, that is.)&amp;nbsp; If you manage to make it to &lt;a href="http://www.bluebytesoftware.com/blog/2008/05/30/PDC08ConcurrentMulticoreProgrammingOnWindowsAndNET.aspx"&gt;my
   PDC 2008&lt;/a&gt; pre-con session, I’m hoping we will have some copies available.&amp;nbsp;
   No promises, since I missed my final deadline by a couple weeks, but my fingers are
   crossed.
&lt;/p&gt;
&lt;p&gt;
   Oh yeah, and you can expect me to pick up blogging again now that I’ll have some free
   time.&amp;nbsp; Hmm, free time?&amp;nbsp; What will I do with myself!
&lt;/p&gt;
&lt;p&gt;
   &lt;strong&gt;&lt;em&gt;Laissez les bon temps roulez!&lt;/em&gt;&lt;/strong&gt;
&lt;/p&gt;
&lt;img width="0" height="0" src="http://www.bluebytesoftware.com/blog/aggbug.ashx?id=dd98d41b-92cb-40b5-a276-b1a8b3d08c54" /&gt;</description>
      <comments>http://www.bluebytesoftware.com/blog/CommentView,guid,dd98d41b-92cb-40b5-a276-b1a8b3d08c54.aspx</comments>
      <category>Personal;Technology</category>
    </item>
    <item>
      <trackback:ping>http://www.bluebytesoftware.com/blog/Trackback.aspx?guid=dd3aff8a-7f8d-4de6-a2e7-d199662b68f4</trackback:ping>
      <pingback:server>http://www.bluebytesoftware.com/blog/pingback.aspx</pingback:server>
      <pingback:target>http://www.bluebytesoftware.com/blog/PermaLink,guid,dd3aff8a-7f8d-4de6-a2e7-d199662b68f4.aspx</pingback:target>
      <dc:creator>
      </dc:creator>
      <wfw:comment>http://www.bluebytesoftware.com/blog/CommentView,guid,dd3aff8a-7f8d-4de6-a2e7-d199662b68f4.aspx</wfw:comment>
      <wfw:commentRss>http://www.bluebytesoftware.com/blog/SyndicationService.asmx/GetEntryCommentsRss?guid=dd3aff8a-7f8d-4de6-a2e7-d199662b68f4</wfw:commentRss>
      <slash:comments>4</slash:comments>
      <body xmlns="http://www.w3.org/1999/xhtml">
        <p>
      We had an interesting debate at a Parallel Extensions design meeting yesterday, where
      I tried to convince everybody that a full fence on SpinLock exit is not a requirement. 
      We currently offer an Exit(bool) overload that accepts a flushReleaseWrite argument. 
      This merely changes the lock release from 
   </p>
        <blockquote dir="ltr" style="MARGIN-RIGHT: 0px">
          <p>
            <font face="Courier New">m_state = 0; </font>
          </p>
        </blockquote>
        <p>
      to 
   </p>
        <blockquote dir="ltr" style="MARGIN-RIGHT: 0px">
          <p>
            <font face="Courier New">Interlocked.Exchange(ref m_state, 0); </font>
          </p>
        </blockquote>
        <p>
      The main purpose of this is to announce “availability” of the locks to other processors. 
      More specifically, it ensures that before the current processor is able to turn around
      and reacquire the lock in its own private cache, that other processors at least have
      the opportunity to see the write.  This is a fairness optimization, and avoiding
      the CAS on release halves the number of CAS operations necessary (which are expensive),
      so we would generally like to avoid superflous ones.  It turns out you could
      easily do this without our help.  Instead of 
   </p>
        <blockquote dir="ltr" style="MARGIN-RIGHT: 0px">
          <p>
            <font face="Courier New">slock.Exit(true);</font>
          </p>
        </blockquote>
        <p>
      you could say 
   </p>
        <blockquote dir="ltr" style="MARGIN-RIGHT: 0px">
          <p>
            <font face="Courier New">slock.Exit();<br />
      Thread.MemoryBarrier(); </font>
          </p>
        </blockquote>
        <p>
      Most of the debate about whether the default Exit should use a fence centered around
      confusion over the strength of volatile vs. a full fence.  For example, the C#
      documentation for volatile is highly misleading (<a href="http://msdn.microsoft.com/en-us/library/x13ttww7(VS.71).aspx">http://msdn.microsoft.com/en-us/library/x13ttww7(VS.71).aspx</a>):
   </p>
        <blockquote dir="ltr" style="MARGIN-RIGHT: 0px">
          <p>
            <em>The volatile modifier is usually used for a field that is accessed by multiple
      threads without using the lock statement to serialize access. Using the volatile modifier </em>
            <u>ensures
      that one thread retrieves the most up-to-date value written by another thread</u>
            <em>. </em>
          </p>
        </blockquote>
        <p>
      The confusion is over the “ensures that one thread receives the most up-to-date value
      written by another thread” part.  Technically this is somewhat-accurate, but
      is worded in a very funny and misleading way.  To see why, let’s take a step
      back and consider what volatile actually means in the CLR’s memory model (MM) for
      a moment, to set context.  Note that I did my best to concisely summarize the
      MM here: <a href="http://www.bluebytesoftware.com/blog/2007/11/10/CLR20MemoryModel.aspx">http://www.bluebytesoftware.com/blog/2007/11/10/CLR20MemoryModel.aspx</a>. 
   </p>
        <p>
      Volatile on loads means ACQUIRE, no more, no less.  (There are additional compiler
      optimization restrictions, of course, like not allowing hoisting outside of loops,
      but let’s focus on the MM aspects for now.)  The standard definition of ACQUIRE
      is that subsequent memory operations may not move before the ACQUIRE instruction;
      e.g. given { ld.acq X, ld Y }, the ld Y cannot occur before ld.acq X.  However,
      previous memory operations can certainly move after it; e.g. given { ld X, ld.acq
      Y }, the ld.acq Y can indeed occur before the ld X.  The only processor Microsoft
      .NET code currently runs on for which this actually occurs is IA64, but this is a
      notable area where CLR’s MM is weaker than most machines.  Next, all stores on
      .NET are RELEASE (regardless of volatile, i.e. volatile is a no-op in terms of jitted
      code).  The standard definition of RELEASE is that previous memory operations
      may not move after a RELEASE operation; e.g. given { st X, st.rel Y }, the st.rel
      Y cannot occur before st X.  However, subsequent memory operations can indeed
      move before it; e.g. given { st.rel X, ld Y }, the ld Y can move before st.rel X. 
      (I used a load since .NET stores are all release.)  Note that RELEASe is the
      opposite of ACQUIRE: you can think of an acquire as a one-way fence that prohibits
      passes downward, and a release as a one-way fence that prohibits passes upward. 
      A full fence prohibits both (lock acquire, XCHG, MB, etc).
   </p>
        <p>
      Note one very interesting thing in this discussion: a release followed by an acquire,
      given the above rules, does not prohibit movement of the instructions with respect
      to one another!  Given { st.rel X, ld.acq Y }, even though they are both volatile
      (i.e. acquire and release), so long as X!=Y, it is perfectly legal for the ld.acq
      Y to move before st.rel X.  We aren’t limited to single instructions either,
      e.g. { st.rel X, ld.acq A, ld.acq B, ld.acq C }, all three loads (A, B, C) may indeed
      happen before the X.  This occurs with regularity in practice, on X86, X64, and
      IA64, because of store buffering.  It would just be too costly to hold up loads
      until a store has reached all processors.  Superscalar execution is meant to
      hide such latencies.
   </p>
        <p>
      (As an aside, many people wonder about the difference between loads and stores of
      variables marked as volatile and calls to Thread.VolatileRead and Thread.VolatileWrite. 
      The difference is that the former APIs are implemented stronger than the jitted code:
      they achieve acquire/release semantics by emitting full fences on the right side. 
      The APIs are more expensive to call too, but at least allow you to decide on a callsite-by-callsite
      basis which individual loads and stores need the MM guarantees.)
   </p>
        <p>
      I have to admit the store buffer problem is mostly theoretical.  It rarely comes
      up in practice.  That said, on a system which permits load reordering, imagine: 
   </p>
        <blockquote dir="ltr" style="MARGIN-RIGHT: 0px">
          <p>
            <em>Initially: X = Y = 0 </em>
          </p>
          <p>
            <font face="Courier New">
              <strong>T0                      
      T1 </strong>
              <br />
      X = 5; // st.rel         while (X == 0) ;
      // ld.acq<br />
      while (Y == 0) ; // ld   X = 0; // st.rel<br />
      A = X; // ld.acq         Y = 5; // st.rel</font>
          </p>
          <p>
            <em>After execution, is it possible that A == 5? </em>
          </p>
        </blockquote>
        <p>
      If the read of Y is non-volatile on T0 (which would be bad because a compiler may
      hoist it out of the loop, but ignore compilers for a moment), then the fact that the
      subsequent read of X is volatile does not save us from a reordering leading to A ==
      5.  This is the { ld, ld.acq } case described earlier.  Why might this physically
      occur?  Well, it won’t happen on X86 and X64 because loads are not permitted
      to reorder.  However!!  IA64 permits non-acquire loads (non-volatile) to
      reorder, and so the A = X may actually be satisfied out of the write buffer before
      the store even leaves the processor.  It’s as though the program became: 
   </p>
        <blockquote dir="ltr" style="MARGIN-RIGHT: 0px">
          <p>
            <font face="Courier New">
              <strong>T0                      
      T1 
      <br /></strong>X = 5; // st.rel         while (X
      == 0) ; // ld.acq<br />
      A = X; // ld.acq         X = 0; // st.rel<br />
      while (Y == 0) ; // ld   Y = 5; // st.rel</font>
          </p>
        </blockquote>
        <p>
      Whoops!  This should make it apparent that this outcome is indeed a real possibility. 
      And clearly it may cause bugs.
   </p>
        <p>
          <em>Note 6/13/08: <a href="http://blogs.msdn.com/ericeil/">Eric</a> pointed out privately
      that compilers need only respect the CLR MM, and can freely reorder loads.  Thus,
      this problem may actually arise on non-IA64 machines.  Of course he is entirely
      correct.  It was silly of me to overlook that.</em>
        </p>
        <p>
      All that said, let’s get back to the original concern about visibility of writes. 
      This issue doesn’t even really involve reordering.  Imagine one processor continuously
      executes a stream of lock acquires and releases, and that the stream goes on indefinitely
      (perhaps because it’s in a loop): 
   </p>
        <blockquote dir="ltr" style="MARGIN-RIGHT: 0px">
          <p>
            <font face="Courier New">while (Interlocked.CompareExchange(ref m_state, 1, 0) !=
      0) ;<br />
      m_state = 0;<br />
      while (Interlocked.CompareExchange(ref m_state, 1, 0) != 0) ;<br />
      m_state = 0;<br />
      … </font>
          </p>
        </blockquote>
        <p>
      The Interlocked operation acquires the cache line in X mode.  After it executes,
      other processors will notice that the lock is taken.  But right away, the processor
      writes 0 to the line without a fence, and immediately goes on to execute another acquire. 
      It is highly likely that the line will be marked dirty in the processor’s cache by
      the time that it acquires it in X mode again, something that the cache coherency system
      makes very cheap.  In fact, the write of m_state = 0 probably hasn’t left the
      write buffer yet due to latency.
   </p>
        <p>
      So before another processor can even see m_state as 0, the processor will have already
      gotten around to taking the lock again.  Even for volatile loads and stores,
      there is no MM guarantee that writes will leave the processor immediately; hence the
      documentaiton earlier is slightly confusing; yes, the processor doing a volatile read
      will see the “most recent” value, but that “most recent” value (a) may be satisfied
      out of the local write buffer, and (b) may simply not have the ability to observe
      writes that occurred in practice due to the above timeliness issue. 
   </p>
        <img width="0" height="0" src="http://www.bluebytesoftware.com/blog/aggbug.ashx?id=dd3aff8a-7f8d-4de6-a2e7-d199662b68f4" />
      </body>
      <title>Volatile reads and writes, and timeliness</title>
      <guid>http://www.bluebytesoftware.com/blog/PermaLink,guid,dd3aff8a-7f8d-4de6-a2e7-d199662b68f4.aspx</guid>
      <link>http://www.bluebytesoftware.com/blog/2008/06/13/VolatileReadsAndWritesAndTimeliness.aspx</link>
      <pubDate>Fri, 13 Jun 2008 19:52:45 GMT</pubDate>
      <description>&lt;p&gt;
   We had an interesting debate at a Parallel Extensions design meeting yesterday, where
   I tried to convince everybody that a full fence on SpinLock exit is not a requirement.&amp;nbsp;
   We currently offer an Exit(bool) overload that accepts a flushReleaseWrite argument.&amp;nbsp;
   This merely changes the lock release from 
&lt;/p&gt;
&lt;blockquote dir=ltr style="MARGIN-RIGHT: 0px"&gt; 
&lt;p&gt;
   &lt;font face="Courier New"&gt;m_state = 0; &lt;/font&gt;
&lt;/p&gt;
&lt;/blockquote&gt; 
&lt;p&gt;
   to 
&lt;/p&gt;
&lt;blockquote dir=ltr style="MARGIN-RIGHT: 0px"&gt; 
&lt;p&gt;
   &lt;font face="Courier New"&gt;Interlocked.Exchange(ref m_state, 0); &lt;/font&gt;
&lt;/p&gt;
&lt;/blockquote&gt; 
&lt;p&gt;
   The main purpose of this is to announce “availability” of the locks to other processors.&amp;nbsp;
   More specifically, it ensures that before the current processor is able to turn around
   and reacquire the lock in its own private cache, that other processors at least have
   the opportunity to see the write.&amp;nbsp; This is a fairness optimization, and avoiding
   the CAS on release halves the number of CAS operations necessary (which are expensive),
   so we would generally like to avoid superflous ones.&amp;nbsp; It turns out you could
   easily do this without our help.&amp;nbsp; Instead of 
&lt;/p&gt;
&lt;blockquote dir=ltr style="MARGIN-RIGHT: 0px"&gt; 
&lt;p&gt;
   &lt;font face="Courier New"&gt;slock.Exit(true);&lt;/font&gt; 
&lt;/p&gt;
&lt;/blockquote&gt; 
&lt;p&gt;
   you could say 
&lt;/p&gt;
&lt;blockquote dir=ltr style="MARGIN-RIGHT: 0px"&gt; 
&lt;p&gt;
   &lt;font face="Courier New"&gt;slock.Exit();&lt;br&gt;
   Thread.MemoryBarrier(); &lt;/font&gt;
&lt;/p&gt;
&lt;/blockquote&gt; 
&lt;p&gt;
   Most of the debate about whether the default Exit should use a fence centered around
   confusion over the strength of volatile vs. a full fence.&amp;nbsp; For example, the C#
   documentation for volatile is highly misleading (&lt;a href="http://msdn.microsoft.com/en-us/library/x13ttww7(VS.71).aspx"&gt;http://msdn.microsoft.com/en-us/library/x13ttww7(VS.71).aspx&lt;/a&gt;):
&lt;/p&gt;
&lt;blockquote dir=ltr style="MARGIN-RIGHT: 0px"&gt; 
&lt;p&gt;
   &lt;em&gt;The volatile modifier is usually used for a field that is accessed by multiple
   threads without using the lock statement to serialize access. Using the volatile modifier &lt;/em&gt;&lt;u&gt;ensures
   that one thread retrieves the most up-to-date value written by another thread&lt;/u&gt;&lt;em&gt;. &lt;/em&gt;
&lt;/p&gt;
&lt;/blockquote&gt; 
&lt;p&gt;
   The confusion is over the “ensures that one thread receives the most up-to-date value
   written by another thread” part.&amp;nbsp; Technically this is somewhat-accurate, but
   is worded in a very funny and misleading way.&amp;nbsp; To see why, let’s take a step
   back and consider what volatile actually means in the CLR’s memory model (MM) for
   a moment, to set context.&amp;nbsp; Note that I did my best to concisely summarize the
   MM here: &lt;a href="http://www.bluebytesoftware.com/blog/2007/11/10/CLR20MemoryModel.aspx"&gt;http://www.bluebytesoftware.com/blog/2007/11/10/CLR20MemoryModel.aspx&lt;/a&gt;. 
&lt;/p&gt;
&lt;p&gt;
   Volatile on loads means ACQUIRE, no more, no less.&amp;nbsp; (There are additional compiler
   optimization restrictions, of course,&amp;nbsp;like not allowing hoisting outside of loops,
   but let’s focus on the MM aspects for now.)&amp;nbsp; The standard definition of ACQUIRE
   is that subsequent memory operations may not move before the ACQUIRE instruction;
   e.g. given { ld.acq X, ld Y }, the ld Y cannot occur before ld.acq X.&amp;nbsp; However,
   previous memory operations can certainly move after it; e.g. given { ld X, ld.acq
   Y }, the ld.acq Y can indeed occur before the ld X.&amp;nbsp; The only processor Microsoft
   .NET code currently runs on for which this actually occurs is IA64, but this is a
   notable area where CLR’s MM is weaker than most machines.&amp;nbsp; Next, all stores on
   .NET are RELEASE (regardless of volatile, i.e. volatile is a no-op in terms of jitted
   code).&amp;nbsp; The standard definition of RELEASE is that previous memory operations
   may not move after a RELEASE operation; e.g. given { st X, st.rel Y }, the st.rel
   Y cannot occur before st X.&amp;nbsp; However, subsequent memory operations can indeed
   move before it; e.g. given { st.rel X, ld Y }, the ld Y can move before st.rel X.&amp;nbsp;
   (I used a load since .NET stores are all release.)&amp;nbsp; Note that RELEASe is the
   opposite of ACQUIRE: you can think of an acquire as a one-way fence that prohibits
   passes downward, and a release as a one-way fence that prohibits passes upward.&amp;nbsp;
   A full fence prohibits both (lock acquire, XCHG, MB, etc).
&lt;/p&gt;
&lt;p&gt;
   Note one very interesting thing in this discussion: a release followed by an acquire,
   given the above rules, does not prohibit movement of the instructions with respect
   to one another!&amp;nbsp; Given { st.rel X, ld.acq Y }, even though they are both volatile
   (i.e. acquire and release), so long as X!=Y, it is perfectly legal for the ld.acq
   Y to move before st.rel X.&amp;nbsp; We aren’t limited to single instructions either,
   e.g. { st.rel X, ld.acq A, ld.acq B, ld.acq C }, all three loads (A, B, C) may indeed
   happen before the X.&amp;nbsp; This occurs with regularity in practice, on X86, X64, and
   IA64, because of store buffering.&amp;nbsp; It would just be too costly to hold up loads
   until a store has reached all processors.&amp;nbsp; Superscalar execution is meant to
   hide such latencies.
&lt;/p&gt;
&lt;p&gt;
   (As an aside, many people wonder about the difference between loads and stores of
   variables marked as volatile and calls to Thread.VolatileRead and Thread.VolatileWrite.&amp;nbsp;
   The difference is that the former APIs are implemented stronger than the jitted code:
   they achieve acquire/release semantics by emitting full fences on the right side.&amp;nbsp;
   The APIs are more expensive to call too, but at least allow you to decide on a callsite-by-callsite
   basis which individual loads and stores need the MM guarantees.)
&lt;/p&gt;
&lt;p&gt;
   I have to admit the store buffer problem is mostly theoretical.&amp;nbsp; It rarely comes
   up in practice.&amp;nbsp; That said, on a system which permits load reordering, imagine: 
&lt;/p&gt;
&lt;blockquote dir=ltr style="MARGIN-RIGHT: 0px"&gt; 
&lt;p&gt;
   &lt;em&gt;Initially: X = Y = 0 &lt;/em&gt;
&lt;/p&gt;
&lt;p&gt;
   &lt;font face="Courier New"&gt;&lt;strong&gt;T0&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;
   T1 &lt;/strong&gt;
   &lt;br&gt;
   X = 5; // st.rel&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; while (X == 0) ;
   // ld.acq&lt;br&gt;
   while (Y == 0) ; // ld&amp;nbsp;&amp;nbsp; X = 0; // st.rel&lt;br&gt;
   A = X; // ld.acq&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; Y = 5; // st.rel&lt;/font&gt; 
&lt;/p&gt;
&lt;p&gt;
   &lt;em&gt;After execution, is it possible that A == 5? &lt;/em&gt;
&lt;/p&gt;
&lt;/blockquote&gt; 
&lt;p&gt;
   If the read of Y is non-volatile on T0 (which would be bad because a compiler may
   hoist it out of the loop, but ignore compilers for a moment), then the fact that the
   subsequent read of X is volatile does not save us from a reordering leading to A ==
   5.&amp;nbsp; This is the { ld, ld.acq } case described earlier.&amp;nbsp; Why might this physically
   occur?&amp;nbsp; Well, it won’t happen on X86 and X64 because loads are not permitted
   to reorder.&amp;nbsp; However!!&amp;nbsp; IA64 permits non-acquire loads (non-volatile) to
   reorder, and so the A = X may actually be satisfied out of the write buffer before
   the store even leaves the processor.&amp;nbsp; It’s as though the program became: 
&lt;/p&gt;
&lt;blockquote dir=ltr style="MARGIN-RIGHT: 0px"&gt; 
&lt;p&gt;
   &lt;font face="Courier New"&gt;&lt;strong&gt;T0&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;
   T1 
   &lt;br&gt;
   &lt;/strong&gt;X = 5; // st.rel&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; while (X
   == 0) ; // ld.acq&lt;br&gt;
   A = X; // ld.acq&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; X = 0; // st.rel&lt;br&gt;
   while (Y == 0) ; // ld&amp;nbsp;&amp;nbsp; Y = 5; // st.rel&lt;/font&gt; 
&lt;/p&gt;
&lt;/blockquote&gt; 
&lt;p&gt;
   Whoops!&amp;nbsp; This should make it apparent that this outcome is indeed a real possibility.&amp;nbsp;
   And clearly it may cause bugs.
&lt;/p&gt;
&lt;p&gt;
   &lt;em&gt;Note 6/13/08: &lt;a href="http://blogs.msdn.com/ericeil/"&gt;Eric&lt;/a&gt; pointed out privately
   that compilers need only respect the CLR MM, and can freely reorder loads.&amp;nbsp; Thus,
   this problem may actually arise on non-IA64 machines.&amp;nbsp; Of course he is entirely
   correct.&amp;nbsp; It was silly of me to overlook that.&lt;/em&gt;
&lt;/p&gt;
&lt;p&gt;
   All that said, let’s get back to the original concern about visibility of writes.&amp;nbsp;
   This issue doesn’t even really involve reordering.&amp;nbsp; Imagine one processor continuously
   executes a stream of lock acquires and releases, and that the stream goes on indefinitely
   (perhaps because it’s in a loop): 
&lt;/p&gt;
&lt;blockquote dir=ltr style="MARGIN-RIGHT: 0px"&gt; 
&lt;p&gt;
   &lt;font face="Courier New"&gt;while (Interlocked.CompareExchange(ref m_state, 1, 0) !=
   0) ;&lt;br&gt;
   m_state = 0;&lt;br&gt;
   while (Interlocked.CompareExchange(ref m_state, 1, 0) != 0) ;&lt;br&gt;
   m_state = 0;&lt;br&gt;
   … &lt;/font&gt;
&lt;/p&gt;
&lt;/blockquote&gt; 
&lt;p&gt;
   The Interlocked operation acquires the cache line in X mode.&amp;nbsp; After it executes,
   other processors will notice that the lock is taken.&amp;nbsp; But right away, the processor
   writes 0 to the line without a fence, and immediately goes on to execute another acquire.&amp;nbsp;
   It is highly likely that the line will be marked dirty in the processor’s cache by
   the time that it acquires it in X mode again, something that the cache coherency system
   makes very cheap.&amp;nbsp; In fact, the write of m_state = 0 probably hasn’t left the
   write buffer yet due to latency.
&lt;/p&gt;
&lt;p&gt;
   So before another processor can even see m_state as 0, the processor will have already
   gotten around to taking the lock again.&amp;nbsp; Even for volatile loads and stores,
   there is no MM guarantee that writes will leave the processor immediately; hence the
   documentaiton earlier is slightly confusing; yes, the processor doing a volatile read
   will see the “most recent” value, but that “most recent” value (a) may be satisfied
   out of the local write buffer, and (b) may simply not have the ability to observe
   writes that occurred in practice due to the above timeliness issue.&amp;nbsp;
&lt;/p&gt;
&lt;img width="0" height="0" src="http://www.bluebytesoftware.com/blog/aggbug.ashx?id=dd3aff8a-7f8d-4de6-a2e7-d199662b68f4" /&gt;</description>
      <comments>http://www.bluebytesoftware.com/blog/CommentView,guid,dd3aff8a-7f8d-4de6-a2e7-d199662b68f4.aspx</comments>
      <category>Technology</category>
    </item>
    <item>
      <trackback:ping>http://www.bluebytesoftware.com/blog/Trackback.aspx?guid=393b17f8-ce5d-40f7-80e6-414e4841fdde</trackback:ping>
      <pingback:server>http://www.bluebytesoftware.com/blog/pingback.aspx</pingback:server>
      <pingback:target>http://www.bluebytesoftware.com/blog/PermaLink,guid,393b17f8-ce5d-40f7-80e6-414e4841fdde.aspx</pingback:target>
      <dc:creator>
      </dc:creator>
      <wfw:comment>http://www.bluebytesoftware.com/blog/CommentView,guid,393b17f8-ce5d-40f7-80e6-414e4841fdde.aspx</wfw:comment>
      <wfw:commentRss>http://www.bluebytesoftware.com/blog/SyndicationService.asmx/GetEntryCommentsRss?guid=393b17f8-ce5d-40f7-80e6-414e4841fdde</wfw:commentRss>
      <body xmlns="http://www.w3.org/1999/xhtml">
        <p>
      We sat down last week with Charles from Channel9 to discuss the new CTP.  Both
      parts got posted today:
   </p>
        <ul>
          <li>
         Part 1: <a href="http://channel9.msdn.com/shows/Going+Deep/Inside-Parallel-Extensions-for-NET-2008-CTP-Part-1/">http://channel9.msdn.com/shows/Going+Deep/Inside-Parallel-Extensions-for-NET-2008-CTP-Part-1/</a></li>
          <li>
         Part 2: <a href="http://channel9.msdn.com/shows/Going+Deep/Inside-Parallel-Extensions-for-NET-2008-CTP-Part-2/">http://channel9.msdn.com/shows/Going+Deep/Inside-Parallel-Extensions-for-NET-2008-CTP-Part-2/</a></li>
        </ul>
        <p>
      We focus on the new aspects of the stack, incl. the new scheduler and CDS, and also
      discuss what's changed in PLINQ and TPL.
   </p>
        <p>
      If you have ideas for future videos, or any feedback/questions, you know where to
      send 'em.  joedu AT youknowwhere DOT com.
   </p>
        <img width="0" height="0" src="http://www.bluebytesoftware.com/blog/aggbug.ashx?id=393b17f8-ce5d-40f7-80e6-414e4841fdde" />
      </body>
      <title>A tour of the new Parallel Extensions CTP on Channel9</title>
      <guid>http://www.bluebytesoftware.com/blog/PermaLink,guid,393b17f8-ce5d-40f7-80e6-414e4841fdde.aspx</guid>
      <link>http://www.bluebytesoftware.com/blog/2008/06/06/ATourOfTheNewParallelExtensionsCTPOnChannel9.aspx</link>
      <pubDate>Fri, 06 Jun 2008 00:14:47 GMT</pubDate>
      <description>&lt;p&gt;
   We sat down last week with Charles from Channel9 to discuss the new CTP.&amp;nbsp; Both
   parts got posted today:
&lt;/p&gt;
&lt;ul&gt;
   &lt;li&gt;
      Part 1: &lt;a href="http://channel9.msdn.com/shows/Going+Deep/Inside-Parallel-Extensions-for-NET-2008-CTP-Part-1/"&gt;http://channel9.msdn.com/shows/Going+Deep/Inside-Parallel-Extensions-for-NET-2008-CTP-Part-1/&lt;/a&gt;
   &lt;/li&gt;
   &lt;li&gt;
      Part 2: &lt;a href="http://channel9.msdn.com/shows/Going+Deep/Inside-Parallel-Extensions-for-NET-2008-CTP-Part-2/"&gt;http://channel9.msdn.com/shows/Going+Deep/Inside-Parallel-Extensions-for-NET-2008-CTP-Part-2/&lt;/a&gt;
   &lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;
   We focus on the new aspects of the stack, incl. the new scheduler and CDS, and also
   discuss what's changed in PLINQ and TPL.
&lt;/p&gt;
&lt;p&gt;
   If you have ideas for future videos, or any feedback/questions, you know where to
   send 'em.&amp;nbsp; joedu AT youknowwhere DOT com.
&lt;/p&gt;
&lt;img width="0" height="0" src="http://www.bluebytesoftware.com/blog/aggbug.ashx?id=393b17f8-ce5d-40f7-80e6-414e4841fdde" /&gt;</description>
      <comments>http://www.bluebytesoftware.com/blog/CommentView,guid,393b17f8-ce5d-40f7-80e6-414e4841fdde.aspx</comments>
      <category>Technology</category>
    </item>
    <item>
      <trackback:ping>http://www.bluebytesoftware.com/blog/Trackback.aspx?guid=847358d1-b8bc-4fa9-a293-b11c7726846b</trackback:ping>
      <pingback:server>http://www.bluebytesoftware.com/blog/pingback.aspx</pingback:server>
      <pingback:target>http://www.bluebytesoftware.com/blog/PermaLink,guid,847358d1-b8bc-4fa9-a293-b11c7726846b.aspx</pingback:target>
      <dc:creator>
      </dc:creator>
      <wfw:comment>http://www.bluebytesoftware.com/blog/CommentView,guid,847358d1-b8bc-4fa9-a293-b11c7726846b.aspx</wfw:comment>
      <wfw:commentRss>http://www.bluebytesoftware.com/blog/SyndicationService.asmx/GetEntryCommentsRss?guid=847358d1-b8bc-4fa9-a293-b11c7726846b</wfw:commentRss>
      <slash:comments>4</slash:comments>
      <body xmlns="http://www.w3.org/1999/xhtml">
        <p>
      We just released a new CTP of Parallel Extensions to .NET: <a href="http://www.microsoft.com/downloads/details.aspx?FamilyId=348F73FD-593D-4B3C-B055-694C50D2B0F3&amp;displaylang=en">get
      it here</a>.
   </p>
        <p>
      Some relevant information is up on our team blog:
   </p>
        <ul>
          <li>
         What's new: <a href="http://blogs.msdn.com/pfxteam/archive/2008/06/02/8567093.aspx">http://blogs.msdn.com/pfxteam/archive/2008/06/02/8567093.aspx</a></li>
          <li>
         Known issues: <a href="http://blogs.msdn.com/pfxteam/archive/2008/06/02/8567816.aspx">http://blogs.msdn.com/pfxteam/archive/2008/06/02/8567816.aspx</a> (hey,
         nobody's perfect)</li>
        </ul>
        <p>
      I'm really excited to see our entire stack finally shipping as one cohesive unit:
      the data structures we use throughout the implementation exposed publicly (what
      we now call CDS), a new scheduler built from the ground up, TPL and PLINQ better together,
      and lots more.  We're still very far from the end of the road, and have plenty
      of cool stuff still in the works, but we've made a ton of progress in just 6
      months since our last CTP.
   </p>
        <p>
      Have fun and, as always, feedback is much appreciated.
   </p>
        <img width="0" height="0" src="http://www.bluebytesoftware.com/blog/aggbug.ashx?id=847358d1-b8bc-4fa9-a293-b11c7726846b" />
      </body>
      <title>New CTP of Parallel Extensions is available</title>
      <guid>http://www.bluebytesoftware.com/blog/PermaLink,guid,847358d1-b8bc-4fa9-a293-b11c7726846b.aspx</guid>
      <link>http://www.bluebytesoftware.com/blog/2008/06/02/NewCTPOfParallelExtensionsIsAvailable.aspx</link>
      <pubDate>Mon, 02 Jun 2008 16:00:26 GMT</pubDate>
      <description>&lt;p&gt;
   We just released a new CTP of Parallel Extensions to .NET: &lt;a href="http://www.microsoft.com/downloads/details.aspx?FamilyId=348F73FD-593D-4B3C-B055-694C50D2B0F3&amp;amp;displaylang=en"&gt;get
   it here&lt;/a&gt;.
&lt;/p&gt;
&lt;p&gt;
   Some relevant information is up on our team blog:
&lt;/p&gt;
&lt;ul&gt;
   &lt;li&gt;
      What's new: &lt;a href="http://blogs.msdn.com/pfxteam/archive/2008/06/02/8567093.aspx"&gt;http://blogs.msdn.com/pfxteam/archive/2008/06/02/8567093.aspx&lt;/a&gt;
   &lt;/li&gt;
   &lt;li&gt;
      Known issues: &lt;a href="http://blogs.msdn.com/pfxteam/archive/2008/06/02/8567816.aspx"&gt;http://blogs.msdn.com/pfxteam/archive/2008/06/02/8567816.aspx&lt;/a&gt;&amp;nbsp;(hey,
      nobody's perfect)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;
   I'm really excited to see our entire stack finally shipping as one cohesive unit:
   the data structures we use throughout the implementation exposed publicly&amp;nbsp;(what
   we now call CDS), a new scheduler built from the ground up, TPL and PLINQ better together,
   and lots more.&amp;nbsp; We're still very far from the end of the road, and have plenty
   of cool stuff still in the works, but&amp;nbsp;we've made a ton of progress in just&amp;nbsp;6
   months since our last CTP.
&lt;/p&gt;
&lt;p&gt;
   Have fun and, as always, feedback is much appreciated.
&lt;/p&gt;
&lt;img width="0" height="0" src="http://www.bluebytesoftware.com/blog/aggbug.ashx?id=847358d1-b8bc-4fa9-a293-b11c7726846b" /&gt;</description>
      <comments>http://www.bluebytesoftware.com/blog/CommentView,guid,847358d1-b8bc-4fa9-a293-b11c7726846b.aspx</comments>
      <category>Technology</category>
    </item>
    <item>
      <trackback:ping>http://www.bluebytesoftware.com/blog/Trackback.aspx?guid=5e8ae5a6-1126-4a4f-8d66-3e19e434a011</trackback:ping>
      <pingback:server>http://www.bluebytesoftware.com/blog/pingback.aspx</pingback:server>
      <pingback:target>http://www.bluebytesoftware.com/blog/PermaLink,guid,5e8ae5a6-1126-4a4f-8d66-3e19e434a011.aspx</pingback:target>
      <dc:creator>
      </dc:creator>
      <wfw:comment>http://www.bluebytesoftware.com/blog/CommentView,guid,5e8ae5a6-1126-4a4f-8d66-3e19e434a011.aspx</wfw:comment>
      <wfw:commentRss>http://www.bluebytesoftware.com/blog/SyndicationService.asmx/GetEntryCommentsRss?guid=5e8ae5a6-1126-4a4f-8d66-3e19e434a011</wfw:commentRss>
      <body xmlns="http://www.w3.org/1999/xhtml">
        <p>
      PDC'08 is officially <em>on </em>for October 27-30th this year: <a href="http://microsoftpdc.com/">http://microsoftpdc.com/</a>.
   </p>
        <p>
      My team will certainly have some really fun stuff to show off, and just glancing at
      the preliminary list of teaser sessions, it's going to be a blast.
   </p>
        <p>
      A few of us have teamed up to give a PreCon.  That's code-word for a
      full day of neverending concurrency goodness.  From <a href="http://microsoftpdc.com/agenda/Preconference.aspx">http://microsoftpdc.com/agenda/Preconference.aspx</a>:
   </p>
        <blockquote dir="ltr" style="MARGIN-RIGHT: 0px">
          <p>
            <strong>
              <font size="3">Concurrent, Multi-core Programming on Windows and .NET</font>
            </strong>
            <br />
            <em>Presenter(s): David Callahan, Joe Duffy, Stephen Toub</em>
            <br />
      The leap from single-core to multi-core technology is altering computing as we know
      it, enabling increased productivity, powerful energy-efficient performance, and leading-edge
      advanced computing experiences. The good news is that Windows and .NET offer rich
      support for threading and synchronization to take advantage of these new platforms.
      This session, presented by David Callahan, Microsoft distinguished engineer, Joe Duffy,
      author of “Concurrent Programming on Windows” (Addison-Wesley), and Stephen Toub,
      program manager lead for the Concurrency Development Platform team at Microsoft, will
      cover a broad range of topics, including mechanisms to create, coordinate, and synchronize
      among threads; best practices for concurrent libraries and apps; and techniques for
      improving scalability, including lock-free algorithms. Focus will be on .NET programming,
      including the next generation of parallel programming support within the Framework,
      but Windows internals and C++ nuggets will be discussed too. 
   </p>
          <p>
            <em>About the presenter(s): </em>
            <br />
      David Callahan joined Microsoft in 2005. He is a Distinguished Engineer leading the
      Parallel Computing Platform Team within Visual Studio® focused on incubating technology
      for the coming manycore processors. This team is producing exciting new technologies
      as part of Visual Studio and also driving the Parallel Computing Initiative, a company
      wide effort to deliver customer value from the power of future high-performance processors.
      David’s background is in programming languages, parallel programming techniques, and
      compilation techniques focused on expressing and exploiting concurrency. 
   </p>
          <p>
      Stephen Toub is a Senior Program Manager Lead on the Parallel Computing Platform team
      at Microsoft, where he spends his days focusing on the next generation of programming
      models for concurrency. Stephen is also a Contributing Editor for MSDN® Magazine,
      for which he writes the .NET Matters column, and he’s an avid speaker at conferences
      like TechEd and DevConnections. Prior to working on the Parallel Computing Platform,
      Stephen designed and built enterprise applications for companies such as GE, McGraw-Hill,
      BankOne, and JetBlue. He was a developer for Microsoft Outlook as well as for the
      Microsoft Office Solution Accelerators. In his spare time, Stephen loves to sing,
      and he spends as much time as possible with his beautiful wife Tamara. 
   </p>
          <p>
      Joe Duffy leads development for Microsoft's Parallel Extensions to .NET technology,
      a set of library and runtime technologies for concurrent and parallel computing. He
      founded the project in 2006 with Parallel Language Integrated Query (aka PLINQ), an
      innovative declarative parallel query analysis and execution engine. Prior to Parallel
      Extensions, Joe worked on transactional memory, library and VM support for concurrency
      in the Common Language Runtime (CLR) team, and has written 3 functional language compilers
      (Scheme, Common LISP, and Haskell). He has written two books, including Concurrent
      Programming on Windows (Addison-Wesley, 2008), and in his spare time reads and writes
      (code and text), plays guitar, and studies music theory. 
   </p>
        </blockquote>
        <p>
      Be there, or be square.  The slides aren't even finalized yet, so let me know
      if you have particular topics you'd like to learn more about.
   </p>
        <img width="0" height="0" src="http://www.bluebytesoftware.com/blog/aggbug.ashx?id=5e8ae5a6-1126-4a4f-8d66-3e19e434a011" />
      </body>
      <title>PDC'08: Concurrent, Multi-core Programming on Windows and .NET</title>
      <guid>http://www.bluebytesoftware.com/blog/PermaLink,guid,5e8ae5a6-1126-4a4f-8d66-3e19e434a011.aspx</guid>
      <link>http://www.bluebytesoftware.com/blog/2008/05/30/PDC08ConcurrentMulticoreProgrammingOnWindowsAndNET.aspx</link>
      <pubDate>Fri, 30 May 2008 06:19:11 GMT</pubDate>
      <description>&lt;p&gt;
   PDC'08 is officially &lt;em&gt;on &lt;/em&gt;for October 27-30th this year: &lt;a href="http://microsoftpdc.com/"&gt;http://microsoftpdc.com/&lt;/a&gt;.
&lt;/p&gt;
&lt;p&gt;
   My team will certainly have some really fun stuff to show off, and just glancing at
   the preliminary list of teaser sessions, it's going to be a blast.
&lt;/p&gt;
&lt;p&gt;
   A few of&amp;nbsp;us have teamed up to give a PreCon.&amp;nbsp; That's code-word for&amp;nbsp;a
   full day of neverending concurrency goodness.&amp;nbsp; From &lt;a href="http://microsoftpdc.com/agenda/Preconference.aspx"&gt;http://microsoftpdc.com/agenda/Preconference.aspx&lt;/a&gt;:
&lt;/p&gt;
&lt;blockquote dir=ltr style="MARGIN-RIGHT: 0px"&gt; 
&lt;p&gt;
   &lt;strong&gt;&lt;font size=3&gt;Concurrent, Multi-core Programming on Windows and .NET&lt;/font&gt;&lt;/strong&gt;
   &lt;br&gt;
   &lt;em&gt;Presenter(s): David Callahan, Joe Duffy, Stephen Toub&lt;/em&gt;
   &lt;br&gt;
   The leap from single-core to multi-core technology is altering computing as we know
   it, enabling increased productivity, powerful energy-efficient performance, and leading-edge
   advanced computing experiences. The good news is that Windows and .NET offer rich
   support for threading and synchronization to take advantage of these new platforms.
   This session, presented by David Callahan, Microsoft distinguished engineer, Joe Duffy,
   author of “Concurrent Programming on Windows” (Addison-Wesley), and Stephen Toub,
   program manager lead for the Concurrency Development Platform team at Microsoft, will
   cover a broad range of topics, including mechanisms to create, coordinate, and synchronize
   among threads; best practices for concurrent libraries and apps; and techniques for
   improving scalability, including lock-free algorithms. Focus will be on .NET programming,
   including the next generation of parallel programming support within the Framework,
   but Windows internals and C++ nuggets will be discussed too. 
&lt;/p&gt;
&lt;p&gt;
   &lt;em&gt;About the presenter(s): &lt;/em&gt;
   &lt;br&gt;
   David Callahan joined Microsoft in 2005. He is a Distinguished Engineer leading the
   Parallel Computing Platform Team within Visual Studio® focused on incubating technology
   for the coming manycore processors. This team is producing exciting new technologies
   as part of Visual Studio and also driving the Parallel Computing Initiative, a company
   wide effort to deliver customer value from the power of future high-performance processors.
   David’s background is in programming languages, parallel programming techniques, and
   compilation techniques focused on expressing and exploiting concurrency. 
&lt;/p&gt;
&lt;p&gt;
   Stephen Toub is a Senior Program Manager Lead on the Parallel Computing Platform team
   at Microsoft, where he spends his days focusing on the next generation of programming
   models for concurrency. Stephen is also a Contributing Editor for MSDN® Magazine,
   for which he writes the .NET Matters column, and he’s an avid speaker at conferences
   like TechEd and DevConnections. Prior to working on the Parallel Computing Platform,
   Stephen designed and built enterprise applications for companies such as GE, McGraw-Hill,
   BankOne, and JetBlue. He was a developer for Microsoft Outlook as well as for the
   Microsoft Office Solution Accelerators. In his spare time, Stephen loves to sing,
   and he spends as much time as possible with his beautiful wife Tamara. 
&lt;/p&gt;
&lt;p&gt;
   Joe Duffy leads development for Microsoft's Parallel Extensions to .NET technology,
   a set of library and runtime technologies for concurrent and parallel computing. He
   founded the project in 2006 with Parallel Language Integrated Query (aka PLINQ), an
   innovative declarative parallel query analysis and execution engine. Prior to Parallel
   Extensions, Joe worked on transactional memory, library and VM support for concurrency
   in the Common Language Runtime (CLR) team, and has written 3 functional language compilers
   (Scheme, Common LISP, and Haskell). He has written two books, including Concurrent
   Programming on Windows (Addison-Wesley, 2008), and in his spare time reads and writes
   (code and text), plays guitar, and studies music theory. 
&lt;/p&gt;
&lt;/blockquote&gt; 
&lt;p&gt;
   Be there, or be square.&amp;nbsp; The slides aren't even finalized yet, so let me know
   if you have particular topics you'd like to learn more about.
&lt;/p&gt;
&lt;img width="0" height="0" src="http://www.bluebytesoftware.com/blog/aggbug.ashx?id=5e8ae5a6-1126-4a4f-8d66-3e19e434a011" /&gt;</description>
      <comments>http://www.bluebytesoftware.com/blog/CommentView,guid,5e8ae5a6-1126-4a4f-8d66-3e19e434a011.aspx</comments>
      <category>Technology</category>
    </item>
    <item>
      <trackback:ping>http://www.bluebytesoftware.com/blog/Trackback.aspx?guid=353c8058-be59-46c1-bd99-da9f0ed2e9ed</trackback:ping>
      <pingback:server>http://www.bluebytesoftware.com/blog/pingback.aspx</pingback:server>
      <pingback:target>http://www.bluebytesoftware.com/blog/PermaLink,guid,353c8058-be59-46c1-bd99-da9f0ed2e9ed.aspx</pingback:target>
      <dc:creator>
      </dc:creator>
      <wfw:comment>http://www.bluebytesoftware.com/blog/CommentView,guid,353c8058-be59-46c1-bd99-da9f0ed2e9ed.aspx</wfw:comment>
      <wfw:commentRss>http://www.bluebytesoftware.com/blog/SyndicationService.asmx/GetEntryCommentsRss?guid=353c8058-be59-46c1-bd99-da9f0ed2e9ed</wfw:commentRss>
      <title>Concurrent counting</title>
      <guid>http://www.bluebytesoftware.com/blog/PermaLink,guid,353c8058-be59-46c1-bd99-da9f0ed2e9ed.aspx</guid>
      <link>http://www.bluebytesoftware.com/blog/2008/05/17/ConcurrentCounting.aspx</link>
      <pubDate>Sat, 17 May 2008 03:18:59 GMT</pubDate>
      <description>&lt;p&gt;
   Counting events and doing something once a certain number have been registered is
   a highly common pattern that comes up in concurrent programming a lot.&amp;nbsp; In the
   olden days, COM ref counting was a clear example of this: multiple threads might share
   a COM object, call Release when done with it, and hence memory management was much
   simpler.&amp;nbsp; GC has alleviated a lot of that, but the problem of deciding when a
   shared IDisposable resource should be finally Disposed of in .NET is strikingly similar.&amp;nbsp;
   And now-a-days, things like CountdownEvent are commonly useful for orchestrating multiple
   workers (see &lt;a href="http://msdn.microsoft.com/en-us/magazine/cc163427.aspx"&gt;MSDN
   Magazine&lt;/a&gt;), which (although not evident at first) is based on the same counting
   principle.
&lt;/p&gt;
&lt;p&gt;
   Coding up one-off solutions to all of these is actually pretty simple.&amp;nbsp; But doing
   so seems unfashionably ad-hoc, at least to me.&amp;nbsp; Codifying the pattern can be
   done in a couple dozen lines of code, so that it can be reused for many purposes.&amp;nbsp;
   As an example, here is a reusable Counting&amp;lt;T&amp;gt; class, written in C#, that just
   invokes action delegate once the count reaches zero:
&lt;/p&gt;
&lt;blockquote dir=ltr style="MARGIN-RIGHT: 0px"&gt; 
&lt;p class=MsoNormal style="MARGIN: 0in 0in 0pt"&gt;
   &lt;span style="FONT-FAMILY: Consolas"&gt;&lt;font color=#000000&gt;&lt;?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" /&gt;#pragma
   warning disable 0420&lt;o:p&gt;&lt;/o:p&gt;
   &lt;/font&gt;&lt;/span&gt;
&lt;/p&gt;
&lt;p class=MsoNormal style="MARGIN: 0in 0in 0pt"&gt;
   &lt;span style="FONT-FAMILY: Consolas"&gt;
   &lt;o:p&gt;
      &lt;font color=#000000&gt;&amp;nbsp;&lt;/font&gt;
   &lt;/o:p&gt;
   &lt;/span&gt;
&lt;/p&gt;
&lt;p class=MsoNormal style="MARGIN: 0in 0in 0pt"&gt;
   &lt;span style="FONT-FAMILY: Consolas"&gt;&lt;font color=#000000&gt;using System;&lt;o:p&gt;&lt;/o:p&gt;
   &lt;/font&gt;&lt;/span&gt;
&lt;/p&gt;
&lt;p class=MsoNormal style="MARGIN: 0in 0in 0pt"&gt;
   &lt;span style="FONT-FAMILY: Consolas"&gt;&lt;font color=#000000&gt;using System.Threading;&lt;o:p&gt;&lt;/o:p&gt;
   &lt;/font&gt;&lt;/span&gt;
&lt;/p&gt;
&lt;p class=MsoNormal style="MARGIN: 0in 0in 0pt"&gt;
   &lt;span style="FONT-FAMILY: Consolas"&gt;
   &lt;o:p&gt;
      &lt;font color=#000000&gt;&amp;nbsp;&lt;/font&gt;
   &lt;/o:p&gt;
   &lt;/span&gt;
&lt;/p&gt;
&lt;p class=MsoNormal style="MARGIN: 0in 0in 0pt"&gt;
   &lt;span style="FONT-FAMILY: Consolas"&gt;&lt;font color=#000000&gt;public class Counting&amp;lt;T&amp;gt;&lt;o:p&gt;&lt;/o:p&gt;
   &lt;/font&gt;&lt;/span&gt;
&lt;/p&gt;
&lt;p class=MsoNormal style="MARGIN: 0in 0in 0pt"&gt;
   &lt;span style="FONT-FAMILY: Consolas"&gt;&lt;font color=#000000&gt;{&lt;o:p&gt;&lt;/o:p&gt;
   &lt;/font&gt;&lt;/span&gt;
&lt;/p&gt;
&lt;p class=MsoNormal style="MARGIN: 0in 0in 0pt"&gt;
   &lt;span style="FONT-FAMILY: Consolas"&gt;&lt;font color=#000000&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; private
   readonly T m_obj;&lt;o:p&gt;&lt;/o:p&gt;
   &lt;/font&gt;&lt;/span&gt;
&lt;/p&gt;
&lt;p class=MsoNormal style="MARGIN: 0in 0in 0pt"&gt;
   &lt;span style="FONT-FAMILY: Consolas"&gt;&lt;font color=#000000&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; private
   volatile int m_count;&lt;o:p&gt;&lt;/o:p&gt;
   &lt;/font&gt;&lt;/span&gt;
&lt;/p&gt;
&lt;p class=MsoNormal style="MARGIN: 0in 0in 0pt"&gt;
   &lt;span style="FONT-FAMILY: Consolas"&gt;&lt;font color=#000000&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; private
   readonly Action&amp;lt;T&amp;gt; m_action;&lt;o:p&gt;&lt;/o:p&gt;
   &lt;/font&gt;&lt;/span&gt;
&lt;/p&gt;
&lt;p class=MsoNormal style="MARGIN: 0in 0in 0pt"&gt;
   &lt;span style="FONT-FAMILY: Consolas"&gt;&lt;font color=#000000&gt;&amp;nbsp;&lt;o:p&gt;&lt;/o:p&gt;
   &lt;/font&gt;&lt;/span&gt;
&lt;/p&gt;
&lt;p class=MsoNormal style="MARGIN: 0in 0in 0pt"&gt;
   &lt;span style="FONT-FAMILY: Consolas"&gt;&lt;font color=#000000&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; public
   Counting(T obj, int initialCount, Action&amp;lt;T&amp;gt; action)&lt;o:p&gt;&lt;/o:p&gt;
   &lt;/font&gt;&lt;/span&gt;
&lt;/p&gt;
&lt;p class=MsoNormal style="MARGIN: 0in 0in 0pt"&gt;
   &lt;span style="FONT-FAMILY: Consolas"&gt;&lt;font color=#000000&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; {&lt;o:p&gt;&lt;/o:p&gt;
   &lt;/font&gt;&lt;/span&gt;
&lt;/p&gt;
&lt;p class=MsoNormal style="MARGIN: 0in 0in 0pt"&gt;
   &lt;span style="FONT-FAMILY: Consolas"&gt;&lt;font color=#000000&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;
   m_obj = obj;&lt;o:p&gt;&lt;/o:p&gt;
   &lt;/font&gt;&lt;/span&gt;
&lt;/p&gt;
&lt;p class=MsoNormal style="MARGIN: 0in 0in 0pt"&gt;
   &lt;span style="FONT-FAMILY: Consolas"&gt;&lt;font color=#000000&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;
   &amp;nbsp;m_count = initialCount;&lt;o:p&gt;&lt;/o:p&gt;
   &lt;/font&gt;&lt;/span&gt;
&lt;/p&gt;
&lt;p class=MsoNormal style="MARGIN: 0in 0in 0pt"&gt;
   &lt;span style="FONT-FAMILY: Consolas"&gt;&lt;font color=#000000&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;
   m_action = action;&lt;o:p&gt;&lt;/o:p&gt;
   &lt;/font&gt;&lt;/span&gt;
&lt;/p&gt;
&lt;p class=MsoNormal style="MARGIN: 0in 0in 0pt"&gt;
   &lt;span style="FONT-FAMILY: Consolas"&gt;&lt;font color=#000000&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; }&lt;o:p&gt;&lt;/o:p&gt;
   &lt;/font&gt;&lt;/span&gt;
&lt;/p&gt;
&lt;p class=MsoNormal style="MARGIN: 0in 0in 0pt"&gt;
   &lt;span style="FONT-FAMILY: Consolas"&gt;&lt;font color=#000000&gt;&amp;nbsp;&lt;o:p&gt;&lt;/o:p&gt;
   &lt;/font&gt;&lt;/span&gt;
&lt;/p&gt;
&lt;p class=MsoNormal style="MARGIN: 0in 0in 0pt"&gt;
   &lt;span style="FONT-FAMILY: Consolas"&gt;&lt;font color=#000000&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; public
   int AddRef()&lt;o:p&gt;&lt;/o:p&gt;
   &lt;/font&gt;&lt;/span&gt;
&lt;/p&gt;
&lt;p class=MsoNormal style="MARGIN: 0in 0in 0pt"&gt;
   &lt;span style="FONT-FAMILY: Consolas"&gt;&lt;font color=#000000&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; {&lt;o:p&gt;&lt;/o:p&gt;
   &lt;/font&gt;&lt;/span&gt;
&lt;/p&gt;
&lt;p class=MsoNormal style="MARGIN: 0in 0in 0pt"&gt;
   &lt;span style="FONT-FAMILY: Consolas"&gt;&lt;font color=#000000&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;
   int c;&lt;o:p&gt;&lt;/o:p&gt;
   &lt;/font&gt;&lt;/span&gt;
&lt;/p&gt;
&lt;p class=MsoNormal style="MARGIN: 0in 0in 0pt"&gt;
   &lt;span style="FONT-FAMILY: Consolas"&gt;&lt;font color=#000000&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;
   if ((c = Interlocked.Increment(ref m_count)) == 1)&lt;o:p&gt;&lt;/o:p&gt;
   &lt;/font&gt;&lt;/span&gt;
&lt;/p&gt;
&lt;p class=MsoNormal style="MARGIN: 0in 0in 0pt"&gt;
   &lt;span style="FONT-FAMILY: Consolas"&gt;&lt;font color=#000000&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;
   throw new Exception();&lt;o:p&gt;&lt;/o:p&gt;
   &lt;/font&gt;&lt;/span&gt;
&lt;/p&gt;
&lt;p class=MsoNormal style="MARGIN: 0in 0in 0pt"&gt;
   &lt;span style="FONT-FAMILY: Consolas"&gt;&lt;font color=#000000&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;
   return c;&lt;o:p&gt;&lt;/o:p&gt;
   &lt;/font&gt;&lt;/span&gt;
&lt;/p&gt;
&lt;p class=MsoNormal style="MARGIN: 0in 0in 0pt"&gt;
   &lt;span style="FONT-FAMILY: Consolas"&gt;&lt;font color=#000000&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; }&lt;o:p&gt;&lt;/o:p&gt;
   &lt;/font&gt;&lt;/span&gt;
&lt;/p&gt;
&lt;p class=MsoNormal style="MARGIN: 0in 0in 0pt"&gt;
   &lt;span style="FONT-FAMILY: Consolas"&gt;&lt;font color=#000000&gt;&amp;nbsp;&lt;o:p&gt;&lt;/o:p&gt;
   &lt;/font&gt;&lt;/span&gt;
&lt;/p&gt;
&lt;p class=MsoNormal style="MARGIN: 0in 0in 0pt"&gt;
   &lt;span style="FONT-FAMILY: Consolas"&gt;&lt;font color=#000000&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; public
   int Release()&lt;o:p&gt;&lt;/o:p&gt;
   &lt;/font&gt;&lt;/span&gt;
&lt;/p&gt;
&lt;p class=MsoNormal style="MARGIN: 0in 0in 0pt"&gt;
   &lt;span style="FONT-FAMILY: Consolas"&gt;&lt;font color=#000000&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; {&lt;o:p&gt;&lt;/o:p&gt;
   &lt;/font&gt;&lt;/span&gt;
&lt;/p&gt;
&lt;p class=MsoNormal style="MARGIN: 0in 0in 0pt"&gt;
   &lt;span style="FONT-FAMILY: Consolas"&gt;&lt;font color=#000000&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;
   int c;&lt;o:p&gt;&lt;/o:p&gt;
   &lt;/font&gt;&lt;/span&gt;
&lt;/p&gt;
&lt;p class=MsoNormal style="MARGIN: 0in 0in 0pt"&gt;
   &lt;span style="FONT-FAMILY: Consolas"&gt;&lt;font color=#000000&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;
   if ((c = Interlocked.Decrement(ref m_count)) == 0)&lt;o:p&gt;&lt;/o:p&gt;
   &lt;/font&gt;&lt;/span&gt;
&lt;/p&gt;
&lt;p class=MsoNormal style="MARGIN: 0in 0in 0pt"&gt;
   &lt;span style="FONT-FAMILY: Consolas"&gt;&lt;font color=#000000&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;
   m_action(m_obj);&lt;o:p&gt;&lt;/o:p&gt;
   &lt;/font&gt;&lt;/span&gt;
&lt;/p&gt;
&lt;p class=MsoNormal style="MARGIN: 0in 0in 0pt"&gt;
   &lt;span style="FONT-FAMILY: Consolas"&gt;&lt;font color=#000000&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;
   return c;&lt;o:p&gt;&lt;/o:p&gt;
   &lt;/font&gt;&lt;/span&gt;
&lt;/p&gt;
&lt;p class=MsoNormal style="MARGIN: 0in 0in 0pt"&gt;
   &lt;span style="FONT-FAMILY: Consolas"&gt;&lt;font color=#000000&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; }&lt;o:p&gt;&lt;/o:p&gt;
   &lt;/font&gt;&lt;/span&gt;
&lt;/p&gt;
&lt;p class=MsoNormal style="MARGIN: 0in 0in 0pt"&gt;
   &lt;span style="FONT-FAMILY: Consolas"&gt;
   &lt;o:p&gt;
      &lt;font color=#000000&gt;&amp;nbsp;&lt;/font&gt;
   &lt;/o:p&gt;
   &lt;/span&gt;
&lt;/p&gt;
&lt;p class=MsoNormal style="MARGIN: 0in 0in 0pt"&gt;
   &lt;span style="FONT-FAMILY: Consolas"&gt;&lt;font color=#000000&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; public
   T Obj { get { return m_obj; } }&lt;o:p&gt;&lt;/o:p&gt;
   &lt;/font&gt;&lt;/span&gt;
&lt;/p&gt;
&lt;p class=MsoNormal style="MARGIN: 0in 0in 0pt"&gt;
   &lt;span style="FONT-FAMILY: Consolas"&gt;&lt;font color=#000000&gt;}&lt;o:p&gt;&lt;/o:p&gt;
   &lt;/font&gt;&lt;/span&gt;
&lt;/p&gt;
&lt;/blockquote&gt; 
&lt;p&gt;
   Notice I’ve used the IUnknown vocabulary of AddRef and Release.&amp;nbsp; Old habits die
   hard.
&lt;/p&gt;
&lt;p&gt;
   The CountdownEvent I mentioned earlier is just a simple extension to this basic functionality.&amp;nbsp;
   In fact, we don’t need to write another class; it’s merely an instance of Counting&amp;lt;T&amp;gt;,
   where the T is a ManualResetEvent.&amp;nbsp; Setters directly use the Counting&amp;lt;T&amp;gt;
   object’s Release method to register a signal, while waiters can use the WaitOne method
   on the raw ManualResetEvent itself.&amp;nbsp; The event will be set once all signals have
   arrived:
&lt;/p&gt;
&lt;blockquote dir=ltr style="MARGIN-RIGHT: 0px"&gt; 
&lt;p class=MsoNormal style="MARGIN: 0in 0in 0pt"&gt;
   &lt;span style="FONT-FAMILY: Consolas"&gt;&lt;font color=#000000&gt;Counting&amp;lt;ManualResetEvent&amp;gt;
   countingEv = new Counting&amp;lt;ManualResetEvent&amp;gt;(&lt;o:p&gt;&lt;/o:p&gt;
   &lt;/font&gt;&lt;/span&gt;
&lt;/p&gt;
&lt;p class=MsoNormal style="MARGIN: 0in 0in 0pt"&gt;
   &lt;span style="FONT-FAMILY: Consolas"&gt;&lt;font color=#000000&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; new ManualResetEvent(false),
   N, e =&amp;gt; e.Set()&lt;o:p&gt;&lt;/o:p&gt;
   &lt;/font&gt;&lt;/span&gt;
&lt;/p&gt;
&lt;p class=MsoNormal style="MARGIN: 0in 0in 0pt"&gt;
   &lt;span style="FONT-FAMILY: Consolas"&gt;&lt;font color=#000000&gt;);&lt;o:p&gt;&lt;/o:p&gt;
   &lt;/font&gt;&lt;/span&gt;
&lt;/p&gt;
&lt;p class=MsoNormal style="MARGIN: 0in 0in 0pt"&gt;
   &lt;span style="FONT-FAMILY: Consolas"&gt;&lt;font color=#000000&gt;…&lt;o:p&gt;&lt;/o:p&gt;
   &lt;/font&gt;&lt;/span&gt;
&lt;/p&gt;
&lt;p class=MsoNormal style="MARGIN: 0in 0in 0pt"&gt;
   &lt;span style="FONT-FAMILY: Consolas"&gt;
   &lt;o:p&gt;
      &lt;font color=#000000&gt;&amp;nbsp;&lt;/font&gt;
   &lt;/o:p&gt;
   &lt;/span&gt;
&lt;/p&gt;
&lt;p class=MsoNormal style="MARGIN: 0in 0in 0pt"&gt;
   &lt;span style="FONT-FAMILY: Consolas"&gt;&lt;font color=#000000&gt;// Setter:&lt;o:p&gt;&lt;/o:p&gt;
   &lt;/font&gt;&lt;/span&gt;
&lt;/p&gt;
&lt;p class=MsoNormal style="MARGIN: 0in 0in 0pt"&gt;
   &lt;span style="FONT-FAMILY: Consolas"&gt;&lt;font color=#000000&gt;countingEv.Release();&lt;o:p&gt;&lt;/o:p&gt;
   &lt;/font&gt;&lt;/span&gt;
&lt;/p&gt;
&lt;p class=MsoNormal style="MARGIN: 0in 0in 0pt"&gt;
   &lt;span style="FONT-FAMILY: Consolas"&gt;
   &lt;o:p&gt;
      &lt;font color=#000000&gt;&amp;nbsp;&lt;/font&gt;
   &lt;/o:p&gt;
   &lt;/span&gt;
&lt;/p&gt;
&lt;p class=MsoNormal style="MARGIN: 0in 0in 0pt"&gt;
   &lt;span style="FONT-FAMILY: Consolas"&gt;&lt;font color=#000000&gt;// Waiter:&lt;o:p&gt;&lt;/o:p&gt;
   &lt;/font&gt;&lt;/span&gt;
&lt;/p&gt;
&lt;p class=MsoNormal style="MARGIN: 0in 0in 0pt"&gt;
   &lt;span style="FONT-FAMILY: Consolas"&gt;&lt;font color=#000000&gt;countingEv.Obj.WaitOne();&lt;o:p&gt;&lt;/o:p&gt;
   &lt;/font&gt;&lt;/span&gt;
&lt;/p&gt;
&lt;/blockquote&gt; 
&lt;p&gt;
   (Exposing a traditional Set/Wait interface would of course be nicer, but even then
   Counting&amp;lt;T&amp;gt; makes the implementation brain-dead simple.)
&lt;/p&gt;
&lt;p&gt;
   Similarly, the “who should dispose” problem is easy to solve with Counting&amp;lt;T&amp;gt;.&amp;nbsp;
   Say that, instead of setting the event, we actually want to Dispose of some IDisposable
   object when all threads are done with it:
&lt;/p&gt;
&lt;blockquote dir=ltr style="MARGIN-RIGHT: 0px"&gt; 
&lt;p class=MsoNormal style="MARGIN: 0in 0in 0pt"&gt;
   &lt;span style="FONT-FAMILY: Consolas"&gt;&lt;font color=#000000&gt;Counting&amp;lt;ManualResetEvent&amp;gt;
   ev = new Counting&amp;lt;ManualResetEvent&amp;gt;(&lt;o:p&gt;&lt;/o:p&gt;
   &lt;/font&gt;&lt;/span&gt;
&lt;/p&gt;
&lt;p class=MsoNormal style="MARGIN: 0in 0in 0pt"&gt;
   &lt;span style="FONT-FAMILY: Consolas"&gt;&lt;font color=#000000&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; new ManualResetEvent(false),
   N, e =&amp;gt; e.Dispose()&lt;o:p&gt;&lt;/o:p&gt;
   &lt;/font&gt;&lt;/span&gt;
&lt;/p&gt;
&lt;p class=MsoNormal style="MARGIN: 0in 0in 0pt"&gt;
   &lt;span style="FONT-FAMILY: Consolas"&gt;&lt;font color=#000000&gt;);&lt;o:p&gt;&lt;/o:p&gt;
   &lt;/font&gt;&lt;/span&gt;
&lt;/p&gt;
&lt;/blockquote&gt; 
&lt;p&gt;
   Though this does the trick, we might instead wrap it in a more convenient package:
&lt;/p&gt;
&lt;blockquote dir=ltr style="MARGIN-RIGHT: 0px"&gt; 
&lt;p class=MsoNormal style="MARGIN: 0in 0in 0pt"&gt;
   &lt;span style="FONT-FAMILY: Consolas"&gt;&lt;font color=#000000&gt;public class CountingDispose&amp;lt;T&amp;gt;
   :&lt;o:p&gt;&lt;/o:p&gt;
   &lt;/font&gt;&lt;/span&gt;
&lt;/p&gt;
&lt;p class=MsoNormal style="MARGIN: 0in 0in 0pt"&gt;
   &lt;span style="FONT-FAMILY: Consolas"&gt;&lt;font color=#000000&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;
   Counting&amp;lt;T&amp;gt;, IDisposable&lt;o:p&gt;&lt;/o:p&gt;
   &lt;/font&gt;&lt;/span&gt;
&lt;/p&gt;
&lt;p class=MsoNormal style="MARGIN: 0in 0in 0pt"&gt;
   &lt;span style="FONT-FAMILY: Consolas"&gt;&lt;font color=#000000&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;
   where T : IDisposable&lt;o:p&gt;&lt;/o:p&gt;
   &lt;/font&gt;&lt;/span&gt;
&lt;/p&gt;
&lt;p class=MsoNormal style="MARGIN: 0in 0in 0pt"&gt;
   &lt;span style="FONT-FAMILY: Consolas"&gt;&lt;font color=#000000&gt;{&lt;o:p&gt;&lt;/o:p&gt;
   &lt;/font&gt;&lt;/span&gt;
&lt;/p&gt;
&lt;p class=MsoNormal style="MARGIN: 0in 0in 0pt"&gt;
   &lt;span style="FONT-FAMILY: Consolas"&gt;&lt;font color=#000000&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; public
   CountingDispose(T obj, int initialCount) :&lt;o:p&gt;&lt;/o:p&gt;
   &lt;/font&gt;&lt;/span&gt;
&lt;/p&gt;
&lt;p class=MsoNormal style="MARGIN: 0in 0in 0pt"&gt;
   &lt;span style="FONT-FAMILY: Consolas"&gt;&lt;font color=#000000&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;
   base(obj, initialCount, d =&amp;gt; d.Dispose()) { }&lt;o:p&gt;&lt;/o:p&gt;
   &lt;/font&gt;&lt;/span&gt;
&lt;/p&gt;
&lt;p class=MsoNormal style="MARGIN: 0in 0in 0pt"&gt;
   &lt;span style="FONT-FAMILY: Consolas"&gt;&lt;font color=#000000&gt;}&lt;o:p&gt;&lt;/o:p&gt;
   &lt;/font&gt;&lt;/span&gt;
&lt;/p&gt;
&lt;/blockquote&gt; 
&lt;p&gt;
   Given this definition, threads can use the CountingDispose&amp;lt;T&amp;gt; object as they
   would any IDisposable thing.&amp;nbsp; This facilitates use in C# using blocks.&amp;nbsp;
   Only when all threads have called Dispose will Dispose be called on the actual underlying
   object:
&lt;/p&gt;
&lt;blockquote dir=ltr style="MARGIN-RIGHT: 0px"&gt; 
&lt;p class=MsoNormal style="MARGIN: 0in 0in 0pt"&gt;
   &lt;span style="FONT-FAMILY: Consolas"&gt;&lt;font color=#000000&gt;CountingDispose&amp;lt;ManualResetEvent&amp;gt;
   ev = new CountingDispose&amp;lt;ManualResetEvent&amp;gt;(&lt;o:p&gt;&lt;/o:p&gt;
   &lt;/font&gt;&lt;/span&gt;
&lt;/p&gt;
&lt;p class=MsoNormal style="MARGIN: 0in 0in 0pt"&gt;
   &lt;span style="FONT-FAMILY: Consolas"&gt;&lt;font color=#000000&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; new ManualResetEvent(false),
   N&lt;o:p&gt;&lt;/o:p&gt;
   &lt;/font&gt;&lt;/span&gt;
&lt;/p&gt;
&lt;p class=MsoNormal style="MARGIN: 0in 0in 0pt"&gt;
   &lt;span style="FONT-FAMILY: Consolas"&gt;&lt;font color=#000000&gt;);&lt;o:p&gt;&lt;/o:p&gt;
   &lt;/font&gt;&lt;/span&gt;
&lt;/p&gt;
&lt;p class=MsoNormal style="MARGIN: 0in 0in 0pt"&gt;
   &lt;span style="FONT-FAMILY: Consolas"&gt;&lt;font color=#000000&gt;…&lt;o:p&gt;&lt;/o:p&gt;
   &lt;/font&gt;&lt;/span&gt;
&lt;/p&gt;
&lt;p class=MsoNormal style="MARGIN: 0in 0in 0pt"&gt;
   &lt;span style="FONT-FAMILY: Consolas"&gt;
   &lt;o:p&gt;
      &lt;font color=#000000&gt;&amp;nbsp;&lt;/font&gt;
   &lt;/o:p&gt;
   &lt;/span&gt;
&lt;/p&gt;
&lt;p class=MsoNormal style="MARGIN: 0in 0in 0pt"&gt;
   &lt;span style="FONT-FAMILY: Consolas"&gt;&lt;font color=#000000&gt;// Some threads wait:&lt;o:p&gt;&lt;/o:p&gt;
   &lt;/font&gt;&lt;/span&gt;
&lt;/p&gt;
&lt;p class=MsoNormal style="MARGIN: 0in 0in 0pt"&gt;
   &lt;span style="FONT-FAMILY: Consolas"&gt;&lt;font color=#000000&gt;using (ev) {&lt;o:p&gt;&lt;/o:p&gt;
   &lt;/font&gt;&lt;/span&gt;
&lt;/p&gt;
&lt;p class=MsoNormal style="MARGIN: 0in 0in 0pt"&gt;
   &lt;span style="FONT-FAMILY: Consolas"&gt;&lt;font color=#000000&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; … ev.WaitOne();
   …&lt;o:p&gt;&lt;/o:p&gt;
   &lt;/font&gt;&lt;/span&gt;
&lt;/p&gt;
&lt;p class=MsoNormal style="MARGIN: 0in 0in 0pt"&gt;
   &lt;span style="FONT-FAMILY: Consolas"&gt;&lt;font color=#000000&gt;}&lt;o:p&gt;&lt;/o:p&gt;
   &lt;/font&gt;&lt;/span&gt;
&lt;/p&gt;
&lt;p class=MsoNormal style="MARGIN: 0in 0in 0pt"&gt;
   &lt;span style="FONT-FAMILY: Consolas"&gt;
   &lt;o:p&gt;
      &lt;font color=#000000&gt;&amp;nbsp;&lt;/font&gt;
   &lt;/o:p&gt;
   &lt;/span&gt;
&lt;/p&gt;
&lt;p class=MsoNormal style="MARGIN: 0in 0in 0pt"&gt;
   &lt;span style="FONT-FAMILY: Consolas"&gt;&lt;font color=#000000&gt;// Some threads set:&lt;o:p&gt;&lt;/o:p&gt;
   &lt;/font&gt;&lt;/span&gt;
&lt;/p&gt;
&lt;p class=MsoNormal style="MARGIN: 0in 0in 0pt"&gt;
   &lt;span style="FONT-FAMILY: Consolas"&gt;&lt;font color=#000000&gt;using (ev) {&lt;o:p&gt;&lt;/o:p&gt;
   &lt;/font&gt;&lt;/span&gt;
&lt;/p&gt;
&lt;p class=MsoNormal style="MARGIN: 0in 0in 0pt"&gt;
   &lt;span style="FONT-FAMILY: Consolas"&gt;&lt;font color=#000000&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; … ev.Set();
   …&lt;o:p&gt;&lt;/o:p&gt;
   &lt;/font&gt;&lt;/span&gt;
&lt;/p&gt;
&lt;p class=MsoNormal style="MARGIN: 0in 0in 0pt"&gt;
   &lt;span style="FONT-FAMILY: Consolas"&gt;&lt;font color=#000000&gt;}&lt;o:p&gt;&lt;/o:p&gt;
   &lt;/font&gt;&lt;/span&gt;
&lt;/p&gt;
&lt;/blockquote&gt; 
&lt;p&gt;
   I’ve found that the extremely simple Counting&amp;lt;T&amp;gt; idea is a surprisingly powerful
   one.&amp;nbsp; It’s fairly extensible too; for example, you clearly may want to run actions
   at different points in the counting, use clever synchronization to ensure actions
   run at particular points are processed in-order (useful for progress reporting), to
   reset the count afterwards, and so on.&amp;nbsp; It’s way too simple to claim it’s anything
   terribly amazing, but thought I’d share the idea anyway.
&lt;/p&gt;
&lt;img width="0" height="0" src="http://www.bluebytesoftware.com/blog/aggbug.ashx?id=353c8058-be59-46c1-bd99-da9f0ed2e9ed" /&gt;</description>
      <comments>http://www.bluebytesoftware.com/blog/CommentView,guid,353c8058-be59-46c1-bd99-da9f0ed2e9ed.aspx</comments>
      <category>Technology</category>
    </item>
    <item>
      <trackback:ping>http://www.bluebytesoftware.com/blog/Trackback.aspx?guid=6e62a753-75b9-4486-92e8-e32c38a85ad4</trackback:ping>
      <pingback:server>http://www.bluebytesoftware.com/blog/pingback.aspx</pingback:server>
      <pingback:target>http://www.bluebytesoftware.com/blog/PermaLink,guid,6e62a753-75b9-4486-92e8-e32c38a85ad4.aspx</pingback:target>
      <dc:creator>
      </dc:creator>
      <wfw:comment>http://www.bluebytesoftware.com/blog/CommentView,guid,6e62a753-75b9-4486-92e8-e32c38a85ad4.aspx</wfw:comment>
      <wfw:commentRss>http://www.bluebytesoftware.com/blog/SyndicationService.asmx/GetEntryCommentsRss?guid=6e62a753-75b9-4486-92e8-e32c38a85ad4</wfw:commentRss>
      <slash:comments>11</slash:comments>
      <body xmlns="http://www.w3.org/1999/xhtml">
        <p>
      We take code reviews very seriously in our group.  No checkin is ever made without
      a peer developer taking a close look.  (Incubation projects are often treated
      differently than product work, because the loss of agility is intolerable.) 
      A lot of this is done over email, but if there’s anything that is unclear from just
      looking at the code, a face to face review is done.  Feedback ranges from consistency
      (with guidelines and surrounding code), finding errors or overlooked conditions, providing
      suggestions on how to more clearly write something, comments, etc.; this ensures that
      our codebase is always of super high quality.
   </p>
        <p>
      Concurrency adds some complexity to development, and requires special consideration
      during code reviews.  I thought I’d put some thoughts on paper about what I look
      for during concurrency-oriented code reviews, in hopes that it will be useful to anybody
      starting to sink their teeth into concurrency; it may also help you devise your own
      internal review guidelines.  Most of this advice just comes down to knowing a
      laundry list of best practices, but a lot of it is also knowing what to look for and
      where to spend your time during a review.
   </p>
        <p>
      (A couple years ago I wrote a lengthy “Concurrency and its impact on reusable libraries”
      essay which provides a lot of the motivation behind what I look for.  It’s up
      on my blog, <a href="http://www.bluebytesoftware.com/blog/2006/10/26/ConcurrencyAndTheImpactOnReusableLibraries.aspx">http://www.bluebytesoftware.com/blog/2006/10/26/ConcurrencyAndTheImpactOnReusableLibraries.aspx</a>,
      and (though slightly out of date) I’m revising it for an Appendix in <a href="http://www.bluebytesoftware.com/books/winconc/winconc_book_resources.html">my
      upcoming book</a>.  If you question why I believe something, chances are that
      this document will explain my rationale.  And it’s far more complete than this
      short essay; I’ve only hit the high points here.)
   </p>
        <p>
          <strong>Getting started</strong>
        </p>
        <p>
      I first review the code in a traditional sequential code review fashion.  When
      doing this, I earmark all state that I see as either “private” (aka isolated) or “shared”. 
      I then go back and closely review all state that is shared (accessible from many threads)
      with a fine-tooth comb.  Sometimes I’ll do this during my first pass through,
      but I usually find it helpful to understand the algorithmic structure of the changes
      first before fully developing an understanding of the concurrency parts.
   </p>
        <p>
      Changes to existing code should be reviewed just as carefully (if not more carefully)
      as new code.  Concurrency behavior is subtle, and it’s very easy to accidentally
      violate some unchecked assumption the code was previously making.  Liberal use
      of asserts is therefore very important.  Sadly many of the conditions code assumes
      are simply unassertable (like “object X isn’t shared”).<br />
      I easily spend about 2x the amount of time reviewing the concurrency aspects of the
      code than usual sequential aspects.  Perhaps more.  This extra time is OK,
      because the concurrency portion is far smaller (in lines of code) than the sequential
      portion in most of the code I review.  There are obvious exceptions to this rule,
      especially since I’m on a team building low-level primitive data types whose sole
      purpose in life is to be used in concurrent apps.
   </p>
        <p>
          <strong>Shared state and synchronization</strong>
        </p>
        <p>
      Some state, although shared, is immutable (read-only) and can be safely shared and
      read from concurrently.  Often this is by construction (e.g. immutable value
      types) but sometimes this is by loose convention (e.g. a data structure is immutable
      for some period of time, simply by virtue of no threads actively writing to it). 
      Both should be clearly documented in the code.<br />
      Once mutable shared state is identified, I look for two major things:
   </p>
        <ol>
          <li>
         When does it become shared, i.e. publicized, and what is the protocol for the transfer
         of ownership?  Is it done cleanly?  And is it well documented? 
      </li>
          <li>
         When does it once again become private again, if ever?  And is this documented
         too?</li>
        </ol>
        <p>
      Ideally all shared state would have clean ownership transitions.  Any state that
      is disposable necessarily must have a point at the end of its life where it has a
      single owner, so it can be safely disposed of (unless ref-counting is used). 
      But for most state the line will be extremely blurry and unenforced.  Comments
      should be used to clarify, in gory detail.  I also tend to prefix names of variables
      that refer to shared objects with the word ‘shared’ itself, so that they jump out.
   </p>
        <p>
      Many, many bugs arise from some code publicizing some state, sometimes by accident,
      and then continuing to treat it as though it is private.  It is also sometimes
      tricky to determine this precisely, since sharing can be modal.  A list data
      structure may be shared in some contexts but not others.  Knowing what its sharing
      policy is requires transitive knowledge of callers.  Building up this level of
      global understanding can involve a fair bit of simply sitting back and reading and
      rereading the code over and over again.
   </p>
        <p>
      Once the policy around sharing a piece of state is known, it is crucial to understand
      the intended synchronization policy for that data.  Is it protected by a fine-grained
      monitor?  Is it manipulated and read in a lock-free way?  And so on. 
      And once the intended policy is known, is the actual policy implemented what was intended?
   </p>
        <p>
      While this part is extremely important, by the way, I have to admit that I feel this
      aspect tends to overshadow other things in conversation.  This is probably because
      it’s the most obvious thing to look for.  Sadly the world of concurrency is far
      more subtle than this.  I’ve honestly found more bugs resulting from failing
      to identify shared state properly than resulting from failing to implement the synchronization
      logic itself properly.  Your mileage will of course vary.
   </p>
        <p>
          <strong>Locks</strong>
        </p>
        <p>
      I treat lock-based code and lock-free as two entirely separate beasts.  I spend
      about 5x the time reviewing lock-free code when compared to lock-based code. 
      There is a tax to having lock-free code in any codebase, so as you are reviewing it,
      also ask yourself: is there a better (or almost-as-good) way that this could have
      been done using locks?  Often the answer is no, due to the benefits lock-freedom
      brings (no thread can indefinitely starve another). 
   </p>
        <p>
      But if the answer is yes, that the code could be written more clearly using locks,
      you could save your team a lot of time by convincing the author to change his or her
      mind.  Not only is lock-free code far more difficult to write and test, it carries
      a large tax during long stress hauls and end-game bug-fixing, an important and time-sensitive
      period in the development lifecycle of any commercial software product.  Maintaining
      lock-free code also carries an extra long-term cost, particularly when ramping up
      new hires on it.  All of this risks interfering with your ability to work on
      cool new features at some point.  Don’t feel bad about pushing back on this one. 
      Hard.
   </p>
        <p>
      Carefully review what happens inside of a lock region.  Look at every single
      line with scrutiny.
   </p>
        <ul>
          <li>
         Lock hold times should be as short as possible.  Hold times should be counted
         in dozens or hundreds of cycles, not thousands (unless absolutely unavoidable). 
      </li>
          <li>
         If lock hold times are in the dozens, you can consider using a spin-lock. 
      </li>
          <li>
         Recursive lock acquisitions are strongly discouraged.  If it can happen, did
         the developer clearly intend it to happen?  Or is this possibility accidental? 
         Point it out to them.  Also, are there any unexpected points at which reentrancy
         can occur?  E.g. any APC or message-pumping waits?  If yes, is there a way
         to avoid that by simple restructuring of the code? 
      </li>
          <li>
         Dynamic method calls via delegates or virtual methods while a lock is held should
         be as rare as possible.  Method calls under a lock to user-supplied code should
         only ever happen if the concurrency behavior is clearly documented and specified for
         the user, and when invariants hold.  All of these cases can lead to reentrancy. 
         Often this requires special code to detect the reentrancy and respond intelligently. 
      </li>
          <li>
         Lock regions should usually not span multiple methods: for example, acquiring the
         lock in one method, returning, and having the caller release it in another method
         is bad form.  It is very easy to screw up the control flow and deadlock your
         library. 
      </li>
          <li>
         CERs can only use spin-locks currently (because Monitor.ReliableEnter is currently
         unavailable), if you care about orphaning locks at least (which most CER-cost does). 
         If you see somebody trying to write a CER using a CLR Monitor, their code is probably
         busted.  Thankfully CERs are pretty rare to encounter in practice.</li>
        </ul>
        <p>
      Races that break code are always must-fix bugs, no matter how obscure.  If they
      happen with low frequency on the quad-cores of today, they will probably break with
      regularity on the 16-cores of tomorrow.  The kinds of code my team writes needs
      to remains correct and scale well into the distant future; presumably if you’re writing
      concurrent code already, yours does too.  If you find such a race, the code should
      not even be checked in until it’s fixed.  “But it only happens once in a while”
      is an inexcusable answer.  Benign races are OK but should be clearly documented.
   </p>
        <p>
          <strong>Events</strong>
        </p>
        <p>
      When I see any event-based code (either Monitor Wait/Pulse/PulseAll condition variables
      or some event type, like AutoResetEvent or ManualResetEvent), the first thing I do
      is build up a global understanding of all the conditions under which events are set,
      reset, and waited on.  This is to understand the coordination and flow of threads
      top-down, rather than bottom-up.  Because I’ve already reviewed the sequential
      parts of the algorithm, I typically already know the important state transitions events
      are guarding before I even get to this point.
   </p>
        <p>
      Next, there are some simple aspects to specific usage of events that I look for:
   </p>
        <ul>
          <li>
         Understanding the relationship between mutual exclusion, the state, and the events
         is important and subtle.  Comments should be used ideally to explain that. 
      </li>
          <li>
         Does the setting of the event happen in a wake-one (Pulse, Auto-Reset) or wake-all
         (PulseAll, Manual-Reset) manner?  If it’s wake-one, are all waiters homogeneous? 
         Is it always strictly true that waking-one is sufficient and won’t lead to missed
         wake-ups?</li>
          <li>
         Waiters that release the lock and then wait should be viewed with suspicion. 
         There’s a race between the release and wait that notoriously causes deadlocks.</li>
          <li>
         Concurrent code should never use timeouts as a way to work around sloppiness in the
         way threads wait and signal.  A missed wake-up is a bug in the code that must
         be fixed.</li>
        </ul>
        <p>
          <strong>Lock-freedom and volatiles</strong>
        </p>
        <p>
      If you’re looking at lock-free code, you need to have a firm grasp on the CLR’s memory
      model.  See <a href="http://www.bluebytesoftware.com/blog/2007/11/10/CLR20MemoryModel.aspx">http://www.bluebytesoftware.com/blog/2007/11/10/CLR20MemoryModel.aspx</a> for
      an overview.  Don’t think about the machine, think about the logical abstraction
      provided by the memory model.  You also need a firm grasp on the invariants of
      the data structures involved.  Specifically you are looking to see if the structure
      could ever move into a state, visible by another thread, where one of these invariants
      doesn’t hold.  I explicitly permute (often on a whiteboard or in notepad) the
      sections of the code that involve shared loads and stores, using knowledge of the
      legal reorderings given our memory model, to see if the code breaks.
   </p>
        <p>
      Any variable marked as volatile should be a red flag to carefully review all use of
      that variable.  For every single read and every single write of that variable,
      you must look at it and convince yourself of why volatile is necessary.  If you
      can’t, ask the person who wrote the code.  Sometimes volatile is used because
      most (but not all) call sites need it; that’s often acceptable.  Leaving the
      variable as non-volatile and selectively using Thread.VolatileRead for the reads that
      need it is typically too costly.  Anyway, comments should always be used to explain
      why each load and store is volatile, even if it doesn’t strictly need the volatile
      semantics.
   </p>
        <p>
      Conversely, any variable that is apparently shared, but not marked volatile, should
      be an even redder flag.  It’s very likely that this is a mistake.  Recall
      that writes happen in-order with the CLR’s memory model, but that reads do not. 
      Anytime there is a relationship between multiple shared variables that are written
      and read together (without the protection of a lock), they typically both need to
      be volatile.
   </p>
        <p>
      Any reads of shared variables used in a tight loop must be marked volatile. 
      Otherwise the compiler may decide to hoist them, causing an infinite loop.  Even
      if they are retrieved via simple method calls like property accessors (due to inlining).
   </p>
        <p>
      Thread.MemoryBarrier should typically only occur to deal with store (release) followed
      by load (acquire) reordering problems.  And it’s usually a better idea to use
      an InterlockedExchange for the store instead, since it implies a full barrier but
      combines the write.  Sometimes a fence can be used to flush write buffers—like
      when releasing a spin-lock to avoid giving the calling thread the unfair ability to
      turn right around and reacquire it—but this is extremely rare, and often an indication
      that somebody has an inaccurate mental model of what the fence is meant to do.
   </p>
        <p>
      Custom spin waiting should be used rarely.  If you see it used, the person may
      not be aware that spin waits <a href="http://www.bluebytesoftware.com/blog/2006/08/23/PriorityinducedStarvationWhySleep1IsBetterThanSleep0AndTheWindowsBalanceSetManager.aspx">need
      special attention</a>: to work well on HT machines, yield properly to all runnable
      threads with appropriate amortized costs, to spin only for a reasonable amount of
      time (in other words, less than the duration of a context switch), and so on. 
      Thread.SpinWait does not do what most people expect, since it only covers the first. 
      Kindly let them know about these things.  If any spin waiting is used in a codebase,
      it’s far better to consolidate all usage into a single primitive that does it all.
   </p>
        <p>
          <strong>Wrapping up</strong>
        </p>
        <p>
      At the end of each review, ask yourself whether all of the concurrency-oriented parts
      of the code were clearly explained in the design doc for the feature.  Did this
      carry over to clearly written comments in the implementation?  These are some
      really hard issues to get your head around, so the time spent reviewing the code should
      not be lost.  Somebody, someday down the road, will need to understand the code
      again (perhaps so that they can maintain it, test it, etc.), and it is your responsibility
      as a member of the team—regardless of whether you wrote the code—to do your part in
      making that feasible.  You should explicitly go back to the design doc and suggest
      areas for clarification.
   </p>
        <img width="0" height="0" src="http://www.bluebytesoftware.com/blog/aggbug.ashx?id=6e62a753-75b9-4486-92e8-e32c38a85ad4" />
      </body>
      <title>Concurrency-oriented code reviews</title>
      <guid>http://www.bluebytesoftware.com/blog/PermaLink,guid,6e62a753-75b9-4486-92e8-e32c38a85ad4.aspx</guid>
      <link>http://www.bluebytesoftware.com/blog/2008/03/28/ConcurrencyorientedCodeReviews.aspx</link>
      <pubDate>Fri, 28 Mar 2008 17:04:50 GMT</pubDate>
      <description>&lt;p&gt;
   We take code reviews very seriously in our group.&amp;nbsp; No checkin is ever made without
   a peer developer taking a close look.&amp;nbsp; (Incubation projects are often treated
   differently than product work, because the loss of agility is intolerable.)&amp;nbsp;
   A lot of this is done over email, but if there’s anything that is unclear from just
   looking at the code, a face to face review is done.&amp;nbsp; Feedback ranges from consistency
   (with guidelines and surrounding code), finding errors or overlooked conditions, providing
   suggestions on how to more clearly write something, comments, etc.; this ensures that
   our codebase is always of super high quality.
&lt;/p&gt;
&lt;p&gt;
   Concurrency adds some complexity to development, and requires special consideration
   during code reviews.&amp;nbsp; I thought I’d put some thoughts on paper about what I look
   for during concurrency-oriented code reviews, in hopes that it will be useful to anybody
   starting to sink their teeth into concurrency; it may also help you devise your own
   internal review guidelines.&amp;nbsp; Most of this advice just comes down to knowing a
   laundry list of best practices, but a lot of it is also knowing what to look for and
   where to spend your time during a review.
&lt;/p&gt;
&lt;p&gt;
   (A couple years ago I wrote a lengthy “Concurrency and its impact on reusable libraries”
   essay which provides a lot of the motivation behind what I look for.&amp;nbsp; It’s up
   on my blog, &lt;a href="http://www.bluebytesoftware.com/blog/2006/10/26/ConcurrencyAndTheImpactOnReusableLibraries.aspx"&gt;http://www.bluebytesoftware.com/blog/2006/10/26/ConcurrencyAndTheImpactOnReusableLibraries.aspx&lt;/a&gt;,
   and (though slightly out of date) I’m revising it for an Appendix in &lt;a href="http://www.bluebytesoftware.com/books/winconc/winconc_book_resources.html"&gt;my
   upcoming book&lt;/a&gt;.&amp;nbsp; If you question why I believe something, chances are that
   this document will explain my rationale.&amp;nbsp; And it’s far more complete than this
   short essay; I’ve only hit the high points here.)
&lt;/p&gt;
&lt;p&gt;
   &lt;strong&gt;Getting started&lt;/strong&gt;
&lt;/p&gt;
&lt;p&gt;
   I first review the code in a traditional sequential code review fashion.&amp;nbsp; When
   doing this, I earmark all state that I see as either “private” (aka isolated) or “shared”.&amp;nbsp;
   I then go back and closely review all state that is shared (accessible from many threads)
   with a fine-tooth comb.&amp;nbsp; Sometimes I’ll do this during my first pass through,
   but I usually find it helpful to understand the algorithmic structure of the changes
   first before fully developing an understanding of the concurrency parts.
&lt;/p&gt;
&lt;p&gt;
   Changes to existing code should be reviewed just as carefully (if not more carefully)
   as new code.&amp;nbsp; Concurrency behavior is subtle, and it’s very easy to accidentally
   violate some unchecked assumption the code was previously making.&amp;nbsp; Liberal use
   of asserts is therefore very important.&amp;nbsp; Sadly many of the conditions code assumes
   are simply unassertable (like “object X isn’t shared”).&lt;br&gt;
   I easily spend about 2x the amount of time reviewing the concurrency aspects of the
   code than usual sequential aspects.&amp;nbsp; Perhaps more.&amp;nbsp; This extra time is OK,
   because the concurrency portion is far smaller (in lines of code) than the sequential
   portion in most of the code I review.&amp;nbsp; There are obvious exceptions to this rule,
   especially since I’m on a team building low-level primitive data types whose sole
   purpose in life is to be used in concurrent apps.
&lt;/p&gt;
&lt;p&gt;
   &lt;strong&gt;Shared state and synchronization&lt;/strong&gt;
&lt;/p&gt;
&lt;p&gt;
   Some state, although shared, is immutable (read-only) and can be safely shared and
   read from concurrently.&amp;nbsp; Often this is by construction (e.g. immutable value
   types) but sometimes this is by loose convention (e.g. a data structure is immutable
   for some period of time, simply by virtue of no threads actively writing to it).&amp;nbsp;
   Both should be clearly documented in the code.&lt;br&gt;
   Once mutable shared state is identified, I look for two major things:
&lt;/p&gt;
&lt;ol&gt;
   &lt;li&gt;
      When does it become shared, i.e. publicized, and what is the protocol for the transfer
      of ownership?&amp;nbsp; Is it done cleanly?&amp;nbsp; And is it well documented? 
   &lt;/li&gt;
   &lt;li&gt;
      When does it once again become private again, if ever?&amp;nbsp; And is this documented
      too?&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;
   Ideally all shared state would have clean ownership transitions.&amp;nbsp; Any state that
   is disposable necessarily must have a point at the end of its life where it has a
   single owner, so it can be safely disposed of (unless ref-counting is used).&amp;nbsp;
   But for most state the line will be extremely blurry and unenforced.&amp;nbsp; Comments
   should be used to clarify, in gory detail.&amp;nbsp; I also tend to prefix names of variables
   that refer to shared objects with the word ‘shared’ itself, so that they jump out.
&lt;/p&gt;
&lt;p&gt;
   Many, many bugs arise from some code publicizing some state, sometimes by accident,
   and then continuing to treat it as though it is private.&amp;nbsp; It is also sometimes
   tricky to determine this precisely, since sharing can be modal.&amp;nbsp; A list data
   structure may be shared in some contexts but not others.&amp;nbsp; Knowing what its sharing
   policy is requires transitive knowledge of callers.&amp;nbsp; Building up this level of
   global understanding can involve a fair bit of simply sitting back and reading and
   rereading the code over and over again.
&lt;/p&gt;
&lt;p&gt;
   Once the policy around sharing a piece of state is known, it is crucial to understand
   the intended synchronization policy for that data.&amp;nbsp; Is it protected by a fine-grained
   monitor?&amp;nbsp; Is it manipulated and read in a lock-free way?&amp;nbsp; And so on.&amp;nbsp;
   And once the intended policy is known, is the actual policy implemented what was intended?
&lt;/p&gt;
&lt;p&gt;
   While this part is extremely important, by the way, I have to admit that I feel this
   aspect tends to overshadow other things in conversation.&amp;nbsp; This is probably because
   it’s the most obvious thing to look for.&amp;nbsp; Sadly the world of concurrency is far
   more subtle than this.&amp;nbsp; I’ve honestly found more bugs resulting from failing
   to identify shared state properly than resulting from failing to implement the synchronization
   logic itself properly.&amp;nbsp; Your mileage will of course vary.
&lt;/p&gt;
&lt;p&gt;
   &lt;strong&gt;Locks&lt;/strong&gt;
&lt;/p&gt;
&lt;p&gt;
   I treat lock-based code and lock-free as two entirely separate beasts.&amp;nbsp; I spend
   about 5x the time reviewing lock-free code when compared to lock-based code.&amp;nbsp;
   There is a tax to having lock-free code in any codebase, so as you are reviewing it,
   also ask yourself: is there a better (or almost-as-good) way that this could have
   been done using locks?&amp;nbsp; Often the answer is no, due to the benefits lock-freedom
   brings (no thread can indefinitely starve another). 
&lt;/p&gt;
&lt;p&gt;
   But if the answer is yes, that the code could be written more clearly using locks,
   you could save your team a lot of time by convincing the author to change his or her
   mind.&amp;nbsp; Not only is lock-free code far more difficult to write and test, it carries
   a large tax during long stress hauls and end-game bug-fixing, an important and time-sensitive
   period in the development lifecycle of any commercial software product.&amp;nbsp; Maintaining
   lock-free code also carries an extra long-term cost, particularly when ramping up
   new hires on it.&amp;nbsp; All of this risks interfering with your ability to work on
   cool new features at some point.&amp;nbsp; Don’t feel bad about pushing back on this one.&amp;nbsp;
   Hard.
&lt;/p&gt;
&lt;p&gt;
   Carefully review what happens inside of a lock region.&amp;nbsp; Look at every single
   line with scrutiny.
&lt;/p&gt;
&lt;ul&gt;
   &lt;li&gt;
      Lock hold times should be as short as possible.&amp;nbsp; Hold times should be counted
      in dozens or hundreds of cycles, not thousands (unless absolutely unavoidable). 
   &lt;/li&gt;
   &lt;li&gt;
      If lock hold times are in the dozens, you can consider using a spin-lock. 
   &lt;/li&gt;
   &lt;li&gt;
      Recursive lock acquisitions are strongly discouraged.&amp;nbsp; If it can happen, did
      the developer clearly intend it to happen?&amp;nbsp; Or is this possibility accidental?&amp;nbsp;
      Point it out to them.&amp;nbsp; Also, are there any unexpected points at which reentrancy
      can occur?&amp;nbsp; E.g. any APC or message-pumping waits?&amp;nbsp; If yes, is there a way
      to avoid that by simple restructuring of the code? 
   &lt;/li&gt;
   &lt;li&gt;
      Dynamic method calls via delegates or virtual methods while a lock is held should
      be as rare as possible.&amp;nbsp; Method calls under a lock to user-supplied code should
      only ever happen if the concurrency behavior is clearly documented and specified for
      the user, and when invariants hold.&amp;nbsp; All of these cases can lead to reentrancy.&amp;nbsp;
      Often this requires special code to detect the reentrancy and respond intelligently. 
   &lt;/li&gt;
   &lt;li&gt;
      Lock regions should usually not span multiple methods: for example, acquiring the
      lock in one method, returning, and having the caller