RSS 2.0

Personal Info:

Joe Send mail to the author(s) works on parallel libraries, infrastructure, and programming models in Microsoft's Developer Division.

Blogroll:
Other
News
 C|Net
 Kuro5hin
 The Register
Technology
 <?xmlhack?>
 Daily WTF
 DevX
 Hacknot
 Java Today
 Microsoft Top 10 Downloads
 MSDN
 MSDN: "Longhorn"
 MSDN: XML Developer Center
 Slashdot
 Techdirt
 theserverside.com
 W3C
 Web Pages That Suck
 XML Cover Pages
 XML Journal
 xml.com
Technology Blogs
 Aaron Skonnard [PluralSight]
 Adam Bosworth [Google]
 Andy Rich [MS/C++]
 Arpan Desai [MS/XML]
 BCL Team [MS]
 Bill Clementson [Lisp]
 Bill de hÓra
 Bruce Eckel [J]
 Bruce Tate [J]
 Casey Chestnut
 Cedric Beust [Google]
 Chris Anderson [MS/Avalon]
 Chris Lyon [MS]
 Christian Weyer
 Clemens Vasters [newtelligence]
 Craig Andera [PluralSight]
 Dan Sugalski [Parrot]
 Daniel Cazzulino
 Dave Chappel
 Dave Roberts [Lisp]
 Dave Thomas [PragProg]
 Dave Winer
 Dion Almaer [J]
 Don Demsak
 Doug Purdy [MS/Indigo]
 Drew Marsh
 Eric Gunnerson [MS]
 Eric Rudder [MS]
 Eric Sink
 Fritz Onion [PluaralSight]
 Gavin King [J/Hibernate]
 Grady Booch [IBM]
 Hervey Wilson [MS/Indigo]
 Hillel Cooperman [MS/Shell]
 Howard Lewis Ship [J/Apache]
 Ingo Rammer [PluralSight]
 James Gosling [J/Sun]
 James Strachan [J/Groovy]
 Jason Matusow [MS/OSS]
 Jeffrey Schlimmer [MS/Indigo]
 Joe Beda [Google]
 Joel Spoelsky
 Jon Udell
 Josh Ledgard [MS/Evang]
 Joshua Allen [MS]
 Lambda
 Larry Osterman [MS]
 Maoni Stephens [MS/CLR]
 Mark Fussell [MS/XML]
 Martin Fowler
 Martin Gudgin [MS/Indigo]
 Me
 Michael Howard [MS]
 Miguel de Icaza [Mono]
 Mike Clark
 Omri Gazitt [MS/Indigo]
 Pat Helland [MS/PAG]
 Pinku Surana
 Raymond Chen [MS]
 Rich Lander [MS/CLR]
 Rob Howard
 Rob Relyea [MS/Avalon]
 Robert Cringely
 S. Somasegar [MS/DevDiv]
 Sam Gentile
 Scoble [MS/Evang]
 Scott Guthrie [MS/WebNet]
 Scott Hanselman
 Sean McGrath [J]
 Simon Fell
 Stanley Lippman [MS/C++]
 Steve Maine
 Steve Swartz [MS/Indigo]
 Steve Vinoski
 Steven Clarke [MS/Usability]
 Stuart Halloway
 Ted Leung
 Ted Neward [DM]
 Tim Bray [Sun]
 Tim Ewald [Mindreef]
 Tim O'Reilly
 Werner Vogels [Amazon]
 Wintellect
 Yasser Shohoud [MS/Indigo]
Top 20
 Brad Abrams [MS/CLR]
 Chris Brumme [MS/CLR]
 Chris Sells [MS/Ultra]
 Cyrus Najmabadi [MS/C#]
 Dominic Cooney [MS/XAF]
 Don Box [MS/Ultra]
 Don Syme [MS/R]
 Guido van Rossum [Python]
 Herb Sutter [MS/C++]
 Ian Griffiths
 Jason Zander [MS/CLR]
 Jim Hugunin [MS/CLR]
 Joel Pobar [MS/CLR]
 Krzysztof Cwalina [MS/CLR]
 Patrick Logan
 Paul Graham
 Rico Mariani [MS/CLR]
 Rory Blyth [MS/DN]
 Sam Ruby
 Wesner Moise
VC/Business Blogs
 Ed Sim
 Fred Wilson
 Jonathan Schwartz [J/Sun]
 Lawrence Lessig [Stanford]
 Mark Cuban
 Michael Hyatt
 Pierre Omidyar
 Ross Mayfield
 VentureBlog
 Weekly Read
Wine, Food & Tea
 The Silk Road of Wine
 Vinography: a wine blog
 Wine Whys

Disclaimer:
The content of this site are my own personal opinions and do not represent my employer's view in anyway.

© 2008, Joe Duffy

 
 Friday, June 13, 2008

We had an interesting debate at a Parallel Extensions design meeting yesterday, where I tried to convince everybody that a full fence on SpinLock exit is not a requirement.  We currently offer an Exit(bool) overload that accepts a flushReleaseWrite argument.  This merely changes the lock release from

m_state = 0;

to

Interlocked.Exchange(ref m_state, 0);

The main purpose of this is to announce “availability” of the locks to other processors.  More specifically, it ensures that before the current processor is able to turn around and reacquire the lock in its own private cache, that other processors at least have the opportunity to see the write.  This is a fairness optimization, and avoiding the CAS on release halves the number of CAS operations necessary (which are expensive), so we would generally like to avoid superflous ones.  It turns out you could easily do this without our help.  Instead of

slock.Exit(true);

you could say

slock.Exit();
Thread.MemoryBarrier();

Most of the debate about whether the default Exit should use a fence centered around confusion over the strength of volatile vs. a full fence.  For example, the C# documentation for volatile is highly misleading (http://msdn.microsoft.com/en-us/library/x13ttww7(VS.71).aspx):

The volatile modifier is usually used for a field that is accessed by multiple threads without using the lock statement to serialize access. Using the volatile modifier ensures that one thread retrieves the most up-to-date value written by another thread.

The confusion is over the “ensures that one thread receives the most up-to-date value written by another thread” part.  Technically this is somewhat-accurate, but is worded in a very funny and misleading way.  To see why, let’s take a step back and consider what volatile actually means in the CLR’s memory model (MM) for a moment, to set context.  Note that I did my best to concisely summarize the MM here: http://www.bluebytesoftware.com/blog/2007/11/10/CLR20MemoryModel.aspx.

Volatile on loads means ACQUIRE, no more, no less.  (There are additional compiler optimization restrictions, of course, like not allowing hoisting outside of loops, but let’s focus on the MM aspects for now.)  The standard definition of ACQUIRE is that subsequent memory operations may not move before the ACQUIRE instruction; e.g. given { ld.acq X, ld Y }, the ld Y cannot occur before ld.acq X.  However, previous memory operations can certainly move after it; e.g. given { ld X, ld.acq Y }, the ld.acq Y can indeed occur before the ld X.  The only processor Microsoft .NET code currently runs on for which this actually occurs is IA64, but this is a notable area where CLR’s MM is weaker than most machines.  Next, all stores on .NET are RELEASE (regardless of volatile, i.e. volatile is a no-op in terms of jitted code).  The standard definition of RELEASE is that previous memory operations may not move after a RELEASE operation; e.g. given { st X, st.rel Y }, the st.rel Y cannot occur before st X.  However, subsequent memory operations can indeed move before it; e.g. given { st.rel X, ld Y }, the ld Y can move before st.rel X.  (I used a load since .NET stores are all release.)  Note that RELEASe is the opposite of ACQUIRE: you can think of an acquire as a one-way fence that prohibits passes downward, and a release as a one-way fence that prohibits passes upward.  A full fence prohibits both (lock acquire, XCHG, MB, etc).

Note one very interesting thing in this discussion: a release followed by an acquire, given the above rules, does not prohibit movement of the instructions with respect to one another!  Given { st.rel X, ld.acq Y }, even though they are both volatile (i.e. acquire and release), so long as X!=Y, it is perfectly legal for the ld.acq Y to move before st.rel X.  We aren’t limited to single instructions either, e.g. { st.rel X, ld.acq A, ld.acq B, ld.acq C }, all three loads (A, B, C) may indeed happen before the X.  This occurs with regularity in practice, on X86, X64, and IA64, because of store buffering.  It would just be too costly to hold up loads until a store has reached all processors.  Superscalar execution is meant to hide such latencies.

(As an aside, many people wonder about the difference between loads and stores of variables marked as volatile and calls to Thread.VolatileRead and Thread.VolatileWrite.  The difference is that the former APIs are implemented stronger than the jitted code: they achieve acquire/release semantics by emitting full fences on the right side.  The APIs are more expensive to call too, but at least allow you to decide on a callsite-by-callsite basis which individual loads and stores need the MM guarantees.)

I have to admit the store buffer problem is mostly theoretical.  It rarely comes up in practice.  That said, on a system which permits load reordering, imagine:

Initially: X = Y = 0

T0                       T1
X = 5; // st.rel         while (X == 0) ; // ld.acq
while (Y == 0) ; // ld   X = 0; // st.rel
A = X; // ld.acq         Y = 5; // st.rel

After execution, is it possible that A == 5?

If the read of Y is non-volatile on T0 (which would be bad because a compiler may hoist it out of the loop, but ignore compilers for a moment), then the fact that the subsequent read of X is volatile does not save us from a reordering leading to A == 5.  This is the { ld, ld.acq } case described earlier.  Why might this physically occur?  Well, it won’t happen on X86 and X64 because loads are not permitted to reorder.  However!!  IA64 permits non-acquire loads (non-volatile) to reorder, and so the A = X may actually be satisfied out of the write buffer before the store even leaves the processor.  It’s as though the program became:

T0                       T1
X = 5; // st.rel         while (X == 0) ; // ld.acq
A = X; // ld.acq         X = 0; // st.rel
while (Y == 0) ; // ld   Y = 5; // st.rel

Whoops!  This should make it apparent that this outcome is indeed a real possibility.  And clearly it may cause bugs.

Note 6/13/08: Eric pointed out privately that compilers need only respect the CLR MM, and can freely reorder loads.  Thus, this problem may actually arise on non-IA64 machines.  Of course he is entirely correct.  It was silly of me to overlook that.

All that said, let’s get back to the original concern about visibility of writes.  This issue doesn’t even really involve reordering.  Imagine one processor continuously executes a stream of lock acquires and releases, and that the stream goes on indefinitely (perhaps because it’s in a loop):

while (Interlocked.CompareExchange(ref m_state, 1, 0) != 0) ;
m_state = 0;
while (Interlocked.CompareExchange(ref m_state, 1, 0) != 0) ;
m_state = 0;

The Interlocked operation acquires the cache line in X mode.  After it executes, other processors will notice that the lock is taken.  But right away, the processor writes 0 to the line without a fence, and immediately goes on to execute another acquire.  It is highly likely that the line will be marked dirty in the processor’s cache by the time that it acquires it in X mode again, something that the cache coherency system makes very cheap.  In fact, the write of m_state = 0 probably hasn’t left the write buffer yet due to latency.

So before another processor can even see m_state as 0, the processor will have already gotten around to taking the lock again.  Even for volatile loads and stores, there is no MM guarantee that writes will leave the processor immediately; hence the documentaiton earlier is slightly confusing; yes, the processor doing a volatile read will see the “most recent” value, but that “most recent” value (a) may be satisfied out of the local write buffer, and (b) may simply not have the ability to observe writes that occurred in practice due to the above timeliness issue. 

6/13/2008 12:52:45 PM (Pacific Daylight Time, UTC-07:00)  #    Comments [5]

 

Recent Entries:

Search:

Browse by Date:
<June 2008>
SunMonTueWedThuFriSat
25262728293031
1234567
891011121314
15161718192021
22232425262728
293012345

Browse by Category:

Notables:

Currently Up To:

Reading...

Listening...

Watching...