RSS 2.0

Personal Info:

Joe Send mail to the author(s) is a lead architect on an OS incubation project at Microsoft, and was the architect for Parallel Extensions to .NET. He is an author and frequent speaker.

Blogroll:
Other
News
 C|Net
 Kuro5hin
 The Register
Technology
 <?xmlhack?>
 Daily WTF
 DevX
 Hacknot
 Java Today
 Microsoft Top 10 Downloads
 MSDN
 MSDN: "Longhorn"
 MSDN: XML Developer Center
 Slashdot
 Techdirt
 theserverside.com
 W3C
 Web Pages That Suck
 XML Cover Pages
 XML Journal
 xml.com
Technology Blogs
 Aaron Skonnard [PluralSight]
 Adam Bosworth [Google]
 Andy Rich [MS/C++]
 Arpan Desai [MS/XML]
 BCL Team [MS]
 Bill Clementson [Lisp]
 Bill de hÓra
 Bruce Eckel [J]
 Bruce Tate [J]
 Casey Chestnut
 Cedric Beust [Google]
 Chris Anderson [MS/Avalon]
 Chris Lyon [MS]
 Christian Weyer
 Clemens Vasters [newtelligence]
 Craig Andera [PluralSight]
 Dan Sugalski [Parrot]
 Daniel Cazzulino
 Dave Chappel
 Dave Roberts [Lisp]
 Dave Thomas [PragProg]
 Dave Winer
 Dion Almaer [J]
 Don Demsak
 Doug Purdy [MS/Indigo]
 Drew Marsh
 Eric Gunnerson [MS]
 Eric Rudder [MS]
 Eric Sink
 Fritz Onion [PluaralSight]
 Gavin King [J/Hibernate]
 Grady Booch [IBM]
 Hervey Wilson [MS/Indigo]
 Hillel Cooperman [MS/Shell]
 Howard Lewis Ship [J/Apache]
 Ingo Rammer [PluralSight]
 James Gosling [J/Sun]
 James Strachan [J/Groovy]
 Jason Matusow [MS/OSS]
 Jeffrey Schlimmer [MS/Indigo]
 Joe Beda [Google]
 Joel Spoelsky
 Jon Udell
 Josh Ledgard [MS/Evang]
 Joshua Allen [MS]
 Lambda
 Larry Osterman [MS]
 Maoni Stephens [MS/CLR]
 Mark Fussell [MS/XML]
 Martin Fowler
 Martin Gudgin [MS/Indigo]
 Me
 Michael Howard [MS]
 Miguel de Icaza [Mono]
 Mike Clark
 Omri Gazitt [MS/Indigo]
 Pat Helland [MS/PAG]
 Pinku Surana
 Raymond Chen [MS]
 Rich Lander [MS/CLR]
 Rob Howard
 Rob Relyea [MS/Avalon]
 Robert Cringely
 S. Somasegar [MS/DevDiv]
 Sam Gentile
 Scoble [MS/Evang]
 Scott Guthrie [MS/WebNet]
 Scott Hanselman
 Sean McGrath [J]
 Simon Fell
 Stanley Lippman [MS/C++]
 Steve Maine
 Steve Swartz [MS/Indigo]
 Steve Vinoski
 Steven Clarke [MS/Usability]
 Stuart Halloway
 Ted Leung
 Ted Neward [DM]
 Tim Bray [Sun]
 Tim Ewald [Mindreef]
 Tim O'Reilly
 Werner Vogels [Amazon]
 Wintellect
 Yasser Shohoud [MS/Indigo]
Top 20
 Brad Abrams [MS/CLR]
 Chris Brumme [MS/CLR]
 Chris Sells [MS/Ultra]
 Cyrus Najmabadi [MS/C#]
 Dominic Cooney [MS/XAF]
 Don Box [MS/Ultra]
 Don Syme [MS/R]
 Guido van Rossum [Python]
 Herb Sutter [MS/C++]
 Ian Griffiths
 Jason Zander [MS/CLR]
 Jim Hugunin [MS/CLR]
 Joel Pobar [MS/CLR]
 Krzysztof Cwalina [MS/CLR]
 Patrick Logan
 Paul Graham
 Rico Mariani [MS/CLR]
 Rory Blyth [MS/DN]
 Sam Ruby
 Wesner Moise
VC/Business Blogs
 Ed Sim
 Fred Wilson
 Jonathan Schwartz [J/Sun]
 Lawrence Lessig [Stanford]
 Mark Cuban
 Michael Hyatt
 Pierre Omidyar
 Ross Mayfield
 VentureBlog
 Weekly Read
Wine, Food & Tea
 The Silk Road of Wine
 Vinography: a wine blog
 Wine Whys

Disclaimer:
The content of this site are my own personal opinions and do not represent my employer's view in anyway.

© 2010, Joe Duffy

 
 Wednesday, December 06, 2006

I took this past week off so that I could work on my book.  Well, I'm happy to report that I've been successfully writing like a madman, averaging around 15-20 solid pages per day.  I still have a long way to go, but I'm getting more confident with the passing of each day that this book will be...  well...  a book that I'd actually like to sit down and read.

[Update: 12/7/2006: Correction made -- the CLR's JITs do not generate manual alignment code.  Instead, we defer to the costly OS handler for alignment fixups.]

In the process of writing the section on data alignment, I realized there is very little documentation on the alignment policy used by the CLR.  This is in contrast to Kang Su Gatlin's wonderful MSDN treatise on the subject for VC++, which leaves absolutely nothing hidden in the closet.  Well, I still don't have all of the answers for you.  Sorry.  You'll have to wait for the book.  But in the meantime, I've discovered that there's a myth that deserves a little debunking.

In the MSDN documentation for InterlockedCompareExchange64, it says:

"The variables for this function must be aligned on a 64-bit boundary; otherwise, this function will behave unpredictably on multiprocessor x86 systems and any non-x86 systems."

I've also heard and read this from other various sources.  I've heard, for example, that LOCK CMPXCHG8B will still do a load/compare/store sequence, but that, if the address isn't 8-byte aligned, the instruction will not be atomic.  This would lead to sporadic atomicity failures, probably even more difficult to track down than a typical race.  Given that the CLR doesn't faithfully align 64-bit data types on 8-byte boundaries (as we'll see momentarily), I suddenly feared that Interlocked.CompareExchange(ref Int64, ...) was HORRIBLY broken.  Without an MP machine at home, I couldn't test this out, so I decided to do a little digging.

In the manuals for many AMD processors and older Intel X86 processors, I found no reference to CMPXCHG8B requiring an aligned address.  What I did find, however, in the Intel 64-bit and IA32 System Programmer's Manual Part A was the following (emphasis mine):

"The integrity of a bus lock is not affected by the alignment of the memory field. The LOCK semantics are followed for as many bus cycles as necessary to update the entire operand. However, it is recommend that locked accesses be aligned on their natural boundaries for better system performance:

  • Any boundary for an 8-bit access (locked or otherwise).
  • 16-bit boundary for locked word accesses.
  • 32-bit boundary for locked doubleword accesses.
  • 64-bit boundary for locked quadword accesses."

If I'm reading that right, this means the common wisdom around 8-byte alignment and LOCK CMPXCHG8B is hogwash.  (Sadly, proving the absence of some flaky processor that crashes or has unpredictable behavior under certain circumstances is rather difficult, especially if someone at some point though it was true enough to put it in the MSDN documentation.  If somebody out there knows of a real case -- and it's not just hear say -- please let me know!)  If this is true of all X86 processors, it means that Interlocked.CompareExchange(ref Int64, ...) isn't horribly broken on the CLR after all.  (Yaay.)  It would have been broken...  because, as I said earlier, the CLR does NOT align 64-bit values on 8-byte address boundaries...

Conversing briefly with Simon Hall over email, the dev that owns most (all?) of the type layout infrastructure, I've concluded the following:  CLR type layout tries to eliminate all misaligned data layout through a combination of padding and field reordering.  This means that data of >= 8-bytes on 64-bit always begins on 8-byte boundaries, and data of >= 4-bytes on 32-bit always begins (at least) on 4-byte boundaries.  I say "at least" because emperical evidence shows that type layout actually aligns many 8-byte fields on 8-byte boundaries, even on 32-bit.  (It turns out this doesn't matter much...  neither the 32-bit JIT nor the GC respect this when allocating data.)  In summary, the CLR ensures that no field that could have fit inside a single 4/8-byte segment ever spills across a boundary.  The CLR also adds necessary padding to StructLayout(Sequential) types, while still preserving the original field ordering.

Therefore, the only cases where we end up with truly misaligned data is with StructLayout(Explicit) and StructLayout(Pack=...) types.  For example the simple struct, struct S { [FieldOffset(6)] int i; }, will always be misaligned, on 32- and 64-bit alike.  In such cases, our JIT simply generates the naive code and lets the OS perform misalignment fixups.  This is actually rather costly, as Kang Su's aforementioned article explaines.  We could have, like the VC++ compiler, generate the manual alignment code using a combination of loads and shifts, but my guess is that most of our customers don't care and will never notice.

To preserve the hard work done by type layout, our JITs and the GC guarantee that all allocated data is aligned on at least 4-byte (on 32-bit) or 8-byte (on 64-bit) boundaries.  I say "at least" once again because I know, for example, that VC++ aligns stack frames on 16-byte boundaries for 64-bit.  I don't claim to understand why.  We might do something similar.

Here's an interesting program that just prints out a few field addresses, and whether things are 8-byte aligned.  You'll interestingly notice that the int/long fields that are adjacent to one another are padded with 4-bytes in between on 32- and 64-bit, but that the JIT and GC only align on 4-byte addresses on 32-bit.  I presume this is so that the layout doesn't have to change between 32- and 64-bit, but I can't say for sure:

using System;
using System.Runtime.InteropServices;

class C {
    internal S s;
}

struct S {
    internal int x;
    internal long y;
    internal byte z;
}

unsafe class P {
    static void Main(string[] args) {
        int pad = 5;
        if (args.Length > 0) pad = int.Parse(args[0]);

        Console.WriteLine("Field\t[Begin\tEnd)\t%8");
        PrintStackS(pad);
        PrintHeapS(pad);
    }

    static void PrintStackS(int x) {
        int * pad = stackalloc int[x];
        S * s = stackalloc S[1];
        PrintAddr(s);
    }

    static void PrintHeapS(int x) {
        for (int i = 0; i < x; i++) new object();
        C c = new C();
        fixed (S * pcs = &c.s) {
            PrintAddr(pcs);
        }
    }

    static unsafe void PrintAddr(S * ps) {
        ulong xa = new UIntPtr(&ps->x).ToUInt64();
        Console.WriteLine("X\t{0:X}\t{1:X}\t{2}", xa, xa + sizeof(int), xa % 8);
        ulong ya = new UIntPtr(&ps->y).ToUInt64();
        Console.WriteLine("Y\t{0:X}\t{1:X}\t{2}", ya, ya + sizeof(long), ya % 8);
        ulong za = new UIntPtr(&ps->z).ToUInt64();
        Console.WriteLine("Z\t{0:X}\t{1:X}\t{2}", za, za + sizeof(byte), za % 8);
    }
}

Running it with a few different inputs yields these results:

C:\Temp>8by
Field   [Begin  End)    %8
X       12F440  12F444  0
Y       12F448  12F450  0
Z       12F450  12F451  0
X       1273670 1273674 0
Y       1273678 1273680 0
Z       1273680 1273681 0

C:\Temp>8by 2
Field   [Begin  End)    %8
X       12F44C  12F450  4
Y       12F454  12F45C  4
Z       12F45C  12F45D  4
X       1273664 1273668 4
Y       127366C 1273674 4
Z       1273674 1273675 4

If the CLR ever decides to support a 128 CAS operation, Interlocked.CompareExchange(ref Int128, ...), which I hope we will, we would need to guarantee alignment on 16-byte boundaries.  In comparison to CMPXCHG8B, CMPXCHG16B does indeed fail when issued against an address that isn't 16-byte aligned.  Instead of failing silently, a GP fault is generated.  This is difficult, because not only must type layout respect the alignment (you can already get this with StructLayout(..., Pack=16)), but the JIT and the GC would also need to allocate correctly.  Or, of course, you could over-allocate a chunk of data and shift the start pointer to the first aligned address inside of it.  This might work for the stack, but for GC allocated data this is going to keep shifting around on you, and probably won't work very well.  Before the CLR supports Interlocked.CompareExchange(ref Int128, ...), however, I suppose we ought to provide an Int128.  :)

 

Recent Entries:

Search:

Browse by Date:
<February 2010>
SunMonTueWedThuFriSat
31123456
78910111213
14151617181920
21222324252627
28123456
78910111213

Browse by Category:

Notables:

Currently Up To:

Reading...

Listening...

Watching...