RSS 2.0

Personal Info:

Joe Send mail to the author(s) works on parallel libraries, infrastructure, and programming models in Microsoft's Developer Division.

Blogroll:
Other
News
 C|Net
 Kuro5hin
 The Register
Technology
 <?xmlhack?>
 Daily WTF
 DevX
 Hacknot
 Java Today
 Microsoft Top 10 Downloads
 MSDN
 MSDN: "Longhorn"
 MSDN: XML Developer Center
 Slashdot
 Techdirt
 theserverside.com
 W3C
 Web Pages That Suck
 XML Cover Pages
 XML Journal
 xml.com
Technology Blogs
 Aaron Skonnard [PluralSight]
 Adam Bosworth [Google]
 Andy Rich [MS/C++]
 Arpan Desai [MS/XML]
 BCL Team [MS]
 Bill Clementson [Lisp]
 Bill de hÓra
 Bruce Eckel [J]
 Bruce Tate [J]
 Casey Chestnut
 Cedric Beust [Google]
 Chris Anderson [MS/Avalon]
 Chris Lyon [MS]
 Christian Weyer
 Clemens Vasters [newtelligence]
 Craig Andera [PluralSight]
 Dan Sugalski [Parrot]
 Daniel Cazzulino
 Dave Chappel
 Dave Roberts [Lisp]
 Dave Thomas [PragProg]
 Dave Winer
 Dion Almaer [J]
 Don Demsak
 Doug Purdy [MS/Indigo]
 Drew Marsh
 Eric Gunnerson [MS]
 Eric Rudder [MS]
 Eric Sink
 Fritz Onion [PluaralSight]
 Gavin King [J/Hibernate]
 Grady Booch [IBM]
 Hervey Wilson [MS/Indigo]
 Hillel Cooperman [MS/Shell]
 Howard Lewis Ship [J/Apache]
 Ingo Rammer [PluralSight]
 James Gosling [J/Sun]
 James Strachan [J/Groovy]
 Jason Matusow [MS/OSS]
 Jeffrey Schlimmer [MS/Indigo]
 Joe Beda [Google]
 Joel Spoelsky
 Jon Udell
 Josh Ledgard [MS/Evang]
 Joshua Allen [MS]
 Lambda
 Larry Osterman [MS]
 Maoni Stephens [MS/CLR]
 Mark Fussell [MS/XML]
 Martin Fowler
 Martin Gudgin [MS/Indigo]
 Me
 Michael Howard [MS]
 Miguel de Icaza [Mono]
 Mike Clark
 Omri Gazitt [MS/Indigo]
 Pat Helland [MS/PAG]
 Pinku Surana
 Raymond Chen [MS]
 Rich Lander [MS/CLR]
 Rob Howard
 Rob Relyea [MS/Avalon]
 Robert Cringely
 S. Somasegar [MS/DevDiv]
 Sam Gentile
 Scoble [MS/Evang]
 Scott Guthrie [MS/WebNet]
 Scott Hanselman
 Sean McGrath [J]
 Simon Fell
 Stanley Lippman [MS/C++]
 Steve Maine
 Steve Swartz [MS/Indigo]
 Steve Vinoski
 Steven Clarke [MS/Usability]
 Stuart Halloway
 Ted Leung
 Ted Neward [DM]
 Tim Bray [Sun]
 Tim Ewald [Mindreef]
 Tim O'Reilly
 Werner Vogels [Amazon]
 Wintellect
 Yasser Shohoud [MS/Indigo]
Top 20
 Brad Abrams [MS/CLR]
 Chris Brumme [MS/CLR]
 Chris Sells [MS/Ultra]
 Cyrus Najmabadi [MS/C#]
 Dominic Cooney [MS/XAF]
 Don Box [MS/Ultra]
 Don Syme [MS/R]
 Guido van Rossum [Python]
 Herb Sutter [MS/C++]
 Ian Griffiths
 Jason Zander [MS/CLR]
 Jim Hugunin [MS/CLR]
 Joel Pobar [MS/CLR]
 Krzysztof Cwalina [MS/CLR]
 Patrick Logan
 Paul Graham
 Rico Mariani [MS/CLR]
 Rory Blyth [MS/DN]
 Sam Ruby
 Wesner Moise
VC/Business Blogs
 Ed Sim
 Fred Wilson
 Jonathan Schwartz [J/Sun]
 Lawrence Lessig [Stanford]
 Mark Cuban
 Michael Hyatt
 Pierre Omidyar
 Ross Mayfield
 VentureBlog
 Weekly Read
Wine, Food & Tea
 The Silk Road of Wine
 Vinography: a wine blog
 Wine Whys

Disclaimer:
The content of this site are my own personal opinions and do not represent my employer's view in anyway.

© 2008, Joe Duffy

 
 Monday, January 22, 2007

I was recently asked by a customer how to guarantee alignment of CLR data on 16-byte boundaries.  They needed this capability to interoperate with code that uses SSE vector instructions to manipulate the data (which require 16-byte alignment).  The bad news is that there’s no real good way of doing this.  That is, there isn’t any “align at N bytes” feature for the CLR in which type layout and stack and heap allocation cooperate.  The good news is that you can fake it.

(I spoke about alignment with respect to atomic cmpxchg8b instructions previously, right here, for those interested in reading about that too.)

The details of how to go about ensuring 16-byte alignment depend on whether you allocate your data on the stack or the GC heap.  For illustration purposes, imagine we’re dealing with an array of float32[]’s.  We’d like to ensure the beginning lies on a 16-byte boundary:

  1. float [] a0 = new float[N]; // GC-allocated array of N floats
  2. float * a1 = stackalloc float[N]; // stack-allocated array of N floats

If you use the former, GC allocation (1), you’re going to have a really tough time.  The GC moves objects around on you as it performs compactions, and only aligns the 1st element of the array on a 4-byte boundary.  So even if you manage to get your object allocated on a 16-byte boundary (by chance), it is apt to move during a subsequent GC.

To solve this problem, you’d have to pin the object.  Pinning causes GC fragmentation, so I really encourage you to avoid this approach and go with stack allocation, (2), if you can afford it.  A float[] on the stack is similarly aligned to begin at a 4-byte boundary, but, unlike (1), it will subsequently not move around.  Of course stack allocation is often impossible, or difficult, if you are writing a reusable library that may be called in an unknown context (where the caller may have very little stack left).  This is a tradeoff you would have to make.  If the pinning is very short lived, i.e. the duration of a single function call, it might be tolerable for you, a la P/Invoke.

Regardless of whether you choose (1) with pinning or (2) by itself, you’ve now got a stable address.  And you can use the stable address to calculate the next 16-byte element in the array from the base address, and then use that as the start of the array.  You will need some extra padding at the end for the worst case, which is base + 3, meaning at most 12 bytes, so you need to allocate 3 extra floats in the array.  Here’s an example:

void * AlignUp(void * p, ulong alignBytes) {
    ulong addr = new UIntPtr(p).ToUInt64();
    alignBytes -= 1; // adjust pointer for arithmetic
    if (((1<<(IntPtr.Size*4 - 1)) - alignBytes) <= addr) throw new Exception(“overflow”);
    ulong newAddr = (addr + alignBytes) & ~alignBytes;
    return new UIntPtr(newAddr).ToPointer();
}

float * p = stackalloc float[N + 3];
p = (float *)AlignUp(p, 16);
… use p …

Note that if you were to use an array of doubles instead, you’d have some challenges.  That’s because a 8-byte value on the 32-bit CLR is only 4-byte aligned, and therefore you can end up with a situation where the next 16-byte granularity is in the middle of a single element.  For example, 12 + 8 = 20 byte, +8 = 28 byte, +8 = 36 byte, and so on.  None of these are 16-byte aligned.  Not that it really matters, so long as you allocate enough memory, but you will need to do some casting of the array reference, as shown in the above code, to do the arithmetic.

Note also that there’s a StructLayout attribute that allows you to specify alignment, through its padding field, but sadly this doesn’t impact the GC’s heap or the JIT’s stack alignment, and so it’s useless for our purposes.  Though the relative alignment within the data structure will be correct, the absolute alignment is not guaranteed to be so.

OK, so I know all of this isn’t pretty.  But it works.

 

Recent Entries:

Search:

Browse by Date:
<August 2008>
SunMonTueWedThuFriSat
272829303112
3456789
10111213141516
17181920212223
24252627282930
31123456

Browse by Category:

Notables:

Currently Up To:

Reading...

Listening...

Watching...