RSS 2.0

Personal Info:

Joe Send mail to the author(s) works on parallel libraries, infrastructure, and programming models in Microsoft's Developer Division.

Blogroll:
Other
News
 C|Net
 Kuro5hin
 The Register
Technology
 <?xmlhack?>
 Daily WTF
 DevX
 Hacknot
 Java Today
 Microsoft Top 10 Downloads
 MSDN
 MSDN: "Longhorn"
 MSDN: XML Developer Center
 Slashdot
 Techdirt
 theserverside.com
 W3C
 Web Pages That Suck
 XML Cover Pages
 XML Journal
 xml.com
Technology Blogs
 Aaron Skonnard [PluralSight]
 Adam Bosworth [Google]
 Andy Rich [MS/C++]
 Arpan Desai [MS/XML]
 BCL Team [MS]
 Bill Clementson [Lisp]
 Bill de hÓra
 Bruce Eckel [J]
 Bruce Tate [J]
 Casey Chestnut
 Cedric Beust [Google]
 Chris Anderson [MS/Avalon]
 Chris Lyon [MS]
 Christian Weyer
 Clemens Vasters [newtelligence]
 Craig Andera [PluralSight]
 Dan Sugalski [Parrot]
 Daniel Cazzulino
 Dave Chappel
 Dave Roberts [Lisp]
 Dave Thomas [PragProg]
 Dave Winer
 Dion Almaer [J]
 Don Demsak
 Doug Purdy [MS/Indigo]
 Drew Marsh
 Eric Gunnerson [MS]
 Eric Rudder [MS]
 Eric Sink
 Fritz Onion [PluaralSight]
 Gavin King [J/Hibernate]
 Grady Booch [IBM]
 Hervey Wilson [MS/Indigo]
 Hillel Cooperman [MS/Shell]
 Howard Lewis Ship [J/Apache]
 Ingo Rammer [PluralSight]
 James Gosling [J/Sun]
 James Strachan [J/Groovy]
 Jason Matusow [MS/OSS]
 Jeffrey Schlimmer [MS/Indigo]
 Joe Beda [Google]
 Joel Spoelsky
 Jon Udell
 Josh Ledgard [MS/Evang]
 Joshua Allen [MS]
 Lambda
 Larry Osterman [MS]
 Maoni Stephens [MS/CLR]
 Mark Fussell [MS/XML]
 Martin Fowler
 Martin Gudgin [MS/Indigo]
 Me
 Michael Howard [MS]
 Miguel de Icaza [Mono]
 Mike Clark
 Omri Gazitt [MS/Indigo]
 Pat Helland [MS/PAG]
 Pinku Surana
 Raymond Chen [MS]
 Rich Lander [MS/CLR]
 Rob Howard
 Rob Relyea [MS/Avalon]
 Robert Cringely
 S. Somasegar [MS/DevDiv]
 Sam Gentile
 Scoble [MS/Evang]
 Scott Guthrie [MS/WebNet]
 Scott Hanselman
 Sean McGrath [J]
 Simon Fell
 Stanley Lippman [MS/C++]
 Steve Maine
 Steve Swartz [MS/Indigo]
 Steve Vinoski
 Steven Clarke [MS/Usability]
 Stuart Halloway
 Ted Leung
 Ted Neward [DM]
 Tim Bray [Sun]
 Tim Ewald [Mindreef]
 Tim O'Reilly
 Werner Vogels [Amazon]
 Wintellect
 Yasser Shohoud [MS/Indigo]
Top 20
 Brad Abrams [MS/CLR]
 Chris Brumme [MS/CLR]
 Chris Sells [MS/Ultra]
 Cyrus Najmabadi [MS/C#]
 Dominic Cooney [MS/XAF]
 Don Box [MS/Ultra]
 Don Syme [MS/R]
 Guido van Rossum [Python]
 Herb Sutter [MS/C++]
 Ian Griffiths
 Jason Zander [MS/CLR]
 Jim Hugunin [MS/CLR]
 Joel Pobar [MS/CLR]
 Krzysztof Cwalina [MS/CLR]
 Patrick Logan
 Paul Graham
 Rico Mariani [MS/CLR]
 Rory Blyth [MS/DN]
 Sam Ruby
 Wesner Moise
VC/Business Blogs
 Ed Sim
 Fred Wilson
 Jonathan Schwartz [J/Sun]
 Lawrence Lessig [Stanford]
 Mark Cuban
 Michael Hyatt
 Pierre Omidyar
 Ross Mayfield
 VentureBlog
 Weekly Read
Wine, Food & Tea
 The Silk Road of Wine
 Vinography: a wine blog
 Wine Whys

Disclaimer:
The content of this site are my own personal opinions and do not represent my employer's view in anyway.

© 2008, Joe Duffy

 
 Wednesday, May 30, 2007

Intel and AMD processors have had very limited support for SIMD computations in the form of MMX and SSE since the late 90s.  Though most programmers live in a MIMD-oriented world, SIMD programming had a surge in research interest in the 80s, and has remained promising for all those years, albeit a bit silently.  Vectorization is a fairly popular technique primarily in niche markets such as the FORTRAN and supercomputing communities.  Given the rise of GPGPU (see here, here, and here) and rumors floating about in the microprocessor arena, this is an interesting space to watch.

You can get at SSE from managed code, though it requires some hoop jumping and the interop overheads end up killing you.  Let's take a quick look at what it takes to use classic loop stripmining techniques for a pairwise multiplication of two arrays.

Since we can't access the SSE instructions directly in managed code, we need to first define a native DLL.  We'll call it 'vecthelp.dll' and it just exports a single function:

#include <xmmintrin.h>

const int c_vectorStride = 4;

extern "C" __declspec(dllexport)
void VectMult(float * src1, float * src2, float * dest, int length) {
    for (int i = 0; i < length; i += c_vectorStride) {
        // Vector load, multiply, store.
        __m128 v1 = _mm_load_ps(src1 + i); // MOVAPS
        __m128 v2 = _mm_load_ps(src2 + i); // MOVAPS
        __m128 vresult = _mm_mul_ps(v1, v2); // MULPS
        _mm_store_ps(dest + i, vresult); // MOVAPS
    }
}

'VectMult' takes two pointers to float arrays, 'src1' and 'src2', of size 'length', and does a pairwise multiplication, storing results into 'dest'.  It walks the array with a stride of 4.  On each iteration, it does a vector load using the SSE intrnsic '_mm_load_ps', which loads 4 contiguous floats from 'src1' and 'src2' into XMMx registers.  Then we multiply them via '_mm_mul_ps' which is a 4-way vector multiply (i.e. the multiplication for each pair occurs in parallel).  Lastly, we store the results back to the 'dest' array.  Note we naively assume the array's size is a multiple of 4.

To use this routine, we just need to P/Invoke.  Well, sadly we also need to do some tricky alignment since SSE demands 16 byte alignment.  As I've written before, this isn't easy to acheive on the CLR.  I've used stack allocation to avoid pinning the arrays, though clearly for large arrays this would easily lead to stack overflow.  It's just for illustration.

using System;

unsafe class Program {
    [System.Runtime.InteropServices.DllImport("vecthelp.dll")]
    private extern static void VectMult(float * src1, float * src2, float * dest, int length);

    public static void Main() {
        const int vecsize = 1024 * 16; // 16KB of floats.

        float * a = stackalloc float[vecsize + (16 / sizeof(float)) - 1];
        float * b = stackalloc float[vecsize + (16 / sizeof(float)) - 1];
        float * c = stackalloc float[vecsize + (16 / sizeof(float)) - 1];

        // To use SSE, we must ensure 16 byte alignment.
        a = (float *)AlignUp(a, 16);
        b = (float *)AlignUp(b, 16);
        c = (float *)AlignUp(c, 16);

        // Initialize 'a' and 'b':
        for (int i = 0; i < vecsize; i++) {
            a[i] = i;
            b[i] = vecsize - i;
        }

        // Now perform the multiplication.
        VectMult(a, b, c, vecsize);

        ... do something with c ...
    }

    private static void * AlignUp(void * p, ulong alignBytes) {
        ulong addr = (ulong)p;
        ulong newAddr = (addr + alignBytes - 1) & ~(alignBytes - 1);
        return (void *)newAddr;
    }
}

I wish I could report some stellar perf numbers, to the tune of the vector version being 4X faster than a non-vector equivalent.  Sadly the P/Invoke overheads kill perf unless the array is unreasonably large.  Who needs to multiply two 16MB arrays of floats together?  Some people I'm sure, but not many.  If the P/Invoke overheads are excluded, however, arrays as small as a few hundred elements see 2X speedup.  And for larger arrays it hovers around 3X.

Clearly as future architectures offer more vector width, these speedups just increase.  And perhaps there will eventually be more incentive for native CLR support.  Just imagine if we had a 32-core system in which each core had a 16-way vector arithmetic unit: that's 32X16 (512) degrees of parallelism if you can just subdivide the problem appropriately.  GPUs, of course, already offer many-fold larger vector width than SSE, which is one reason why GPGPU is attractive.  Maybe I'll show how to write a DirectX pixel shader that adds two float arrays together in a future post.

5/30/2007 12:45:26 AM (Pacific Daylight Time, UTC-07:00)  #    Comments [4]
Tracked by:
"Interesting Finds: May 30, 2007" (Jason Haley) [Trackback]
"Interesting Finds: May 30, 2007" (Jason Haley) [Trackback]

 

Recent Entries:

Search:

Browse by Date:
<August 2008>
SunMonTueWedThuFriSat
272829303112
3456789
10111213141516
17181920212223
24252627282930
31123456

Browse by Category:

Notables:

Currently Up To:

Reading...

Listening...

Watching...